Back to AI Research

AI Research

Standing on the Shoulders of Giants: Stabilized Kno... | AI Research

Key Takeaways

  • Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection This research addresses the difficulty of ide...
  • Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity.
  • Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting.
  • In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels.
  • To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD.
Paper AbstractExpand

Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA adapters. We further introduce response stabilization methods, including forced conclusion prompting, a binary classification head, and a contrastive classification head, and evaluate model behavior using both predictive metrics and response rate. Experiments on Python--Java, Rust--Java, Rust--Python, and Rust--Ruby show that knowledge distillation consistently improves the reliability of compact models and often improves predictive performance, especially under distribution shift. In addition, classification-head variants substantially reduce inference time compared to generation-based inference. Overall, our results show that reasoning-oriented distillation combined with response stabilization makes compact open-source models more practical and reliable for X-CCD detection.

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
This research addresses the difficulty of identifying semantically equivalent code written in different programming languages, a task known as Cross-Language Code Clone Detection (X-CCD). While large language models (LLMs) are capable of this, they are often expensive, slow, and difficult to use for consistent, automated tasks. This paper introduces a knowledge distillation framework that transfers the advanced reasoning capabilities of a powerful teacher model, DeepSeek-R1, into smaller, open-source student models (Phi3 and Qwen-Coder). By training these compact models on synthetic, reasoning-oriented data, the authors aim to create efficient, locally deployable tools that can reliably detect code clones across different languages.

Creating Reasoning-Oriented Training Data

The core of the approach involves generating high-quality training data using the Project CodeNet dataset. The researchers filtered this dataset to ensure all code samples were functionally correct. They then used DeepSeek-R1 to analyze these code pairs, instructing the model to perform a step-by-step "System 2" style of thinking. This process required the teacher model to break down the functionality, mathematical logic, and structural differences of the code before reaching a final conclusion. This synthetic dataset, consisting of over 10,000 samples, teaches the smaller student models how to reason through code similarity rather than just guessing based on surface-level patterns.

Stabilizing Model Responses

A major challenge with using LLMs for classification is that they often produce free-form text that is difficult to map to a simple "clone" or "non-clone" label. To solve this, the authors implemented three stabilization methods:

  • Forced Conclusion Prompting: This forces the model to follow its reasoning with a specific, parsable conclusion.

  • Binary Classification Head: This adds a dedicated layer to the model to output a binary decision, which is more deterministic than generating text.

  • Contrastive Classification Head: This encourages the model to create more distinct internal representations for clone and non-clone pairs, which helps maintain performance even when the model encounters new types of problems.

Performance and Efficiency Gains

The experiments demonstrated that knowledge distillation significantly improves the reliability of compact models. Across various language pairs—including Python, Java, Rust, and Ruby—the distilled models showed better performance in both same-distribution and different-distribution settings. Notably, the use of classification heads drastically reduced inference time, turning a process that previously took hours into one that takes only minutes. By combining reasoning-oriented training with these stabilization techniques, the researchers successfully made compact, open-source models more practical for real-world software engineering tasks.

Generalization and Practicality

The study highlights that these distilled models are capable of generalizing to unseen programming languages and new problem distributions. By using LoRA adapters for fine-tuning, the researchers kept the training process parameter-efficient, allowing the student models to acquire complex reasoning skills without the need for full-scale retraining. The results suggest that this framework provides a viable path for developers to deploy high-performing, private, and cost-effective code analysis tools locally, bypassing the limitations of relying on proprietary, black-box AI services.

Comments (0)

No comments yet

Be the first to share your thoughts!