Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
This research addresses the difficulty of identifying semantically equivalent code written in different programming languages, a task known as Cross-Language Code Clone Detection (X-CCD). While large language models (LLMs) are capable of this, they are often expensive, slow, and difficult to use for consistent, automated tasks. This paper introduces a knowledge distillation framework that transfers the advanced reasoning capabilities of a powerful teacher model, DeepSeek-R1, into smaller, open-source student models (Phi3 and Qwen-Coder). By training these compact models on synthetic, reasoning-oriented data, the authors aim to create efficient, locally deployable tools that can reliably detect code clones across different languages.
Creating Reasoning-Oriented Training Data
The core of the approach involves generating high-quality training data using the Project CodeNet dataset. The researchers filtered this dataset to ensure all code samples were functionally correct. They then used DeepSeek-R1 to analyze these code pairs, instructing the model to perform a step-by-step "System 2" style of thinking. This process required the teacher model to break down the functionality, mathematical logic, and structural differences of the code before reaching a final conclusion. This synthetic dataset, consisting of over 10,000 samples, teaches the smaller student models how to reason through code similarity rather than just guessing based on surface-level patterns.
Stabilizing Model Responses
A major challenge with using LLMs for classification is that they often produce free-form text that is difficult to map to a simple "clone" or "non-clone" label. To solve this, the authors implemented three stabilization methods:
Forced Conclusion Prompting: This forces the model to follow its reasoning with a specific, parsable conclusion.
Binary Classification Head: This adds a dedicated layer to the model to output a binary decision, which is more deterministic than generating text.
Contrastive Classification Head: This encourages the model to create more distinct internal representations for clone and non-clone pairs, which helps maintain performance even when the model encounters new types of problems.
Performance and Efficiency Gains
The experiments demonstrated that knowledge distillation significantly improves the reliability of compact models. Across various language pairs—including Python, Java, Rust, and Ruby—the distilled models showed better performance in both same-distribution and different-distribution settings. Notably, the use of classification heads drastically reduced inference time, turning a process that previously took hours into one that takes only minutes. By combining reasoning-oriented training with these stabilization techniques, the researchers successfully made compact, open-source models more practical for real-world software engineering tasks.
Generalization and Practicality
The study highlights that these distilled models are capable of generalizing to unseen programming languages and new problem distributions. By using LoRA adapters for fine-tuning, the researchers kept the training process parameter-efficient, allowing the student models to acquire complex reasoning skills without the need for full-scale retraining. The results suggest that this framework provides a viable path for developers to deploy high-performing, private, and cost-effective code analysis tools locally, bypassing the limitations of relying on proprietary, black-box AI services.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!