Multi-LCB: Extending LiveCodeBench to Multiple Prog...

Multi-LCB: Extending LiveCodeBench to Multiple Prog... | AI Research

Key Takeaways

Multi-LCB is a new benchmark designed to evaluate how well large language models (LLMs) perform across twelve different programming languages.
LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks.
By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability.
However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering.
We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python.

Paper AbstractExpand

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

Multi-LCB is a new benchmark designed to evaluate how well large language models (LLMs) perform across twelve different programming languages. While previous benchmarks like LiveCodeBench (LCB) have been effective at measuring coding ability, they have historically focused almost exclusively on Python. Multi-LCB addresses this limitation by expanding the scope of these evaluations, ensuring that models are tested on their ability to handle the diverse syntax and requirements of real-world software engineering beyond just one language.

Extending the Reach of Code Evaluation

The core objective of Multi-LCB is to provide a rigorous, cross-language assessment of LLMs. By taking the existing tasks from the LCB dataset—which are sourced from competitive programming platforms—the researchers transformed them into a format compatible with twelve languages, including C++, Java, Go, Rust, and JavaScript. This approach preserves the original benchmark’s strict contamination controls, ensuring that models are evaluated on fresh problems that were not part of their training data.

How the Benchmark Works

To maintain consistency across different languages, the researchers developed an automated pipeline that converts functional programming tasks into a unified standard input/output (STDIN/STDOUT) format. This is a significant design choice because it avoids the need for language-specific test harnesses, which are often error-prone and difficult to maintain. By using this standardized format, the benchmark can evaluate a model's performance on the exact same problem across all twelve languages, allowing for a direct and fair comparison of how well a model generalizes its coding logic.

Key Findings on Model Performance

The evaluation of 24 different LLMs revealed several critical gaps in current AI capabilities. The researchers found clear evidence of "Python overfitting," where models that excel in Python tasks see a sharp decline in performance when tasked with other languages. Furthermore, the results showed that Python is not always a reliable proxy for a model's overall coding competence; some models that perform well in Python struggle significantly with statically typed or less common languages. The study also identified language-specific contamination, suggesting that some models may have been exposed to more training data for certain languages than others.

Why This Matters

Multi-LCB serves as a foundation for developing more robust, language-agnostic coding models. By exposing the disparities in multilingual performance, the benchmark highlights that true coding proficiency in AI requires more than just mastery of a single language. Because Multi-LCB is designed to automatically track future updates to the original LiveCodeBench, it provides a sustainable way for the research community to monitor progress and ensure that future LLMs are capable of meeting the diverse demands of modern software development.