Back to AI Research

AI Research

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Ap... | AI Research

Key Takeaways

  • STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator As Large Language Models (LLMs) become central to industries like finance and...
  • Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support.
  • We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets.
  • STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
  • As Large Language Models (LLMs) become central to industries like finance and healthcare, the need for reliable, domain-specific, and multilingual testing has grown.
Paper AbstractExpand

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
As Large Language Models (LLMs) become central to industries like finance and healthcare, the need for reliable, domain-specific, and multilingual testing has grown. However, creating high-quality evaluation datasets manually is slow, expensive, and often restricted by privacy concerns. STELLAR-E is a fully automated system designed to solve this by generating high-quality, synthetic instruction-answer datasets. It allows researchers to create custom benchmarks without relying on existing sensitive data, providing a faster and more scalable way to test LLM performance, safety, and compliance.

How the System Works

STELLAR-E operates through a multi-stage, automated pipeline that mimics human-like quality control. It begins by defining specific "Question Types" to target different domains. The system then generates topics and instructions, which are refined through an iterative feedback loop. To ensure the data is challenging and diverse, the pipeline includes two key features: Difficulty Enhancement (DFE), which paraphrases instructions to make them more complex, and Diversity Enhancement (DVE), which uses embedding models to remove redundant or similar questions. Throughout this process, a custom version of the "G-Eval" framework—using LLMs as judges—scores the content based on criteria like correctness, relevance, and safety.

A Modular and Flexible Pipeline

A core strength of STELLAR-E is its modular design. Because each stage of the pipeline is independent, the system is robust and can recover easily if a specific step fails. By separating the generation of instructions from the generation of answers, the system can apply different quality checks to each, reducing the likelihood of hallucinations or biased content. This flexibility allows users to control the language, format, and volume of the generated data, making it adaptable to both high-resource languages like English and lower-resource languages that often lack sufficient evaluation materials.

Performance and Results

To validate the effectiveness of the synthetic datasets, the researchers compared them against existing human-curated benchmarks. The results showed that STELLAR-E’s synthetic datasets achieved an average difference of +5.7% in "LLM-as-a-judge" scores compared to traditional benchmarks. This indicates that the synthetic data is of comparable quality to human-made sets and is effective at evaluating both large and small LLMs.

Important Considerations

While STELLAR-E offers a powerful alternative to manual benchmarking, the authors note that real-world datasets can still be more challenging for smaller models. Additionally, the system relies on LLMs to act as both generators and judges, which means the quality of the output is inherently tied to the capabilities of the models used in the pipeline. Despite these factors, the framework provides a scalable, domain-adaptable solution that helps organizations implement high-efficiency quality assurance cycles for their LLM applications.

Comments (0)

No comments yet

Be the first to share your thoughts!