STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
As Large Language Models (LLMs) become central to industries like finance and healthcare, the need for reliable, domain-specific, and multilingual testing has grown. However, creating high-quality evaluation datasets manually is slow, expensive, and often restricted by privacy concerns. STELLAR-E is a fully automated system designed to solve this by generating high-quality, synthetic instruction-answer datasets. It allows researchers to create custom benchmarks without relying on existing sensitive data, providing a faster and more scalable way to test LLM performance, safety, and compliance.
How the System Works
STELLAR-E operates through a multi-stage, automated pipeline that mimics human-like quality control. It begins by defining specific "Question Types" to target different domains. The system then generates topics and instructions, which are refined through an iterative feedback loop. To ensure the data is challenging and diverse, the pipeline includes two key features: Difficulty Enhancement (DFE), which paraphrases instructions to make them more complex, and Diversity Enhancement (DVE), which uses embedding models to remove redundant or similar questions. Throughout this process, a custom version of the "G-Eval" framework—using LLMs as judges—scores the content based on criteria like correctness, relevance, and safety.
A Modular and Flexible Pipeline
A core strength of STELLAR-E is its modular design. Because each stage of the pipeline is independent, the system is robust and can recover easily if a specific step fails. By separating the generation of instructions from the generation of answers, the system can apply different quality checks to each, reducing the likelihood of hallucinations or biased content. This flexibility allows users to control the language, format, and volume of the generated data, making it adaptable to both high-resource languages like English and lower-resource languages that often lack sufficient evaluation materials.
Performance and Results
To validate the effectiveness of the synthetic datasets, the researchers compared them against existing human-curated benchmarks. The results showed that STELLAR-E’s synthetic datasets achieved an average difference of +5.7% in "LLM-as-a-judge" scores compared to traditional benchmarks. This indicates that the synthetic data is of comparable quality to human-made sets and is effective at evaluating both large and small LLMs.
Important Considerations
While STELLAR-E offers a powerful alternative to manual benchmarking, the authors note that real-world datasets can still be more challenging for smaller models. Additionally, the system relies on LLMs to act as both generators and judges, which means the quality of the output is inherently tied to the capabilities of the models used in the pipeline. Despite these factors, the framework provides a scalable, domain-adaptable solution that helps organizations implement high-efficiency quality assurance cycles for their LLM applications.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!