ValueAlpha: Agreement-Gated Stress Testing of LLM-J...

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable
In the field of AI-driven finance, researchers often face a "pre-realization" problem: they must evaluate investment rationales and trading strategies long before the actual market returns are known. While LLMs are frequently used as judges to provide quick feedback on these rationales, they are prone to biases, such as favoring verbose or confident-sounding text over actual financial judgment. This paper introduces ValueAlpha, a rigorous, preregistered protocol designed to determine whether an LLM-based evaluation is stable, agreed-upon, and reliable enough to be reported, or if it should be qualified or discarded.

A Protocol for Permissioned Claims

ValueAlpha is not a leaderboard or a tool for measuring investment skill. Instead, it acts as a "metrology layer"—a system of checks and balances that decides if a claim made by an LLM judge is valid. The protocol produces a "verdict tuple" for every claim, which explicitly states whether the findings are publishable as a headline, should be downgraded to a methodology finding, or must be halted entirely. By requiring researchers to preregister their evaluation rules, ValueAlpha prevents the common practice of "post-hoc" tuning, where researchers might adjust their criteria after seeing the results to make their findings look more favorable.

Stress-Testing the Judges

To test the protocol, the authors applied it to a controlled market-state capital-allocation prototype involving 1,100 decision trajectories and 5,500 individual judge calls. The protocol uses several diagnostic tools to expose potential failures:

Agreement Gates: It uses quadratic-weighted Cohen’s kappa to ensure judges are actually in agreement. If the judges fail to converge, the protocol refuses to issue a ranking.
Adversarial Controls: The researchers included specific test cases, such as "verbose-confident-but-wrong" rationales and "terse-correct" rationales, to see if the judges were being misled by writing style.
Stability Checks: The protocol performs "leave-one-judge-out" (LOFO) analysis to ensure that the final ranking isn't dependent on the preferences of a single judge family.

Key Findings and Limitations

The study revealed that while the aggregate system was stable enough to clear the publication gate, it also surfaced significant issues that would have been hidden in a standard evaluation. For instance, the "constraint_awareness" dimension failed the protocol’s agreement gate, meaning that any claims regarding this specific metric were downgraded to "methodology findings" rather than being presented as definitive results. Furthermore, the protocol discovered that the current rubric severely penalized "terse-correct" rationales, suggesting that LLM judges often confuse brevity with a lack of substance.
Ultimately, the authors emphasize that ValueAlpha is a starting point for better measurement discipline. It does not solve the fundamental challenge of predicting market returns, but it provides a necessary framework to ensure that the AI systems we build are being evaluated on their actual decision-making logic rather than their ability to mimic professional-sounding language.

ValueAlpha: Agreement-Gated Stress Testing of LLM-J... | AI Research

Key Takeaways

A Protocol for Permissioned Claims

Stress-Testing the Judges

Key Findings and Limitations

Comments (0)

No comments yet