Back to AI Research

AI Research

ValueAlpha: Agreement-Gated Stress Testing of LLM-J... | AI Research

Key Takeaways

  • ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable In the field of AI-driven finance, researchers o...
  • LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment.
  • This paper introduces \textbf{ValueAlpha}, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid.
  • A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing.
  • The contribution is therefore not a leaderboard and not a claim to measure true investment skill.
Paper AbstractExpand

Long-horizon investment decisions create a pre-realization evaluation problem: realized returns are the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces \textbf{ValueAlpha}, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueAlpha clears the aggregate agreement gate at \(\bar{\kappa}_w = 0.7168\) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint\_awareness}, \(\bar{\kappa}_w = 0.2022\)), single-judge rankings are family-dependent, and terse-correct rationales receive a \(\Delta = -2.81\) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The contribution is therefore not a leaderboard and not a claim to measure true investment skill. ValueAlpha is a pre-calibration metrology layer for AI-finance evaluation: it determines whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable
In the field of AI-driven finance, researchers often face a "pre-realization" problem: they must evaluate investment rationales and trading strategies long before the actual market returns are known. While LLMs are frequently used as judges to provide quick feedback on these rationales, they are prone to biases, such as favoring verbose or confident-sounding text over actual financial judgment. This paper introduces ValueAlpha, a rigorous, preregistered protocol designed to determine whether an LLM-based evaluation is stable, agreed-upon, and reliable enough to be reported, or if it should be qualified or discarded.

A Protocol for Permissioned Claims

ValueAlpha is not a leaderboard or a tool for measuring investment skill. Instead, it acts as a "metrology layer"—a system of checks and balances that decides if a claim made by an LLM judge is valid. The protocol produces a "verdict tuple" for every claim, which explicitly states whether the findings are publishable as a headline, should be downgraded to a methodology finding, or must be halted entirely. By requiring researchers to preregister their evaluation rules, ValueAlpha prevents the common practice of "post-hoc" tuning, where researchers might adjust their criteria after seeing the results to make their findings look more favorable.

Stress-Testing the Judges

To test the protocol, the authors applied it to a controlled market-state capital-allocation prototype involving 1,100 decision trajectories and 5,500 individual judge calls. The protocol uses several diagnostic tools to expose potential failures:

  • Agreement Gates: It uses quadratic-weighted Cohen’s kappa to ensure judges are actually in agreement. If the judges fail to converge, the protocol refuses to issue a ranking.

  • Adversarial Controls: The researchers included specific test cases, such as "verbose-confident-but-wrong" rationales and "terse-correct" rationales, to see if the judges were being misled by writing style.

  • Stability Checks: The protocol performs "leave-one-judge-out" (LOFO) analysis to ensure that the final ranking isn't dependent on the preferences of a single judge family.

Key Findings and Limitations

The study revealed that while the aggregate system was stable enough to clear the publication gate, it also surfaced significant issues that would have been hidden in a standard evaluation. For instance, the "constraint_awareness" dimension failed the protocol’s agreement gate, meaning that any claims regarding this specific metric were downgraded to "methodology findings" rather than being presented as definitive results. Furthermore, the protocol discovered that the current rubric severely penalized "terse-correct" rationales, suggesting that LLM judges often confuse brevity with a lack of substance.
Ultimately, the authors emphasize that ValueAlpha is a starting point for better measurement discipline. It does not solve the fundamental challenge of predicting market returns, but it provides a necessary framework to ensure that the AI systems we build are being evaluated on their actual decision-making logic rather than their ability to mimic professional-sounding language.

Comments (0)

No comments yet

Be the first to share your thoughts!