AI Research

Correlation Is Not Enough: Embedding Human Metadata... | AI Research

Key Takeaways

The Problem with Current Biomedical Encoders The paper "Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery" addresses a crit...
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical.
This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero.
Accuracy on cross-domain discrimination is 0%.
Retrieval systems survive this, because a language model downstream filters the noise.

Paper AbstractExpand

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

The Problem with Current Biomedical Encoders

The paper "Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery" addresses a critical failure in modern biomedical language models. When these models are asked to compare unrelated concepts—such as a specific cortisol level and stock-market volatility—they often assign them a high similarity score. While traditional retrieval systems can filter out this noise, "Large Behavioural Models" (LBMs) cannot. Because LBMs map out a person's life events to infer causal links, they treat this false similarity as evidence of a real connection, leading to significant errors in causal reasoning.

Improving Embedding Accuracy

To fix this, the authors introduce a two-step refinement process to improve how models distinguish between related and unrelated concepts. First, they perform a contrastive training pass over 72,034 pairs, which significantly improves the model's ability to separate different domains. Second, they introduce a method called BODHI, which mines "hard negatives"—pairs that are clearly unrelated—from a biomedical knowledge graph. This further sharpens the model's discrimination capabilities, ensuring that embedding geometry accurately reflects true causal relationships rather than just superficial similarity.

Performance and Hardware Optimization

The researchers also focused on the practical deployment of these models using Intel Xeon 6737P hardware with AMX acceleration. By utilizing OpenVINO, they achieved a 133x increase in speed, reducing query latency from 1367 ms to just 10 ms. Interestingly, the study found that FP16 precision outperformed INT8 on this specific hardware, a result that contradicts standard industry advice. The authors provide an explanation for this performance quirk and note that their models run significantly slower on hardware lacking AMX support.

Key Takeaways for Future Research

The authors emphasize that for LBMs, embedding geometry is not merely a technical detail—it is the foundation of correctness. By releasing their benchmark suite, training corpora, the BODHI generator, and OpenVINO scripts, they aim to provide the community with the tools necessary to build more reliable causal discovery systems. The findings highlight that as we move toward models that reason over individual human data, the accuracy of the underlying embeddings becomes a primary requirement for preventing the propagation of false causal edges.

Comments (0)

No comments yet

Be the first to share your thoughts!