Meta Introduces TRIBE v2 for Predictive Brain Activity Mapping

Key Takeaways

  • Bridges the gap between AI and neuroscience by using foundation models to predict human brain activity across video, audio, and text.
  • Demonstrates that brain encoding models follow log-linear scaling laws, suggesting predictive accuracy will improve as neuroimaging data grows.
  • Enables 'in-silico' neuroscience, allowing researchers to run virtual experiments and identify functional brain networks without human subjects.

Meta’s Fundamental AI Research (FAIR) team has introduced TRIBE v2, a tri-modal foundation model designed to bridge the gap in neuroscience by providing a unified framework for understanding how the human brain integrates multisensory information. While traditional neuroscientific research has often mapped cognitive functions to isolated brain regions using narrow paradigms, TRIBE v2 aligns the latent representations of state-of-the-art AI architectures with human brain activity to predict high-resolution fMRI responses across diverse naturalistic and experimental conditions.

Multi-modal Architecture and Integration

TRIBE v2 functions by leveraging representational alignment between deep neural networks and the primate brain. The architecture utilizes three frozen foundation models as feature extractors: LLaMA 3.2-3B for text, V-JEPA2-Giant for video, and Wav2Vec-BERT 2.0 for audio. These specialized encoders process stimuli into contextualized embeddings, which are then compressed into a shared dimension and concatenated into a multi-modal time series.
This combined sequence is fed into a Transformer encoder consisting of eight layers and eight attention heads, which exchanges information across a 100-second window. Finally, a subject-specific prediction block projects these latent representations onto 20,484 cortical vertices and 8,802 subcortical voxels, effectively predicting brain activity at the 1 Hz fMRI frequency.

Scaling Laws and Predictive Performance

To address the challenge of data scarcity in brain encoding, the research team trained TRIBE v2 on 451.6 hours of fMRI data from 25 subjects. Evaluation was conducted across a broader collection totaling 1,117.7 hours from 720 subjects. The team observed a log-linear increase in encoding accuracy as training data volume expanded, suggesting that the model's predictive power will continue to scale as neuroimaging repositories grow.
The model demonstrates significant improvements over traditional Finite Impulse Response (FIR) models. Notably, TRIBE v2 exhibits strong zero-shot generalization capabilities, allowing it to predict the group-averaged response of a new cohort more accurately than the actual recordings of many individual subjects. In the Human Connectome Project 7T dataset, the model achieved a group correlation near 0.4, representing a two-fold improvement over the median subject’s group-predictivity.

In-Silico Neuroscience and Interpretability

Beyond its predictive capabilities, TRIBE v2 serves as a tool for in-silico experimentation, enabling researchers to conduct virtual neuroscientific tests. By running simulations on the Individual Brain Charting dataset, the model successfully recovered classic functional landmarks, including the fusiform face area, the parahippocampal place area, the temporo-parietal junction, and Broca’s area.
Furthermore, applying Independent Component Analysis to the model’s final layer revealed that TRIBE v2 naturally learns five well-known functional networks: primary auditory, language, motion, default mode, and visual. When provided with as little as one hour of data for a new participant, fine-tuning the model for a single epoch leads to a two- to four-fold improvement over linear models trained from scratch.

Comments (0)

No comments yet

Be the first to share your thoughts!