Meta’s Fundamental AI Research (FAIR) team has introduced TRIBE v2, a tri-modal foundation model designed to bridge the gap in neuroscience by providing a unified framework for understanding how the human brain integrates multisensory information. While traditional neuroscientific research has often mapped cognitive functions to isolated brain regions using narrow paradigms, TRIBE v2 aligns the latent representations of state-of-the-art AI architectures with human brain activity to predict high-resolution fMRI responses across diverse naturalistic and experimental conditions.
Multi-modal Architecture and Integration
TRIBE v2 functions by leveraging representational alignment between deep neural networks and the primate brain. The architecture utilizes three frozen foundation models as feature extractors: LLaMA 3.2-3B for text, V-JEPA2-Giant for video, and Wav2Vec-BERT 2.0 for audio. These specialized encoders process stimuli into contextualized embeddings, which are then compressed into a shared dimension and concatenated into a multi-modal time series.
This combined sequence is fed into a Transformer encoder consisting of eight layers and eight attention heads, which exchanges information across a 100-second window. Finally, a subject-specific prediction block projects these latent representations onto 20,484 cortical vertices and 8,802 subcortical voxels, effectively predicting brain activity at the 1 Hz fMRI frequency.
Scaling Laws and Predictive Performance
To address the challenge of data scarcity in brain encoding, the research team trained TRIBE v2 on 451.6 hours of fMRI data from 25 subjects. Evaluation was conducted across a broader collection totaling 1,117.7 hours from 720 subjects. The team observed a log-linear increase in encoding accuracy as training data volume expanded, suggesting that the model's predictive power will continue to scale as neuroimaging repositories grow.
The model demonstrates significant improvements over traditional Finite Impulse Response (FIR) models. Notably, TRIBE v2 exhibits strong zero-shot generalization capabilities, allowing it to predict the group-averaged response of a new cohort more accurately than the actual recordings of many individual subjects. In the Human Connectome Project 7T dataset, the model achieved a group correlation near 0.4, representing a two-fold improvement over the median subject’s group-predictivity.
In-Silico Neuroscience and Interpretability
Beyond its predictive capabilities, TRIBE v2 serves as a tool for in-silico experimentation, enabling researchers to conduct virtual neuroscientific tests. By running simulations on the Individual Brain Charting dataset, the model successfully recovered classic functional landmarks, including the fusiform face area, the parahippocampal place area, the temporo-parietal junction, and Broca’s area.
Furthermore, applying Independent Component Analysis to the model’s final layer revealed that TRIBE v2 naturally learns five well-known functional networks: primary auditory, language, motion, default mode, and visual. When provided with as little as one hour of data for a new participant, fine-tuning the model for a single epoch leads to a two- to four-fold improvement over linear models trained from scratch.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!