Meta AI has introduced NeuralBench, a comprehensive, open-source framework designed to standardize the evaluation of NeuroAI models. By providing a unified interface for benchmarking brain activity models, the framework addresses the fragmentation that has historically hindered progress in the field. The initial release, NeuralBench-EEG v1.0, stands as the largest benchmark of its kind, encompassing 36 downstream tasks, 94 datasets, and over 13,600 hours of electroencephalography (EEG) data.
A Unified Approach to NeuroAI Evaluation
The development of brain foundation models—large-scale systems pretrained on unlabeled brain recordings—has accelerated, yet the evaluation landscape has remained inconsistent. Research groups have frequently relied on varying preprocessing pipelines and narrow task sets, making it difficult to determine the true generalizability of these models. NeuralBench resolves these issues through a modular pipeline built on three core Python packages: NeuralFetch for data acquisition, NeuralSet for preprocessing and dataloading, and NeuralTrain for standardized execution.
The framework allows researchers to run complex evaluations using a simple command-line interface. By utilizing lightweight YAML configuration files, NeuralBench ensures that training hyperparameters, data splits, and evaluation metrics are applied consistently across all models. This standardization removes model-specific optimization tricks, allowing for a direct comparison between task-specific architectures, handcrafted feature baselines, and large-scale foundation models.
Performance Insights and Model Comparison
NeuralBench-EEG v1.0 evaluates 14 deep learning architectures, ranging from lightweight task-specific models to large foundation models with up to 157.1 million parameters. A significant finding from the initial benchmark is that foundation models, such as REVE, LaBraM, and LUNA, only marginally outperform smaller, task-specific models. For instance, the CTNet architecture, which contains roughly 270 times fewer parameters than leading foundation models, demonstrated competitive performance, even overtaking some larger models when evaluated across a broader range of datasets.
The benchmark also highlights that many complex tasks remain difficult for current AI systems. While tasks such as seizure detection and sleep stage classification are approaching saturation, cognitive decoding tasks—including the recovery of speech, video, and sentence representations from brain activity—continue to yield performance levels near the dummy baseline. These challenging tasks are intended to serve as a benchmark for future advancements in the field.
Cross-Modality Potential and Future Expansion
Although the current release focuses on EEG, NeuralBench is architected for broader application. The framework already includes proof-of-concept support for magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Early results suggest that representations learned from EEG data can transfer effectively to other modalities, as evidenced by the REVE model achieving top performance on MEG typing decoding tasks.
The framework is designed to scale, with future iterations planned to incorporate intracranial EEG, functional near-infrared spectroscopy, and electromyography. By providing a transparent and rigorous testing environment, NeuralBench aims to support the development of more robust and generalizable NeuroAI models. The framework is available under an MIT license, providing the research community with a standardized tool to stress-test the next generation of brain-computer interface technologies.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!