FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning
The "Food-as-Medicine" movement aims to integrate clinically appropriate nutrition into healthcare to help manage chronic conditions like diabetes, hypertension, and kidney disease. While current AI models are good at identifying dishes or estimating calories, they often lack the ability to determine if a specific meal is actually safe or beneficial for a patient with a particular health condition. This paper introduces FAM-Bench, a new benchmark designed to test whether AI models can perform this "decision-oriented" reasoning by analyzing dish images and ingredient lists against specific medical requirements.
How the Benchmark Works
FAM-Bench consists of 2,500 expert-verified instances covering 13 diet-related health conditions. The benchmark evaluates models on two primary tasks:
Dish-Level Suitability Assessment: The model must decide if a single dish is "suitable" or "not suitable" for a given health condition based on its image and ingredients.
Comparative Dish Analysis: The model is presented with four candidate dishes and must rank them from most to least suitable for a specific health condition. This task mimics real-world dietary recommendations where a user must choose between multiple options.
To ensure accuracy, the researchers created a curated knowledge base of dietary guidelines—reviewed by nutrition experts—that defines which ingredients are beneficial or harmful for each of the 13 conditions.
Evaluating AI Performance
The researchers tested five different vision-language models, including both closed-source frontier models and open-weight models. They used four different prompting strategies: a baseline approach, Chain-of-Thought (CoT) reasoning, Knowledge Injection (KI) of dietary rules, and a combination of both.
The results revealed a significant gap between a model's ability to provide a "yes/no" verdict and its ability to explain why. While models became quite accurate at determining if a dish was suitable, they struggled to correctly identify the specific ingredients that drove that decision. Furthermore, while models performed reasonably well on the binary suitability task, they found the comparative ranking task much more difficult, indicating that ranking alternatives based on health constraints remains a significant challenge for current AI.
Key Findings and Limitations
The study highlights that while models are becoming more capable of basic dietary understanding, they are not yet fully reliable for clinical-grade decision-making. The "verdict-rationale gap" shows that models often guess the correct suitability label without truly grounding their reasoning in the relevant ingredients.
Additionally, the researchers found that Chain-of-Thought prompting and Knowledge Injection serve different purposes: reasoning traces help models cite the correct ingredients, while injecting dietary rules helps improve the accuracy of the final suitability decision. Even with these tools, the models still struggle to reach the level of nuanced, consistent reasoning required for professional dietary guidance.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!