Back to AI Research

AI Research

DualFact+: A Multimodal Fact Verification Framework... | AI Research

Key Takeaways

  • DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding Procedural video captioning—the task of describing step-by-step action...
  • We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning.
  • To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets.
  • We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence.
  • Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies.
Paper AbstractExpand

We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies. DualFact correlates more strongly with human factuality judgments than standard metrics, particularly for contextual facts, and reveals that caption-only evaluation overestimates hallucinations compared to video-grounded verification. Overall, DualFact offers an interpretable and human-aligned evaluation protocol that highlights persistent challenges in multimodal factual grounding, extending beyond surface-level fluency.

DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding
Procedural video captioning—the task of describing step-by-step actions like cooking or crafting—requires models to be both fluent and factually accurate. However, current AI models often produce captions that sound natural but fail to accurately represent the specific actions, tools, or ingredients shown in a video. The DualFact framework addresses this by introducing a dual-layer evaluation system that breaks down factual correctness into two distinct categories: conceptual facts and contextual facts. By doing so, it provides a more precise way to identify where AI models go wrong, such as omitting critical steps or misassigning roles to objects.

A Two-Layered Approach to Factuality

DualFact evaluates captions by separating information into two layers. Conceptual facts focus on the abstract roles within a task, such as identifying the Action, Ingredient, Tool, or Location involved. Contextual facts, meanwhile, look at how these elements interact within the video, such as verifying the specific predicate-argument relationship (e.g., "cut the tomato" vs. "cut the board"). This separation allows the framework to distinguish between a model that understands the general goal of a task and one that correctly grounds those actions in the visual evidence provided by the video.

Identifying Systematic Errors

To better understand why models fail, the framework categorizes errors into three specific types: hallucinations (introducing content not present in the video), salience errors (focusing on visible but irrelevant items), and omissions (leaving out essential information). By using implicit argument augmentation—a process that fills in missing but implied details, such as identifying that "stir it" implies "stirring the soup with a spoon"—the framework can detect when a model fails to mention a task-critical component. This reveals that many state-of-the-art models suffer from systematic omissions and role-level inconsistencies that standard metrics often overlook.

Improving Evaluation Accuracy

The research demonstrates that relying solely on text-based evaluation often leads to an overestimation of a model's accuracy. When captions are evaluated only against other text, the system may miss hallucinations that are clearly visible when compared against the actual video. DualFact shows a stronger correlation with human judgment than traditional metrics like BLEU or ROUGE, which only measure surface-level word overlap. By incorporating both textual and video-grounded verification, DualFact provides a more interpretable and human-aligned protocol for assessing how well AI models truly understand the procedural steps they are describing.

Key Findings and Benchmarks

To support this research, the authors introduced two new benchmarks: YouCook3-Fact and CraftBench-Fact. These datasets provide structured, role-annotated facts that serve as a gold standard for testing. Experiments using these benchmarks show that while modern multimodal language models are highly fluent, they frequently struggle with the "grounding" of their descriptions. The framework highlights that even when a model correctly identifies an object, it may still fail to correctly link that object to the right action or role, a persistent challenge in multimodal AI that DualFact helps to diagnose and measure.

Comments (0)

No comments yet

Be the first to share your thoughts!