DualFact+: A Multimodal Fact Verification Framework...

DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding
Procedural video captioning—the task of describing step-by-step actions like cooking or crafting—requires models to be both fluent and factually accurate. However, current AI models often produce captions that sound natural but fail to accurately represent the specific actions, tools, or ingredients shown in a video. The DualFact framework addresses this by introducing a dual-layer evaluation system that breaks down factual correctness into two distinct categories: conceptual facts and contextual facts. By doing so, it provides a more precise way to identify where AI models go wrong, such as omitting critical steps or misassigning roles to objects.

A Two-Layered Approach to Factuality

DualFact evaluates captions by separating information into two layers. Conceptual facts focus on the abstract roles within a task, such as identifying the Action, Ingredient, Tool, or Location involved. Contextual facts, meanwhile, look at how these elements interact within the video, such as verifying the specific predicate-argument relationship (e.g., "cut the tomato" vs. "cut the board"). This separation allows the framework to distinguish between a model that understands the general goal of a task and one that correctly grounds those actions in the visual evidence provided by the video.

Identifying Systematic Errors

To better understand why models fail, the framework categorizes errors into three specific types: hallucinations (introducing content not present in the video), salience errors (focusing on visible but irrelevant items), and omissions (leaving out essential information). By using implicit argument augmentation—a process that fills in missing but implied details, such as identifying that "stir it" implies "stirring the soup with a spoon"—the framework can detect when a model fails to mention a task-critical component. This reveals that many state-of-the-art models suffer from systematic omissions and role-level inconsistencies that standard metrics often overlook.

Improving Evaluation Accuracy

The research demonstrates that relying solely on text-based evaluation often leads to an overestimation of a model's accuracy. When captions are evaluated only against other text, the system may miss hallucinations that are clearly visible when compared against the actual video. DualFact shows a stronger correlation with human judgment than traditional metrics like BLEU or ROUGE, which only measure surface-level word overlap. By incorporating both textual and video-grounded verification, DualFact provides a more interpretable and human-aligned protocol for assessing how well AI models truly understand the procedural steps they are describing.

Key Findings and Benchmarks

To support this research, the authors introduced two new benchmarks: YouCook3-Fact and CraftBench-Fact. These datasets provide structured, role-annotated facts that serve as a gold standard for testing. Experiments using these benchmarks show that while modern multimodal language models are highly fluent, they frequently struggle with the "grounding" of their descriptions. The framework highlights that even when a model correctly identifies an object, it may still fail to correctly link that object to the right action or role, a persistent challenge in multimodal AI that DualFact helps to diagnose and measure.

DualFact+: A Multimodal Fact Verification Framework... | AI Research

Key Takeaways

A Two-Layered Approach to Factuality

Identifying Systematic Errors

Improving Evaluation Accuracy

Key Findings and Benchmarks

Comments (0)

No comments yet