Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models
This research investigates a specific challenge in training vision-language models: when models are fine-tuned to generate dense lists of object coordinates, they often develop unintended behaviors regarding how they serialize, repeat, and terminate their outputs. The authors demonstrate that this "interference" is not a random failure but a predictable, measurable surface that can be managed through specific control axes. By understanding this surface, developers can improve a model's ability to locate objects in images while preventing the generation of repetitive, redundant data.
The Nature of the Interference
When vision-language models are adapted to perform dense visual grounding—listing multiple objects and their bounding box coordinates—they face a complex task. They must manage class labels, numeric coordinates, and the logical end of a list simultaneously. The researchers found that high-capacity fine-tuning (using LoRA adapters) significantly improves the model's ability to identify objects, but it also triggers "repeated-tail pressure," where the model begins to output duplicate records at the end of its response. This behavior is specific to coordinate-list structures; other structured outputs, such as general JSON data, remain clean and free of these repetitions even under the same training conditions.
Navigating the Control Surface
The study organizes the model's behavior into two orthogonal control axes, allowing users to navigate the trade-off between accuracy and output quality:
Density-Precision Axis: This axis manages how many objects the model attempts to identify. By using prompt-level budgets, users can control the number of objects the model commits to, which directly influences the density of the output.
Structural-Integrity Axis: This axis addresses the repetition issue. The authors introduced an "object-level repeat-stop" mechanism that monitors the output in real-time. If the model attempts to generate an exact duplicate of a previously listed object, the system terminates the list. This ensures the output remains clean without sacrificing the accuracy of the objects already identified.
Key Findings and Reproducibility
The researchers validated these findings across different model families, including Gemma 4 12B and Qwen3-VL-8B, and confirmed that the patterns persist across different datasets, such as the industrial InsPLAD dataset and the public COCO 2017 dataset. A critical takeaway is that the "repeated-tail" is a separable structural fault; removing these duplicates does not degrade the model's localization performance. In fact, applying these controls allows the models to reach a "clean endpoint" where they maintain high accuracy while achieving a zero-duplicate rate and perfect list termination.
Implications for Model Training
The study highlights that this interference is not a result of limited model capacity or a specific implementation error, as it persists across a wide range of adapter ranks. Instead, it is a structural byproduct of how models learn to serialize coordinate lists. By treating this behavior as a controllable surface rather than a simple error, practitioners can use these two axes to fine-tune models for dense visual tasks more effectively, ensuring that the generated output is both accurate and structurally sound.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!