Back to AI Research

AI Research

Dense Coordinate-List Fine-Tuning Induces a Control... | AI Research

Key Takeaways

  • Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models This research investigates a specific challenge in tr...
  • Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs.
  • We study this behavior as a generation and control surface.
  • In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware [email protected] from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23).
  • A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence.
Paper AbstractExpand

Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware [email protected] from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter [email protected] (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint ([email protected] 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models
This research investigates a specific challenge in training vision-language models: when models are fine-tuned to generate dense lists of object coordinates, they often develop unintended behaviors regarding how they serialize, repeat, and terminate their outputs. The authors demonstrate that this "interference" is not a random failure but a predictable, measurable surface that can be managed through specific control axes. By understanding this surface, developers can improve a model's ability to locate objects in images while preventing the generation of repetitive, redundant data.

The Nature of the Interference

When vision-language models are adapted to perform dense visual grounding—listing multiple objects and their bounding box coordinates—they face a complex task. They must manage class labels, numeric coordinates, and the logical end of a list simultaneously. The researchers found that high-capacity fine-tuning (using LoRA adapters) significantly improves the model's ability to identify objects, but it also triggers "repeated-tail pressure," where the model begins to output duplicate records at the end of its response. This behavior is specific to coordinate-list structures; other structured outputs, such as general JSON data, remain clean and free of these repetitions even under the same training conditions.

Navigating the Control Surface

The study organizes the model's behavior into two orthogonal control axes, allowing users to navigate the trade-off between accuracy and output quality:

  • Density-Precision Axis: This axis manages how many objects the model attempts to identify. By using prompt-level budgets, users can control the number of objects the model commits to, which directly influences the density of the output.

  • Structural-Integrity Axis: This axis addresses the repetition issue. The authors introduced an "object-level repeat-stop" mechanism that monitors the output in real-time. If the model attempts to generate an exact duplicate of a previously listed object, the system terminates the list. This ensures the output remains clean without sacrificing the accuracy of the objects already identified.

Key Findings and Reproducibility

The researchers validated these findings across different model families, including Gemma 4 12B and Qwen3-VL-8B, and confirmed that the patterns persist across different datasets, such as the industrial InsPLAD dataset and the public COCO 2017 dataset. A critical takeaway is that the "repeated-tail" is a separable structural fault; removing these duplicates does not degrade the model's localization performance. In fact, applying these controls allows the models to reach a "clean endpoint" where they maintain high accuracy while achieving a zero-duplicate rate and perfect list termination.

Implications for Model Training

The study highlights that this interference is not a result of limited model capacity or a specific implementation error, as it persists across a wide range of adapter ranks. Instead, it is a structural byproduct of how models learn to serialize coordinate lists. By treating this behavior as a controllable surface rather than a simple error, practitioners can use these two axes to fine-tune models for dense visual tasks more effectively, ensuring that the generated output is both accurate and structurally sound.

Comments (0)

No comments yet

Be the first to share your thoughts!