NeurIPS 2026  ·  Submission

SteerSeg: Attention Steering for
Reasoning Video Segmentation

  1. Ali Cheraghian1
  2. Hamidreza Dastmalchi2
  3. Abdelwahed Khamis3
  4. Morteza Saberi4
  5. Aijun An2
  6. Lars Petersson3
  1. 1Macquarie University
  2. 2York University
  3. 3CSIRO Data61
  4. 4University of Technology Sydney
Diagnostic study and effect of attention refinement on masking. For the query 'a white dog with gray patches', raw and contrast attention produce ambiguous masks, while soft prompts plus CoT reasoning yield concentrated attention and correct masks.
Figure 1. (a) A diagnostic study reveals a discrepancy between reasoning and grounding: the LVLM often identifies the correct target object, yet the corresponding attention remains poorly localized, leading to inaccurate masks. (b) Effect of attention refinement on segmentation. Raw and contrast-based attention produce ambiguous localization, while soft prompts and Chain-of-Thought reasoning progressively improve attention quality and segmentation accuracy.

Abstract

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals.

In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting: the soft prompts reshape the attention distribution to produce spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects. The resulting maps are converted into point prompts that guide a segmentation model, and candidate tracklets are ranked by correlation-based scoring.

Method

SteerSeg performs input-level attention steering: learnable soft prompts paired with a single-step Chain-of-Thought reasoning step, producing concentrated attention maps that drive accurate point prompts for SAM2.

SteerSeg pipeline. CoT-derived attributes augment the input expression; soft prompts steer LVLM attention; rollout produces point prompts for SAM2; candidate tracklets are ranked by correlation against the rollout maps.
Figure 2. SteerSeg pipeline. CoT-derived attributes augment the input expression; soft prompts steer LVLM attention; rollout from the response token produces point prompts for SAM2; candidate tracklets are ranked against the rollout maps.
  1. 1

    Soft Prompt Steering

    Learnable soft prompts prepended to a frozen LVLM participate in self-attention at every layer, reshaping how the response token attends to visual tokens and yielding more concentrated, spatially aligned maps.

  2. 2

    Chain-of-Thought Attribute Reasoning

    A single-step CoT module elicits discriminative attributes (color, position, motion) for the referred object. Appending these to the prompt disambiguates similar instances and guides attention to the correct target.

  3. 3

    Dual-Granularity Rollout

    Attention rollout is computed at both frame and video granularities, balancing spatial precision with temporal consistency. The refined maps drive point prompts across sampled keyframes.

  4. 4

    Correlation-Based Tracklet Selection

    SAM2 propagates candidates into tracklets, ranked by Pearson correlation against the rollout maps. The most consistent tracklet wins — robust to occlusion, fast motion, and visually similar distractors.

Results

Trained only on Ref-YouTube-VOS, SteerSeg outperforms every training-free frozen-LVLM baseline on every benchmark and is competitive with fully-trained methods — despite never updating the LVLM or SAM2.

Method LVLM Ref-DAVIS ReasonVOS ReVOS(Overall) ReVOS(Referring) ReVOS(Reasoning)
J&FJF J&FJF J&FJF J&FJF J&FJF
Fully Trained Methods
LISA[CVPR'24] LLaVA-7B 64.862.267.3 31.129.133.1 40.939.142.7 45.744.347.1 36.133.838.4
VISA[ECCV'24] ChatUniVi-7B 69.466.372.5 46.944.949.0 50.949.252.6 43.040.645.4
VideoLISA[NeurIPS'24] LLaVA-Phi-3-V 68.864.972.7 47.545.149.9
GLUS[CVPR'25] LLaVA-7B 49.947.552.4 54.952.457.3 58.356.060.7 51.448.853.9
VRS-HQ[CVPR'25] ChatUniVi-7B 76.072.679.4 59.156.661.6 62.159.864.5 56.153.558.7
Veason-R1[arXiv'25.08] Qwen2.5VL-7B 59.956.063.8 61.358.264.4 63.660.766.5 59.055.862.2
Frozen-LVLM Methods
Loc-Head*[CVPR'25] LLaVA-7B 56.352.160.5 33.629.338.0 32.528.236.9 36.932.541.3 28.123.832.5
DecAF*[ICLR'26] LLaVA-OV-7B 59.454.864.0 52.849.356.3 40.035.844.1 43.439.147.6 36.632.640.7
SteerSeg[Ours] LLaVA-OV-7B 70.065.774.3 58.655.761.5 49.245.652.8 51.948.455.5 47.043.450.6
Loc-Head*[CVPR'25] InternVL3-8B 66.362.470.2 44.341.047.5 43.739.947.5 46.742.950.6 43.239.546.8
DecAF*[ICLR'26] InternVL3-8B 62.856.968.6 58.955.162.7 47.443.751.2 51.747.955.5 43.239.546.8
SteerSeg[Ours] InternVL3-8B 66.162.070.1 63.360.666.1 52.548.856.2 55.651.959.3 49.545.853.1
Loc-Head*[CVPR'25] Qwen2VL-7B 61.958.065.8 34.031.836.2 44.040.847.2 52.749.156.2 35.432.638.2
DecAF*[ICLR'26] Qwen2VL-7B 64.159.468.9 52.549.056.0 45.341.649.0 52.748.956.4 37.934.341.5
SteerSeg[Ours] Qwen2VL-7B 77.874.281.4 63.660.866.4 53.850.457.1 59.256.262.3 48.344.652.0
Loc-Head*[CVPR'25] Qwen2.5VL-7B 64.660.268.9 41.137.944.3 47.043.350.7 53.149.356.9 40.837.244.4
DecAF*[ICLR'26] Qwen2.5VL-7B 75.270.979.5 63.960.567.2 54.250.158.2 58.754.862.6 49.745.453.9
SteerSeg[Ours] Qwen2.5VL-7B 81.478.084.8 65.963.168.7 56.653.559.8 61.158.263.9 52.448.955.9

Bolded rows are SteerSeg, which trains only learnable soft prompts while keeping the LVLM and SAM2 frozen. * marks reproduced baselines. Full comparison tables (including MeViS and Ref-YouTube-VOS) are in the paper.

Qualitative Results

Selected segmentation results on ReasonVOS, with the input expression overlaid on each frame. Hover or tap a clip to play; click to enlarge.

BibTeX

If SteerSeg is useful in your work, please cite:

@article{cheraghian2026steerseg,
  title   = {SteerSeg: Attention Steering for Reasoning Video Segmentation},
  author  = {Cheraghian, Ali and Dastmalchi, Hamidreza and Khamis, Abdelwahed and
             Saberi, Morteza and An, Aijun and Petersson, Lars},
  journal = {arXiv preprint},
  year    = {2026}
}

Acknowledgments

We thank the broader vision-language community for releasing the LVLMs and segmentation models that this work builds on. Webpage template inspired by Nerfies.