SteerSeg: Attention Steering for Reasoning Video Segmentation

Abstract

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals.

In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting: the soft prompts reshape the attention distribution to produce spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects. The resulting maps are converted into point prompts that guide a segmentation model, and candidate tracklets are ranked by correlation-based scoring.

Method

SteerSeg performs input-level attention steering: learnable soft prompts paired with a single-step Chain-of-Thought reasoning step, producing concentrated attention maps that drive accurate point prompts for SAM2.

SteerSeg pipeline. CoT-derived attributes augment the input expression; soft prompts steer LVLM attention; rollout produces point prompts for SAM2; candidate tracklets are ranked by correlation against the rollout maps. — **Figure 2. SteerSeg pipeline.** CoT-derived attributes augment the input expression; soft prompts steer LVLM attention; rollout from the response token produces point prompts for SAM2; candidate tracklets are ranked against the rollout maps.

1

Soft Prompt Steering

Learnable soft prompts prepended to a frozen LVLM participate in self-attention at every layer, reshaping how the response token attends to visual tokens and yielding more concentrated, spatially aligned maps.
2

Chain-of-Thought Attribute Reasoning

A single-step CoT module elicits discriminative attributes (color, position, motion) for the referred object. Appending these to the prompt disambiguates similar instances and guides attention to the correct target.
3

Dual-Granularity Rollout

Attention rollout is computed at both frame and video granularities, balancing spatial precision with temporal consistency. The refined maps drive point prompts across sampled keyframes.
4

Correlation-Based Tracklet Selection

SAM2 propagates candidates into tracklets, ranked by Pearson correlation against the rollout maps. The most consistent tracklet wins — robust to occlusion, fast motion, and visually similar distractors.

Results

Trained only on Ref-YouTube-VOS, SteerSeg outperforms every training-free frozen-LVLM baseline on every benchmark and is competitive with fully-trained methods — despite never updating the LVLM or SAM2.

Method	LVLM	Ref-DAVIS			ReasonVOS			ReVOS(Overall)			ReVOS(Referring)			ReVOS(Reasoning)
Method	LVLM	J&F	J	F	J&F	J	F	J&F	J	F	J&F	J	F	J&F	J	F
Fully Trained Methods
LISA[CVPR'24]	LLaVA-7B	64.8	62.2	67.3	31.1	29.1	33.1	40.9	39.1	42.7	45.7	44.3	47.1	36.1	33.8	38.4
VISA[ECCV'24]	ChatUniVi-7B	69.4	66.3	72.5	—	—	—	46.9	44.9	49.0	50.9	49.2	52.6	43.0	40.6	45.4
VideoLISA[NeurIPS'24]	LLaVA-Phi-3-V	68.8	64.9	72.7	47.5	45.1	49.9	—	—	—	—	—	—	—	—	—
GLUS[CVPR'25]	LLaVA-7B	—	—	—	49.9	47.5	52.4	54.9	52.4	57.3	58.3	56.0	60.7	51.4	48.8	53.9
VRS-HQ[CVPR'25]	ChatUniVi-7B	76.0	72.6	79.4	—	—	—	59.1	56.6	61.6	62.1	59.8	64.5	56.1	53.5	58.7
Veason-R1[arXiv'25.08]	Qwen2.5VL-7B	—	—	—	59.9	56.0	63.8	61.3	58.2	64.4	63.6	60.7	66.5	59.0	55.8	62.2
Frozen-LVLM Methods
Loc-Head*[CVPR'25]	LLaVA-7B	56.3	52.1	60.5	33.6	29.3	38.0	32.5	28.2	36.9	36.9	32.5	41.3	28.1	23.8	32.5
DecAF*[ICLR'26]	LLaVA-OV-7B	59.4	54.8	64.0	52.8	49.3	56.3	40.0	35.8	44.1	43.4	39.1	47.6	36.6	32.6	40.7
SteerSeg[Ours]	LLaVA-OV-7B	70.0	65.7	74.3	58.6	55.7	61.5	49.2	45.6	52.8	51.9	48.4	55.5	47.0	43.4	50.6
Loc-Head*[CVPR'25]	InternVL3-8B	66.3	62.4	70.2	44.3	41.0	47.5	43.7	39.9	47.5	46.7	42.9	50.6	43.2	39.5	46.8
DecAF*[ICLR'26]	InternVL3-8B	62.8	56.9	68.6	58.9	55.1	62.7	47.4	43.7	51.2	51.7	47.9	55.5	43.2	39.5	46.8
SteerSeg[Ours]	InternVL3-8B	66.1	62.0	70.1	63.3	60.6	66.1	52.5	48.8	56.2	55.6	51.9	59.3	49.5	45.8	53.1
Loc-Head*[CVPR'25]	Qwen2VL-7B	61.9	58.0	65.8	34.0	31.8	36.2	44.0	40.8	47.2	52.7	49.1	56.2	35.4	32.6	38.2
DecAF*[ICLR'26]	Qwen2VL-7B	64.1	59.4	68.9	52.5	49.0	56.0	45.3	41.6	49.0	52.7	48.9	56.4	37.9	34.3	41.5
SteerSeg[Ours]	Qwen2VL-7B	77.8	74.2	81.4	63.6	60.8	66.4	53.8	50.4	57.1	59.2	56.2	62.3	48.3	44.6	52.0
Loc-Head*[CVPR'25]	Qwen2.5VL-7B	64.6	60.2	68.9	41.1	37.9	44.3	47.0	43.3	50.7	53.1	49.3	56.9	40.8	37.2	44.4
DecAF*[ICLR'26]	Qwen2.5VL-7B	75.2	70.9	79.5	63.9	60.5	67.2	54.2	50.1	58.2	58.7	54.8	62.6	49.7	45.4	53.9
SteerSeg[Ours]	Qwen2.5VL-7B	81.4	78.0	84.8	65.9	63.1	68.7	56.6	53.5	59.8	61.1	58.2	63.9	52.4	48.9	55.9

Bolded rows are SteerSeg, which trains only learnable soft prompts while keeping the LVLM and SAM2 frozen. * marks reproduced baselines. Full comparison tables (including MeViS and Ref-YouTube-VOS) are in the paper.

Qualitative Results

Selected segmentation results on ReasonVOS, with the input expression overlaid on each frame. Hover or tap a clip to play; click to enlarge.

BibTeX

If SteerSeg is useful in your work, please cite:

@article{cheraghian2026steerseg,
  title   = {SteerSeg: Attention Steering for Reasoning Video Segmentation},
  author  = {Cheraghian, Ali and Dastmalchi, Hamidreza and Khamis, Abdelwahed and
             Saberi, Morteza and An, Aijun and Petersson, Lars},
  journal = {arXiv preprint},
  year    = {2026}
}

Acknowledgments

We thank the broader vision-language community for releasing the LVLMs and segmentation models that this work builds on. Webpage template inspired by Nerfies.

Abstract

Method

Soft Prompt Steering

Chain-of-Thought Attribute Reasoning

Dual-Granularity Rollout

Correlation-Based Tracklet Selection

Results

Qualitative Results

BibTeX

Acknowledgments