Technical Report · StepFun-Audio Team

Boosting Omni-Modal Language Models

Staged Post-Training with Visually Debiased Evaluation

Che Liu^1,2, Lichao Ma^1,3, Xiangyu Tony Zhang^1,5, Yuxin Zhang^1,4, Haoyang Zhang^1,3, Xuerui Yang¹, and Fei Tian^1,*

¹StepFun · ²Imperial College London · ³Peking University · ⁴Shanghai Jiao Tong University · ⁵The University of New South Wales

^*Corresponding author: tianfei@stepfun.com

Benchmark score shifts after OmniClean filtering — Fig. 1 exposes the central measurement issue: benchmark scores can drop substantially after removing queries that are already answerable from visual input and the question.

Key findings

What should readers take away?

Visual shortcuts can inflate omni-modal scores.

Some audio-visual-language queries remain answerable from the visual stream and the question alone. Fig. 1 shows why raw scores should not be read as direct evidence of audio-visual-language integration.

Cleaning changes benchmark meaning, not just scores.

OmniClean audits 16,968 queries and keeps 8,551 under a fixed protocol. After filtering, several benchmarks become less tied to visual or audio reference strength.

OmniBoost substantially improves the 3B baseline.

Stage 2 improves the macro average by +6.51 over Qwen2.5-Omni-3B. Stage 3 improves the query-weighted average by +5.10 and slightly exceeds Qwen3-Omni-30B-Instruct.

Self-distillation helps, but the profile matters.

Synthetic audio-visual-text supervision improves the 3B lineage, yet Stage 2 and Stage 3 emphasize different benchmark families and aggregation views.

Motivation

Visual shortcut is the first problem to control

Omni-modal language models are expected to integrate audio, visual inputs, and language. However, a benchmark query may look audio-visual-language on paper while still being recoverable from the visual input and the question alone.

This creates a measurement problem: model gains can come from stronger visual shortcut exploitation rather than improved omni-modal reasoning. We therefore first construct a cleaned evaluation view, then use it to study which training signals actually transfer.

OmniClean

A visually debiased evaluation view

OmniClean probes each evaluation query with image or video plus the original text question while withholding audio. If a strong visual-language model recovers the verifiable answer in this visual-only setting, the query is excluded from the cleaned view. Benchmarks with protocol-specific exceptions are retained as full subsets.

9: audited omni benchmarks
16,968: queries before cleaning
8,551: retained OmniClean queries

Correlation shifts after cleaning — Cleaning changes whether omni scores track vision or audio reference strength, indicating that the cleaned view changes what the benchmark measures.

Leakage diagnostic

Visual-only solvability varies sharply by benchmark

CG-AV-Counting visual-only probing histogram — CG-AV-Counting

Daily-Omni visual-only probing histogram — Daily-Omni

IntentBench visual-only probing histogram — IntentBench

OmniBench visual-only probing histogram — OmniBench

UNO-Bench visual-only probing histogram — UNO-Bench

Video-Holmes visual-only probing histogram — Video-Holmes

WorldSense visual-only probing histogram — WorldSense

OmniVideoBench visual-only probing histogram — OmniVideoBench

The diagnostic is query-level rather than benchmark-level: two benchmarks with similar raw omni scores can contain very different amounts of visually answerable content.

OmniBoost

Testing what kind of post-training transfers to OmniClean

Stage 1

Mixed bi-modal SFT

Balanced audio-text, image-text, video-text, and text supervision.

Stage 2

Mixed-modality RLVR

Verifiable-reward optimization over text, visual, audio-image, and audio-video tasks.

Stage 3

Self-distillation SFT

SFT on filtered synthetic audio-visual-text traces generated by the same 3B lineage.

Modality composition of the RLVR training mixture — Stage 2 explicitly shifts optimization toward audio-video-text and audio-image-text queries while retaining visual and textual replay.

Synthetic Query construction pipeline — Stage 3 builds hard-matchable Synthetic Query pairs from dense audio/video captions and entity-relation records before rollout filtering.

Results

Stages improve the 3B baseline and approach larger references

OmniBoost stage comparison against Qwen2.5-Omni-3B and Qwen3-Omni-30B-Instruct — Relative to Qwen2.5-Omni-3B, Stage 2 gives the strongest macro-average gain, while Stage 3 gives the strongest query-weighted gain and slightly exceeds Qwen3-Omni-30B-A3B-Instruct on that retained-query summary.

OmniBoost aggregate ordering — Stage 2 is strongest under benchmark-level macro averaging, while Stage 3 leads under query-weighted averaging because larger retained subsets receive more weight.

Benchmark-level OmniBoost deltas relative to Qwen2.5-Omni-3B
Benchmark	Baseline	Stage 1	Stage 2	Stage 3
Daily-Omni	27.53	27.43 (-0.10)	38.05 (+10.52)	38.82 (+11.29)
IntentBench	29.57	30.15 (+0.58)	36.46 (+6.89)	37.03 (+7.46)
Video-Holmes	24.36	31.53 (+7.17)	47.07 (+22.71)	44.46 (+20.10)
WorldSense	24.91	24.11 (-0.80)	27.53 (+2.62)	24.71 (-0.20)
OmniBench	27.14	32.13 (+4.99)	43.24 (+16.10)	40.29 (+13.15)
UNO-Bench	21.41	23.68 (+2.27)	21.97 (+0.56)	23.35 (+1.94)
CG-AV-Counting	12.73	16.22 (+3.49)	19.65 (+6.92)	16.49 (+3.76)
OmniVideoBench	27.67	25.16 (-2.51)	21.00 (-6.67)	22.33 (-5.34)
AV-Odyssey	29.00	28.00 (-1.00)	27.87 (-1.13)	31.80 (+2.80)
Macro Avg.	24.92	26.49 (+1.57)	31.43 (+6.51)	31.03 (+6.11)
Query-Weighted Avg.	27.05	27.58 (+0.53)	30.74 (+3.69)	32.15 (+5.10)

Synthetic query construction

Synthetic data and filtering turn caption evidence into verifiable supervision

Construct

Caption and entity records

Audio and video captions are organized into within-segment and temporal relation scaffolds.

Generate

Hard-matchable queries

Synthetic audio-visual-text questions are constrained to verifiable answer formats.

Filter

F1-F3 quality passes

Rollouts are filtered by difficulty, perception defects, malformed outputs, and answer consistency.

Detailed Synthetic Query construction process — Caption records, entity relations, answer-format constraints, and rollout filtering are used to construct verifiable supervision for the same 3B model lineage.

Resources

Paper and data

Paper	arXiv PDF
Data	OmniClean on Hugging Face

Citation

BibTeX

@misc{liu2026omnicleanomniboost,
  title = {Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation},
  author = {Liu, Che and Ma, Lichao and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Yang, Xuerui and Tian, Fei},
  year = {2026},
  note = {Preprint}
}