Technical Report · StepFun-Audio Team

StepFun logo Boosting Omni-Modal Language Models

Staged Post-Training with Visually Debiased Evaluation

Che Liu1,2, Lichao Ma1,3, Xiangyu Tony Zhang1,5, Yuxin Zhang1,4, Haoyang Zhang1,3, Xuerui Yang1, and Fei Tian1,*

1StepFun · 2Imperial College London · 3Peking University · 4Shanghai Jiao Tong University · 5The University of New South Wales

*Corresponding author: tianfei@stepfun.com

Benchmark score shifts after OmniClean filtering
Fig. 1 exposes the central measurement issue: benchmark scores can drop substantially after removing queries that are already answerable from visual input and the question.

Key findings

What should readers take away?

01

Visual shortcuts can inflate omni-modal scores.

Some audio-visual-language queries remain answerable from the visual stream and the question alone. Fig. 1 shows why raw scores should not be read as direct evidence of audio-visual-language integration.

02

Cleaning changes benchmark meaning, not just scores.

OmniClean audits 16,968 queries and keeps 8,551 under a fixed protocol. After filtering, several benchmarks become less tied to visual or audio reference strength.

03

OmniBoost substantially improves the 3B baseline.

Stage 2 improves the macro average by +6.51 over Qwen2.5-Omni-3B. Stage 3 improves the query-weighted average by +5.10 and slightly exceeds Qwen3-Omni-30B-Instruct.

04

Self-distillation helps, but the profile matters.

Synthetic audio-visual-text supervision improves the 3B lineage, yet Stage 2 and Stage 3 emphasize different benchmark families and aggregation views.

Motivation

Visual shortcut is the first problem to control

Omni-modal language models are expected to integrate audio, visual inputs, and language. However, a benchmark query may look audio-visual-language on paper while still being recoverable from the visual input and the question alone.

This creates a measurement problem: model gains can come from stronger visual shortcut exploitation rather than improved omni-modal reasoning. We therefore first construct a cleaned evaluation view, then use it to study which training signals actually transfer.

OmniClean

A visually debiased evaluation view

OmniClean probes each evaluation query with image or video plus the original text question while withholding audio. If a strong visual-language model recovers the verifiable answer in this visual-only setting, the query is excluded from the cleaned view. Benchmarks with protocol-specific exceptions are retained as full subsets.

9
audited omni benchmarks
16,968
queries before cleaning
8,551
retained OmniClean queries
Correlation shifts after cleaning
Cleaning changes whether omni scores track vision or audio reference strength, indicating that the cleaned view changes what the benchmark measures.

Leakage diagnostic

Visual-only solvability varies sharply by benchmark

CG-AV-Counting visual-only probing histogram
CG-AV-Counting
Daily-Omni visual-only probing histogram
Daily-Omni
IntentBench visual-only probing histogram
IntentBench
OmniBench visual-only probing histogram
OmniBench
UNO-Bench visual-only probing histogram
UNO-Bench
Video-Holmes visual-only probing histogram
Video-Holmes
WorldSense visual-only probing histogram
WorldSense
OmniVideoBench visual-only probing histogram
OmniVideoBench

The diagnostic is query-level rather than benchmark-level: two benchmarks with similar raw omni scores can contain very different amounts of visually answerable content.

OmniBoost

Testing what kind of post-training transfers to OmniClean

Stage 1

Mixed bi-modal SFT

Balanced audio-text, image-text, video-text, and text supervision.

Stage 2

Mixed-modality RLVR

Verifiable-reward optimization over text, visual, audio-image, and audio-video tasks.

Stage 3

Self-distillation SFT

SFT on filtered synthetic audio-visual-text traces generated by the same 3B lineage.

Modality composition of the RLVR training mixture
Stage 2 explicitly shifts optimization toward audio-video-text and audio-image-text queries while retaining visual and textual replay.
Synthetic Query construction pipeline
Stage 3 builds hard-matchable Synthetic Query pairs from dense audio/video captions and entity-relation records before rollout filtering.

Results

Stages improve the 3B baseline and approach larger references

OmniBoost stage comparison against Qwen2.5-Omni-3B and Qwen3-Omni-30B-Instruct
Relative to Qwen2.5-Omni-3B, Stage 2 gives the strongest macro-average gain, while Stage 3 gives the strongest query-weighted gain and slightly exceeds Qwen3-Omni-30B-A3B-Instruct on that retained-query summary.
OmniBoost aggregate ordering
Stage 2 is strongest under benchmark-level macro averaging, while Stage 3 leads under query-weighted averaging because larger retained subsets receive more weight.
Benchmark-level OmniBoost deltas relative to Qwen2.5-Omni-3B
Benchmark Baseline Stage 1 Stage 2 Stage 3
Daily-Omni27.5327.43 (-0.10)38.05 (+10.52)38.82 (+11.29)
IntentBench29.5730.15 (+0.58)36.46 (+6.89)37.03 (+7.46)
Video-Holmes24.3631.53 (+7.17)47.07 (+22.71)44.46 (+20.10)
WorldSense24.9124.11 (-0.80)27.53 (+2.62)24.71 (-0.20)
OmniBench27.1432.13 (+4.99)43.24 (+16.10)40.29 (+13.15)
UNO-Bench21.4123.68 (+2.27)21.97 (+0.56)23.35 (+1.94)
CG-AV-Counting12.7316.22 (+3.49)19.65 (+6.92)16.49 (+3.76)
OmniVideoBench27.6725.16 (-2.51)21.00 (-6.67)22.33 (-5.34)
AV-Odyssey29.0028.00 (-1.00)27.87 (-1.13)31.80 (+2.80)
Macro Avg.24.9226.49 (+1.57)31.43 (+6.51)31.03 (+6.11)
Query-Weighted Avg.27.0527.58 (+0.53)30.74 (+3.69)32.15 (+5.10)

Synthetic query construction

Synthetic data and filtering turn caption evidence into verifiable supervision

Construct

Caption and entity records

Audio and video captions are organized into within-segment and temporal relation scaffolds.

Generate

Hard-matchable queries

Synthetic audio-visual-text questions are constrained to verifiable answer formats.

Filter

F1-F3 quality passes

Rollouts are filtered by difficulty, perception defects, malformed outputs, and answer consistency.

Detailed Synthetic Query construction process
Caption records, entity relations, answer-format constraints, and rollout filtering are used to construct verifiable supervision for the same 3B model lineage.

Resources

Paper and data

Paper arXiv PDF
Data OmniClean on Hugging Face

Citation

BibTeX

@misc{liu2026omnicleanomniboost,
  title = {Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation},
  author = {Liu, Che and Ma, Lichao and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Yang, Xuerui and Tian, Fei},
  year = {2026},
  note = {Preprint}
}