TraceAV-Bench

Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Hengyi Feng1,2*, Hao Liang2,4*†, Mingrui Chen3, Bohan Zeng2, Meiyi Qiang2,
Zhengyang Zhao2, Zimo Meng2, Zeang Sheng2, Wentao Zhang2,4
1University of Electronic Science and Technology of China    2Peking University   
3Institute of Automation, Chinese Academy of Sciences    4Zhongguancun Academy
*Equal contribution    Project leader    Corresponding author
2,200
Questions
578
Long Videos
339.5 h
Total Duration
3.68
Avg. Hops / Q
15.1 min
Avg. Temporal Span
15
Sub-Tasks
TraceAV-Bench teaser: two QA examples grounded in multi-hop audio-visual trajectories

Two representative TraceAV-Bench questions, each grounded in an explicit multi-hop evidence trajectory whose hops are tagged with their source modality. Top: Temporal Sequencing (AVR) — chronologically ordering four events by chaining speech and on-screen cues. Bottom: Temporal Splicing Fallacy (MH) — rejecting a fabricated narrative that splices temporally isolated events into a false ownership timeline.

Abstract

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception.

We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process.

Evaluation of multiple representative OmniLLMs reveals that TraceAV-Bench poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance.

Key Features

Ultra-Long Videos

606–8394 s per video (avg. 2,112 s / ~35 min). Only benchmark with avg. duration beyond 30 min.

Multi-Hop Trajectories

Every question grounded in an explicit evidence chain of temporally dispersed, cross-modal hops.

4 Dims × 15 Tasks

Covering Audio-Visual Joint Reasoning, Visual- and Audio-Centric Reasoning, and Hallucination.

Hallucination Stress Test

Dedicated MH dimension: V2A deception, A2V deception, and temporal splicing fallacy.

Task Taxonomy

TraceAV-Bench organizes 15 sub-tasks under 4 evaluation dimensions.

AVR Audio-Visual Joint Reasoning 7 tasks · 835 Qs
IRInformation Retrieval140
TSTemporal Sequencing97
ETEntity Tracking124
FCRForward Causal Reasoning73
BCRBackward Causal Reasoning89
CMMCross-Modality Matching85
SLSpatiotemporal Localization227
VR Visual-Centric Reasoning 2 tasks · 391 Qs
SRSpatial Reasoning165
VCVisual Counting226
AR Audio-Centric Reasoning 3 tasks · 349 Qs
SCSpeech Context130
ESEnvironmental Sound88
BMBackground Music131
MH Multimodal Hallucination 3 tasks · 625 Qs
V2AVisual-to-Audio Deception230
A2VAudio-to-Visual Deception229
TSFTemporal Splicing Fallacy166

Dataset Statistics

578 long videos (339.5 h) spanning 5 top-level genres and 30+ sub-genres, with 2,200 questions grounded in multi-hop audio-visual trajectories.

Video Corpus

Total videos
578
Total duration
339.5 hrs
Duration range
10.1 – 139.9 min
Avg. duration
35.2 min
Resolution ≥ 720p
73.7%

Question Pool

Total questions
2,200
Single-choice
1,848 (84.0%)
Multi-choice
352 (16.0%)
Avg. question length
43.6 words
Avg. option length
22.1 words

Reasoning Trajectories

Evaluation dimensions
4
Sub-tasks
15
Avg. hop count
3.68
Avg. temporal span
15.1 min
Modality coverage
audio + visual
Video category distribution (sunburst chart)
Video category distribution. Our corpus covers 5 top-level genres — Knowledge & Information, Lifestyle & Leisure, Film & TV, Sports Competition, and Artistic Performance — with 30+ fine-grained sub-genres ensuring broad topical diversity.
Video duration distribution histogram and bucket counts
Video duration distribution. All videos are ≥ 10 min; 40.8% fall in the 25–45 min bucket, 17.5% are 45–60 min, and 7.4% exceed 1 hour, demonstrating TraceAV-Bench's ultra-long-form nature.

Per-Sub-Task Statistics

Dim. Sub-task #Q % Hops Span (min) M-ch
AVRInformation Retrieval (IR) 1406.4 2.8713.59
Temporal Sequencing (TS) 97 4.4 3.9915.214
Entity Tracking (ET) 1245.6 4.1918.343
Forward Causal Reasoning (FCR) 73 3.3 3.1110.114
Backward Causal Reasoning (BCR)89 4.0 3.5314.144
Cross-Modality Matching (CMM) 85 3.9 3.8019.322
Spatiotemporal Localization (SL)22710.33.4012.113
VRSpatial Reasoning (SR) 1657.5 3.3814.60
Visual Counting (VC) 22610.34.2414.411
ARSpeech Context (SC) 1305.9 3.2215.723
Environmental Sound (ES) 88 4.0 3.4111.822
Background Music (BM) 1316.0 3.6817.430
MHVisual-to-Audio Deception (V2A) 23010.53.6014.230
Audio-to-Visual Deception (A2V)22910.44.0015.925
Temporal Splicing Fallacy (TSF)1667.5 4.2319.852
Total 2,200100.03.6815.1352

Benchmark Construction Pipeline

A three-step semi-automated pipeline followed by a strict quality assurance stage.

TraceAV-Bench construction pipeline overview
1

Visual Captioning

Minute-level visual captioning with Qwen3-VL-32B, equipped with an Entity Cache for long-range identity tracking.

2

Asynchronous A-V Fusion

Gemini-2.5-Flash aligns 1-minute audio with visual narrative, and performs entity updates from audio evidence.

3

Agentic QA Generation

Event segmentation, trajectory proposal, and MCQ generation over explicit multi-hop evidence.

4

Quality Assurance

Multi-stage verification: blindfolded solver, deduplication, and LLM-based filtering.

Leaderboard

General Tasks (AVR / VR / AR Dimensions)

Accuracy (%) on 12 general sub-tasks across the Audio-Visual Joint Reasoning, Visual-Centric Reasoning and Audio-Centric Reasoning dimensions.

# Model Modality Audio-Visual Joint Reasoning Visual Reasoning Audio Reasoning Avg
IRTSETFCRBCRCMMSL SRVC SCESBM
1Gemini 3.1 Pro A+V 83.5760.8271.7786.3061.8049.4151.54 73.9441.15 96.9263.6478.63 68.29
2Gemini 2.5 Pro A+V 83.5763.9260.4876.7150.5654.1248.02 68.4839.38 83.8565.9167.18 63.52
3Gemini 3 Flash A+V 82.1453.6165.3283.5659.5549.4129.07 73.3336.73 86.9261.3666.41 62.28
4Gemini 2.5 Flash A+V 75.0058.7662.1075.3458.4340.0029.07 66.0639.38 81.5460.2362.60 59.04
5Ming-Flash-Omni-2.0 A+V 56.4353.6147.5857.5340.4544.7131.28 65.4539.38 63.8556.8263.36 51.70
6Gemini 2 Flash A+V 66.4353.6158.0664.3843.8241.1825.11 54.5527.43 70.0055.6858.78 51.59
7Qwen3-Omni-30B-A3B A+V 47.1451.5535.4843.8450.5640.0032.60 58.1838.50 63.8559.0960.31 48.43
8OmniVinci-9B A+V 49.2944.3338.7157.5334.8335.2933.48 55.1534.51 65.3865.9154.20 47.38
9MiniCPM-o 4.5 A+V 45.7136.0837.9028.7726.9741.1837.44 60.6138.50 61.5459.0964.12 44.83
10Qwen2.5-Omni-7B A+V 46.4332.9937.1030.1435.9637.6537.44 49.7035.40 60.0052.2749.62 42.06
11Qwen3-VL-32B V 44.2939.1832.2638.3638.2034.1216.30 67.2739.38 46.1548.8648.09 41.04
12Gemma 4-E4B A+V 39.2938.1437.1036.9929.2136.4716.74 55.1534.07 55.3854.5558.02 40.93
13Video-SALMONN 2 A+V 42.1441.2429.0330.1429.2132.9431.28 48.4839.38 47.6947.7344.27 38.63
14Ming-Flash-Omni-2.0 V 37.8635.0532.2645.2131.4627.0619.82 54.5538.94 43.0847.7349.62 38.55
15HumanOmni-7B A+V 37.8631.9629.8431.5131.4625.8835.68 55.1534.96 52.3151.1444.27 38.50
16Qwen3-Omni-30B-A3B V 37.8634.0237.9032.8834.8324.7130.84 53.9438.94 38.4640.9143.51 37.40
17Qwen3-VL-8B V 34.2928.8729.8426.0324.7224.7117.62 59.3932.30 45.3842.0546.56 34.31
18Qwen2-Audio-7B A 30.7127.8433.0620.5526.9729.4129.96 26.6723.45 38.4637.5044.27 30.74
19VideoLLaMA2.1-AV-7B A+V 36.4329.9025.8117.8116.8525.8822.91 38.7938.94 36.1539.7735.88 30.43
20Baichuan-Omni-1.5 A+V 37.1414.4320.9715.0712.3620.0030.40 37.5826.99 32.3142.0533.59 26.91

Abbreviations. Audio-Visual Joint Reasoning: IR = Information Retrieval, TS = Temporal Sequencing, ET = Entity Tracking, FCR = Forward Causal Reasoning, BCR = Backward Causal Reasoning, CMM = Cross-Modality Matching, SL = Spatiotemporal Localization. Visual Reasoning: SR = Spatial Reasoning, VC = Visual Counting. Audio Reasoning: SC = Speech Context, ES = Environmental Sound, BM = Background Music.

Hallucination Robustness

Accuracy (%) on the 3 Multimodal Hallucination sub-tasks. Models are sorted by MH Avg; Gen. Avg is the average accuracy on the 12 general sub-tasks above.

# Model Modality V2A A2V TSF MH Avg Gen. Avg
1Gemini 3.1 Pro A+V 89.5779.9184.3484.6168.29
2Gemini 3 Flash A+V 76.5275.5587.3579.8162.28
3Gemini 2 Flash A+V 74.7881.6677.1177.8551.59
4Gemini 2.5 Pro A+V 79.1374.2475.9076.4263.52
5Qwen3-Omni-30B-A3B A+V 65.6569.8766.8767.4648.43
6Ming-Flash-Omni-2.0 A+V 71.3067.2562.6567.0751.70
7Gemma 4-E4B A+V 74.3569.4356.0266.6040.93
8MiniCPM-o 4.5 A+V 70.8772.0556.6366.5244.83
9Gemini 2.5 Flash A+V 60.8766.8166.8764.8559.04
10Qwen2.5-Omni-7B A+V 60.4355.4653.6156.5042.06
11OmniVinci-9B A+V 42.1744.1042.1742.8147.38
12Video-SALMONN 2 A+V 45.6539.3037.9540.9738.63
13HumanOmni-7B A+V 33.9137.9928.3133.4038.50
14Baichuan-Omni-1.5 A+V 28.2641.4822.2930.6826.91
15VideoLLaMA2.1-AV-7B A+V 35.2229.6919.8828.2630.43

Abbreviations. V2A = Visual-to-Audio Deception, A2V = Audio-to-Visual Deception, TSF = Temporal Splicing Fallacy. MH Avg: mean accuracy over the three MH sub-tasks. Gen. Avg: mean accuracy over the 12 general sub-tasks above.

Data & Usage

1. Clone the repository

git clone https://github.com/Heinz217/TraceAV-Bench.git
cd TraceAV-Bench

2. Download source videos

Each video id in data/*.json is resolved through data/video_name_mapping.json. Videos are either from OmniVideoBench or fetched directly from YouTube via https://www.youtube.com/watch?v=<id>. Save them as <video_id>.mp4 under a flat directory (e.g. ~/traceav_videos/).

3. Data format

{
  "task_type": "v_visual_counting",
  "video_count": 219,
  "question_count": 226,
  "items": [
    {
      "question_id": 1,
      "video_id": "video2",
      "question": "...",
      "options": {"A": "...", "B": "...", "C": "...", "D": "..."},
      "question_type": "single",           // "single" | "multiple"
      "correct_options": ["C"],
      "answer_text": "...",
      "minute_hop_count": 40,              // temporal span (minutes)
      "hop_length_label": "long",          // "short" | "medium" | "long"
      "trajectory_with_timestamps": [
        {
          "event_id": 6,
          "evidence": "...",
          "label": "visual",               // "visual" | "audio" | "audio-visual"
          "reason": "...",
          "timestamp_minute": 42,
          "event_time_range": {"start_minute": 41, "end_minute": 44}
        }
      ],
      "difficulty": "medium"               // "easy" | "medium" | "hard"
    }
  ]
}

4. Evaluate a model

# Example 1: evaluate Gemini via remote API
export BENCHMARK_DIR=$(pwd)/data
export GEMINI_API_KEY=<your_key>
bash eval/gemini/eval_gemini.sh

# Example 2: evaluate a local Qwen3-VL checkpoint
export QWEN3VL_MODEL_PATH=/path/to/Qwen3-VL-32B-Instruct
export QWEN3VL_CLEANED_DIR=$(pwd)/data
export QWEN3VL_VIDEOS_DIR=/path/to/videos
bash eval/qwen3_vl/eval_qwen3_vl.sh

# Example 3: evaluate an OpenAI-compatible vLLM server
export BENCHMARK_DIR=$(pwd)/data
export LVBENCH_BASE_URL=http://127.0.0.1:8000
bash eval/qwen3_omni_instruct/eval_qwen3_omni_instruct.sh

BibTeX

@misc{feng2026traceavbenchbenchmarkingmultihoptrajectory,
      title={TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos},
      author={Hengyi Feng and Hao Liang and Mingrui Chen and Bohan Zeng and Meiyi Qiang and Zhengyang Zhao and Zimo Meng and Zeang Sheng and Wentao Zhang},
      year={2026},
      eprint={2605.07593},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.07593},
}