AV-Odyssey

Introduction

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

Leaderboard

Evaluation results of various MLLMs in different parts of AV-Odyssey Bench.

By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.

#	Model	LLM Params	Date	Overall (%)	Timbre (%)	Tone (%)	Melody (%)	Space (%)	Time (%)	Hallucination (%)	Intricacy (%)
#	Model	LLM Params	Date	Overall (%)	Timbre (%)	Tone (%)	Melody (%)	Space (%)	Time (%)	Hallucination (%)	Intricacy (%)	GPT-4o audio caption OpenAI	-	2024-11-10	34.5	38.6	31.8	33.6	32.5	27.5	25.0	26.1
	GPT-4o visual caption OpenAI	-	2024-11-10	32.3	37.4	28.6	32.3	27.5	25.5	23.0	28.9
	Gemini 1.5 Pro Google	-	2024-11-10	30.8	30.8	31.4	31.3	37.5	27.7	20.5	33.0
	Gemini 1.5 Flash Google	-	2024-11-10	27.8	27.2	25.0	28.8	30.0	25.3	28.5	31.2
	OneLLM MMLab	7B	2024-11-10	27.4	25.0	25.5	21.5	37.5	29.3	25.5	38.4
	Unified-IO-2 XXL Allenai	7B	2024-11-10	27.2	26.3	22.7	26.4	32.5	26.8	24.5	33.8
	Reka Core Reka	67B	2024-11-10	26.9	26.7	27.7	26.4	22.5	26.5	24.0	34.3
	Gemini 1.5 Flash-8B Google	-	2024-11-10	26.8	25.1	24.5	28.9	27.5	27.5	29.0	30.2
	VideoLLaMA2 Alibaba	7B	2024-11-10	26.8	24.1	25.5	26.4	30.0	27.2	33.0	34.5
	PandaGPT Cantab & Tencent	7B	2024-11-10	26.7	23.5	23.2	27.6	45.0	23.8	28.0	23.9
	VITA Tencent	8 x 7B	2024-11-10	26.4	24.1	26.4	27.8	22.5	26.3	31.0	36.8
	Unified-IO-2 XL Allenai	3B	2024-11-10	26.3	24.3	23.2	27.8	22.5	25.3	31.5	34.8
	Reka Flash Reka	21B	2024-11-10	26.3	25.5	24.1	27.2	30.0	27.5	31.5	24.1
	Video-llama Alibaba	7B	2024-11-10	26.1	25.5	22.3	24.4	30.0	26.2	25.0	30.7
	AnyGPT FDU	7B	2024-11-10	26.1	24.6	25.0	26.4	27.5	29.2	29.0	25.7
	Unified-IO-2 L Allenai	1B	2024-11-10	26.0	23.8	24.1	28.8	15.0	26.8	30.0	30.4
	NExT-GPT NUS	7B	2024-11-10	25.5	23.2	20.9	27.8	30.0	28.8	28.5	23.6
	Reka Edge Reka	7B	2024-11-10	25.0	23.8	20.5	26.3	22.5	25.5	22.5	36.8

- indicates closed-source models with unknown parameters

DeafTest

[audio1]

[audio2]

[audio1]

[audio2]

Data Examples

All data are newly collected and annotated by humans, not from any existing audio-visual dataset.

Question

[img1]

[audio1]

[audio2]

[audio3]

[audio4]

Question

[audio1]

[img1]

[img2]

[img1]

[audio1]

@misc{gong2024avodysseybenchmultimodalllms, title={AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?}, author={Kaixiong Gong and Kaituo Feng and Bohao Li and Yibing Wang and Mofan Cheng and Shijia Yang and Jiaming Han and Benyou Wang and Yutong Bai and Zhuoran Yang and Xiangyu Yue}, year={2024}, eprint={2412.02611}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.02611}, }

AV-Odyssey

Explores Whether MLLMs Really Understand Audio-Visual Information

Introduction

Leaderboard

DeafTest

Comparisons with Existing Benchmarks

Benchmark Task Overview

Benchmark Statistics

Data Examples

Question

Question

Question

Question

Question

Question

Question

Question

Question

Error Distributions

Error Examples

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Citation