Multimodal AI

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

XXilin JiangQQiaolin WangJJunkai WuXXiaomin HeZZhongweiyang XuYYinghao MaMMinshuo PiaoKKaiyi YangXXiuwen ZhengRRiki ShimizuYYicong ChenAArsalan FirooziGGavin MischlerSSukru Samet DindarRRichard AntonelloLLinyang HeTTsun-An HsiehXXulin FanYYulun WuYYuesheng MaCChaitanya AmballaWWeixiong ChenJJiarui HaiRRuisi LiVVishal ChoudhariCCong HanYYinghao Aaron LiAAdeen FlinkerMMounya ElhilaliEEmmanouil BenetosMMark Hasegawa-JohnsonRRomit Roy ChoudhuryNNima Mesgarani
Published
January 25, 2026
Authors
33
Word Count
10,389

Testing MLLMs' cultural understanding with AVMeme Exam.

Abstract

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public

Key Takeaways

  • 1

    MLLMs excel at surface-level tasks but struggle with cultural context.

  • 2

    Humans outperform models, especially on familiar memes.

  • 3

    Performance varies across languages and audio types.

Limitations

  • Cultural coverage influenced by researcher backgrounds.

  • Limited representation of lesser-known languages.

Keywords

multimodal large language modelsaudio-visual clipshuman-curated benchmarkiconic Internet soundscultural contextsurface contentsemantic comprehension

More in Multimodal AI

View all
AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking | Paperchime