How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Published 19 Mar 2026 in eess.AS, cs.CL, and cs.SD | (2603.19195v1)

Abstract: LLMs have been widely used as knowledge backbones of Large Audio LLMs (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio LLM (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.