Crossmusim: A Cross-modal Framework For Music Similarity Retrieval With Llm-powered Text Description Sourcing And Mining | Awesome Learning to Hash Add your paper to Learning2Hash

Crossmusim: A Cross-modal Framework For Music Similarity Retrieval With Llm-powered Text Description Sourcing And Mining

Tristan Tsoi, Jiajun Deng, Yaolong Ju, Benno Weck, Holger Kirchhoff, Simon Lui . 2025 IEEE International Conference on Multimedia and Expo (ICME) 2025 – 0 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Evaluation Self-Supervised Similarity Search Tools & Libraries

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

Similar Work