Answering Multimodal Exclusion Queries With Lightweight Sparse Disentangled Representations | Awesome Learning to Hash Add your paper to Learning2Hash

Answering Multimodal Exclusion Queries With Lightweight Sparse Disentangled Representations

Prachi J, Sumit Bhatia, Srikanta Bedathur . Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) 2025 – 0 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Evaluation Multimodal Retrieval SIGIR

Multimodal representations that enable cross-modal retrieval are widely used. However, these often lack interpretability making it difficult to explain the retrieved results. Solutions such as learning sparse disentangled representations are typically guided by the text tokens in the data, making the dimensionality of the resulting embeddings very high. We propose an approach that generates smaller dimensionality fixed-size embeddings that are not only disentangled but also offer better control for retrieval tasks. We demonstrate their utility using challenging exclusion queries over MSCOCO and Conceptual Captions benchmarks. Our experiments show that our approach is superior to traditional dense models such as CLIP, BLIP and VISTA (gains up to 11% in AP@10), as well as sparse disentangled models like VDR (gains up to 21% in AP@10). We also present qualitative results to further underline the interpretability of disentangled representations.

Similar Work