Deep Visual-semantic Hashing For Cross-modal Retrieval | Awesome Learning to Hash Add your paper to Learning2Hash

Deep Visual-semantic Hashing For Cross-modal Retrieval

Yue Cao, Long, Wang, Yang, Yu . Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2025 – 265 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Datasets Efficiency Hashing Methods Multimodal Retrieval Neural Hashing

Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and highquality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in crossmodal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO.

Similar Work