Deep Semantic Text Hashing with Weak Supervision

Suthee Chaidaroon, Travis Ebesu, Yi Fang. SIGIR 2019

Deep Learning SIGIR Weakly Supervised

With an ever increasing amount of data available on the web, fast similarity search has become the critical component for large-scale information retrieval systems. One solution is semantic hashing which designs binary codes to accelerate similarity search. Recently, deep learning has been successfully applied to the semantic hashing problem and produces high-quality compact binary codes compared to traditional methods. However, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Motivated by the recent success in machine learning that makes use of weak supervision, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We further introduce two deep generative semantic hashing models to leverage weak signals for text hashing. The experimental results on four public datasets show that our models can generate high-quality binary codes without using hand-labeled training data and significantly outperform the competitive unsupervised semantic hashing baselines.