Awesome Papers on Learning to Hash

🏷 Browse Papers by Tag

AAAI ACCV ACL AISTATS Arxiv BMVC CIKM CIVR CNN CVIU CVPR CVPRW Case Study Cross-Modal Dataset Deep Learning ECCV ECIR FOCS GAN Graph Has Code ICCV ICIP ICLR ICME ICML ICMR ICPR IJCAI IJCV Image Retrieval JCSS KDD LSH MM MSR NAACL NeurIPS Pattern Recognition Letters Quantisation SCG SDM SIGIR SIGMOD Self-Supervised Semi-Supervised Streaming Data Supervised Survey Paper TIP TKDE TNNLS TOIS TOM TOMM TPAMI Text Retrieval Unsupervised VLDB Video Retrieval WWW Weakly Supervised

The Field of Learning to Hash

This website exists to help researchers learn about, share and more easily discover recent work in the field of learning to hash. Awesome learning to hash is a living literature review that allows the reader to explore models in the field following a taxonomy based on model properties. New models can be added to this website by anyone simply by filling out a form see Contributing below. Select “All Papers” on the right-hand menu to get started.

Background

In general, Nearest neighbour search is the problem of finding the most similar data-points to a query in a large database of data-points. Nearest neighbour search is a fundamental operation in computer science that has found wide applicability in many fields, from Bioinformatics, through to Natural Language Processing (NLP) and Computer Vision. A selection of interesting and successful application areas include:

A Scalable Sourcecode Search Engine: Enabling code-to-code recommendation over large-scale sourcecode repositories with MinHash.
Making the Transformer from NLP much more efficient: large Transformer models achieve state-of-the-art results but training these models can be costly. LSH to the rescue!
Detecting and Tracking Interesting Events on Social Media: a real-time system incorporating LSH to detect and track interesting social media events over time.
Earthquake Hunting with Efficient Time Series Similarity Search: detecting earthquakes by comparing segments of seismic activity time series. Some nice lessons learnt here about applying LSH in real-world applications.
Detecting Fraudulent Taxi Rides: Uber has applied LSH to detect rides that are similar based on their spatial properties.
Audio Fingerprinting Matching a query snippet of audio to a large database (think Shazam!) can be achieved efficiently by using learning-to-hash methods.
Genomics: Locality sensitive hashing (LSH) is used by Biologists to assemble large genomes and to find genes with similar expression in genomic databases.
Image Retrieval: Google applies locality sensitive hashing (LSH) alongside PageRank to index planet-scale collections of images.
Malware Detection: Developers of anti-viral software use hashcode learning models to quickly match a possibly malign code snippet to a database of known viruses.

The list is truly endless!

Learning2hash provides a set of curated, community-driven links to a host of approximate nearest neighbour search (ANN) models that permit sublinear-time retrieval of nearest neighbours. Hashing models work by generating similar binary hashcodes for semantically similar data-points (illustrated by the diagram below). These similarity preserving hashcodes can then be used to index the data-points (images, documents etc) into the buckets of hashtables. Similar data-points should ideally end up in the same bucket of the hash table if we happen to have an effective hash function. This whole process is illustrated in the diagram below:

Locality Sensitive Hashing (LSH)

Diagram taken from the PhD thesis of Sean Moran.

Given a query, such as the image of the tiger in the example above, we can search for similar data-points by generating a hashcode for the query and only comparing the query data-point to the data-points that collide with it in the same hashtable bucket (or buckets if we have multiple independent hashtables). Usually the number of data-points in the colliding hash table bucket(s) should be much less than the total number of data-points in the entire dataset, yielding a query time that is substantially improved over a simple brute-force search. The only downside to these models is they might not return the closest nearest neighbour(s) every time (that is, we forgo a degree of accuracy), which is usually an acceptable trade-off in practice.

Much more introductory material on hashing and learning-to-hash can be found in Resources.

This literature review gives a particular focus on recently proposed approximate nearest neighbour search models that improve retrieval effectiveness by learning task specific hashcodes that are sensitive to the distribution of the data. All the papers listed in this review website suggest new methods for learning the hash functions to generate more effective similarity preserving hashcodes.

Contributing

Learning to hash is rapidly evolving. The purpose of this website is to augment static literature reviews with a dynamic website that can be updated by any interested researcher with newly published work in the field, including links to relevant code and datasets. To add a new paper to the website simply create a markdown file and open a pull request in GitHub by following these instructions for contributing.

Awesome Learning to Hash

Awesome Papers on Learning to Hash

🏷 Browse Papers by Tag

The Field of Learning to Hash

Background

Contributing