With Jeff Dean’s recent keynote at NeurIPS 2018, learned index structures of which learning-to-hash (hash function learning) is an important sub-field, are gaining much more prominence in the research literature and in industry. SageDB sketches the vision of a future in which a data processing system will be highly tuned to the distribution of the data via machine learning bringing about significant savings in storage, memory and time.
To address these exciting advances in hash function learning, we have put together this website to help researchers learn about, share and more easily discover recent work in the field. Learning2hash is a living literature review that allows the reader to explore models in the field following a taxonomy based on the fundamental model properties. New models can be added to this website by anyone simply by making a GitHub pull request see Contributing below. Select “All Papers” on the right-hand menu to get started.
Nearest neighbour search is the problem of finding the most similar data-points to a query in a large database of data-points, and is a fundamental operation that has found wide applicability in many fields, from Bioinformatics, through to Natural Language Processing (NLP) and Computer Vision. Some interesting application areas include:
A simple way of finding similar data-points would simply be to search through the entire dataset comparing each database data-point to the query data-point, a method known as brute-force search. Unfortunately, for most datasets of practical interest, particularly in the age of big-data and deep learning, this brute-force search (O(N)) is too computationally demanding (excessive compute time) and more efficient search methods are required.
Learning2hash provides a set of curated, community-driven links to a host of approximate nearest neighbour search (ANN) models that permit sublinear-time (O(log N), where N are the number of data-points in the dataset) retrieval of nearest neighbours. Hashing models work by generating similar binary hashcodes for semantically similar data-points (illustrated by the diagram below). These similarity preserving hashcodes can then be used to index the data-points (images, documents etc) into the buckets of a hashtable. Similar data-points should ideally end up in the same bucket of the hash table if we happen to have an effective hash function. This whole process is illustrated in the diagram below:
Diagram taken from the PhD thesis of Sean Moran.
Given a query (e.g. the image of the tiger in the example above), we can search for similar data-points by generating a hashcode for the query and only comparing the query data-point to the data-points that collide with it in the same hashtable bucket (or buckets if we have multiple independent hashtables). The principle here is that the number of data-points in the colliding hash table bucket(s) should be much less than the total number of data-points in the entire dataset, yielding a query time that is substantially improved over a simple brute-force search. The only downside to these models is they might not return the closest nearest neighbour(s) every time (that is, we forgo a degree of accuracy), which is usually an acceptable trade-off in practice.
Much more introductory material on hashing and learning-to-hash can be found in Resources.
This literature review gives a particular focus on recently proposed approximate nearest neighbour search models that improve retrieval effectiveness by learning task specific hashcodes that are sensitive to the distribution of the data. All the papers listed in this review website suggest new methods for learning the hash functions to generate more effective similarity preserving hashcodes.
Learning-to-hash is a vibrant research field and is rapidly evolving, particularly with the recent surge in interest in deep neural network-based models. The purpose of this website is to augment static literature reviews with a dynamic website that can be updated by any interested researcher with newly published work in the field, including links to relevant code and datasets. To add a new paper to the website simply create a markdown file and open a pull request in GitHub by following these instructions for contributing.
Copyright © Sean Moran 2020. All opinions are my own.