Resources on Machine Learning for Hashing

Conferences and Workshops

Practical Vector Search Challenge 2023: This challenge aims to push the boundaries of approximate nearest neighbor (ANN) search techniques and offers a platform for researchers and developers to benchmark their solutions on billion-scale datasets.
Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS’21 Competition Track: Competitors must improve search accuracy and speed on extremely large datasets, providing valuable insights into the performance of state-of-the-art methods for nearest neighbor search.
Compact and Efficient Feature Representation and Learning in Computer Vision, ICCV 2019: This workshop at ICCV 2019 focuses on efficient learning techniques for compact feature representations, including binary hashing methods.
International Conference on Similarity Search and Applications: SISAP is an annual conference dedicated to the study of similarity search techniques.
Joint Workshop on Efficient Deep Learning in Computer Vision: This workshop, co-located with CVPR 2020, focused on the intersection of deep learning and efficient computing techniques for computer vision tasks.
IEEE International Conference on Data Engineering (ICDE): ICDE is one of the leading conferences on data engineering, where researchers present advances in data management, indexing, and search.
ACM International Conference on Knowledge Discovery and Data Mining (KDD): KDD is a premier conference on data mining and machine learning.
SIAM International Conference on Data Mining (SDM): SDM is an important conference for researchers in data mining, focusing on the latest developments in algorithms, data analysis, and big data applications.

Introductory Video Material

Dr. Wu-Jun Li’s tutorial slides: These tutorial slides by Dr. Wu-Jun Li offer a comprehensive introduction to learning to hash (L2H) techniques. It’s an excellent resource for anyone seeking a deep understanding of hashing from a technical perspective.
Intro to LSH - Part 1: In this video, Dr. Victor Lavrenko provides an introduction to Locality-Sensitive Hashing (LSH). Part 1 covers the basic concepts and intuition behind LSH, making it accessible for beginners.
Intro to LSH - Part 2: Part 2 of Dr. Lavrenko’s LSH series dives deeper into the mathematics and mechanics of how LSH works.
Hashing Algorithms for Large-Scale Machine Learning - 2017 Rice Machine Learning Workshop: This video is a recording of a presentation from the 2017 Rice Machine Learning Workshop. It offers a detailed overview of various hashing algorithms used for large-scale machine learning.

Survey Papers

For a deeper dive, these survey papers are excellent resources:

Learning to Hash for Indexing Big Data - A Survey: This comprehensive survey explores the evolution of hashing techniques for indexing and retrieving big data.
A Survey on Learning to Hash: This survey provides a detailed overview of different learning-to-hash algorithms, categorized into unsupervised, semi-supervised, and supervised methods.
Learning Binary Hash Codes for Large-Scale Image Search: This paper focuses on learning binary hash codes for efficient large-scale image search. The paper is particularly useful for researchers working on image retrieval and large-scale computer vision tasks.
Locality-Sensitive Hashing for Finding Nearest Neighbors: This tutorial-style survey introduces Locality-Sensitive Hashing (LSH) as a method for efficient nearest neighbor search. It explains the principles behind LSH and demonstrates how it can be applied to large-scale datasets.
Deep Learning for Hashing: A Survey: This survey provides an in-depth overview of deep learning-based hashing techniques, which have become increasingly popular for large-scale retrieval tasks.
Learning to Hash With Binary Deep Neural Networks: A Survey: This survey focuses on binary deep neural networks and their use in learning to hash. It explores how these networks are trained to produce compact binary codes that can be used for efficient data retrieval in large-scale datasets.

Pre-Processed Datasets for Download

CIFAR-10 Gist Features (.mat): This dataset contains GIST features extracted from the CIFAR-10 dataset, a popular image classification benchmark.
LabelMe Gist Features (.mat): A set of GIST features extracted from the LabelMe dataset, which includes a large collection of labeled images.
MNIST Pixel Features (.mat): This dataset contains pixel-level features extracted from the MNIST dataset, a benchmark for handwritten digit recognition.
SIFT 1M Features (.mat): This dataset consists of SIFT (Scale-Invariant Feature Transform) descriptors for one million image patches, commonly used in image matching and retrieval tasks.
20 Newsgroups (.mat): This dataset contains features extracted from the 20 Newsgroups text dataset, a collection of approximately 20,000 documents categorized into 20 different newsgroups.
TDT2 (.mat): This dataset includes features from the Topic Detection and Tracking (TDT2) dataset, designed for research on text retrieval and clustering.
BIGANN Dataset: The BIGANN dataset includes SIFT descriptors extracted from a large collection of images. It is widely used for benchmarking large-scale approximate nearest neighbor (ANN) search algorithms.
Facebook SimSearchNet++: A large-scale dataset developed by Facebook for the SimSearchNet++ model, which is used to benchmark billion-scale similarity search algorithms in the context of AI and machine learning applications.
Microsoft SPACEV-1B: This dataset from Microsoft includes one billion vectors for testing large-scale similarity search algorithms. It is a benchmark for efficient vector retrieval systems and helps evaluate ANN algorithms’ performance.
Yandex DEEP-1B: The DEEP-1B dataset from Yandex consists of one billion deep image descriptors for benchmarking approximate nearest neighbor search algorithms. It provides a challenging, large-scale benchmark for evaluating hashing and ANN methods.
Yandex Text-to-Image-1B: A dataset that includes one billion text-to-image matching features, useful for evaluating and benchmarking similarity search techniques that bridge the gap between text and image modalities.
Deep1B Dataset: The Deep1B dataset contains one billion deep representations of images, widely used in large-scale similarity search benchmarks.
DEEP-10M: A smaller variant of the DEEP-1B dataset, containing 10 million deep image descriptors.
GLUE Benchmark: The General Language Understanding Evaluation (GLUE) benchmark consists of a variety of natural language processing tasks that test a model’s understanding of language. While not traditionally used for image hashing, it provides valuable challenges for text-based hashing techniques.

Courses

Some university courses cover topics related to machine learning and efficient computing, with publicly available materials:

Extreme Computing at the University of Edinburgh: This course focuses on the challenges and techniques involved in building and scaling systems for processing massive datasets.
Text Technologies for Data Science at the University of Edinburgh: This course covers the processing, analysis, and modeling of textual data in data science applications. It includes topics such as text mining, natural language processing, and information retrieval, with a focus on how these techniques can be used to extract insights from large text corpora.
CS276: Information Retrieval at Stanford University: A comprehensive course that covers the fundamental principles of information retrieval, including algorithms for vector similarity search and hashing.

Blog Posts

Blog posts are a great way to keep up with cutting-edge research. Here are some of our favorites:

Learning to Hash — Finding the Needle in the Haystack with AI: This blog post, authored by Sean Moran, provides a beginner-friendly introduction to the concept of learning to hash, focusing on how AI techniques like deep learning can improve approximate nearest neighbor search.
Fast Near-Duplicate Image Search Using Locality-Sensitive Hashing: This post explains how Locality-Sensitive Hashing (LSH) can be applied to find near-duplicate images efficiently.
An Introduction to Hashing in the Era of Machine Learning: This blog post gives an overview of hashing techniques, specifically in the context of modern machine learning applications.
Locality-Sensitive Hashing: Reducing Data Dimensionality: This article introduces Locality-Sensitive Hashing (LSH) as a method for reducing the dimensionality of high-dimensional data while preserving similarity.
Efficient Similarity Search with Faiss: This blog post, from Facebook AI Research (FAIR), provides an in-depth explanation of Faiss, an open-source library for efficient similarity search.
Johnson–Lindenstrauss Lemma: This resource describes the Johnson-Lindenstrauss Lemma, a mathematical result that provides a way to reduce the dimensionality of data while approximately preserving distances between points.
LSH Ideas: This article offers ideas and insights about Locality-Sensitive Hashing (LSH), focusing on its conceptual foundation and potential applications.
Introduction to Locality-Sensitive Hashing (Great Visualizations): This tutorial, rich with visual aids, provides an easy-to-follow introduction to Locality-Sensitive Hashing (LSH).
What is Locality-Sensitive Hashing?: This Quora discussion explains LSH in simple terms. It covers the core principles of how LSH works and why it is useful for approximate nearest neighbor search.

Hashing Software Packages

Faiss (Facebook AI Similarity Search): Faiss is a high-performance library developed by Facebook AI Research (FAIR) for efficient similarity search and clustering of dense vectors. It is optimized for handling large-scale datasets and provides a range of indexing algorithms, including brute-force, approximate nearest neighbor search, and quantization techniques, making it widely used in machine learning and AI applications.
Annoy (Approximate Nearest Neighbors Oh Yeah): Annoy is a fast and lightweight C++ library developed by Spotify for performing approximate nearest neighbor search in high-dimensional spaces. It is particularly well-suited for applications requiring quick retrieval of similar vectors, such as music recommendation systems, and is optimized for read-heavy tasks with low memory overhead.
Deep Hashing Toolbox: Deep Hashing Toolbox is an open-source implementation designed for learning to hash with deep neural networks. This toolbox is a valuable resource for researchers working on deep learning-driven similarity search tasks.
NMSLIB: NMSLIB (Non-Metric Space Library) is a cross-platform library designed for approximate nearest neighbor search in non-metric spaces. NMSLIB is commonly used in search engines, recommendation systems, and machine learning pipelines.
HNSWlib: HNSWlib is an efficient implementation of Hierarchical Navigable Small World (HNSW) graphs, which provide fast approximate nearest neighbor search. This library is widely used for large-scale search tasks, offering low memory usage and high accuracy.

Books

Here are a few recommended books on large-scale machine learning:

Mining of Massive Datasets: This book covers a wide range of large-scale data mining topics, including graph processing, machine learning algorithms, and large-scale search. It features a detailed section on Locality-Sensitive Hashing (LSH), explaining how LSH can be used to efficiently index and retrieve data in large datasets. This is a fundamental resource for understanding data mining techniques at scale.
Introduction to Information Retrieval: A classic textbook in the field of information retrieval, authored by leading experts, it covers the full spectrum of data indexing and retrieval techniques. The book explores topics such as vector space models, relevance ranking, and search algorithms, with sections dedicated to hashing techniques for efficient data retrieval.
Efficient Processing of Deep Neural Networks: This book provides a thorough exploration of various techniques for optimizing and processing deep neural networks. It covers both the theoretical foundations and practical implementations for improving the efficiency of DNNs, including model compression, quantization, and hashing for efficient data storage and retrieval. It is essential for researchers focused on the intersection of deep learning and efficient computation.

Awesome Learning to Hash