A Webpage dedicated to the latest research on Hash Function Learning. Maintained by Sean Moran.
Contact Sean Moran about this survey or website. Made with Jekyll and Hyde.
A curated collection of books, courses, datasets, and tools covering Learning to Hash. Use the search bar below to quickly find resources by title, domain, or description.
Loading Resources Explorer …
No resources match your search.
| TitleHTML | Category | Subcategory | DescHTML | raw | url | domain |
|---|
Dr. Wu-Jun Li’s tutorial slides: These tutorial slides by Dr. Wu-Jun Li offer a comprehensive introduction to learning to hash (L2H) techniques. It’s an excellent resource for anyone seeking a deep understanding of hashing from a technical perspective.
Intro to LSH - Part 1: In this video, Dr. Victor Lavrenko provides an introduction to Locality-Sensitive Hashing (LSH). Part 1 covers the basic concepts and intuition behind LSH, making it accessible for beginners.
Intro to LSH - Part 2: Part 2 of Dr. Lavrenko’s LSH series dives deeper into the mathematics and mechanics of how LSH works.
Hashing Algorithms for Large-Scale Machine Learning - 2017 Rice Machine Learning Workshop: This video is a recording of a presentation from the 2017 Rice Machine Learning Workshop. It offers a detailed overview of various hashing algorithms used for large-scale machine learning.
IJCNN 2025: Scalable and Deep Graph Learning and Mining: Workshop including hashing methods applied to graph structures for retrieval and similarity.
Practical Vector Search Challenge 2023: This challenge aims to push the boundaries of approximate nearest neighbor (ANN) search techniques and offers a platform for researchers and developers to benchmark their solutions on billion-scale datasets.
Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS’21 Competition Track: Competitors must improve search accuracy and speed on extremely large datasets, providing valuable insights into the performance of state-of-the-art methods for nearest neighbor search.
Compact and Efficient Feature Representation and Learning in Computer Vision, ICCV 2019: This workshop at ICCV 2019 focuses on efficient learning techniques for compact feature representations, including binary hashing methods.
International Conference on Similarity Search and Applications: SISAP is an annual conference dedicated to the study of similarity search techniques.
Joint Workshop on Efficient Deep Learning in Computer Vision: This workshop, co-located with CVPR 2020, focused on the intersection of deep learning and efficient computing techniques for computer vision tasks.
IEEE International Conference on Data Engineering (ICDE): ICDE is one of the leading conferences on data engineering, where researchers present advances in data management, indexing, and search.
ACM International Conference on Knowledge Discovery and Data Mining (KDD): KDD is a premier conference on data mining and machine learning.
SIAM International Conference on Data Mining (SDM): SDM is an important conference for researchers in data mining, focusing on the latest developments in algorithms, data analysis, and big data applications.
For a deeper dive, these survey papers are excellent resources:
Learning-Based Hashing for Approximate Nearest Neighbour (ANN) Search: Foundations and Early Advances (Moran, 2025): A foundational survey introducing the core principles of learning-based hashing for ANN search.
Learning to Hash for Recommendation: A Survey (2024): A dedicated overview of hashing-based methods used in recommender systems, from binary encodings to retrieval-aware deep architectures.
Learning to Hash for Indexing Big Data - A Survey: This comprehensive survey explores the evolution of hashing techniques for indexing and retrieving big data.
A Survey on Learning to Hash: This survey provides a detailed overview of different learning-to-hash algorithms, categorized into unsupervised, semi-supervised, and supervised methods.
Learning Binary Hash Codes for Large-Scale Image Search: This paper focuses on learning binary hash codes for efficient large-scale image search. The paper is particularly useful for researchers working on image retrieval and large-scale computer vision tasks.
Locality-Sensitive Hashing for Finding Nearest Neighbors: This tutorial-style survey introduces Locality-Sensitive Hashing (LSH) as a method for efficient nearest neighbor search. It explains the principles behind LSH and demonstrates how it can be applied to large-scale datasets.
Deep Learning for Hashing: A Survey: This survey provides an in-depth overview of deep learning-based hashing techniques, which have become increasingly popular for large-scale retrieval tasks.
Learning to Hash With Binary Deep Neural Networks: A Survey: This survey focuses on binary deep neural networks and their use in learning to hash. It explores how these networks are trained to produce compact binary codes that can be used for efficient data retrieval in large-scale datasets.
Some university courses cover topics related to machine learning and efficient computing, with publicly available materials:
Learning from Data by Yaser S. Abu-Mostafa et al.: A concise, intuitive introduction to the principles of supervised learning and generalization theory — foundational for understanding supervised hashing methods.
Extreme Computing (University of Edinburgh): Focuses on the challenges and techniques involved in building and scaling systems for processing massive datasets.
Text Technologies for Data Science (University of Edinburgh): Covers processing, analysis, and modeling of textual data. Includes topics in text mining, NLP, and information retrieval — with relevance to similarity search and hashing.
CS276: Information Retrieval (Stanford University): A comprehensive, foundational course covering algorithms for vector similarity search, ranking, indexing, and hashing.
Vector Databases: from Embeddings to Applications: Learn how vector databases work (dense vs sparse search, multilingual embeddings, hybrid search) with real-world applications using Weaviate. (~55 min)
Retrieval Optimization: from Tokenization to Vector Quantization: Deep dive into ANN performance tuning — covering HNSW, product/scalar/binary quantization, and index compression techniques. Created with Qdrant.
Building Applications with Vector Databases: Hands-on course for building RAG, semantic search, hybrid retrieval, and anomaly detection apps using Pinecone.
Retrieval Augmented Generation (RAG): Explore architectures and implementation of RAG pipelines using vector indices, chunking, retrieval filtering, and prompt design.
Knowledge Graphs for RAG: Learn how to connect vector embeddings with structured data using Neo4j to improve retrieval in multimodal and structured RAG systems.
Prompt Compression and Query Optimization: Covers retrieval latency reduction via query filtering, projection, re-ranking, and prompt shortening — with examples using MongoDB Atlas Vector Search.
Blog posts are a great way to keep up with cutting-edge research. Here are some of our favorites:
ANN-Benchmarks: A standard benchmarking platform for evaluating the performance of Approximate Nearest Neighbor (ANN) algorithms on a range of real-world and synthetic datasets. Continuously updated and widely cited, it provides reproducible results for comparison across indexing methods and libraries.
Learning to Hash — Finding the Needle in the Haystack with AI: This blog post, authored by Sean Moran, provides a beginner-friendly introduction to the concept of learning to hash, focusing on how AI techniques like deep learning can improve approximate nearest neighbor search.
Fast Near-Duplicate Image Search Using Locality-Sensitive Hashing: This post explains how Locality-Sensitive Hashing (LSH) can be applied to find near-duplicate images efficiently.
An Introduction to Hashing in the Era of Machine Learning: This blog post gives an overview of hashing techniques, specifically in the context of modern machine learning applications.
Locality-Sensitive Hashing: Reducing Data Dimensionality: This article introduces Locality-Sensitive Hashing (LSH) as a method for reducing the dimensionality of high-dimensional data while preserving similarity.
Efficient Similarity Search with Faiss: This blog post, from Facebook AI Research (FAIR), provides an in-depth explanation of Faiss, an open-source library for efficient similarity search.
Johnson–Lindenstrauss Lemma: This resource describes the Johnson-Lindenstrauss Lemma, a mathematical result that provides a way to reduce the dimensionality of data while approximately preserving distances between points.
LSH Ideas: This article offers ideas and insights about Locality-Sensitive Hashing (LSH), focusing on its conceptual foundation and potential applications.
Introduction to Locality-Sensitive Hashing (Great Visualizations): This tutorial, rich with visual aids, provides an easy-to-follow introduction to Locality-Sensitive Hashing (LSH).
What is Locality-Sensitive Hashing?: This Quora discussion explains LSH in simple terms. It covers the core principles of how LSH works and why it is useful for approximate nearest neighbor search.
Deep Hashing Toolbox: An open-source implementation designed for learning to hash with deep neural networks. Useful for deep similarity search research.
Rensa (beowolx) – High-performance MinHash: A Rust-based MinHash implementation with Python bindings. Fast and memory-efficient for deduplication tasks.
Deep Supervised Hashing (DSH): A PyTorch implementation of Deep Supervised Hashing, which learns compact binary codes using supervision for high retrieval performance.
HashNet: Implements HashNet, a deep hashing method that handles imbalanced data distributions and learns binary hash codes end-to-end.
Faiss (Facebook AI Similarity Search): A powerful library by Facebook AI Research for efficient similarity search of dense vectors. Supports PQ, IVF, HNSW, and more.
Annoy (Approximate Nearest Neighbors Oh Yeah): A C++/Python library from Spotify for fast approximate nearest neighbor search. Optimized for read-heavy workloads.
NMSLIB: A cross-platform library for similarity search in non-metric spaces. Frequently used in search and recommendation systems.
HNSWlib: Implements Hierarchical Navigable Small World (HNSW) graphs for fast and accurate ANN search.
ScaNN (Scalable Nearest Neighbors): Developed by Google Research, ScaNN is optimized for vector similarity search at production scale using quantization and reordering.
Milvus: A production-ready open-source vector database for similarity search. Supports multiple ANN algorithms and distributed deployments.
Weaviate: An open-source vector database with semantic search capabilities, supporting hybrid search, classification, and modules like CLIP and OpenAI.
Qdrant: A fast and scalable vector database written in Rust. Provides gRPC and REST APIs and supports filtering and payload-based search.
ANN-Benchmarks is the standard benchmarking framework for evaluating Approximate Nearest Neighbor (ANN) algorithms on a wide range of datasets and distance metrics.
Here are a few recommended books on large-scale machine learning:
Mining of Massive Datasets: (affiliate link) This classic book explores large-scale data mining techniques, including graph processing, clustering, recommendation, and Locality-Sensitive Hashing (LSH). It’s a core resource for anyone working on scalable algorithms for big data.
Introduction to Information Retrieval: (affiliate link) Authored by Manning, Raghavan, and Schütze, this book is essential reading for understanding search engines, indexing, relevance, and vector space models — including chapters on hashing for text retrieval.
Efficient Processing of Deep Neural Networks: (affiliate link) A practical and theoretical guide to optimizing deep neural networks for deployment. It covers model compression, quantization, and hashing, making it highly relevant for efficient deep hashing research.
Similarity Search: The Metric Space Approach (affiliate link) by Zezula et al.: A foundational text on similarity search in metric spaces, offering deep insight into indexing and retrieval techniques that predate modern hashing but remain highly relevant.
Foundations of Data Science (affiliate link) by Blum, Hopcroft, and Kannan: A mathematically rigorous treatment of data science topics, including high-dimensional geometry, random projections, and algorithms that underlie LSH and related hashing techniques.
Deep Learning (affiliate link) by Goodfellow, Bengio, and Courville: The definitive book on deep learning. While not specific to hashing, it provides the theoretical backbone for understanding the neural network architectures used in deep supervised hashing models.
CIFAR-10 Gist Features (.mat): This dataset contains GIST features extracted from the CIFAR-10 dataset, a popular image classification benchmark.
LabelMe Gist Features (.mat): A set of GIST features extracted from the LabelMe dataset, which includes a large collection of labeled images.
MNIST Pixel Features (.mat): This dataset contains pixel-level features extracted from the MNIST dataset, a benchmark for handwritten digit recognition.
SIFT 1M Features (.mat): This dataset consists of SIFT (Scale-Invariant Feature Transform) descriptors for one million image patches, commonly used in image matching and retrieval tasks.
20 Newsgroups (.mat): This dataset contains features extracted from the 20 Newsgroups text dataset, a collection of approximately 20,000 documents categorized into 20 different newsgroups.
TDT2 (.mat): This dataset includes features from the Topic Detection and Tracking (TDT2) dataset, designed for research on text retrieval and clustering.
BIGANN Dataset: The BIGANN dataset includes SIFT descriptors extracted from a large collection of images. It is widely used for benchmarking large-scale approximate nearest neighbor (ANN) search algorithms.
Facebook SimSearchNet++: A large-scale dataset developed by Facebook for the SimSearchNet++ model, which is used to benchmark billion-scale similarity search algorithms in the context of AI and machine learning applications.
Microsoft SPACEV-1B: This dataset from Microsoft includes one billion vectors for testing large-scale similarity search algorithms. It is a benchmark for efficient vector retrieval systems and helps evaluate ANN algorithms’ performance.
Yandex DEEP-1B: The DEEP-1B dataset from Yandex consists of one billion deep image descriptors for benchmarking approximate nearest neighbor search algorithms. It provides a challenging, large-scale benchmark for evaluating hashing and ANN methods.
Yandex Text-to-Image-1B: A dataset that includes one billion text-to-image matching features, useful for evaluating and benchmarking similarity search techniques that bridge the gap between text and image modalities.
Deep1B Dataset: The Deep1B dataset contains one billion deep representations of images, widely used in large-scale similarity search benchmarks.
DEEP-10M: A smaller variant of the DEEP-1B dataset, containing 10 million deep image descriptors.
GLUE Benchmark: The General Language Understanding Evaluation (GLUE) benchmark consists of a variety of natural language processing tasks that test a model’s understanding of language. While not traditionally used for image hashing, it provides valuable challenges for text-based hashing techniques.
Please, feel free to submit a web form to add more links in this page.
As an Amazon Associate, this site earns from qualifying purchases made. This comes at no additional cost to you. (All Amazon links marked as “affiliate link.”)
Please, feel free to submit a web form to add more links to this page. As an Amazon Associate, this site earns from qualifying purchases. Some links may be affiliate (no extra cost).