Search across all paper titles, abstracts, authors by using the search field. Please consider contributing by updating the information of existing papers or adding new work.
Loading...
Year | Title | Authors | Venue | Abstract | |||||
---|---|---|---|---|---|---|---|---|---|
2024 | Results Of The Big ANN Neurips23 Competition | Simhadri Harsha Vardhan, Aumüller Martin, Ingber Amir, Douze Matthijs, Williams George, Manohar Magdalen Dobson, Baranchuk Dmitry, Liberty Edo, Liu Frank, Landrum Ben, Karjikar Mazin, Dhulipala Laxman, Chen Meng, Chen Yue, Ma Rui, Zhang Kai, Cai Yuzheng, Shi Jiayang, Chen Yizhuo, Zheng Weiguo, Wan Zihao, Yin Jie, Huang Ben | Arxiv | The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search. |
|||||
2024 | Asymmetric LSH (ALSH) For Sublinear Time Maximum Inner Product Search (MIPS). | Shrivastava A., Li | Arxiv | We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we extend the LSH framework to allow asymmetric hashing schemes. Our proposal is based on a key observation that the problem of finding maximum inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search in classical settings. This key observation makes efficient sublinear hashing scheme for MIPS possible. Under the extended asymmetric LSH (ALSH) framework, this paper provides an example of explicit construction of provably fast hashing scheme for MIPS. Our proposed algorithm is simple and easy to implement. The proposed hashing scheme leads to significant computational savings over the two popular conventional LSH schemes: (i) Sign Random Projection (SRP) and (ii) hashing based on p-stable distributions for L2 norm (L2LSH), in the collaborative filtering task of item recommendations on Netflix and Movielens (10M) datasets. |
|||||
2024 | Densifying One Permutation Hashing Via Rotation For Fast Near Neighbor Search | Shrivastava A., Li | Arxiv | The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often high-dimensional and binary (e.g., text n-grams), minwise hashing is widely adopted, which requires applying a large number of permutations on the data. This is costly in computation and energy-consumption. In this paper, we propose a hashing technique which generates all the necessary hash evaluations needed for similarity search, using one single permutation. The heart of the proposed hash function is a “rotation” scheme which densifies the sparse sketches of one permutation hashing (Li et al., 2012) in an unbiased fashion thereby maintaining the LSH property. This makes the obtained sketches suitable for hash table construction. This idea of rotation presented in this paper could be of independent interest for densifying other types of sparse sketches. Using our proposed hashing method, the query time of a (K, L)-parameterized LSH is reduced from the typical O(dKL) complexity to merely O(KL + dL), where d is the number of nonzeros of the data vector, K is the number of hashes in each hash table, and L is the number of hash tables. Our experimental evaluation on real data confirms that the proposed scheme significantly reduces the query processing time over minwise hashing without loss in retrieval accuracies. |
|||||
2024 | Cayley Hashing With Cookies | Shpilrain Vladimir, Sosnovski Bianca | Arxiv | Cayley hash functions are based on a simple idea of using a pair of semigroup elements, A and B, to hash the 0 and 1 bit, respectively, and then to hash an arbitrary bit string in the natural way, by using multiplication of elements in the semigroup. The main advantage of Cayley hash functions compared to, say, hash functions in the SHA family is that when an already hashed document is amended, one does not have to hash the whole amended document all over again, but rather hash just the amended part and then multiply the result by the hash of the original document. Some authors argued that this may be a security hazard, specifically that this property may facilitate finding a second preimage by splitting a long bit string into shorter pieces. In this paper, we offer a way to get rid of this alleged disadvantage and keep the advantages at the same time. We call this method ``Cayley hashing with cookies” using terminology borrowed from the theory of random walks in a random environment. For the platform semigroup, we use 2x2 matrices over F_p. |
|||||
2024 | Deskew-lsh Based Code-to-code Recommendation Engine | Silavong Fran, Moran, Georgiadis, Saphal, Otter | Arxiv | Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with Senatus, a new code-to-code recommendation engine. At the core of Senatus is De-Skew LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus via automatic evaluation and with an expert developer user study and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example, on the CodeSearchNet dataset we show that Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation. |
|||||
2024 | Variable-length Quantization Strategy For Hashing | Shi Yang, Nie, Zhou, Xi, Yin | Arxiv | Hashing is widely used to solve fast Approximate Nearest Neighbor (ANN) search problems, involves converting the original real-valued samples to binary-valued representations. The conventional quantization strategies, such as Single-Bit Quantization and Multi-Bit quantization, are considered ineffective, because of their serious information loss. To address this issue, we propose a novel variable-length quantization (VLQ) strategy for hashing. In the proposed VLQ technique, we divide all samples into different regions in each dimension firstly given the real-valued features of samples. Then we compute the dispersion degrees of these regions. Subsequently, we attempt to optimally assign different number of bits to each dimensions to obtain the minimum dispersion degree. Our experiments show that the VLQ strategy achieves not only superior performance over the state-of-the-art methods, but also has a faster retrieval speed on public datasets. |
|||||
2024 | Unsupervised Deep Hashing With Similarity-adaptive And Discrete Optimization | Shen Fumin, Xu, Liu, Yang, Huang, Shen | Arxiv | Recent vision and learning studies show that learning compact hash codes can facilitate massive data processing with significantly reduced storage and computation. Particularly, learning deep hash functions has greatly improved the retrieval performance, typically under the semantic supervision. In contrast, current unsupervised deep hashing algorithms can hardly achieve satisfactory performance due to either the relaxed optimization or absence of similarity-sensitive objective. In this work, we propose a simple yet effective unsupervised hashing framework, named Similarity-Adaptive Deep Hashing (SADH), which alternatingly proceeds over three training modules: deep hash model training, similarity graph updating and binary code optimization. The key difference from the widely-used two-step hashing method is that the output representations of the learned deep model help update the similarity graph matrix, which is then used to improve the subsequent code optimization. In addition, for producing high-quality binary codes, we devise an effective discrete optimization algorithm which can directly handle the binary constraints with a general hashing loss. Extensive experiments validate the efficacy of SADH, which consistently outperforms the state-of-the-arts by large gaps. |
|||||
2024 | NASH Toward End-to-end Neural Architecture For Generative Semantic Hashing | Shen Dinghan, Su, Chapfuwa, Wang, Wang, Carin, Henao | Arxiv | Semantic hashing has become a powerful paradigm for fast similarity search in many information retrieval systems. While fairly successful, previous techniques generally require two-stage training, and the binary constraints are handled ad-hoc. In this paper, we present an end-to-end Neural Architecture for Semantic Hashing (NASH), where the binary hashing codes are treated as Bernoulli latent variables. A neural variational inference framework is proposed for training, where gradients are directly backpropagated through the discrete latent variable to optimize the hash function. We also draw connections between proposed method and rate-distortion theory, which provides a theoretical foundation for the effectiveness of the proposed framework. Experimental results on three public datasets demonstrate that our method significantly outperforms several state-of-the-art models on both unsupervised and supervised scenarios. |
|||||
2024 | Auto-encoding Twin-bottleneck Hashing | Shen Yuming, Qin, Chen, Yu, Liu, Zhu, Shen, Shao | Arxiv | Conventional unsupervised hashing methods usually take advantage of similarity graphs, which are either pre-computed in the high-dimensional space or obtained from random anchor points. On the one hand, existing methods uncouple the procedures of hash function learning and graph construction. On the other hand, graphs empirically built upon original data could introduce biased prior knowledge of data relevance, leading to sub-optimal retrieval performance. In this paper, we tackle the above problems by proposing an efficient and adaptive code-driven graph, which is updated by decoding in the context of an auto-encoder. Specifically, we introduce into our framework twin bottlenecks (i.e., latent variables) that exchange crucial information collaboratively. One bottleneck (i.e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i.e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes. The auto-encoding learning objective literally rewards the code-driven graph to learn an optimal encoder. Moreover, the proposed model can be simply optimized by gradient descent without violating the binary constraints. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods. |
|||||
2024 | Embarrassingly Simple Binary Representation Learning | Shen Yuming, Qin, Chen Jiaxin, Liu, Zhu | Arxiv | Recent binary representation learning models usually require sophisticated binary optimization, similarity measure or even generative models as auxiliaries. However, one may wonder whether these non-trivial components are needed to formulate practical and effective hashing models. In this paper, we answer the above question by proposing an embarrassingly simple approach to binary representation learning. With a simple classification objective, our model only incorporates two additional fully-connected layers onto the top of an arbitrary backbone network, whilst complying with the binary constraints during training. The proposed model lower-bounds the Information Bottleneck (IB) between data samples and their semantics, and can be related to many recent `learning to hash’ paradigms. We show that, when properly designed, even such a simple network can generate effective binary codes, by fully exploring data semantics without any held-out alternating updating steps or auxiliary models. Experiments are conducted on conventional large-scale benchmarks, i.e., CIFAR-10, NUS-WIDE, and ImageNet, where the proposed simple model outperforms the state-of-the-art methods. |
|||||
2024 | Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment | Schall Konstantin, Barthel Kai Uwe, Hezel Nico, Jung Klaus | Arxiv | Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP’s performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems. |
|||||
2024 | Semantic Hashing | Salakhutdinov R., Hinton | Arxiv | We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set. |
|||||
2024 | On The Robustness Of Malware Detectors To Adversarial Samples | Salman Muhammad, Zhao Benjamin Zi Hao, Asghar Hassan Jameel, Ikram Muhammad, Kaushik Sidharth, Kaafar Mohamed Ali | Arxiv | Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary. |
|||||
2024 | Bio-inspired Hashing For Unsupervised Similarity Search | Ryali Chaitanya, Hopfield, Grinberg, Krotov | Arxiv | The fruit fly Drosophila’s olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search. |
|||||
2024 | Labelme A Database And Web-based Tool For Image Annotation | Russell B., Torralba, Murphy, Freeman | Arxiv | We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web. |
|||||
2024 | Deep Transfer Hashing For Adaptive Learning On Federated Streaming Data | Röder Manuel, Schleif Frank-michael | Arxiv | This extended abstract explores the integration of federated learning with deep transfer hashing for distributed prediction tasks, emphasizing resource-efficient client training from evolving data streams. Federated learning allows multiple clients to collaboratively train a shared model while maintaining data privacy - by incorporating deep transfer hashing, high-dimensional data can be converted into compact hash codes, reducing data transmission size and network loads. The proposed framework utilizes transfer learning, pre-training deep neural networks on a central server, and fine-tuning on clients to enhance model accuracy and adaptability. A selective hash code sharing mechanism using a privacy-preserving global memory bank further supports client fine-tuning. This approach addresses challenges in previous research by improving computational efficiency and scalability. Practical applications include Car2X event predictions, where a shared model is collectively trained to recognize traffic patterns, aiding in tasks such as traffic density assessment and accident detection. The research aims to develop a robust framework that combines federated learning, deep transfer hashing and transfer learning for efficient and secure downstream task execution. |
|||||
2024 | Locality-sensitive Hashing For Earthquake Detection A Case Study Of Scaling Data-driven Science | Rong Kexin, Yoon, Bergen, Elezabi, Bailis Peter, Levis, Beroza | Arxiv | In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on the high waveform similarity between reoccurring earthquakes, our application identifies potential earthquakes by searching for similar time series segments via LSH. However, a straightforward implementation of this LSH-enabled application has difficulty scaling beyond 3 months of continuous time series data measured at a single seismic station. As a case study of a data-driven science workflow, we illustrate how domain knowledge can be incorporated into the workload to improve both the efficiency and result quality. We describe several end-toend optimizations of the analysis pipeline from pre-processing to post-processing, which allow the application to scale to time series data measured at multiple seismic stations. Our optimizations enable an over 100× speedup in the end-to-end analysis pipeline. This improved scalability enabled seismologists to perform seismic analysis on more than ten years of continuous time series data from over ten seismic stations, and has directly enabled the discovery of 597 new earthquakes near the Diablo Canyon nuclear power plant in California and 6123 new earthquakes in New Zealand. |
|||||
2024 | Xnor-net Imagenet Classification Using Binary Convolutional Neural Networks | Rastegari M., Ordonez, Redmon, Farhadi | Arxiv | We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9\% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16\% in top-1 accuracy. |
|||||
2024 | Relevance Filtering For Embedding-based Retrieval | Rossi Nicholas, Lin Juexin, Liu Feng, Yang Zhen, Lee Tony, Magnani Alessandro, Liao Ciya | Arxiv | In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets. While maximizing recall of relevant items is usually the goal of retrieval systems, a low precision may lead to a poor search experience. Unlike lexical retrieval, which inherently limits the size of the retrieved set through keyword matching, dense retrieval via ANN search has no natural cutoff. Moreover, the cosine similarity scores of embedding vectors are often optimized via contrastive or ranking losses, which make them difficult to interpret. Consequently, relying on top-K or cosine-similarity cutoff is often insufficient to filter out irrelevant results effectively. This issue is prominent in product search, where the number of relevant products is often small. This paper introduces a novel relevance filtering component (called “Cosine Adapter”) for embedding-based retrieval to address this challenge. Our approach maps raw cosine similarity scores to interpretable scores using a query-dependent mapping function. We then apply a global threshold on the mapped scores to filter out irrelevant results. We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall. The effectiveness of our approach is demonstrated through experiments on both public MS MARCO dataset and internal Walmart product search data. Furthermore, online A/B testing on the Walmart site validates the practical value of our approach in real-world e-commerce settings. |
|||||
2024 | Blockboost Scalable And Efficient Blocking Through Boosting | Ramos Thiago, Schuller, Okuno, Nissenbaum, Oliveira, Orenstein | Arxiv | As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost. |
|||||
2024 | Locality-sensitive Binary Codes From Shift-invariant Kernels | Raginsky M., Lazebnik | Arxiv | This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing. |
|||||
2024 | Hihpq Hierarchical Hyperbolic Product Quantization For Unsupervised Image Retrieval | Qiu Zexuan, Liu Jiahong, Chen Yankai, King Irwin | Arxiv | Existing unsupervised deep product quantization methods primarily aim for the increased similarity between different views of the identical image, whereas the delicate multi-level semantic similarities preserved between images are overlooked. Moreover, these methods predominantly focus on the Euclidean space for computational convenience, compromising their ability to map the multi-level semantic relationships between images effectively. To mitigate these shortcomings, we propose a novel unsupervised product quantization method dubbed \textbf{Hi}erarchical \textbf{H}yperbolic \textbf{P}roduct \textbf{Q}uantization (HiHPQ), which learns quantized representations by incorporating hierarchical semantic similarity within hyperbolic geometry. Specifically, we propose a hyperbolic product quantizer, where the hyperbolic codebook attention mechanism and the quantized contrastive learning on the hyperbolic product manifold are introduced to expedite quantization. Furthermore, we propose a hierarchical semantics learning module, designed to enhance the distinction between similar and non-matching images for a query by utilizing the extracted hierarchical semantics as an additional training supervision. Experiments on benchmarks show that our proposed method outperforms state-of-the-art baselines. |
|||||
2024 | Deep Semantic Hashing With Generative Adversarial Networks | Qiu Zhaofan, Pan, Yao, Mei | Arxiv | Hashing has been a widely-adopted technique for nearest neighbor search in large-scale image retrieval tasks. Recent research has shown that leveraging supervised information can lead to high quality hashing. However, the cost of annotating data is often an obstacle when applying supervised hashing to a new domain. Moreover, the results can suffer from the robustness problem as the data at training and test stage may come from different distributions. This paper studies the exploration of generating synthetic data through semisupervised generative adversarial networks (GANs), which leverages largely unlabeled and limited labeled training data to produce highly compelling data with intrinsic invariance and global coherence, for better understanding statistical structures of natural data. We demonstrate that the above two limitations can be well mitigated by applying the synthetic data for hashing. Specifically, a novel deep semantic hashing with GANs (DSH-GANs) is presented, which mainly consists of four components: a deep convolution neural networks (CNN) for learning image representations, an adversary stream to distinguish synthetic images from real ones, a hash stream for encoding image representations to hash codes and a classification stream. The whole architecture is trained endto-end by jointly optimizing three losses, i.e., adversarial loss to correct label of synthetic or real for each sample, triplet ranking loss to preserve the relative similarity ordering in the input real-synthetic triplets and classification loss to classify each sample accurately. Extensive experiments conducted on both CIFAR-10 and NUS-WIDE image benchmarks validate the capability of exploiting synthetic images for hashing. Our framework also achieves superior results when compared to state-of-the-art deep hash models. |
|||||
2024 | Streaming First Story Detection With Application To Twitter | Petrovic S., Osborne, Lavrenko | Arxiv | With the recent rise in popularity and size of social media, there is a growing need for systems that can extract useful information from this amount of data. We address the problem of detecting new events from a stream of Twitter posts. To make event detection feasible on web-scale corpora, we present an algorithm based on locality-sensitive hashing which is able overcome the limitations of traditional approaches, while maintaining competitive results. In particular, a comparison with a stateof-the-art system on the first story detection task shows that we achieve over an order of magnitude speedup in processing time, while retaining comparable performance. Event detection experiments on a collection of 160 million Twitter posts show that celebrity deaths are the fastest spreading news on Twitter. |
|||||
2024 | Using Paraphrases For Improving First Story Detection In News And Twitter | Petrovic S., Osborne, Lavrenko | Arxiv | First story detection (FSD) involves identifying first stories about events from a continuous stream of documents. A major problem in this task is the high degree of lexical variation in documents which makes it very difficult to detect stories that talk about the same event but expressed using different words. We suggest using paraphrases to alleviate this problem, making this the first work to use paraphrases for FSD. We show a novel way of integrating paraphrases with locality sensitive hashing (LSH) in order to obtain an efficient FSD system that can scale to very large datasets. Our system achieves state-of-the-art results on the first story detection task, beating both the best supervised and unsupervised systems. To test our approach on large data, we construct a corpus of events for Twitter, consisting of 50 million documents, and show that paraphrasing is also beneficial in this domain. |
|||||
2024 | Performance Evaluation Of Hashing Algorithms On Commodity Hardware | Pandya Marut | Arxiv | Hashing functions, which are created to provide brief and erratic digests for the message entered, are the primary cryptographic primitives used in blockchain networks. Hashing is employed in blockchain networks to create linked block lists, which offer safe and secure distributed repository storage for critical information. Due to the unique nature of the hash search problem in blockchain networks, the most parallelization of calculations is possible. This technical report presents a performance evaluation of three popular hashing algorithms Blake3, SHA-256, and SHA-512. These hashing algorithms are widely used in various applications, such as digital signatures, message authentication, and password storage. It then discusses the performance metrics used to evaluate the algorithms, such as hash rate/throughput and memory usage. The evaluation is conducted on a range of hardware platforms, including desktop and VMs. The evaluation includes synthetic benchmarks. The results of the evaluation show that Blake3 generally outperforms both SHA-256 and SHA-512 in terms of throughput and latency. However, the performance advantage of Blake3 varies depending on the specific hardware platform and the size of the input data. The report concludes with recommendations for selecting the most suitable hashing algorithm for a given application, based on its performance requirements and security needs. The evaluation results can also inform future research and development efforts to improve the performance and security of hashing algorithms. |
|||||
2024 | Locality Sensitive Hashing A Comparison Of Hash Function Types And Querying Mechanisms | Pauleve Loic, Jegou, Amsaleg | Arxiv | It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All the proposed extensions, however, rely on a structured quantizer for hashing, poorly fitting real data sets, limiting its performance in practice. In this paper, we compare several families of space hashing functions in a real setup, namely when searching for high-dimension SIFT descriptors. The comparison of random projections, lattice quantizers, k-means and hierarchical k-means reveal that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space. We then compare two querying mechanisms introduced in the literature with the one originally proposed in LSH, and discuss their respective merits and limitations. |
|||||
2024 | A New Approach To Cross-modal Multimedia Retrieval | Rasiwasia N., Pereira, Coviello, Doyle, Lanckriet, Vasconcelos | Arxiv | The collected documents are selected sections from the Wikipedia’s featured articles collection. This is a continuously growing dataset, that at the time of collection (October 2009) had 2,669 articles spread over 29 categories. Some of the categories are very scarce, therefore we considered only the 10 most populated ones. The articles generally have multiple sections and pictures. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the author(s). Then this dataset was prunned to keep only sections that contained a single image and at least 70 words. The final corpus contains 2,866 multimedia documents. The median text length is 200 words. |
|||||
2024 | Comparing Apples To Oranges A Scalable Solution With Heterogeneous Hashing | Ou M., Cui, Wang, Wang, Zhu, Yang | Arxiv | Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges’’ under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains. |
|||||
2024 | Sliding Block (slick) Hashing An Implementation Benchmarks | Oberst Jan | Arxiv | With hash tables being one of the most used data structures, Lehmann, Sanders and Walzer propose a novel, light-weight hash table, referred to as Slick Hash. Their idea is to hit a sweet spot between space consumption and speed. Building on the theoretical ideas by the authors, an implementation and experiments are required to evaluate the practical performance of Slick Hash. This work contributes to fulfilling this requirement by providing a basic implementation of Slick Hash, an analysis of its performance, and an evaluation of the entry deletion, focusing on the impact of backyard cleaning. The findings are discussed, and a conclusion is drawn. |
|||||
2024 | Hamming Distance Metric Learning | Norouzi M., Fleet, Salakhutdinov | Arxiv | Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. |
|||||
2024 | Minimal Loss Hashing | Norouzi M., Fleet | Arxiv | We propose a method for learning similaritypreserving hash functions that map highdimensional data onto binary codes. The formulation is based on structured prediction with latent variables and a hinge-like loss function. It is efficient to train for large datasets, scales well to large code lengths, and outperforms state-of-the-art methods. |
|||||
2024 | Concepthash Interpretable Fine-grained Hashing Via Concept Discovery | Ng Kam Woh, Zhu Xiatian, Song Yi-zhe, Xiang Tao | Arxiv | Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a Vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: https://github.com/kamwoh/concepthash. |
|||||
2024 | Unsupervised Hashing With Similarity Distribution Calibration | Ng Kam, Zhu, Hoe, Chan, Zhang, Song, Xiang | Arxiv | Unsupervised hashing methods typically aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes. However, these methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space, due to the limited similarity range of hash codes. The similarity range is bounded by the code length and can lead to a problem known as similarity collapse. That is, the positive and negative pairs of data points become less distinguishable from each other in the hash space. To alleviate this problem, in this paper a novel Simialrity Distribution Calibration (SDC) method is introduced. SDC aligns the hash code similarity distribution towards a calibration distribution (e.g., beta distribution) with sufficient spread across the entire similarity range, thus alleviating the similarity collapse problem. Extensive experiments show that our SDC outperforms significantly the state-of-the-art alternatives on coarse category-level and instance-level image retrieval. |
|||||
2024 | Hashing Geographical Point Data Using The Space-filling H-curve | Netay Igor V. | Arxiv | We construct geohashing procedure based on using of space-filling H-curve. This curve provides a way to construct geohash with less computations than the construction based on usage of Hilbert curve. At the same time, H-curve has better clustering properties. |
|||||
2024 | The Power Of Asymmetry In Binary Hashing | Neyshabur B., Salakhutdinov, Srebro | Arxiv | When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x 0 as the hamming distance between f(x) and g(x0), for two distinct binary codes f, g, rather than as the hamming distance between f(x) and f(x0). |
|||||
2024 | An NMF Perspective On Binary Hashing | Mukherjee Lopamudra, Ravi, Ithapu, Singh | Arxiv | The pervasiveness of massive data repositories has led to much interest in efficient methods for indexing, search, and retrieval. For image data, a rapidly developing body of work for these applications shows impressive performance with methods that broadly fall under the umbrella term of Binary Hashing. Given a distance matrix, a binary hashing algorithm solves for a binary code for the given set of examples, whose Hamming distance nicely approximates the original distances. The formulation is non-convex — so existing solutions adopt spectral relaxations or perform coordinate descent (or quantization) on a surrogate objective that is numerically more tractable. In this paper, we first derive an Augmented Lagrangian approach to optimize the standard binary Hashing objective (i.e., maintain fidelity with a given distance matrix). With appropriate step sizes, we find that this scheme already yields results that match or substantially outperform state of the art methods on most benchmarks used in the literature. Then, to allow the model to scale to large datasets, we obtain an interesting reformulation of the binary hashing objective as a non-negative matrix factorization. Later, this leads to a simple multiplicative updates algorithm — whose parallelization properties are exploited to obtain a fast GPU based implementation. We give a probabilistic analysis of our initialization scheme and present a range of experiments to show that the method is simple to implement and competes favorably with available methods (both for optimization and generalization). |
|||||
2024 | Efficient Multi-vector Dense Retrieval Using Bit Vectors | Nardini Franco Maria, Rulli Cosimo, Venturini Rossano | Arxiv | Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors’’ (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID. |
|||||
2024 | Deep Hashing With Hash-consistent Large Margin Proxy Embeddings | Morgado Pedro, Li, Pereira, Saberian, Vasconcelos | Arxiv | Image hash codes are produced by binarizing the embeddings of convolutional neural networks (CNN) trained for either classification or retrieval. While proxy embeddings achieve good performance on both tasks, they are non-trivial to binarize, due to a rotational ambiguity that encourages non-binary embeddings. The use of a fixed set of proxies (weights of the CNN classification layer) is proposed to eliminate this ambiguity, and a procedure to design proxy sets that are nearly optimal for both classification and hashing is introduced. The resulting hash-consistent large margin (HCLM) proxies are shown to encourage saturation of hashing units, thus guaranteeing a small binarization error, while producing highly discriminative hash-codes. A semantic extension (sHCLM), aimed to improve hashing performance in a transfer scenario, is also proposed. Extensive experiments show that sHCLM embeddings achieve significant improvements over state-of-the-art hashing procedures on several small and large datasets, both within and beyond the set of training classes. |
|||||
2024 | Learning To Project And Binarise For Hashing-based Approximate Nearest Neighbour Search | Moran S. | Arxiv | In this paper we focus on improving the effectiveness of hashing-based approximate nearest neighbour search. Generating similarity preserving hashcodes for images has been shown to be an effective and efficient method for searching through large datasets. Hashcode generation generally involves two steps: bucketing the input feature space with a set of hyperplanes, followed by quantising the projection of the data-points onto the normal vectors to those hyperplanes. This procedure results in the makeup of the hashcodes depending on the positions of the data-points with respect to the hyperplanes in the feature space, allowing a degree of locality to be encoded into the hashcodes. In this paper we study the effect of learning both the hyperplanes and the thresholds as part of the same model. Most previous research either learn the hyperplanes assuming a fixed set of thresholds, or vice-versa. In our experiments over two standard image datasets we find statistically significant increases in retrieval effectiveness versus a host of state-of-the-art data-dependent and independent hashing models. |
|||||
2024 | Enhancing First Story Detection Using Word Embeddings | Moran S., Mccreadie, Macdonald, Ounis | Arxiv | In this paper we show how word embeddings can be used to increase the effectiveness of a state-of-the art Locality Sensitive Hashing (LSH) based first story detection (FSD) system over a standard tweet corpus. Vocabulary mismatch, in which related tweets use different words, is a serious hindrance to the effectiveness of a modern FSD system. In this case, a tweet could be flagged as a first story even if a related tweet, which uses different but synonymous words, was already returned as a first story. In this work, we propose a novel approach to mitigate this problem of lexical variation, based on tweet expansion. In particular, we propose to expand tweets with semantically related paraphrases identified via automatically mined word embeddings over a background tweet corpus. Through experimentation on a large data stream comprised of 50 million tweets, we show that FSD effectiveness can be improved by 9.5% over a state-of-the-art FSD system. |
|||||
2024 | Graph Regularised Hashing | Moran S., Lavrenko | Arxiv | In this paper we propose a two-step iterative scheme, Graph Regularised Hashing (GRH), for incrementally adjusting the positioning of the hashing hypersurfaces to better conform to the supervisory signal: in the first step the binary bits are regularised using a data similarity graph so that similar data points receive similar bits. In the second step the regularised hashcodes form targets for a set of binary classifiers which shift the position of each hypersurface so as to separate opposite bits with maximum margin. GRH exhibits superior retrieval accuracy to competing hashing methods. |
|||||
2024 | Neighbourhood Preserving Quantisation For LSH | Moran S., Lavrenko, Osborne | Arxiv | We introduce a scheme for optimally allocating multiple bits per hyperplane for Locality Sensitive Hashing (LSH). Existing approaches binarise LSH projections by thresholding at zero yielding a single bit per dimension. We demonstrate that this is a sub-optimal bit allocation approach that can easily destroy the neighbourhood structure in the original feature space. Our proposed method, dubbed Neighbourhood Preserving Quantization (NPQ), assigns multiple bits per hyperplane based upon adaptively learned thresholds. NPQ exploits a pairwise affinity matrix to discretise each dimension such that nearest neighbours in the original feature space fall within the same quantisation thresholds and are therefore assigned identical bits. NPQ is not only applicable to LSH, but can also be applied to any low-dimensional projection scheme. Despite using half the number of hyperplanes, NPQ is shown to improve LSH-based retrieval accuracy by up to 65% compared to the state-of-the-art. |
|||||
2024 | Tile Compression And Embeddings For Multi-label Classification In Geolifeclef 2024 | Miyaguchi Anthony, Aphiwetsa Patcharapong, Mcduffie Mark | Arxiv | We explore methods to solve the multi-label classification task posed by the GeoLifeCLEF 2024 competition with the DS@GT team, which aims to predict the presence and absence of plant species at specific locations using spatial and temporal remote sensing data. Our approach uses frequency-domain coefficients via the Discrete Cosine Transform (DCT) to compress and pre-compute the raw input data for convolutional neural networks. We also investigate nearest neighborhood models via locality-sensitive hashing (LSH) for prediction and to aid in the self-supervised contrastive learning of embeddings through tile2vec. Our best competition model utilized geolocation features with a leaderboard score of 0.152 and a best post-competition score of 0.161. Source code and models are available at https://github.com/dsgt-kaggle-clef/geolifeclef-2024. |
|||||
2024 | Regularised Cross-modal Hashing | Moran S., Lavrenko | Arxiv | In this paper we propose Regularised Cross-Modal Hashing (RCMH) a new cross-modal hashing scheme that projects annotation and visual feature descriptors into a common Hamming space. RCMH optimises the intra-modality similarity of data-points in the annotation modality using an iterative three-step hashing algorithm: in the first step each training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines. |
|||||
2024 | Locality-sensitive Hashing-based Efficient Point Transformer With Applications In High-energy Physics | Miao Siqi, Lu Zhiyuan, Liu Mia, Duarte Javier, Li Pan | Arxiv | This study introduces a novel transformer model optimized for large-scale point cloud processing in scientific domains such as high-energy physics (HEP) and astrophysics. Addressing the limitations of graph neural networks and standard transformers, our model integrates local inductive bias and achieves near-linear complexity with hardware-friendly regular operations. One contribution of this work is the quantitative analysis of the error-complexity tradeoff of various sparsification techniques for building efficient transformers. Our findings highlight the superiority of using locality-sensitive hashing (LSH), especially OR & AND-construction LSH, in kernel approximation for large-scale point cloud data with local inductive bias. Based on this finding, we propose LSH-based Efficient Point Transformer (HEPT), which combines E\(^2\)LSH with OR & AND constructions and is built upon regular computations. HEPT demonstrates remarkable performance on two critical yet time-consuming HEP tasks, significantly outperforming existing GNNs and transformers in accuracy and computational speed, marking a significant advancement in geometric deep learning and large-scale scientific data processing. Our code is available at https://github.com/Graph-COM/HEPT. |
|||||
2024 | Microsoft SPACEV-1B | Microsoft Microsoft | Arxiv | Microsoft SPACEV-1B is a new web search related dataset released by Microsoft Bing for this competition. It consists of document and query vectors encoded by Microsoft SpaceV Superior model to capture generic intent representation. |
|||||
2024 | Improved Space-efficient Approximate Nearest Neighbor Search Using Function Inversion | Mccauley Samuel | Arxiv | Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH). Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times. In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000). We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage. Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among “list-of-points” data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN. |
|||||
2024 | An Improvement Of Degree-based Hashing (DBH) Graph Partition Method Using A Novel Metric | Mastikhina Anna, Senkevich Oleg, Sirotkin Dmitry, Demin Danila, Moiseev Stanislav | Arxiv | This paper examines the graph partition problem and introduces a new metric, MSIDS (maximal sum of inner degrees squared). We establish its connection to the replication factor (RF) optimization, which has been the main focus of theoretical work in this field. Additionally, we propose a new partition algorithm, DBH-X, based on the DBH partitioner. We demonstrate that DBH-X significantly improves both the RF and MSIDS, compared to the baseline DBH algorithm. In addition, we provide test results that show the runtime acceleration of GraphX-based PageRank and Label propagation algorithms. |
|||||
2024 | Locality Sensitive Hashing For Network Traffic Fingerprinting | Mashnoor Nowfel, Thom Jay, Rouf Abdur, Sengupta Shamik, Charyyev Batyr | Arxiv | The advent of the Internet of Things (IoT) has brought forth additional intricacies and difficulties to computer networks. These gadgets are particularly susceptible to cyber-attacks because of their simplistic design. Therefore, it is crucial to recognise these devices inside a network for the purpose of network administration and to identify any harmful actions. Network traffic fingerprinting is a crucial technique for identifying devices and detecting anomalies. Currently, the predominant methods for this depend heavily on machine learning (ML). Nevertheless, machine learning (ML) methods need the selection of features, adjustment of hyperparameters, and retraining of models to attain optimal outcomes and provide resilience to concept drifts detected in a network. In this research, we suggest using locality-sensitive hashing (LSH) for network traffic fingerprinting as a solution to these difficulties. Our study focuses on examining several design options for the Nilsimsa LSH function. We then use this function to create unique fingerprints for network data, which may be used to identify devices. We also compared it with ML-based traffic fingerprinting and observed that our method increases the accuracy of state-of-the-art by 12% achieving around 94% accuracy in identifying devices in a network. |
|||||
2024 | Fliphash A Constant-time Consistent Range-hashing Algorithm | Masson Charles, Lee Homin K. | Arxiv | Consistent range-hashing is a technique used in distributed systems, either directly or as a subroutine for consistent hashing, commonly to realize an even and stable data distribution over a variable number of resources. We introduce FlipHash, a consistent range-hashing algorithm with constant time complexity and low memory requirements. Like Jump Consistent Hash, FlipHash is intended for applications where resources can be indexed sequentially. Under this condition, it ensures that keys are hashed evenly across resources and that changing the number of resources only causes keys to be remapped from a removed resource or to an added one, but never shuffled across persisted ones. FlipHash differentiates itself with its low computational cost, achieving constant-time complexity. We show that FlipHash beats Jump Consistent Hash’s cost, which is logarithmic in the number of resources, both theoretically and in experiments over practical settings. |
|||||
2024 | Variable Bit Quantisation For LSH | Moran S., Lavrenko, Osborne | Arxiv | We introduce a scheme for optimally allocating a variable number of bits per LSH hyperplane. Previous approaches assign a constant number of bits per hyperplane. This neglects the fact that a subset of hyperplanes may be more informative than others. Our method, dubbed Variable Bit Quantisation (VBQ), provides a datadriven non-uniform bit allocation across hyperplanes. Despite only using a fraction of the available hyperplanes, VBQ outperforms uniform quantisation by up to 168% for retrieval across standard text and image datasets. |
|||||
2024 | Progressive Generative Hashing For Image Retrieval | Ma Yuqing, He, Ding, Hu, Li, Liu | Arxiv | Recent years have witnessed the success of the emerging hashing techniques in large-scale image retrieval. Owing to the great learning capacity, deep hashing has become one of the most promising solutions, and achieved attractive performance in practice. However, without semantic label information, the unsupervised deep hashing still remains an open question. In this paper, we propose a novel progressive generative hashing (PGH) framework to help learn a discriminative hashing network in an unsupervised way. Different from existing studies, it first treats the hash codes as a kind of semantic condition for the similar image generation, and simultaneously feeds the original image and its codes into the generative adversarial networks (GANs). The real images together with the synthetic ones can further help train a discriminative hashing network based on a triplet loss. By iteratively inputting the learnt codes into the hash conditioned GANs, we can progressively enable the hashing network to discover the semantic relations. Extensive experiments on the widely-used image datasets demonstrate that PGH can significantly outperform stateof-the-art unsupervised hashing methods. |
|||||
2024 | HARR Learning Discriminative And High-quality Hash Codes For Image Retrieval | Ma Zeyu, Wang, Luo, Gu, Chen, Li, Hua, Lu | Arxiv | This article studies deep unsupervised hashing, which has attracted increasing attention in large-scale image retrieval. The majority of recent approaches usually reconstruct semantic similarity information, which then guides the hash code learning. However, they still fail to achieve satisfactory performance in reality for two reasons. On the one hand, without accurate supervised information, these methods usually fail to produce independent and robust hash codes with semantics information well preserved, which may hinder effective image retrieval. On the other hand, due to discrete constraints, how to effectively optimize the hashing network in an end-to-end manner with small quantization errors remains a problem. To address these difficulties, we propose a novel unsupervised hashing method called HARR to learn discriminative and high-quality hash codes. To comprehensively explore semantic similarity structure, HARR adopts the Winner-Take-All hash to model the similarity structure. Then similarity-preserving hash codes are learned under the reliable guidance of the reconstructed similarity structure. Additionally, we improve the quality of hash codes by a bit correlation reduction module, which forces the cross-correlation matrix between a batch of hash codes under different augmentations to approach the identity matrix. In this way, the generated hash bits are expected to be invariant to disturbances with minimal redundancy, which can be further interpreted as an instantiation of the information bottleneck principle. Finally, for effective hashing network training, we minimize the cosine distances between real-value network outputs and their binary codes for small quantization errors. Extensive experiments demonstrate the effectiveness of our proposed HARR. |
|||||
2024 | Spectral Toolkit Of Algorithms For Graphs Technical Report (2) | Macgregor Peter, Sun He | Arxiv | Spectral Toolkit of Algorithms for Graphs (STAG) is an open-source library for efficient graph algorithms. This technical report presents the newly implemented component on locality sensitive hashing, kernel density estimation, and fast spectral clustering. The report includes a user’s guide to the newly implemented algorithms, experiments and demonstrations of the new functionality, and several technical considerations behind our development. |
|||||
2024 | Fast Scalable Supervised Hashing | Luo Xin, Nie, He, Wu, Chen, Xu | Arxiv | Despite significant progress in supervised hashing, there are three common limitations of existing methods. First, most pioneer methods discretely learn hash codes bit by bit, making the learning procedure rather time-consuming. Second, to reduce the large complexity of the n by n pairwise similarity matrix, most methods apply sampling strategies during training, which inevitably results in information loss and suboptimal performance; some recent methods try to replace the large matrix with a smaller one, but the size is still large. Third, among the methods that leverage the pairwise similarity matrix, most of them only encode the semantic label information in learning the hash codes, failing to fully capture the characteristics of data. In this paper, we present a novel supervised hashing method, called Fast Scalable Supervised Hashing (FSSH), which circumvents the use of the large similarity matrix by introducing a pre-computed intermediate term whose size is independent with the size of training data. Moreover, FSSH can learn the hash codes with not only the semantic information but also the features of data. Extensive experiments on three widely used datasets demonstrate its superiority over several state-of-the-art methods in both accuracy and scalability. Our experiment codes are available at: https://lcbwlx.wixsite.com/fssh. |
|||||
2024 | Fine-grained Embedding Dimension Optimization During Training For Recommender Systems | Luo Qinyi, Wang Penghan, Zhang Wei, Lai Fan, Mao Jiachen, Wei Xiaohan, Song Jun, Tsai Wei-yu, Yang Shuai, Hu Yuxi, Qian Xuehai | Arxiv | Huge embedding tables in modern Deep Learning Recommender Models (DLRM) require prohibitively large memory during training and inference. Aiming to reduce the memory footprint of training, this paper proposes FIne-grained In-Training Embedding Dimension optimization (FIITED). Given the observation that embedding vectors are not equally important, FIITED adjusts the dimension of each individual embedding vector continuously during training, assigning longer dimensions to more important embeddings while adapting to dynamic changes in data. A novel embedding storage system based on virtually-hashed physically-indexed hash tables is designed to efficiently implement the embedding dimension adjustment and effectively enable memory saving. Experiments on two industry models show that FIITED is able to reduce the size of embeddings by more than 65% while maintaining the trained model’s quality, saving significantly more memory than a state-of-the-art in-training embedding pruning method. On public click-through rate prediction datasets, FIITED is able to prune up to 93.75%-99.75% embeddings without significant accuracy loss. |
|||||
2024 | Online Multi-modal Hashing With Dynamic Query-adaption | Lu Xu, Zhu, Cheng, Zhang | Arxiv | Multi-modal hashing is an effective technique to support large-scale multimedia retrieval, due to its capability of encoding heterogeneous multi-modal features into compact and similarity-preserving binary codes. Although great progress has been achieved so far, existing methods still suffer from several problems, including: 1) All existing methods simply adopt fixed modality combination weights in online hashing process to generate the query hash codes. This strategy cannot adaptively capture the variations of different queries. 2) They either suffer from insufficient semantics (for unsupervised methods) or require high computation and storage cost (for the supervised methods, which rely on pair-wise semantic matrix). 3) They solve the hash codes with relaxed optimization strategy or bit-by-bit discrete optimization, which results in significant quantization loss or consumes considerable computation time. To address the above limitations, in this paper, we propose an Online Multi-modal Hashing with Dynamic Query-adaption (OMH-DQ) method in a novel fashion. Specifically, a self-weighted fusion strategy is designed to adaptively preserve the multi-modal feature information into hash codes by exploiting their complementarity. The hash codes are learned with the supervision of pair-wise semantic labels to enhance their discriminative capability, while avoiding the challenging symmetric similarity matrix factorization. Under such learning framework, the binary hash codes can be directly obtained with efficient operations and without quantization errors. Accordingly, our method can benefit from the semantic labels, and simultaneously, avoid the high computation complexity. Moreover, to accurately capture the query variations, at the online retrieval stage, we design a parameter-free online hashing module which can adaptively learn the query hash codes according to the dynamic query contents. Extensive experiments demonstrate the state-of-the-art performance of the proposed approach from various aspects. |
|||||
2024 | Label Self-adaption Hashing For Image Retrieval | Lu Jianglin, Lai, Wang, Zhou | Arxiv | Hashing has attracted widespread attention in image retrieval because of its fast retrieval speed and low storage cost. Compared with supervised methods, unsupervised hashing methods are more reasonable and suitable for large-scale image retrieval since it is always difficult and expensive to collect true labels of the massive data. Without label information, however, unsupervised hashing methods can not guarantee the quality of learned binary codes. To resolve this dilemma, this paper proposes a novel unsupervised hashing method called Label Self-Adaption Hashing (LSAH), which contains effective hashing function learning part and self-adaption label generation part. In the first part, we utilize anchor graph to keep the local structure of the data and introduce joint sparsity into the model to extract effective features for high-quality binary code learning. In the second part, a self-adaptive cluster label matrix is learned from the data under the assumption that the nearest neighbor points should have a large probability to be in the same cluster. Therefore, the proposed LSAH can make full use of the potential discriminative information of the data to guide the learning of binary code. It is worth noting that LSAH can learn effective binary codes, hashing function and cluster labels simultaneously in a unified optimization framework. To solve the resulting optimization problem, an Augmented Lagrange Multiplier based iterative algorithm is elaborately designed. Extensive experiments on three large-scale data sets indicate the promising performance of the proposed LSAH. |
|||||
2024 | A Survey On Deep Hashing Methods | Luo Xiao, Wang, Wu, Chen, Deng, Huang, Hua | Arxiv | Nearest neighbor search aims at obtaining the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this survey, we detailedly investigate current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing. Specifically, we categorize deep supervised hashing methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes. Moreover, deep unsupervised hashing is categorized into similarity reconstruction-based methods, pseudo-label-based methods, and prediction-free self-supervised learning-based methods based on their semantic learning manners. We also introduce three related important topics including semi-supervised deep hashing, domain adaption deep hashing, and multi-modal deep hashing. Meanwhile, we present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discuss some potential research directions in conclusion. |
|||||
2024 | Deep Domain Adaptation Hashing With Adversarial Learning | Long Fuchen, Yao, Dai, Tian, Luo, Mei | Arxiv | The recent advances in deep neural networks have demonstrated high capability in a wide variety of scenarios. Nevertheless, fine-tuning deep models in a new domain still requires a significant amount of labeled data despite expensive labeling efforts. A valid question is how to leverage the source knowledge plus unlabeled or only sparsely labeled target data for learning a new model in target domain. The core problem is to bring the source and target distributions closer in the feature space. In the paper, we facilitate this issue in an adversarial learning framework, in which a domain discriminator is devised to handle domain shift. Particularly, we explore the learning in the context of hashing problem, which has been studied extensively due to its great efficiency in gigantic data. Specifically, a novel Deep Domain Adaptation Hashing with Adversarial learning (DeDAHA) architecture is presented, which mainly consists of three components: a deep convolutional neural networks (CNN) for learning basic image/frame representation followed by an adversary stream on one hand to optimize the domain discriminator, and on the other, to interact with each domain-specific hashing stream for encoding image representation to hash codes. The whole architecture is trained end-to-end by jointly optimizing two types of losses, i.e., triplet ranking loss to preserve the relative similarity ordering in the input triplets and adversarial loss to maximally fool the domain discriminator with the learnt source and target feature distributions. Extensive experiments are conducted on three domain transfer tasks, including cross-domain digits retrieval, image to image and image to video transfers, on several benchmarks. Our DeDAHA framework achieves superior results when compared to the state-of-the-art techniques. |
|||||
2024 | Supervised Hashing With Kernels | Liu W., Wang, Ji, Jiang, Chang | Arxiv | Recent years have witnessed the growing popularity of hashing in large-scale vision problems. It has been shown that the hashing quality could be boosted by leveraging supervised information into hash function learning. However, the existing supervised methods either lack adequate performance or often incur cumbersome model training. In this paper, we propose a novel kernel-based supervised hashing model which requires a limited amount of supervised information, i.e., similar and dissimilar data pairs, and a feasible training cost in achieving high quality hashing. The idea is to map the data to compact binary codes whose Hamming distances are minimized on similar pairs and simultaneously maximized on dissimilar pairs. Our approach is distinct from prior works by utilizing the equivalence between optimizing the code inner products and the Hamming distances. This enables us to sequentially and efficiently train the hash functions one bit at a time, yielding very short yet discriminative codes. We carry out extensive experiments on two image benchmarks with up to one million samples, demonstrating that our approach significantly outperforms the state-of-the-arts in searching both metric distance neighbors and semantically similar neighbors, with accuracy gains ranging from 13% to 46%. |
|||||
2024 | Model Optimization Boosting Framework For Linear Model Hash Learning | Liu Xingbo, Nie, Zhou, Nie, Yin | Arxiv | Efficient hashing techniques have attracted extensive research interests in both storage and retrieval of high dimensional data, such as images and videos. In existing hashing methods, a linear model is commonly utilized owing to its efficiency. To obtain better accuracy, linear-based hashing methods focus on designing a generalized linear objective function with different constraints or penalty terms that consider the inherent characteristics and neighborhood information of samples. Differing from existing hashing methods, in this study, we propose a self-improvement framework called Model Boost (MoBoost) to improve model parameter optimization for linear-based hashing methods without adding new constraints or penalty terms. In the proposed MoBoost, for a linear-based hashing method, we first repeatedly execute the hashing method to obtain several hash codes to training samples. Then, utilizing two novel fusion strategies, these codes are fused into a single set. We also propose two new criteria to evaluate the goodness of hash bits during the fusion process. Based on the fused set of hash codes, we learn new parameters for the linear hash function that can significantly improve the accuracy. In general, the proposed MoBoost can be adopted by existing linear-based hashing methods, achieving more precise and stable performance compared to the original methods, and adopting the proposed MoBoost will incur negligible time and space costs. To evaluate the proposed MoBoost, we performed extensive experiments on four benchmark datasets, and the results demonstrate superior performance. |
|||||
2024 | Moboost A Self-improvement Framework For Linear-based Hashing | Liu Xingbo, Nie, Xi, Zhu, Yin | Arxiv | The linear model is commonly utilized in hashing methods owing to its efficiency. To obtain better accuracy, linear-based hashing methods focus on designing a generalized linear objective function with different constraints or penalty terms that consider neighborhood information. In this study, we propose a novel generalized framework called Model Boost (MoBoost), which can achieve the self-improvement of the linear-based hashing. The proposed MoBoost is used to improve model parameter optimization for linear-based hashing methods without adding new constraints or penalty terms. In the proposed MoBoost, given a linear-based hashing method, we first execute the method several times to get several different hash codes for training samples, and then combine these different hash codes into one set utilizing one novel fusion strategy. Based on this set of hash codes, we learn some new parameters for the linear hash function that can significantly improve accuracy. The proposed MoBoost can be generally adopted in existing linear-based hashing methods, achieving more precise and stable performance compared to the original methods while imposing negligible added expenditure in terms of time and space. Extensive experiments are performed based on three benchmark datasets, and the results demonstrate the superior performance of the proposed framework. |
|||||
2024 | Multi-view Complementary Hash Tables For Nearest Neighbor Search | Liu Xianglong, Huang, Deng, Land | Arxiv | Recent years have witnessed the success of hashing techniques in fast nearest neighbor search. In practice many applications (e.g., visual search, object detection, image matching, etc.) have enjoyed the benefits of complementary hash tables and information fusion over multiple views. However, most of prior research mainly focused on compact hash code cleaning, and rare work studies how to build multiple complementary hash tables, much less to adaptively integrate information stemming from multiple views. In this paper we first present a novel multi-view complementary hash table method that learns complementary hash tables from the data with multiple views. For single multiview table, using exemplar based feature fusion, we approximate the inherent data similarities with a low-rank matrix, and learn discriminative hash functions in an efficient way. To build complementary tables and meanwhile maintain scalable training and fast out-of-sample extension, an exemplar reweighting scheme is introduced to update the induced low-rank similarity in the sequential table construction framework, which indeed brings mutual benefits between tables by placing greater importance on exemplars shared by mis-separated neighbors. Extensive experiments on three large-scale image datasets demonstrate that the proposed method significantly outperforms various naive solutions and state-of-the-art multi-table methods. |
|||||
2024 | Hash Bit Selection A Unified Solution For Selection Problems In Hashing | Liu X., He, Lang, Chang | Arxiv | Hashing based methods recently have been shown promising for large-scale nearest neighbor search. However, good designs involve difficult decisions of many unknowns – data features, hashing algorithms, parameter settings, kernels, etc. In this paper, we provide a unified solution as hash bit selection, i.e., selecting the most informative hash bits from a pool of candidates that may have been generated under various conditions mentioned above. We represent the candidate bit pool as a vertex- and edge-weighted graph with the pooled bits as vertices. Then we formulate the bit selection problem as quadratic programming over the graph, and solve it efficiently by replicator dynamics. Extensive experiments show that our bit selection approach can achieve superior performance over both naive selection methods and state-of-the-art methods under each scenario, usually with significant accuracy gains from 10% to 50% relatively. |
|||||
2024 | Discretely Coding Semantic Rank Orders For Supervised Image Hashing | Liu Li, Shao, Shen, Yu | Arxiv | Learning to hash has been recognized to accomplish highly efficient storage and retrieval for large-scale visual data. Particularly, ranking-based hashing techniques have recently attracted broad research attention because ranking accuracy among the retrieved data is well explored and their objective is more applicable to realistic search tasks. However, directly optimizing discrete hash codes without continuous-relaxations on a nonlinear ranking objective is infeasible by either traditional optimization methods or even recent discrete hashing algorithms. To address this challenging issue, in this paper, we introduce a novel supervised hashing method, dubbed Discrete Semantic Ranking Hashing (DSeRH), which aims to directly embed semantic rank orders into binary codes. In DSeRH, a generalized Adaptive Discrete Minimization (ADM) approach is proposed to discretely optimize binary codes with the quadratic nonlinear ranking objective in an iterative manner and is guaranteed to converge quickly. Additionally, instead of using 0/1 independent labels to form rank orders as in previous works, we generate the listwise rank orders from the high-level semantic word embeddings which can quantitatively capture the intrinsic correlation between different categories. We evaluate our DSeRH, coupled with both linear and deep convolutional neural network (CNN) hash functions, on three image datasets, i.e., CIFAR-10, SUN397 and ImageNet100, and the results manifest that DSeRH can outperform the state-of-the-art ranking-based hashing methods. |
|||||
2024 | Hashing With Graphs | Liu W., Wang, Kumar, Chang | Arxiv | Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However, learning short codes that yield good search performance is still a challenge. Moreover, in many cases realworld data lives on a low-dimensional manifold, which should be taken into account to capture meaningful nearest neighbors. In this paper, we propose a novel graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes. To make such an approach computationally feasible, we utilize Anchor Graphs to obtain tractable low-rank adjacency matrices. Our formulation allows constant time hashing of a new data point by extrapolating graph Laplacian eigenvectors to eigenfunctions. Finally, we describe a hierarchical threshold learning procedure in which each eigenfunction yields multiple bits, leading to higher search accuracy. Experimental comparison with the other state-of-the-art methods on two large datasets demonstrates the efficacy of the proposed method. |
|||||
2024 | Collaborative Hashing | Liu X., He, Deng, Lang | Arxiv | Hashing technique has become a promising approach for fast similarity search. Most of existing hashing research pursue the binary codes for the same type of entities by preserving their similarities. In practice, there are many scenarios involving nearest neighbor search on the data given in matrix form, where two different types of, yet naturally associated entities respectively correspond to its two dimensions or views. To fully explore the duality between the two views, we propose a collaborative hashing scheme for the data in matrix form to enable fast search in various applications such as image search using bag of words and recommendation using user-item ratings. By simultaneously preserving both the entity similarities in each view and the interrelationship between views, our collaborative hashing effectively learns the compact binary codes and the explicit hash functions for out-of-sample extension in an alternating optimization way. Extensive evaluations are conducted on three well-known datasets for search inside a single view and search across different views, demonstrating that our proposed method outperforms state-of-the-art baselines, with significant accuracy gains ranging from 7.67% to 45.87% relatively. |
|||||
2024 | Discrete Graph Hashing | Liu W., Mu, Kumar, Chang | Arxiv | Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art unsupervised hashing methods, especially for longer codes. |
|||||
2024 | Joint-modal Distribution-based Similarity Hashing For Large-scale Unsupervised Deep Cross-modal Retrieval | Liu Song, Qian, Guan, Zhan, Ying | Arxiv | Hashing-based cross-modal search which aims to map multiple modality features into binary codes has attracted increasingly attention due to its storage and search efficiency especially in large-scale database retrieval. Recent unsupervised deep cross-modal hashing methods have shown promising results. However, existing approaches typically suffer from two limitations: (1) They usually learn cross-modal similarity information separately or in a redundant fusion manner, which may fail to capture semantic correlations among instances from different modalities sufficiently and effectively. (2) They seldom consider the sampling and weighting schemes for unsupervised cross-modal hashing, resulting in the lack of satisfactory discriminative ability in hash codes. To overcome these limitations, we propose a novel unsupervised deep cross-modal hashing method called Joint-modal Distribution-based Similarity Hashing (JDSH) for large-scale cross-modal retrieval. Firstly, we propose a novel cross-modal joint-training method by constructing a joint-modal similarity matrix to fully preserve the cross-modal semantic correlations among instances. Secondly, we propose a sampling and weighting scheme termed the Distribution-based Similarity Decision and Weighting (DSDW) method for unsupervised cross-modal hashing, which is able to generate more discriminative hash codes by pushing semantic similar instance pairs closer and pulling semantic dissimilar instance pairs apart. The experimental results demonstrate the superiority of JDSH compared with several unsupervised cross-modal hashing methods on two public datasets NUS-WIDE and MIRFlickr. |
|||||
2024 | Multi-probe LSH Efficient Indexing For High-dimensional Similarity Search | Lv Q., Josephson, Wang, Charikar, Li | Arxiv | Similarity indices for high-dimensional data are very desirable for building content-based search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large number of hash tables in order to achieve good search quality. This paper proposes a new indexing scheme called multi-probe LSH that overcomes this drawback. Multi-probe LSH is built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by and improves upon recent theoretical work on entropy-based LSH designed to reduce the space requirement of the basic LSH method. We have implemented the multi-probe LSH method and evaluated the implementation with two different high-dimensional datasets. Our evaluation shows that the multi-probe LSH method substantially improves upon previously proposed methods in both space and time efficiency. To achieve the same search quality, multi-probe LSH has a similar timeefficiency as the basic LSH method while reducing the number of hash tables by an order of magnitude. In comparison with the entropy-based LSH method, to achieve the same search quality, multi-probe LSH uses less query time and 5 to 8 times fewer number of hash tables. |
|||||
2024 | Deep Variational And Structural Hashing | Liong Venice, Lu, Duan, Tan | Arxiv | In this paper, we propose a deep variational and structural hashing (DVStH) method to learn compact binary codes for multimedia retrieval. Unlike most existing deep hashing methods which use a series of convolution and fully-connected layers to learn binary features, we develop a probabilistic framework to infer latent feature representation inside the network. Then, we design a struct layer rather than a bottleneck hash layer, to obtain binary codes through a simple encoding procedure. By doing these, we are able to obtain binary codes discriminatively and generatively. To make it applicable to cross-modal scalable multimedia retrieval, we extend our method to a cross-modal deep variational and structural hashing (CM-DVStH). We design a deep fusion network with a struct layer to maximize the correlation between image-text input pairs during the training stage so that a unified binary vector can be obtained. We then design modality-specific hashing networks to handle the out-of-sample extension scenario. Specifically, we train a network for each modality which outputs a latent representation that is as close as possible to the binary codes which are inferred from the fusion network. Experimental results on five benchmark datasets are presented to show the efficacy of the proposed approach. |
|||||
2024 | Cross-modal Deep Variational Hashing | Liong Venice, Lu, Tan, Zhou | Arxiv | In this paper, we propose a cross-modal deep variational hashing (CMDVH) method for cross-modality multimedia retrieval. Unlike existing cross-modal hashing methods which learn a single pair of projections to map each example as a binary vector, we design a couple of deep neural network to learn non-linear transformations from imagetext input pairs, so that unified binary codes can be obtained. We then design the modality-specific neural networks in a probabilistic manner where we model a latent variable as close as possible from the inferred binary codes, which is approximated by a posterior distribution regularized by a known prior. Experimental results on three benchmark datasets show the efficacy of the proposed approach. |
|||||
2024 | Semantics-preserving Hashing For Cross-view Retrieval | Lin Zijia, Ding, Wang | Arxiv | With benefits of low storage costs and high query speeds, hashing methods are widely researched for efficiently retrieving large-scale data, which commonly contains multiple views, e.g. a news report with images, videos and texts. In this paper, we study the problem of cross-view retrieval and propose an effective Semantics-Preserving Hashing method, termed SePH. Given semantic affinities of training data as supervised information, SePH transforms them into a probability distribution and approximates it with tobe-learnt hash codes in Hamming space via minimizing the Kullback-Leibler divergence. Then kernel logistic regression with a sampling strategy is utilized to learn the nonlinear projections from features in each view to the learnt hash codes. And for any unseen instance, predicted hash codes and their corresponding output probabilities from observed views are utilized to determine its unified hash code, using a novel probabilistic approach. Extensive experiments conducted on three benchmark datasets well demonstrate the effectiveness and reasonableness of SePH. |
|||||
2024 | Deep Learning Of Binary Hash Codes For Fast Image Retrieval | Lin Kevin, Yang, Hsiao, Chen | Arxiv | Approximate nearest neighbor search is an efficient strategy for large-scale image retrieval. Encouraged by the recent advances in convolutional neural networks (CNNs), we propose an effective deep learning framework to generate binary hash codes for fast image retrieval. Our idea is that when the data labels are available, binary codes can be learned by employing a hidden layer for representing the latent concepts that dominate the class labels. he utilization of the CNN also allows for learning image representations. Unlike other supervised methods that require pair-wised inputs for binary code learning, our method learns hash codes and image representations in a point-wised manner, making it suitable for large-scale datasets. Experimental results show that our method outperforms several state-of-the-art hashing algorithms on the CIFAR-10 and MNIST datasets. We further demonstrate its scalability and efficacy on a large-scale dataset of 1 million clothing images. |
|||||
2024 | Fast Supervised Hashing With Decision Trees For High-dimensional Data | Lin Guosheng, Shen, Shi, Hengel, Suter. | Arxiv | Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the Hamming space. Non-linear hash functions have demonstrated their advantage over linear ones due to their powerful generalization capability. In the literature, kernel functions are typically used to achieve non-linearity in hashing, which achieve encouraging retrieval performance at the price of slow evaluation and training time. Here we propose to use boosted decision trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence more suitable for hashing with high dimensional data. In our approach, we first propose sub-modular formulations for the hashing binary code inference problem and an efficient GraphCut based block search method for solving large-scale inference. Then we learn hash functions by training boosted decision trees to fit the binary codes. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time. Especially for highdimensional data, our method is orders of magnitude faster than many methods in terms of training time. |
|||||
2024 | A General Two-step Approach To Learning-based Hashing | Lin G., Shen, Suter, Hengel | Arxiv | Most existing approaches to hashing apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of the method to respond to the data, and can result in complex optimization problems that are difficult to solve. Here we propose a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. This framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods. Our framework decomposes hashing learning problem into two steps: hash bit learning and hash function learning based on the learned bits. The first step can typically be formulated as binary quadratic problems, and the second step can be accomplished by training standard binary classifiers. Both problems have been extensively studied in the literature. Our extensive experiments demonstrate that the proposed framework is effective, flexible and outperforms the state-of-the-art. |
|||||
2024 | Microsoft COCO Common Objects In Context | Lin Tsung-yi, Maire, Belongie, Bourdev, Girshick, Hays, Perona, Ramanan, Zitnick, Dollar | Arxiv | We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model. |
|||||
2024 | Push For Quantization Deep Fisher Hashing | Li Yunqiang, Pei, Zha, Gemert | Arxiv | Current massive datasets demand light-weight access for analysis. Discrete hashing methods are thus beneficial because they map high-dimensional data to compact binary codes that are efficient to store and process, while preserving semantic similarity. To optimize powerful deep learning methods for image hashing, gradient-based methods are required. Binary codes, however, are discrete and thus have no continuous derivatives. Relaxing the problem by solving it in a continuous space and then quantizing the solution is not guaranteed to yield separable binary codes. The quantization needs to be included in the optimization. In this paper we push for quantization: We optimize maximum class separability in the binary space. We introduce a margin on distances between dissimilar image pairs as measured in the binary space. In addition to pair-wise distances, we draw inspiration from Fisher’s Linear Discriminant Analysis (Fisher LDA) to maximize the binary distances between classes and at the same time minimize the binary distance of images within the same class. Experiments on CIFAR-10, NUS-WIDE and ImageNet100 demonstrate compact codes comparing favorably to the current state of the art. |
|||||
2024 | Mixed-precision Embeddings For Large-scale Recommendation Models | Li Shiwei, Hu Zhuoqi, Lyu Fuyuan, Tang Xing, Wang Haozhao, Xu Shijie, Luo Weihong, Li Yuhua, Liu Xue, He Xiuqiang, Li Ruixuan | Arxiv | Embedding techniques have become essential components of large databases in the deep learning era. By encoding discrete entities, such as words, items, or graph nodes, into continuous vector spaces, embeddings facilitate more efficient storage, retrieval, and processing in large databases. Especially in the domain of recommender systems, millions of categorical features are encoded as unique embedding vectors, which facilitates the modeling of similarities and interactions among features. However, numerous embedding vectors can result in significant storage overhead. In this paper, we aim to compress the embedding table through quantization techniques. Given that features vary in importance levels, we seek to identify an appropriate precision for each feature to balance model accuracy and memory usage. To this end, we propose a novel embedding compression method, termed Mixed-Precision Embeddings (MPE). Specifically, to reduce the size of the search space, we first group features by frequency and then search precision for each feature group. MPE further learns the probability distribution over precision levels for each feature group, which can be used to identify the most suitable precision with a specially designed sampling strategy. Extensive experiments on three public datasets demonstrate that MPE significantly outperforms existing embedding compression methods. Remarkably, MPE achieves about 200x compression on the Criteo dataset without comprising the prediction accuracy. |
|||||
2024 | Learning Hash Functions Using Column Generation | Li X., Lin, Shen, Hengel, Dick | Arxiv | Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning datadependent hash functions have been developed. In this work, we propose a column generation based method for learning datadependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identi- fied. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets. |
|||||
2024 | Neighborhood Preserving Hashing For Scalable Video Retrieval | Li Shuyan, Chen, Lu, Li, Zhou | Arxiv | In this paper, we propose a Neighborhood Preserving Hashing (NPH) method for scalable video retrieval in an unsupervised manner. Unlike most existing deep video hashing methods which indiscriminately compress an entire video into a binary code, we embed the spatial-temporal neighborhood information into the encoding network such that the neighborhood-relevant visual content of a video can be preferentially encoded into a binary code under the guidance of the neighborhood information. Specifically, we propose a neighborhood attention mechanism which focuses on partial useful content of each input frame conditioned on the neighborhood information. We then integrate the neighborhood attention mechanism into an RNN-based reconstruction scheme to encourage the binary codes to capture the spatial-temporal structure in a video which is consistent with that in the neighborhood. As a consequence, the learned hashing functions can map similar videos to similar binary codes. Extensive experiments on three widely-used benchmark datasets validate the effectiveness of our proposed approach. |
|||||
2024 | Self-supervised Video Hashing Via Bidirectional Transformers | Li Shuyan, Li, Lu, Zhou | Arxiv | Most existing unsupervised video hashing methods are built on unidirectional models with less reliable training objectives, which underuse the correlations among frames and the similarity structure between videos. To enable efficient scalable video retrieval, we propose a self-supervised video Hashing method based on Bidirectional Transformers (BTH). Based on the encoder-decoder structure of transformers, we design a visual cloze task to fully exploit the bidirectional correlations between frames. To unveil the similarity structure between unlabeled video data, we further develop a similarity reconstruction task by establishing reliable and effective similarity connections in the video space. Furthermore, we develop a cluster assignment task to exploit the structural statistics of the whole dataset such that more discriminative binary codes can be learned. Extensive experiments implemented on three public benchmark datasets, FCVID, ActivityNet and YFCC, demonstrate the superiority of our proposed approach. |
|||||
2024 | 0-bit Consistent Weighted Sampling | Li P. | Arxiv | We develop 0-bit consistent weighted sampling (CWS) for efficiently estimating min-max kernel, which is a generalization of the resemblance kernel originally designed for binary data. Because the estimator of 0-bit CWS constitutes a positive definite kernel, this method can be naturally applied to large-scale data mining problems. Basically, if we feed the sampled data from 0-bit CWS to a highly efficient linear classifier (e.g., linear SVM), we effectively (and approximately) train a nonlinear classifier based on the min-max kernel. The accuracy improves as we increase the sample size. In this paper, we first demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. This helps justify the use of min-max kernel. However, as the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we propose to linearize the min-max kernel via 0-bit CWS, a simplification of the original CWS method. The previous remarkable work on consistent weighted sampling (CWS) produces samples in the form of (i, t) where the i* records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the t* is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. We provide a simple solution by discarding t* (which we refer to as the “0-bit” scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply 0-bit CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of public datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data. |
|||||
2024 | COMAE Comprehensive Attribute Exploration For Zero-shot Hashing | Li Yuqi, Long Qingqing, Zhou Yihang, Cao Ning, Liu Shuai, Zheng Fang, Zhu Zhihong, Ning Zhiyuan, Xiao Meng, Wang Xuezhi, Wang Pengfei, Zhou Yuanchun | Arxiv | Zero-shot hashing (ZSH) has shown excellent success owing to its efficiency and generalization in large-scale retrieval scenarios. While considerable success has been achieved, there still exist urgent limitations. Existing works ignore the locality relationships of representations and attributes, which have effective transferability between seeable classes and unseeable classes. Also, the continuous-value attributes are not fully harnessed. In response, we conduct a COMprehensive Attribute Exploration for ZSH, named COMAE, which depicts the relationships from seen classes to unseen ones through three meticulously designed explorations, i.e., point-wise, pair-wise and class-wise consistency constraints. By regressing attributes from the proposed attribute prototype network, COMAE learns the local features that are relevant to the visual attributes. Then COMAE utilizes contrastive learning to comprehensively depict the context of attributes, rather than instance-independent optimization. Finally, the class-wise constraint is designed to cohesively learn the hash code, image representation, and visual attributes more effectively. Experimental results on the popular ZSH datasets demonstrate that COMAE outperforms state-of-the-art hashing techniques, especially in scenarios with a larger number of unseen label classes. |
|||||
2024 | Deep Unsupervised Image Hashing By Maximizing Bit Entropy | Li Yunqiang, Gemert | Arxiv | Unsupervised hashing is important for indexing huge image or video collections without having expensive annotations available. Hashing aims to learn short binary codes for compact storage and efficient semantic retrieval. We propose an unsupervised deep hashing layer called Bi-half Net that maximizes entropy of the binary codes. Entropy is maximal when both possible values of the bit are uniformly (half-half) distributed. To maximize bit entropy, we do not add a term to the loss function as this is difficult to optimize and tune. Instead, we design a new parameter-free network layer to explicitly force continuous image features to approximate the optimal half-half bit distribution. This layer is shown to minimize a penalized term of the Wasserstein distance between the learned continuous image features and the optimal half-half bit distribution. Experimental results on the image datasets Flickr25k, Nus-wide, Cifar-10, Mscoco, Mnist and the video datasets Ucf-101 and Hmdb-51 show that our approach leads to compact codes and compares favorably to the current state-of-the-art. |
|||||
2024 | BERT-LSH Reducing Absolute Compute For Attention | Li Zezheng, Yip Kingston | Arxiv | This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model’s ability to generalize from its training data. For more information, visit our GitHub repository: https://github.com/leo4life2/algoml-final |
|||||
2024 | Feature Learning Based Deep Supervised Hashing With Pairwise Labels | Li Wu-jun, Kang | Arxiv | Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on handcrafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing (DPSH), to perform simultaneous feature learning and hashcode learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications. |
|||||
2024 | Two Birds One Stone Jointly Learning Binary Code For Large-scale Face Image Retrieval And Attributes Prediction | Li Yan, Wang, Liu, Jiang, Chen | Arxiv | We address the challenging large-scale content-based face image retrieval problem, intended as searching images based on the presence of specific subject, given one face image of him/her. To this end, one natural demand is a supervised binary code learning method. While the learned codes might be discriminating, people often have a further expectation that whether some semantic message (e.g., visual attributes) can be read from the human-incomprehensible codes. For this purpose, we propose a novel binary code learning framework by jointly encoding identity discriminability and a number of facial attributes into unified binary code. In this way, the learned binary codes can be applied to not only fine-grained face image retrieval, but also facial attributes prediction, which is the very innovation of this work, just like killing two birds with one stone. To evaluate the effectiveness of the proposed method, extensive experiments are conducted on a new purified large-scale web celebrity database, named CFW 60K, with abundant manual identity and attributes annotation, and experimental results exhibit the superiority of our method over state-of-the-art. |
|||||
2024 | Very Sparse Random Projections | Li Ping, Hastie, Church | Arxiv | There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy. |
|||||
2024 | Optimizing Ranking Measures For Compact Binary Code Learning | Lin Guosheng, Shen, Wu. | Arxiv | Hashing has proven a valuable tool for large-scale information retrieval. Despite much success, existing hashing methods optimize over simple objectives such as the reconstruction error or graph Laplacian related loss functions, instead of the performance evaluation criteria of interest—multivariate performance measures such as the AUC and NDCG. Here we present a general framework (termed StructHash) that allows one to directly optimize multivariate performance measures. The resulting optimization problem can involve exponentially or infinitely many variables and constraints, which is more challenging than standard structured output learning. To solve the StructHash optimization problem, we use a combination of column generation and cutting-plane techniques. We demonstrate the generality of StructHash by applying it to ranking prediction and image retrieval, and show that it outperforms a few state-of-the-art hashing methods. |
|||||
2024 | The MNIST Database Of Handwritten Digits | Lecun Y., Cortes, Burges | Arxiv | The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. |
|||||
2024 | Hashing For Distributed Data | Leng Cong, Wu, Cheng, Lu | Arxiv | Recently, hashing based approximate nearest neighbors search has attracted much attention. Extensive centralized hashing algorithms have been proposed and achieved promising performance. However, due to the large scale of many applications, the data is often stored or even collected in a distributed manner. Learning hash functions by aggregating all the data into a fusion center is infeasible because of the prohibitively expensive communication and computation overhead. In this paper, we develop a novel hashing model to learn hash functions in a distributed setting. We cast a centralized hashing model as a set of subproblems with consensus constraints. We find these subproblems can be analytically solved in parallel on the distributed compute nodes. Since no training data is transmitted across the nodes in the learning process, the communication cost of our model is independent to the data size. Extensive experiments on several large scale datasets containing up to 100 million samples demonstrate the efficacy of our method. |
|||||
2024 | Simultaneous Feature Learning And Hash Coding With Deep Neural Networks | Lai H., Pan, Liu, Yan | Arxiv | Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. For most existing hashing methods, an image is first encoded as a vector of hand-engineering visual features, followed by another separate projection or quantization step that generates binary codes. However, such visual feature vectors may not be optimally compatible with the coding process, thus producing sub-optimal hashing codes. In this paper, we propose a deep architecture for supervised hashing, in which images are mapped into binary codes via carefully designed deep neural networks. The pipeline of the proposed deep architecture consists of three building blocks: 1) a sub-network with a stack of convolution layers to produce the effective intermediate image features; 2) a divide-and-encode module to divide the intermediate image features into multiple branches, each encoded into one hash bit; and 3) a triplet ranking loss designed to characterize that one image is more similar to the second image than to the third one. Extensive evaluations on several benchmark image datasets show that the proposed simultaneous feature learning and hash coding pipeline brings substantial improvements over other state-of-the-art supervised or unsupervised hashing methods. |
|||||
2024 | Learning To Hash With A Dimension Analysis-based Quantizer For Image Retrieval | Kwok Yuan | Arxiv | The last few years have witnessed the rise of the big data era in which approximate nearest neighbor search is a fundamental problem in many applications, such as large-scale image retrieval. Recently, many research results have demonstrated that hashing can achieve promising performance due to its appealing storage and search efficiency. Since complex optimization problems for loss functions are difficult to solve, most hashing methods decompose the hash code learning problem into two steps: projection and quantization. In the quantization step, binary codes are widely used because ranking them by the Hamming distance is very efficient. However, the massive information loss produced by the quantization step should be reduced in applications where high search accuracy is required, such as in image retrieval. Since many two-step hashing methods produce uneven projected dimensions in the projection step, in this paper, we propose a novel dimension analysis-based quantization (DAQ) on two-step hashing methods for image retrieval. We first perform an importance analysis of the projected dimensions and select a subset of them that are more informative than others, and then we divide the selected projected dimensions into several regions with our quantizer. Every region is quantized with its corresponding codebook. Finally, the similarity between two hash codes is estimated by the Manhattan distance between their corresponding codebooks, which is also efficient. We conduct experiments on three public benchmarks containing up to one million descriptors and show that the proposed DAQ method consistently leads to significant accuracy improvements over state-of-the-art quantization methods. |
|||||
2024 | LLC Accurate Multi-purpose Learnt Low-dimensional Binary Codes | Kusupati Aditya, Wallingford, Ramanujan, Somani, Park, Pillutla, Jain, Kakade, Farhadi | Arxiv | Learning binary representations of instances and classes is a classical problem with several high potential applications. In modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. In this work, we propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes. Our method does not require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (~20 bits for ImageNet-1K). The learnt codes are super-efficient while still ensuring nearly optimal classification accuracy for ResNet50 on ImageNet-1K. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. We further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (OOD) detection problems. For ImageNet-100 retrieval problem, our learnt binary codes outperform 16 bit HashNet using only 10 bits and also are as accurate as 10 dimensional real representations. Finally, our learnt binary codes can perform OOD detection, out-of-the-box, as accurately as a baseline that needs ~3000 samples to tune its threshold, while we require none. |
|||||
2024 | Kernelized Locality-sensitive Hashing For Scalable Image Search | Kulis B., Grauman | Arxiv | Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be efficiently searched. However, existing methods do not apply for high-dimensional kernelized data when the underlying feature embedding for the kernel is unknown. We show how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm’s sub-linear time similarity search guarantees for a wide class of useful similarity functions. Since a number of successful image-based kernels have unknown or incomputable embeddings, this is especially valuable for image retrieval tasks. We validate our technique on several large-scale datasets, and show that it enables accurate and fast performance for example-based object classification, feature matching, and content-based retrieval. |
|||||
2024 | Learning Multiple Layers Of Features From Tiny Images | Krizhevsky A. | Arxiv | Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is significantly improved by pre-training a layer of features on a large set of unlabeled tiny images. |
|||||
2024 | Learning To Hash With Binary Reconstructive Embeddings | Kulis B., Darrell | Arxiv | Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. |
|||||
2024 | Manhattan Hashing For Large-scale Image Retrieval | Kong W., Li, Guo | Arxiv | Hashing is used to learn binary-code representation for data with expectation of preserving the neighborhood structure in the original feature space. Due to its fast query speed and reduced storage cost, hashing has been widely used for efficient nearest neighbor search in a large variety of applications like text and image retrieval. Most existing hashing methods adopt Hamming distance to measure the similarity (neighborhood) between points in the hashcode space. However, one problem with Hamming distance is that it may destroy the neighborhood structure in the original feature space, which violates the essential goal of hashing. In this paper, Manhattan hashing (MH), which is based on Manhattan distance, is proposed to solve the problem of Hamming distance based hashing. The basic idea of MH is to encode each projected dimension with multiple bits of natural binary code (NBC), based on which the Manhattan distance between points in the hashcode space is calculated for nearest neighbor search. MH can effectively preserve the neighborhood structure in the data to achieve the goal of hashing. To the best of our knowledge, this is the first work to adopt Manhattan distance with NBC for hashing. Experiments on several largescale image data sets containing up to one million points show that our MH method can significantly outperform other state-of-the-art methods. |
|||||
2024 | Isotropic Hashing | Kong W., Li | Arxiv | Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. |
|||||
2024 | Utilizing Low-dimensional Molecular Embeddings For Rapid Chemical Similarity Search | Kirchoff Kathryn E., Wellnitz James, Hochuli Joshua E., Maxfield Travis, Popov Konstantin I., Gomez Shawn, Tropsha Alexander | Arxiv | Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding – SmallSA – for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks. |
|||||
2024 | Double-bit Quantisation For Hashing | Kong W., Li | Arxiv | Hashing, which tries to learn similarity-preserving binary codes for data representation, has been widely used for efficient nearest neighbor search in massive databases due to its fast query speed and low storage cost. Because it is NP hard to directly compute the best binary codes for a given data set, mainstream hashing methods typically adopt a two-stage strategy. In the first stage, several projected dimensions of real values are generated. Then in the second stage, the real values will be quantized into binary codes by thresholding. Currently, most existing methods use one single bit to quantize each projected dimension. One problem with this single-bit quantization (SBQ) is that the threshold typically lies in the region of the highest point density and consequently a lot of neighboring points close to the threshold will be hashed to totally different bits, which is unexpected according to the principle of hashing. In this paper, we propose a novel quantization strategy, called double-bit quantization (DBQ), to solve the problem of SBQ. The basic idea of DBQ is to quantize each projected dimension into double bits with adaptively learned thresholds. Extensive experiments on two real data sets show that our DBQ strategy can signifi- cantly outperform traditional SBQ strategy for hashing. |
|||||
2024 | Learning Hash Functions For Cross-view Similarity Search | Kumar S., Udupa | Arxiv | Many applications in Multilingual and Multimodal Information Access involve searching large databases of high dimensional data objects with multiple (conditionally independent) views. In this work we consider the problem of learning hash functions for similarity search across the views for such applications. We propose a principled method for learning a hash function for each view given a set of multiview training data objects. The hash functions map similar objects to similar codes across the views thus enabling cross-view similarity search. We present results from an extensive empirical study of the proposed approach which demonstrate its effectiveness on Japanese language People Search and Multilingual People Search problems. |
|||||
2024 | Fast Redescription Mining Using Locality-sensitive Hashing | Karjalainen Maiju, Galbrun Esther, Miettinen Pauli | Arxiv | Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining. |
|||||
2024 | On The Adversarial Robustness Of Locality-sensitive Hashing In Hamming Space | Kapralov Michael, Makarov Mikhail, Sohler Christian | Arxiv | Locality-sensitive hashing~[Indyk,Motwani’98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query. In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling. |
|||||
2024 | Column Sampling Based Discrete Supervised Hashing | Kang Wang-cheng, Li, Zhou | Arxiv | By leveraging semantic (label) information, supervised hashing has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation. Some other methods try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. COSDISH is an iterative method, in each iteration of which several columns are sampled from the semantic similarity matrix and then the hashing code is decomposed into two parts which can be alternately optimized in a discrete way. Theoretical analysis shows that the learning (optimization) algorithm of COSDISH has a constant-approximation bound in each step of the alternating optimization procedure. Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications like image retrieval. |
|||||
2024 | Maximum-margin Hamming Hashing | Kang Rong, Cao, (b), Wang, Yu | Arxiv | Deep hashing enables computation and memory efficient image search through end-to-end learning of feature representations and binary codes. While linear scan over binary hash codes is more efficient than over the high-dimensional representations, its linear-time complexity is still unacceptable for very large databases. Hamming space retrieval enables constant-time search through hash lookups, where for each query, there is a Hamming ball centered at the query and the data points within the ball are returned as relevant. Since inside the Hamming ball implies retrievable while outside irretrievable, it is crucial to explicitly characterize the Hamming ball. The main idea of this work is to directly embody the Hamming radius into the loss functions, leading to Maximum-Margin Hamming Hashing (MMHH), a new model specifically optimized for Hamming space retrieval. We introduce a max-margin t-distribution loss, where the t-distribution concentrates more similar data points to be within the Hamming ball, and the margin characterizes the Hamming radius such that less penalization is applied to similar data points within the Hamming ball. The loss function also introduces robustness to data noise, where the similarity supervision may be inaccurate in practical problems. The model is trained end-to-end using a new semi-batch optimization algorithm tailored to extremely imbalanced data. Our method yields state-of-the-art results on four datasets and shows superior performance on noisy data. |
|||||
2024 | Random Maximum Margin Hashing | Joly A., Buisson | Arxiv | Following the success of hashing methods for multidimensional indexing, more and more works are interested in embedding visual feature space in compact hash codes. Such approaches are not an alternative to using index structures but a complementary way to reduce both the memory usage and the distance computation cost. Several data dependent hash functions have notably been proposed to closely fit data distribution and provide better selectivity than usual random projections such as LSH. However, improvements occur only for relatively small hash code sizes up to 64 or 128 bits. As discussed in the paper, this is mainly due to the lack of independence between the produced hash functions. We introduce a new hash function family that attempts to solve this issue in any kernel space. Rather than boosting the collision probability of close points, our method focus on data scattering. By training purely random splits of the data, regardless the closeness of the training samples, it is indeed possible to generate consistently more independent hash functions. On the other side, the use of large margin classifiers allows to maintain good generalization performances. Experiments show that our new Random Maximum Margin Hashing scheme (RMMH) outperforms four state-of-the-art hashing methods, notably in kernel spaces. |
|||||
2024 | SSAH Semi-supervised Adversarial Deep Hashing With Self-paced Hard Sample Generation | Jin Sheng, Zhou, Liu, Chen, Sun, Yao, Hua | Arxiv | Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semi-supervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A-Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widely-used hashing datasets and fine-grained datasets. |
|||||
2024 | Deep Saliency Hashing For Fine-grained Retrieval | Jin Sheng, Yao, Sun, Zhou, Zhang, Hua | Arxiv | In recent years, hashing methods have been proved to be effective and efficient for the large-scale Web media search. However, the existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce the attention mechanism to the learning of fine-grained hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semanticpreserving hashing codes simultaneously. DSaH is a twostep end-to-end model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. As the core of DSaH, the saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both fine-grained and general retrieval datasets for performance evaluation. Experimental results on fine grained dataset, including Oxford Flowers-17, Stanford Dogs-120 and CUB Bird demonstrate that our DSaH performs the best for fine-grained retrieval task and beats strongest competitor (DTQ) by approximately 10% on both Stanford Dogs-120 and CUB Bird. DSaH is also comparable to several state-of-the-art hashing methods on general datasets, including CIFAR-10 and NUS-WIDE. |
|||||
2024 | Unsupervised Discrete Hashing With Affinity Similarity | Jin Sheng, Yao, Zhou, Liu, Huang, Hua | Arxiv | In recent years, supervised hashing has been validated to greatly boost the performance of image retrieval. However, the label-hungry property requires massive label collection, making it intractable in practical scenarios. To liberate the model training procedure from laborious manual annotations, some unsupervised methods are proposed. However, the following two factors make unsupervised algorithms inferior to their supervised counterparts: (1) Without manually-defined labels, it is difficult to capture the semantic information across data, which is of crucial importance to guide robust binary code learning. (2) The widely adopted relaxation on binary constraints results in quantization error accumulation in the optimization procedure. To address the above-mentioned problems, in this paper, we propose a novel Unsupervised Discrete Hashing method (UDH). Specifically, to capture the semantic information, we propose a balanced graph-based semantic loss which explores the affinity priors in the original feature space. Then, we propose a novel self-supervised loss, termed orthogonal consistent loss, which can leverage semantic loss of instance and impose independence of codes. Moreover, by integrating the discrete optimization into the proposed unsupervised framework, the binary constraints are consistently preserved, alleviating the influence of quantization errors. Extensive experiments demonstrate that UDH outperforms state-of-the-art unsupervised methods for image retrieval. |
|||||
2024 | Complementary Projection Hashing | Jin Z., Hu, Lin, Zhang, Lin, Cai, Li | Arxiv | Recently, hashing techniques have been widely applied to solve the approximate nearest neighbors search problem in many vision applications. Generally, these hashing approaches generate 2^c buckets, where c is the length of the hash code. A good hashing method should satisfy the following two requirements: 1) mapping the nearby data points into the same bucket or nearby (measured by the Hamming distance) buckets. 2) all the data points are evenly distributed among all the buckets. In this paper, we propose a novel algorithm named Complementary Projection Hashing (CPH) to find the optimal hashing functions which explicitly considers the above two requirements. Specifically, CPH aims at sequentially finding a series of hyperplanes (hashing functions) which cross the sparse region of the data. At the same time, the data points are evenly distributed in the hypercubes generated by these hyperplanes. The experiments comparing with the state-of-the-art hashing methods demonstrate the effectiveness of the proposed method. |
|||||
2024 | Deep Cross-modal Hashing | Jiang Qing-yuan, Li | Arxiv | Due to its low storage cost and fast query speed, crossmodal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, most existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with hand-crafted features may not achieve satisfactory performance. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on three real datasets with image-text modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications. |
|||||
2024 | Scalable Graph Hashing With Feature Transformation | Jiang Q., Li | Arxiv | Hashing has been widely used for approximate nearest neighbor (ANN) search in big data applications because of its low storage cost and fast retrieval speed. The goal of hashing is to map the data points from the original space into a binary-code space where the similarity (neighborhood structure) in the original space is preserved. By directly exploiting the similarity to guide the hashing code learning procedure, graph hashing has attracted much attention. However, most existing graph hashing methods cannot achieve satisfactory performance in real applications due to the high complexity for graph modeling. In this paper, we propose a novel method, called scalable graph hashing with feature transformation (SGH), for large-scale graph hashing. Through feature transformation, we can effectively approximate the whole graph without explicitly computing the similarity graph matrix, based on which a sequential learning method is proposed to learn the hash functions in a bit-wise manner. Experiments on two datasets with one million data points show that our SGH method can outperform the state-of-the-art methods in terms of both accuracy and scalability. |
|||||
2024 | Fast Online Hashing With Multi-label Projection | Jia Wenzhe, Cao, Liu, Gui | Arxiv | Hashing has been widely researched to solve the large-scale approximate nearest neighbor search problem owing to its time and storage superiority. In recent years, a number of online hashing methods have emerged, which can update the hash functions to adapt to the new stream data and realize dynamic retrieval. However, existing online hashing methods are required to update the whole database with the latest hash functions when a query arrives, which leads to low retrieval efficiency with the continuous increase of the stream data. On the other hand, these methods ignore the supervision relationship among the examples, especially in the multi-label case. In this paper, we propose a novel Fast Online Hashing (FOH) method which only updates the binary codes of a small part of the database. To be specific, we first build a query pool in which the nearest neighbors of each central point are recorded. When a new query arrives, only the binary codes of the corresponding potential neighbors are updated. In addition, we create a similarity matrix which takes the multi-label supervision information into account and bring in the multi-label projection loss to further preserve the similarity among the multi-label data. The experimental results on two common benchmarks show that the proposed FOH can achieve dramatic superiority on query time up to 6.28 seconds less than state-of-the-art baselines with competitive retrieval accuracy. |
|||||
2024 | Searching With Quantization Approximate Nearest Neighbor Search Using Short Codes And Distance Estimators | Jegou H., Douze, Schmid | Arxiv | We propose an approximate nearest neighbor search method based on quantization. It uses, in particular, product quantizer to produce short codes and corresponding distance estimators approximating the Euclidean distance between the orginal vectors. The method is advantageously used in an asymmetric manner, by computing the distance between a vector and code, unlike competing techniques such as spectral hashing that only compare codes. Our approach approximates the Euclidean distance based on memory efficient codes and, thus, permits efficient nearest neighbor search. Experiments performed on SIFT and GIST image descriptors show excellent search accuracy. The method is shown to outperform two state-of-the-art approaches of the literature. Timings measured when searching a vector set of 2 billion vectors are shown to be excellent given the high accuracy of the method. |
|||||
2024 | Distilling Vision-language Pretraining For Efficient Cross-modal Retrieval | Jang Young Kyun, Kim Donghyun, Lim Ser-nam | Arxiv | ``Learning to hash’’ is a practical solution for efficient retrieval,
offering fast search speed and low storage cost. It is widely applied in
various applications, such as image-text cross-modal search. In this paper, we
explore the potential of enhancing the performance of learning to hash with the
proliferation of powerful large pre-trained models, such as Vision-Language
Pre-training (VLP) models. We introduce a novel method named Distillation for
Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of
VLP models to improve hash representation learning. Specifically, we use the
VLP as a |
|||||
2024 | Fast Similarity Search For Learned Metrics | Jain P., Kulis, Grauman | Arxiv | We propose a method to efficiently index into a large database of examples according to a learned metric. Given a collection of examples, we learn a Mahalanobis distance using an information-theoretic metric learning technique that adapts prior knowledge about pairwise distances to incorporate similarity and dissimilarity constraints. To enable sub-linear time similarity search under the learned metric, we show how to encode a learned Mahalanobis parameterization into randomized locality-sensitive hash functions. We further formulate an indirect solution that enables metric learning and hashing for sparse input vector spaces whose high dimensionality make it infeasible to learn an explicit weighting over the feature dimensions. We demonstrate the approach applied to systems and image datasets, and show that our learned metrics improve accuracy relative to commonly-used metric baselines, while our hashing construction permits effi- cient indexing with a learned distance and very large databases. |
|||||
2024 | Hashing Hyperplane Queries To Near Points With Applications To Large-scale Active Learning | Jain P., Vijayanarasimhan, Grauman | Arxiv | We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the database. We propose two hashing-based solutions. Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our first method’s preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods’ tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points. |
|||||
2024 | Data-dependent LSH For The Earth Movers Distance | Jayaram Rajesh, Waingarten Erik, Zhang Tian | Arxiv | We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover’s Distance (\(\mathsf{EMD}\)), and as a result, improve the best approximation for nearest neighbor search under \(\mathsf{EMD}\) by a quadratic factor. Here, the metric \(\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)\) consists of sets of \(s\) vectors in \(\mathbb{R}^d\), and for any two sets \(x,y\) of \(s\) vectors the distance \(\mathsf{EMD}(x,y)\) is the minimum cost of a perfect matching between \(x,y\), where the cost of matching two vectors is their \(\ell_p\) distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for \(\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)\) when \(p \in [1,2]\) with approximation \(O(log^2 s)\). By being data-dependent, we improve the approximation to \(\tilde{O}(log s)\). Our main technical contribution is to show that for any distribution \(\mu\) supported on the metric \(\mathsf{EMD}_s(\mathbb{R}^d, \ell_p)\), there exists a data-dependent LSH for dense regions of \(\mu\) which achieves approximation \(\tilde{O}(log s)\), and that the data-independent LSH actually achieves a \(\tilde{O}(log s)\)-approximation outside of those dense regions. Finally, we show how to “glue” together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover’s Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to \(\mathrm{poly}(log log s)\) factors) among those that collide close points with constant probability. |
|||||
2024 | Spatially Optimized Compact Deep Metric Learning Model For Similarity Search | Islam Md. Farhadul, Reza Md. Tanzim, Manab Meem Arafat, Mahin Mohammad Rakibul Hasan, Zabeen Sarah, Noor Jannatun | Arxiv | Spatial optimization is often overlooked in many computer vision tasks. Filters should be able to recognize the features of an object regardless of where it is in the image. Similarity search is a crucial task where spatial features decide an important output. The capacity of convolution to capture visual patterns across various locations is limited. In contrast to convolution, the involution kernel is dynamically created at each pixel based on the pixel value and parameters that have been learned. This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search. Additionally, we improve predictions by using the GELU activation function rather than the ReLU. The negligible amount of weight parameters in involution with a compact model with better performance makes the model very useful in real-world implementations. Our proposed model is below 1 megabyte in size. We have experimented with our proposed methodology and other models on CIFAR-10, FashionMNIST, and MNIST datasets. Our proposed method outperforms across all three datasets. |
|||||
2024 | Locally Linear Hashing For Extracting Non-linear Manifolds | Irie G., Li, Wu, Chang | Arxiv | Previous efforts in hashing intend to preserve data variance or pairwise affinity, but neither is adequate in capturing the manifold structures hidden in most visual data. In this paper, we tackle this problem by reconstructing the locally linear structures of manifolds in the binary Hamming space, which can be learned by locality-sensitive sparse coding. We cast the problem as a joint minimization of reconstruction error and quantization loss, and show that, despite its NP-hardness, a local optimum can be obtained efficiently via alternative optimization. Our method distinguishes itself from existing methods in its remarkable ability to extract the nearest neighbors of the query from the same manifold, instead of from the ambient space. On extensive experiments on various image benchmarks, our results improve previous state-of-the-art by 28-74% typically, and 627% on the Yale face data. |
|||||
2024 | Datasets For Approximate Nearest Neighbor Search | Jegou Herve, Amsaleg | Arxiv | BIGANN consists of SIFT descriptors applied to images from extracted from a large image dataset. |
|||||
2024 | Accelerate Learning Of Deep Hashing With Gradient Attention | Huang Long-kai, Chen, Pan | Arxiv | Recent years have witnessed the success of learning to hash in fast large-scale image retrieval. As deep learning has shown its superior performance on many computer vision applications, recent designs of learning-based hashing models have been moving from shallow ones to deep architectures. However, based on our analysis, we find that gradient descent based algorithms used in deep hashing models would potentially cause hash codes of a pair of training instances to be updated towards the directions of each other simultaneously during optimization. In the worst case, the paired hash codes switch their directions after update, and consequently, their corresponding distance in the Hamming space remain unchanged. This makes the overall learning process highly inefficient. To address this issue, we propose a new deep hashing model integrated with a novel gradient attention mechanism. Extensive experimental results on three benchmark datasets show that our proposed algorithm is able to accelerate the learning process and obtain competitive retrieval performance compared with state-of-the-art deep hashing models. |
|||||
2024 | Separated Variational Hashing Networks For Cross-modal Retrieval | Hu Peng, Wang, Zhen, Peng | Arxiv | Cross-modal hashing, due to its low storage cost and high query speed, has been successfully used for similarity search in multimedia retrieval applications. It projects high-dimensional data into a shared isomorphic Hamming space with similar binary codes for semantically-similar data. In some applications, all modalities may not be obtained or trained simultaneously for some reasons, such as privacy, secret, storage limitation, and computational resource limitation. However, most existing cross-modal hashing methods need all modalities to jointly learn the common Hamming space, thus hindering them from handling these problems. In this paper, we propose a novel approach called Separated Variational Hashing Networks (SVHNs) to overcome the above challenge. Firstly, it adopts a label network (LabNet) to exploit available and nonspecific label annotations to learn a latent common Hamming space by projecting each semantic label into a common binary representation. Then, each modality-specific network can separately map the samples of the corresponding modality into their binary semantic codes learned by LabNet. We achieve it by conducting variational inference to match the aggregated posterior of the hashing code of LabNet with an arbitrary prior distribution. The effectiveness and efficiency of our SVHNs are verified by extensive experiments carried out on four widely-used multimedia databases, in comparison with 11 state-of-the-art approaches. |
|||||
2024 | Residual Quantization With Implicit Neural Codebooks | Huijben Iris A. M., Douze Matthijs, Muckley Matthew, Van Sloun Ruud J. G., Verbeek Jakob | Arxiv | Vector quantization is a fundamental operation for data compression and vector search. To obtain high accuracy, multi-codebook methods represent each vector using codewords across several codebooks. Residual quantization (RQ) is one such method, which iteratively quantizes the error of the previous step. While the error distribution is dependent on previously-selected codewords, this dependency is not accounted for in conventional RQ as it uses a fixed codebook per quantization step. In this paper, we propose QINCo, a neural RQ variant that constructs specialized codebooks per step that depend on the approximation of the vector from previous steps. Experiments show that QINCo outperforms state-of-the-art methods by a large margin on several datasets and code sizes. For example, QINCo achieves better nearest-neighbor search accuracy using 12-byte codes than the state-of-the-art UNQ using 16 bytes on the BigANN1M and Deep1M datasets. |
|||||
2024 | K-semstamp A Clustering-based Semantic Watermark For Detection Of Machine-generated Text | Hou Abe Bohan, Zhang Jingyu, Wang Yichen, Khashabi Daniel, He Tianxing | Arxiv | Recent watermarked generation algorithms inject detectable signatures during language generation to facilitate post-hoc detection. While token-level watermarks are vulnerable to paraphrase attacks, SemStamp (Hou et al., 2023) applies watermark on the semantic representation of sentences and demonstrates promising robustness. SemStamp employs locality-sensitive hashing (LSH) to partition the semantic space with arbitrary hyperplanes, which results in a suboptimal tradeoff between robustness and speed. We propose k-SemStamp, a simple yet effective enhancement of SemStamp, utilizing k-means clustering as an alternative of LSH to partition the embedding space with awareness of inherent semantic structure. Experimental results indicate that k-SemStamp saliently improves its robustness and sampling efficiency while preserving the generation quality, advancing a more effective tool for machine-generated text detection. |
|||||
2024 | Creating Something From Nothing Unsupervised Knowledge Distillation For Cross-modal Hashing | Hu Hengtong, Xie, Hong, Tian | Arxiv | In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin |
|||||
2024 | The MIR Flickr Retrieval Evaluation. | Huiskes M., Lew | Arxiv | In most well known image retrieval test sets, the imagery typically cannot be freely distributed or is not representative of a large community of users. In this paper we present a collection for the MIR community comprising 25000 images from the Flickr website which are redistributable for research purposes and represent a real community of users both in the image content and image tags. We have extracted the tags and EXIF image metadata, and also make all of these publicly available. In addition we discuss several challenges for benchmarking retrieval and classification methods. |
|||||
2024 | A Non-alternating Graph Hashing Algorithm For Large Scale Image Search | Hemati Sobhan, Mehdizavareh, Chenouri, Tizhoosh | Arxiv | In the era of big data, methods for improving memory and computational efficiency have become crucial for successful deployment of technologies. Hashing is one of the most effective approaches to deal with computational limitations that come with big data. One natural way for formulating this problem is spectral hashing that directly incorporates affinity to learn binary codes. However, due to binary constraints, the optimization becomes intractable. To mitigate this challenge, different relaxation approaches have been proposed to reduce the computational load of obtaining binary codes and still attain a good solution. The problem with all existing relaxation methods is resorting to one or more additional auxiliary variables to attain high quality binary codes while relaxing the problem. The existence of auxiliary variables leads to coordinate descent approach which increases the computational complexity. We argue that introducing these variables is unnecessary. To this end, we propose a novel relaxed formulation for spectral hashing that adds no additional variables to the problem. Furthermore, instead of solving the problem in original space where number of variables is equal to the data points, we solve the problem in a much smaller space and retrieve the binary codes from this solution. This trick reduces both the memory and computational complexity at the same time. We apply two optimization techniques, namely projected gradient and optimization on manifold, to obtain the solution. Using comprehensive experiments on four public datasets, we show that the proposed efficient spectral hashing (ESH) algorithm achieves highly competitive retrieval performance compared with state of the art at low complexity. |
|||||
2024 | Spherical Hashing | Heo J., Lee, He, Chang, Yoon | Arxiv | Many binary code encoding schemes based on hashing have been actively studied recently, since they can provide efficient similarity search, especially nearest neighbor search, and compact data representations suitable for handling large scale image databases in many computer vision problems. Existing hashing techniques encode highdimensional data points by using hyperplane-based hashing functions. In this paper we propose a novel hyperspherebased hashing function, spherical hashing, to map more spatially coherent data points into a binary code compared to hyperplane-based hashing functions. Furthermore, we propose a new binary code distance function, spherical Hamming distance, that is tailored to our hyperspherebased binary coding scheme, and design an efficient iterative optimization process to achieve balanced partitioning of data points for each hash function and independence between hashing functions. Our extensive experiments show that our spherical hashing technique significantly outperforms six state-of-the-art hashing techniques based on hyperplanes across various image benchmarks of sizes ranging from one to 75 million of GIST descriptors. The performance gains are consistent and large, up to 100% improvements. The excellent results confirm the unique merits of the proposed idea in using hyperspheres to encode proximity regions in high-dimensional spaces. Finally, our method is intuitive and easy to implement. |
|||||
2024 | Compact Parallel Hash Tables On The GPU | Hegeman Steef, Wöltgens Daan, Wijs Anton, Laarman Alfons | Arxiv | On the GPU, hash table operation speed is determined in large part by cache line efficiency, and state-of-the-art hashing schemes thus divide tables into cache line-sized buckets. This raises the question whether performance can be further improved by increasing the number of entries that fit in such buckets. Known compact hashing techniques have not yet been adapted to the massively parallel setting, nor have they been evaluated on the GPU. We consider a compact version of bucketed cuckoo hashing, and a version of compact iceberg hashing suitable for the GPU. We discuss the tables from a theoretical perspective, and provide an open source implementation of both schemes in CUDA for comparative benchmarking. In terms of performance, the state-of-the-art cuckoo hashing benefits from compactness on lookups and insertions (most experiments show at least 10-20% increase in throughput), and the iceberg table benefits significantly, to the point of being comparable to compact cuckoo hashing–while supporting performant dynamic operation. |
|||||
2024 | Beyond Neighbourhood-preserving Transformations For Quantization-based Unsupervised Hashing | Hemati Sobhan, Tizhoosh | Arxiv | An effective unsupervised hashing algorithm leads to compact binary codes preserving the neighborhood structure of data as much as possible. One of the most established schemes for unsupervised hashing is to reduce the dimensionality of data and then find a rigid (neighbourhood-preserving) transformation that reduces the quantization error. Although employing rigid transformations is effective, we may not reduce quantization loss to the ultimate limits. As well, reducing dimensionality and quantization loss in two separate steps seems to be sub-optimal. Motivated by these shortcomings, we propose to employ both rigid and non-rigid transformations to reduce quantization error and dimensionality simultaneously. We relax the orthogonality constraint on the projection in a PCA-formulation and regularize this by a quantization term. We show that both the non-rigid projection matrix and rotation matrix contribute towards minimizing quantization loss but in different ways. A scalable nested coordinate descent approach is proposed to optimize this mixed-integer optimization problem. We evaluate the proposed method on five public benchmark datasets providing almost half a million images. Comparative results indicate that the proposed method mostly outperforms state-of-art linear methods and competes with end-to-end deep solutions. |
|||||
2024 | PHOBIC Perfect Hashing With Optimized Bucket Sizes And Interleaved Coding | Hermann Stefan, Lehmann Hans-peter, Pibiri Giulio Ermanno, Sanders Peter, Walzer Stefan | Arxiv | A minimal perfect hash function (MPHF) maps a set of n keys to {1, …, n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal distribution of expected bucket sizes. We arrive at a simple, closed form solution which improves construction throughput for space efficient configurations in practice. Our second contribution is a novel encoding scheme for the seeds. We split the keys into partitions. Within each partition, we run the bucket distribution and search step. We then store the seeds in an interleaved way by consecutively placing the seeds for the i-th buckets from all partitions. The seeds for the i-th bucket of each partition follow the same statistical distribution. This allows us to tune a compressor for each bucket. Hence, we call our technique PHOBIC - Perfect Hashing with Optimized Bucket sizes and Interleaved Coding. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. We also contribute a GPU implementation to further accelerate MPHF construction. For a configuration with fast queries, PHOBIC-GPU can construct a perfect hash function at 2.17 bits/key in 28 ns per key, which can be queried in 37 ns on the CPU. |
|||||
2024 | One Loss For All Deep Hashing With A Single Cosine Similarity Based Learning Objective | Hoe Jiun, Ng, Zhang, Chan Chee, Song Yi-zhe, Xiang Tao | Arxiv | A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. |
|||||
2024 | Hashing As Tie-aware Learning To Rank | He K., Cakir, Bargal, Sclaroff | Arxiv | Hashing, or learning binary embeddings of data, is frequently used in nearest neighbor retrieval. In this paper, we develop learning to rank formulations for hashing, aimed at directly optimizing ranking-based evaluation metrics such as Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). We first observe that the integer-valued Hamming distance often leads to tied rankings, and propose to use tie-aware versions of AP and NDCG to evaluate hashing for retrieval. Then, to optimize tie-aware ranking metrics, we derive their continuous relaxations, and perform gradient-based optimization with deep neural networks. Our results establish the new state-of-the-art for image retrieval by Hamming ranking in common benchmarks. |
|||||
2024 | Bit-mask Robust Contrastive Knowledge Distillation For Unsupervised Semantic Hashing | He Liyang, Huang Zhenya, Liu Jiayu, Chen Enhong, Wang Fei, Sha Jing, Wang Shijin | Arxiv | Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD. |
|||||
2024 | Hybridhash Hybrid Convolutional And Self-attention Deep Hashing For Image Retrieval | He Chao, Wei Hongxi | Arxiv | Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available https://github.com/shuaichaochao/HybridHash. |
|||||
2024 | K-nearest Neighbors Hashing | He Xiangyu, Wang, Cheng | Arxiv | Hashing based approximate nearest neighbor search embeds high dimensional data to compact binary codes, which enables efficient similarity search and storage. However, the non-isometry sign(·) function makes it hard to project the nearest neighbors in continuous data space into the closest codewords in discrete Hamming space. In this work, we revisit the sign(·) function from the perspective of space partitioning. In specific, we bridge the gap between k-nearest neighbors and binary hashing codes with Shannon entropy. We further propose a novel K-Nearest Neighbors Hashing (KNNH) method to learn binary representations from KNN within the subspaces generated by sign(·). Theoretical and experimental results show that the KNN relation is of central importance to neighbor preserving embeddings, and the proposed method outperforms the state-of-the-arts on benchmark datasets. |
|||||
2024 | Explicit Orthogonal Arrays And Universal Hashing With Arbitrary Parameters | Harvey Nicholas, Sahami Arvin | Arxiv | Orthogonal arrays are a type of combinatorial design that were developed in the 1940s in the design of statistical experiments. In 1947, Rao proved a lower bound on the size of any orthogonal array, and raised the problem of constructing arrays of minimum size. Kuperberg, Lovett and Peled (2017) gave a non-constructive existence proof of orthogonal arrays whose size is near-optimal (i.e., within a polynomial of Rao’s lower bound), leaving open the question of an algorithmic construction. We give the first explicit, deterministic, algorithmic construction of orthogonal arrays achieving near-optimal size for all parameters. Our construction uses algebraic geometry codes. In pseudorandomness, the notions of \(t\)-independent generators or \(t\)-independent hash functions are equivalent to orthogonal arrays. Classical constructions of \(t\)-independent hash functions are known when the size of the codomain is a prime power, but very few constructions are known for an arbitrary codomain. Our construction yields algorithmically efficient \(t\)-independent hash functions for arbitrary domain and codomain. |
|||||
2024 | Content-aware Neural Hashing For Cold-start Recommendation | Hansen Casper, Hansen, Simonsen, Alstrup, Lioma | Arxiv | Content-aware recommendation approaches are essential for providing meaningful recommendations for new (i.e., cold-start) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user-item relevance. NeuHash-CF is modelled as an autoencoder architecture, consisting of two joint hashing components for generating user and item hash codes. Inspired from semantic hashing, the item hashing component generates a hash code directly from an item’s content information (i.e., it generates cold-start and seen item hash codes in the same manner). This contrasts existing state-of-the-art models, which treat the two item cases separately. The user hash codes are generated directly based on user id, through learning a user embedding matrix. We show experimentally that NeuHash-CF significantly outperforms state-of-the-art baselines by up to 12% NDCG and 13% MRR in cold-start recommendation settings, and up to 4% in both NDCG and MRR in standard settings where all items are present while training. Our approach uses 2-4x shorter hash codes, while obtaining the same or better performance compared to the state of the art, thus consequently also enabling a notable storage reduction. |
|||||
2024 | Unsupervised Semantic Hashing With Pairwise Reconstruction | Hansen Casper, Hansen, Simonsen, Alstrup, Lioma | Arxiv | Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search. |
|||||
2024 | Microsoft Turing-anns-1b | Jegou Herve | Arxiv | Microsoft Turing-ANNS-1B is a new dataset being released by the Microsoft Turing team for this competition. It consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in web search queries. An early version of the RNN-based AGI Encoder is described in a SIGIR’19 paper and a blogpost. |
|||||
2024 | Private And Secure Fuzzy Name Matching | Kasyap Harsh, Atmaca Ugur Ilker, Maple Carsten, Cormode Graham, He Jiancong | Arxiv | Modern financial institutions rely on data for many operations, including a need to drive efficiency, enhance services and prevent financial crime. Data sharing across an organisation or between institutions can facilitate rapid, evidence-based decision making, including identifying money laundering and fraud. However, data privacy regulations impose restrictions on data sharing. Privacy-enhancing technologies are being increasingly employed to allow organisations to derive shared intelligence while ensuring regulatory compliance. This paper examines the case in which regulatory restrictions mean a party cannot share data on accounts of interest with another (internal or external) party to identify people that hold an account in each dataset. We observe that the names of account holders may be recorded differently in each data set. We introduce a novel privacy-preserving approach for fuzzy name matching across institutions, employing fully homomorphic encryption with locality-sensitive hashing. The efficiency of the approach is enhanced using a clustering mechanism. The practicality and effectiveness of the proposed approach are evaluated using different datasets. Experimental results demonstrate it takes around 100 and 1000 seconds to search 1000 names from 10k and 100k names, respectively. Moreover, the proposed approach exhibits significant improvement in reducing communication overhead by 30-300 times, using clustering. |
|||||
2024 | Learning Binary Hash Codes For Large-scale Image Search | Grauman Kristen, Fergus | Arxiv | Algorithms to rapidly search massive image or video collections are critical for many vision applications, including visual search, content-based retrieval, and non-parametric models for object recognition. Recent work shows that learned binary projections are a powerful way to index large collections according to their content. The basic idea is to formulate the projections so as to approximately preserve a given similarity function of interest. Having done so, one can then search the data efficiently using hash tables, or by exploring the Hamming ball volume around a novel query. Both enable sub-linear time retrieval with respect to the database size. Further, depending on the design of the projections, in some cases it is possible to bound the number of database examples that must be searched in order to achieve a given level of accuracy. This chapter overviews data structures for fast search with binary codes, and then describes several supervised and unsupervised strategies for generating the codes. In particular, we review supervised methods that integrate metric learning, boosting, and neural networks into the hash key construction, and unsupervised methods based on spectral analysis or kernelized random projections that compute affinity-preserving binary codes.Whether learning from explicit semantic supervision or exploiting the structure among unlabeled data, these methods make scalable retrieval possible for a variety of robust visual similarity measures.We focus on defining the algorithms, and illustrate the main points with results using millions of images. |
|||||
2024 | Lagrangian Hashing For Compressed Neural Field Representations | Govindarajan Shrisudhan Ahan, Sambugaro Zeno Ahan, Akhmedkhan Ahan, Shabanov, Takikawa Towaki, Rebain Daniel, Sun Weiwei, Conci Nicola, Yi Kwang Moo, Tagliasacchi Andrea | Arxiv | We present Lagrangian Hashing, a representation for neural fields combining the characteristics of fast training NeRF methods that rely on Eulerian grids (i.e.~InstantNGP), with those that employ points equipped with features as a way to represent information (e.g. 3D Gaussian Splatting or PointNeRF). We achieve this by incorporating a point-based representation into the high-resolution layers of the hierarchical hash tables of an InstantNGP representation. As our points are equipped with a field of influence, our representation can be interpreted as a mixture of Gaussians stored within the hash table. We propose a loss that encourages the movement of our Gaussians towards regions that require more representation budget to be sufficiently well represented. Our main finding is that our representation allows the reconstruction of signals using a more compact representation without compromising quality. |
|||||
2024 | Learning Binary Codes For High-dimensional Data Using Bilinear Projections | Gong Y., Kumar, Rowley, Lazebnik | Arxiv | Recent advances in visual recognition indicate that to achieve good retrieval and classification accuracy on largescale datasets like ImageNet, extremely high-dimensional visual descriptors, e.g., Fisher Vectors, are needed. We present a novel method for converting such descriptors to compact similarity-preserving binary codes that exploits their natural matrix structure to reduce their dimensionality using compact bilinear projections instead of a single large projection matrix. This method achieves comparable retrieval and classification accuracy to the original descriptors and to the state-of-the-art Product Quantization approach while having orders of magnitude faster code generation time and smaller memory footprint. |
|||||
2024 | Iterative Quantization A Procrustean Approach To Learning Binary Codes | Gong Y., Lazebnik | Arxiv | This paper addresses the problem of learning similarity preserving binary codes for efficient retrieval in large-scale image collections. We propose a simple and efficient alternating minimization scheme for finding a rotation of zerocentered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube. This method, dubbed iterative quantization (ITQ), has connections to multi-class spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA). Our experiments show that the resulting binary coding schemes decisively outperform several other state-of-the-art methods. |
|||||
2024 | Weakly Supervised Deep Image Hashing Through Tag Embeddings | Gattupalli Vijetha, Zhuo, Li | Arxiv | Many approaches to semantic image hashing have been formulated as supervised learning problems that utilize images and label information to learn the binary hash codes. However, large-scale labeled image data is expensive to obtain, thus imposing a restriction on the usage of such algorithms. On the other hand, unlabelled image data is abundant due to the existence of many Web image repositories. Such Web images may often come with images tags that contain useful information, although raw tags, in general, do not readily lead to semantic labels. Motivated by this scenario, we formulate the problem of semantic image hashing as a weakly-supervised learning problem. We utilize the information contained in the user-generated tags associated with the images to learn the hash codes. More specifically, we extract the word2vec semantic embeddings of the tags and use the information contained in them for constraining the learning. Accordingly, we name our model Weakly Supervised Deep Hashing using Tag Embeddings (WDHT). WDHT is tested for the task of semantic image retrieval and is compared against several state-of-art models. Results show that our approach sets a new state-of-art in the area of weekly supervised image hashing. |
|||||
2024 | Graph Cuts For Supervised Binary Coding | Ge T., He, Sun | Arxiv | Learning short binary codes is challenged by the inherent discrete nature of the problem. The graph cuts algorithm is a well-studied discrete label assignment solution in computer vision, but has not yet been applied to solve the binary coding problems. This is partially because it was unclear how to use it to learn the encoding (hashing) functions for out-of-sample generalization. In this paper, we formulate supervised binary coding as a single optimization problem that involves both the encoding functions and the binary label assignment. Then we apply the graph cuts algorithm to address the discrete optimization problem involved, with no continuous relaxation. This method, named as Graph Cuts Coding (GCC), shows competitive results in various datasets. |
|||||
2024 | Practical And Asymptotically Optimal Quantization Of High-dimensional Vectors In Euclidean Space For Approximate Nearest Neighbor Search | Gao Jianyang, Gou Yutong, Xu Yuexuan, Yang Yongyi, Long Cheng, Wong Raymond Chi-wing | Arxiv | Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space is a key operator in database systems. For this query, quantization is a popular family of methods developed for compressing vectors and reducing memory consumption. Recently, a method called RaBitQ achieves the state-of-the-art performance among these methods. It produces better empirical performance in both accuracy and efficiency when using the same compression rate and provides rigorous theoretical guarantees. However, the method is only designed for compressing vectors at high compression rates (32x) and lacks support for achieving higher accuracy by using more space. In this paper, we introduce a new quantization method to address this limitation by extending RaBitQ. The new method inherits the theoretical guarantees of RaBitQ and achieves the asymptotic optimality in terms of the trade-off between space and error bounds as to be proven in this study. Additionally, we present efficient implementations of the method, enabling its application to ANN queries to reduce both space and time consumption. Extensive experiments on real-world datasets confirm that our method consistently outperforms the state-of-the-art baselines in both accuracy and efficiency when using the same amount of memory. |
|||||
2024 | A General Framework For Distributed Approximate Similarity Search With Arbitrary Distances | Garcia-morato Elena, Algar Maria Jesus, Alfaro Cesar, Ortega Felipe, Gomez Javier, Moguerza Javier M. | Arxiv | Similarity search is a central problem in domains such as information management and retrieval or data analysis. Many similarity search algorithms are designed or specifically adapted to metric distances. Thus, they are unsuitable for alternatives like the cosine distance, which has become quite common, for example, with embeddings and in text mining. This paper presents GDASC (General Distributed Approximate Similarity search with Clustering), a general framework for distributed approximate similarity search that accepts arbitrary distances. This framework can build a multilevel index structure, by selecting a clustering algorithm, the number of prototypes in each cluster and any arbitrary distance function. As a result, this framework effectively overcomes the limitation of using metric distances and can address situations involving cosine similarity or other non-standard similarity measures. Experimental results using k-medoids clustering in GDASC with real datasets confirm the applicability of this approach for approximate similarity search, improving the performance of extant algorithms for this purpose. |
|||||
2024 | Pfeed Generating Near Real-time Personalized Feeds Using Precomputed Embedding Similarities | Gebre Binyam, Ranta Karoliina, Elzen Stef Van Den, Kuiper Ernst, Baars Thijs, Heskes Tom | Arxiv | In personalized recommender systems, embeddings are often used to encode customer actions and items, and retrieval is then performed in the embedding space using approximate nearest neighbor search. However, this approach can lead to two challenges: 1) user embeddings can restrict the diversity of interests captured and 2) the need to keep them up-to-date requires an expensive, real-time infrastructure. In this paper, we propose a method that overcomes these challenges in a practical, industrial setting. The method dynamically updates customer profiles and composes a feed every two minutes, employing precomputed embeddings and their respective similarities. We tested and deployed this method to personalise promotional items at Bol, one of the largest e-commerce platforms of the Netherlands and Belgium. The method enhanced customer engagement and experience, leading to a significant 4.9% uplift in conversions. |
|||||
2024 | Similarity Search In High Dimensions Via Hashing | Gionis A., Indyk, Motwani | Arxiv | The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the curse of dimensionality. That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should suffice for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives significant improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50). |
|||||
2024 | Supervised Binary Hash Code Learning With Jensen Shannon Divergence | Fan Lixin | Arxiv | This paper proposes to learn binary hash codes within a statistical learning framework, in which an upper bound of the probability of Bayes decision errors is derived for different forms of hash functions and a rigorous proof of the convergence of the upper bound is presented. Consequently, minimizing such an upper bound leads to consistent performance improvements of existing hash code learning algorithms, regardless of whether original algorithms are unsupervised or supervised. This paper also illustrates a fast hash coding method that exploits simple binary tests to achieve orders of magnitude improvement in coding speed as compared to projection based methods. |
|||||
2024 | Facebook Simsearchnet++ | Facebook/meta Facebook/meta | Arxiv | Facebook SimSearchNet++ is a new dataset released by Facebook for this competition. It consists of features used for image copy detection for integrity purposes. The features are generated by Facebook SimSearchNet++ model. |
|||||
2024 | Jumpbackhash Say Goodbye To The Modulo Operation To Distribute Keys Uniformly To Buckets | Ertl Otmar | Arxiv | The distribution of keys to a given number of buckets is a fundamental task in distributed data processing and storage. A simple, fast, and therefore popular approach is to map the hash values of keys to buckets based on the remainder after dividing by the number of buckets. Unfortunately, these mappings are not stable when the number of buckets changes, which can lead to severe spikes in system resource utilization, such as network or database requests. Consistent hash algorithms can minimize remappings, but are either significantly slower than the modulo-based approach, require floating-point arithmetic, or are based on a family of hash functions rarely available in standard libraries. This paper introduces JumpBackHash, which uses only integer arithmetic and a standard pseudorandom generator. Due to its speed and simple implementation, it can safely replace the modulo-based approach to improve assignment and system stability. A production-ready Java implementation of JumpBackHash has been released as part of the Hash4j open source library. |
|||||
2024 | Deep Polarized Network For Supervised Learning Of Accurate Binary Hashing Codes | Fan Lixin, Ng, Ju, Zhang, Chan | Arxiv | This paper proposes a novel deep polarized network (DPN) for learning to hash, in which each channel in the network outputs is pushed far away from zero by employing a differentiable bit-wise hinge-like loss which is dubbed as polarization loss. Reformulated within a generic Hamming Distance Metric Learning framework [Norouzi et al., 2012], the proposed polarization loss bypasses the requirement to prepare pairwise labels for (dis-)similar items and, yet, the proposed loss strictly bounds from above the pairwise Hamming Distance based losses. The intrinsic connection between pairwise and pointwise label information, as disclosed in this paper, brings about the following methodological improvements: (a) we may directly employ the proposed differentiable polarization loss with no large deviations incurred from the target Hamming distance based loss; and (b) the subtask of assigning binary codes becomes extremely simple — even random codes assigned to each class suffice to result in state-of-the-art performances, as demonstrated in CIFAR10, NUS-WIDE and ImageNet100 datasets. |
|||||
2024 | Approximate Nearest Neighbor Search With Window Filters | Engels Joshua, Landrum Benjamin, Yu Shangdi, Dhulipala Laxman, Shun Julian | Arxiv | We define and investigate the problem of \(\textit{c-approximate window search}\): approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a \(75\times\) speedup over existing solutions at the same level of recall. |
|||||
2024 | Simisketch Efficiently Estimating Similarity Of Streaming Multisets | Dong Fenghao, He Yang, Liang Yutong, Liu Zirui, Wu Yuhan, Chen Peiqing, Yang Tong | Arxiv | The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around hashing techniques, which are well-suited for sets but less naturally adaptable to multisets, a common occurrence in scenarios like network streams and text data. Moreover, with the increasing prevalence of data arriving in streaming patterns, many existing methods struggle to handle cases where set items are presented in a continuous stream. Consequently, our focus in this paper is on the challenging scenario of multisets with item streams. To address this, we propose SimiSketch, a sketching algorithm designed to tackle this specific problem. The paper begins by presenting two simpler versions that employ intuitive sketches for similarity estimation. Subsequently, we formally introduce SimiSketch and leverage SALSA to enhance accuracy. To validate our algorithms, we conduct extensive testing on synthetic datasets, real-world network traffic, and text articles. Our experiment shows that compared with the state-of-the-art, SimiSketch can improve the accuracy by up to 42 times, and increase the throughput by up to 360 times. The complete source code is open-sourced and available on GitHub for reference. |
|||||
2024 | The Faiss Library | Douze Matthijs, Guzhva Alexandr, Deng Chengqi, Johnson Jeff, Szilvasy Gergely, Mazaré Pierre-emmanuel, Lomeli Maria, Hosseini Lucas, Jégou Hervé | Arxiv | Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability. |
|||||
2024 | Transformer-based Clipped Contrastive Quantization Learning For Unsupervised Image Retrieval | Dubey Ayush, Dubey Shiv Ram, Singh Satish Kumar, Chu Wei-ta | Arxiv | Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-of-the-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning. |
|||||
2024 | Knn Hashing With Factorized Neighborhood Representation | Ding Kun, Huo, Fan, Pan | Arxiv | Hashing is very effective for many tasks in reducing the processing time and in compressing massive databases. Although lots of approaches have been developed to learn data-dependent hash functions in recent years, how to learn hash functions to yield good performance with acceptable computational and memory cost is still a challenging problem. Based on the observation that retrieval precision is highly related to the kNN classification accuracy, this paper proposes a novel kNN-based supervised hashing method, which learns hash functions by directly maximizing the kNN accuracy of the Hamming-embedded training data. To make it scalable well to large problem, we propose a factorized neighborhood representation to parsimoniously model the neighborhood relationships inherent in training data. Considering that real-world data are often linearly inseparable, we further kernelize this basic model to improve its performance. As a result, the proposed method is able to learn accurate hashing functions with tolerable computation and storage cost. Experiments on four benchmarks demonstrate that our method outperforms the state-of-the-arts. |
|||||
2024 | Efficient Retrieval With Learned Similarities | Ding Bailu, Zhai Jiaqi | Arxiv | Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such retrieval tasks, thanks to Maximum Inner Product Search (MIPS) that enabled efficient retrieval based on dot products. However, state-of-the-art retrieval algorithms have migrated to learned similarities. Such algorithms vary in form; the queries can be represented with multiple embeddings, complex neural networks can be deployed, the item ids can be decoded directly from queries using beam search, and multiple approaches can be combined in hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work investigates techniques for approximate nearest neighbor search with learned similarity functions. We first prove that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions. We next propose techniques to retrieve the approximate top K results using MoL with a tight bound. We finally compare our techniques with existing approaches, showing that MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency, while achieving > .99 recall rate of exact algorithms. |
|||||
2024 | Key Compression Limits For k-minimum Value Sketches | Dickens Charlie, Bax Eric, Saydakov Alexander | Arxiv | The \(k\)-Minimum Values (\kmv) data sketch algorithm stores the \(k\) least hash keys generated by hashing the items in a dataset. We show that compression based on ordering the keys and encoding successive differences can offer \(O(log n)\) bits per key in expected storage savings, where \(n\) is the number of unique values in the data set. We also show that \(O(log n)\) expected bits saved per key is optimal for any form of compression for the \(k\) least of \(n\) random values – that the encoding method is near-optimal among all methods to encode a \kmv sketch. We present a practical method to perform that compression, show that it is computationally efficient, and demonstrate that its average savings in practice is within about five percent of the theoretical minimum based on entropy. We verify that our method outperforms off-the-shelf compression methods, and we demonstrate that it is practical, using real and synthetic data. |
|||||
2024 | Identity With Locality An Ideal Hash For Gene Sequence Search | Desai Aditya, Gupta Gaurav, Zhang Tianyi, Shrivastava Anshumali | Arxiv | Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency. |
|||||
2024 | MUVERA Multi-vector Retrieval Via Fixed Dimensional Encodings | Dhulipala Laxman, Hadian Majid, Jayaram Rajesh, Lee Jason, Mirrokni Vahab | Arxiv | Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding \(x \in \mathbb{R}^d\) per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality \(\epsilon\)-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5\(\times\) fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10\(\%\) improved recall with \(90\%\) lower latency. |
|||||
2024 | Collective Matrix Factorization Hashing For Multimodal Data | Ding G., Guo, Zhou | Arxiv | Nearest neighbor search methods based on hashing have attracted considerable attention for effective and efficient large-scale similarity search in computer vision and information retrieval community. In this paper, we study the problems of learning hash functions in the context of multimodal data for cross-view similarity search. We put forward a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH learns unified hash codes by collective matrix factorization with latent factor model from different modalities of one instance, which can not only supports cross-view search but also increases the search accuracy by merging multiple view information sources. We also prove that CMFH, a similaritypreserving hashing learning method, has upper and lower boundaries. Extensive experiments verify that CMFH significantly outperforms several state-of-the-art methods on three different datasets. |
|||||
2024 | Learning Space Partitions For Nearest Neighbor Search | Dong Yihe, Indyk, Razenshteyn, Wagner | Arxiv | Space partitions of underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH. |
|||||
2024 | Locality-sensitive Hashing Scheme Based On P-stable Distributions | Datar M., Immorlica, Indyk, Mirrokni | Arxiv | We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p<1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain “bounded growth” condition.Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree. |
|||||
2024 | Imagenet A Large-scale Hierarchical Image Database | Deng J., Dong, Socher, Li, Li, Fei-fei | Arxiv | The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond. |
|||||
2024 | Exchnet A Unified Hashing Network For Large-scale Fine-grained Image Retrieval | Cui Quan, Jiang, Wei, Li, Yoshie | Arxiv | Retrieving content relevant images from a large-scale fine grained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, it can firstly obtain both local and global features to represent object parts and whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning’s consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternative learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our proposal consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets, which shows our effectiveness. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality. |
|||||
2024 | Binomialhash A Constant Time Minimal Memory Consistent Hash Algorithm | Coluzzi Massimo, Brocco Amos, Antonucci Alessandro | Arxiv | Consistent hashing is employed in distributed systems and networking applications to evenly and effectively distribute data across a cluster of nodes. This paper introduces BinomialHash, a consistent hashing algorithm that operates in constant time and requires minimal memory. We provide a detailed explanation of the algorithm, offer a pseudo-code implementation, and formally establish its strong theoretical guarantees. |
|||||
2024 | Know Your Neighborhood General And Zero-shot Capable Binary Function Search Powered By Call Graphlets | Collyer Joshua, Watson Tim, Phillips Iain | Arxiv | Binary code similarity detection is an important problem with applications in areas like malware analysis, vulnerability research and plagiarism detection. This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets. A call graphlet encodes the neighborhood around each function in a binary executable, capturing the local and global context through a series of statistical features. A specialized graph neural network model is then designed to operate on this graph representation, learning to map it to a feature vector that encodes semantic code similarities using deep metric learning. The proposed approach is evaluated across four distinct datasets covering different architectures, compiler toolchains, and optimization levels. Experimental results demonstrate that the combination of call graphlets and the novel graph neural network architecture achieves state-of-the-art performance compared to baseline techniques across cross-architecture, mono-architecture and zero shot tasks. In addition, our proposed approach also performs well when evaluated against an out-of-domain function inlining task. Overall, the work provides a general and effective graph neural network-based solution for conducting binary code similarity detection. |
|||||
2024 | Two-stream Deep Hashing With Class-specific Centers For Supervised Image Search | Deng Cheng, Yang, Liu, Tao | Arxiv | Hashing has been widely used for large-scale approximate nearest neighbor search due to its storage and search efficiency. Recent supervised hashing research has shown that deep learning-based methods can significantly outperform nondeep methods. Most existing supervised deep hashing methods exploit supervisory signals to generate similar and dissimilar image pairs for training. However, natural images can have large intraclass and small interclass variations, which may degrade the accuracy of hash codes. To address this problem, we propose a novel two-stream ConvNet architecture, which learns hash codes with class-specific representation centers. Our basic idea is that if we can learn a unified binary representation for each class as a center and encourage hash codes of images to be close to the corresponding centers, the intraclass variation will be greatly reduced. Accordingly, we design a neural network that leverages label information and outputs a unified binary representation for each class. Moreover, we also design an image network to learn hash codes from images and force these hash codes to be close to the corresponding class-specific centers. These two neural networks are then seamlessly incorporated to create a unified, end-to-end trainable framework. Extensive experiments on three popular benchmarks corroborate that our proposed method outperforms current state-of-the-art methods. |
|||||
2024 | NUS-WIDE A Real-world Web Image Database From National University Of Singapore | Chua T., Tang, Hong, Li, Luo, Zheng | Arxiv | This paper introduces a web image dataset created by NUS’s Lab for Media Search. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5x5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we highlight characteristics of Web image collections and identify four research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval. |
|||||
2024 | A Two-step Cross-modal Hashing By Exploiting Label Correlations And Preserving Similarity In Both Steps | Chen Zhen-duo, Wang, Li, Luo, Nie, Xin-shun | Arxiv | In this paper, we present a novel Two-stEp Cross-modal Hashing method, TECH for short, for cross-modal retrieval tasks. As a two-step method, it first learns hash codes based on semantic labels, while preserving the similarity in the original space and exploiting the label correlations in the label space. In the light of this, it is able to make better use of label information and generate better binary codes. In addition, different from other two-step methods that mainly focus on the hash codes learning, TECH adopts a new hash function learning strategy in the second step, which also preserves the similarity in the original space. Moreover, with the help of well designed objective function and optimization scheme, it is able to generate hash codes discretely and scalable for large scale data. To the best of our knowledge, it is the first cross-modal hashing method exploiting label correlations, and also the first two-step hashing model preserving the similarity while leaning hash function. Extensive experiments demonstrate that the proposed approach outperforms some state-of-the-art cross-modal hashing methods. |
|||||
2024 | Robust Unsupervised Cross-modal Hashing For Multimedia Retrieval | Cheng Miaomiao, Jing, Ng | Arxiv | With the quick development of social websites, there are more opportunities to have different media types (such as text, image, video, etc.) describing the same topic from large-scale heterogeneous data sources. To efficiently identify the inter-media correlations for multimedia retrieval, unsupervised cross-modal hashing (UCMH) has gained increased interest due to the significant reduction in computation and storage. However, most UCMH methods assume that the data from different modalities are well paired. As a result, existing UCMH methods may not achieve satisfactory performance when partially paired data are given only. In this article, we propose a new-type of UCMH method called robust unsupervised cross-modal hashing (RUCMH). The major contribution lies in jointly learning modal-specific hash function, exploring the correlations among modalities with partial or even without any pairwise correspondence, and preserving the information of original features as much as possible. The learning process can be modeled via a joint minimization problem, and the corresponding optimization algorithm is presented. A series of experiments is conducted on four real-world datasets (Wiki, MIRFlickr, NUS-WIDE, and MS-COCO). The results demonstrate that RUCMH can significantly outperform the state-of-the-art unsupervised cross-modal hashing methods, especially for the partially paired case, which validates the effectiveness of RUCMH. |
|||||
2024 | Supervised Consensus Anchor Graph Hashing For Cross Modal Retrieval | Chen Rui, Wang | Arxiv | The target of cross-modal hashing is to embed heterogeneous multimedia data into a common low-dimensional Hamming space, which plays a pivotal part in multimedia retrieval due to the emergence of big multimodal data. Recently, matrix factorization has achieved great success in cross-modal hashing. However, how to effectively use label information and local geometric structure is still a challenging problem for these approaches. To address this issue, we propose a cross-modal hashing method based on collective matrix factorization, which considers both the label consistency across different modalities and the local geometric consistency in each modality. These two elements are formulated as a graph Laplacian term in the objective function, leading to a substantial improvement on the discriminative power of latent semantic features obtained by collective matrix factorization. Moreover, the proposed method learns unified hash codes for different modalities of an instance to facilitate cross-modal search, and the objective function is solved using an iterative strategy. The experimental results on two benchmark data sets show the effectiveness of the proposed method and its superiority over state-of-the-art cross-modal hashing methods. |
|||||
2024 | Long-tail Hashing | Chen Yong, Hou, Leng, Hu, Lin, Zhang | Arxiv | Hashing, which represents data items as compact binary codes, has been becoming a more and more popular technique, e.g., for large-scale image retrieval, owing to its super fast search speed as well as its extremely economical memory consumption. However, existing hashing methods all try to learn binary codes from artificially balanced datasets which are not commonly available in real-world scenarios. In this paper, we propose Long-Tail Hashing Network (LTHNet), a novel two-stage deep hashing approach that addresses the problem of learning to hash for more realistic datasets where the data labels roughly exhibit a long-tail distribution. Specifically, the first stage is to learn relaxed embeddings of the given dataset with its long-tail characteristic taken into account via an end-to-end deep neural network; the second stage is to binarize those obtained embeddings. A critical part of LTHNet is its extended dynamic meta-embedding module which can adaptively realize visual knowledge transfer between head and tail classes, and thus enrich image representations for hashing. Our experiments have shown that LTHNet achieves dramatic performance improvements over all state-of-the-art competitors on long-tail datasets, with no or little sacrifice on balanced datasets. Further analyses reveal that while to our surprise directly manipulating class weights in the loss function has little effect, the extended dynamic meta-embedding module, the usage of cross-entropy loss instead of square loss, and the relatively small batch-size for training all contribute to LTHNet’s success. |
|||||
2024 | Locality-sensitive Hashing For F-divergences Mutual Information Loss And Beyond | Chen L., Esfandiari, Fu, Mirrokni | Arxiv | Computing approximate nearest neighbors in high dimensional spaces is a central problem in large-scale data mining with a wide range of applications in machine learning and data science. A popular and effective technique in computing nearest neighbors approximately is the locality-sensitive hashing (LSH) scheme. In this paper, we aim to develop LSH schemes for distance functions that measure the distance between two probability distributions, particularly for f-divergences as well as a generalization to capture mutual information loss. First, we provide a general framework to design LHS schemes for f-divergence distance functions and develop LSH schemes for the generalized Jensen-Shannon divergence and triangular discrimination in this framework. We show a two-sided approximation result for approximation of the generalized Jensen-Shannon divergence by the Hellinger distance, which may be of independent interest. Next, we show a general method of reducing the problem of designing an LSH scheme for a Krein kernel (which can be expressed as the difference of two positive definite kernels) to the problem of maximum inner product search. We exemplify this method by applying it to the mutual information loss, due to its several important applications such as model compression. |
|||||
2024 | Strongly Constrained Discrete Hashing | Chen Yong, Tian, Zhang, Wang, Zhang | Arxiv | Learning to hash is a fundamental technique widely used in large-scale image retrieval. Most existing methods for learning to hash address the involved discrete optimization problem by the continuous relaxation of the binary constraint, which usually leads to large quantization errors and consequently suboptimal binary codes. A few discrete hashing methods have emerged recently. However, they either completely ignore some useful constraints (specifically the balance and decorrelation of hash bits) or just turn those constraints into regularizers that would make the optimization easier but less accurate. In this paper, we propose a novel supervised hashing method named Strongly Constrained Discrete Hashing (SCDH) which overcomes such limitations. It can learn the binary codes for all examples in the training set, and meanwhile obtain a hash function for unseen samples with the above mentioned constraints preserved. Although the model of SCDH is fairly sophisticated, we are able to find closed-form solutions to all of its optimization subproblems and thus design an efficient algorithm that converges quickly. In addition, we extend SCDH to a kernelized version SCDH K . Our experiments on three large benchmark datasets have demonstrated that not only can SCDH and SCDH K achieve substantially higher MAP scores than state-of-the-art baselines, but they train much faster than those that are also supervised as well. |
|||||
2024 | Towards Effective Top-n Hamming Search Via Bipartite Graph Contrastive Hashing | Chen Yankai, Fang Yixiang, Zhang Yifei, Ma Chenhao, Hong Yang, King Irwin | Arxiv | Searching on bipartite graphs serves as a fundamental task for various real-world applications, such as recommendation systems, database retrieval, and document querying. Conventional approaches rely on similarity matching in continuous Euclidean space of vectorized node embeddings. To handle intensive similarity computation efficiently, hashing techniques for graph-structured data have emerged as a prominent research direction. However, despite the retrieval efficiency in Hamming space, previous studies have encountered catastrophic performance decay. To address this challenge, we investigate the problem of hashing with Graph Convolutional Network for effective Top-N search. Our findings indicate the learning effectiveness of incorporating hashing techniques within the exploration of bipartite graph reception fields, as opposed to simply treating hashing as post-processing to output embeddings. To further enhance the model performance, we advance upon these findings and propose Bipartite Graph Contrastive Hashing (BGCH+). BGCH+ introduces a novel dual augmentation approach to both intermediate information and hash code outputs in the latent feature spaces, thereby producing more expressive and robust hash codes within a dual self-supervised learning paradigm. Comprehensive empirical analyses on six real-world benchmarks validate the effectiveness of our dual feature contrastive learning in boosting the performance of BGCH+ compared to existing approaches. |
|||||
2024 | Deep Supervised Hashing With Anchor Graph | Chen Yudong, Lai, Ding, Lin, Wong | Arxiv | Recently, a series of deep supervised hashing methods were proposed for binary code learning. However, due to the high computation cost and the limited hardware’s memory, these methods will first select a subset from the training set, and then form a mini-batch data to update the network in each iteration. Therefore, the remaining labeled data cannot be fully utilized and the model cannot directly obtain the binary codes of the entire training set for retrieval. To address these problems, this paper proposes an interesting regularized deep model to seamlessly integrate the advantages of deep hashing and efficient binary code learning by using the anchor graph. As such, the deep features and label matrix can be jointly used to optimize the binary codes, and the network can obtain more discriminative feedback from the linear combinations of the learned bits. Moreover, we also reveal the algorithm mechanism and its computation essence. Experiments on three large-scale datasets indicate that the proposed method achieves better retrieval performance with less training time compared to previous deep hashing methods. |
|||||
2024 | Enhanced Discrete Multi-modal Hashing More Constraints Yet Less Time To Learn | Chen Yong, Zhang, Tian, Wang, Zhang, Li | Arxiv | Due to the exponential growth of multimedia data, multi-modal hashing as a promising technique to make cross-view retrieval scalable is attracting more and more attention. However, most of the existing multi-modal hashing methods either divide the learning process unnaturally into two separate stages or treat the discrete optimization problem simplistically as a continuous one, which leads to suboptimal results. Recently, a few discrete multi-modal hashing methods that try to address such issues have emerged, but they still ignore several important discrete constraints (such as the balance and decorrelation of hash bits). In this paper, we overcome those limitations by proposing a novel method named “Enhanced Discrete Multi-modal Hashing (EDMH)” which learns binary codes and hashing functions simultaneously from the pairwise similarity matrix of data, under the aforementioned discrete constraints. Although the model of EDMH looks a lot more complex than the other models for multi-modal hashing, we are actually able to develop a fast iterative learning algorithm for it, since the subproblems of its optimization all have closed-form solutions after introducing two auxiliary variables. Our experimental results on three real-world datasets have demonstrated that EDMH not only performs much better than state-of-the-art competitors but also runs much faster than them. |
|||||
2024 | HAC Hash-grid Assisted Context For 3D Gaussian Splatting Compression | Chen Yihang, Wu Qianyi, Lin Weiyao, Harandi Mehrtash, Cai Jianfei | ECCV | 3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over \(75\times\) compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over \(11\times\) size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code is available here: https://github.com/YihangChen-ee/HAC |
|||||
2024 | Deep Semantic Text Hashing With Weak Supervision | Chaidaroon Suthee, Ebesu, Fang | Arxiv | With an ever increasing amount of data available on the web, fast similarity search has become the critical component for large-scale information retrieval systems. One solution is semantic hashing which designs binary codes to accelerate similarity search. Recently, deep learning has been successfully applied to the semantic hashing problem and produces high-quality compact binary codes compared to traditional methods. However, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Motivated by the recent success in machine learning that makes use of weak supervision, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We further introduce two deep generative semantic hashing models to leverage weak signals for text hashing. The experimental results on four public datasets show that our models can generate high-quality binary codes without using hand-labeled training data and significantly outperform the competitive unsupervised semantic hashing baselines. |
|||||
2024 | Variational Deep Semantic Hashing For Text Documents | Chaidaroon Suthee, Fang | Arxiv | As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. In this paper, we propose a series of novel deep document generative models for text hashing. The first proposed model is unsupervised while the second one is supervised by utilizing document labels/tags for hashing. The third model further considers document-specific factors that affect the generation of words. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on four public testbeds. The experimental results have demonstrated the effectiveness of the proposed supervised learning models for text hashing. |
|||||
2024 | Hashing With Binary Autoencoders | Carreira-perpinan M., Raziperchikolaei | Arxiv | An attractive approach for fast search in image databases is binary hashing, where each high-dimensional, real-valued image is mapped onto a low-dimensional, binary vector and the search is done in this binary space. Finding the optimal hash function is difficult because it involves binary constraints, and most approaches approximate the optimization by relaxing the constraints and then binarizing the result. Here, we focus on the binary autoencoder model, which seeks to reconstruct an image from the binary code produced by the hash function. We show that the optimization can be simplified with the method of auxiliary coordinates. This reformulates the optimization as alternating two easier steps: one that learns the encoder and decoder separately, and one that optimizes the code for each image. Image retrieval experiments show the resulting hash function outperforms or is competitive with state-ofthe-art methods for binary hashing. |
|||||
2024 | Deep Cauchy Hashing For Hamming Space Retrieval | Cao Yue, Long, Liu, Wang | Arxiv | Due to its computation efficiency and retrieval quality, hashing has been widely applied to approximate nearest neighbor search for large-scale image retrieval, while deep hashing further improves the retrieval quality by end-toend representation learning and hash coding. With compact hash codes, Hamming space retrieval enables the most efficient constant-time search that returns data points within a given Hamming radius to each query, by hash table lookups instead of linear scan. However, subject to the weak capability of concentrating relevant images to be within a small Hamming ball due to mis-specified loss functions, existing deep hashing methods may underperform for Hamming space retrieval. This work presents Deep Cauchy Hashing (DCH), a novel deep hashing model that generates compact and concentrated binary hash codes to enable efficient and effective Hamming space retrieval. The main idea is to design a pairwise cross-entropy loss based on Cauchy distribution, which penalizes significantly on similar image pairs with Hamming distance larger than the given Hamming radius threshold. Comprehensive experiments demonstrate that DCH can generate highly concentrated hash codes and yield state-of-the-art Hamming space retrieval performance on three datasets, NUS-WIDE, CIFAR-10, and MS-COCO. |
|||||
2024 | Correlation Autoencoder Hashing For Supervised Cross-modal Search | Cao Yue, Long, Wang, Zhu | Arxiv | Due to its storage and query efficiency, hashing has been widely applied to approximate nearest neighbor search from large-scale datasets. While there is increasing interest in cross-modal hashing which facilitates cross-media retrieval by embedding data from different modalities into a common Hamming space, how to distill the cross-modal correlation structure effectively remains a challenging problem. In this paper, we propose a novel supervised cross-modal hashing method, Correlation Autoencoder Hashing (CAH), to learn discriminative and compact binary codes based on deep autoencoders. Specifically, CAH jointly maximizes the feature correlation revealed by bimodal data and the semantic correlation conveyed in similarity labels, while embeds them into hash codes by nonlinear deep autoencoders. Extensive experiments clearly show the superior effectiveness and efficiency of CAH against the state-of-the-art hashing methods on standard cross-modal retrieval benchmarks. |
|||||
2024 | Hashgan Deep Learning To Hash With Pair Conditional Wasserstein GAN | Cao Yue, Long, Liu, Wang | Arxiv | Deep learning to hash improves image retrieval performance by end-to-end representation learning and hash coding from training data with pairwise similarity information. Subject to the scarcity of similarity information that is often expensive to collect for many application domains, existing deep learning to hash methods may overfit the training data and result in substantial loss of retrieval quality. This paper presents HashGAN, a novel architecture for deep learning to hash, which learns compact binary hash codes from both real images and diverse images synthesized by generative models. The main idea is to augment the training data with nearly real images synthesized from a new Pair Conditional Wasserstein GAN (PC-WGAN) conditioned on the pairwise similarity information. Extensive experiments demonstrate that HashGAN can generate high-quality binary hash codes and yield state-of-the-art image retrieval performance on three benchmarks, NUS-WIDE, CIFAR-10, and MS-COCO. |
|||||
2024 | Collective Deep Quantization For Efficient Cross-modal Retrieval | Cao Yue, Long, Wang, Liu | Arxiv | Cross-modal similarity retrieval is a problem about designing a retrieval system that supports querying across content modalities, e.g., using an image to retrieve for texts. This paper presents a compact coding solution for efficient cross-modal retrieval, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in single-modal similarity retrieval. We propose a collective deep quantization (CDQ) approach, which is the first attempt to introduce quantization in end-to-end deep architecture for cross-modal retrieval. The major contribution lies in jointly learning deep representations and the quantizers for both modalities using carefully-crafted hybrid networks and well-specified loss functions. In addition, our approach simultaneously learns the common quantizer codebook for both modalities through which the crossmodal correlation can be substantially enhanced. CDQ enables efficient and effective cross-modal retrieval using inner product distance computed based on the common codebook with fast distance table lookup. Extensive experiments show that CDQ yields state of the art cross-modal retrieval results on standard benchmarks. |
|||||
2024 | Hashnet Deep Learning To Hash By Continuation | Cao Zhangjie, Long, Wang, Yu | Arxiv | Learning to hash has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by end-to-end representation learning and hash encoding, has received increasing attention recently. Subject to the illposed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the ill-posed gradient problem in optimizing deep networks with non-smooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield state-of-the-art multimedia retrieval performance on standard benchmarks. |
|||||
2024 | Adaptive Hashing For Fast Similarity Search | Cakir F., Sclaroff | Arxiv | With the staggering growth in image and video datasets, algorithms that provide fast similarity search and compact storage are crucial. Hashing methods that map the data into Hamming space have shown promise; however, many of these methods employ a batch-learning strategy in which the computational cost and memory requirements may become intractable and infeasible with larger and larger datasets. To overcome these challenges, we propose an online learning algorithm based on stochastic gradient descent in which the hash functions are updated iteratively with streaming data. In experiments with three image retrieval benchmarks, our online algorithm attains retrieval accuracy that is comparable to competing state-of-the-art batch-learning solutions, while our formulation is orders of magnitude faster and being online it is adaptable to the variations of the data. Moreover, our formulation yields improved retrieval performance over a recently reported online hashing technique, Online Kernel Hashing. |
|||||
2024 | Hashing With Binary Matrix Pursuit | Cakir F., He, Sclaroff | Arxiv | We propose theoretical and empirical improvements for two-stage hashing methods. We first provide a theoretical analysis on the quality of the binary codes and show that, under mild assumptions, a residual learning scheme can construct binary codes that fit any neighborhood structure with arbitrary accuracy. Secondly, we show that with high-capacity hash functions such as CNNs, binary code inference can be greatly simplified for many standard neighborhood definitions, yielding smaller optimization problems and more robust codes. Incorporating our findings, we propose a novel two-stage hashing method that significantly outperforms previous hashing studies on widely used image retrieval benchmarks. |
|||||
2024 | Early Exit Strategies For Approximate K-nn Search In Dense Retrieval | Busolin Francesco, Lucchese Claudio, Nardini Franco Maria, Orlando Salvatore, Perego Raffaele, Trani Salvatore | Arxiv | Learned dense representations are a popular family of techniques for encoding queries and documents using high-dimensional embeddings, which enable retrieval by performing approximate k nearest-neighbors search (A-kNN). A popular technique for making A-kNN search efficient is based on a two-level index, where the embeddings of documents are clustered offline and, at query processing, a fixed number N of clusters closest to the query is visited exhaustively to compute the result set. In this paper, we build upon state-of-the-art for early exit A-kNN and propose an unsupervised method based on the notion of patience, which can reach competitive effectiveness with large efficiency gains. Moreover, we discuss a cascade approach where we first identify queries that find their nearest neighbor within the closest t « N clusters, and then we decide how many more to visit based on our patience approach or other state-of-the-art strategies. Reproducible experiments employing state-of-the-art dense retrieval models and publicly available resources show that our techniques improve the A-kNN efficiency with up to 5x speedups while achieving negligible effectiveness losses. All the code used is available at https://github.com/francescobusolin/faiss_pEE |
|||||
2024 | Mihash Online Hashing With Mutual Information | Cakir F., He, Bargal, Sclaroff | Arxiv | Learning-based hashing methods are widely used for nearest neighbor retrieval, and recently, online hashing methods have demonstrated good performance-complexity trade-offs by learning hash functions from streaming data. In this paper, we first address a key challenge for online hashing: the binary codes for indexed data must be recomputed to keep pace with updates to the hash functions. We propose an efficient quality measure for hash functions, based on an information-theoretic quantity, mutual information, and use it successfully as a criterion to eliminate unnecessary hash table updates. Next, we also show how to optimize the mutual information objective using stochastic gradient descent. We thus develop a novel hashing method, MIHash, that can be used in both online and batch settings. Experiments on image retrieval benchmarks (including a 2.5M image dataset) confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions. |
|||||
2024 | Min-wise Independent Permutations | Broder Andrei, Charikar Moses, Frieze Alan, Mitzenmacher Michael | Arxiv | We define and study the notion of min-wise independent families of permutations. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept we present the solutions to some of them and we list the rest as open problems. |
|||||
2024 | LSH Forest Self-tuning Indexes For Similarity Search | Bawa M., Condie, Ganesan | Arxiv | We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH’s performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest. |
|||||
2024 | Targeted Attack For Deep Hashing Based Retrieval | Bai Jiawang, Chen, Li, Wu, Guo, Xia, Yang | Arxiv | The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the ℓ∞ restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval. |
|||||
2024 | Ai-based Copyright Detection Of An Image In A Video Using Degree Of Similarity And Image Hashing | Ashutosh, Pandya Rahul Jashvantbhai | Arxiv | The expanse of information available over the internet makes it difficult to identify whether a specific work is a replica or a duplication of a protected work, especially if we talk about visual representations. Strategies are planned to identify the utilization of the copyrighted image in a report. Still, we want to resolve the issue of involving a copyrighted image in a video and a calculation that could recognize the degree of similarity of the copyrighted picture utilized in the video, even for the pieces of the video that are not featured a lot and in the end perform characterization errands on those edges. Machine learning (ML) and artificial intelligence (AI) are vital to address this problem. Numerous associations have been creating different calculations to screen the identification of copyrighted work. This work means concentrating on those calculations, recognizing designs inside the information, and fabricating a more reasonable model for copyrighted image classification and detection. We have used different algorithms like- Image Processing, Convolutional Neural Networks (CNN), Image hashing, etc. Keywords- Copyright, Artificial Intelligence(AI), Copyrighted Image, Convolutional Neural Network(CNN), Image processing, Degree of similarity, Image Hashing. |
|||||
2024 | Near-optimal Hashing Algorithms For Approximate Nearest Neighbor In High Dimensions | Andoni A., Indyk | Arxiv | We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(dn + n1+1c2/+o(1)). This almost matches the lower bound for hashing-based algorithm recently obtained in (R. Motwani et al., 2006). We also obtain a space-efficient version of the algorithm, which uses dn+n logO(1) n space, with a query time of dnO(1/c2). Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech lattice |
|||||
2024 | Learning To Hash Robustly With Guarantees | Andoni Alexandr, Beaglehole | Arxiv | The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to “learn” the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm’s ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets. |
|||||
2024 | Practical And Optimal LSH For Angular Distance | Andoni A., Indyk, Laarhoven | Arxiv | We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [1, 2]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [3] in practice. We also introduce a multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions. |
|||||
2024 | Locally-adaptive Quantization For Streaming Vector Search | Aguerrebere Cecilia, Hildebrand Mark, Bhati Ishwar Singh, Willke Theodore, Tepper Mariano | Arxiv | Retrieving the most similar vector embeddings to a given query among a massive collection of vectors has long been a key component of countless real-world applications. The recently introduced Retrieval-Augmented Generation is one of the most prominent examples. For many of these applications, the database evolves over time by inserting new data and removing outdated data. In these cases, the retrieval problem is known as streaming similarity search. While Locally-Adaptive Vector Quantization (LVQ), a highly efficient vector compression method, yields state-of-the-art search performance for non-evolving databases, its usefulness in the streaming setting has not been yet established. In this work, we study LVQ in streaming similarity search. In support of our evaluation, we introduce two improvements of LVQ: Turbo LVQ and multi-means LVQ that boost its search performance by up to 28% and 27%, respectively. Our studies show that LVQ and its new variants enable blazing fast vector search, outperforming its closest competitor by up to 9.4x for identically distributed data and by up to 8.8x under the challenging scenario of data distribution shifts (i.e., where the statistical distribution of the data changes over time). We release our contributions as part of Scalable Vector Search, an open-source library for high-performance similarity search. |
|||||
2024 | SCRATCH A Scalable Discrete Matrix Factorization Hashing For Cross-modal Retrieval | Chuan-xiang, Chen, Zhang, Luo, Nie, Zhang, Xu | Arxiv | In recent years, many hashing methods have been proposed for the cross-modal retrieval task. However, there are still some issues that need to be further explored. For example, some of them relax the binary constraints to generate the hash codes, which may generate large quantization error. Although some discrete schemes have been proposed, most of them are time-consuming. In addition, most of the existing supervised hashing methods use an n x n similarity matrix during the optimization, making them unscalable. To address these issues, in this paper, we present a novel supervised cross-modal hashing method—Scalable disCRete mATrix faCtorization Hashing, SCRATCH for short. It leverages the collective matrix factorization on the kernelized features and the semantic embedding with labels to find a latent semantic space to preserve the intra- and inter-modality similarities. In addition, it incorporates the label matrix instead of the similarity matrix into the loss function. Based on the proposed loss function and the iterative optimization algorithm, it can learn the hash functions and binary codes simultaneously. Moreover, the binary codes can be generated discretely, reducing the quantization error generated by the relaxation scheme. Its time complexity is linear to the size of the dataset, making it scalable to large-scale datasets. Extensive experiments on three benchmark datasets, namely, Wiki, MIRFlickr-25K, and NUS-WIDE, have verified that our proposed SCRATCH model outperforms several state-of-the-art unsupervised and supervised hashing methods for cross-modal retrieval. |
|||||
2024 | Leveraging High-resolution Features For Improved Deep Hashing-based Image Retrieval | Berriche Aymene, Zakaria Mehdi Adjal, Baghdadi Riyadh | Arxiv | Deep hashing techniques have emerged as the predominant approach for efficient image retrieval. Traditionally, these methods utilize pre-trained convolutional neural networks (CNNs) such as AlexNet and VGG-16 as feature extractors. However, the increasing complexity of datasets poses challenges for these backbone architectures in capturing meaningful features essential for effective image retrieval. In this study, we explore the efficacy of employing high-resolution features learned through state-of-the-art techniques for image retrieval tasks. Specifically, we propose a novel methodology that utilizes High-Resolution Networks (HRNets) as the backbone for the deep hashing task, termed High-Resolution Hashing Network (HHNet). Our approach demonstrates superior performance compared to existing methods across all tested benchmark datasets, including CIFAR-10, NUS-WIDE, MS COCO, and ImageNet. This performance improvement is more pronounced for complex datasets, which highlights the need to learn high-resolution features for intricate image retrieval tasks. Furthermore, we conduct a comprehensive analysis of different HRNet configurations and provide insights into the optimal architecture for the deep hashing task |
|||||
2024 | Hashing Based Contrastive Learning For Virtual Screening | Han Jin, Hong Yun, Li Wu-jun | Arxiv | Virtual screening (VS) is a critical step in computer-aided drug discovery, aiming to identify molecules that bind to a specific target receptor like protein. Traditional VS methods, such as docking, are often too time-consuming for screening large-scale molecular databases. Recent advances in deep learning have demonstrated that learning vector representations for both proteins and molecules using contrastive learning can outperform traditional docking methods. However, given that target databases often contain billions of molecules, real-valued vector representations adopted by existing methods can still incur significant memory and time costs in VS. To address this problem, in this paper we propose a hashing-based contrastive learning method, called DrugHash, for VS. DrugHash treats VS as a retrieval task that uses efficient binary hash codes for retrieval. In particular, DrugHash designs a simple yet effective hashing strategy to enable end-to-end learning of binary hash codes for both protein and molecule modalities, which can dramatically reduce the memory and time costs with higher accuracy compared with existing methods. Experimental results show that DrugHash can outperform existing methods to achieve state-of-the-art accuracy, with a memory saving of 32\(\times\) and a speed improvement of 3.5\(\times\). |
|||||
2024 | Circulant Binary Embedding | Yu F., Kumar, Gong, Chang | Arxiv | Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from O(d^2 ) to O(d log d), and the space complexity from O(d^2) to O(d) where d is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits. |
|||||
2024 | Deep Graph-neighbor Coherence Preserving Network For Unsupervised Cross-modal Hashing | Yu Jun, Zhou, Zhan, Tao | Arxiv | Unsupervised cross-modal hashing (UCMH) has become a hot topic recently. Current UCMH focuses on exploring data similarities. However, current UCMH methods calculate the similarity between two data, mainly relying on the two data’s cross-modal features. These methods suffer from inaccurate similarity problems that result in a suboptimal retrieval Hamming space, because the cross-modal features between the data are not sufficient to describe the complex data relationships, such as situations where two data have different feature representations but share the inherent concepts. In this paper, we devise a deep graph-neighbor coherence preserving network (DGCPN). Specifically, DGCPN stems from graph models and explores graph-neighbor coherence by consolidating the information between data and their neighbors. DGCPN regulates comprehensive similarity preserving losses by exploiting three types of data similarities (i.e., the graph-neighbor coherence, the coexistent similarity, and the intra- and inter-modality consistency) and designs a half-real and half-binary optimization strategy to reduce the quantization errors during hashing. Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data’s intrinsic relationships in a graph. We conduct extensive experiments on three public UCMH datasets. The experimental results demonstrate the superiority of DGCPN, e.g., by improving the mean average precision from 0.722 to 0.751 on MIRFlickr-25K using 64-bit hashing codes to retrieval texts from images. We will release the source code package and the trained model on https://github.com/Atmegal/DGCPN. |
|||||
2024 | Unsupervised Few-bits Semantic Hashing With Implicit Topics Modeling | Ye Fanghua, Manotumruksa, Yilmaz | Arxiv | Semantic hashing is a powerful paradigm for representing texts as compact binary hash codes. The explosion of short text data has spurred the demand of few-bits hashing. However, the performance of existing semantic hashing methods cannot be guaranteed when applied to few-bits hashing because of severe information loss. In this paper, we present a simple but effective unsupervised neural generative semantic hashing method with a focus on few-bits hashing. Our model is built upon variational autoencoder and represents each hash bit as a Bernoulli variable, which allows the model to be end-to-end trainable. To address the issue of information loss, we introduce a set of auxiliary implicit topic vectors. With the aid of these topic vectors, the generated hash codes are not only low-dimensional representations of the original texts but also capture their implicit topics. We conduct comprehensive experiments on four datasets. The results demonstrate that our approach achieves significant improvements over state-of-the-art semantic hashing methods in few-bits hashing. |
|||||
2024 | Nonlinear Robust Discrete Hashing For Cross-modal Retrieval | Yang Zhan, Long, Zhu, Huang | Arxiv | Hashing techniques have recently been successfully applied to solve similarity search problems in the information retrieval field because of their significantly reduced storage and high-speed search capabilities. However, the hash codes learned from most recent cross-modal hashing methods lack the ability to comprehensively preserve adequate information, resulting in a less than desirable performance. To solve this limitation, we propose a novel method termed Nonlinear Robust Discrete Hashing (NRDH), for cross-modal retrieval. The main idea behind NRDH is motivated by the success of neural networks, i.e., nonlinear descriptors, in the field of representation learning, and the use of nonlinear descriptors instead of simple linear transformations is more in line with the complex relationships that exist between common latent representation and heterogeneous multimedia data in the real world. In NRDH, we first learn a common latent representation through nonlinear descriptors to encode complementary and consistent information from the features of the heterogeneous multimedia data. Moreover, an asymmetric learning scheme is proposed to correlate the learned hash codes with the common latent representation. Empirically, we demonstrate that NRDH is able to successfully generate a comprehensive common latent representation that significantly improves the quality of the learned hash codes. Then, NRDH adopts a linear learning strategy to fast learn the hash function with the learned hash codes. Extensive experiments performed on two benchmark datasets highlight the superiority of NRDH over several state-of-the-art methods. |
|||||
2024 | Adaptive Labeling For Deep Learning To Hash | Yang Huei-fang, Tu, Chen | Arxiv | Hash function learning has been widely used for largescale image retrieval because of the efficiency of computation and storage. We introduce AdaLabelHash, a binary hash function learning approach via deep neural networks in this paper. In AdaLabelHash, class label representations are variables that are adapted during the backward network training procedure. We express the labels as hypercube vertices in a K-dimensional space, and the class label representations together with the network weights are updated in the learning process. As the label representations (or referred to as codewords in this work), are learned from data, semantically similar classes will be assigned with the codewords that are close to each other in terms of Hamming distance in the label space. The codewords then serve as the desired output of the hash function learning, and yield compact and discriminating binary hash representations. AdaLabelHash is easy to implement, which can jointly learn label representations and infer compact binary codes from data. It is applicable to both supervised and semi-supervised hash. Experimental results on standard benchmarks demonstrate the satisfactory performance of AdaLabelHash. |
|||||
2024 | Distillhash Unsupervised Deep Hashing By Distilling Data Pairs | Yang Erkun, Liu, Deng, Liu, Tao | Arxiv | Due to the high storage and search efficiency, hashing has become prevalent for large-scale similarity search. Particularly, deep hashing methods have greatly improved the search performance under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of reliable supervisory similarity signals. To address this issue, we propose a novel deep unsupervised hashing model, dubbed DistillHash, which can learn a distilled data set consisted of data pairs, which have confidence similarity signals. Specifically, we investigate the relationship between the initial noisy similarity signals learned from local structures and the semantic similarity labels assigned by a Bayes optimal classifier. We show that under a mild assumption, some data pairs, of which labels are consistent with those assigned by the Bayes optimal classifier, can be potentially distilled. Inspired by this fact, we design a simple yet effective strategy to distill data pairs automatically and further adopt a Bayesian learning framework to learn hash functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets show that the proposed DistillHash consistently accomplishes the stateof-the-art search performance. |
|||||
2024 | Hash3d Training-free Acceleration For 3D Generation | Yang Xingyi, Wang Xinchao | Arxiv | The evolution of 3D generative modeling has been notably propelled by the adoption of 2D diffusion models. Despite this progress, the cumbersome optimization process per se presents a critical hurdle to efficiency. In this paper, we introduce Hash3D, a universal acceleration for 3D generation without model training. Central to Hash3D is the insight that feature-map redundancy is prevalent in images rendered from camera positions and diffusion time-steps in close proximity. By effectively hashing and reusing these feature maps across neighboring timesteps and camera angles, Hash3D substantially prevents redundant calculations, thus accelerating the diffusion model’s inference in 3D generation tasks. We achieve this through an adaptive grid-based hashing. Surprisingly, this feature-sharing mechanism not only speed up the generation but also enhances the smoothness and view consistency of the synthesized 3D objects. Our experiments covering 5 text-to-3D and 3 image-to-3D models, demonstrate Hash3D’s versatility to speed up optimization, enhancing efficiency by 1.3 to 4 times. Additionally, Hash3D’s integration with 3D Gaussian splatting largely speeds up 3D model creation, reducing text-to-3D processing to about 10 minutes and image-to-3D conversion to roughly 30 seconds. The project page is at https://adamdad.github.io/hash3D/. |
|||||
2024 | Central Similarity Hashing For Efficient Image And Video Retrieval | Yuan Li, Wang, Zhang, Jie, Tay, Feng | Arxiv | Existing data-dependent hashing methods usually learn hash functions from the pairwise or triplet data relationships, which only capture the data similarity locally, and often suffer low learning efficiency and low collision rate. In this work, we propose a new global similarity metric, termed as central similarity, with which the hash codes for similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy. We principally formulate the computation of the proposed central similarity metric by introducing a new concept, i.e. hash center that refers to a set of data points scattered in the Hamming space with sufficient mutual distance between each other. We then provide an efficient method to construct well separated hash centers by leveraging the Hadamard matrix and Bernoulli distributions. Finally, we propose the Central Similarity Hashing (CSH) that optimizes the central similarity between data points w.r.t. their hash centers instead of optimizing the local similarity. The CSH is generic and applicable to both image and video hashing. Extensive experiments on large-scale image and video retrieval demonstrate CSH can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs, and achieve noticeable boost in retrieval performance, i.e. 3%-20% in mAP over the previous state-of-the-art. The codes are in: https://github.com/yuanli2333/ Hadamard-Matrix-for-hashing |
|||||
2024 | Deep Hashing By Discriminating Hard Examples | Yan Cheng, Pang, Bai, Shen, Zhou, Hancock | Arxiv | This paper tackles a rarely explored but critical problem within learning to hash, i.e., to learn hash codes that effectively discriminate hard similar and dissimilar examples, to empower large-scale image retrieval. Hard similar examples refer to image pairs from the same semantic class that demonstrate some shared appearance but have different fine-grained appearance. Hard dissimilar examples are image pairs that come from different semantic classes but exhibit similar appearance. These hard examples generally have a small distance due to the shared appearance. Therefore, effective encoding of the hard examples can well discriminate the relevant images within a small Hamming distance, enabling more accurate retrieval in the top-ranked returned images. However, most existing hashing methods cannot capture this key information as their optimization is dominated byeasy examples, i.e., distant similar/dissimilar pairs that share no or limited appearance. To address this problem, we introduce a novel Gamma distribution-enabled and symmetric Kullback-Leibler divergence-based loss, which is dubbed dual hinge loss because it works similarly as imposing two smoothed hinge losses on the respective similar and dissimilar pairs. Specifically, the loss enforces exponentially variant penalization on the hard similar (dissimilar) examples to emphasize and learn their fine-grained difference. It meanwhile imposes a bounding penalization on easy similar (dissimilar) examples to prevent the dominance of the easy examples in the optimization while preserving the high-level similarity (dissimilarity). This enables our model to well encode the key information carried by both easy and hard examples. Extensive empirical results on three widely-used image retrieval datasets show that (i) our method consistently and substantially outperforms state-of-the-art competing methods using hash codes of the same length and (ii) our method can use significantly (e.g., 50%-75%) shorter hash codes to perform substantially better than, or comparably well to, the competing methods. |
|||||
2024 | Yandex DEEP-1B | Yandex Yandex | Arxiv | Yandex DEEP-1B image descriptor dataset consisting of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task. |
|||||
2024 | Convolutional Neural Networks For Text Hashing | Xu Jiaming, Pengwang, Tian, Xu, Zhao, Wang, Hao | Arxiv | Hashing, as a popular approximate nearest neighbor search, has been widely used for large-scale similarity search. Recently, a spectrum of machine learning methods are utilized to learn similarity-preserving binary codes. However, most of them directly encode the explicit features, keywords, which fail to preserve the accurate semantic similarities in binary code beyond keyword matching, especially on short texts. Here we propose a novel text hashing framework with convolutional neural networks. In particular, we first embed the keyword features into compact binary code with a locality preserving constraint. Meanwhile word features and position features are together fed into a convolutional network to learn the implicit features which are further incorporated with the explicit features to fit the pre-trained binary code. Such base method can be successfully accomplished without any external tags/labels, and other three model variations are designed to integrate tags/labels. Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods when tested on one short text dataset as well as one normal text dataset. |
|||||
2024 | Adaptive Quantization For Hashing An Information-based Approach To Learning Binary Codes | Xiong C., Chen, Chen, Johnson, Corso | Arxiv | Large-scale data mining and retrieval applications have increasingly turned to compact binary data representations as a way to achieve both fast queries and efficient data storage; many algorithms have been proposed for learning effective binary encodings. Most of these algorithms focus on learning a set of projection hyperplanes for the data and simply binarizing the result from each hyperplane, but this neglects the fact that informativeness may not be uniformly distributed across the projections. In this paper, we address this issue by proposing a novel adaptive quantization (AQ) strategy that adaptively assigns varying numbers of bits to different hyperplanes based on their information content. Our method provides an information-based schema that preserves the neighborhood structure of data points, and we jointly find the globally optimal bit-allocation for all hyperplanes. In our experiments, we compare with state-of-the-art methods on four large-scale datasets and find that our adaptive quantization approach significantly improves on traditional hashing methods. |
|||||
2024 | Harmonious Hashing | Xu B., Bu, Chen, He, Cai | Arxiv | Hashing-based fast nearest neighbor search technique has attracted great attention in both research and industry areas recently. Many existing hashing approaches encode data with projection-based hash functions and represent each projected dimension by 1-bit. However, the dimensions with high variance hold large energy or information of data but treated equivalently as dimensions with low variance, which leads to a serious information loss. In this paper, we introduce a novel hashing algorithm called Harmonious Hashing which aims at learning hash functions with low information loss. Specifically, we learn a set of optimized projections to preserve the maximum cumulative energy and meet the constraint of equivalent variance on each dimension as much as possible. In this way, we could minimize the information loss after binarization. Despite the extreme simplicity, our method outperforms superiorly to many state-of-the-art hashing methods in large-scale and high-dimensional nearest neighbor search experiments. |
|||||
2024 | Supervised Hashing Via Image Representation Learning | Xia R., Pan, Lai, Liu, Yan. | Arxiv | Hashing is a popular approximate nearest neighbor search approach for large-scale image retrieval. Supervised hashing, which incorporates similarity/dissimilarity information on entity pairs to improve the quality of hashing function learning, has recently received increasing attention. However, in the existing supervised hashing methods for images, an input image is usually encoded by a vector of hand-crafted visual features. Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs, which may often degrade the performance of hashing function learning. In this paper, we propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions. The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images. Extensive empirical evaluations on three benchmark datasets with different kinds of images show that the proposed method has superior performance gains over several state-of-the-art supervised and unsupervised hashing methods. |
|||||
2024 | Deep Incremental Hashing Network For Efficient Image Retrieval | Wu Dayan, Dai, Liu, Li, Wang | Arxiv | Hashing has shown great potential in large-scale image retrieval due to its storage and computation efficiency, especially the recent deep supervised hashing methods. To achieve promising performance, deep supervised hashing methods require a large amount of training data from different classes. However, when images of new categories emerge, existing deep hashing methods have to retrain the CNN model and generate hash codes for all the database images again, which is impractical for large-scale retrieval system. In this paper, we propose a novel deep hashing framework, called Deep Incremental Hashing Network (DIHN), for learning hash codes in an incremental manner. DIHN learns the hash codes for the new coming images directly, while keeping the old ones unchanged. Simultaneously, a deep hash function for query set is learned by preserving the similarities between training points. Extensive experiments on two widely used image retrieval benchmarks demonstrate that the proposed DIHN framework can significantly decrease the training time while keeping the state-of-the-art retrieval accuracy. |
|||||
2024 | Sign-guided Bipartite Graph Hashing For Hamming Space Search | Wu Xueyi | Arxiv | Bipartite graph hashing (BGH) is extensively used for Top-K search in Hamming space at low storage and inference costs. Recent research adopts graph convolutional hashing for BGH and has achieved the state-of-the-art performance. However, the contributions of its various influencing factors to hashing performance have not been explored in-depth, including the same/different sign count between two binary embeddings during Hamming space search (sign property), the contribution of sub-embeddings at each layer (model property), the contribution of different node types in the bipartite graph (node property), and the combination of augmentation methods. In this work, we build a lightweight graph convolutional hashing model named LightGCH by mainly removing the augmentation methods of the state-of-the-art model BGCH. By analyzing the contributions of each layer and node type to performance, as well as analyzing the Hamming similarity statistics at each layer, we find that the actual neighbors in the bipartite graph tend to have low Hamming similarity at the shallow layer, and all nodes tend to have high Hamming similarity at the deep layers in LightGCH. To tackle these problems, we propose a novel sign-guided framework SGBGH to make improvement, which uses sign-guided negative sampling to improve the Hamming similarity of neighbors, and uses sign-aware contrastive learning to help nodes learn more uniform representations. Experimental results show that SGBGH outperforms BGCH and LightGCH significantly in embedding quality. |
|||||
2024 | Online Hashing With Efficient Updating Of Binary Codes | Weng Zhenyu, Zhu | Arxiv | Online hashing methods are efficient in learning the hash functions from the streaming data. However, when the hash functions change, the binary codes for the database have to be recomputed to guarantee the retrieval accuracy. Recomputing the binary codes by accumulating the whole database brings a timeliness challenge to the online retrieval process. In this paper, we propose a novel online hashing framework to update the binary codes efficiently without accumulating the whole database. In our framework, the hash functions are fixed and the projection functions are introduced to learn online from the streaming data. Therefore, inefficient updating of the binary codes by accumulating the whole database can be transformed to efficient updating of the binary codes by projecting the binary codes into another binary space. The queries and the binary code database are projected asymmetrically to further improve the retrieval accuracy. The experiments on two multi-label image databases demonstrate the effectiveness and the efficiency of our method for multi-label image retrieval. |
|||||
2024 | Towards Optimal Deep Hashing Via Policy Gradient | Yuan Xin, Ren, Lu, Zhou | Arxiv | In this paper, we propose a simple yet effective relaxation free method to learn more effective binary codes via policy gradient for scalable image search. While a variety of deep hashing methods have been proposed in recent years, most of them are confronted by the dilemma to obtain optimal binary codes in a truly end-to-end manner with nonsmooth sign activations. Unlike existing methods which usually employ a general relaxation framework to adapt to the gradient-based algorithms, our approach formulates the non-smooth part of the hashing network as sampling with a stochastic policy, so that the retrieval performance degradation caused by the relaxation can be avoided. Specifically, our method directly generates the binary codes and maximizes the expectation of rewards for similarity preservation, where the network can be trained directly via policy gradient. Hence, the differentiation challenge for discrete optimization can be naturally addressed, which leads to effective gradients and binary codes. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed method. |
|||||
2024 | A-net Learning Attribute-aware Hash Codes For Large-scale Fine-grained Image Retrieval | Wei Xiu-shen, Xiu-shen_wei, Shen Yang, Sun, Ye, Yang | Arxiv | Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose an Attribute-Aware hashing Network (A-Net) for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. A-Net is also equipped with a feature decorrelation constraint upon these attribute vectors to enhance their representation abilities. Finally, the required hash codes are generated by the attribute vectors driven by preserving original similarities. Qualitative experiments on five benchmark fine-grained datasets show our superiority over competing methods. More importantly, quantitative results demonstrate the obtained hash codes can strongly correspond to certain kinds of crucial properties of fine-grained objects. |
|||||
2024 | Contrastive Masked Auto-encoders Based Self-supervised Hashing For 2D Image And 3D Point Cloud Cross-modal Retrieval | Wei Rukai, Cui Heng, Liu Yu, Hou Yufeng, Xie Yanzhao, Zhou Ke | Arxiv | Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model’s understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods. |
|||||
2024 | Uncertainty-aware Unsupervised Video Hashing | Wang Yucheng, Zhou, Sun, Qian | Arxiv | Learning to hash has become popular for video retrieval due to its fast speed and low storage consumption. Previous efforts formulate video hashing as training a binary auto-encoder, for which noncontinuous latent representations are optimized by the biased straight-through (ST) back-propagation heuristic. We propose to formulate video hashing as learning a discrete variational auto-encoder with the factorized Bernoulli latent distribution, termed as Bernoulli variational auto-encoder (BerVAE). The corresponding evidence lower bound (ELBO) in our BerVAE implementation leads to closed-form gradient expression, which can be applied to achieve principled training along with some other unbiased gradient estimators. BerVAE enables uncertainty-aware video hashing by predicting the probability distribution of video hash code-words, thus providing reliable uncertainty quantification. Experiments on both simulated and real-world large-scale video data demonstrate that our BerVAE trained with unbiased gradient estimators can achieve the state-of-the-art retrieval performance. Furthermore, we show that quantified uncertainty is highly correlated to video retrieval performance, which can be leveraged to further improve the retrieval accuracy. Our code is available at https://github.com/wangyucheng1234/BerVAE |
|||||
2024 | Sequential Projection Learning For Hashing With Compact Codes | Wang J., Kumar, Chang | Arxiv | Hashing based Approximate Nearest Neighbor (ANN) search has attracted much attention due to its fast query time and drastically reduced storage. However, most of the hashing methods either use random projections or extract principal directions from the data to derive hash functions. The resulting embedding suffers from poor discrimination when compact codes are used. In this paper, we propose a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. The proposed method easily adapts to both unsupervised and semi-supervised scenarios and shows significant performance gains over the state-ofthe-art methods on two large datasets containing up to 1 million points. |
|||||
2024 | Semi-supervised Deep Quantization For Cross-modal Search | Wang Xin, Zhu, Liu | Arxiv | The problem of cross-modal similarity search, which aims at making efficient and accurate queries across multiple domains, has become a significant and important research topic. Composite quantization, a compact coding solution superior to hashing techniques, has shown its effectiveness for similarity search. However, most existing works utilizing composite quantization to search multi-domain content only consider either pairwise similarity information or class label information across different domains, which fails to tackle the semi-supervised problem in composite quantization. In this paper, we address the semi-supervised quantization problem by considering: (i) pairwise similarity information (without class label information) across different domains, which captures the intra-document relation, (ii) cross-domain data with class label which can help capture inter-document relation, and (iii) cross-domain data with neither pairwise similarity nor class label which enables the full use of abundant unlabelled information. To the best of our knowledge, we are the first to consider both supervised information (pairwise similarity + class label) and unsupervised information (neither pairwise similarity nor class label) simultaneously in composite quantization. A challenging problem arises: how can we jointly handle these three sorts of information across multiple domains in an efficient way? To tackle this challenge, we propose a novel semi-supervised deep quantization (SSDQ) model that takes both supervised and unsupervised information into account. The proposed SSDQ model is capable of incorporating the above three kinds of information into one single framework when utilizing composite quantization for accurate and efficient queries across different domains. More specifically, we employ a modified deep autoencoder for better latent representation and formulate pairwise similarity loss, supervised quantization loss as well as unsupervised distribution match loss to handle all three types of information. The extensive experiments demonstrate the significant improvement of SSDQ over several state-of-the-art methods on various datasets. |
|||||
2024 | A Survey On Learning To Hash | Wang Jingdong, Zhang, Song, Sebe, Shen | Arxiv | Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algori |
|||||
2024 | Weakly Supervised Deep Hyperspherical Quantization For Image Retrieval | Wang Jinpeng, Chen Bin, Zhang Qiang, Meng Zaiqiao, Liang Shangsong, Xia Shu-tao | Arxiv | Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at https://github.com/gimpong/AAAI21-WSDHQ. |
|||||
2024 | Prototype-supervised Adversarial Network For Targeted Attack Of Deep Hashing | Wang Xunguang, Zhang, Wu, Shen, Lu | Arxiv | Due to its powerful capability of representation learning and high-efficiency computation, deep hashing has made significant progress in large-scale image retrieval. However, deep hashing networks are vulnerable to adversarial examples, which is a practical secure problem but seldom studied in hashing-based retrieval field. In this paper, we propose a novel prototype-supervised adversarial network (ProS-GAN), which formulates a flexible generative architecture for efficient and effective targeted hashing attack. To the best of our knowledge, this is the first generation-based method to attack deep hashing networks. Generally, our proposed framework consists of three parts, i.e., a PrototypeNet, a generator and a discriminator. Specifically, the designed PrototypeNet embeds the target label into the semantic representation and learns the prototype code as the category-level representative of the target label. Moreover, the semantic representation and the original image are jointly fed into the generator for flexible targeted attack. Particularly, the prototype code is adopted to supervise the generator to construct the targeted adversarial example by minimizing the Hamming distance between the hash code of the adversarial example and the prototype code. Furthermore, the generator is against the discriminator to simultaneously encourage the adversarial examples visually realistic and the semantic representation informative. Extensive experiments verify that the proposed framework can efficiently produce adversarial examples with better targeted attack performance and transferability over state-of-the-art targeted attack methods of deep hashing. |
|||||
2024 | Online Collective Matrix Factorization Hashing For Large-scale Cross-media Retrieval | Wang Di, Wang, An, Gao, Tian | Arxiv | Cross-modal hashing has been widely investigated recently for its efficiency in large-scale cross-media retrieval. However, most existing cross-modal hashing methods learn hash functions in a batch-based learning mode. Such mode is not suitable for large-scale data sets due to the large memory consumption and loses its efficiency when training streaming data. Online cross-modal hashing can deal with the above problems by learning hash model in an online learning process. However, existing online cross-modal hashing methods cannot update hash codes of old data by the newly learned model. In this paper, we propose Online Collective Matrix Factorization Hashing (OCMFH) based on collective matrix factorization hashing (CMFH), which can adaptively update hash codes of old data according to dynamic changes of hash model without accessing to old data. Specifically, it learns discriminative hash codes for streaming data by collective matrix factorization in an online optimization scheme. Unlike conventional CMFH which needs to load the entire data points into memory, the proposed OCMFH retrains hash functions only by newly arriving data points. Meanwhile, it generates hash codes of new data and updates hash codes of old data by the latest updated hash model. In such way, hash codes of new data and old data are well-matched. Furthermore, a zero mean strategy is developed to solve the mean-varying problem in the online hash learning process. Extensive experiments on three benchmark data sets demonstrate the effectiveness and efficiency of OCMFH on online cross-media retrieval. |
|||||
2024 | RREH Reconstruction Relations Embedded Hashing For Semi-paired Cross-modal Retrieval | Wang Jianzong, Shi Haoxiang, Luo Kaiyi, Zhang Xulong, Cheng Ning, Xiao Jing | Arxiv | Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods. |
|||||
2024 | IDEA An Invariant Perspective For Efficient Domain Adaptive Image Retrieval | Wang Haixin, Wu, Sun, Zhang, Chen, Hua, Luo | Arxiv | In this paper, we investigate the problem of unsupervised domain adaptive hashing, which leverage knowledge from a label-rich source domain to expedite learning to hash on a label-scarce target domain. Although numerous existing approaches attempt to incorporate transfer learning techniques into deep hashing frameworks, they often neglect the essential invariance for adequate alignment between these two domains. Worse yet, these methods fail to distinguish between causal and non-causal effects embedded in images, rendering cross-domain retrieval ineffective. To address these challenges, we propose an Invariance-acquired Domain AdaptivE HAshing (IDEA) model. Our IDEA first decomposes each image into a causal feature representing label information, and a non-causal feature indicating domain information. Subsequently, we generate discriminative hash codes using causal features with consistency learning on both source and target domains. More importantly, we employ a generative model for synthetic samples to simulate the intervention of various non-causal effects, ultimately minimizing their impact on hash codes for domain invariance. Comprehensive experiments conducted on benchmark datasets validate the superior performance of our IDEA compared to a variety of competitive baselines. |
|||||
2024 | Hamming Compatible Quantization For Hashing | Wang Z., Duan, Lin, Wang, Gao | Arxiv | Hashing is one of the effective techniques for fast Approximate Nearest Neighbour (ANN) search. Traditional single-bit quantization (SBQ) in most hashing methods incurs lots of quantization error which seriously degrades the search performance. To address the limitation of SBQ, researchers have proposed promising multi-bit quantization (MBQ) methods to quantize each projection dimension with multiple bits. However, some MBQ methods need to adopt specific distance for binary code matching instead of the original Hamming distance, which would significantly decrease the retrieval speed. Two typical MBQ methods Hierarchical Quantization and Double Bit Quantization retain the Hamming distance, but both of them only consider the projection dimensions during quantization, ignoring the neighborhood structure of raw data inherent in Euclidean space. In this paper, we propose a multi-bit quantization method named Hamming Compatible Quantization (HCQ) to preserve the capability of similarity metric between Euclidean space and Hamming space by utilizing the neighborhood structure of raw data. Extensive experiment results have shown our approach significantly improves the performance of various stateof-the-art hashing methods while maintaining fast retrieval speed. |
|||||
2024 | Deep Collaborative Discrete Hashing With Semantic-invariant Structure | Wang Zijian, Zhang, Huang | Arxiv | Existing deep hashing approaches fail to fully explore semantic correlations and neglect the effect of linguistic context on visual attention learning, leading to inferior performance. This paper proposes a dual-stream learning framework, dubbed Deep Collaborative Discrete Hashing (DCDH), which constructs a discriminative common discrete space by collaboratively incorporating the shared and individual semantics deduced from visual features and semantic labels. Specifically, the context-aware representations are generated by employing the outer product of visual embeddings and semantic encodings. Moreover, we reconstruct the labels and introduce the focal loss to take advantage of frequent and rare concepts. The common binary code space is built on the joint learning of the visual representations attended by language, the semantic-invariant structure construction and the label distribution correction. Extensive experiments demonstrate the superiority of our method. |
|||||
2024 | Hashing with Uncertainty Quantification via Sampling-based Hypothesis Testing | Yucheng Wang, Mingyuan Zhou, Xiaoning Qian | TMLR | To quantify different types of uncertainty when deriving hash-codes for image retrieval, we develop a probabilistic hashing model(ProbHash). Sampling-based hypothesis testing is then derived for hashing with uncertainty quantification(HashUQ) in ProbHash to improve the granularity of hashing-based retrieval by prioritizing the data with confident hash-codes. HashUQ can drastically improve the retrieval performance without sacrificing computational efficiency. For efficient deployment of HashUQ in real-world applications, we discretize the quantified uncertainty to reduce the potential storage overhead. Experimental results show that our HashUQ can achieve state-of-the-art retrieval performance on three image datasets. Ablation experiments on model hyperparameters, different model components, and effects of UQ are also provided with performance comparisons. Our code is available at https://github.com/QianLab/HashUQ. |
|||||
2024 | Neural Locality Sensitive Hashing For Entity Blocking | Wang Runhui, Kong Luyang, Tao Yefan, Borthwick Andrew, Golac Davor, Johnson Henrik, Hijazi Shadie, Deng Dong, Zhang Yongfeng | Arxiv | Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex metrics. We assess the effectiveness of this approach within the context of the entity resolution problem, which frequently involves the use of task-specific metrics in real-world applications. Specifically, we introduce NLSHBlock (Neural-LSH Block), a novel blocking methodology that leverages pre-trained language models, fine-tuned with a novel LSH-based loss function. Through extensive evaluations conducted on a diverse range of real-world datasets, we demonstrate the superiority of NLSHBlock over existing methods, exhibiting significant performance improvements. Furthermore, we showcase the efficacy of NLSHBlock in enhancing the performance of the entity matching phase, particularly within the semi-supervised setting. |
|||||
2024 | Semantic Topic Multimodal Hashing For Cross-media Retrieval | Wang Di, Gao, He | Arxiv | Multimodal hashing is essential to cross-media similarity search for its low storage cost and fast query speed. Most existing multimodal hashing methods embedded heterogeneous data into a common low-dimensional Hamming space, and then rounded the continuous embeddings to obtain the binary codes. Yet they usually neglect the inherent discrete nature of hashing for relaxing the discrete constraints, which will cause degraded retrieval performance especially for long codes. For this purpose, a novel Semantic Topic Multimodal Hashing (STMH) is developed by considering latent semantic information in coding procedure. It first discovers clustering patterns of texts and robust factorizes the matrix of images to obtain multiple semantic topics of texts and concepts of images. Then the learned multimodal semantic features are transformed into a common subspace by their correlations. Finally, each bit of unified hash code can be generated directly by figuring out whether a topic or concept is contained in a text or an image. Therefore, the obtained model by STMH is more suitable for hashing scheme as it directly learns discrete hash codes in the coding process. Experimental results demonstrate that the proposed method outperforms several state-of-the-art methods. |
|||||
2024 | Multidimensional Spectral Hashing | Weiss Y., Fergus, Torralba | Arxiv | en a surge of interest in methods based on “semantic hashing”, i.e. compact binary codes of data-points so that the Hamming distance between codewords correlates with similarity. In reviewing and comparing existing methods, we show that their relative performance can change drastically depending on the definition of ground-truth neighbors. Motivated by this finding, we propose a new formulation for learning binary codes which seeks to reconstruct the affinity between datapoints, rather than their distances. We show that this criterion is intractable to solve exactly, but a spectral relaxation gives an algorithm where the bits correspond to thresholded eigenvectors of the affinity matrix, and as the number of datapoints goes to infinity these eigenvectors converge to eigenfunctions of Laplace-Beltrami operators, similar to the recently proposed Spectral Hashing (SH) method. Unlike SH whose performance may degrade as the number of bits increases, the optimal code using our formulation is guaranteed to faithfully reproduce the affinities as the number of bits increases. We show that the number of eigenfunctions needed may increase exponentially with dimension, but introduce a “kernel trick” to allow us to compute with an exponentially large number of bits but using only memory and computation that grows linearly with dimension. Experiments shows that MDSH outperforms the state-of-the art, especially in the challenging regime of small distance thresholds. |
|||||
2024 | Affinity Preserving Quantization For Hashing A Vector Quantization Approach To Learning Compact Binary Codes | Wang Z., Duan, Huang, Gao | Arxiv | Hashing techniques are powerful for approximate nearest neighbour (ANN) search. Existing quantization methods in hashing are all focused on scalar quantization (SQ) which is inferior in utilizing the inherent data distribution. In this paper, we propose a novel vector quantization (VQ) method named affinity preserving quantization (APQ) to improve the quantization quality of projection values, which has significantly boosted the performance of state-of-the-art hashing techniques. In particular, our method incorporates the neighbourhood structure in the pre- and post-projection data space into vector quantization. APQ minimizes the quantization errors of projection values as well as the loss of affinity property of original space. An effective algorithm has been proposed to solve the joint optimization problem in APQ, and the extension to larger binary codes has been resolved by applying product quantization to APQ. Extensive experiments have shown that APQ consistently outperforms the state-of-the-art quantization methods, and has significantly improved the performance of various hashing techniques. |
|||||
2024 | To Be Continuous Or To Be Discrete Those Are Bits Of Questions | Wang Yiran, Utiyama Masao | Arxiv | Recently, binary representation has been proposed as a novel representation that lies between continuous and discrete representations. It exhibits considerable information-preserving capability when being used to replace continuous input vectors. In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead. To preserve the structural information on the output side along with label information, we extend the previous contrastive hashing method as structured contrastive hashing. More specifically, we upgrade CKY from label-level to bit-level, define a new similarity function with span marginal probabilities, and introduce a novel contrastive loss function with a carefully designed instance selection strategy. Our model achieves competitive performance on various structured prediction tasks, and demonstrates that binary representation can be considered a novel representation that further bridges the gap between the continuous nature of deep learning and the discrete intrinsic property of natural languages. |
|||||
2024 | Spectral Hashing | Weiss Y., Torralba, Fergus | Arxiv | Semantic hashing seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. |
|||||
2024 | Improving LSH Via Tensorized Random Projection | Verma Bhisham Dev, Pratap Rameshwar | Arxiv | Locality sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large scale data processing applications such as near duplicate detection, nearest neighbour search, clustering, etc. In this work, we aim to propose faster and space efficient locality sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data \(E2LSH\) and \(SRP\). However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely \(CP-E2LSH\), \(TT-E2LSH\), and \(CP-SRP\), \(TT-SRP\), respectively, building on \(CP\) and tensor train \((TT)\) decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank \(CP\) or \(TT\) tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy. |
|||||
2024 | Pixel Embedding Fully Quantized Convolutional Neural Network With Differentiable Lookup Table | Tokunaga Hiroyuki, Nicholls Joel, Vazhenina Daria, Kanemura Atsunori | Arxiv | By quantizing network weights and activations to low bitwidth, we can obtain hardware-friendly and energy-efficient networks. However, existing quantization techniques utilizing the straight-through estimator and piecewise constant functions face the issue of how to represent originally high-bit input data with low-bit values. To fully quantize deep neural networks, we propose pixel embedding, which replaces each float-valued input pixel with a vector of quantized values by using a lookup table. The lookup table or low-bit representation of pixels is differentiable and trainable by backpropagation. Such replacement of inputs with vectors is similar to word embedding in the natural language processing field. Experiments on ImageNet and CIFAR-100 show that pixel embedding reduces the top-5 error gap caused by quantizing the floating points at the first layer to only 1% for the ImageNet dataset, and the top-1 error gap caused by quantizing first and last layers to slightly over 1% for the CIFAR-100 dataset. The usefulness of pixel embedding is further demonstrated by inference time measurements, which demonstrate over 1.7 times speedup compared to floating point precision first layer. |
|||||
2024 | 80 Million Tiny Images A Large Dataset For Non-parametric Object And Scene Recognition | Torralba A., Freeman | Arxiv | With the advent of the Internet, billions of images are now freely available online and constitute a dense sampling of the visual world. Using a variety of non-parametric methods, we explore this world with the aid of a large dataset of 79,302,017 images collected from the Web. Motivated by psychophysical results showing the remarkable tolerance of the human visual system to degradations in image resolution, the images in the dataset are stored as 32 × 32 color images. Each image is loosely labeled with one of the 75,062 non-abstract nouns in English, as listed in the Wordnet lexical database. Hence the image database gives a comprehensive coverage of all object categories and scenes. The semantic information from Wordnet can be used in conjunction with nearest-neighbor methods to perform object classification over a range of semantic levels minimizing the effects of labeling noise. For certain classes that are particularly prevalent in the dataset, such as people, we are able to demonstrate a recognition performance comparable to class-specific Viola-Jones style detectors. |
|||||
2024 | Fast Approximate Nearest-neighbor Field By Cascaded Spherical Hashing | Torres-xirau I., Salvador, Pérez-pellitero | Arxiv | We present an efficient and fast algorithm for computing approximate nearest neighbor fields between two images. Our method builds on the concept of Coherency-Sensitive Hashing (CSH), but uses a recent hashing scheme, Spherical Hashing (SpH), which is known to be better adapted to the nearest-neighbor problem for natural images. Cascaded Spherical Hashing concatenates different configurations of SpH to build larger Hash Tables with less elements in each bin to achieve higher selectivity. Our method amply outperforms existing techniques like PatchMatch and CSH, and the experimental results show that our algorithm is faster and more accurate than existing methods. |
|||||
2024 | Aisaq All-in-storage ANNS With Product Quantization For Dram-free Information Retrieval | Tatsuno Kento, Miyashita Daisuke, Ikeda Taiga, Ishiyama Kiyoshi, Sumiyoshi Kazunari, Deguchi Jun | Arxiv | In approximate nearest neighbor search (ANNS) methods based on approximate proximity graphs, DiskANN achieves good recall-speed balance for large-scale datasets using both of RAM and storage. Despite it claims to save memory usage by loading compressed vectors by product quantization (PQ), its memory usage increases in proportion to the scale of datasets. In this paper, we propose All-in-Storage ANNS with Product Quantization (AiSAQ), which offloads the compressed vectors to storage. Our method achieves \(\sim\)10 MB memory usage in query search even with billion-scale datasets with minor performance degradation. AiSAQ also reduces the index load time before query search, which enables the index switch between muitiple billion-scale datasets and significantly enhances the flexibility of retrieval-augmented generation (RAG). This method is applicable to all graph-based ANNS algorithms and can be combined with higher-spec ANNS methods in the future. |
|||||
2024 | Streaming Similarity Search Over One Billion Tweets Using Parallel Locality-sensitive Hashing | Sundaram Narayanan, Turmukhametova, Satish, Mostak, Indyk, Dubey | Arxiv | Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kdtrees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports highthroughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1–2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7x faster and query times that are 8.3x faster than a basic implementation. |
|||||
2024 | Supervised Hierarchical Cross-modal Hashing | Sun Changchang, Song, Feng, Zhao, Nie | Arxiv | Recently, due to the unprecedented growth of multimedia data, cross-modal hashing has gained increasing attention for the efficient cross-media retrieval. Typically, existing methods on crossmodal hashing treat labels of one instance independently but overlook the correlations among labels. Indeed, in many real-world scenarios, like the online fashion domain, instances (items) are labeled with a set of categories correlated by certain hierarchy. In this paper, we propose a new end-to-end solution for supervised cross-modal hashing, named HiCHNet, which explicitly exploits the hierarchical labels of instances. In particular, by the pre-established label hierarchy, we comprehensively characterize each modality of the instance with a set of layer-wise hash representations. In essence, hash codes are encouraged to not only preserve the layerwise semantic similarities encoded by the label hierarchy, but also retain the hierarchical discriminative capabilities. Due to the lack of benchmark datasets, apart from adapting the existing dataset FashionVC from fashion domain, we create a dataset from the online fashion platform Ssense consisting of 15, 696 image-text pairs labeled by 32 hierarchical categories. Extensive experiments on two real-world datasets demonstrate the superiority of our model over the state-of-the-art methods. |
|||||
2024 | Deep Normalized Cross-modal Hashing With Bi-direction Relation Reasoning | Sun Changchang, Latapie, Liu, Yan | Arxiv | Due to the continuous growth of large-scale multi-modal data and increasing requirements for retrieval speed, deep cross-modal hashing has gained increasing attention recently. Most of existing studies take a similarity matrix as supervision to optimize their models, and the inner product between continuous surrogates of hash codes is utilized to depict the similarity in the Hamming space. However, all of them merely consider the relevant information to build the similarity matrix, ignoring the contribution of the irrelevant one, i.e., the categories that samples do not belong to. Therefore, they cannot effectively alleviate the effect of dissimilar samples. Moreover, due to the modality distribution difference, directly utilizing continuous surrogates of hash codes to calculate similarity may induce suboptimal retrieval performance. To tackle these issues, in this paper, we propose a novel deep normalized cross-modal hashing scheme with bi-direction relation reasoning, named Bi_NCMH. Specifically, we build the multi-level semantic similarity matrix by considering bi-direction relation, i.e., consistent and inconsistent relation. It hence can holistically characterize relations among instances. Besides, we execute feature normalization on continuous surrogates of hash codes to eliminate the deviation caused by modality gap, which further reduces the negative impact of binarization on retrieval performance. Extensive experiments on two cross-modal benchmark datasets demonstrate the superiority of our model over several state-of-the-art baselines. |
|||||
2024 | SOAR Improved Indexing For Approximate Nearest Neighbor Search | Sun Philip, Simcha David, Dopson Dave, Guo Ruiqi, Kumar Sanjiv | Advances in Neural Information Processing Systems | This paper introduces SOAR: Spilling with Orthogonality-Amplified Residuals, a novel data indexing technique for approximate nearest neighbor (ANN) search. SOAR extends upon previous approaches to ANN search, such as spill trees, that utilize multiple redundant representations while partitioning the data to reduce the probability of missing a nearest neighbor during search. Rather than training and computing these redundant representations independently, however, SOAR uses an orthogonality-amplified residual loss, which optimizes each representation to compensate for cases where other representations perform poorly. This drastically improves the overall index quality, resulting in state-of-the-art ANN benchmark performance while maintaining fast indexing times and low memory consumption. |
|||||
2024 | Deep Joint-semantics Reconstructing Hashing For Large-scale Unsupervised Cross-modal Retrieval | Su Shupeng, Zhong, Zhang | Arxiv |
Cross-modal hashing encodes the multimedia data into a common binary hash space in which the correlations among the samples from different modalities can be effectively measured. Deep cross-modal hashing further improves the retrieval performance as the deep neural networks can generate more semantic relevant features and hash codes. In this paper, we study the unsupervised deep cross-modal hash coding and propose Deep Joint Semantics Reconstructing Hashing (DJSRH), which has the following two main advantages. First, to learn binary codes that preserve the neighborhood structure of the original data, DJSRH constructs a novel joint-semantics affinity matrix which elaborately integrates the original neighborhood information from different modalities and accordingly is capable to capture the latent intrinsic semantic affinity for the input multi-modal instances. Second, DJSRH later trains the networks to generate binary codes that maximally reconstruct above joint-semantics relations via the proposed reconstructing framework, which is more competent for the batch-wise training as it reconstructs the specific similarity value unlike the common Laplacian constraint merely preserving the similarity order. Extensive experiments demonstrate the significant improvement by DJSRH in various cross-modal retrieval tasks. |
|||||
2024 | Greedy Hash Towards Fast Optimization For Accurate Hash Coding In CNN | Su Shupeng, Zhang, Han, Tian | Arxiv | To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks. |
|||||
2024 | Diskann Fast Accurate Billion-point Nearest Neighbor Search On A Single Node | Subramanya Suhas, Devvrit, Simhadri, Krishnawamy, Kadekodi | Arxiv | Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC+G+P plateau at around 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can index and serve 5 − 10x more points per node compared to state-of-the-art graph- based methods such as HNSW and NSG. Finally, as part of our overall DiskANN system, we introduce Vamana, a new graph-based ANNS index that is more versatile than the graph indices even for in-memory indices. |
|||||
2024 | Top Rank Supervised Binary Coding For Visual Search | Song Dongjin, Liu, Ji, Meyer, Smith | Arxiv | In recent years, binary coding techniques are becoming increasingly popular because of their high efficiency in handling large-scale computer vision applications. It has been demonstrated that supervised binary coding techniques that leverage supervised information can significantly enhance the coding quality, and hence greatly benefit visual search tasks. Typically, a modern binary coding method seeks to learn a group of coding functions which compress data samples into binary codes. However, few methods pursued the coding functions such that the precision at the top of a ranking list according to Hamming distances of the generated binary codes is optimized. In this paper, we propose a novel supervised binary coding approach, namely Top Rank Supervised Binary Coding (Top-RSBC), which explicitly focuses on optimizing the precision of top positions in a Hamming-distance ranking list towards preserving the supervision information. The core idea is to train the disciplined coding functions, by which the mistakes at the top of a Hamming-distance ranking list are penalized more than those at the bottom. To solve such coding functions, we relax the original discrete optimization objective with a continuous surrogate, and derive a stochastic gradient descent to optimize the surrogate objective. To further reduce the training time cost, we also design an online learning algorithm to optimize the surrogate objective more efficiently. Empirical studies based upon three benchmark image datasets demonstrate that the proposed binary coding approach achieves superior image search accuracy over the state-of-the-arts. |
|||||
2024 | Self-supervised Video Hashing With Hierarchical Binary Auto-encoder | Song Jingkuan, Zhang, Li, Gao, Wang, Hong | Arxiv | Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed Self-Supervised Video Hashing (SSVH), that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos; and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary autoencoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world datasets (FCVID and YFCC) show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the currently best performance on the task of unsupervised video retrieval. |
|||||
2024 | Inter-media Hashing For Large-scale Retrieval From Heterogeneous Data Sources | Song J., Yang, Yang, Huang, Shen | Arxiv | In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users’ demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query’s results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques. |
|||||
2024 | Neurohash A Hyperdimensional Neuro-symbolic Framework For Spatially-aware Image Hashing And Retrieval | Yun Sanggeon, Masukawa Ryozo, Jeong Sungheon, Imani Mohsen | Arxiv | Customizable image retrieval from large datasets remains a critical challenge, particularly when preserving spatial relationships within images. Traditional hashing methods, primarily based on deep learning, often fail to capture spatial information adequately and lack transparency. In this paper, we introduce NeuroHash, a novel neuro-symbolic framework leveraging Hyperdimensional Computing (HDC) to enable highly customizable, spatially-aware image retrieval. NeuroHash combines pre-trained deep neural network models with HDC-based symbolic models, allowing for flexible manipulation of hash values to support conditional image retrieval. Our method includes a self-supervised context-aware HDC encoder and novel loss terms for optimizing lower-dimensional bipolar hashing using multilinear hyperplanes. We evaluate NeuroHash on two benchmark datasets, demonstrating superior performance compared to state-of-the-art hashing methods, as measured by mAP@5K scores and our newly introduced metric, mAP@5Kr, which assesses spatial alignment. The results highlight NeuroHash’s ability to achieve competitive performance while offering significant advantages in flexibility and customization, paving the way for more advanced and versatile image retrieval systems. |
|||||
2024 | Supervised Hashing With Latent Factor Models | Zhang P., Zhang, Li, Guo | Arxiv | Due to its low storage cost and fast query speed, hashing has been widely adopted for approximate nearest neighbor search in large-scale datasets. Traditional hashing methods try to learn the hash codes in an unsupervised way where the metric (Euclidean) structure of the training data is preserved. Very recently, supervised hashing methods, which try to preserve the semantic structure constructed from the semantic labels of the training points, have exhibited higher accuracy than unsupervised methods. In this paper, we propose a novel supervised hashing method, called latent factor hashing (LFH), to learn similarity-preserving binary codes based on latent factor models. An algorithm with convergence guarantee is proposed to learn the parameters of LFH. Furthermore, a linear-time variant with stochastic learning is proposed for training LFH on large-scale datasets. Experimental results on two large datasets with semantic labels show that LFH can achieve superior accuracy than state-of-the-art methods with comparable training time. |
|||||
2024 | Self-taught Hashing For Fast Similarity Search | Zhang D., Wang, Cai, Lu | Arxiv | The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel SelfTaught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms stateof-the-art techniques significantly. |
|||||
2024 | Hierarchical Deep Hashing For Fast Large Scale Image Retrieval | Zhang Yongfei, Peng, Jingtao, Liu, Pu, Chen | Arxiv | Fast image retrieval is of great importance in many computer vision tasks and especially practical applications. Deep hashing, the state-of-the-art fast image retrieval scheme, introduces deep learning to learn the hash functions and generate binary hash codes, and outperforms the other image retrieval methods in terms of accuracy. However, all the existing deep hashing methods could only generate one level hash codes and require a linear traversal of all the hash codes to figure out the closest one when a new query arrives, which is very time-consuming and even intractable for large scale applications. In this work, we propose a Hierarchical Deep Hashing(HDHash) scheme to speed up the state-of-the-art deep hashing methods. More specifically, hierarchical deep hash codes of multiple levels can be generated and indexed with tree structures rather than linear ones, and pruning irrelevant branches can sharply decrease the retrieval time. To our best knowledge, this is the first work to introduce hierarchical indexed deep hashing for fast large scale image retrieval. Extensive experimental results on three benchmark datasets demonstrate that the proposed HDHash scheme achieves better or comparable accuracy with significantly improved efficiency and reduced memory as compared to state-of- the-art fast image retrieval schemes. |
|||||
2024 | Fast Discrete Cross-modal Hashing Based On Label Relaxation And Matrix Factorization | Zhang Donglin, Wu, Liu, Yu, Kittler | Arxiv | In recent years, cross-media retrieval has drawn considerable attention due to the exponential growth of multimedia data. Many hashing approaches have been proposed for the cross-media search task. However, there are still open problems that warrant investigation. For example, most existing supervised hashing approaches employ a binary label matrix, which achieves small margins between wrong labels (0) and true labels (1). This may affect the retrieval performance by generating many false negatives and false positives. In addition, some methods adopt a relaxation scheme to solve the binary constraints, which may cause large quantization errors. There are also some discrete hashing methods that have been presented, but most of them are time-consuming. To conquer these problems, we present a label relaxation and discrete matrix factorization method (LRMF) for cross-modal retrieval. It offers a number of innovations. First of all, the proposed approach employs a novel label relaxation scheme to control the margins adaptively, which has the benefit of reducing the quantization error. Second, by virtue of the proposed discrete matrix factorization method designed to learn the binary codes, large quantization errors caused by relaxation can be avoided. The experimental results obtained on two widely-used databases demonstrate that LRMF outperforms state-of-the-art cross-media methods. |
|||||
2024 | An Enhanced Batch Query Architecture In Real-time Recommendation | Zhang Qiang, Teng Zhipeng, Wu Disheng, Wang Jiayin | Arxiv | In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system’s real-time performance. |
|||||
2024 | Gaussianimage 1000 FPS Image Representation And Compression By 2D Gaussian Splatting | Zhang Xinjie, Ge Xingtong, Xu Tongda, He Dailan, Wang Yan, Qin Hongwei, Lu Guo, Geng Jing, Zhang Jun | Arxiv | Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3\(\times\) lower GPU memory usage and 5\(\times\) faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 2000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding. Code is available at https://github.com/Xinjie-Q/GaussianImage. |
|||||
2024 | High-order Nonlocal Hashing For Unsupervised Cross-modal Retrieval | Zhang Peng-fei, Luo, Huang, Xu, Song | Arxiv | In light of the ability to enable efficient storage and fast query for big data, hashing techniques for cross-modal search have aroused extensive attention. Despite the great success achieved, unsupervised cross-modal hashing still suffers from lacking reliable similarity supervision and struggles with handling the heterogeneity issue between different modalities. To cope with these, in this paper, we devise a new deep hashing model, termed as High-order Nonlocal Hashing (HNH) to facilitate cross-modal retrieval with the following advantages. First, different from existing methods that mainly leverage low-level local-view similarity as the guidance for hashing learning, we propose a high-order affinity measure that considers the multi-modal neighbourhood structures from a nonlocal perspective, thereby comprehensively capturing the similarity relationships between data items. Second, a common representation is introduced to correlate different modalities. By enforcing the modal-specific descriptors and the common representation to be aligned with each other, the proposed HNH significantly bridges the modality gap and maintains the intra-consistency. Third, an effective affinity preserving objective function is delicately designed to generate high-quality binary codes. Extensive experiments evidence the superiority of the proposed HNH in unsupervised cross-modal retrieval tasks over the state-of-the-art baselines. |
|||||
2024 | Deep Center-based Dual-constrained Hashing For Discriminative Face Image Retrieval | Zhang Ming, Zhe, Yan | Arxiv | With the advantages of low storage cost and extremely fast retrieval speed, deep hashing methods have attracted much attention for image retrieval recently. However, large-scale face image retrieval with significant intra-class variations is still challenging. Neither existing pairwise/triplet labels-based nor softmax classification loss-based deep hashing works can generate compact and discriminative binary codes. Considering these issues, we propose a center-based framework integrating end-to-end hashing learning and class centers learning simultaneously. The framework minimizes the intra-class variance by clustering intra-class samples into a learnable class center. To strengthen inter-class separability, it additionally imposes a novel regularization term to enlarge the Hamming distance between pairwise class centers. Moreover, a simple yet effective regression matrix is introduced to encourage intra-class samples to generate the same binary codes, which further enhances the hashing codes compactness. Experiments on four large-scale datasets show the proposed method outperforms state-of-the-art baselines under various code lengths and commonly-used evaluation metrics. |
|||||
2024 | Composite Hashing With Multiple Information Sources | Zhang D., Wang, Si | Arxiv | Similarity search applications with a large amount of text and image data demands an efficient and effective solution. One useful strategy is to represent the examples in databases as compact binary codes through semantic hashing, which has attracted much attention due to its fast query/search speed and drastically reduced storage requirement. All of the current semantic hashing methods only deal with the case when each example is represented by one type of features. However, examples are often described from several different information sources in many real world applications. For example, the characteristics of a webpage can be derived from both its content part and its associated links. To address the problem of learning good hashing codes in this scenario, we propose a novel research problem – Composite Hashing with Multiple Information Sources (CHMIS). The focus of the new research problem is to design an algorithm for incorporating the features from different information sources into the binary hashing codes efficiently and effectively. In particular, we propose an algorithm CHMISAW (CHMIS with Adjusted Weights) for learning the codes. The proposed algorithm integrates information from several different sources into the binary hashing codes by adjusting the weights on each individual source for maximizing the coding performance, and enables fast conversion from query examples to their binary hashing codes. Experimental results on five different datasets demonstrate the superior performance of the proposed method against several other state-of-the-art semantic hashing techniques. |
|||||
2024 | DEMO A Statistical Perspective For Efficient Image-text Matching | Zhang Fan, Hua Xian-sheng, Chen Chong, Luo Xiao | Arxiv | Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently provides guidance to the model optimization process. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Through extensive experiments on three benchmark image-text matching datasets, we demonstrate that DEMO achieves superior performance compared with many state-of-the-art methods. |
|||||
2024 | Binary Code Ranking With Weighted Hamming Distance | Zhang Lei, Zhang, Tang, Lu, Tian | Arxiv | Binary hashing has been widely used for efficient similarity search due to its query and storage efficiency. In most existing binary hashing methods, the high-dimensional data are embedded into Hamming space and the distance or similarity of two points are approximated by the Hamming distance between their binary codes. The Hamming distance calculation is efficient, however, in practice, there are often lots of results sharing the same Hamming distance to a query, which makes this distance measure ambiguous and poses a critical issue for similarity search where ranking is important. In this paper, we propose a weighted Hamming distance ranking algorithm (WhRank) to rank the binary codes of hashing methods. By assigning different bit-level weights to different hash bits, the returned binary codes are ranked at a finer-grained binary code level. We give an algorithm to learn the data-adaptive and query-sensitive weight for each hash bit. Evaluations on two large-scale image data sets demonstrate the efficacy of our weighted Hamming distance for binary code ranking. |
|||||
2024 | Bit-scalable Deep Hashing With Regularized Similarity Learning For Image Retrieval And Person Re-identification | Zhang R., Lin, Zhang, Zuo, Zhang | Arxiv | Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval . Conventional methods often study these two steps separately, e.g., learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. Specifically, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between matched pairs and mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths. |
|||||
2024 | Efficient Training Of Very Deep Neural Networks For Supervised Hashing | Zhang Ziming, Chen, Saligrama | Arxiv | In this paper, we propose training very deep neural networks (DNNs) for supervised learning of hash codes. Existing methods in this context train relatively “shallow” networks limited by the issues arising in back propagation (e.e. vanishing gradients) as well as computational efficiency. We propose a novel and efficient training algorithm inspired by alternating direction method of multipliers (ADMM) that overcomes some of these limitations. Our method decomposes the training process into independent layer-wise local updates through auxiliary variables. Empirically we observe that our training algorithm always converges and its computational complexity is linearly proportional to the number of edges in the networks. Empirically we manage to train DNNs with 64 hidden layers and 1024 nodes per layer for supervised hashing in about 3 hours using a single GPU. Our proposed very deep supervised hashing (VDSH) method significantly outperforms the state-of-the-art on several benchmark datasets. |
|||||
2024 | Large-scale Supervised Multimodal Hashing With Semantic Correlation Maximization | Zhang D., Li | Arxiv | Due to its low storage cost and fast query speed, hashing has been widely adopted for similarity search in multimedia data. In particular, more and more attentions have been payed to multimodal hashing for search in multimedia data with multiple modalities, such as images with tags. Typically, supervised information of semantic labels is also available for the data points in many real applications. Hence, many supervised multimodal hashing (SMH) methods have been proposed to utilize such semantic labels to further improve the search accuracy. However, the training time complexity of most existing SMH methods is too high, which makes them unscalable to large-scale datasets. In this paper, a novel SMH method, called semantic correlation maximization (SCM), is proposed to seamlessly integrate semantic labels into the hashing learning procedure for large-scale data modeling. Experimental results on two real-world datasets show that SCM can signifi- cantly outperform the state-of-the-art SMH methods, in terms of both accuracy and scalability. |
|||||
2024 | Deep Semantic Ranking Based Hashing For Multi-label Image Retrieval | Zhao F., Huang, Wang, Tan | Arxiv | With the rapid growth of web images, hashing has received increasing interests in large scale image retrieval. Research efforts have been devoted to learning compact binary codes that preserve semantic similarity based on labels. However, most of these hashing methods are designed to handle simple binary similarity. The complex multilevel semantic structure of images associated with multiple labels have not yet been well explored. Here we propose a deep semantic ranking based method for learning hash functions that preserve multilevel semantic similarity between multilabel images. In our approach, deep convolutional neural network is incorporated into hash functions to jointly learn feature representations and mappings from them to hash codes, which avoids the limitation of semantic representation power of hand-crafted features. Meanwhile, a ranking list that encodes the multilevel similarity information is employed to guide the learning of such deep hash functions. An effective scheme based on surrogate loss is used to solve the intractable optimization problem of nonsmooth and multivariate ranking measures involved in the learning procedure. Experimental results show the superiority of our proposed approach over several state-of-theart hashing methods in term of ranking evaluation metrics when tested on multi-label image datasets. |
|||||
2024 | Cross-modal Similarity Learning Via Pairs Preferences And Active Supervision | Zhen Yi, Rai, Zha, Carin | Arxiv | We present a probabilistic framework for learning pairwise similarities between objects belonging to different modalities, such as drugs and proteins, or text and images. Our framework is based on learning a binary code based representation for objects in each modality, and has the following key properties: (i) it can leverage both pairwise as well as easy-to-obtain relative preference based cross-modal constraints, (ii) the probabilistic framework naturally allows querying for the most useful/informative constraints, facilitating an active learning setting (existing methods for cross-modal similarity learning do not have such a mechanism), and (iii) the binary code length is learned from the data. We demonstrate the effectiveness of the proposed approach on two problems that require computing pairwise similarities between cross-modal object pairs: cross-modal link prediction in bipartite graphs, and hashing based cross-modal similarity search. |
|||||
2024 | Co-regularized Hashing For Multimodal Data | Zhen Y., Yeung | Arxiv | Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. |
|||||
2024 | Deep Hashing Network For Efficient Similarity Retrieval | Zhu Han, Long, Wang, Cao | Arxiv | Due to the storage and retrieval efficiency, hashing has been widely deployed to approximate nearest neighbor search for large-scale multimedia retrieval. Supervised hashing, which improves the quality of hash coding by exploiting the semantic similarity on data pairs, has received increasing attention recently. For most existing supervised hashing methods for image retrieval, an image is first represented as a vector of hand-crafted or machine-learned features, followed by another separate quantization step that generates binary codes. However, suboptimal hash coding may be produced, because the quantization error is not statistically minimized and the feature representation is not optimally compatible with the binary coding. In this paper, we propose a novel Deep Hashing Network (DHN) architecture for supervised hashing, in which we jointly learn good image representation tailored to hash coding and formally control the quantization error. The DHN model constitutes four key components: (1) a sub-network with multiple convolution-pooling layers to capture image representations; (2) a fully-connected hashing layer to generate compact binary hash codes; (3) a pairwise cross-entropy loss layer for similarity-preserving learning; and (4) a pairwise quantization loss for controlling hashing quality. Extensive experiments on standard image retrieval datasets show the proposed DHN model yields substantial boosts over latest state-of-the-art hashing methods. |
|||||
2024 | Linear Cross-modal Hashing For Efficient Multimedia Search | Zhu Xiaofeng, Huang, Shen, Zhao | Arxiv | Most existing cross-modal hashing methods suffer from the scalability issue in the training phase. In this paper, we propose a novel cross-modal hashing approach with a linear time complexity to the training data size, to enable scalable indexing for multimedia search across multiple modals. Taking both the intra-similarity in each modal and the inter-similarity across different modals into consideration, the proposed approach aims at effectively learning hash functions from large-scale training datasets. More specifically, for each modal, we first partition the training data into \(k\) clusters and then represent each training data point with its distances to \(k\) centroids of the clusters. Interestingly, such a k-dimensional data representation can reduce the time complexity of the training phase from traditional O(n2) or higher to O(n), where \(n\) is the training data size, leading to practical learning on large-scale datasets. We further prove that this new representation preserves the intra-similarity in each modal. To preserve the inter-similarity among data points across different modals, we transform the derived data representations into a common binary subspace in which binary codes from all the modals are “consistent” and comparable. The transformation simultaneously outputs the hash functions for all modals, which are used to convert unseen data into binary codes. Given a query of one modal, it is first mapped into the binary codes using the modal’s hash functions, followed by matching the database binary codes of any other modals. Experimental results on two benchmark datasets confirm the scalability and the effectiveness of the proposed approach in comparison with the state of the art. |
|||||
2023 | Deeplsh Deep Locality-sensitive Hash Learning For Fast And Efficient Near-duplicate Crash Report Detection | Remil Youcef, Bendimerad Anes, Mathonat Romain, Raissi Chedy, Kaytoue Mehdi | Arxiv | Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available. |
|||||
2023 | Description-based Text Similarity | Ravfogel Shauli, Pyatkin Valentina, Cohen Amir Dn, Manevich Avshalom, Goldberg Yoav | Arxiv | Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of description based similarity. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model. |
|||||
2023 | Improving Code Example Recommendations On Informal Documentation Using BERT And Query-aware LSH A Comparative Study | Rahmani Sajjad, Naghshzan Amirhossein, Guerrouj Latifa | Arxiv | Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20\% to 35\% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation. |
|||||
2023 | Large-scale Distributed Learning Via Private On-device Locality-sensitive Hashing | Rabbani Tahseen, Bornstein Marco, Huang Furong | Arxiv | Locality-sensitive hashing (LSH) based frameworks have been used efficiently to select weight vectors in a dense hidden layer with high cosine similarity to an input, enabling dynamic pruning. While this type of scheme has been shown to improve computational training efficiency, existing algorithms require repeated randomized projection of the full layer weight, which is impractical for computational- and memory-constrained devices. In a distributed setting, deferring LSH analysis to a centralized host is (i) slow if the device cluster is large and (ii) requires access to input data which is forbidden in a federated context. Using a new family of hash functions, we develop one of the first private, personalized, and memory-efficient on-device LSH frameworks. Our framework enables privacy and personalization by allowing each device to generate hash tables, without the help of a central host, using device-specific hashing hyper-parameters (e.g. number of hash tables or hash length). Hash tables are generated with a compressed set of the full weights, and can be serially generated and discarded if the process is memory-intensive. This allows devices to avoid maintaining (i) the fully-sized model and (ii) large amounts of hash tables in local memory for LSH analysis. We prove several statistical and sensitivity properties of our hash functions, and experimentally demonstrate that our framework is competitive in training large-scale recommender networks compared to other LSH frameworks which assume unrestricted on-device capacity. |
|||||
2023 | Language Embedded 3D Gaussians For Open-vocabulary Scene Understanding | Shi Jin-chuan, Wang Miao, Duan Hao-bin, Guan Shao-hua | Arxiv | Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU. |
|||||
2023 | Deep Hashing Via Householder Quantization | Schwengber Lucas R., Resende Lucas, Orenstein Paulo, Oliveira Roberto I. | Arxiv | Hashing is at the heart of large-scale image similarity search, and recent methods have been substantially improved through deep learning techniques. Such algorithms typically learn continuous embeddings of the data. To avoid a subsequent costly binarization step, a common solution is to employ loss functions that combine a similarity learning term (to ensure similar images are grouped to nearby embeddings) and a quantization penalty term (to ensure that the embedding entries are close to binarized entries, e.g., -1 or 1). Still, the interaction between these two terms can make learning harder and the embeddings worse. We propose an alternative quantization strategy that decomposes the learning problem in two stages: first, perform similarity learning over the embedding space with no quantization; second, find an optimal orthogonal transformation of the embeddings so each coordinate of the embedding is close to its sign, and then quantize the transformed embedding through the sign function. In the second step, we parametrize orthogonal transformations using Householder matrices to efficiently leverage stochastic gradient descent. Since similarity measures are usually invariant under orthogonal transformations, this quantization strategy comes at no cost in terms of performance. The resulting algorithm is unsupervised, fast, hyperparameter-free and can be run on top of any existing deep hashing or metric learning algorithm. We provide extensive experimental results showing that this approach leads to state-of-the-art performance on widely used image datasets, and, unlike other quantization strategies, brings consistent improvements in performance to existing deep hashing algorithms. |
|||||
2023 | Beyond Two-tower Matching Learning Sparse Retrievable Cross-interactions For Recommendation | Su Liangcai, Yan Fan, Zhu Jieming, Xiao Xi, Duan Haoyi, Zhao Zhou, Dong Zhenhua, Tang Ruiming | Arxiv | Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models. |
|||||
2023 | Multivariate Representation Learning For Information Retrieval | Zamani Hamed, Bendersky Michael | Arxiv | Dense retrieval models use bi-encoder network architectures for learning query and document representations. These representations are often in the form of a vector representation and their similarities are often computed using the dot product function. In this paper, we propose a new representation learning framework for dense retrieval. Instead of learning a vector for each query and document, our framework learns a multivariate distribution and uses negative multivariate KL divergence to compute the similarity between distributions. For simplicity and efficiency reasons, we assume that the distributions are multivariate normals and then train large language models to produce mean and variance vectors for these distributions. We provide a theoretical foundation for the proposed framework and show that it can be seamlessly integrated into the existing approximate nearest neighbor algorithms to perform retrieval efficiently. We conduct an extensive suite of experiments on a wide range of datasets, and demonstrate significant improvements compared to competitive dense retrieval models. |
|||||
2023 | Better Generalization With Semantic Ids A Case Study In Ranking For Recommendations | Singh Anima, Vu Trung, Mehta Nikhil, Keshavan Raghunandan, Sathiamoorthy Maheswaran, Zheng Yilin, Hong Lichan, Heldt Lukasz, Wei Li, Tandon Devansh, Chi Ed H., Yi Xinyang | Arxiv | Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs – a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items – as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality. |
|||||
2023 | Divideclassify Fine-grained Classification For City-wide Visual Place Recognition | Trivigno Gabriele, Berton Gabriele, Aragon Juan, Caputo Barbara, Masone Carlo | Arxiv | Visual Place recognition is commonly addressed as an image retrieval problem. However, retrieval methods are impractical to scale to large datasets, densely sampled from city-wide maps, since their dimension impact negatively on the inference time. Using approximate nearest neighbour search for retrieval helps to mitigate this issue, at the cost of a performance drop. In this paper we investigate whether we can effectively approach this task as a classification problem, thus bypassing the need for a similarity search. We find that existing classification methods for coarse, planet-wide localization are not suitable for the fine-grained and city-wide setting. This is largely due to how the dataset is split into classes, because these methods are designed to handle a sparse distribution of photos and as such do not consider the visual aliasing problem across neighbouring classes that naturally arises in dense scenarios. Thus, we propose a partitioning scheme that enables a fast and accurate inference, preserving a simple learning procedure, and a novel inference pipeline based on an ensemble of novel classifiers that uses the prototypes learned via an angular margin loss. Our method, Divide&Classify (D&C), enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting. Moreover, we show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall, leading to new state-of-the-art results. |
|||||
2023 | What If We Tried Less Power -- Lessons From Studying The Power Of Choices In Hashing-based Data Structures | Walzer Stefan | Arxiv | In the first part of this survey, we review how the power of two choices underlies space-efficient data structures like cuckoo hash tables. We’ll find that the additional power afforded by more than 2 choices is often outweighed by the additional costs they bring. In the second part, we present a data structure where choices play a role at coarser than per-element granularity. In some sense, we rely on the power of \(1+\epsilon\) choices. |
|||||
2023 | Integrating Visual And Semantic Similarity Using Hierarchies For Image Retrieval | Venkataramanan Aishwarya, Laviale Martin, Pradalier Cédric | Arxiv | Most of the research in content-based image retrieval (CBIR) focus on developing robust feature representations that can effectively retrieve instances from a database of images that are visually similar to a query. However, the retrieved images sometimes contain results that are not semantically related to the query. To address this, we propose a method for CBIR that captures both visual and semantic similarity using a visual hierarchy. The hierarchy is constructed by merging classes with overlapping features in the latent space of a deep neural network trained for classification, assuming that overlapping classes share high visual and semantic similarities. Finally, the constructed hierarchy is integrated into the distance calculation metric for similarity search. Experiments on standard datasets: CUB-200-2011 and CIFAR100, and a real-life use case using diatom microscopy images show that our method achieves superior performance compared to the existing methods on image retrieval. |
|||||
2023 | Algorithms For Massive Data -- Lecture Notes | Prezza Nicola | Arxiv | These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca’ Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer’s memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams (pattern matching, counting). |
|||||
2023 | Let Them Have CAKES A Cutting-edge Algorithm For Scalable Efficient And Exact Search On Big Data | Prior Morgan E., Howard Thomas J. Iii, Mclaughlin Oliver, Ferguson Terrence, Ishaq Najib, Daniels Noah M. | Arxiv | The ongoing Big Data explosion has created a demand for efficient and scalable algorithms for similarity search. Most recent work has focused on \textit{approximate} \(k\)-NN search, and while this may be sufficient for some applications, \textit{exact} \(k\)-NN search would be ideal for many applications. We present CAKES, a set of three novel, exact algorithms for \(k\)-NN search. CAKES’s algorithms are generic over \textit{any} distance function, and they \textit{do not} scale with the cardinality or embedding dimension of the dataset, but rather with its metric entropy and fractal dimension. We test these claims on datasets from the ANN-Benchmarks suite under commonly-used distance functions, as well as on a genomic dataset with Levenshtein distance and a radio-frequency dataset with Dynamic Time Warping distance. We demonstrate that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces. We also demonstrate that CAKES exhibits significantly higher recall than state-of-the-art \(k\)-NN search algorithms when the distance function is not a metric. Additionally, we show that indexing and tuning time for CAKES is an order of magnitude, or more, faster than state-of-the-art approaches. We conclude that CAKES is a highly efficient and scalable algorithm for exact \(k\)-NN search on Big Data. We provide a Rust implementation of CAKES. |
|||||
2023 | Efficient Online String Matching Through Linked Weak Factors | Palmer Matthew N., Faro Simone, Scafiti Stefano | Arxiv | Online string matching is a computational problem involving the search for patterns or substrings in a large text dataset, with the pattern and text being processed sequentially, without prior access to the entire text. Its relevance stems from applications in data compression, data mining, text editing, and bioinformatics, where rapid and efficient pattern matching is crucial. Various solutions have been proposed over the past few decades, employing diverse techniques. Recently, weak recognition approaches have attracted increasing attention. This paper presents Hash Chain, a new algorithm based on a robust weak factor recognition approach that connects adjacent factors through hashing. Despite its O(nm) complexity, the algorithm exhibits a sublinear behavior in practice and achieves superior performance compared to the most effective algorithms. |
|||||
2023 | CAGRA Highly Parallel Graph Construction And Approximate Nearest Neighbor Search For Gpus | Ootomo Hiroyuki, Naruse Akira, Nolet Corey, Wang Ray, Feher Tamas, Wang Yong | Arxiv | Approximate Nearest Neighbor Search (ANNS) plays a critical role in various disciplines spanning data mining and artificial intelligence, from information retrieval and computer vision to natural language processing and recommender systems. Data volumes have soared in recent years and the computational cost of an exhaustive exact nearest neighbor search is often prohibitive, necessitating the adoption of approximate techniques. The balanced performance and recall of graph-based approaches have more recently garnered significant attention in ANNS algorithms, however, only a few studies have explored harnessing the power of GPUs and multi-core processors despite the widespread use of massively parallel and general-purpose computing. To bridge this gap, we introduce a novel parallel computing hardware-based proximity graph and search algorithm. By leveraging the high-performance capabilities of modern hardware, our approach achieves remarkable efficiency gains. In particular, our method surpasses existing CPU and GPU-based methods in constructing the proximity graph, demonstrating higher throughput in both large- and small-batch searches while maintaining compatible accuracy. In graph construction time, our method, CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA implementations. In large-batch query throughput in the 90% to 95% recall range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the SOTA implementations for GPU. For a single query, our method is 3.4~53x faster than HNSW at 95% recall. |
|||||
2023 | The Lower Energy Consumption In Cryptocurrency Mining Processes By SHA-256 Quantum Circuit Design Used In Hybrid Computing Domains | Orun Ahmet, Kurugollu Fatih | Arxiv | Cryptocurrency mining processes always lead to a high energy consumption at considerably high production cost, which is nearly one-third of cryptocurrency (e.g. Bitcoin) price itself. As the core of mining process is based on SHA-256 cryptographic hashing function, by using the alternative quantum computers, hybrid quantum computers or more larger quantum computing devices like quantum annealers, it would be possible to reduce the mining energy consumption with a quantum hardware’s low-energy-operation characteristics. Within this work we demonstrated the use of optimized quantum mining facilities which would replace the classical SHA-256 and high energy consuming classical hardware in near future. |
|||||
2023 | Hashreid Dynamic Network With Binary Codes For Efficient Person Re-identification | Nikhal Kshitij, Ma Yujunrong, Bhattacharyya Shuvra S., Riggan Benjamin S. | Arxiv | Biometric applications, such as person re-identification (ReID), are often deployed on energy constrained devices. While recent ReID methods prioritize high retrieval performance, they often come with large computational costs and high search time, rendering them less practical in real-world settings. In this work, we propose an input-adaptive network with multiple exit blocks, that can terminate computation early if the retrieval is straightforward or noisy, saving a lot of computation. To assess the complexity of the input, we introduce a temporal-based classifier driven by a new training strategy. Furthermore, we adopt a binary hash code generation approach instead of relying on continuous-valued features, which significantly improves the search process by a factor of 20. To ensure similarity preservation, we utilize a new ranking regularizer that bridges the gap between continuous and binary features. Extensive analysis of our proposed method is conducted on three datasets: Market1501, MSMT17 (Multi-Scene Multi-Time), and the BGC1 (BRIAR Government Collection). Using our approach, more than 70% of the samples with compact hash codes exit early on the Market1501 dataset, saving 80% of the networks computational cost and improving over other hash-based methods by 60%. These results demonstrate a significant improvement over dynamic networks and showcase comparable accuracy performance to conventional ReID methods. Code will be made available. |
|||||
2023 | Relative Nn-descent A Fast Index Construction For Graph-based Approximate Nearest Neighbor Search | Ono Naoki, Matsui Yusuke | Arxiv | Approximate Nearest Neighbor Search (ANNS) is the task of finding the database vector that is closest to a given query vector. Graph-based ANNS is the family of methods with the best balance of accuracy and speed for million-scale datasets. However, graph-based methods have the disadvantage of long index construction time. Recently, many researchers have improved the tradeoff between accuracy and speed during a search. However, there is little research on accelerating index construction. We propose a fast graph construction algorithm, Relative NN-Descent (RNN-Descent). RNN-Descent combines NN-Descent, an algorithm for constructing approximate K-nearest neighbor graphs (K-NN graphs), and RNG Strategy, an algorithm for selecting edges effective for search. This algorithm allows the direct construction of graph-based indexes without ANNS. Experimental results demonstrated that the proposed method had the fastest index construction speed, while its search performance is comparable to existing state-of-the-art methods such as NSG. For example, in experiments on the GIST1M dataset, the construction of the proposed method is 2x faster than NSG. Additionally, it was even faster than the construction speed of NN-Descent. |
|||||
2023 | Unsupervised Hashing With Similarity Distribution Calibration | Ng Kam Woh, Zhu Xiatian, Hoe Jiun Tian, Chan Chee Seng, Zhang Tianyu, Song Yi-zhe, Xiang Tao | Arxiv | Unsupervised hashing methods typically aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes. However, these methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space, due to the limited similarity range of hash codes. The similarity range is bounded by the code length and can lead to a problem known as similarity collapse. That is, the positive and negative pairs of data points become less distinguishable from each other in the hash space. To alleviate this problem, in this paper a novel Similarity Distribution Calibration (SDC) method is introduced. SDC aligns the hash code similarity distribution towards a calibration distribution (e.g., beta distribution) with sufficient spread across the entire similarity range, thus alleviating the similarity collapse problem. Extensive experiments show that our SDC outperforms significantly the state-of-the-art alternatives on coarse category-level and instance-level image retrieval. Code is available at https://github.com/kamwoh/sdc. |
|||||
2023 | Learning Multi-stage Multi-grained Semantic Embeddings For E-commerce Search | Wang Binbin, Li Mingming, Zeng Zhixiong, Zhuo Jingwei, Wang Songlin, Xu Sulong, Long Bo, Yan Weipeng | Arxiv | Retrieving relevant items that match users’ queries from billion-scale corpus forms the core of industrial e-commerce search systems, in which embedding-based retrieval (EBR) methods are prevailing. These methods adopt a two-tower framework to learn embedding vectors for query and item separately and thus leverage efficient approximate nearest neighbor (ANN) search to retrieve relevant items. However, existing EBR methods usually ignore inconsistent user behaviors in industrial multi-stage search systems, resulting in insufficient retrieval efficiency with a low commercial return. To tackle this challenge, we propose to improve EBR methods by learning Multi-level Multi-Grained Semantic Embeddings(MMSE). We propose the multi-stage information mining to exploit the ordered, clicked, unclicked and random sampled items in practical user behavior data, and then capture query-item similarity via a post-fusion strategy. We then propose multi-grained learning objectives that integrate the retrieval loss with global comparison ability and the ranking loss with local comparison ability to generate semantic embeddings. Both experiments on a real-world billion-scale dataset and online A/B tests verify the effectiveness of MMSE in achieving significant performance improvements on metrics such as offline recall and online conversion rate (CVR). |
|||||
2023 | A Note On efficient Task-specific Data Valuation For Nearest Neighbor Algorithms | Wang Jiachen T., Jia Ruoxi | Arxiv | Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation. |
|||||
2023 | Masked Space-time Hash Encoding For Efficient Dynamic Scene Reconstruction | Wang Feng, Chen Zilong, Wang Guokang, Song Yafei, Liu Huaping | Arxiv | In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth |
|||||
2023 | Instant Complexity Reduction In Cnns Using Locality-sensitive Hashing | Meiner Lukas, Mehnert Jens, Condurache Alexandru Paul | Arxiv | To reduce the computational cost of convolutional neural networks (CNNs) for usage on resource-constrained devices, structured pruning approaches have shown promising results, drastically reducing floating-point operations (FLOPs) without substantial drops in accuracy. However, most recent methods require fine-tuning or specific training procedures to achieve a reasonable trade-off between retained accuracy and reduction in FLOPs. This introduces additional cost in the form of computational overhead and requires training data to be available. To this end, we propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. It instantly reduces the network’s test-time inference cost without requiring any training or fine-tuning. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) to detect redundancies in the channel dimension. Similar channels are aggregated to reduce the input and filter depth simultaneously, allowing for cheaper convolutions. We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet. In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module. |
|||||
2023 | Deep Supervised Hashing For Fast Retrieval Of Radio Image Cubes | Ndung'u Steven, Grobler Trienko, Wijnholds Stefan J., Karastoyanova Dimka, Azzopardi George | Arxiv | The shear number of sources that will be detected by next-generation radio surveys will be astronomical, which will result in serendipitous discoveries. Data-dependent deep hashing algorithms have been shown to be efficient at image retrieval tasks in the fields of computer vision and multimedia. However, there are limited applications of these methodologies in the field of astronomy. In this work, we utilize deep hashing to rapidly search for similar images in a large database. The experiment uses a balanced dataset of 2708 samples consisting of four classes: Compact, FRI, FRII, and Bent. The performance of the method was evaluated using the mean average precision (mAP) metric where a precision of 88.5\% was achieved. The experimental results demonstrate the capability to search and retrieve similar radio images efficiently and at scale. The retrieval is based on the Hamming distance between the binary hash of the query image and those of the reference images in the database. |
|||||
2023 | Graph-collaborated Auto-encoder Hashing For Multi-view Binary Clustering | Wang Huibing, Yao Mingze, Jiang Guangqi, Mi Zetian, Fu Xianping | Arxiv | Unsupervised hashing methods have attracted widespread attention with the explosive growth of large-scale data, which can greatly reduce storage and computation by learning compact binary codes. Existing unsupervised hashing methods attempt to exploit the valuable information from samples, which fails to take the local geometric structure of unlabeled samples into consideration. Moreover, hashing based on auto-encoders aims to minimize the reconstruction loss between the input data and binary codes, which ignores the potential consistency and complementarity of multiple sources data. To address the above issues, we propose a hashing algorithm based on auto-encoders for multi-view binary clustering, which dynamically learns affinity graphs with low-rank constraints and adopts collaboratively learning between auto-encoders and affinity graphs to learn a unified binary code, called Graph-Collaborated Auto-Encoder Hashing for Multi-view Binary Clustering (GCAE). Specifically, we propose a multi-view affinity graphs learning model with low-rank constraint, which can mine the underlying geometric information from multi-view data. Then, we design an encoder-decoder paradigm to collaborate the multiple affinity graphs, which can learn a unified binary code effectively. Notably, we impose the decorrelation and code balance constraints on binary codes to reduce the quantization errors. Finally, we utilize an alternating iterative optimization scheme to obtain the multi-view clustering results. Extensive experimental results on \(5\) public datasets are provided to reveal the effectiveness of the algorithm and its superior performance over other state-of-the-art alternatives. |
|||||
2023 | Parlayann Scalable And Deterministic Parallel Graph-based Approximate Nearest Neighbor Search Algorithms | Manohar Magdalen Dobson, Shen Zheqi, Blelloch Guy E., Dhulipala Laxman, Gu Yan, Simhadri Harsha Vardhan, Sun Yihan | Arxiv | Approximate nearest-neighbor search (ANNS) algorithms are a key part of the modern deep learning stack due to enabling efficient similarity search over high-dimensional vector space representations (i.e., embeddings) of data. Among various ANNS algorithms, graph-based algorithms are known to achieve the best throughput-recall tradeoffs. Despite the large scale of modern ANNS datasets, existing parallel graph based implementations suffer from significant challenges to scale to large datasets due to heavy use of locks and other sequential bottlenecks, which 1) prevents them from efficiently scaling to a large number of processors, and 2) results in nondeterminism that is undesirable in certain applications. In this paper, we introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms, along with a set of useful tools for developing such algorithms. In this library, we develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets. Our algorithms are deterministic and achieve high scalability across a diverse set of challenging datasets. In addition to the new algorithmic ideas, we also conduct a detailed experimental study of our new algorithms as well as two existing non-graph approaches. Our experimental results both validate the effectiveness of our new techniques, and lead to a comprehensive comparison among ANNS algorithms on large scale datasets with a list of interesting findings. |
|||||
2023 | Fast Approximation Of Similarity Graphs With Kernel Density Estimation | Macgregor Peter, Sun He | Arxiv | Constructing a similarity graph from a set \(X\) of data points in \(\mathbb{R}^d\) is the first step of many modern clustering algorithms. However, typical constructions of a similarity graph have high time complexity, and a quadratic space dependency with respect to \(|X|\). We address this limitation and present a new algorithmic framework that constructs a sparse approximation of the fully connected similarity graph while preserving its cluster structure. Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions. We compare our designed algorithm with the well-known implementations from the scikit-learn library and the FAISS library, and find that our method significantly outperforms the implementation from both libraries on a variety of datasets. |
|||||
2023 | Anserini Gets Dense Retrieval Integration Of Lucenes HNSW Indexes | Ma Xueguang, Teofili Tommaso, Lin Jimmy | Arxiv | Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that has been gaining traction in the community. It provides retrieval capabilities for both “traditional” bag-of-words retrieval models such as BM25 as well as retrieval using learned sparse representations such as SPLADE. With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface. Nevertheless, hybrid fusion techniques that integrate sparse and dense retrieval models need to stitch together results from two completely different “software stacks”, which creates unnecessary complexities and inefficiencies. However, the introduction of HNSW indexes for dense vector search in Lucene promises the integration of both dense and sparse retrieval within a single software framework. We explore exactly this integration in the context of Anserini. Experiments on the MS MARCO passage and BEIR datasets show that our Anserini HNSW integration supports (reasonably) effective and (reasonably) efficient approximate nearest neighbor search for dense retrieval models, using only Lucene. |
|||||
2023 | On The Maximal Independent Sets Of k-mers With The Edit Distance | Ma Leran, Chen Ke, Shao Mingfu | Arxiv | In computational biology, \(k\)-mers and edit distance are fundamental concepts. However, little is known about the metric space of all \(k\)-mers equipped with the edit distance. In this work, we explore the structure of the \(k\)-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all \(k\)-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a \(k\)-mer space grows geometrically with respect to \(k\). We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the \(k\)-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of \(k\)-mer spaces and their statistical properties are reported and analyzed for \(k\) up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace . |
|||||
2023 | Central Similarity Multi-view Hashing For Multimedia Retrieval | Zhu Jian, Cheng Wen, Cui Yu, Tang Chang, Dai Yuyang, Li Yong, Zeng Lingfang | Arxiv | Hash representation learning of multi-view heterogeneous data is the key to improving the accuracy of multimedia retrieval. However, existing methods utilize local similarity and fall short of deeply fusing the multi-view features, resulting in poor retrieval accuracy. Current methods only use local similarity to train their model. These methods ignore global similarity. Furthermore, most recent works fuse the multi-view features via a weighted sum or concatenation. We contend that these fusion methods are insufficient for capturing the interaction between various views. We present a novel Central Similarity Multi-View Hashing (CSMVH) method to address the mentioned problems. Central similarity learning is used for solving the local similarity problem, which can utilize the global similarity between the hash center and samples. We present copious empirical data demonstrating the superiority of gate-based fusion over conventional approaches. On the MS COCO and NUS-WIDE, the proposed CSMVH performs better than the state-of-the-art methods by a large margin (up to 11.41% mean Average Precision (mAP) improvement). |
|||||
2023 | Attributes Grouping And Mining Hashing For Fine-grained Image Retrieval | Lu Xin, Chen Shikun, Cao Yichao, Zhou Xin, Lu Xiaobo | Proceedings of the | In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets. |
|||||
2023 | CHAIN Exploring Global-local Spatio-temporal Information For Improved Self-supervised Video Hashing | Wei Rukai, Liu Yu, Song Jingkuan, Cui Heng, Xie Yanzhao, Zhou Ke | Arxiv | Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between video frames, especially in the absence of labels. Existing self-supervised video hashing methods have been effective in designing expressive temporal encoders, but have not fully utilized the temporal dynamics and spatial appearance of videos due to less challenging and unreliable learning tasks. To address these challenges, we begin by utilizing the contrastive learning task to capture global spatio-temporal information of videos for hashing. With the aid of our designed augmentation strategies, which focus on spatial and temporal variations to create positive pairs, the learning framework can generate hash codes that are invariant to motion, scale, and viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., frame order verification and scene change regularization, to capture local spatio-temporal details within video frames, thereby enhancing the perception of temporal structure and the modeling of spatio-temporal relationships. Our proposed Contrastive Hashing with Global-Local Spatio-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets. Our codes will be released. |
|||||
2023 | Building K-anonymous User Cohorts With Consecutive Consistent Weighted Sampling (CCWS) | Zheng Xinyi, Zhao Weijie, Li Xiaoyun, Li Ping | Arxiv | To retrieve personalized campaigns and creatives while protecting user privacy, digital advertising is shifting from member-based identity to cohort-based identity. Under such identity regime, an accurate and efficient cohort building algorithm is desired to group users with similar characteristics. In this paper, we propose a scalable \(K\)-anonymous cohort building algorithm called {\em consecutive consistent weighted sampling} (CCWS). The proposed method combines the spirit of the (\(p\)-powered) consistent weighted sampling and hierarchical clustering, so that the \(K\)-anonymity is ensured by enforcing a lower bound on the size of cohorts. Evaluations on a LinkedIn dataset consisting of \(>70\)M users and ads campaigns demonstrate that CCWS achieves substantial improvements over several hashing-based methods including sign random projections (SignRP), minwise hashing (MinHash), as well as the vanilla CWS. |
|||||
2023 | Adaptive Confidence Multi-view Hashing For Multimedia Retrieval | Zhu Jian, Cui Yu, Huang Zhangmin, Li Xingyu, Liu Lei, Zeng Lingfang, Dai Li-rong | Arxiv | The multi-view hash method converts heterogeneous data from multiple views into binary hash codes, which is one of the critical technologies in multimedia retrieval. However, the current methods mainly explore the complementarity among multiple views while lacking confidence learning and fusion. Moreover, in practical application scenarios, the single-view data contain redundant noise. To conduct the confidence learning and eliminate unnecessary noise, we propose a novel Adaptive Confidence Multi-View Hashing (ACMVH) method. First, a confidence network is developed to extract useful information from various single-view features and remove noise information. Furthermore, an adaptive confidence multi-view network is employed to measure the confidence of each view and then fuse multi-view features through a weighted summation. Lastly, a dilation network is designed to further enhance the feature representation of the fused features. To the best of our knowledge, we pioneer the application of confidence learning into the field of multimedia retrieval. Extensive experiments on two public datasets show that the proposed ACMVH performs better than state-of-the-art methods (maximum increase of 3.24%). The source code is available at https://github.com/HackerHyper/ACMVH. |
|||||
2023 | CLIP Multi-modal Hashing A New Baseline CLIPMH | Zhu Jian, Sheng Mingkai, Ke Mingda, Huang Zhangmin, Chang Jingfei | Arxiv | The multi-modal hashing method is widely used in multimedia retrieval. It can fuse multi-source data to generate binary hash code. However, the current multi-modal methods have the problem of low retrieval accuracy. The reason is that the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data. To solve this problem, we propose a new baseline CLIP Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and image features, and then fuse to generate hash code. CLIP improves the expressiveness of each modal feature. In this way, it can greatly improve the retrieval performance of multi-modal hashing methods. In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance (Maximum increase of 8.38%). CLIP also has great advantages over the text and visual backbone networks commonly used before. |
|||||
2023 | Attribute-aware Deep Hashing With Self-consistency For Large-scale Fine-grained Image Retrieval | Wei Xiu-shen, Shen Yang, Sun Xuhao, Wang Peng, Peng Yuxin | Arxiv | Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. Our models are also equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities. Then, driven by preserving original entities’ similarity, the required hash codes can be generated from these attribute-specific vectors and thus become attribute-aware. Furthermore, to combat simplicity bias in deep hashing, we consider the model design from the perspective of the self-consistency principle and propose to further enhance models’ self-consistency by equipping an additional image reconstruction path. Comprehensive quantitative experiments under diverse empirical settings on six fine-grained retrieval datasets and two generic retrieval datasets show the superiority of our models over competing methods. |
|||||
2023 | Cryptanalysis Of A Cayley Hash Function Based On Affine Maps In One Variable Over A Finite Field | Sosnovski Bianca | Arxiv | Cayley hash functions are cryptographic hashes constructed from Cayley graphs of groups. The hash function proposed by Shpilrain and Sosnovski (2016), based on linear functions over a finite field, was proven insecure. This paper shows that the proposal by Ghaffari and Mostaghim (2018) that uses the Shpilrain and Sosnovski’s hash in its construction is also insecure. We demonstrate its security vulnerability by constructing collisions. |
|||||
2023 | Embedding In Recommender Systems A Survey | Zhao Xiangyu, Wang Maolin, Zhao Xinjian, Li Jiansheng, Zhou Shucheng, Yin Dawei, Li Qing, Tang Jiliang, Guo Ruocheng | Arxiv | Recommender systems have become an essential component of many online platforms, providing personalized recommendations to users. A crucial aspect is embedding techniques that coverts the high-dimensional discrete features, such as user and item IDs, into low-dimensional continuous vectors and can enhance the recommendation performance. Applying embedding techniques captures complex entity relationships and has spurred substantial research. In this survey, we provide an overview of the recent literature on embedding techniques in recommender systems. This survey covers embedding methods like collaborative filtering, self-supervised learning, and graph-based techniques. Collaborative filtering generates embeddings capturing user-item preferences, excelling in sparse data. Self-supervised methods leverage contrastive or generative learning for various tasks. Graph-based techniques like node2vec exploit complex relationships in network-rich environments. Addressing the scalability challenges inherent to embedding methods, our survey delves into innovative directions within the field of recommendation systems. These directions aim to enhance performance and reduce computational complexity, paving the way for improved recommender systems. Among these innovative approaches, we will introduce Auto Machine Learning (AutoML), hash techniques, and quantization techniques in this survey. We discuss various architectures and techniques and highlight the challenges and future directions in these aspects. This survey aims to provide a comprehensive overview of the state-of-the-art in this rapidly evolving field and serve as a useful resource for researchers and practitioners working in the area of recommender systems. |
|||||
2023 | Learning Category Trees For Id-based Recommendation Exploring The Power Of Differentiable Vector Quantization | Liu Qijiong, Fan Lu, Xiao Jiaren, Zhu Jieming, Wu Xiao-ming | Arxiv | Category information plays a crucial role in enhancing the quality and personalization of recommender systems. Nevertheless, the availability of item category information is not consistently present, particularly in the context of ID-based recommendations. In this work, we propose a novel approach to automatically learn and generate entity (i.e., user or item) category trees for ID-based recommendation. Specifically, we devise a differentiable vector quantization framework for automatic category tree generation, namely CAGE, which enables the simultaneous learning and refinement of categorical code representations and entity embeddings in an end-to-end manner, starting from the randomly initialized states. With its high adaptability, CAGE can be easily integrated into both sequential and non-sequential recommender systems. We validate the effectiveness of CAGE on various recommendation tasks including list completion, collaborative filtering, and click-through rate prediction, across different recommendation models. We release the code and data for others to reproduce the reported results. |
|||||
2023 | HS-GCN Hamming Spatial Graph Convolutional Networks For Recommendation | Liu Han, Wei Yinwei, Yin Jianhua, Nie Liqiang | Arxiv | An efficient solution to the large-scale recommender system is to represent users and items as binary hash codes in the Hamming space. Towards this end, existing methods tend to code users by modeling their Hamming similarities with the items they historically interact with, which are termed as the first-order similarities in this work. Despite their efficiency, these methods suffer from the suboptimal representative capacity, since they forgo the correlation established by connecting multiple first-order similarities, i.e., the relation among the indirect instances, which could be defined as the high-order similarity. To tackle this drawback, we propose to model both the first- and the high-order similarities in the Hamming space through the user-item bipartite graph. Therefore, we develop a novel learning to hash framework, namely Hamming Spatial Graph Convolutional Networks (HS-GCN), which explicitly models the Hamming similarity and embeds it into the codes of users and items. Extensive experiments on three public benchmark datasets demonstrate that our proposed model significantly outperforms several state-of-the-art hashing models, and obtains performance comparable with the real-valued recommendation models. |
|||||
2023 | Sparse-inductive Generative Adversarial Hashing For Nearest Neighbor Search | Liu Hong | Arxiv | Unsupervised hashing has received extensive research focus on the past decade, which typically aims at preserving a predefined metric (i.e. Euclidean metric) in the Hamming space. To this end, the encoding functions of the existing hashing are typically quasi-isometric, which devote to reducing the quantization loss from the target metric space to the discrete Hamming space. However, it is indeed problematic to directly minimize such error, since such mentioned two metric spaces are heterogeneous, and the quasi-isometric mapping is non-linear. The former leads to inconsistent feature distributions, while the latter leads to problematic optimization issues. In this paper, we propose a novel unsupervised hashing method, termed Sparsity-Induced Generative Adversarial Hashing (SiGAH), to encode large-scale high-dimensional features into binary codes, which well solves the two problems through a generative adversarial training framework. Instead of minimizing the quantization loss, our key innovation lies in enforcing the learned Hamming space to have similar data distribution to the target metric space via a generative model. In particular, we formulate a ReLU-based neural network as a generator to output binary codes and an MSE-loss based auto-encoder network as a discriminator, upon which a generative adversarial learning is carried out to train hash functions. Furthermore, to generate the synthetic features from the hash codes, a compressed sensing procedure is introduced into the generative model, which enforces the reconstruction boundary of binary codes to be consistent with that of original features. Finally, such generative adversarial framework can be trained via the Adam optimizer. Experimental results on four benchmarks, i.e., Tiny100K, GIST1M, Deep1M, and MNIST, have shown that the proposed SiGAH has superior performance over the state-of-the-art approaches. |
|||||
2023 | Reliable And Efficient Evaluation Of Adversarial Robustness For Deep Hashing-based Retrieval | Wang Xunguang, Bai Jiawang, Xu Xinyue, Li Xiaomeng | Arxiv | Deep hashing has been extensively applied to massive image retrieval due to its efficiency and effectiveness. Recently, several adversarial attacks have been presented to reveal the vulnerability of deep hashing models against adversarial examples. However, existing attack methods suffer from degraded performance or inefficiency because they underutilize the semantic relations between original samples or spend a lot of time learning these relations with a deep neural network. In this paper, we propose a novel Pharos-guided Attack, dubbed PgA, to evaluate the adversarial robustness of deep hashing networks reliably and efficiently. Specifically, we design pharos code to represent the semantics of the benign image, which preserves the similarity to semantically relevant samples and dissimilarity to irrelevant ones. It is proven that we can quickly calculate the pharos code via a simple math formula. Accordingly, PgA can directly conduct a reliable and efficient attack on deep hashing-based retrieval by maximizing the similarity between the hash code of the adversarial example and the pharos code. Extensive experiments on the benchmark datasets verify that the proposed algorithm outperforms the prior state-of-the-arts in both attack strength and speed. |
|||||
2023 | Deep Metric Multi-view Hashing For Multimedia Retrieval | Zhu Jian, Huang Zhangmin, Ruan Xiaohu, Cui Yu, Cheng Yongli, Zeng Lingfang | Arxiv | Learning the hash representation of multi-view heterogeneous data is an important task in multimedia retrieval. However, existing methods fail to effectively fuse the multi-view features and utilize the metric information provided by the dissimilar samples, leading to limited retrieval precision. Current methods utilize weighted sum or concatenation to fuse the multi-view features. We argue that these fusion methods cannot capture the interaction among different views. Furthermore, these methods ignored the information provided by the dissimilar samples. We propose a novel deep metric multi-view hashing (DMMVH) method to address the mentioned problems. Extensive empirical evidence is presented to show that gate-based fusion is better than typical methods. We introduce deep metric learning to the multi-view hashing problems, which can utilize metric information of dissimilar samples. On the MIR-Flickr25K, MS COCO, and NUS-WIDE, our method outperforms the current state-of-the-art methods by a large margin (up to 15.28 mean Average Precision (mAP) improvement). |
|||||
2023 | An Elementary Proof Of The First LP Bound On The Rate Of Binary Codes | Linial Nati, Loyfer Elyassaf | Arxiv | The asymptotic rate vs. distance problem is a long-standing fundamental problem in coding theory. The best upper bound to date was given in 1977 and has received since then numerous proofs and interpretations. Here we provide a new, elementary proof of this bound based on counting walks in the Hamming cube. |
|||||
2023 | Binary Code Similarity Detection | Liu Zian | Arxiv | Binary code similarity detection is to detect the similarity of code at binary (assembly) level without source code. Existing works have their limitations when dealing with mutated binary code generated by different compiling options. In this paper, we propose a novel approach to addressing this problem. By inspecting the binary code, we found that generally, within a function, some instructions aim to calculate (prepare) values for other instructions. The latter instructions are defined by us as key instructions. Currently, we define four categories of key instructions: calling subfunctions, comparing instruction, returning instruction, and memory-store instruction. Thus if we symbolically execute similar binary codes, symbolic values at these key instructions are expected to be similar. As such, we implement a prototype tool, which has three steps. First, it symbolically executes binary code; Second, it extracts symbolic values at defined key instructions into a graph; Last, it compares the symbolic graph similarity. In our implementation, we also address some problems, including path explosion and loop handling. |
|||||
2023 | RAFIC Retrieval-augmented Few-shot Image Classification | Lin Hangfei, Miao Li, Ziai Amir | Arxiv | Few-shot image classification is the task of classifying unseen images to one of N mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K) presents a significant challenge to classification accuracy in some cases. To address this, we have developed a method for augmenting the set of K with an addition set of A retrieved images. We call this system Retrieval-Augmented Few-shot Image Classification (RAFIC). Through a series of experiments, we demonstrate that RAFIC markedly improves performance of few-shot image classification across two challenging datasets. RAFIC consists of two main components: (a) a retrieval component which uses CLIP, LAION-5B, and faiss, in order to efficiently retrieve images similar to the supplied images, and (b) retrieval meta-learning, which learns to judiciously utilize the retrieved images. Code and data is available at github.com/amirziai/rafic. |
|||||
2023 | Learning Compact Compositional Embeddings Via Regularized Pruning For Recommendation | Liang Xurong, Chen Tong, Nguyen Quoc Viet Hung, Li Jianxin, Yin Hongzhi | Arxiv | Latent factor models are the dominant backbones of contemporary recommender systems (RSs) given their performance advantages, where a unique vector embedding with a fixed dimensionality (e.g., 128) is required to represent each entity (commonly a user/item). Due to the large number of users and items on e-commerce sites, the embedding table is arguably the least memory-efficient component of RSs. For any lightweight recommender that aims to efficiently scale with the growing size of users/items or to remain applicable in resource-constrained settings, existing solutions either reduce the number of embeddings needed via hashing, or sparsify the full embedding table to switch off selected embedding dimensions. However, as hash collision arises or embeddings become overly sparse, especially when adapting to a tighter memory budget, those lightweight recommenders inevitably have to compromise their accuracy. To this end, we propose a novel compact embedding framework for RSs, namely Compositional Embedding with Regularized Pruning (CERP). Specifically, CERP represents each entity by combining a pair of embeddings from two independent, substantially smaller meta-embedding tables, which are then jointly pruned via a learnable element-wise threshold. In addition, we innovatively design a regularized pruning mechanism in CERP, such that the two sparsified meta-embedding tables are encouraged to encode information that is mutually complementary. Given the compatibility with agnostic latent factor models, we pair CERP with two popular recommendation models for extensive experiments, where results on two real-world datasets under different memory budgets demonstrate its superiority against state-of-the-art baselines. The codebase of CERP is available in https://github.com/xurong-liang/CERP. |
|||||
2023 | Practice With Graph-based ANN Algorithms On Sparse Data Chi-square Two-tower Model HNSW Sign Cauchy Projections | Li Ping, Zhao Weijie, Wang Chao, Xia Qi, Wu Alice, Peng Lijun | Arxiv | Sparse data are common. The traditional |
|||||
2023 | Locality Preserving Multiview Graph Hashing For Large Scale Remote Sensing Image Search | Li Wenyun, Zhong Guo, Lu Xingyu, Pun Chi-man | Arxiv | Hashing is very popular for remote sensing image search. This article proposes a multiview hashing with learnable parameters to retrieve the queried images for a large-scale remote sensing dataset. Existing methods always neglect that real-world remote sensing data lies on a low-dimensional manifold embedded in high-dimensional ambient space. Unlike previous methods, this article proposes to learn the consensus compact codes in a view-specific low-dimensional subspace. Furthermore, we have added a hyperparameter learnable module to avoid complex parameter tuning. In order to prove the effectiveness of our method, we carried out experiments on three widely used remote sensing data sets and compared them with seven state-of-the-art methods. Extensive experiments show that the proposed method can achieve competitive results compared to the other method. |
|||||
2023 | Dual-stream Knowledge-preserving Hashing For Unsupervised Video Retrieval | Li Pandeng, Xie Hongtao, Ge Jiannan, Zhang Lei, Min Shaobo, Zhang Yongdong | Arxiv | Unsupervised video hashing usually optimizes binary codes by learning to reconstruct input videos. Such reconstruction constraint spends much effort on frame-level temporal context changes without focusing on video-level global semantics that are more useful for retrieval. Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint. Specifically, we first design a simple dual-stream structure, including a temporal layer and a hash layer. Then, with the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval, while the temporal layer learns to capture the information for reconstruction. In this way, the model naturally preserves the disentangled semantics into binary codes. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-arts on three video benchmarks. |
|||||
2023 | Pb-hash Partitioned B-bit Hashing | Li Ping, Zhao Weijie | Arxiv | Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of \(B\) bits. With \(k\) hashes for each data vector, the storage would be \(B\times k\) bits; and when used for large-scale learning, the model size would be \(2^B\times k\), which can be expensive. A standard strategy is to use only the lowest \(b\) bits out of the \(B\) bits and somewhat increase \(k\), the number of hashes. In this study, we propose to re-use the hashes by partitioning the \(B\) bits into \(m\) chunks, e.g., \(b\times m =B\). Correspondingly, the model size becomes \(m\times 2^b \times k\), which can be substantially smaller than the original \(2^B\times k\). Our theoretical analysis reveals that by partitioning the hash values into \(m\) chunks, the accuracy would drop. In other words, using \(m\) chunks of \(B/m\) bits would not be as accurate as directly using \(B\) bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) \(m=2\sim 4\). In some regions, Pb-Hash still works well even for \(m\) much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine \(m\) embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study. |
|||||
2023 | Can LSH (locality-sensitive Hashing) Be Replaced By Neural Network | Liu Renyang, Zhao Jun, Chu Xing, Liang Yu, Zhou Wei, He Jing | Arxiv | With the rapid development of GPU (Graphics Processing Unit) technologies and neural networks, we can explore more appropriate data structures and algorithms. Recent progress shows that neural networks can partly replace traditional data structures. In this paper, we proposed a novel DNN (Deep Neural Network)-based learned locality-sensitive hashing, called LLSH, to efficiently and flexibly map high-dimensional data to low-dimensional space. LLSH replaces the traditional LSH (Locality-sensitive Hashing) function families with parallel multi-layer neural networks, which reduces the time and memory consumption and guarantees query accuracy simultaneously. The proposed LLSH demonstrate the feasibility of replacing the hash index with learning-based neural networks and open a new door for developers to design and configure data organization more accurately to improve information-searching performance. Extensive experiments on different types of datasets show the superiority of the proposed method in query accuracy, time consumption, and memory usage. |
|||||
2023 | Differentially Private One Permutation Hashing And Bin-wise Consistent Weighted Sampling | Li Xiaoyun, Li Ping | Arxiv | Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying \(K\) random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into \(K\) bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around \(\epsilon = 5\sim 10\), where \(\epsilon\) is the standard parameter in the language of \((\epsilon, \delta)\)-DP. |
|||||
2023 | Lrvs-fashion Extending Visual Search With Referring Instructions | Lepage Simon, Mary Jérémie, Picard David | Arxiv | This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion . |
|||||
2023 | Fast Consistent Hashing In Constant Time | Leu Eric | Arxiv | Consistent hashing is a technique that can minimize key remapping when the number of hash buckets changes. The paper proposes a fast consistent hash algorithm (called power consistent hash) that has \(O(1)\) expected time for key lookup, independent of the number of buckets. Hash values are computed in real time. No search data structure is constructed to store bucket ranges or key mappings. The algorithm has a lightweight design using \(O(1)\) space with superior scalability. In particular, it uses two auxiliary hash functions to achieve distribution uniformity and \(O(1)\) expected time for key lookup. Furthermore, it performs consistent hashing such that only a minimal number of keys are remapped when the number of buckets changes. Consistent hashing has a wide range of use cases, including load balancing, distributed caching, and distributed key-value stores. The proposed algorithm is faster than well-known consistent hash algorithms with \(O(log n)\) lookup time. |
|||||
2023 | Shockhash Near Optimal-space Minimal Perfect Hashing Beyond Brute-force | Lehmann Hans-peter, Sanders Peter, Walzer Stefan | Arxiv | A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of n*log(e)=1.44n bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries e^n hash function seeds in expectation and stores the first seed leading to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions h_0 and h_1, hoping for the existence of a function f : S->{0, 1} such that x -> h_{f(x)}(x) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using n + o(n) bits. In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest - where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about (e/2)^n=1.359^n seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of 2^n compared to brute-force. Bipartite ShockHash reduces the expected construction time again to 1.166^n by maintaining a pool of candidate hash functions and checking all possible pairs. ShockHash as a building block within the RecSplit framework can be constructed up to 3 orders of magnitude faster than competing approaches. It can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query. |
|||||
2023 | Optimal-hash Exact String Matching Algorithms | Lecroq Thierry | Arxiv | String matching is the problem of finding all the occurrences of a pattern in a text. We propose improved versions of the fast family of string matching algorithms based on hashing \(q\)-grams. The improvement consists of considering minimal values \(q\) such that each \(q\)-grams of the pattern has a unique hash value. The new algorithms are fastest than algorithm of the HASH family for short patterns on large size alphabets. |
|||||
2023 | Sliding Block Hashing (slick) -- Basic Algorithmic Ideas | Lehmann Hans-peter, Sanders Peter, Walzer Stefan | Arxiv | We present {\bf Sli}ding Blo{\bf ck} Hashing (Slick), a simple hash table data structure that combines high performance with very good space efficiency. This preliminary report outlines avenues for analysis and implementation that we intend to pursue. |
|||||
2023 | Elastichash Semantic Image Similarity Search By Deep Hashing With Elasticsearch | Korfhage Nikolaus, Mühling Markus, Freisleben Bernd | The | We present ElasticHash, a novel approach for high-quality, efficient, and large-scale semantic image similarity search. It is based on a deep hashing model to learn hash codes for fine-grained image similarity search in natural images and a two-stage method for efficiently searching binary hash codes using Elasticsearch (ES). In the first stage, a coarse search based on short hash codes is performed using multi-index hashing and ES terms lookup of neighboring hash codes. In the second stage, the list of results is re-ranked by computing the Hamming distance on long hash codes. We evaluate the retrieval performance of \textit{ElasticHash} for more than 120,000 query images on about 6.9 million database images of the OpenImages data set. The results show that our approach achieves high-quality retrieval results and low search latencies. |
|||||
2023 | A Relaxation Method For Binary Optimizations On Constrained Stiefel Manifold | Xiao Lianghai, Qian Yitian, Pan Shaohua | Arxiv | This paper focuses on a class of binary orthogonal optimization problems frequently arising in semantic hashing. Consider that this class of problems may have an empty feasible set, rendering them not well-defined. We introduce an equivalent model involving a restricted Stiefel manifold and a matrix box set, and then investigate its penalty problems induced by the \(\ell_1\)-distance from the box set and its Moreau envelope. The two penalty problems are always well-defined. Moreover, they serve as the global exact penalties provided that the original feasible set is non-empty. Notably, the penalty problem induced by the Moreau envelope is a smooth optimization over an embedded submanifold with a favorable structure. We develop a retraction-based line-search Riemannian gradient method to address the penalty problem. Finally, the proposed method is applied to supervised and unsupervised hashing tasks and is compared with several popular methods on the MNIST and CIFAR-10 datasets. The numerical comparisons reveal that our algorithm is significantly superior to other solvers in terms of feasibility violation, and it is comparable even superior to others in terms of evaluation metrics related to the Hamming distance. |
|||||
2023 | Avscan2vec Feature Learning On Antivirus Scan Data For Production-scale Malware Corpora | Joyce Robert J., Patel Tirth, Nicholas Charles, Raff Edward | Arxiv | When investigating a malicious file, searching for related files is a common task that malware analysts must perform. Given that production malware corpora may contain over a billion files and consume petabytes of storage, many feature extraction and similarity search approaches are computationally infeasible. Our work explores the potential of antivirus (AV) scan data as a scalable source of features for malware. This is possible because AV scan reports are widely available through services such as VirusTotal and are ~100x smaller than the average malware sample. The information within an AV scan report is abundant with information and can indicate a malicious file’s family, behavior, target operating system, and many other characteristics. We introduce AVScan2Vec, a language model trained to comprehend the semantics of AV scan data. AVScan2Vec ingests AV scan data for a malicious file and outputs a meaningful vector representation. AVScan2Vec vectors are ~3 to 85x smaller than popular alternatives in use today, enabling faster vector comparisons and lower memory usage. By incorporating Dynamic Continuous Indexing, we show that nearest-neighbor queries on AVScan2Vec vectors can scale to even the largest malware production datasets. We also demonstrate that AVScan2Vec vectors are superior to other leading malware feature vector representations across nearly all classification, clustering, and nearest-neighbor lookup algorithms that we evaluated. |
|||||
2023 | Mem-rec Memory Efficient Recommendation System Using Alternative Representation | Jha Gopi Krishna, Thomas Anthony, Jain Nilesh, Gobriel Sameh, Rosing Tajana, Iyer Ravi | Arxiv | Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on tens-of-millions of possible distinct values. These categorical tokens are typically assigned learned vector representations, that are stored in large embedding tables, on the order of 100s of GB. Storing and accessing these tables represent a substantial burden in commercial deployments. Our work proposes MEM-REC, a novel alternative representation approach for embedding tables. MEM-REC leverages bloom filters and hashing methods to encode categorical features using two cache-friendly embedding tables. The first table (token embedding) contains raw embeddings (i.e. learned vector representation), and the second table (weight embedding), which is much smaller, contains weights to scale these raw embeddings to provide better discriminative capability to each data point. We provide a detailed architecture, design and analysis of MEM-REC addressing trade-offs in accuracy and computation requirements, in comparison with state-of-the-art techniques. We show that MEM-REC can not only maintain the recommendation quality and significantly reduce the memory footprint for commercial scale recommendation models but can also improve the embedding latency. In particular, based on our results, MEM-REC compresses the MLPerf CriteoTB benchmark DLRM model size by 2900x and performs up to 3.4x faster embeddings while achieving the same AUC as that of the full uncompressed model. |
|||||
2023 | Unsupervised Multi-criteria Adversarial Detection In Deep Image Retrieval | Xiao Yanru, Wang Cong, Gao Xing | Arxiv | The vulnerability in the algorithm supply chain of deep learning has imposed new challenges to image retrieval systems in the downstream. Among a variety of techniques, deep hashing is gaining popularity. As it inherits the algorithmic backend from deep learning, a handful of attacks are recently proposed to disrupt normal image retrieval. Unfortunately, the defense strategies in softmax classification are not readily available to be applied in the image retrieval domain. In this paper, we propose an efficient and unsupervised scheme to identify unique adversarial behaviors in the hamming space. In particular, we design three criteria from the perspectives of hamming distance, quantization loss and denoising to defend against both untargeted and targeted attacks, which collectively limit the adversarial space. The extensive experiments on four datasets demonstrate 2-23% improvements of detection rates with minimum computational overhead for real-time image queries. |
|||||
2023 | Worst-case Performance Of Popular Approximate Nearest Neighbor Search Implementations Guarantees And Limitations | Indyk Piotr, Xu Haike | Arxiv | Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its “slow preprocessing” version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded “intrinsic” dimension. For the other data structure variants studied, including DiskANN with “fast preprocessing”, HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a “reasonable” accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least \(0.1 n\) steps on instances of size \(n\) before it encounters any of the \(5\) nearest neighbors of the query. |
|||||
2023 | Lightweight-yet-efficient Revitalizing Ball-tree For Point-to-hyperplane Nearest Neighbor Search | Huang Qiang, Tung Anthony K. H. | Arxiv | Finding the nearest neighbor to a hyperplane (or Point-to-Hyperplane Nearest Neighbor Search, simply P2HNNS) is a new and challenging problem with applications in many research domains. While existing state-of-the-art hashing schemes (e.g., NH and FH) are able to achieve sublinear time complexity without the assumption of the data being in a unit hypersphere, they require an asymmetric transformation, which increases the data dimension from \(d\) to \(Ω(d^2)\). This leads to considerable overhead for indexing and incurs significant distortion errors. In this paper, we investigate a tree-based approach for solving P2HNNS using the classical Ball-Tree index. Compared to hashing-based methods, tree-based methods usually require roughly linear costs for construction, and they provide different kinds of approximations with excellent flexibility. A simple branch-and-bound algorithm with a novel lower bound is first developed on Ball-Tree for performing P2HNNS. Then, a new tree structure named BC-Tree, which maintains the Ball and Cone structures in the leaf nodes of Ball-Tree, is described together with two effective strategies, i.e., point-level pruning and collaborative inner product computing. BC-Tree inherits both the low construction cost and lightweight property of Ball-Tree while providing a similar or more efficient search. Experimental results over 16 real-world data sets show that Ball-Tree and BC-Tree are around 1.1\(\sim\)10\(\times\) faster than NH and FH, and they can reduce the index size and indexing time by about 1\(\sim\)3 orders of magnitudes on average. The code is available at \url{https://github.com/HuangQiang/BC-Tree}. |
|||||
2023 | Semstamp A Semantic Watermark With Paraphrastic Robustness For Text Generation | Hou Abe Bohan, Zhang Jingyu, He Tianxing, Wang Yichen, Chuang Yung-sung, Wang Hongwei, Shen Lingfeng, Van Durme Benjamin, Khashabi Daniel, Tsvetkov Yulia | Arxiv | Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. To address this issue, we propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH), which partitions the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by an LLM, and conducts sentence-level rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. A margin-based constraint is used to enhance its robustness. To show the advantages of our algorithm, we propose a “bigram” paraphrase attack using the paraphrase that has the fewest bigram overlaps with the original sentence. This attack is shown to be effective against the existing token-level watermarking method. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation. |
|||||
2023 | A Sparse Johnson-lindenstrauss Transform Using Fast Hashing | Houen Jakob Bæk Tejs, Thorup Mikkel | Arxiv | The Sparse Johnson-Lindenstrauss Transform of Kane and Nelson (SODA 2012) provides a linear dimensionality-reducing map \(A \in \mathbb{R}^{m \times u}\) in \(ℓ₂\) that preserves distances up to distortion of \(1 + \epsilon\) with probability \(1 - \delta\), where \(m = O(\epsilon^{-2} log 1/\delta)\) and each column of \(A\) has \(O(\epsilon m)\) non-zero entries. The previous analyses of the Sparse Johnson-Lindenstrauss Transform all assumed access to a \(Ω(log 1/\delta)\)-wise independent hash function. The main contribution of this paper is a more general analysis of the Sparse Johnson-Lindenstrauss Transform with less assumptions on the hash function. We also show that the Mixed Tabulation hash function of Dahlgaard, Knudsen, Rotenberg, and Thorup (FOCS 2015) satisfies the conditions of our analysis, thus giving us the first analysis of a Sparse Johnson-Lindenstrauss Transform that works with a practical hash function. |
|||||
2023 | A Study On The Use Of Perceptual Hashing To Detect Manipulation Of Embedded Messages In Images | Wöhnert Sven-jannik, Wöhnert Kai Hendrik, Almamedov Eldar, Frank Carsten, Skwarek Volker | Arxiv | Typically, metadata of images are stored in a specific data segment of the image file. However, to securely detect changes, data can also be embedded within images. This follows the goal to invisibly and robustly embed as much information as possible to, ideally, even survive compression. This work searches for embedding principles which allow to distinguish between unintended changes by lossy image compression and malicious manipulation of the embedded message based on the change of its perceptual or robust hash. Different embedding and compression algorithms are compared. The study shows that embedding a message via integer wavelet transform and compression with Karhunen-Loeve-transform yields the best results. However, it was not possible to distinguish between manipulation and compression in all cases. |
|||||
2023 | Cascading Hierarchical Networks With Multi-task Balanced Loss For Fine-grained Hashing | Zeng Xianxian, Zheng Yanjun | Arxiv | With the explosive growth in the number of fine-grained images in the Internet era, it has become a challenging problem to perform fast and efficient retrieval from large-scale fine-grained images. Among the many retrieval methods, hashing methods are widely used due to their high efficiency and small storage space occupation. Fine-grained hashing is more challenging than traditional hashing problems due to the difficulties such as low inter-class variances and high intra-class variances caused by the characteristics of fine-grained images. To improve the retrieval accuracy of fine-grained hashing, we propose a cascaded network to learn compact and highly semantic hash codes, and introduce an attention-guided data augmentation method. We refer to this network as a cascaded hierarchical data augmentation network. We also propose a novel approach to coordinately balance the loss of multi-task learning. We do extensive experiments on some common fine-grained visual classification datasets. The experimental results demonstrate that our proposed method outperforms several state-of-art hashing methods and can effectively improve the accuracy of fine-grained retrieval. The source code is publicly available: https://github.com/kaiba007/FG-CNET. |
|||||
2023 | Deep Lifelong Cross-modal Hashing | Xu Liming, Li Hanqi, Zheng Bochuan, Li Weisheng, Lv Jiancheng | Arxiv | Hashing methods have made significant progress in cross-modal retrieval tasks with fast query speed and low storage cost. Among them, deep learning-based hashing achieves better performance on large-scale data due to its excellent extraction and representation ability for nonlinear heterogeneous features. However, there are still two main challenges in catastrophic forgetting when data with new categories arrive continuously, and time-consuming for non-continuous hashing retrieval to retrain for updating. To this end, we, in this paper, propose a novel deep lifelong cross-modal hashing to achieve lifelong hashing retrieval instead of re-training hash function repeatedly when new data arrive. Specifically, we design lifelong learning strategy to update hash functions by directly training the incremental data instead of retraining new hash functions using all the accumulated data, which significantly reduce training time. Then, we propose lifelong hashing loss to enable original hash codes participate in lifelong learning but remain invariant, and further preserve the similarity and dis-similarity among original and incremental hash codes to maintain performance. Additionally, considering distribution heterogeneity when new data arriving continuously, we introduce multi-label semantic similarity to supervise hash learning, and it has been proven that the similarity improves performance with detailed analysis. Experimental results on benchmark datasets show that the proposed methods achieves comparative performance comparing with recent state-of-the-art cross-modal hashing methods, and it yields substantial average increments over 20\% in retrieval accuracy and almost reduces over 80\% training time when new data arrives continuously. |
|||||
2023 | Identifying Reducible K-tuples Of Vectors With Subspace-proximity Sensitive Hashing/filtering | Holden Gabriella, Shiu Daniel, Strutt Lauren | Arxiv | We introduce and analyse a family of hash and predicate functions that are more likely to produce collisions for small reducible configurations of vectors. These may offer practical improvements to lattice sieving for short vectors. In particular, in one asymptotic regime the family exhibits significantly different convergent behaviour than existing hash functions and predicates. |
|||||
2023 | Rediscovering Hashed Random Projections For Efficient Quantization Of Contextualized Sentence Embeddings | Hamster Ulf A., Lee Ji-ung, Geyken Alexander, Gurevych Iryna | Arxiv | Training and inference on edge devices often requires an efficient setup due to computational limitations. While pre-computing data representations and caching them on a server can mitigate extensive edge device computation, this leads to two challenges. First, the amount of storage required on the server that scales linearly with the number of instances. Second, the bandwidth required to send extensively large amounts of data to an edge device. To reduce the memory footprint of pre-computed data representations, we propose a simple, yet effective approach that uses randomly initialized hyperplane projections. To further reduce their size by up to 98.96%, we quantize the resulting floating-point representations into binary vectors. Despite the greatly reduced size, we show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%–99% of their floating-point. |
|||||
2023 | Efficient High-resolution Template Matching With Vector Quantized Nearest Neighbour Fields | Gupta Ankit, Sintorn Ida-maria | Arxiv | Template matching is a fundamental problem in computer vision with applications in fields including object detection, image registration, and object tracking. Current methods rely on nearest-neighbour (NN) matching, where the query feature space is converted to NN space by representing each query pixel with its NN in the template. NN-based methods have been shown to perform better in occlusions, appearance changes, and non-rigid transformations; however, they scale poorly with high-resolution data and high feature dimensions. We present an NN-based method which efficiently reduces the NN computations and introduces filtering in the NN fields (NNFs). A vector quantization step is introduced before the NN calculation to represent the template with \(k\) features, and the filter response over the NNFs is used to compare the template and query distributions over the features. We show that state-of-the-art performance is achieved in low-resolution data, and our method outperforms previous methods at higher resolution. |
|||||
2023 | SHACIRA Scalable Hash-grid Compression For Implicit Neural Representations | Girish Sharath, Shrivastava Abhinav, Gupta Kamal | Arxiv | Implicit Neural Representations (INR) or neural fields have emerged as a popular framework to encode multimedia signals such as images and radiance fields while retaining high-quality. Recently, learnable feature grids proposed by Instant-NGP have allowed significant speed-up in the training as well as the sampling of INRs by replacing a large neural network with a multi-resolution look-up table of feature vectors and a much smaller neural network. However, these feature grids come at the expense of large memory consumption which can be a bottleneck for storage and streaming applications. In this work, we propose SHACIRA, a simple yet effective task-agnostic framework for compressing such feature grids with no additional post-hoc pruning/quantization stages. We reparameterize feature grids with quantized latent weights and apply entropy regularization in the latent space to achieve high levels of compression across various domains. Quantitative and qualitative results on diverse datasets consisting of images, videos, and radiance fields, show that our approach outperforms existing INR approaches without the need for any large datasets or domain-specific heuristics. Our project page is available at http://shacira.github.io . |
|||||
2023 | Geometric Covering Using Random Fields | Goncalves Felipe, Keren Daniel, Shahar Amit, Yehuda Gal | Arxiv | A set of vectors \(S \subseteq \mathbb{R}^d\) is
\((k_1,\epsilon)\)-clusterable if there are \(k_1\) balls of radius
\(\epsilon\) that cover \(S\). A set of vectors \(S \subseteq \mathbb{R}^d\) is
\((k_2,\delta)\)-far from being clusterable if there are at least \(k_2\) vectors
in \(S\), with all pairwise distances at least \(\delta\). We propose a
probabilistic algorithm to distinguish between these two cases. Our algorithm
reaches a decision by only looking at the extreme values of a scalar valued
hash function, defined by a random field, on \(S\); hence, it is especially
suitable in distributed and online settings. An important feature of our method
is that the algorithm is oblivious to the number of vectors: in the online
setting, for example, the algorithm stores only a constant number of scalars,
which is independent of the stream length.
We introduce random field hash functions, which are a key ingredient in our
paradigm. Random field hash functions generalize locality-sensitive hashing
(LSH). In addition to the LSH requirement that |
|||||
2023 | Unified Functional Hashing In Automatic Machine Learning | Gillard Ryan, Jonany Stephen, Miao Yingjie, Munn Michael, De Souza Connal, Dungay Jonathan, Liang Chen, So David R., Le Quoc V., Real Esteban | Arxiv | The field of Automatic Machine Learning (AutoML) has recently attained impressive results, including the discovery of state-of-the-art machine learning solutions, such as neural image classifiers. This is often done by applying an evolutionary search method, which samples multiple candidate solutions from a large space and evaluates the quality of each candidate through a long training process. As a result, the search tends to be slow. In this paper, we show that large efficiency gains can be obtained by employing a fast unified functional hash, especially through the functional equivalence caching technique, which we also present. The central idea is to detect by hashing when the search method produces equivalent candidates, which occurs very frequently, and this way avoid their costly re-evaluation. Our hash is “functional” in that it identifies equivalent candidates even if they were represented or coded differently, and it is “unified” in that the same algorithm can hash arbitrary representations; e.g. compute graphs, imperative code, or lambda functions. As evidence, we show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery. Finally, we consider the effect of hash collisions, evaluation noise, and search distribution through empirical analysis. Altogether, we hope this paper may serve as a guide to hashing techniques in AutoML. |
|||||
2023 | High-dimensional Approximate Nearest Neighbor Search With Reliable And Efficient Distance Comparison Operations | Gao Jianyang, Long Cheng | Arxiv | Approximate K nearest neighbor (AKNN) search is a fundamental and challenging problem. We observe that in high-dimensional space, the time consumption of nearly all AKNN algorithms is dominated by that of the distance comparison operations (DCOs). For each operation, it scans full dimensions of an object and thus, runs in linear time wrt the dimensionality. To speed it up, we propose a randomized algorithm named ADSampling which runs in logarithmic time wrt to the dimensionality for the majority of DCOs and succeeds with high probability. In addition, based on ADSampling we develop one general and two algorithm-specific techniques as plugins to enhance existing AKNN algorithms. Both theoretical and empirical studies confirm that: (1) our techniques introduce nearly no accuracy loss and (2) they consistently improve the efficiency. |
|||||
2023 | Count-min Sketch With Variable Number Of Hash Functions An Experimental Study | Fusy Éric, Kucherov Gregory | Arxiv | Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022], we demonstrated that under the uniform distribution of input elements, the error of conservative Count-Min follows two distinct regimes depending on its load factor. In this work, we provide a series of experimental results providing new insights into the behavior of conservative Count-Min. Our contributions can be seen as twofold. On one hand, we provide a detailed experimental analysis of the behavior of Count-Min sketch in different regimes and under several representative probability distributions of input elements. On the other hand, we demonstrate improvements that can be made by assigning a variable number of hash functions to different elements. This includes, in particular, reduced space of the data structure while still supporting a small error. |
|||||
2023 | Bounds For C-ideal Hashing | Frei Fabian, Wehner David | Arxiv | In this paper, we analyze hashing from a worst-case perspective. To this end, we study a new property of hash families that is strongly related to d-perfect hashing, namely c-ideality. On the one hand, this notion generalizes the definition of perfect hashing, which has been studied extensively; on the other hand, it provides a direct link to the notion of c-approximativity. We focus on the usually neglected case where the average load \alpha is at least 1 and prove upper and lower parametrized bounds on the minimal size of c-ideal hash families. As an aside, we show how c-ideality helps to analyze the advice complexity of hashing. The concept of advice, introduced a decade ago, lets us measure the information content of an online problem. We prove hashing’s advice complexity to be linear in the hash table size. |
|||||
2023 | Learned Monotone Minimal Perfect Hashing | Ferragina Paolo, Lehmann Hans-peter, Sanders Peter, Vinciguerra Giorgio | Arxiv | A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor. |
|||||
2023 | Invertible Bloom Lookup Tables With Less Memory And Randomness | Fleischhacker Nils, Larsen Kasper Green, Obremski Maciej, Simkin Mark | Arxiv | In this work we study Invertible Bloom Lookup Tables (IBLTs) with small failure probabilities. IBLTs are highly versatile data structures that have found applications in set reconciliation protocols, error-correcting codes, and even the design of advanced cryptographic primitives. For storing \(n\) elements and ensuring correctness with probability at least \(1 - \delta\), existing IBLT constructions require \(Ω(n(\frac{log(1/\delta)}{log(n)}+1))\) space and they crucially rely on fully random hash functions. We present new constructions of IBLTs that are simultaneously more space efficient and require less randomness. For storing \(n\) elements with a failure probability of at most \(\delta\), our data structure only requires \(\mathcal{O}(n + log(1/\delta)loglog(1/\delta))\) space and \(\mathcal{O}(log(log(n)/\delta))\)-wise independent hash functions. As a key technical ingredient we show that hashing \(n\) keys with any \(k\)-wise independent hash function \(h:U \to [Cn]\) for some sufficiently large constant \(C\) guarantees with probability \(1 - 2^{-Ω(k)}\) that at least \(n/2\) keys will have a unique hash value. Proving this is highly non-trivial as \(k\) approaches \(n\). We believe that the techniques used to prove this statement may be of independent interest. |
|||||
2023 | Towards Efficient Deep Hashing Retrieval Condensing Your Data Via Feature-embedding Matching | Feng Tao, Zhang Jie, Wang Peizheng, Wang Zhijie | Arxiv | The expenses involved in training state-of-the-art deep hashing retrieval models have witnessed an increase due to the adoption of more sophisticated models and large-scale datasets. Dataset Distillation (DD) or Dataset Condensation(DC) focuses on generating smaller synthetic dataset that retains the original information. Nevertheless, existing DD methods face challenges in maintaining a trade-off between accuracy and efficiency. And the state-of-the-art dataset distillation methods can not expand to all deep hashing retrieval methods. In this paper, we propose an efficient condensation framework that addresses these limitations by matching the feature-embedding between synthetic set and real set. Furthermore, we enhance the diversity of features by incorporating the strategies of early-stage augmented models and multi-formation. Extensive experiments provide compelling evidence of the remarkable superiority of our approach, both in terms of performance and efficiency, compared to state-of-the-art baseline methods. |
|||||
2023 | Binary Embedding-based Retrieval At Tencent | Gan Yukang, Ge Yixiao, Zhou Chang, Su Shupeng, Xu Zhouchuan, Xu Xuyuan, Hui Quanchao, Chen Xiang, Wang Yexin, Shan Ying | Arxiv | Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level. |
|||||
2023 | A Comprehensive Survey On Vector Database Storage And Retrieval Technique Challenge | Han Yikun, Liu Chunjiang, Wang Pengfei | Arxiv | A vector database is used to store high-dimensional data that cannot be characterized by traditional DBMS. Although there are not many articles describing existing or introducing new vector database architectures, the approximate nearest neighbor search problem behind vector databases has been studied for a long time, and considerable related algorithmic articles can be found in the literature. This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities. |
|||||
2023 | Review Of Extreme Multilabel Classification | Dasgupta Arpan, Katyan Siddhant, Das Shrutimoy, Kumar Pawan | Arxiv | Extreme multilabel classification or XML, is an active area of interest in machine learning. Compared to traditional multilabel classification, here the number of labels is extremely large, hence, the name extreme multilabel classification. Using classical one versus all classification wont scale in this case due to large number of labels, same is true for any other classifiers. Embedding of labels as well as features into smaller label space is an essential first step. Moreover, other issues include existence of head and tail labels, where tail labels are labels which exist in relatively smaller number of given samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels. |
|||||
2023 | Two-way Linear Probing Revisited | Dalal Ketan, Devroye Luc, Malalla Ebrahim | Arxiv | We introduce linear probing hashing schemes that construct a hash table of size \(n\), with constant load factor \(\alpha\), on which the worst-case unsuccessful search time is asymptotically almost surely \(O(log log n)\). The schemes employ two linear probe sequences to find empty cells for the keys. Matching lower bounds on the maximum cluster size produced by any algorithm that uses two linear probe sequences are obtained as well. |
|||||
2023 | Semantic-aware Adversarial Training For Reliable Deep Hashing Retrieval | Yuan Xu, Zhang Zheng, Wang Xunguang, Wu Lin | IEEE Transactions on Information Forensics and Security vol. | Deep hashing has been intensively studied and successfully applied in large-scale image retrieval systems due to its efficiency and effectiveness. Recent studies have recognized that the existence of adversarial examples poses a security threat to deep hashing models, that is, adversarial vulnerability. Notably, it is challenging to efficiently distill reliable semantic representatives for deep hashing to guide adversarial learning, and thereby it hinders the enhancement of adversarial robustness of deep hashing-based retrieval models. Moreover, current researches on adversarial training for deep hashing are hard to be formalized into a unified minimax structure. In this paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the adversarial robustness of deep hashing models. Specifically, we conceive a discriminative mainstay features learning (DMFL) scheme to construct semantic representatives for guiding adversarial learning in deep hashing. Particularly, our DMFL with the strict theoretical guarantee is adaptively optimized in a discriminative learning manner, where both discriminative and semantic properties are jointly considered. Moreover, adversarial examples are fabricated by maximizing the Hamming distance between the hash codes of adversarial samples and mainstay features, the efficacy of which is validated in the adversarial attack trials. Further, we, for the first time, formulate the formalized adversarial training of deep hashing into a unified minimax optimization under the guidance of the generated mainstay codes. Extensive experiments on benchmark datasets show superb attack performance against the state-of-the-art algorithms, meanwhile, the proposed adversarial training can effectively eliminate adversarial perturbations for trustworthy deep hashing-based retrieval. Our code is available at https://github.com/xandery-geek/SAAT. |
|||||
2023 | Embersim A Large-scale Databank For Boosting Similarity Search In Malware Analysis | Corlatescu Dragos Georgian, Dinu Alexandru, Gaman Mihaela, Sumedrea Paul | Arxiv | In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method. |
|||||
2023 | Mementohash A Stateful Minimal Memory Best Performing Consistent Hash Algorithm | Coluzzi Massimo, Brocco Amos, Antonucci Alessandro, Leidi Tiziano | Arxiv | Consistent hashing is used in distributed systems and networking applications to spread data evenly and efficiently across a cluster of nodes. In this paper, we present MementoHash, a novel consistent hashing algorithm that eliminates known limitations of state-of-the-art algorithms while keeping optimal performance and minimal memory usage. We describe the algorithm in detail, provide a pseudo-code implementation, and formally establish its solid theoretical guarantees. To measure the efficacy of MementoHash, we compare its performance, in terms of memory usage and lookup time, to that of state-of-the-art algorithms, namely, AnchorHash, DxHash, and JumpHash. Unlike JumpHash, MementoHash can handle random failures. Moreover, MementoHash does not require fixing the overall capacity of the cluster (as AnchorHash and DxHash do), allowing it to scale indefinitely. The number of removed nodes affects the performance of all the considered algorithms. Therefore, we conduct experiments considering three different scenarios: stable (no removed nodes), one-shot removals (90% of the nodes removed at once), and incremental removals. We report experimental results that averaged a varying number of nodes from ten to one million. Results indicate that our algorithm shows optimal lookup performance and minimal memory usage in its best-case scenario. It behaves better than AnchorHash and DxHash in its average-case scenario and at least as well as those two algorithms in its worst-case scenario. However, the worst-case scenario for MementoHash occurs when more than 70% of the nodes fail, which describes a unlikely scenario. Therefore, MementoHash shows the best performance during the regular life cycle of a cluster. |
|||||
2023 | Constant Sequence Extension For Fast Search Using Weighted Hamming Distance | Weng Zhenyu, Zhuang Huiping, Li Haizhou, Lin Zhiping | Arxiv | Representing visual data using compact binary codes is attracting increasing attention as binary codes are used as direct indices into hash table(s) for fast non-exhaustive search. Recent methods show that ranking binary codes using weighted Hamming distance (WHD) rather than Hamming distance (HD) by generating query-adaptive weights for each bit can better retrieve query-related items. However, search using WHD is slower than that using HD. One main challenge is that the complexity of extending a monotone increasing sequence using WHD to probe buckets in hash table(s) for existing methods is at least proportional to the square of the sequence length, while that using HD is proportional to the sequence length. To overcome this challenge, we propose a novel fast non-exhaustive search method using WHD. The key idea is to design a constant sequence extension algorithm to perform each sequence extension in constant computational complexity and the total complexity is proportional to the sequence length, which is justified by theoretical analysis. Experimental results show that our method is faster than other WHD-based search methods. Also, compared with the HD-based non-exhaustive search method, our method has comparable efficiency but retrieves more query-related items for the dataset of up to one billion items. |
|||||
2023 | Model-enhanced Vector Index | Zhang Hailin, Wang Yujing, Chen Qi, Chang Ruiheng, Zhang Ting, Miao Ziming, Hou Yingyan, Ding Yang, Miao Xupeng, Wang Haonan, Pang Bochen, Zhan Yuefeng, Sun Hao, Deng Weiwei, Zhang Qi, Yang Fan, Xie Xing, Yang Mao, Cui Bin | Arxiv | Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall performance. Recent research indicates that deep retrieval solutions offer better model quality, but are hindered by unacceptable serving latency and the inability to support document updates. In this paper, we aim to enhance the vector index with end-to-end deep generative models, leveraging the differentiable advantages of deep retrieval models while maintaining desirable serving efficiency. We propose Model-enhanced Vector Index (MEVI), a differentiable model-enhanced index empowered by a twin-tower representation model. MEVI leverages a Residual Quantization (RQ) codebook to bridge the sequence-to-sequence deep retrieval and embedding-based models. To substantially reduce the inference time, instead of decoding the unique document ids in long sequential steps, we first generate some semantic virtual cluster ids of candidate documents in a small number of steps, and then leverage the well-adapted embedding vectors to further perform a fine-grained search for the relevant documents in the candidate virtual clusters. We empirically show that our model achieves better performance on the commonly used academic benchmarks MSMARCO Passage and Natural Questions, with comparable serving latency to dense retrieval solutions. |
|||||
2023 | On The Relationship Between Several Variants Of The Linear Hashing Conjecture | Westover Alek | Arxiv | In Linear Hashing (\(\mathsf{LH}\)) with \(\beta\) bins on a size \(u\) universe \({\mathcal{U}=\{0,1,\ldots, u-1\}}\), items \(\{x_1,x_2,\ldots, x_n\}\subset \mathcal{U}\) are placed in bins by the hash function $\(x_i\mapsto (ax_i+b)\mod p \mod \beta\)\( for some prime \)p\in [u,2u]\( and randomly chosen integers \)a,b \in [1,p]\(. The “maxload” of \)\mathsf{LH}\( is the number of items assigned to the fullest bin. Expected maxload for a worst-case set of items is a natural measure of how well \)\mathsf{LH}\( distributes items amongst the bins. Fix \)\beta=n\(. Despite \)\mathsf{LH}\(‘s simplicity, bounding \)\mathsf{LH}\(‘s worst-case maxload is extremely challenging. It is well-known that on random inputs \)\mathsf{LH}\( achieves maxload \)Ω\left(\frac{log n}{loglog n}\right)\(; this is currently the best lower bound for \)\mathsf{LH}\(‘s expected maxload. Recently Knudsen established an upper bound of \)\widetilde{O}(n^{1 / 3})\(. The question “Is the worst-case expected maxload of \)\mathsf{LH}\( \)n^{o(1)}$?” is one of the most basic open problems in discrete math. In this paper we propose a set of intermediate open questions to help researchers make progress on this problem. We establish the relationship between these intermediate open questions and make some partial progress on them. |
|||||
2023 | Supervised Auto-encoding Twin-bottleneck Hashing | Chen Yuan, Marchand-maillet Stéphane | Arxiv | Deep hashing has shown to be a complexity-efficient solution for the Approximate Nearest Neighbor search problem in high dimensional space. Many methods usually build the loss function from pairwise or triplet data points to capture the local similarity structure. Other existing methods construct the similarity graph and consider all points simultaneously. Auto-encoding Twin-bottleneck Hashing is one such method that dynamically builds the graph. Specifically, each input data is encoded into a binary code and a continuous variable, or the so-called twin bottlenecks. The similarity graph is then computed from these binary codes, which get updated consistently during the training. In this work, we generalize the original model into a supervised deep hashing network by incorporating the label information. In addition, we examine the differences of codes structure between these two networks and consider the class imbalance problem especially in multi-labeled datasets. Experiments on three datasets yield statistically significant improvement against the original model. Results are also comparable and competitive to other supervised methods. |
|||||
2023 | Bipartite Graph Convolutional Hashing For Effective And Efficient Top-n Search In Hamming Space | Chen Yankai, Fang Yixiang, Zhang Yifei, King Irwin | Arxiv | Searching on bipartite graphs is basal and versatile to many real-world Web applications, e.g., online recommendation, database retrieval, and query-document searching. Given a query node, the conventional approaches rely on the similarity matching with the vectorized node embeddings in the continuous Euclidean space. To efficiently manage intensive similarity computation, developing hashing techniques for graph structured data has recently become an emerging research direction. Despite the retrieval efficiency in Hamming space, prior work is however confronted with catastrophic performance decay. In this work, we investigate the problem of hashing with Graph Convolutional Network on bipartite graphs for effective Top-N search. We propose an end-to-end Bipartite Graph Convolutional Hashing approach, namely BGCH, which consists of three novel and effective modules: (1) adaptive graph convolutional hashing, (2) latent feature dispersion, and (3) Fourier serialized gradient estimation. Specifically, the former two modules achieve the substantial retention of the structural information against the inevitable information loss in hash encoding; the last module develops Fourier Series decomposition to the hashing function in the frequency domain mainly for more accurate gradient estimation. The extensive experiments on six real-world datasets not only show the performance superiority over the competing hashing-based counterparts, but also demonstrate the effectiveness of all proposed model components contained therein. |
|||||
2023 | Homomorphic Hashing Based On Elliptic Curve Cryptography | Chen Abel C. H. | Arxiv | For avoiding the exposure of plaintexts in cloud environments, some homomorphic hashing algorithms have been proposed to generate the hash value of each plaintext, and cloud environments only store the hash values and calculate the hash values for future needs. However, longer hash value generation time and longer hash value summary time may be required by these homomorphic hashing algorithms with higher security strengths. Therefore, this study proposes a homomorphic hashing based on elliptic curve cryptography (ECC) to provide a homomorphic hashing function in accordance with the characteristics of ECC. Furthermore, mathematical models and practical cases have been given to prove the proposed method. In experiments, the results show that the proposed method have higher efficiency with different security strengths. |
|||||
2023 | Hashing Neural Video Decomposition With Multiplicative Residuals In Space-time | Chan Cheng-hung, Yuan Cheng-yang, Sun Cheng, Chen Hwann-tzong | Arxiv | We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents’ consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/ |
|||||
2023 | Distinct Elements In Streams An Algorithm For The (text) Book | Chakraborty Sourav, Vinodchandran N. V., Meel Kuldeep S. | Apppeared in Proceedings of | Given a data stream \(\mathcal{A} = \langle a_1, a_2, \ldots, a_m \rangle\) of \(m\) elements where each \(a_i \in [n]\), the Distinct Elements problem is to estimate the number of distinct elements in \(\mathcal{A}\).Distinct Elements has been a subject of theoretical and empirical investigations over the past four decades resulting in space optimal algorithms for it.All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory. |
|||||
2023 | Cuckoo Hashing In Cryptography Optimal Parameters Robustness And Applications | Yeo Kevin | Arxiv | Cuckoo hashing is a powerful primitive that enables storing items using small space with efficient querying. At a high level, cuckoo hashing maps \(n\) items into \(b\) entries storing at most \(\ell\) items such that each item is placed into one of \(k\) randomly chosen entries. Additionally, there is an overflow stash that can store at most \(s\) items. Many cryptographic primitives rely upon cuckoo hashing to privately embed and query data where it is integral to ensure small failure probability when constructing cuckoo hashing tables as it directly relates to the privacy guarantees. As our main result, we present a more query-efficient cuckoo hashing construction using more hash functions. For construction failure probability \(\epsilon\), the query overhead of our scheme is \(O(1 + \sqrt{log(1/\epsilon)/log n})\). Our scheme has quadratically smaller query overhead than prior works for any target failure probability \(\epsilon\). We also prove lower bounds matching our construction. Our improvements come from a new understanding of the locality of cuckoo hashing failures for small sets of items. We also initiate the study of robust cuckoo hashing where the input set may be chosen with knowledge of the hash functions. We present a cuckoo hashing scheme using more hash functions with query overhead \(\tilde{O}(log \lambda)\) that is robust against poly\((\lambda)\) adversaries. Furthermore, we present lower bounds showing that this construction is tight and that extending previous approaches of large stashes or entries cannot obtain robustness except with \(Ω(n)\) query overhead. As applications of our results, we obtain improved constructions for batch codes and PIR. In particular, we present the most efficient explicit batch code and blackbox reduction from single-query PIR to batch PIR. |
|||||
2023 | Review Of The NIST Light-weight Cryptography Finalists | Buchanan William J, Maglaras Leandros | Arxiv | Since 2016, NIST has been assessing lightweight encryption methods, and, in 2022, NIST published the final 10: ASCON, Elephant, GIFT-COFB, Grain128-AEAD, ISAP, Photon-Beetle, Romulus, Sparkle, TinyJambu, and Xoodyak. At the time that the article was written, NISC announced ASCOn as the chosen method that will be published as NIST’S lightweight cryptography standard later in 2023. In this article, we provide a comparison between these methods in terms of energy efficiency, time for encryption, and time for hashing. |
|||||
2023 | Simple And Efficient Four-cycle Counting On Sparse Graphs | Burkhardt Paul, Harris David G. | Arxiv | We consider the problem of counting 4-cycles (\(C_4\)) in an undirected graph \(G\) of \(n\) vertices and \(m\) edges (in bipartite graphs, 4-cycles are also often referred to as \(\textit{butterflies}\)). Most recently, Wang et al. (2019, 2022) developed algorithms for this problem based on hash tables and sorting the graph by degree. Their algorithm takes \(O(m\bar\delta)\) expected time and \(O(m)\) space, where \(\bar \delta \leq O(\sqrt{m})\) is the \(\textit{average degeneracy}\) parameter introduced by Burkhardt, Faber \& Harris (2020). We develop a streamlined version of this algorithm requiring \(O(m\bar\delta)\) time and precisely \(n\) words of space. It has several practical improvements and optimizations; for example, it is fully deterministic, does not require any auxiliary storage or sorting of the input graph, and uses only addition and array access in its inner loops. Our algorithm is very simple and easily adapted to count 4-cycles incident to each vertex and edge. Empirical tests demonstrate that our array-based approach is \(4\times\) – \(7\times\) faster on average compared to popular hash table implementations. |
|||||
2023 | Weighted Minwise Hashing Beats Linear Sketching For Inner Product Estimation | Bessa Aline, Daliri Majid, Freire Juliana, Musco Cameron, Musco Christopher, Santos Aécio, Zhang Haoxiang | In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems | We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors. |
|||||
2023 | Corrembed Evaluating Pre-trained Model Image Similarity Efficacy With A Novel Metric | Borgersen Karl Audun Kagnes, Goodwin Morten, Sharma Jivitesh, Aasmoe Tobias, Leonhardsen Mari, Rørvik Gro Herredsvela | Arxiv | Detecting visually similar images is a particularly useful attribute to look to when calculating product recommendations. Embedding similarity, which utilizes pre-trained computer vision models to extract high-level image features, has demonstrated remarkable efficacy in identifying images with similar compositions. However, there is a lack of methods for evaluating the embeddings generated by these models, as conventional loss and performance metrics do not adequately capture their performance in image similarity search tasks. In this paper, we evaluate the viability of the image embeddings from numerous pre-trained computer vision models using a novel approach named CorrEmbed. Our approach computes the correlation between distances in image embeddings and distances in human-generated tag vectors. We extensively evaluate numerous pre-trained Torchvision models using this metric, revealing an intuitive relationship of linear scaling between ImageNet1k accuracy scores and tag-correlation scores. Importantly, our method also identifies deviations from this pattern, providing insights into how different models capture high-level image features. By offering a robust performance evaluation of these pre-trained models, CorrEmbed serves as a valuable tool for researchers and practitioners seeking to develop effective, data-driven approaches to similar item recommendations in fashion retail. |
|||||
2023 | Locally Uniform Hashing | Bercea Ioana O., Beretta Lorenzo, Klausen Jonas, Houen Jakob Bæk Tejs, Thorup Mikkel | Arxiv | Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance. To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of \(n\) keys using \(O(n)\) space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP’09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP’08]. Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques. |
|||||
2023 | Dedrift Robust Similarity Search Under Content Drift | Baranchuk Dmitry, Douze Matthijs, Upadhyay Yash, Yalniz I. Zeki | Arxiv | The statistical distribution of content uploaded and searched on media sharing sites changes over time due to seasonal, sociological and technical factors. We investigate the impact of this “content drift” for large-scale similarity search tools, based on nearest neighbor search in embedding space. Unless a costly index reconstruction is performed frequently, content drift degrades the search accuracy and efficiency. The degradation is especially severe since, in general, both the query and database distributions change. We introduce and analyze real-world image and video datasets for which temporal information is available over a long time period. Based on the learnings, we devise DeDrift, a method that updates embedding quantizers to continuously adapt large-scale indexing structures on-the-fly. DeDrift almost eliminates the accuracy degradation due to the query and database content drift while being up to 100x faster than a full index reconstruction. |
|||||
2023 | Yes We CANN Constrained Approximate Nearest Neighbors For Local Feature-based Visual Localization | Aiger Dror, Araujo André, Lynen Simon | Arxiv | Large-scale visual localization systems continue to rely on 3D point clouds built from image collections using structure-from-motion. While the 3D points in these models are represented using local image features, directly matching a query image’s local features against the point cloud is challenging due to the scale of the nearest-neighbor search problem. Many recent approaches to visual localization have thus proposed a hybrid method, where first a global (per image) embedding is used to retrieve a small subset of database images, and local features of the query are matched only against those. It seems to have become common belief that global embeddings are critical for said image-retrieval in visual localization, despite the significant downside of having to compute two feature types for each query image. In this paper, we take a step back from this assumption and propose Constrained Approximate Nearest Neighbors (CANN), a joint solution of k-nearest-neighbors across both the geometry and appearance space using only local features. We first derive the theoretical foundation for k-nearest-neighbor retrieval across multiple metrics and then showcase how CANN improves visual localization. Our experiments on public localization benchmarks demonstrate that our method significantly outperforms both state-of-the-art global feature-based retrieval and approaches using local feature aggregation schemes. Moreover, it is an order of magnitude faster in both index and query time than feature aggregation schemes for these datasets. Code: \url{https://github.com/google-research/google-research/tree/master/cann} |
|||||
2023 | Similarity Search In The Blink Of An Eye With Compressed Indices | Aguerrebere Cecilia, Bhati Ishwar, Hildebrand Mark, Tepper Mariano, Willke Ted | Arxiv | Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. In this work, we present new techniques and systems for creating faster and smaller graph-based indices. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory. |
|||||
2023 | Massively-parallel Heat Map Sorting And Applications To Explainable Clustering | Aghamolaei Sepideh, Ghodsi Mohammad | Arxiv | Given a set of points labeled with \(k\) labels, we introduce the heat map sorting problem as reordering and merging the points and dimensions while preserving the clusters (labels). A cluster is preserved if it remains connected, i.e., if it is not split into several clusters and no two clusters are merged. We prove the problem is NP-hard and we give a fixed-parameter algorithm with a constant number of rounds in the massively parallel computation model, where each machine has a sublinear memory and the total memory of the machines is linear. We give an approximation algorithm for a NP-hard special case of the problem. We empirically compare our algorithm with k-means and density-based clustering (DBSCAN) using a dimensionality reduction via locality-sensitive hashing on several directed and undirected graphs of email and computer networks. |
|||||
2023 | Glued Lattices Are Better Quantizers Than K_12 | Agrell Erik, Pook-kolb Daniel, Allen Bruce | Arxiv | 40 years ago, Conway and Sloane proposed using the highly symmetrical Coxeter-Todd lattice \(K_{12}\) for quantization, and estimated its second moment. Since then, all published lists identify \(K_{12}\) as the best 12-dimensional lattice quantizer. Surprisingly, \(K_{12}\) is not optimal: we construct two new 12-dimensional lattices with lower normalized second moments. The new lattices are obtained by gluing together 6-dimensional lattices. |
|||||
2023 | Efficient Deduplication And Leakage Detection In Large Scale Image Datasets With A Focus On The Crowdai Mapping Challenge Dataset | Adimoolam Yeshwanth Kumar, Chatterjee Bodhiswatta, Poullis Charalambos, Averkiou Melinos | Arxiv | Recent advancements in deep learning and computer vision have led to widespread use of deep neural networks to extract building footprints from remote-sensing imagery. The success of such methods relies on the availability of large databases of high-resolution remote sensing images with high-quality annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets that has been used extensively in recent years to train deep neural networks. This dataset consists of \( \sim\ \)280k training images and \( \sim\ \)60k testing images, with polygonal building annotations for all images. However, issues such as low-quality and incorrect annotations, extensive duplication of image samples, and data leakage significantly reduce the utility of deep neural networks trained on the dataset. Therefore, it is an imperative pre-condition to adopt a data validation pipeline that evaluates the quality of the dataset prior to its use. To this end, we propose a drop-in pipeline that employs perceptual hashing techniques for efficient de-duplication of the dataset and identification of instances of data leakage between training and testing splits. In our experiments, we demonstrate that nearly 250k(\( \sim\ \)90%) images in the training split were identical. Moreover, our analysis on the validation split demonstrates that roughly 56k of the 60k images also appear in the training split, resulting in a data leakage of 93%. The source code used for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search . |
|||||
2023 | Fast Locality Sensitive Hashing With Theoretical Guarantee | Tan Zongyuan, Wang Hongya, Xu Bo, Luo Minjie, Du Ming | Arxiv | Locality-sensitive hashing (LSH) is an effective randomized technique widely used in many machine learning tasks. The cost of hashing is proportional to data dimensions, and thus often the performance bottleneck when dimensionality is high and the number of hash functions involved is large. Surprisingly, however, little work has been done to improve the efficiency of LSH computation. In this paper, we design a simple yet efficient LSH scheme, named FastLSH, under l2 norm. By combining random sampling and random projection, FastLSH reduces the time complexity from O(n) to O(m) (m<n), where n is the data dimensionality and m is the number of sampled dimensions. Moreover, FastLSH has provable LSH property, which distinguishes it from the non-LSH fast sketches. We conduct comprehensive experiments over a collection of real and synthetic datasets for the nearest neighbor search task. Experimental results demonstrate that FastLSH is on par with the state-of-the-arts in terms of answer quality, space occupation and query efficiency, while enjoying up to 80x speedup in hash function evaluation. We believe that FastLSH is a promising alternative to the classic LSH scheme. |
|||||
2023 | Unfolded Self-reconstruction LSH Towards Machine Unlearning In Approximate Nearest Neighbour Search | Tan Kim Yong, Lyu Yueming, Ong Yew Soon, Tsang Ivor W. | Arxiv | Approximate nearest neighbour (ANN) search is an essential component of search engines, recommendation systems, etc. Many recent works focus on learning-based data-distribution-dependent hashing and achieve good retrieval performance. However, due to increasing demand for users’ privacy and security, we often need to remove users’ data information from Machine Learning (ML) models to satisfy specific privacy and security requirements. This need requires the ANN search algorithm to support fast online data deletion and insertion. Current learning-based hashing methods need retraining the hash function, which is prohibitable due to the vast time-cost of large-scale data. To address this problem, we propose a novel data-dependent hashing method named unfolded self-reconstruction locality-sensitive hashing (USR-LSH). Our USR-LSH unfolded the optimization update for instance-wise data reconstruction, which is better for preserving data information than data-independent LSH. Moreover, our USR-LSH supports fast online data deletion and insertion without retraining. To the best of our knowledge, we are the first to address the machine unlearning of retrieval problems. Empirically, we demonstrate that USR-LSH outperforms the state-of-the-art data-distribution-independent LSH in ANN tasks in terms of precision and recall. We also show that USR-LSH has significantly faster data deletion and insertion time than learning-based data-dependent hashing. |
|||||
2023 | Vector Embeddings By Sequence Similarity And Context For Improved Compression Similarity Search Clustering Organization And Manipulation Of Cdna Libraries | Um Daniel H., Knowles David A., Kaiser Gail E. | Arxiv | This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for mapping and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding similar sequences. The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves. By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, through learning alternative coordinate vector embeddings based on the contexts of codon triplets, we can demonstrate clustering based on amino acid properties. Finally, using this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of the similarity search by coupling vector embeddings with an algorithm that determines the proximity of vectors in Euclidean space; this allows us to perform sequence similarity searches in a quicker and more modular fashion. |
|||||
2023 | Locality-sensitive Hashing Does Not Guarantee Privacy! Attacks On Googles Floc And The Minhash Hierarchy System | Turati Florian Eth Zurich, Cotrini Carlos Eth Zurich, Kubicek Karel Eth Zurich, Basin David Eth Zurich | Arxiv | Recently proposed systems aim at achieving privacy using locality-sensitive hashing. We show how these approaches fail by presenting attacks against two such systems: Google’s FLoC proposal for privacy-preserving targeted advertising and the MinHash Hierarchy, a system for processing mobile users’ traffic behavior in a privacy-preserving way. Our attacks refute the pre-image resistance, anonymity, and privacy guarantees claimed for these systems. In the case of FLoC, we show how to deanonymize users using Sybil attacks and to reconstruct 10% or more of the browsing history for 30% of its users using Generative Adversarial Networks. We achieve this only analyzing the hashes used by FLoC. For MinHash, we precisely identify the movement of a subset of individuals and, on average, we can limit users’ movement to just 10% of the possible geographic area, again using just the hashes. In addition, we refute their differential privacy claims. |
|||||
2023 | Fast Private Kernel Density Estimation Via Locality Sensitive Quantization | Wagner Tal, Naamad Yonatan, Mishra Nina | Arxiv | We study efficient mechanisms for differentially private kernel density estimation (DP-KDE). Prior work for the Gaussian kernel described algorithms that run in time exponential in the number of dimensions \(d\). This paper breaks the exponential barrier, and shows how the KDE can privately be approximated in time linear in \(d\), making it feasible for high-dimensional data. We also present improved bounds for low-dimensional data. Our results are obtained through a general framework, which we term Locality Sensitive Quantization (LSQ), for constructing private KDE mechanisms where existing KDE approximation techniques can be applied. It lets us leverage several efficient non-private KDE methods – like Random Fourier Features, the Fast Gauss Transform, and Locality Sensitive Hashing – and ``privatize’’ them in a black-box manner. Our experiments demonstrate that our resulting DP-KDE mechanisms are fast and accurate on large datasets in both high and low dimensions. |
|||||
2022 | Sichash -- Small Irregular Cuckoo Tables For Perfect Hashing | Lehmann Hans-peter, Sanders Peter, Walzer Stefan | Arxiv | A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It places objects in a cuckoo hash table and then stores the final hash function choice of each object in a retrieval data structure. We combine the idea with irregular cuckoo hashing, where each object has a different number of hash functions. Additionally, we use many small tables that we overload beyond their asymptotic maximum load factor. The most space efficient competitors often use brute force methods to determine the PHFs. SicHash provides a more direct construction algorithm that only rarely needs to recompute parts. Our implementation improves the state of the art in terms of space usage versus construction time for a wide range of configurations. At the same time, it provides very fast queries. |
|||||
2022 | Set2box Similarity Preserving Representation Learning Of Sets | Lee Geon, Park Chanyoung, Shin Kijung | Arxiv | Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions. In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box+, which yields more concise but more accurate box representations of sets. Through extensive experiments on 8 real-world datasets, we show that, compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to 40.8X smaller estimation error while requiring 60% fewer bits to encode sets, (b) Concise: yielding up to 96.8X more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set. |
|||||
2022 | Composite Code Sparse Autoencoders For First Stage Retrieval | Lassance Carlos, Formal Thibault, Clinchant Stephane | Arxiv | We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary quantization method and we propose to combine it with the recent graph based ANN techniques. Our experiments on MSMARCO dataset reveal that CCSA outperforms IVF with product quantization. Furthermore, CCSA binary quantization is beneficial for the index size, and memory usage for the graph-based HNSW method, while maintaining a good level of recall and MRR. Third, we compare with recent supervised quantization methods for image retrieval and find that CCSA is able to outperform them. |
|||||
2022 | A Hash Table Without Hash Functions And How To Get The Most Out Of Your Random Bits | Kuszmaul William | Arxiv | This paper considers the basic question of how strong of a probabilistic guarantee can a hash table, storing \(n\) \((1 + \Theta(1)) log n\)-bit key/value pairs, offer? Past work on this question has been bottlenecked by limitations of the known families of hash functions: The only hash tables to achieve failure probabilities less than \(1 / 2^{\polylog n}\) require access to fully-random hash functions – if the same hash tables are implemented using the known explicit families of hash functions, their failure probabilities become \(1 / \poly(n)\). To get around these obstacles, we show how to construct a randomized data structure that has the same guarantees as a hash table, but that avoids the direct use of hash functions. Building on this, we are able to construct a hash table using \(O(n)\) random bits that achieves failure probability \(1 / n^{n^{1 - \epsilon}}\) for an arbitrary positive constant \(\epsilon\). In fact, we show that this guarantee can even be achieved by a succinct dictionary, that is, by a dictionary that uses space within a \(1 + o(1)\) factor of the information-theoretic optimum. Finally we also construct a succinct hash table whose probabilistic guarantees fall on a different extreme, offering a failure probability of \(1 / \poly(n)\) while using only \(\tilde{O}(log n)\) random bits. This latter result matches (up to low-order terms) a guarantee previously achieved by Dietzfelbinger et al., but with increased space efficiency and with several surprising technical components. |
|||||
2022 | Pachash Packed And Compressed Hash Tables | Kurpicz Florian, Lehmann Hans-peter, Sanders Peter | Arxiv | We introduce PaCHash, a hash table that stores its objects contiguously in an array without intervening space, even if the objects have variable size. In particular, each object can be compressed using standard compression techniques. A small search data structure allows locating the objects in constant expected time. PaCHash is most naturally described as a static external hash table where it needs a constant number of bits of internal memory per block of external memory. Here, in some sense, PaCHash beats a lower bound on the space consumption of k-perfect hashing. An implementation for fast SSDs needs about 5 bits of internal memory per block of external memory, requires only one disk access (of variable length) per search operation, and has small internal search overhead compared to the disk access cost. Our experiments show that it has lower space consumption than all previous approaches even when considering objects of identical size. |
|||||
2022 | Parameterizing Kterm Hashing | Wurzer Dominik, Qin Yumeng | SIGIR | Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document’s degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach. |
|||||
2022 | Hierarchical Locality Sensitive Hashing For Structured Data A Survey | Wu Wei, Li Bin | Arxiv | Data similarity (or distance) computation is a fundamental research topic which fosters a variety of similarity-based machine learning and data mining applications. In big data analytics, it is impractical to compute the exact similarity of data instances due to high computational cost. To this end, the Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors in an efficient manner without the learning process. Structured data (e.g., sequences, trees and graphs), which are composed of elements and relations between the elements, are commonly seen in the real world, but the traditional LSH algorithms cannot preserve the structure information represented as relations between elements. In order to conquer the issue, researchers have been devoted to the family of the hierarchical LSH algorithms. In this paper, we explore the present progress of the research into hierarchical LSH from the following perspectives: 1) Data structures, where we review various hierarchical LSH algorithms for three typical data structures and uncover their inherent connections; 2) Applications, where we review the hierarchical LSH algorithms in multiple application scenarios; 3) Challenges, where we discuss some potential challenges as future directions. |
|||||
2022 | HQANN Efficient And Robust Similarity Search For Hybrid Queries With Structured And Unstructured Constraints | Wu Wei, He Junlin, Qiao Yu, Fu Guoheng, Liu Li, Yu Jin | Arxiv | The in-memory approximate nearest neighbor search (ANNS) algorithms have achieved great success for fast high-recall query processing, but are extremely inefficient when handling hybrid queries with unstructured (i.e., feature vectors) and structured (i.e., related attributes) constraints. In this paper, we present HQANN, a simple yet highly efficient hybrid query processing framework which can be easily embedded into existing proximity graph-based ANNS algorithms. We guarantee both low latency and high recall by leveraging navigation sense among attributes and fusing vector similarity search with attribute filtering. Experimental results on both public and in-house datasets demonstrate that HQANN is 10x faster than the state-of-the-art hybrid ANNS solutions to reach the same recall quality and its performance is hardly affected by the complexity of attributes. It can reach 99\% recall@10 in just around 50 microseconds On GLOVE-1.2M with thousands of attribute constraints. |
|||||
2022 | A Lower Bound Of Hash Codes Performance | Xiaosu Zhu, Jingkuan Song, Yu Lei, Lianli Gao, Hengtao Shen | Neural Information Processing Systems | As a crucial approach for compact representation learning, hashing has achieved great success in effectiveness and efficiency. Numerous heuristic Hamming space metric learning objectives are designed to obtain high-quality hash codes. Nevertheless, a theoretical analysis of criteria for learning good hash codes remains largely unexploited. In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes’ performance. Promoting these two characteristics could lift the bound and improve hash learning. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. Extensive experiments reveal the effectiveness of the proposed method. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to \(26.5\%\) increase in mean Average Precision and an up to \(20.5\%\) increase in accuracy. Our code is publicly available at https://github.com/VL-Group/LBHash. |
|||||
2022 | Self-supervised Consistent Quantization For Fully Unsupervised Image Retrieval | Wu Guile, Zhang Chao, Liwicki Stephan | Arxiv | Unsupervised image retrieval aims to learn an efficient retrieval system without expensive data annotations, but most existing methods rely heavily on handcrafted feature descriptors or pre-trained feature extractors. To minimize human supervision, recent advance proposes deep fully unsupervised image retrieval aiming at training a deep model from scratch to jointly optimize visual features and quantization codes. However, existing approach mainly focuses on instance contrastive learning without considering underlying semantic structure information, resulting in sub-optimal performance. In this work, we propose a novel self-supervised consistent quantization approach to deep fully unsupervised image retrieval, which consists of part consistent quantization and global consistent quantization. In part consistent quantization, we devise part neighbor semantic consistency learning with codeword diversity regularization. This allows to discover underlying neighbor structure information of sub-quantized representations as self-supervision. In global consistent quantization, we employ contrastive learning for both embedding and quantized representations and fuses these representations for consistent contrastive regularization between instances. This can make up for the loss of useful representation information during quantization and regularize consistency between instances. With a unified learning objective of part and global consistent quantization, our approach exploits richer self-supervision cues to facilitate model learning. Extensive experiments on three benchmark datasets show the superiority of our approach over the state-of-the-art methods. |
|||||
2022 | Givens Coordinate Descent Methods For Rotation Matrix Learning In Trainable Embedding Indexes | Jiang Yunjiang, Zhang Han, Qiu Yiming, Xiao Yun, Long Bo, Yang Wen-yun | The Tenth International Conference on Learning Representations | Product quantization (PQ) coupled with a space rotation, is widely used in modern approximate nearest neighbor (ANN) search systems to significantly compress the disk storage for embeddings and speed up the inner product computation. Existing rotation learning methods, however, minimize quantization distortion for fixed embeddings, which are not applicable to an end-to-end training scenario where embeddings are updated constantly. In this paper, based on geometric intuitions from Lie group theory, in particular the special orthogonal group \(SO(n)\), we propose a family of block Givens coordinate descent algorithms to learn rotation matrix that are provably convergent on any convex objectives. Compared to the state-of-the-art SVD method, the Givens algorithms are much more parallelizable, reducing runtime by orders of magnitude on modern GPUs, and converge more stably according to experimental studies. They further improve upon vanilla product quantization significantly in an end-to-end training scenario. |
|||||
2022 | Fast Online Hashing With Multi-label Projection | Jia Wenzhe, Cao Yuan, Liu Junwei, Gui Jie | Arxiv | Hashing has been widely researched to solve the large-scale approximate nearest neighbor search problem owing to its time and storage superiority. In recent years, a number of online hashing methods have emerged, which can update the hash functions to adapt to the new stream data and realize dynamic retrieval. However, existing online hashing methods are required to update the whole database with the latest hash functions when a query arrives, which leads to low retrieval efficiency with the continuous increase of the stream data. On the other hand, these methods ignore the supervision relationship among the examples, especially in the multi-label case. In this paper, we propose a novel Fast Online Hashing (FOH) method which only updates the binary codes of a small part of the database. To be specific, we first build a query pool in which the nearest neighbors of each central point are recorded. When a new query arrives, only the binary codes of the corresponding potential neighbors are updated. In addition, we create a similarity matrix which takes the multi-label supervision information into account and bring in the multi-label projection loss to further preserve the similarity among the multi-label data. The experimental results on two common benchmarks show that the proposed FOH can achieve dramatic superiority on query time up to 6.28 seconds less than state-of-the-art baselines with competitive retrieval accuracy. |
|||||
2022 | Ood-diskann Efficient And Scalable Graph ANNS For Out-of-distribution Queries | Jaiswal Shikhar, Krishnaswamy Ravishankar, Garg Ankit, Simhadri Harsha Vardhan, Agrawal Sheshansh | Arxiv | State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well. |
|||||
2022 | Hyp ^2 Loss Beyond Hypersphere Metric Space For Multi-label Image Retrieval | Xu Chengyin, Chai Zenghao, Xu Zhengzhuo, Yuan Chun, Fan Yanbo, Wang Jue | Arxiv | Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP{{ ‘{{’ }}^2{{ ‘}}’ }} Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP{{ ‘{{’ }}^2{{ ‘}}’ }} Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss. |
|||||
2022 | Experimental Analysis Of Machine Learning Techniques For Finding Search Radius In Locality Sensitive Hashing | Jafari Omid, Nagarkar Parth | Arxiv | Finding similar data in high-dimensional spaces is one of the important tasks in multimedia applications. Approaches introduced to find exact searching techniques often use tree-based index structures which are known to suffer from the curse of the dimensionality problem that limits their performance. Approximate searching techniques prefer performance over accuracy and they return good enough results while achieving a better performance. Locality Sensitive Hashing (LSH) is one of the most popular approximate nearest neighbor search techniques for high-dimensional spaces. One of the most time-consuming processes in LSH is to find the neighboring points in the projected spaces. An improved LSH-based index structure, called radius-optimized Locality Sensitive Hashing (roLSH) has been proposed to utilize Machine Learning and efficiently find these neighboring points; thus, further improve the overall performance of LSH. In this paper, we extend roLSH by experimentally studying the effect of different types of famous Machine Learning techniques on overall performance. We compare ten regression techniques on four real-world datasets and show that Neural Network-based techniques are the best fit to be used in roLSH as their accuracy and performance trade-off are the best compared to the other techniques. |
|||||
2022 | A Multilabel Classification Framework For Approximate Nearest Neighbor Search | Ville Hyvönen, Elias Jääsaari, Teemu Roos | Neural Information Processing Systems | Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological \(k\)-d trees. |
|||||
2022 | Hyp^2 Loss Beyond Hypersphere Metric Space For Multi-label Image Retrieval | Xu Chengyin, Chai Zenghao, Xu Zhengzhuo, Yuan Chun, Fan Yanbo, Wang Jue | Arxiv | Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP\(^2\) Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP\(^2\) Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss. |
|||||
2022 | SAH Shifting-aware Asymmetric Hashing For Reverse k-maximum Inner Product Search | Huang Qiang, Wang Yanhao, Tung Anthony K. H. | Arxiv | This paper investigates a new yet challenging problem called Reverse \(k\)-Maximum Inner Product Search (R\(k\)MIPS). Given a query (item) vector, a set of item vectors, and a set of user vectors, the problem of R\(k\)MIPS aims to find a set of user vectors whose inner products with the query vector are one of the \(k\) largest among the query and item vectors. We propose the first subquadratic-time algorithm, i.e., Shifting-aware Asymmetric Hashing (SAH), to tackle the R\(k\)MIPS problem. To speed up the Maximum Inner Product Search (MIPS) on item vectors, we design a shifting-invariant asymmetric transformation and develop a novel sublinear-time Shifting-Aware Asymmetric Locality Sensitive Hashing (SA-ALSH) scheme. Furthermore, we devise a new blocking strategy based on the Cone-Tree to effectively prune user vectors (in a batch). We prove that SAH achieves a theoretical guarantee for solving the RMIPS problem. Experimental results on five real-world datasets show that SAH runs 4\(\sim\)8\(\times\) faster than the state-of-the-art methods for R\(k\)MIPS while achieving F1-scores of over 90\%. The code is available at \url{https://github.com/HuangQiang/SAH}. |
|||||
2022 | Badhash Invisible Backdoor Attacks Against Deep Hashing With Clean Label | Hu Shengshan, Zhou Ziqi, Zhang Yechao, Zhang Leo Yu, Zheng Yifeng, He Yuanyuan, Jin Hai | Arxiv | Due to its powerful feature learning capability and high efficiency, deep hashing has achieved great success in large-scale image retrieval. Meanwhile, extensive works have demonstrated that deep neural networks (DNNs) are susceptible to adversarial examples, and exploring adversarial attack against deep hashing has attracted many research efforts. Nevertheless, backdoor attack, another famous threat to DNNs, has not been studied for deep hashing yet. Although various backdoor attacks have been proposed in the field of image classification, existing approaches failed to realize a truly imperceptive backdoor attack that enjoys invisible triggers and clean label setting simultaneously, and they also cannot meet the intrinsic demand of image retrieval backdoor. In this paper, we propose BadHash, the first generative-based imperceptible backdoor attack against deep hashing, which can effectively generate invisible and input-specific poisoned images with clean label. Specifically, we first propose a new conditional generative adversarial network (cGAN) pipeline to effectively generate poisoned samples. For any given benign image, it seeks to generate a natural-looking poisoned counterpart with a unique invisible trigger. In order to improve the attack effectiveness, we introduce a label-based contrastive learning network LabCLN to exploit the semantic characteristics of different labels, which are subsequently used for confusing and misleading the target model to learn the embedded trigger. We finally explore the mechanism of backdoor attacks on image retrieval in the hash space. Extensive experiments on multiple benchmark datasets verify that BadHash can generate imperceptible poisoned samples with strong attack ability and transferability over state-of-the-art deep hashing schemes. |
|||||
2022 | Understanding The Moments Of Tabulation Hashing Via Chaoses | Houen Jakob Bæk Tejs, Thorup Mikkel | Arxiv | Simple tabulation hashing dates back to Zobrist in 1970 and is defined as follows: Each key is viewed as \(c\) characters from some alphabet \(\Sigma\), we have \(c\) fully random hash functions \(h_0, \ldots, h_{c - 1} \colon \Sigma \to \{0, \ldots, 2^l - 1\}\), and a key \(x = (x_0, \ldots, x_{c - 1})\) is hashed to \(h(x) = h_0(x_0) \oplus \ldots \oplus h_{c - 1}(x_{c - 1})\) where \(\oplus\) is the bitwise XOR operation. The previous results on tabulation hashing by P{\v a}tra{\c s}cu and Thorup~[J.ACM’11] and by Aamand et al.~[STOC’20] focused on proving Chernoff-style tail bounds on hash-based sums, e.g., the number keys hashing to a given value, for simple tabulation hashing, but their bounds do not cover the entire tail. Chaoses are random variables of the form \(\sum a_{i_0, \ldots, i_{c - 1}} X_{i_0} \cdot \ldots \cdot X_{i_{c - 1}}\) where \(X_i\) are independent random variables. Chaoses are a well-studied concept from probability theory, and tight analysis has been proven in several instances, e.g., when the independent random variables are standard Gaussian variables and when the independent random variables have logarithmically convex tails. We notice that hash-based sums of simple tabulation hashing can be seen as a sum of chaoses that are not independent. This motivates us to use techniques from the theory of chaoses to analyze hash-based sums of simple tabulation hashing. In this paper, we obtain bounds for all the moments of hash-based sums for simple tabulation hashing which are tight up to constants depending only on \(c\). In contrast with the previous attempts, our approach will mostly be analytical and does not employ intricate combinatorial arguments. The improved analysis of simple tabulation hashing allows us to obtain bounds for the moments of hash-based sums for the mixed tabulation hashing introduced by Dahlgaard et al.~[FOCS’15]. |
|||||
2022 | Similarity Search On Computational Notebooks | Horiuchi Misato Osaka University, Sasaki Yuya Osaka University, Xiao Chuan Osaka University, Onizuka Makoto Osaka University | Arxiv | Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency. |
|||||
2022 | Cross-scale Context Extracted Hashing For Fine-grained Image Binary Encoding | Xue Xuetong, Shi Jiaying, He Xinxue, Xu Shenghui, Pan Zhaoming | Arxiv | Deep hashing has been widely applied to large-scale image retrieval tasks owing to efficient computation and low storage cost by encoding high-dimensional image data into binary codes. Since binary codes do not contain as much information as float features, the essence of binary encoding is preserving the main context to guarantee retrieval quality. However, the existing hashing methods have great limitations on suppressing redundant background information and accurately encoding from Euclidean space to Hamming space by a simple sign function. In order to solve these problems, a Cross-Scale Context Extracted Hashing Network (CSCE-Net) is proposed in this paper. Firstly, we design a two-branch framework to capture fine-grained local information while maintaining high-level global semantic information. Besides, Attention guided Information Extraction module (AIE) is introduced between two branches, which suppresses areas of low context information cooperated with global sliding windows. Unlike previous methods, our CSCE-Net learns a content-related Dynamic Sign Function (DSF) to replace the original simple sign function. Therefore, the proposed CSCE-Net is context-sensitive and able to perform well on accurate image binary encoding. We further demonstrate that our CSCE-Net is superior to the existing hashing methods, which improves retrieval performance on standard benchmarks. |
|||||
2022 | Progressively Optimized Bi-granular Document Representation For Scalable Embedding Based Retrieval | Xiao Shitao, Liu Zheng, Han Weihao, Zhang Jianjin, Shao Yingxia, Lian Defu, Li Chaozhuo, Sun Hao, Deng Denvy, Zhang Liangjie, Zhang Qi, Xie Xing | Arxiv | Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the contrastive quantization and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to +4.3% recall gain on million-scale corpus, and up to +17.5% recall gain on billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue (+1.95%), Recall (+1.01%) and CTR (+0.49%). Our code is available at https://github.com/microsoft/BiDR. |
|||||
2022 | Hyperdimensional Hashing A Robust And Efficient Dynamic Hash Table | Heddes Mike, Nunes Igor, Givargis Tony, Nicolau Alexandru, Veidenbaum Alex | Arxiv | Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms that minimize key remapping as the hash table resizes. While memory errors in large-scale cloud deployments are common, neither algorithm offers both efficiency and robustness. Hyperdimensional Computing is an emerging computational model that has inherent efficiency, robustness and is well suited for vector or hardware acceleration. We propose Hyperdimensional (HD) hashing and show that it has the efficiency to be deployed in large systems. Moreover, a realistic level of memory errors causes more than 20% mismatches for consistent hashing while HD hashing remains unaffected. |
|||||
2022 | Hashformers Towards Vocabulary-independent Pre-trained Transformers | Xue Huiyin, Aletras Nikolaos | Arxiv | Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4\% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models. |
|||||
2022 | Accelerating Code Search With Deep Hashing And Code Classification | Gu Wenchao, Wang Yanlin, Du Lun, Zhang Hongyu, Han Shi, Zhang Dongmei, Lyu Michael R. | Arxiv | Code search is to search reusable code snippets from source code corpus based on natural languages queries. Deep learning-based methods of code search have shown promising results. However, previous methods focus on retrieval accuracy but lacked attention to the efficiency of the retrieval process. We propose a novel method CoSHC to accelerate code search with deep hashing and code classification, aiming to perform an efficient code search without sacrificing too much accuracy. To evaluate the effectiveness of CoSHC, we apply our method to five code search models. Extensive experimental results indicate that compared with previous code search baselines, CoSHC can save more than 90% of retrieval time meanwhile preserving at least 99% of retrieval accuracy. |
|||||
2022 | Vit2hash Unsupervised Information-preserving Hashing | Gong Qinkang, Wang Liangdao, Lai Hanjiang, Pan Yan, Yin Jian | Arxiv | Unsupervised image hashing, which maps images into binary codes without supervision, is a compressor with a high compression rate. Hence, how to preserving meaningful information of the original data is a critical problem. Inspired by the large-scale vision pre-training model, known as ViT, which has shown significant progress for learning visual representations, in this paper, we propose a simple information-preserving compressor to finetune the ViT model for the target unsupervised hashing task. Specifically, from pixels to continuous features, we first propose a feature-preserving module, using the corrupted image as input to reconstruct the original feature from the pre-trained ViT model and the complete image, so that the feature extractor can focus on preserving the meaningful information of original data. Secondly, from continuous features to hash codes, we propose a hashing-preserving module, which aims to keep the semantic information from the pre-trained ViT model by using the proposed Kullback-Leibler divergence loss. Besides, the quantization loss and the similarity loss are added to minimize the quantization error. Our method is very simple and achieves a significantly higher degree of MAP on three benchmark image datasets. |
|||||
2022 | Supervised Deep Hashing For High-dimensional And Heterogeneous Case-based Reasoning | Zhang Qi, Hu Liang, Shi Chongyang, Liu Ke, Cao Longbing | Arxiv | Case-based Reasoning (CBR) on high-dimensional and heterogeneous data is a trending yet challenging and computationally expensive task in the real world. A promising approach is to obtain low-dimensional hash codes representing cases and perform a similarity retrieval of cases in Hamming space. However, previous methods based on data-independent hashing rely on random projections or manual construction, inapplicable to address specific data issues (e.g., high-dimensionality and heterogeneity) due to their insensitivity to data characteristics. To address these issues, this work introduces a novel deep hashing network to learn similarity-preserving compact hash codes for efficient case retrieval and proposes a deep-hashing-enabled CBR model HeCBR. Specifically, we introduce position embedding to represent heterogeneous features and utilize a multilinear interaction layer to obtain case embeddings, which effectively filtrates zero-valued features to tackle high-dimensionality and sparsity and captures inter-feature couplings. Then, we feed the case embeddings into fully-connected layers, and subsequently a hash layer generates hash codes with a quantization regularizer to control the quantization loss during relaxation. To cater to incremental learning of CBR, we further propose an adaptive learning strategy to update the hash function. Extensive experiments on public datasets show that HeCBR greatly reduces storage and significantly accelerates case retrieval. HeCBR achieves desirable performance compared with the state-of-the-art CBR methods and performs significantly better than hashing-based CBR methods in classification. |
|||||
2022 | Learning To Collide Recommendation System Model Compression With Learned Hash Functions | Ghaemmaghami Benjamin, Ozdal Mustafa, Komuravelli Rakesh, Korchev Dmitriy, Mudigere Dheevatsa, Nair Krishnakumar, Naumov Maxim | Arxiv | A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. These embedding tables can often reach hundreds of gigabytes which increases hardware requirements and training cost. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size. However, this approach introduces collisions between semantically dissimilar ids that degrade model quality. We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids. We derive this learned mapping from historical data and embedding access patterns. We experiment with this technique on a production model and find that a mapping informed by the combination of access frequency and a learned low dimension embedding is the most effective. We demonstrate a small improvement relative to the hashing trick and other collision related compression techniques. This is ongoing work that explores the impact of categorical id collisions on recommendation model quality and how those collisions may be controlled to improve model performance. |
|||||
2022 | Free Resolutions And Generalized Hamming Weights Of Binary Linear Codes | García-marco Ignacio, Márquez-corbella Irene, Martínez-moro Edgar, Pitones Yuriko | Arxiv | In this work, we explore the relationship between free resolution of some monomial ideals and Generalized Hamming Weights (GHWs) of binary codes. More precisely, we look for a structure smaller than the set of codewords of minimal support that provides us some information about the GHWs. We prove that the first and second generalized Hamming weight of a binary linear code can be computed (by means of a graded free resolution) from a set of monomials associated to a binomial ideal related with the code. Moreover, the remaining weights are bounded by the Betti numbers for that set. |
|||||
2022 | Long-tail Cross Modal Hashing | Gao Zijun, Wang Jun, Yu Guoxian, Yan Zhongmin, Domeniconi Carlotta, Zhang Jinglin | Arxiv | Existing Cross Modal Hashing (CMH) methods are mainly designed for balanced data, while imbalanced data with long-tail distribution is more general in real-world. Several long-tail hashing methods have been proposed but they can not adapt for multi-modal data, due to the complex interplay between labels and individuality and commonality information of multi-modal data. Furthermore, CMH methods mostly mine the commonality of multi-modal data to learn hash codes, which may override tail labels encoded by the individuality of respective modalities. In this paper, we propose LtCMH (Long-tail CMH) to handle imbalanced multi-modal data. LtCMH firstly adopts auto-encoders to mine the individuality and commonality of different modalities by minimizing the dependency between the individuality of respective modalities and by enhancing the commonality of these modalities. Then it dynamically combines the individuality and commonality with direct features extracted from respective modalities to create meta features that enrich the representation of tail labels, and binaries meta features to generate hash codes. LtCMH significantly outperforms state-of-the-art baselines on long-tail datasets and holds a better (or comparable) performance on datasets with balanced labels. |
|||||
2022 | Streaming Encoding Algorithms For Scalable Hyperdimensional Computing | Thomas Anthony, Khaleghi Behnam, Jha Gopi Krishna, Dasgupta Sanjoy, Himayat Nageen, Iyer Ravi, Jain Nilesh, Rosing Tajana | Arxiv | Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues when the input data itself is high-dimensional. In this work, we explore a family of streaming encoding techniques based on hashing. We show formally that these methods enjoy comparable guarantees on performance for learning applications while being substantially more efficient than existing alternatives. We validate these results experimentally on a popular high-dimensional classification problem and show that our approach easily scales to very large data sets. |
|||||
2022 | Active Image Indexing | Fernandez Pierre, Douze Matthijs, Jégou Hervé, Furon Teddy | Arxiv | Image copy detection and retrieval from large databases leverage two components. First, a neural network maps an image to a vector representation, that is relatively robust to various transformations of the image. Second, an efficient but approximate similarity search algorithm trades scalability (size and speed) against quality of the search, thereby introducing a source of error. This paper improves the robustness of image copy detection with active indexing, that optimizes the interplay of these two components. We reduce the quantization loss of a given image representation by making imperceptible changes to the image before its release. The loss is back-propagated through the deep neural network back to the image, under perceptual constraints. These modifications make the image more retrievable. Our experiments show that the retrieval and copy detection of activated images is significantly improved. For instance, activation improves by \(+40\%\) the Recall1@1 on various image transformations, and for several popular indexing structures based on product quantization and locality sensitivity hashing. |
|||||
2022 | Embedding Compression With Hashing For Efficient Representation Learning In Large-scale Graph | Yeh Chin-chia Michael, Gu Mengting, Zheng Yan, Chen Huiyuan, Ebrahimi Javid, Zhuang Zhongfang, Wang Junpeng, Wang Liang, Zhang Wei | Arxiv | Graph neural networks (GNNs) are deep learning models designed specifically for graph data, and they typically rely on node features as the input to the first layer. When applying such a type of network on the graph without node features, one can extract simple graph-based node features (e.g., number of degrees) or learn the input node representations (i.e., embeddings) when training the network. While the latter approach, which trains node embeddings, more likely leads to better performance, the number of parameters associated with the embeddings grows linearly with the number of nodes. It is therefore impractical to train the input node embeddings together with GNNs within graphics processing unit (GPU) memory in an end-to-end fashion when dealing with industrial-scale graph data. Inspired by the embedding compression methods developed for natural language processing (NLP) tasks, we develop a node embedding compression method where each node is compactly represented with a bit vector instead of a floating-point vector. The parameters utilized in the compression method can be trained together with GNNs. We show that the proposed node embedding compression method achieves superior performance compared to the alternatives. |
|||||
2022 | Rapid Person Re-identification Via Sub-space Consistency Regularization | Yin Qingze, Wang Guanan, Ding Guodong, Li Qilei, Gong Shaogang, Tang Zhenmin | Arxiv | Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation as well as complex quick-sort algorithms. Recently, some works propose to yield binary encoded person descriptors which instead only require fast Hamming distance computation and simple counting-sort algorithms. However, the performances of such binary encoded descriptors, especially with short code (e.g., 32 and 64 bits), are hardly satisfactory given the sparse binary space. To strike a balance between the model accuracy and efficiency, we propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by \(0.25\) times than real-value features under the same dimensions whilst maintaining a competitive accuracy, especially under short codes. SCR transforms real-value features vector (e.g., 2048 float32) with short binary codes (e.g., 64 bits) by first dividing real-value features vector into \(M\) sub-spaces, each with \(C\) clustered centroids. Thus the distance between two samples can be expressed as the summation of the respective distance to the centroids, which can be sped up by offline calculation and maintained via a look-up table. On the other side, these real-value centroids help to achieve significantly higher accuracy than using binary code. Lastly, we convert the distance look-up table to be integer and apply the counting-sort algorithm to speed up the ranking stage. We also propose a novel consistency regularization with an iterative framework. Experimental results on Market-1501 and DukeMTMC-reID show promising and exciting results. Under short code, our proposed SCR enjoys Real-value-level accuracy and Hashing-level speed. |
|||||
2022 | Knn-embed Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval | El-kishky Ahmed, Markovich Thomas, Leung Kenny, Portman Frank, Haghighi Aria, Xiao Ying | Arxiv | Candidate retrieval is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. As the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user’s multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct “interests” of the user. By querying each of a user’s mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user’s interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset (https://huggingface.co/datasets/Twitter/TwitterFollowGraph), to spur further research in graph-mining and representation learning for recommender systems. |
|||||
2022 | Wavelet Feature Maps Compression For Image-to-image Cnns | Shahaf E. Finder, Yair Zohav, Maor Ashkenazi, Eran Treister | Neural Information Processing Systems | Convolutional Neural Networks (CNNs) are known for requiring extensive computational resources, and quantization is among the best and most common methods for compressing them. While aggressive quantization (i.e., less than 4-bits) performs well for classification, it may cause severe performance degradation in image-to-image tasks such as semantic segmentation and depth estimation. In this paper, we propose Wavelet Compressed Convolution (WCC)—a novel approach for high-resolution activation maps compression integrated with point-wise convolutions, which are the main computational cost of modern architectures. To this end, we use an efficient and hardware-friendly Haar-wavelet transform, known for its effectiveness in image compression, and define the convolution on the compressed activation map. We experiment with various tasks that benefit from high-resolution input. By combining WCC with light quantization, we achieve compression rates equivalent to 1-4bit activation quantization with relatively small and much more graceful degradation in performance. Our code is available at https://github.com/BGUCompSci/WaveletCompressedConvolution. |
|||||
2022 | One Loss For Quantization Deep Hashing With Discrete Wasserstein Distributional Matching | Doan Khoa D., Yang Peng, Li Ping | Arxiv | Image hashing is a principled approximate nearest neighbor approach to find similar items to a query in a large collection of images. Hashing aims to learn a binary-output function that maps an image to a binary vector. For optimal retrieval performance, producing balanced hash codes with low-quantization error to bridge the gap between the learning stage’s continuous relaxation and the inference stage’s discrete quantization is important. However, in the existing deep supervised hashing methods, coding balance and low-quantization error are difficult to achieve and involve several losses. We argue that this is because the existing quantization approaches in these methods are heuristically constructed and not effective to achieve these objectives. This paper considers an alternative approach to learning the quantization constraints. The task of learning balanced codes with low quantization error is re-formulated as matching the learned distribution of the continuous codes to a pre-defined discrete, uniform distribution. This is equivalent to minimizing the distance between two distributions. We then propose a computationally efficient distributional distance by leveraging the discrete property of the hash functions. This distributional distance is a valid distance and enjoys lower time and sample complexities. The proposed single-loss quantization objective can be integrated into any existing supervised hashing method to improve code balance and quantization error. Experiments confirm that the proposed approach substantially improves the performance of several representative hashing~methods. |
|||||
2022 | Asymmetric Hashing For Fast Ranking Via Neural Network Measures | Doan Khoa, Tan Shulong, Zhao Weijie, Li Ping | Arxiv | Fast item ranking is an important task in recommender systems. In previous works, graph-based Approximate Nearest Neighbor (ANN) approaches have demonstrated good performance on item ranking tasks with generic searching/matching measures (including complex measures such as neural network measures). However, since these ANN approaches must go through the neural measures several times during ranking, the computation is not practical if the neural measure is a large network. On the other hand, fast item ranking using existing hashing-based approaches, such as Locality Sensitive Hashing (LSH), only works with a limited set of measures. Previous learning-to-hash approaches are also not suitable to solve the fast item ranking problem since they can take a significant amount of time and computation to train the hash functions. Hashing approaches, however, are attractive because they provide a principle and efficient way to retrieve candidate items. In this paper, we propose a simple and effective learning-to-hash approach for the fast item ranking problem that can be used for any type of measure, including neural network measures. Specifically, we solve this problem with an asymmetric hashing framework based on discrete inner product fitting. We learn a pair of related hash functions that map heterogeneous objects (e.g., users and items) into a common discrete space where the inner product of their binary codes reveals their true similarity defined via the original searching measure. The fast ranking problem is reduced to an ANN search via this asymmetric hashing scheme. Then, we propose a sampling strategy to efficiently select relevant and contrastive samples to train the hashing model. We empirically validate the proposed method against the existing state-of-the-art fast item ranking methods in several combinations of non-linear searching functions and prominent datasets. |
|||||
2022 | Coophash Cooperative Learning Of Multipurpose Descriptor And Contrastive Pair Generator Via Variational MCMC Teaching For Supervised Image Hashing | Doan Khoa D., Xie Jianwen, Zhu Yaxuan, Zhao Yang, Li Ping | Arxiv | Leveraging supervised information can lead to superior retrieval performance in the image hashing domain but the performance degrades significantly without enough labeled data. One effective solution to boost performance is to employ generative models, such as Generative Adversarial Networks (GANs), to generate synthetic data in an image hashing model. However, GAN-based methods are difficult to train, which prevents the hashing approaches from jointly training the generative models and the hash functions. This limitation results in sub-optimal retrieval performance. To overcome this limitation, we propose a novel framework, the generative cooperative hashing network, which is based on energy-based cooperative learning. This framework jointly learns a powerful generative representation of the data and a robust hash function via two components: a top-down contrastive pair generator that synthesizes contrastive images and a bottom-up multipurpose descriptor that simultaneously represents the images from multiple perspectives, including probability density, hash code, latent code, and category. The two components are jointly learned via a novel likelihood-based cooperative learning scheme. We conduct experiments on several real-world datasets and show that the proposed method outperforms the competing hashing supervised methods, achieving up to 10\% relative improvement over the current state-of-the-art supervised hashing methods, and exhibits a significantly better performance in out-of-distribution retrieval. |
|||||
2022 | A Generalized Approach For Cancellable Template And Its Realization For Minutia Cylinder-code | Dong Xingbo, Jin Zhe, Wong Koksheik | Arxiv | Hashing technology gains much attention in protecting the biometric template lately. For instance, Index-of-Max (IoM), a recent reported hashing technique, is a ranking-based locality sensitive hashing technique, which illustrates the feasibility to protect the ordered and fixed-length biometric template. However, biometric templates are not always in the form of ordered and fixed-length, rather it may be an unordered and variable size point set e.g. fingerprint minutiae, which restricts the usage of the traditional hashing technology. In this paper, we proposed a generalized version of IoM hashing namely gIoM, and therefore the unordered and variable size biometric template can be used. We demonstrate a realization using a well-known variable size feature vector, fingerprint Minutia Cylinder-Code (MCC). The gIoM transforms MCC into index domain to form indexing-based feature representation. Consequently, the inversion of MCC from the transformed representation is computational infeasible, thus to achieve non-invertibility while the performance is preserved. Public fingerprint databases FVC2002 and FVC2004 are employed for experiment as benchmark to demonstrate a fair comparison with other methods. Moreover, the security and privacy analysis suggest that gIoM meets the criteria of template protection: non-invertibility, revocability, and non-linkability. |
|||||
2022 | Pattern Spotting And Image Retrieval In Historical Documents Using Deep Hashing | Dias Caio Da S., Britto Alceu De S. Jr., Barddal Jean P., Heutte Laurent, Koerich Alessandro L. | Arxiv | This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candidate images are ranked by computing the feature similarity with a given input query. A robust experimental protocol evaluates the proposed approach considering each representation scheme (real-valued and binary code) on the DocExplore image database. The experimental results show that the proposed deep models compare favorably to the state-of-the-art image retrieval approaches for images of historical documents, outperforming other deep models by 2.56 percentage points using the same techniques for pattern spotting. Besides, the proposed approach also reduces the search time by up to 200x and the storage cost up to 6,000x when compared to related works based on real-valued representations. |
|||||
2022 | Linear Hashing With ell_infty Guarantees And Two-sided Kakeya Bounds | Dhar Manik, Dvir Zeev | TheoretiCS Volume | We show that a randomly chosen linear map over a finite field gives a good hash function in the \(\ell_\infty\) sense. More concretely, consider a set \(S \subset \mathbb{F}q^n\) and a randomly chosen linear map \(L : \mathbb{F}_q^n \to \mathbb{F}_q^t\) with \(q^t\) taken to be sufficiently smaller than \( |S|\). Let \(U_S\) denote a random variable distributed uniformly on \(S\). Our main theorem shows that, with high probability over the choice of \(L\), the random variable \(L(U_S)\) is close to uniform in the \(\ell\infty\) norm. In other words, {\em every} element in the range \(\mathbb{F}_q^t\) has about the same number of elements in \(S\) mapped to it. This complements the widely-used Leftover Hash Lemma (LHL) which proves the analog statement under the statistical, or \(\ell_1\), distance (for a richer class of functions) as well as prior work on the expected largest ‘bucket size’ in linear hash functions [ADMPT99]. By known bounds from the load balancing literature [RS98], our results are tight and show that linear functions hash as well as trully random function up to a constant factor in the entropy loss. Our proof leverages a connection between linear hashing and the finite field Kakeya problem and extends some of the tools developed in this area, in particular the polynomial method. |
|||||
2022 | Weighted Contrastive Hashing | Yu Jiaguo, Qiu Huming, Chen Dubing, Zhang Haofeng | Arxiv | The development of unsupervised hashing is advanced by the recent popular contrastive learning paradigm. However, previous contrastive learning-based works have been hampered by (1) insufficient data similarity mining based on global-only image representations, and (2) the hash code semantic loss caused by the data augmentation. In this paper, we propose a novel method, namely Weighted Contrative Hashing (WCH), to take a step towards solving these two problems. We introduce a novel mutual attention module to alleviate the problem of information asymmetry in network features caused by the missing image structure during contrative augmentation. Furthermore, we explore the fine-grained semantic relations between images, i.e., we divide the images into multiple patches and calculate similarities between patches. The aggregated weighted similarities, which reflect the deep image relations, are distilled to facilitate the hash codes learning with a distillation loss, so as to obtain better retrieval performance. Extensive experiments show that the proposed WCH significantly outperforms existing unsupervised hashing methods on three benchmark datasets. |
|||||
2022 | Learning To Hash Naturally Sorts | Yu Jiaguo, Shen Yuming, Wang Menghan, Zhang Haofeng, Torr Philip H. S. | Arxiv | Learning to hash pictures a list-wise sorting problem. Its testing metrics, e.g., mean-average precision, count on a sorted candidate list ordered by pair-wise code similarity. However, scarcely does one train a deep hashing model with the sorted results end-to-end because of the non-differentiable nature of the sorting operation. This inconsistency in the objectives of training and test may lead to sub-optimal performance since the training loss often fails to reflect the actual retrieval metric. In this paper, we tackle this problem by introducing Naturally-Sorted Hashing (NSH). We sort the Hamming distances of samples’ hash codes and accordingly gather their latent representations for self-supervised training. Thanks to the recent advances in differentiable sorting approximations, the hash head receives gradients from the sorter so that the hash encoder can be optimized along with the training procedure. Additionally, we describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning, which allows NSH to mine data semantic relations during training in an unsupervised manner. Our extensive experiments show the proposed NSH model significantly outperforms the existing unsupervised hashing methods on three benchmarked datasets. |
|||||
2022 | Post-quantum Hash Functions Using _n(_p) | Coz Corentin Le, Battarbee Christopher, Flores Ramón, Koberda Thomas, Kahrobaei Delaram | Arxiv | We define new families of Tillich-Z'emor hash functions, using higher dimensional special linear groups over finite fields as platforms. The Cayley graphs of these groups combine fast mixing properties and high girth, which together give rise to good preimage and collision resistance of the corresponding hash functions. We justify the claim that the resulting hash functions are post-quantum secure. |
|||||
2022 | Fedhap Federated Hashing With Global Prototypes For Cross-silo Retrieval | Yang Meilin, Xu Jian, Liu Yang, Ding Wenbo | Arxiv | Deep hashing has been widely applied in large-scale data retrieval due to its superior retrieval efficiency and low storage cost. However, data are often scattered in data silos with privacy concerns, so performing centralized data storage and retrieval is not always possible. Leveraging the concept of federated learning (FL) to perform deep hashing is a recent research trend. However, existing frameworks mostly rely on the aggregation of the local deep hashing models, which are trained by performing similarity learning with local skewed data only. Therefore, they cannot work well for non-IID clients in a real federated environment. To overcome these challenges, we propose a novel federated hashing framework that enables participating clients to jointly train the shared deep hashing model by leveraging the prototypical hash codes for each class. Globally, the transmission of global prototypes with only one prototypical hash code per class will minimize the impact of communication cost and privacy risk. Locally, the use of global prototypes are maximized by jointly training a discriminator network and the local hashing network. Extensive experiments on benchmark datasets are conducted to demonstrate that our method can significantly improve the performance of the deep hashing model in the federated environments with non-IID data distributions. |
|||||
2022 | Clustering The Sketch A Novel Approach To Embedding Table Compression | Tsang Henry Ling-hei, Ahle Thomas Dybdahl | Arxiv | Embedding tables are used by machine learning systems to work with categorical features. In modern Recommendation Systems, these tables can be very large, necessitating the development of new methods for fitting them in memory, even during training. We suggest Clustered Compositional Embeddings (CCE) which combines clustering-based compression like quantization to codebooks with dynamic methods like The Hashing Trick and Compositional Embeddings (Shi et al., 2020). Experimentally CCE achieves the best of both worlds: The high compression rate of codebook-based quantization, but dynamically like hashing-based methods, so it can be used during training. Theoretically, we prove that CCE is guaranteed to converge to the optimal codebook and give a tight bound for the number of iterations required. |
|||||
2022 | Unitail Detecting Reading And Matching In Retail Scene | Chen Fangyi, Zhang Han, Li Zaiwang, Dou Jiachen, Mo Shentong, Chen Hao, Zhang Yongxin, Ahmed Uzair, Zhu Chenchen, Savvides Marios | Arxiv | To make full use of computer vision technology in stores, it is required to consider the actual needs that fit the characteristics of the retail scene. Pursuing this goal, we introduce the United Retail Datasets (Unitail), a large-scale benchmark of basic visual tasks on products that challenges algorithms for detecting, reading, and matching. With 1.8M quadrilateral-shaped instances annotated, the Unitail offers a detection dataset to align product appearance better. Furthermore, it provides a gallery-style OCR dataset containing 1454 product categories, 30k text regions, and 21k transcriptions to enable robust reading on products and motivate enhanced product matching. Besides benchmarking the datasets using various state-of-the-arts, we customize a new detector for product detection and provide a simple OCR-based matching solution that verifies its effectiveness. |
|||||
2022 | FINGER Fast Inference For Graph-based Approximate Nearest Neighbor Search | Chen Patrick H., Wei-cheng Chang, Hsiang-fu Yu, Dhillon Inderjit S., Cho-jui Hsieh | Arxiv | Approximate K-Nearest Neighbor Search (AKNNS) has now become ubiquitous in modern applications, for example, as a fast search procedure with two tower deep learning models. Graph-based methods for AKNNS in particular have received great attention due to their superior performance. These methods rely on greedy graph search to traverse the data points as embedding vectors in a database. Under this greedy search scheme, we make a key observation: many distance computations do not influence search updates so these computations can be approximated without hurting performance. As a result, we propose FINGER, a fast inference method to achieve efficient graph search. FINGER approximates the distance function by estimating angles between neighboring residual vectors with low-rank bases and distribution matching. The approximated distance can be used to bypass unnecessary computations, which leads to faster searches. Empirically, accelerating a popular graph-based method named HNSW by FINGER is shown to outperform existing graph-based methods by 20%-60% across different benchmark datasets. |
|||||
2022 | Approximate Nearest Neighbor Search Under Neural Similarity Metric For Large-scale Recommendation | Chen Rihan, Liu Bin, Zhu Han, Wang Yaoxuan, Li Qi, Ma Buting, Hua Qingbo, Jiang Jun, Xu Yunlong, Deng Hongbo, Zheng Bo | Arxiv | Model-based methods for recommender systems have been studied extensively for years. Modern recommender systems usually resort to 1) representation learning models which define user-item preference as the distance between their embedding representations, and 2) embedding-based Approximate Nearest Neighbor (ANN) search to tackle the efficiency problem introduced by large-scale corpus. While providing efficient retrieval, the embedding-based retrieval pattern also limits the model capacity since the form of user-item preference measure is restricted to the distance between their embedding representations. However, for other more precise user-item preference measures, e.g., preference scores directly derived from a deep neural network, they are computationally intractable because of the lack of an efficient retrieval method, and an exhaustive search for all user-item pairs is impractical. In this paper, we propose a novel method to extend ANN search to arbitrary matching functions, e.g., a deep neural network. Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items. To solve the problem that the similarity measures of graph construction and user-item matching function are heterogeneous, we propose a pluggable adversarial training task to ensure the graph search with arbitrary matching function can achieve fairly high precision. Experimental results in both open source and industry datasets demonstrate the effectiveness of our method. The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase. We also summarize our detailed experiences in deployment in this paper. |
|||||
2022 | Learning Binarized Graph Representations With Multi-faceted Quantization Reinforcement For Top-k Recommendation | Chen Yankai, Guo Huifeng, Zhang Yingxue, Ma Chen, Tang Ruiming, Li Jingjie, King Irwin | Arxiv | Learning vectorized embeddings is at the core of various recommender systems for user-item matching. To perform efficient online inference, representation quantization, aiming to embed the latent features by a compact sequence of discrete numbers, recently shows the promising potentiality in optimizing both memory and computation overheads. However, existing work merely focuses on numerical quantization whilst ignoring the concomitant information loss issue, which, consequently, leads to conspicuous performance degradation. In this paper, we propose a novel quantization framework to learn Binarized Graph Representations for Top-K Recommendation (BiGeaR). BiGeaR introduces multi-faceted quantization reinforcement at the pre-, mid-, and post-stage of binarized representation learning, which substantially retains the representation informativeness against embedding binarization. In addition to saving the memory footprint, BiGeaR further develops solid online inference acceleration with bitwise operations, providing alternative flexibility for the realistic deployment. The empirical results over five large real-world benchmarks show that BiGeaR achieves about 22%~40% performance improvement over the state-of-the-art quantization-based recommender system, and recovers about 95%~102% of the performance capability of the best full-precision counterpart with over 8x time and space reduction. |
|||||
2022 | Locality-sensitive Bucketing Functions For The Edit Distance | Chen Ke, Shao Mingfu | Arxiv | Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing \(k\)-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be \((d_1, d_2)\)-sensitive if any two sequences within an edit distance of \(d_1\) are mapped into at least one shared bucket, and any two sequences with distance at least \(d_2\) are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of \((d_1,d_2)\) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions. |
|||||
2022 | High Performance Construction Of Recsplit Based Minimal Perfect Hash Functions | Bez Dominik, Kurpicz Florian, Lehmann Hans-peter, Sanders Peter | Arxiv | A minimal perfect hash function (MPHF) bijectively maps a set S of objects to the first |S| integers. It can be used as a building block in databases and data compression. RecSplit [Esposito et al., ALENEX’20] is currently the most space efficient practical minimal perfect hash function. It heavily relies on trying out hash functions in a brute force way. We introduce rotation fitting, a new technique that makes the search more efficient by drastically reducing the number of tried hash functions. Additionally, we greatly improve the construction time of RecSplit by harnessing parallelism on the level of bits, vectors, cores, and GPUs. In combination, the resulting improvements yield speedups up to 239 on an 8-core CPU and up to 5438 using a GPU. The original single-threaded RecSplit implementation needs 1.5 hours to construct an MPHF for 5 Million objects with 1.56 bits per object. On the GPU, we achieve the same space usage in just 5 seconds. Given that the speedups are larger than the increase in energy consumption, our implementation is more energy efficient than the original implementation. |
|||||
2022 | A Learned Index For Exact Similarity Search In Metric Spaces | Tian Yao, Yan Tingyun, Zhao Xi, Huang Kai, Zhou Xiaofang | Arxiv | Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively explored to reduce storage and search costs. However, accurate and efficient similarity query processing in high-dimensional metric spaces remains to be an open challenge. In this paper, we propose a novel indexing approach called LIMS that uses data clustering, pivot-based data transformation techniques and learned indexes to support efficient similarity query processing in metric spaces. In LIMS, the underlying data is partitioned into clusters such that each cluster follows a relatively uniform data distribution. Data redistribution is achieved by utilizing a small number of pivots for each cluster. Similar data are mapped into compact regions and the mapped values are totally ordinal. Machine learning models are developed to approximate the position of each data record on disk. Efficient algorithms are designed for processing range queries and nearest neighbor queries based on LIMS, and for index maintenance with dynamic updates. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes and state-of-the-art learned indexes. |
|||||
2022 | On The Size Of Maximal Binary Codes With 2 3 And 4 Distances | Barg Alexander, Glazyrin Alexey, Kao Wei-jiun, Lai Ching-yi, Tseng Pin-chieh, Yu Wei-hsuan | Arxiv | We address the maximum size of binary codes and binary constant weight codes with few distances. Previous works established a number of bounds for these quantities as well as the exact values for a range of small code lengths. As our main results, we determine the exact size of maximal binary codes with two distances for all lengths \(n\ge 6\) as well as the exact size of maximal binary constant weight codes with 2,3, and 4 distances for several values of the weight and for all but small lengths. |
|||||
2022 | Tight Bounds For Monotone Minimal Perfect Hashing | Assadi Sepehr, Farach-colton Martin, Kuszmaul William | Arxiv | The monotone minimal perfect hash function (MMPHF) problem is the following indexing problem. Given a set \(S= \{s_1,\ldots,s_n\}\) of \(n\) distinct keys from a universe \(U\) of size \(u\), create a data structure \(DS\) that answers the following query: [ RankOp(q) = \text{rank of } q \text{ in } S \text{ for all } q\in S ~\text{ and arbitrary answer otherwise.} ] Solutions to the MMPHF problem are in widespread use in both theory and practice. The best upper bound known for the problem encodes \(DS\) in \(O(nlogloglog u)\) bits and performs queries in \(O(log u)\) time. It has been an open problem to either improve the space upper bound or to show that this somewhat odd looking bound is tight. In this paper, we show the latter: specifically that any data structure (deterministic or randomized) for monotone minimal perfect hashing of any collection of \(n\) elements from a universe of size \(u\) requires \(Ω(n \cdot logloglog{u})\) expected bits to answer every query correctly. We achieve our lower bound by defining a graph \(\mathbf{G}\) where the nodes are the possible \({u \choose n}\) inputs and where two nodes are adjacent if they cannot share the same \(DS\). The size of \(DS\) is then lower bounded by the log of the chromatic number of \(\mathbf{G}\). Finally, we show that the fractional chromatic number (and hence the chromatic number) of \(\mathbf{G}\) is lower bounded by \(2^{Ω(n logloglog u)}\). |
|||||
2022 | Satellite Image Search In Agoraeo | Aksoy Ahmet Kerem, Dushev Pavel, Zacharatou Eleni Tzirita, Hemsen Holmer, Charfuelan Marcela, Quiané-ruiz Jorge-arnulfo, Demir Begüm, Markl Volker | Arxiv | The growing operational capability of global Earth Observation (EO) creates new opportunities for data-driven approaches to understand and protect our planet. However, the current use of EO archives is very restricted due to the huge archive sizes and the limited exploration capabilities provided by EO platforms. To address this limitation, we have recently proposed MiLaN, a content-based image retrieval approach for fast similarity search in satellite image archives. MiLaN is a deep hashing network based on metric learning that encodes high-dimensional image features into compact binary hash codes. We use these codes as keys in a hash table to enable real-time nearest neighbor search and highly accurate retrieval. In this demonstration, we showcase the efficiency of MiLaN by integrating it with EarthQube, a browser and search engine within AgoraEO. EarthQube supports interactive visual exploration and Query-by-Example over satellite image repositories. Demo visitors will interact with EarthQube playing the role of different users that search images in a large-scale remote sensing archive by their semantic content and apply other filters. |
|||||
2022 | Privacy-preserving Record Linkage Using Local Sensitive Hash And Private Set Intersection | Adir Allon, Aharoni Ehud, Drucker Nir, Kushnir Eyal, Masalha Ramy, Mirkin Michael, Soceanu Omri | Arxiv | The amount of data stored in data repositories increases every year. This makes it challenging to link records between different datasets across companies and even internally, while adhering to privacy regulations. Address or name changes, and even different spelling used for entity data, can prevent companies from using private deduplication or record-linking solutions such as private set intersection (PSI). To this end, we propose a new and efficient privacy-preserving record linkage (PPRL) protocol that combines PSI and local sensitive hash (LSH) functions, and runs in linear time. We explain the privacy guarantees that our protocol provides and demonstrate its practicality by executing the protocol over two datasets with \(2^{20}\) records each, in \(11-45\) minutes, depending on network settings. |
|||||
2022 | Unsupervised Hashing With Semantic Concept Mining | Tu Rong-cheng, Mao Xian-ling, Lin Kevin Qinghong, Cai Chengfei, Qin Weize, Wang Hongfa, Wei Wei, Huang Heyan | Arxiv | Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an important role in calculating the similarity among images. In real-world scenarios, each image is associated with some concepts, and the similarity between two images will be larger if they share more identical concepts. Inspired by the above intuition, in this work, we propose a novel Unsupervised Hashing with Semantic Concept Mining, called UHSCM, which leverages a VLP model to construct a high-quality similarity matrix. Specifically, a set of randomly chosen concepts is first collected. Then, by employing a vision-language pretraining (VLP) model with the prompt engineering which has shown strong power in visual representation learning, the set of concepts is denoised according to the training images. Next, the proposed method UHSCM applies the VLP model with prompting again to mine the concept distribution of each image and construct a high-quality semantic similarity matrix based on the mined concept distributions. Finally, with the semantic similarity matrix as guiding information, a novel hashing loss with a modified contrastive loss based regularization item is proposed to optimize the hashing network. Extensive experiments on three benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task. |
|||||
2022 | Hashpim High-throughput SHA-3 Via Memristive Digital Processing-in-memory | Oved Batel, Leitersdorf Orian, Ronen Ronny, Kvatinsky Shahar | Arxiv | Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory to eliminate data-transfer and simultaneously provide massive computational parallelism. In this paper, we seek to vastly accelerate the state-of-the-art SHA-3 cryptographic function using the memristive memory processing unit (mMPU), a general-purpose memristive PIM architecture. To that end, we propose a novel in-memory algorithm for variable rotation, and utilize an efficient mapping of the SHA-3 state vector for memristive crossbar arrays to efficiently exploit PIM parallelism. We demonstrate a massive energy efficiency of 1,422 Gbps/W, improving a state-of-the-art memristive SHA-3 accelerator (SHINE-2) by 4.6x. |
|||||
2022 | Unconventional Application Of K-means For Distributed Approximate Similarity Search | Ortega Felipe, Algar Maria Jesus, De Diego Isaac Martín, Moguerza Javier M. | Arxiv | Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up queries for similar elements in this context. Moreover, since exact approaches to solve similarity queries can be complex and time-consuming, alternative options have appeared to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with \(k\)-means), an unconventional application of the \(k\)-means algorithm as the foundation of a multilevel index structure for approximate similarity search, suitable for metric spaces. We show that inherent properties of \(k\)-means, like representing high-density data areas with fewer prototypes, can be leveraged for this purpose. An implementation of this new indexing method is evaluated, using a synthetic dataset and a real-world dataset in a high-dimensional and high-sparsity space. Results are promising and underpin the applicability of this novel indexing method in multiple domains. |
|||||
2022 | Hybrid Contrastive Quantization For Efficient Cross-view Video Retrieval | Wang Jinpeng, Chen Bin, Liao Dongliang, Zeng Ziyun, Li Gongfu, Xia Shu-tao, Xu Jin | Arxiv | With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ. |
|||||
2022 | Contrastive Masked Autoencoders For Self-supervised Video Hashing | Wang Yuting, Wang Jinpeng, Chen Bin, Zeng Ziyun, Xia Shutao | Arxiv | Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation among unlabeled videos. Typically, state-of-the-art SSVH methods consider these two points in a two-stage training pipeline, where they firstly train an auxiliary network by instance-wise mask-and-predict tasks and secondly train a hashing model to preserve the pseudo-neighborhood structure transferred from the auxiliary network. This consecutive training strategy is inflexible and also unnecessary. In this paper, we propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding in a single stage. To capture video semantic information for better hashing learning, we adopt an encoder-decoder structure to reconstruct the video from its temporal-masked frames. Particularly, we find that a higher masking ratio helps video understanding. Besides, we fully exploit the similarity relationship between videos by maximizing agreement between two augmented views of a video, which contributes to more discriminative and robust hash codes. Extensive experiments on three large-scale video datasets (i.e., FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art results. Code is available at https://github.com/huangmozhi9527/ConMH. |
|||||
2022 | Anchor Graph Structure Fusion Hashing For Cross-modal Similarity Search | Wang Lu, Yang Jie, Zareapoor Masoumeh, Zheng Zhonglong | Arxiv | Cross-modal hashing still has some challenges needed to address: (1) most existing CMH methods take graphs as input to model data distribution. These methods omit to consider the correlation of graph structure among multiple modalities; (2) most existing CMH methods ignores considering the fusion affinity among multi-modalities data; (3) most existing CMH methods relax the discrete constraints to solve the optimization objective, significantly degrading the retrieval performance. To solve the above limitations, we propose a novel Anchor Graph Structure Fusion Hashing (AGSFH). AGSFH constructs the anchor graph structure fusion matrix from different anchor graphs of multiple modalities with the Hadamard product, which can fully exploit the geometric property of underlying data structure. Based on the anchor graph structure fusion matrix, AGSFH attempts to directly learn an intrinsic anchor graph, where the structure of the intrinsic anchor graph is adaptively tuned so that the number of components of the intrinsic graph is exactly equal to the number of clusters. Besides, AGSFH preserves the anchor fusion affinity into the common binary Hamming space. Furthermore, a discrete optimization framework is designed to learn the unified binary codes. Extensive experimental results on three public social datasets demonstrate the superiority of AGSFH. |
|||||
2022 | A Detection Method Of Temporally Operated Videos Using Robust Hashing | Niwa Shoko, Tanaka Miki, Kiya Hitoshi | Arxiv | SNS providers are known to carry out the recompression and resizing of uploaded videos/images, but most conventional methods for detecting tampered videos/images are not robust enough against such operations. In addition, videos are temporally operated such as the insertion of new frames and the permutation of frames, of which operations are difficult to be detected by using conventional methods. Accordingly, in this paper, we propose a novel method with a robust hashing algorithm for detecting temporally operated videos even when applying resizing and compression to the videos. |
|||||
2022 | Cgat Center-guided Adversarial Training For Deep Hashing-based Retrieval | Wang Xunguang, Lin Yiqun, Li Xiaomeng | Arxiv | Deep hashing has been extensively utilized in massive image retrieval because of its efficiency and effectiveness. However, deep hashing models are vulnerable to adversarial examples, making it essential to develop adversarial defense methods for image retrieval. Existing solutions achieved limited defense performance because of using weak adversarial samples for training and lacking discriminative optimization objectives to learn robust features. In this paper, we present a min-max based Center-guided Adversarial Training, namely CgAT, to improve the robustness of deep hashing networks through worst adversarial examples. Specifically, we first formulate the center code as a semantically-discriminative representative of the input image content, which preserves the semantic similarity with positive samples and dissimilarity with negative examples. We prove that a mathematical formula can calculate the center code immediately. After obtaining the center codes in each optimization iteration of the deep hashing network, they are adopted to guide the adversarial training process. On the one hand, CgAT generates the worst adversarial examples as augmented data by maximizing the Hamming distance between the hash codes of the adversarial examples and the center codes. On the other hand, CgAT learns to mitigate the effects of adversarial samples by minimizing the Hamming distance to the center codes. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our adversarial training algorithm in defending against adversarial attacks for deep hashing-based retrieval. Compared with the current state-of-the-art defense method, we significantly improve the defense performance by an average of 18.61\%, 12.35\%, and 11.56\% on FLICKR-25K, NUS-WIDE, and MS-COCO, respectively. The code is available at https://github.com/xunguangwang/CgAT. |
|||||
2022 | Inverted Semantic-index For Image Retrieval | Wang Ying | Arxiv | This paper addresses the construction of inverted index for large-scale image retrieval. The inverted index proposed by J. Sivic brings a significant acceleration by reducing distance computations with only a small fraction of the database. The state-of-the-art inverted indices aim to build finer partitions that produce a concise and accurate candidate list. However, partitioning in these frameworks is generally achieved by unsupervised clustering methods which ignore the semantic information of images. In this paper, we replace the clustering method with image classification, during the construction of codebook. We then propose a merging and splitting method to solve the problem that the number of partitions is unchangeable in the inverted semantic-index. Next, we combine our semantic-index with the product quantization (PQ) so as to alleviate the accuracy loss caused by PQ compression. Finally, we evaluate our model on large-scale image retrieval benchmarks. Experiment results demonstrate that our model can significantly improve the retrieval accuracy by generating high-quality candidate lists. |
|||||
2022 | Binary Representation Via Jointly Personalized Sparse Hashing | Wang Xiaoqin, Chen Chen, Lan Rushi, Liu Licheng, Liu Zhenbing, Zhou Huiyu, Luo Xiaonan | Arxiv | Unsupervised hashing has attracted much attention for binary representation learning due to the requirement of economical storage and efficiency of binary codes. It aims to encode high-dimensional features in the Hamming space with similarity preservation between instances. However, most existing methods learn hash functions in manifold-based approaches. Those methods capture the local geometric structures (i.e., pairwise relationships) of data, and lack satisfactory performance in dealing with real-world scenarios that produce similar features (e.g. color and shape) with different semantic information. To address this challenge, in this work, we propose an effective unsupervised method, namely Jointly Personalized Sparse Hashing (JPSH), for binary representation learning. To be specific, firstly, we propose a novel personalized hashing module, i.e., Personalized Sparse Hashing (PSH). Different personalized subspaces are constructed to reflect category-specific attributes for different clusters, adaptively mapping instances within the same cluster to the same Hamming space. In addition, we deploy sparse constraints for different personalized subspaces to select important features. We also collect the strengths of the other clusters to build the PSH module with avoiding over-fitting. Then, to simultaneously preserve semantic and pairwise similarities in our JPSH, we incorporate the PSH and manifold-based hash learning into the seamless formulation. As such, JPSH not only distinguishes the instances from different clusters, but also preserves local neighborhood structures within the cluster. Finally, an alternating optimization algorithm is adopted to iteratively capture analytical solutions of the JPSH model. Extensive experiments on four benchmark datasets verify that the JPSH outperforms several hashing algorithms on the similarity search task. |
|||||
2022 | Instant Neural Graphics Primitives With A Multiresolution Hash Encoding | Müller Thomas, Evans Alex, Schied Christoph, Keller Alexander | ACM Trans. Graph. | Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of \({1920!\times!1080}\). |
|||||
2022 | Navigable Proximity Graph-driven Native Hybrid Queries With Structured And Unstructured Constraints | Wang Mengzhao, Lv Lingwei, Xu Xiaoliang, Wang Yuxiang, Yue Qiang, Ni Jiongkang | Arxiv | As research interest surges, vector similarity search is applied in multiple fields, including data mining, computer vision, and information retrieval. {Given a set of objects (e.g., a set of images) and a query object, we can easily transform each object into a feature vector and apply the vector similarity search to retrieve the most similar objects. However, the original vector similarity search cannot well support \textit{hybrid queries}, where users not only input unstructured query constraint (i.e., the feature vector of query object) but also structured query constraint (i.e., the desired attributes of interest). Hybrid query processing aims at identifying these objects with similar feature vectors to query object and satisfying the given attribute constraints. Recent efforts have attempted to answer a hybrid query by performing attribute filtering and vector similarity search separately and then merging the results later, which limits efficiency and accuracy because they are not purpose-built for hybrid queries.} In this paper, we propose a native hybrid query (NHQ) framework based on proximity graph (PG), which provides the specialized \textit{composite index and joint pruning} modules for hybrid queries. We easily deploy existing various PGs on this framework to process hybrid queries efficiently. Moreover, we present two novel navigable PGs (NPGs) with optimized edge selection and routing strategies, which obtain better overall performance than existing PGs. After that, we deploy the proposed NPGs in NHQ to form two hybrid query methods, which significantly outperform the state-of-the-art competitors on all experimental datasets (10\(\times\) faster under the same \textit{Recall}), including eight public and one in-house real-world datasets. Our code and datasets have been released at \url{https://github.com/AshenOn3/NHQ}. |
|||||
2022 | Multi Hash Embeddings In Spacy | Miranda Lester James, Kádár Ákos, Boyd Adriane, Van Landeghem Sofie, Søgaard Anders, Honnibal Matthew | Arxiv | The distributed representation of symbols is one of the key technologies in machine learning systems today, playing a pivotal role in modern natural language processing. Traditional word embeddings associate a separate vector with each word. While this approach is simple and leads to good performance, it requires a lot of memory for representing a large vocabulary. To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape. Together, these features produce a multi-embedding of a word. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Second, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages. The experiments validate most key design choices behind spaCy’s embedders, but we also uncover a few surprising results. |
|||||
2022 | Deep Unsupervised Contrastive Hashing For Large-scale Cross-modal Text-image Retrieval In Remote Sensing | Mikriukov Georgii, Ravanbakhsh Mahdyar, Demir Begüm | Arxiv | Due to the availability of large-scale multi-modal data (e.g., satellite images acquired by different sensors, text sentences, etc) archives, the development of cross-modal retrieval systems that can search and retrieve semantically relevant data across different modalities based on a query in any modality has attracted great attention in RS. In this paper, we focus our attention on cross-modal text-image retrieval, where queries from one modality (e.g., text) can be matched to archive entries from another (e.g., image). Most of the existing cross-modal text-image retrieval systems require a high number of labeled training samples and also do not allow fast and memory-efficient retrieval due to their intrinsic characteristics. These issues limit the applicability of the existing cross-modal retrieval systems for large-scale applications in RS. To address this problem, in this paper we introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval. The proposed DUCH is made up of two main modules: 1) feature extraction module (which extracts deep representations of the text-image modalities); and 2) hashing module (which learns to generate cross-modal binary hash codes from the extracted representations). Within the hashing module, we introduce a novel multi-objective loss function including: i) contrastive objectives that enable similarity preservation in both intra- and inter-modal similarities; ii) an adversarial objective that is enforced across two modalities for cross-modal representation consistency; iii) binarization objectives for generating representative hash codes. Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods on two multi-modal (image and text) benchmark archives in RS. Our code is publicly available at https://git.tu-berlin.de/rsim/duch. |
|||||
2022 | An Instance Selection Algorithm For Big Data In High Imbalanced Datasets Based On LSH | Melo-acosta Germán E., Duitama-muñoz Freddy, Arias-londoño Julián D. | Arxiv | Training of Machine Learning (ML) models in real contexts often deals with big data sets and high-class imbalance samples where the class of interest is unrepresented (minority class). Practical solutions using classical ML models address the problem of large data sets using parallel/distributed implementations of training algorithms, approximate model-based solutions, or applying instance selection (IS) algorithms to eliminate redundant information. However, the combined problem of big and high imbalanced datasets has been less addressed. This work proposes three new methods for IS to be able to deal with large and imbalanced data sets. The proposed methods use Locality Sensitive Hashing (LSH) as a base clustering technique, and then three different sampling methods are applied on top of the clusters (or buckets) generated by LSH. The algorithms were developed in the Apache Spark framework, guaranteeing their scalability. The experiments carried out in three different datasets suggest that the proposed IS methods can improve the performance of a base ML model between 5% and 19% in terms of the geometric mean. |
|||||
2022 | Hamming Distributions Of Popular Perceptual Hashing Techniques | Mckeown Sean, Buchanan William J | DFRWS | Content-based file matching has been widely deployed for decades, largely for the detection of sources of copyright infringement, extremist materials, and abusive sexual media. Perceptual hashes, such as Microsoft’s PhotoDNA, are one automated mechanism for facilitating detection, allowing for machines to approximately match visual features of an image or video in a robust manner. However, there does not appear to be much public evaluation of such approaches, particularly when it comes to how effective they are against content-preserving modifications to media files. In this paper, we present a million-image scale evaluation of several perceptual hashing archetypes for popular algorithms (including Facebook’s PDQ, Apple’s Neuralhash, and the popular pHash library) against seven image variants. The focal point is the distribution of Hamming distance scores between both unrelated images and image variants to better understand the problems faced by each approach. |
|||||
2022 | Look-ups Are Not (yet) All You Need For Deep Learning Inference | Mccarter Calvin, Dronen Nicholas | Arxiv | Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by fitting a fast hash function from training data. In this work, we propose improvements to this previous work, targeted to the deep learning inference setting, where one has access to both training data and fixed (already learned) model weight matrices. We further propose a fine-tuning procedure for accelerating entire neural networks while minimizing loss in accuracy. Finally, we analyze the proposed method on a simple image classification task. While we show improvements to prior work, overall classification accuracy remains substantially diminished compared to exact matrix multiplication. Our work, despite this negative result, points the way towards future efforts to accelerate inner products with fast nonlinear hashing methods. |
|||||
2022 | Unsupervised Contrastive Hashing For Cross-modal Retrieval In Remote Sensing | Mikriukov Georgii, Ravanbakhsh Mahdyar, Demir Begüm | Arxiv | The development of cross-modal retrieval systems that can search and retrieve semantically relevant data across different modalities based on a query in any modality has attracted great attention in remote sensing (RS). In this paper, we focus our attention on cross-modal text-image retrieval, where queries from one modality (e.g., text) can be matched to archive entries from another (e.g., image). Most of the existing cross-modal text-image retrieval systems in RS require a high number of labeled training samples and also do not allow fast and memory-efficient retrieval. These issues limit the applicability of the existing cross-modal retrieval systems for large-scale applications in RS. To address this problem, in this paper we introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS. To this end, the proposed DUCH is made up of two main modules: 1) feature extraction module, which extracts deep representations of two modalities; 2) hashing module that learns to generate cross-modal binary hash codes from the extracted representations. We introduce a novel multi-objective loss function including: i) contrastive objectives that enable similarity preservation in intra- and inter-modal similarities; ii) an adversarial objective that is enforced across two modalities for cross-modal representation consistency; and iii) binarization objectives for generating hash codes. Experimental results show that the proposed DUCH outperforms state-of-the-art methods. Our code is publicly available at https://git.tu-berlin.de/rsim/duch. |
|||||
2022 | Hashencoding Autoencoding With Multiscale Coordinate Hashing | Zhornyak Lukas, Xu Zhengjie, Tang Haoran, Shi Jianbo | Arxiv | We present HashEncoding, a novel autoencoding architecture that leverages a non-parametric multiscale coordinate hash function to facilitate a per-pixel decoder without convolutions. By leveraging the space-folding behaviour of hashing functions, HashEncoding allows for an inherently multiscale embedding space that remains much smaller than the original image. As a result, the decoder requires very few parameters compared with decoders in traditional autoencoders, approaching a non-parametric reconstruction of the original image and allowing for greater generalizability. Finally, by allowing backpropagation directly to the coordinate space, we show that HashEncoding can be exploited for geometric tasks such as optical flow. |
|||||
2022 | Codes From Incidence Matrices Of Hypergraphs | Mallik Sudipta, Yildiz Bahattin | Arxiv | Binary codes are constructed from incidence matrices of hypergraphs. A combinatroial description is given for the minimum distances of such codes via a combinatorial tool called ``eonv”. This combinatorial approach provides a faster alternative method of finding the minimum distance, which is known to be a hard problem. This is demonstrated on several classes of codes from hypergraphs. Moreover, self-duality and self-orthogonality conditions are also studied through hypergraphs. |
|||||
2022 | Learning To Embed Semantic Similarity For Joint Image-text Retrieval | Malali Noam, Keller Yosi | Arxiv | We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches. |
|||||
2022 | Analysis Of The New Standard Hash Function | Martín-fernández F, Caballero-gil P | International Conference on Computer Aided Systems Theory | On 2\(^{nd}\) October 2012 the NIST (National Institute of Standards and Technology) in the United States of America announced the new hashing algorithm which will be adopted as standard from now on. Among a total of 73 candidates, the winner was Keccak, designed by a group of cryptographers from Belgium and Italy. The public selection of a new standard of cryptographic hash function SHA (Secure Hash Algorithm) took five years. Its object is to generate a hash a fixed size from a pattern with arbitrary length. The first selection on behalf of NIST on a standard of this family took place in 1993 when SHA-1 was chosen, which later on was replaced by SHA-2. This paper is focused on the analysis both from the point of view of security and the implementation of the Keccak function, which is the base of the new SHA-3 standard. In particular, an implementation in the mobile platform Android is presented here, providing the first known external library in this mobile operating system so that any developer could use the new standard hashing. Finally, the new standard in applications in the Internet of Things is analysed. |
|||||
2022 | Deep Forest With Hashing Screening And Window Screening | Ma Pengfei, Wu Youxi, Li Yan, Guo Lei, Jiang He, Zhu Xingquan, Wu Xindong | Arxiv | As a novel deep learning model, gcForest has been widely used in various applications. However, the current multi-grained scanning of gcForest produces many redundant feature vectors, and this increases the time cost of the model. To screen out redundant feature vectors, we introduce a hashing screening mechanism for multi-grained scanning and propose a model called HW-Forest which adopts two strategies, hashing screening and window screening. HW-Forest employs perceptual hashing algorithm to calculate the similarity between feature vectors in hashing screening strategy, which is used to remove the redundant feature vectors produced by multi-grained scanning and can significantly decrease the time cost and memory consumption. Furthermore, we adopt a self-adaptive instance screening strategy to improve the performance of our approach, called window screening, which can achieve higher accuracy without hyperparameter tuning on different datasets. Our experimental results show that HW-Forest has higher accuracy than other models, and the time cost is also reduced. |
|||||
2022 | Asymmetric Transfer Hashing With Adaptive Bipartite Graph Learning | Lu Jianglin, Zhou Jie, Chen Yudong, Pedrycz Witold, Hung Kwok-wai | Arxiv | Thanks to the efficient retrieval speed and low storage consumption, learning to hash has been widely used in visual retrieval tasks. However, existing hashing methods assume that the query and retrieval samples lie in homogeneous feature space within the same domain. As a result, they cannot be directly applied to heterogeneous cross-domain retrieval. In this paper, we propose a Generalized Image Transfer Retrieval (GITR) problem, which encounters two crucial bottlenecks: 1) the query and retrieval samples may come from different domains, leading to an inevitable {domain distribution gap}; 2) the features of the two domains may be heterogeneous or misaligned, bringing up an additional {feature gap}. To address the GITR problem, we propose an Asymmetric Transfer Hashing (ATH) framework with its unsupervised/semi-supervised/supervised realizations. Specifically, ATH characterizes the domain distribution gap by the discrepancy between two asymmetric hash functions, and minimizes the feature gap with the help of a novel adaptive bipartite graph constructed on cross-domain data. By jointly optimizing asymmetric hash functions and the bipartite graph, not only can knowledge transfer be achieved but information loss caused by feature alignment can also be avoided. Meanwhile, to alleviate negative transfer, the intrinsic geometrical structure of single-domain data is preserved by involving a domain affinity graph. Extensive experiments on both single-domain and cross-domain benchmarks under different GITR subtasks indicate the superiority of our ATH method in comparison with the state-of-the-art hashing methods. |
|||||
2022 | Hyperbolic Hierarchical Contrastive Hashing | Wei Rukai, Liu Yu, Song Jingkuan, Xie Yanzhao, Zhou Ke | Transaction on Image Processing | Hierarchical semantic structures, naturally existing in real-world datasets, can assist in capturing the latent distribution of data to learn robust hash codes for retrieval systems. Although hierarchical semantic structures can be simply expressed by integrating semantically relevant data into a high-level taxon with coarser-grained semantics, the construction, embedding, and exploitation of the structures remain tricky for unsupervised hash learning. To tackle these problems, we propose a novel unsupervised hashing method named Hyperbolic Hierarchical Contrastive Hashing (HHCH). We propose to embed continuous hash codes into hyperbolic space for accurate semantic expression since embedding hierarchies in hyperbolic space generates less distortion than in hyper-sphere space and Euclidean space. In addition, we extend the K-Means algorithm to hyperbolic space and perform the proposed hierarchical hyperbolic K-Means algorithm to construct hierarchical semantic structures adaptively. To exploit the hierarchical semantic structures in hyperbolic space, we designed the hierarchical contrastive learning algorithm, including hierarchical instance-wise and hierarchical prototype-wise contrastive learning. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art unsupervised hashing methods. Codes will be released. |
|||||
2022 | Adaptive Asymmetric Label-guided Hashing For Multimedia Search | Long Yitian | Arxiv | With the rapid growth of multimodal media data on the Web in recent years, hash learning methods as a way to achieve efficient and flexible cross-modal retrieval of massive multimedia data have received a lot of attention from the current Web resource retrieval research community. Existing supervised hashing methods simply transform label information into pairwise similarity information to guide hash learning, leading to a potential risk of semantic error in the face of multi-label data. In addition, most existing hash optimization methods solve NP-hard optimization problems by employing approximate approximation strategies based on relaxation strategies, leading to a large quantization error. In order to address above obstacles, we present a simple yet efficient Adaptive Asymmetric Label-guided Hashing, named A2LH, for Multimedia Search. Specifically, A2LH is a two-step hashing method. In the first step, we design an association representation model between the different modality representations and semantic label representation separately, and use the semantic label representation as an intermediate bridge to solve the semantic gap existing between different modalities. In addition, we present an efficient discrete optimization algorithm for solving the quantization error problem caused by relaxation-based optimization algorithms. In the second step, we leverage the generated hash codes to learn the hash mapping functions. The experimental results show that our proposed method achieves optimal performance on all compared baseline methods. |
|||||
2022 | Hashing Learning With Hyper-class Representation | Zhang Shichao, Li Jiaye | Arxiv | Existing unsupervised hash learning is a kind of attribute-centered calculation. It may not accurately preserve the similarity between data. This leads to low down the performance of hash function learning. In this paper, a hash algorithm is proposed with a hyper-class representation. It is a two-steps approach. The first step finds potential decision features and establish hyper-class. The second step constructs hash learning based on the hyper-class information in the first step, so that the hash codes of the data within the hyper-class are as similar as possible, as well as the hash codes of the data between the hyper-classes are as different as possible. To evaluate the efficiency, a series of experiments are conducted on four public datasets. The experimental results show that the proposed hash algorithm is more efficient than the compared algorithms, in terms of mean average precision (MAP), average precision (AP) and Hamming radius 2 (HAM2) |
|||||
2022 | Prototype-based Layered Federated Cross-modal Hashing | Liu Jiale, Zhan Yu-wei, Luo Xin, Chen Zhen-duo, Wang Yongxin, Xu Xin-shun | Arxiv | Recently, deep cross-modal hashing has gained increasing attention. However, in many practical cases, data are distributed and cannot be collected due to privacy concerns, which greatly reduces the cross-modal hashing performance on each client. And due to the problems of statistical heterogeneity, model heterogeneity, and forcing each client to accept the same parameters, applying federated learning to cross-modal hash learning becomes very tricky. In this paper, we propose a novel method called prototype-based layered federated cross-modal hashing. Specifically, the prototype is introduced to learn the similarity between instances and classes on server, reducing the impact of statistical heterogeneity (non-IID) on different clients. And we monitor the distance between local and global prototypes to further improve the performance. To realize personalized federated learning, a hypernetwork is deployed on server to dynamically update different layers’ weights of local model. Experimental results on benchmark datasets show that our method outperforms state-of-the-art methods. |
|||||
2022 | Deep Unsupervised Hashing With Latent Semantic Components | Lin Qinghong, Chen Xiaojun, Zhang Qin, Cai Shaotian, Zhao Wenzhe, Wang Hongfa | Arxiv | Deep unsupervised hashing has been appreciated in the regime of image retrieval. However, most prior arts failed to detect the semantic components and their relationships behind the images, which makes them lack discriminative power. To make up the defect, we propose a novel Deep Semantic Components Hashing (DSCH), which involves a common sense that an image normally contains a bunch of semantic components with homology and co-occurrence relationships. Based on this prior, DSCH regards the semantic components as latent variables under the Expectation-Maximization framework and designs a two-step iterative algorithm with the objective of maximum likelihood of training data. Firstly, DSCH constructs a semantic component structure by uncovering the fine-grained semantics components of images with a Gaussian Mixture Modal~(GMM), where an image is represented as a mixture of multiple components, and the semantics co-occurrence are exploited. Besides, coarse-grained semantics components, are discovered by considering the homology relationships between fine-grained components, and the hierarchy organization is then constructed. Secondly, DSCH makes the images close to their semantic component centers at both fine-grained and coarse-grained levels, and also makes the images share similar semantic components close to each other. Extensive experiments on three benchmark datasets demonstrate that the proposed hierarchical semantic components indeed facilitate the hashing model to achieve superior performance. |
|||||
2022 | Adaptive Structural Similarity Preserving For Unsupervised Cross Modal Hashing | Li Liang, Zheng Baihua, Sun Weiwei | Arxiv | Cross-modal hashing is an important approach for multimodal data management and application. Existing unsupervised cross-modal hashing algorithms mainly rely on data features in pre-trained models to mine their similarity relationships. However, their optimization objectives are based on the static metric between the original uni-modal features, without further exploring data correlations during the training. In addition, most of them mainly focus on association mining and alignment among pairwise instances in continuous space but ignore the latent structural correlations contained in the semantic hashing space. In this paper, we propose an unsupervised hash learning framework, namely Adaptive Structural Similarity Preservation Hashing (ASSPH), to solve the above problems. Firstly, we propose an adaptive learning scheme, with limited data and training batches, to enrich semantic correlations of unlabeled instances during the training process and meanwhile to ensure a smooth convergence of the training process. Secondly, we present an asymmetric structural semantic representation learning scheme. We introduce structural semantic metrics based on graph adjacency relations during the semantic reconstruction and correlation mining stage and meanwhile align the structure semantics in the hash space with an asymmetric binary optimization process. Finally, we conduct extensive experiments to validate the enhancements of our work in comparison with existing works. |
|||||
2022 | Asymmetric Scalable Cross-modal Hashing | Li Wenyun, Pun Chi-man | Arxiv | Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue. A lot of matrix factorization-based hashing methods are proposed. However, the existing methods still struggle with a few problems, such as how to generate the binary codes efficiently rather than directly relax them to continuity. In addition, most of the existing methods choose to use an \(n\times n\) similarity matrix for optimization, which makes the memory and computation unaffordable. In this paper we propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues. It firstly introduces a collective matrix factorization to learn a common latent space from the kernelized features of different modalities, and then transforms the similarity matrix optimization to a distance-distance difference problem minimization with the help of semantic labels and common latent space. Hence, the computational complexity of the \(n\times n\) asymmetric optimization is relieved. In the generation of hash codes we also employ an orthogonal constraint of label information, which is indispensable for search accuracy. So the redundancy of computation can be much reduced. For efficient optimization and scalable to large-scale datasets, we adopt the two-step approach rather than optimizing simultaneously. Extensive experiments on three benchmark datasets: Wiki, MIRFlickr-25K, and NUS-WIDE, demonstrate that our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency. |
|||||
2022 | An Efficient Hashing-based Ensemble Method For Collaborative Outlier Detection | Li Kitty, Pham Ninh | Arxiv | In collaborative outlier detection, multiple participants exchange their local detectors trained on decentralized devices without exchanging their own data. A key problem of collaborative outlier detection is efficiently aggregating multiple local detectors to form a global detector without breaching the privacy of participants’ data and degrading the detection accuracy. We study locality-sensitive hashing-based ensemble methods to detect collaborative outliers since they are mergeable and compatible with differentially private mechanisms. Our proposed LSH iTables is simple and outperforms recent ensemble competitors on centralized and decentralized scenarios over many real-world data sets. |
|||||
2022 | Asymmetric Hash Code Learning For Remote Sensing Image Retrieval | Song Weiwei, Gao Zhi, Dian Renwei, Ghamisi Pedram, Zhang Yongjun, Benediktsson Jón Atli | Arxiv | Remote sensing image retrieval (RSIR), aiming at searching for a set of similar items to a given query image, is a very important task in remote sensing applications. Deep hashing learning as the current mainstream method has achieved satisfactory retrieval performance. On one hand, various deep neural networks are used to extract semantic features of remote sensing images. On the other hand, the hashing techniques are subsequently adopted to map the high-dimensional deep features to the low-dimensional binary codes. This kind of methods attempts to learn one hash function for both the query and database samples in a symmetric way. However, with the number of database samples increasing, it is typically time-consuming to generate the hash codes of large-scale database images. In this paper, we propose a novel deep hashing method, named asymmetric hash code learning (AHCL), for RSIR. The proposed AHCL generates the hash codes of query and database images in an asymmetric way. In more detail, the hash codes of query images are obtained by binarizing the output of the network, while the hash codes of database images are directly learned by solving the designed objective function. In addition, we combine the semantic information of each image and the similarity information of pairs of images as supervised information to train a deep hashing network, which improves the representation ability of deep features and hash codes. The experimental results on three public datasets demonstrate that the proposed method outperforms symmetric methods in terms of retrieval accuracy and efficiency. The source code is available at https://github.com/weiweisong415/Demo AHCL for TGRS2022. |
|||||
2022 | Elastic Product Quantization For Time Series | Robberechts Pieter, Meert Wannes, Davis Jesse | Arxiv | Analyzing numerous or long time series is difficult in practice due to the high storage costs and computational requirements. Therefore, techniques have been proposed to generate compact similarity-preserving representations of time series, enabling real-time similarity search on large in-memory data collections. However, the existing techniques are not ideally suited for assessing similarity when sequences are locally out of phase. In this paper, we propose the use of product quantization for efficient similarity-based comparison of time series under time warping. The idea is to first compress the data by partitioning the time series into equal length sub-sequences which are represented by a short code. The distance between two time series can then be efficiently approximated by pre-computed elastic distances between their codes. The partitioning into sub-sequences forces unwanted alignments, which we address with a pre-alignment step using the maximal overlap discrete wavelet transform (MODWT). To demonstrate the efficiency and accuracy of our method, we perform an extensive experimental evaluation on benchmark datasets in nearest neighbors classification and clustering applications. Overall, the proposed solution emerges as a highly efficient (both in terms of memory usage and computation time) replacement for elastic measures in time series applications. |
|||||
2022 | Context Unaware Knowledge Distillation For Image Retrieval | Reddy Bytasandram Yaswanth, Dubey Shiv Ram, Sanodiya Rakesh Kumar, Karn Ravi Ranjan Prasad | Arxiv | Existing data-dependent hashing methods use large backbone networks with millions of parameters and are computationally complex. Existing knowledge distillation methods use logits and other features of the deep (teacher) model and as knowledge for the compact (student) model, which requires the teacher’s network to be fine-tuned on the context in parallel with the student model on the context. Training teacher on the target context requires more time and computational resources. In this paper, we propose context unaware knowledge distillation that uses the knowledge of the teacher model without fine-tuning it on the target context. We also propose a new efficient student model architecture for knowledge distillation. The proposed approach follows a two-step process. The first step involves pre-training the student model with the help of context unaware knowledge distillation from the teacher model. The second step involves fine-tuning the student model on the context of image retrieval. In order to show the efficacy of the proposed approach, we compare the retrieval results, no. of parameters and no. of operations of the student models with the teacher models under different retrieval frameworks, including deep cauchy hashing (DCH) and central similarity quantization (CSQ). The experimental results confirm that the proposed approach provides a promising trade-off between the retrieval results and efficiency. The code used in this paper is released publicly at \url{https://github.com/satoru2001/CUKDFIR}. |
|||||
2022 | Efficient Document Retrieval By End-to-end Refining And Quantizing BERT Embedding With Contrastive Product Quantization | Qiu Zexuan, Su Qinliang, Yu Jianxing, Si Shijing | EMNLP | Efficient document retrieval heavily relies on the technique of semantic hashing, which learns a binary code for every document and employs Hamming distance to evaluate document distances. However, existing semantic hashing methods are mostly established on outdated TFIDF features, which obviously do not contain lots of important semantic information about documents. Furthermore, the Hamming distance can only be equal to one of several integer values, significantly limiting its representational ability for document distances. To address these issues, in this paper, we propose to leverage BERT embeddings to perform efficient retrieval based on the product quantization technique, which will assign for every document a real-valued codeword from the codebook, instead of a binary code as in semantic hashing. Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output the assigned codeword. The refining and quantizing modules can be optimized in an end-to-end manner by minimizing the probabilistic contrastive loss. A mutual information maximization based method is further proposed to improve the representativeness of codewords, so that documents can be quantized more accurately. Extensive experiments conducted on three benchmarks demonstrate that our proposed method significantly outperforms current state-of-the-art baselines. |
|||||
2022 | Adaptive And Dynamic Multi-resolution Hashing For Pairwise Summations | Qin Lianke, Reddy Aravind, Song Zhao, Xu Zhaozhuo, Zhuo Danyang | Arxiv | In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation. Given a data-set \(X \subset \mathbb{R}^d\), a binary function \(f:\mathbb{R}^d\times \mathbb{R}^d\to \mathbb{R}\), and a point \(y \in \mathbb{R}^d\), the Pairwise Summation Estimate \(\mathrm{PSE}X(y) := \frac{1}{|X|} \sum{x \in X} f(x,y)\). For any given data-set \(X\), we need to design a data-structure such that given any query point \(y \in \mathbb{R}^d\), the data-structure approximately estimates \(\mathrm{PSE}X(y)\) in time that is sub-linear in \(|X|\). Prior works on this problem have focused exclusively on the case where the data-set is static, and the queries are independent. In this paper, we design a hashing-based PSE data-structure which works for the more practical \textit{dynamic} setting in which insertions, deletions, and replacements of points are allowed. Moreover, our proposed Adam-Hash is also robust to adaptive PSE queries, where an adversary can choose query \(q_j \in \mathbb{R}^d\) depending on the output from previous queries \(q_1, q_2, \dots, q{j-1}\). |
|||||
2022 | Hashvfl Defending Against Data Reconstruction Attacks In Vertical Federated Learning | Qiu Pengyu, Zhang Xuhong, Ji Shouling, Fu Chong, Yang Xing, Wang Ting | Arxiv | Vertical Federated Learning (VFL) is a trending collaborative machine learning model training solution. Existing industrial frameworks employ secure multi-party computation techniques such as homomorphic encryption to ensure data security and privacy. Despite these efforts, studies have revealed that data leakage remains a risk in VFL due to the correlations between intermediate representations and raw data. Neural networks can accurately capture these correlations, allowing an adversary to reconstruct the data. This emphasizes the need for continued research into securing VFL systems. Our work shows that hashing is a promising solution to counter data reconstruction attacks. The one-way nature of hashing makes it difficult for an adversary to recover data from hash codes. However, implementing hashing in VFL presents new challenges, including vanishing gradients and information loss. To address these issues, we propose HashVFL, which integrates hashing and simultaneously achieves learnability, bit balance, and consistency. Experimental results indicate that HashVFL effectively maintains task performance while defending against data reconstruction attacks. It also brings additional benefits in reducing the degree of label leakage, mitigating adversarial attacks, and detecting abnormal inputs. We hope our work will inspire further research into the potential applications of HashVFL. |
|||||
2022 | Locality-preserving Minimal Perfect Hashing Of K-mers | Pibiri Giulio Ermanno, Shibuya Yoshihiro, Limasset Antoine | Arxiv | Minimal perfect hashing is the problem of mapping a static set of \(n\) distinct keys into the address space \(\{1,\ldots,n\}\) bijectively. It is well-known that \(nlog_2(e)\) bits are necessary to specify a minimal perfect hash function (MPHF) \(f\), when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of \(f\). For example, consider a string and the set of all its distinct \(k\)-mers as input keys: since two consecutive \(k\)-mers share an overlap of \(k-1\) symbols, it seems possible to beat the classic \(log_2(e)\) bits/key barrier in this case. Moreover, we would like \(f\) to map consecutive \(k\)-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for \(f\), resulting in a better evaluation time when querying consecutive \(k\)-mers. Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for \(k\)-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing \(k\) and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature. |
|||||
2022 | Falconn++ A Locality-sensitive Filtering Approach For Approximate Nearest Neighbor Search | Ninh Pham, Tao Liu | Neural Information Processing Systems | We present Falconn++, a novel locality-sensitive filtering (LSF) approach for approximate nearest neighbor search on angular distance. Falconn++ can filter out potential far away points in any hash bucket before querying, which results in higher quality candidates compared to other hashing-based solutions. Theoretically, Falconn++ asymptotically achieves lower query time complexity than Falconn, an optimal locality-sensitive hashing scheme on angular distance. Empirically, Falconn++ achieves a higher recall-speed tradeoff than Falconn on many real-world data sets. Falconn++ is also competitive with HNSW, an efficient representative of graph-based solutions on high search recall regimes. |
|||||
2022 | Learning Compressed Embeddings For On-device Inference | Pansare Niketan, Katukuri Jay, Arora Aditya, Cipollone Frank, Shaik Riyaaz, Tokgozoglu Noyan, Venkataraman Chandru | Arxiv | In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies. An embedding layer maps each entity to a unique vector, causing the layer’s memory requirement to be proportional to the number of entities. In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory. The scale of these networks makes them difficult to deploy in resource constrained environments. In this paper, we propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding. Rather than maintaining the full embedding table, we construct each entity’s embedding “on the fly” using two separate embedding tables. The first table employs hashing to force multiple entities to share an embedding. The second table contains one trainable weight per entity, allowing the model to distinguish between entities sharing the same embedding. Since these two tables are trained jointly, the network is able to learn a unique embedding per entity, helping it maintain a discriminative capability similar to a model with an uncompressed embedding table. We call this approach MEmCom (Multi-Embedding Compression). We compare with state-of-the-art model compression techniques for multiple problem classes including classification and ranking. On four popular recommender system datasets, MEmCom had a 4% relative loss in nDCG while compressing the input embedding sizes of our recommendation models by 16x, 4x, 12x, and 40x. MEmCom outperforms the state-of-the-art techniques, which achieved 16%, 6%, 10%, and 8% relative loss in nDCG at the respective compression ratios. Additionally, MEmCom is able to compress the RankNet ranking model by 32x on a dataset with millions of users’ interactions with games while incurring only a 1% relative loss in nDCG. |
|||||
2022 | Iceberght High Performance PMEM Hash Tables Through Stability And Low Associativity | Pandey Prashant, Bender Michael A., Conway Alex, Farach-colton Martín, Kuszmaul William, Tagliavini Guido, Johnson Rob | Arxiv | Modern hash table designs strive to minimize space while maximizing speed. The most important factor in speed is the number of cache lines accessed during updates and queries. This is especially important on PMEM, which is slower than DRAM and in which writes are more expensive than reads. This paper proposes two stronger design objectives: stability and low-associativity. A stable hash table doesn’t move items around, and a hash table has low associativity if there are only a few locations where an item can be stored. Low associativity ensures that queries need to examine only a few memory locations, and stability ensures that insertions write to very few cache lines. Stability also simplifies scaling and crash safety. We present IcebergHT, a fast, crash-safe, concurrent, and space-efficient hash table for PMEM based on the design principles of stability and low associativity. IcebergHT combines in-memory metadata with a new hashing technique, iceberg hashing, that is (1) space efficient, (2) stable, and (3) supports low associativity. In contrast, existing hash-tables either modify numerous cache lines during insertions (e.g. cuckoo hashing), access numerous cache lines during queries (e.g. linear probing), or waste space (e.g. chaining). Moreover, the combination of (1)-(3) yields several emergent benefits: IcebergHT scales better than other hash tables, supports crash-safety, and has excellent performance on PMEM (where writes are particularly expensive). |
|||||
2022 | A Principled Design Of Image Representation Towards Forensic Tasks | Qi Shuren, Zhang Yushu, Wang Chao, Zhou Jiantao, Cao Xiaochun | Arxiv | Image forensics is a rising topic as the trustworthy multimedia content is critical for modern society. Like other vision-related applications, forensic analysis relies heavily on the proper image representation. Despite the importance, current theoretical understanding for such representation remains limited, with varying degrees of neglect for its key role. For this gap, we attempt to investigate the forensic-oriented image representation as a distinct problem, from the perspectives of theory, implementation, and application. Our work starts from the abstraction of basic principles that the representation for forensics should satisfy, especially revealing the criticality of robustness, interpretability, and coverage. At the theoretical level, we propose a new representation framework for forensics, called Dense Invariant Representation (DIR), which is characterized by stable description with mathematical guarantees. At the implementation level, the discrete calculation problems of DIR are discussed, and the corresponding accurate and fast solutions are designed with generic nature and constant complexity. We demonstrate the above arguments on the dense-domain pattern detection and matching experiments, providing comparison results with state-of-the-art descriptors. Also, at the application level, the proposed DIR is initially explored in passive and active forensics, namely copy-move forgery detection and perceptual hashing, exhibiting the benefits in fulfilling the requirements of such forensic tasks. |
|||||
2022 | Information-theoretic Hashing For Zero-shot Cross-modal Retrieval | Shi Yufeng, Yu Shujian, Xu Duanquan, You Xinge | Arxiv | Zero-shot cross-modal retrieval (ZS-CMR) deals with the retrieval problem among heterogenous data from unseen classes. Typically, to guarantee generalization, the pre-defined class embeddings from natural language processing (NLP) models are used to build a common space. In this paper, instead of using an extra NLP model to define a common space beforehand, we consider a totally different way to construct (or learn) a common hamming space from an information-theoretic perspective. We term our model the Information-Theoretic Hashing (ITH), which is composed of two cascading modules: an Adaptive Information Aggregation (AIA) module; and a Semantic Preserving Encoding (SPE) module. Specifically, our AIA module takes the inspiration from the Principle of Relevant Information (PRI) to construct a common space that adaptively aggregates the intrinsic semantics of different modalities of data and filters out redundant or irrelevant information. On the other hand, our SPE module further generates the hashing codes of different modalities by preserving the similarity of intrinsic semantics with the element-wise Kullback-Leibler (KL) divergence. A total correlation regularization term is also imposed to reduce the redundancy amongst different dimensions of hash codes. Sufficient experiments on three benchmark datasets demonstrate the superiority of the proposed ITH in ZS-CMR. Source code is available in the supplementary material. |
|||||
2022 | Efficient Cross-modal Retrieval Via Deep Binary Hashing And Quantization | Shi Yang, Chung Young-joo | BMVC | Cross-modal retrieval aims to search for data with similar semantic meanings across different content modalities. However, cross-modal retrieval requires huge amounts of storage and retrieval time since it needs to process data in multiple modalities. Existing works focused on learning single-source compact features such as binary hash codes that preserve similarities between different modalities. In this work, we propose a jointly learned deep hashing and quantization network (HQ) for cross-modal retrieval. We simultaneously learn binary hash codes and quantization codes to preserve semantic information in multiple modalities by an end-to-end deep learning architecture. At the retrieval step, binary hashing is used to retrieve a subset of items from the search space, then quantization is used to re-rank the retrieved items. We theoretically and empirically show that this two-stage retrieval approach provides faster retrieval results while preserving accuracy. Experimental results on the NUS-WIDE, MIR-Flickr, and Amazon datasets demonstrate that HQ achieves boosts of more than 7% in precision compared to supervised neural network-based compact coding models. |
|||||
2022 | Learning Similarity Preserving Binary Codes For Recommender Systems | Shi Yang, Chung Young-joo | Arxiv | Hashing-based Recommender Systems (RSs) are widely studied to provide scalable services. The existing methods for the systems combine three modules to achieve efficiency: feature extraction, interaction modeling, and binarization. In this paper, we study an unexplored module combination for the hashing-based recommender systems, namely Compact Cross-Similarity Recommender (CCSR). Inspired by cross-modal retrieval, CCSR utilizes Maximum a Posteriori similarity instead of matrix factorization and rating reconstruction to model interactions between users and items. We conducted experiments on MovieLens1M, Amazon product review, Ichiba purchase dataset and confirmed CCSR outperformed the existing matrix factorization-based methods. On the Movielens1M dataset, the absolute performance improvements are up to 15.69% in NDCG and 4.29% in Recall. In addition, we extensively studied three binarization modules: \(sign\), scaled tanh, and sign-scaled tanh. The result demonstrated that although differentiable scaled tanh is popular in recent discrete feature learning literature, a huge performance drop occurs when outputs of scaled \(tanh\) are forced to be binary. |
|||||
2022 | Deep Manifold Hashing A Divide-and-conquer Approach For Semi-paired Unsupervised Cross-modal Retrieval | Shi Yufeng, You Xinge, Xu Jiamiao, Zheng Feng, Peng Qinmu, Ou Weihua | Arxiv | Hashing that projects data into binary codes has shown extraordinary talents in cross-modal retrieval due to its low storage usage and high query speed. Despite their empirical success on some scenarios, existing cross-modal hashing methods usually fail to cross modality gap when fully-paired data with plenty of labeled information is nonexistent. To circumvent this drawback, motivated by the Divide-and-Conquer strategy, we propose Deep Manifold Hashing (DMH), a novel method of dividing the problem of semi-paired unsupervised cross-modal retrieval into three sub-problems and building one simple yet efficiency model for each sub-problem. Specifically, the first model is constructed for obtaining modality-invariant features by complementing semi-paired data based on manifold learning, whereas the second model and the third model aim to learn hash codes and hash functions respectively. Extensive experiments on three benchmarks demonstrate the superiority of our DMH compared with the state-of-the-art fully-paired and semi-paired unsupervised cross-modal hashing methods. |
|||||
2022 | SEMICON A Learning-to-hash Solution For Large-scale Fine-grained Image Retrieval | Shen Yang, Sun Xuhao, Wei Xiu-shen, Jiang Qing-yuan, Yang Jian | Arxiv | In this paper, we propose Suppression-Enhancing Mask based attention and Interactive Channel transformatiON (SEMICON) to learn binary hash codes for dealing with large-scale fine-grained image retrieval tasks. In SEMICON, we first develop a suppression-enhancing mask (SEM) based attention to dynamically localize discriminative image regions. More importantly, different from existing attention mechanism simply erasing previous discriminative regions, our SEM is developed to restrain such regions and then discover other complementary regions by considering the relation between activated regions in a stage-by-stage fashion. In each stage, the interactive channel transformation (ICON) module is afterwards designed to exploit correlations across channels of attended activation tensors. Since channels could generally correspond to the parts of fine-grained objects, the part correlation can be also modeled accordingly, which further improves fine-grained retrieval accuracy. Moreover, to be computational economy, ICON is realized by an efficient two-step process. Finally, the hash learning of our SEMICON consists of both global- and local-level branches for better representing fine-grained objects and then generating binary hash codes explicitly corresponding to multiple levels. Experiments on five benchmark fine-grained datasets show our superiority over competing methods. |
|||||
2022 | Johnson-lindenstrauss Embeddings For Noisy Vectors -- Taking Advantage Of The Noise | Shao Zhen | Arxiv | This paper investigates theoretical properties of subsampling and hashing as tools for approximate Euclidean norm-preserving embeddings for vectors with (unknown) additive Gaussian noises. Such embeddings are sometimes called Johnson-lindenstrauss embeddings due to their celebrated lemma. Previous work shows that as sparse embeddings, the success of subsampling and hashing closely depends on the \(l_\infty\) to \(l_2\) ratios of the vector to be mapped. This paper shows that the presence of noise removes such constrain in high-dimensions, in other words, sparse embeddings such as subsampling and hashing with comparable embedding dimensions to dense embeddings have similar approximate norm-preserving dimensionality-reduction properties. The key is that the noise should be treated as an information to be exploited, not simply something to be removed. Theoretical bounds for subsampling and hashing to recover the approximate norm of a high dimension vector in the presence of noise are derived, with numerical illustrations showing better performances are achieved in the presence of noise. |
|||||
2022 | Mlp-hash Protecting Face Templates Via Hashing Of Randomized Multi-layer Perceptron | Shahreza Hatef Otroshi, Hahn Vedrana Krivokuća, Marcel Sébastien | Arxiv | Applications of face recognition systems for authentication purposes are growing rapidly. Although state-of-the-art (SOTA) face recognition systems have high recognition accuracy, the features which are extracted for each user and are stored in the system’s database contain privacy-sensitive information. Accordingly, compromising this data would jeopardize users’ privacy. In this paper, we propose a new cancelable template protection method, dubbed MLP-hash, which generates protected templates by passing the extracted features through a user-specific randomly-weighted multi-layer perceptron (MLP) and binarizing the MLP output. We evaluated the unlinkability, irreversibility, and recognition accuracy of our proposed biometric template protection method to fulfill the ISO/IEC 30136 standard requirements. Our experiments with SOTA face recognition systems on the MOBIO and LFW datasets show that our method has competitive performance with the BioHashing and IoM Hashing (IoM-GRP and IoM-URP) template protection algorithms. We provide an open-source implementation of all the experiments presented in this paper so that other researchers can verify our findings and build upon our work. |
|||||
2022 | Double-hashing Algorithm For Frequency Estimation In Data Streams | Seleznev Nikita, Kumar Senthil, Bruss C. Bayan | Arxiv | Frequency estimation of elements is an important task for summarizing data streams and machine learning applications. The problem is often addressed by using streaming algorithms with sublinear space data structures. These algorithms allow processing of large data while using limited data storage. Commonly used streaming algorithms, such as count-min sketch, have many advantages, but do not take into account properties of a data stream for performance optimization. In the present paper we introduce a novel double-hashing algorithm that provides flexibility to optimize streaming algorithms depending on the properties of a given stream. In the double-hashing approach, first a standard streaming algorithm is employed to obtain an estimate of the element frequencies. This estimate is derived using a fraction of the stream and allows identification of the heavy hitters. Next, it uses a modified hash table where the heavy hitters are mapped into individual buckets and other stream elements are mapped into the remaining buckets. Finally, the element frequencies are estimated based on the constructed hash table over the entire data stream with any streaming algorithm. We demonstrate on both synthetic data and an internet query log dataset that our approach is capable of improving frequency estimation due to removing heavy hitters from the hashing process and, thus, reducing collisions in the hash table. Our approach avoids employing additional machine learning models to identify heavy hitters and, thus, reduces algorithm complexity and streamlines implementation. Moreover, because it is not dependent on specific features of the stream elements for identifying heavy hitters, it is applicable to a large variety of streams. In addition, we propose a procedure on how to dynamically adjust the proposed double-hashing algorithm when frequencies of the elements in a stream are changing over time. |
|||||
2022 | Benchmarking Hashing Algorithms For Load Balancing In A Distributed Database Environment | Slesarev Alexander, Mikhailov Mikhail, Chernishev George | Arxiv | Modern high load applications store data using multiple database instances. Such an architecture requires data consistency, and it is important to ensure even distribution of data among nodes. Load balancing is used to achieve these goals. Hashing is the backbone of virtually all load balancing systems. Since the introduction of classic Consistent Hashing, many algorithms have been devised for this purpose. One of the purposes of the load balancer is to ensure storage cluster scalability. It is crucial for the performance of the whole system to transfer as few data records as possible during node addition or removal. The load balancer hashing algorithm has the greatest impact on this process. In this paper we experimentally evaluate several hashing algorithms used for load balancing, conducting both simulated and real system experiments. To evaluate algorithm performance, we have developed a benchmark suite based on Unidata MDM~ – a scalable toolkit for various Master Data Management (MDM) applications. For assessment, we have employed three criteria~ – uniformity of the produced distribution, the number of moved records, and computation speed. Following the results of our experiments, we have created a table, in which each algorithm is given an assessment according to the abovementioned criteria. |
|||||
2022 | Noise-robust De-duplication At Scale | Silcock Emily, D'amico-wong Luca, Yang Jinglin, Dell Melissa | Arxiv | Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications. |
|||||
2021 | Efficient Approximate Search For Sets Of Vectors | Leybovich Michael, Shmueli Oded | Arxiv | We consider a similarity measure between two sets \(A\) and \(B\) of vectors, that balances the average and maximum cosine distance between pairs of vectors, one from set \(A\) and one from set \(B\). As a motivation for this measure, we present lineage tracking in a database. To practically realize this measure, we need an approximate search algorithm that given a set of vectors \(A\) and sets of vectors \(B_1,…,B_n\), the algorithm quickly locates the set \(B_i\) that maximizes the similarity measure. For the case where all sets are singleton sets, essentially each is a single vector, there are known efficient approximate search algorithms, e.g., approximated versions of tree search algorithms, locality-sensitive hashing (LSH), vector quantization (VQ) and proximity graph algorithms. In this work, we present approximate search algorithms for the general case. The underlying idea in these algorithms is encoding a set of vectors via a “long” single vector. The proposed approximate approach achieves significant performance gains over an optimized, exact search on vector sets. |
|||||
2021 | When Similarity Digest Meets Vector Management System A Survey On Similarity Hash Function | Tang Zhushou, Tang Lingyi, Tang Keying, Tang Ruoying | Arxiv | The booming vector manage system calls for feasible similarity hash function as a front-end to perform similarity analysis. In this paper, we make a systematical survey on the existent well-known similarity hash functions to tease out the satisfied ones. We conclude that the similarity hash function MinHash and Nilsimsa can be directly marshaled into the pipeline of similarity analysis using vector manage system. After that, we make a brief and empirical discussion on the performance, drawbacks of the these functions and highlight MinHash, the variant of SimHash and feature hashing are the best for vector management system for large-scale similarity analysis. |
|||||
2021 | Hashing-accelerated Graph Neural Networks For Link Prediction | Wu Wei, Li Bin, Luo Chuan, Nejdl Wolfgang | The Web Conference | Networks are ubiquitous in the real world. Link prediction, as one of the key problems for network-structured data, aims to predict whether there exists a link between two nodes. The traditional approaches are based on the explicit similarity computation between the compact node representation by embedding each node into a low-dimensional space. In order to efficiently handle the intensive similarity computation in link prediction, the hashing technique has been successfully used to produce the node representation in the Hamming space. However, the hashing-based link prediction algorithms face accuracy loss from the randomized hashing techniques or inefficiency from the learning to hash techniques in the embedding process. Currently, the Graph Neural Network (GNN) framework has been widely applied to the graph-related tasks in an end-to-end manner, but it commonly requires substantial computational resources and memory costs due to massive parameter learning, which makes the GNN-based algorithms impractical without the help of a powerful workhorse. In this paper, we propose a simple and effective model called #GNN, which balances the trade-off between accuracy and efficiency. #GNN is able to efficiently acquire node representation in the Hamming space for link prediction by exploiting the randomized hashing technique to implement message passing and capture high-order proximity in the GNN framework. Furthermore, we characterize the discriminative power of #GNN in probability. The extensive experimental results demonstrate that the proposed #GNN algorithm achieves accuracy comparable to the learning-based algorithms and outperforms the randomized algorithm, while running significantly faster than the learning-based algorithms. Also, the proposed algorithm shows excellent scalability on a large-scale network with the limited resources. |
|||||
2021 | Federated Nearest Neighbor Classification With A Colony Of Fruit-flies With Supplement | Ram Parikshit, Sinha Kaushik | Arxiv | The mathematical formalization of a neurological mechanism in the olfactory circuit of a fruit-fly as a locality sensitive hash (Flyhash) and bloom filter (FBF) has been recently proposed and “reprogrammed” for various machine learning tasks such as similarity search, outlier detection and text embeddings. We propose a novel reprogramming of this hash and bloom filter to emulate the canonical nearest neighbor classifier (NNC) in the challenging Federated Learning (FL) setup where training and test data are spread across parties and no data can leave their respective parties. Specifically, we utilize Flyhash and FBF to create the FlyNN classifier, and theoretically establish conditions where FlyNN matches NNC. We show how FlyNN is trained exactly in a FL setup with low communication overhead to produce FlyNNFL, and how it can be differentially private. Empirically, we demonstrate that (i) FlyNN matches NNC accuracy across 70 OpenML datasets, (ii) FlyNNFL training is highly scalable with low communication overhead, providing up to \(8\times\) speedup with \(16\) parties. |
|||||
2021 | LLC Accurate Multi-purpose Learnt Low-dimensional Binary Codes | Aditya Kusupati, Matthew Wallingford, Vivek Ramanujan, Raghav Somani, Jae Sung Park, Krishna Pillutla, Prateek Jain, Sham Kakade, Ali Farhadi | Neural Information Processing Systems | Learning binary representations of instances and classes is a classical problem with several high potential applications. In modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. In this work, we propose a novel method for \(\textbf{L}\)earning \(\textbf{L}\)ow-dimensional binary \(\textbf{C}\)odes \((\textbf{LLC})\) for instances as well as classes. Our method does \({\textit{not}}\) require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (\(\approx 20\) bits for ImageNet-1K). The learnt codes are super-efficient while still ensuring \(\textit{nearly optimal}\) classification accuracy for ResNet50 on ImageNet-1K. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. We further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (OOD) detection problems. For ImageNet-100 retrieval problem, our learnt binary codes outperform \(16\) bit HashNet using only \(10\) bits and also are as accurate as \(10\) dimensional real representations. Finally, our learnt binary codes can perform OOD detection, out-of-the-box, as accurately as a baseline that needs \(\approx3000\) samples to tune its threshold, while we require \({\textit{none}}\). Code is open-sourced at https://github.com/RAIVNLab/LLC. |
|||||
2021 | Online Hashing With Similarity Learning | Weng Zhenyu, Zhu Yuesheng | Arxiv | Online hashing methods usually learn the hash functions online, aiming to efficiently adapt to the data variations in the streaming environment. However, when the hash functions are updated, the binary codes for the whole database have to be updated to be consistent with the hash functions, resulting in the inefficiency in the online image retrieval process. In this paper, we propose a novel online hashing framework without updating binary codes. In the proposed framework, the hash functions are fixed and a parametric similarity function for the binary codes is learnt online to adapt to the streaming data. Specifically, a parametric similarity function that has a bilinear form is adopted and a metric learning algorithm is proposed to learn the similarity function online based on the characteristics of the hashing methods. The experiments on two multi-label image datasets show that our method is competitive or outperforms the state-of-the-art online hashing methods in terms of both accuracy and efficiency for multi-label image retrieval. |
|||||
2021 | Projective Clustering Product Quantization | Krishnan Aditya, Liberty Edo | Arxiv | This paper suggests the use of projective clustering based product quantization for improving nearest neighbor and max-inner-product vector search (MIPS) algorithms. We provide anisotropic and quantized variants of projective clustering which outperform previous clustering methods used for this problem such as ScaNN. We show that even with comparable running time complexity, in terms of lookup-multiply-adds, projective clustering produces more quantization centers resulting in more accurate dot-product estimates. We provide thorough experimentation to support our claims. |
|||||
2021 | Semantic-aware Binary Code Representation With BERT | Koo Hyungjoon, Park Soyeon, Choi Daejin, Kim Taesoo | Arxiv | A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary instead of manually crafting specifics of the analysis algorithm. However, the existing approaches utilizing machine learning are still specialized to solve one domain of problems, rendering recreation of models for different types of binary analysis. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code. To this end, we introduce well-balanced instruction normalization that holds rich information for each of instructions yet minimizing an out-of-vocabulary (OOV) problem. DeepSemantic has been carefully designed based on our study with large swaths of binaries. Besides, DeepSemantic leverages the essence of the BERT architecture into re-purposing a pre-trained generic model that is readily available as a one-time processing, followed by quickly applying specific downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with two downstream tasks, namely, binary similarity comparison and compiler provenance (i.e., compiler and optimization level) prediction. Our experimental results show that the binary similarity model outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, 49.84% and 15.83% on average, respectively. |
|||||
2021 | Linear-time Self Attention With Codeword Histogram For Efficient Recommendation | Wu Yongji, Lian Defu, Gong Neil Zhenqiang, Yin Lu, Yin Mingyang, Zhou Jingren, Yang Hongxia | Arxiv | Self-attention has become increasingly popular in a variety of sequence modeling tasks from natural language processing to recommendation, due to its effectiveness. However, self-attention suffers from quadratic computational and memory complexities, prohibiting its applications on long sequences. Existing approaches that address this issue mainly rely on a sparse attention context, either using a local window, or a permuted bucket obtained by locality-sensitive hashing (LSH) or sorting, while crucial information may be lost. Inspired by the idea of vector quantization that uses cluster centroids to approximate items, we propose LISA (LInear-time Self Attention), which enjoys both the effectiveness of vanilla self-attention and the efficiency of sparse attention. LISA scales linearly with the sequence length, while enabling full contextual attention via computing differentiable histograms of codeword distributions. Meanwhile, unlike some efficient attention methods, our method poses no restriction on casual masking or sequence length. We evaluate our method on four real-world datasets for sequential recommendation. The results show that LISA outperforms the state-of-the-art efficient attention methods in both performance and speed; and it is up to 57x faster and 78x more memory efficient than vanilla self-attention. |
|||||
2021 | Low-precision Quantization For Efficient Nearest Neighbor Search | Ko Anthony, Keivanloo Iman, Lakshman Vihan, Schkufza Eric | Arxiv | Fast k-Nearest Neighbor search over real-valued vector spaces (KNN) is an important algorithmic task for information retrieval and recommendation systems. We present a method for using reduced precision to represent vectors through quantized integer values, enabling both a reduction in the memory overhead of indexing these vectors and faster distance computations at query time. While most traditional quantization techniques focus on minimizing the reconstruction error between a point and its uncompressed counterpart, we focus instead on preserving the behavior of the underlying distance metric. Furthermore, our quantization approach is applied at the implementation level and can be combined with existing KNN algorithms. Our experiments on both open source and proprietary datasets across multiple popular KNN frameworks validate that quantized distance metrics can reduce memory by 60% and improve query throughput by 30%, while incurring only a 2% reduction in recall. |
|||||
2021 | A Comprehensive Survey And Experimental Comparison Of Graph-based Approximate Nearest Neighbor Search | Wang Mengzhao, Xu Xiaoliang, Yue Qiang, Wang Yuxiang | Arxiv | Approximate nearest neighbor search (ANNS) constitutes an important operation in a multitude of applications, including recommendation systems, information retrieval, and pattern recognition. In the past decade, graph-based ANNS algorithms have been the leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, efficient solutions for retrieving the nearest neighbors for a given query. Nevertheless, these efforts focus on developing and optimizing algorithms with different approaches, so there is a real need for a comprehensive survey about the approaches’ relative performance, strengths, and pitfalls. Thus here we provide a thorough comparative analysis and experimental evaluation of 13 representative graph-based ANNS algorithms via a new taxonomy and fine-grained pipeline. We compared each algorithm in a uniform test environment on eight real-world datasets and 12 synthetic datasets with varying sizes and characteristics. Our study yields novel discoveries, offerings several useful principles to improve algorithms, thus designing an optimized method that outperforms the state-of-the-art algorithms. This effort also helped us pinpoint algorithms’ working portions, along with rule-of-thumb recommendations about promising research directions and suitable algorithms for practitioners in different fields. |
|||||
2021 | Locality Sensitive Hashing For Efficient Similar Polygon Retrieval | Kaplan Haim, Tenenbaum Jay | Arxiv | Locality Sensitive Hashing (LSH) is an effective method of indexing a set of items to support efficient nearest neighbors queries in high-dimensional spaces. The basic idea of LSH is that similar items should produce hash collisions with higher probability than dissimilar items. We study LSH for (not necessarily convex) polygons, and use it to give efficient data structures for similar shape retrieval. Arkin et al. represent polygons by their “turning function” - a function which follows the angle between the polygon’s tangent and the \( x \)-axis while traversing the perimeter of the polygon. They define the distance between polygons to be variations of the \( L_p \) (for \(p=1,2\)) distance between their turning functions. This metric is invariant under translation, rotation and scaling (and the selection of the initial point on the perimeter) and therefore models well the intuitive notion of shape resemblance. We develop and analyze LSH near neighbor data structures for several variations of the \( L_p \) distance for functions (for \(p=1,2\)). By applying our schemes to the turning functions of a collection of polygons we obtain efficient near neighbor LSH-based structures for polygons. To tune our structures to turning functions of polygons, we prove some new properties of these turning functions that may be of independent interest. As part of our analysis, we address the following problem which is of independent interest. Find the vertical translation of a function \( f \) that is closest in \( L_1 \) distance to a function \( g \). We prove tight bounds on the approximation guarantee obtained by the translation which is equal to the difference between the averages of \( g \) and \( f \). |
|||||
2021 | DEANN Speeding Up Kernel-density Estimation Using Approximate Nearest Neighbor Search | Karppa Matti, Aumüller Martin, Pagh Rasmus | Arxiv | Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for kernel density estimation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance. |
|||||
2021 | Contrastive Quantization With Code Memory For Unsupervised Image Retrieval | Wang Jinpeng, Zeng Ziyun, Chen Bin, Dai Tao, Xia Shu-tao | Arxiv | The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, namely Contrastive Quantization with Code Memory (MeCoQ). Different from existing reconstruction-based strategies, we learn unsupervised binary descriptors by contrastive learning, which can better capture discriminative visual semantics. Besides, we uncover that codeword diversity regularization is critical to prevent contrastive learning-based quantization from model degeneration. Moreover, we introduce a novel quantization code memory module that boosts contrastive learning with lower feature drift than conventional feature memories. Extensive experiments on benchmark datasets show that MeCoQ outperforms state-of-the-art methods. Code and configurations are publicly available at https://github.com/gimpong/AAAI22-MeCoQ. |
|||||
2021 | Instance-weighted Central Similarity For Multi-label Image Retrieval | Zhang Zhiwei, Peng Hanyu | Arxiv | Deep hashing has been widely applied to large-scale image retrieval by encoding high-dimensional data points into binary codes for efficient retrieval. Compared with pairwise/triplet similarity based hash learning, central similarity based hashing can more efficiently capture the global data distribution. For multi-label image retrieval, however, previous methods only use multiple hash centers with equal weights to generate one centroid as the learning target, which ignores the relationship between the weights of hash centers and the proportion of instance regions in the image. To address the above issue, we propose a two-step alternative optimization approach, Instance-weighted Central Similarity (ICS), to automatically learn the center weight corresponding to a hash code. Firstly, we apply the maximum entropy regularizer to prevent one hash center from dominating the loss function, and compute the center weights via projection gradient descent. Secondly, we update neural network parameters by standard back-propagation with fixed center weights. More importantly, the learned center weights can well reflect the proportion of foreground instances in the image. Our method achieves the state-of-the-art performance on the image retrieval benchmarks, and especially improves the mAP by 1.6%-6.4% on the MS COCO dataset. |
|||||
2021 | A Fast Randomized Algorithm For Massive Text Normalization | Jiang Nan, Luo Chen, Lakshman Vihan, Dattatreya Yesh, Xue Yexiang | Arxiv | Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN. |
|||||
2021 | Joint Representation Learning And Novel Category Discovery On Single- And Multi-modal Data | Jia Xuhui, Han Kai, Zhu Yukun, Green Bradley | Arxiv | This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we propose using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results. |
|||||
2021 | Picarrange -- Visually Sort Search And Explore Private Images On A Mac Computer | Jung Klaus, Barthel Kai Uwe, Hezel Nico, Schall Konstantin | Arxiv | The native macOS application PicArrange integrates state-of-the-art image sorting and similarity search to enable users to get a better overview of their images. Many file and image management features have been added to make it a tool that addresses a full image management workflow. A modification of the Self Sorting Map algorithm enables a list-like image arrangement without loosing the visual sorting. Efficient calculation and storage of visual features as well as the use of many macOS APIs result in an application that is fluid to use. |
|||||
2021 | Senatus -- A Fast And Accurate Code-to-code Recommendation Engine | Silavong Fran, Moran Sean, Georgiadis Antonios, Saphal Rohan, Otter Robert | Arxiv | Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with Senatus, a new code-to-code recommendation engine. At the core of Senatus is De-Skew LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example on the CodeSearchNet dataset Senatus improves performance by 31.21\% F1 and 147.9x faster query time compared to Facebook Aroma. Senatus also outperforms standard MinHash LSH by 29.2\% F1 and 51.02x faster query time. |
|||||
2021 | The Many Faces Of Anger A Multicultural Video Dataset Of Negative Emotions In The Wild (mfa-wild) | Javadi Roya, Lim Angelica | Arxiv | The portrayal of negative emotions such as anger can vary widely between cultures and contexts, depending on the acceptability of expressing full-blown emotions rather than suppression to maintain harmony. The majority of emotional datasets collect data under the broad label ``anger”, but social signals can range from annoyed, contemptuous, angry, furious, hateful, and more. In this work, we curated the first in-the-wild multicultural video dataset of emotions, and deeply explored anger-related emotional expressions by asking culture-fluent annotators to label the videos with 6 labels and 13 emojis in a multi-label framework. We provide a baseline multi-label classifier on our dataset, and show how emojis can be effectively used as a language-agnostic tool for annotation. |
|||||
2021 | Similarity Guided Deep Face Image Retrieval | Jang Young Kyun, Cho Nam Ik | Arxiv | Face image retrieval, which searches for images of the same identity from the query input face image, is drawing more attention as the size of the image database increases rapidly. In order to conduct fast and accurate retrieval, a compact hash code-based methods have been proposed, and recently, deep face image hashing methods with supervised classification training have shown outstanding performance. However, classification-based scheme has a disadvantage in that it cannot reveal complex similarities between face images into the hash code learning. In this paper, we attempt to improve the face image retrieval quality by proposing a Similarity Guided Hashing (SGH) method, which gently considers self and pairwise-similarity simultaneously. SGH employs various data augmentations designed to explore elaborate similarities between face images, solving both intra and inter identity-wise difficulties. Extensive experimental results on the protocols with existing benchmarks and an additionally proposed large scale higher resolution face image dataset demonstrate that our SGH delivers state-of-the-art retrieval performance. |
|||||
2021 | HHF Hashing-guided Hinge Function For Deep Hashing Retrieval | Xu Chengyin, Chai Zenghao, Xu Zhengzhuo, Li Hongjia, Zuo Qiruyi, Yang Lingyu, Yuan Chun | Arxiv | Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by Deep Neural Networks (DNNs) will inevitably lose semantic information during the binarization process, which damages the retrieval accuracy and makes it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible conflict between metric learning and quantization learning. The metric loss penalizes the inter-class distances to push different classes unconstrained far away. Worse still, it tends to map the latent code deviate from ideal binarization point and generate severe ambiguity in the binarization process. Based on the minimum distance of the binary linear code, we creatively propose Hashing-guided Hinge Function (HHF) to avoid such conflict. In detail, the carefully-designed inflection point, which relies on the hash bit length and category numbers, is explicitly adopted to balance the metric term and quantization term. Such a modification prevents the network from falling into local metric optimal minima in deep hashing. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO show that HHF consistently outperforms existing techniques, and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF. |
|||||
2021 | Deep Hash Distillation For Image Retrieval | Jang Young Kyun, Gu Geonmo, Ko Byungsoo, Kang Isaac, Cho Nam Ik | Arxiv | In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that not only improves the existing deep hashing approaches, but also achieves the state-of-the-art retrieval results. Extensive experiments are conducted and confirm the effectiveness of our work. |
|||||
2021 | MOON Multi-hash Codes Joint Learning For Cross-media Retrieval | Zhang Donglin, Wu Xiao-jun, Yin He-feng, Kittler Josef | Arxiv | In recent years, cross-media hashing technique has attracted increasing attention for its high computation efficiency and low storage cost. However, the existing approaches still have some limitations, which need to be explored. 1) A fixed hash length (e.g., 16bits or 32bits) is predefined before learning the binary codes. Therefore, these models need to be retrained when the hash length changes, that consumes additional computation power, reducing the scalability in practical applications. 2) Existing cross-modal approaches only explore the information in the original multimedia data to perform the hash learning, without exploiting the semantic information contained in the learned hash codes. To this end, we develop a novel Multiple hash cOdes jOint learNing method (MOON) for cross-media retrieval. Specifically, the developed MOON synchronously learns the hash codes with multiple lengths in a unified framework. Besides, to enhance the underlying discrimination, we combine the clues from the multimodal data, semantic labels and learned hash codes for hash learning. As far as we know, the proposed MOON is the first work to simultaneously learn different length hash codes without retraining in cross-media retrieval. Experiments on several databases show that our MOON can achieve promising performance, outperforming some recent competitive shallow and deep methods. |
|||||
2021 | Self-supervised Product Quantization For Deep Unsupervised Image Retrieval | Jang Young Kyun, Cho Nam Ik | Arxiv | Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems. By fully exploiting label annotations, they are achieving outstanding retrieval performances compared to the conventional methods. However, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, we propose the first deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner. We design a Cross Quantized Contrastive learning strategy that jointly learns codewords and deep visual descriptors by comparing individually transformed images (views). Our method analyzes the image contents to extract descriptive features, allowing us to understand image representations for accurate retrieval. By conducting extensive experiments on benchmarks, we demonstrate that the proposed method yields state-of-the-art results even without supervised pretraining. |
|||||
2021 | Improved Deep Classwise Hashing With Centers Similarity Learning For Image Retrieval | Zhang Ming, Yan Hong | Arxiv | Deep supervised hashing for image retrieval has attracted researchers’ attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval. |
|||||
2021 | Yandex Text-to-Image-1B | Yandex | NeurIPS | Yandex Text-to-Image-1B is a new cross-model dataset (text and visual), where database and query vectors have different distributions in a shared representation space. The base set consists of Image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. Since the distributions are different, a 50M sample of the query distribution is provided. |
|||||
2021 | Joint Learning Of Deep Retrieval Model And Product Quantization Based Embedding Index | Zhang Han, Shen Hongwei, Qiu Yiming, Jiang Yunjiang, Wang Songlin, Xu Sulong, Xiao Yun, Long Bo, Yang Wen-yun | Arxiv | Embedding index that enables fast approximate nearest neighbor(ANN) search, serves as an indispensable component for state-of-the-art deep retrieval systems. Traditional approaches, often separating the two steps of embedding learning and index building, incur additional indexing time and decayed retrieval accuracy. In this paper, we propose a novel method called Poeem, which stands for product quantization based embedding index jointly trained with deep retrieval model, to unify the two separate steps within an end-to-end training, by utilizing a few techniques including the gradient straight-through estimator, warm start strategy, optimal space decomposition and Givens rotation. Extensive experimental results show that the proposed method not only improves retrieval accuracy significantly but also reduces the indexing time to almost none. We have open sourced our approach for the sake of comparison and reproducibility. |
|||||
2021 | Yandex DEEP-1B | Yandex | NeurIPS | Yandex DEEP-1B image descriptor dataset consisting of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task. |
|||||
2021 | Efficient Passage Retrieval With Hashing For Open-domain Question Answering | Yamada Ikuya, Asai Akari, Hajishirzi Hannaneh | Arxiv | Most state-of-the-art open-domain question answering systems use a neural retrieval model to encode passages into continuous vectors and extract them from a knowledge source. However, such retrieval models often require large memory to run because of the massive size of their passage index. In this paper, we introduce Binary Passage Retriever (BPR), a memory-efficient neural retrieval model that integrates a learning-to-hash technique into the state-of-the-art Dense Passage Retriever (DPR) to represent the passage index using compact binary codes rather than continuous vectors. BPR is trained with a multi-task objective over two tasks: efficient candidate generation based on binary codes and accurate reranking based on continuous vectors. Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA. Our code and trained models are available at https://github.com/studio-ousia/bpr. |
|||||
2021 | Binary Code Based Hash Embedding For Web-scale Applications | Yan Bencheng, Wang Pengjie, Liu Jinquan, Lin Wei, Lee Kuang-chih, Xu Jian, Zheng Bo | Arxiv | Nowadays, deep learning models are widely adopted in web-scale applications such as recommender systems, and online advertising. In these applications, embedding learning of categorical features is crucial to the success of deep learning models. In these models, a standard method is that each categorical feature value is assigned a unique embedding vector which can be learned and optimized. Although this method can well capture the characteristics of the categorical features and promise good performance, it can incur a huge memory cost to store the embedding table, especially for those web-scale applications. Such a huge memory cost significantly holds back the effectiveness and usability of EDRMs. In this paper, we propose a binary code based hash embedding method which allows the size of the embedding table to be reduced in arbitrary scale without compromising too much performance. Experimental evaluation results show that one can still achieve 99\% performance even if the embedding table size is reduced 1000\(\times\) smaller than the original one with our proposed method. |
|||||
2021 | Multi-modal Mutual Information Maximization A Novel Approach For Unsupervised Deep Cross-modal Hashing | Hoang Tuan, Do Thanh-toan, Nguyen Tam V., Cheung Ngai-man | Arxiv | In this paper, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH). First, to learn informative representations that can preserve both intra- and inter-modal similarities, we leverage the recent advances in estimating variational lower-bound of MI to maximize the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modelled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intra- and inter-modal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods. |
|||||
2021 | One Loss For All Deep Hashing With A Single Cosine Similarity Based Learning Objective | Hoe Jiun Tian, Ng Kam Woh, Zhang Tianyu, Chan Chee Seng, Song Yi-zhe, Xiang Tao | Arxiv | A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash |
|||||
2021 | Beyond Neighbourhood-preserving Transformations For Quantization-based Unsupervised Hashing | Hemati Sobhan, Tizhoosh H. R. | Arxiv | An effective unsupervised hashing algorithm leads to compact binary codes preserving the neighborhood structure of data as much as possible. One of the most established schemes for unsupervised hashing is to reduce the dimensionality of data and then find a rigid (neighbourhood-preserving) transformation that reduces the quantization error. Although employing rigid transformations is effective, we may not reduce quantization loss to the ultimate limits. As well, reducing dimensionality and quantization loss in two separate steps seems to be sub-optimal. Motivated by these shortcomings, we propose to employ both rigid and non-rigid transformations to reduce quantization error and dimensionality simultaneously. We relax the orthogonality constraint on the projection in a PCA-formulation and regularize this by a quantization term. We show that both the non-rigid projection matrix and rotation matrix contribute towards minimizing quantization loss but in different ways. A scalable nested coordinate descent approach is proposed to optimize this mixed-integer optimization problem. We evaluate the proposed method on five public benchmark datasets providing almost half a million images. Comparative results indicate that the proposed method mostly outperforms state-of-art linear methods and competes with end-to-end deep solutions. |
|||||
2021 | Fake-image Detection With Robust Hashing | Tanaka Miki, Kiya Hitoshi | Arxiv | In this paper, we investigate whether robust hashing has a possibility to robustly detect fake-images even when multiple manipulation techniques such as JPEG compression are applied to images for the first time. In an experiment, the proposed fake detection with robust hashing is demonstrated to outperform state-of-the-art one under the use of various datasets including fake images generated with GANs. |
|||||
2021 | Self-supervised Video Retrieval Transformer Network | He Xiangteng, Pan Yulin, Tang Mingqian, Lv Yiliang | Arxiv | Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed SVRTN method, which achieves the best performance of video retrieval on accuracy and efficiency. |
|||||
2021 | Unsupervised Domain-adaptive Hash For Networks | He Tao, Gao Lianli, Song Jingkuan, Li Yuan-fang | Arxiv | Abundant real-world data can be naturally represented by large-scale networks, which demands efficient and effective learning algorithms. At the same time, labels may only be available for some networks, which demands these algorithms to be able to adapt to unlabeled networks. Domain-adaptive hash learning has enjoyed considerable success in the computer vision community in many practical tasks due to its lower cost in both retrieval time and storage footprint. However, it has not been applied to multiple-domain networks. In this work, we bridge this gap by developing an unsupervised domain-adaptive hash learning method for networks, dubbed UDAH. Specifically, we develop four {task-specific yet correlated} components: (1) network structure preservation via a hard groupwise contrastive loss, (2) relaxation-free supervised hashing, (3) cross-domain intersected discriminators, and (4) semantic center alignment. We conduct a wide range of experiments to evaluate the effectiveness and efficiency of our method on a range of tasks including link prediction, node classification, and neighbor recommendation. Our evaluation results demonstrate that our model achieves better performance than the state-of-the-art conventional discrete embedding methods over all the tasks. |
|||||
2021 | Unsupervised Multi-index Semantic Hashing | Hansen Christian, Hansen Casper, Simonsen Jakob Grue, Alstrup Stephen, Lioma Christina | Arxiv | Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness. |
|||||
2021 | Projected Hamming Dissimilarity For Bit-level Importance Coding In Collaborative Filtering | Hansen Christian, Hansen Casper, Simonsen Jakob Grue, Lioma Christina | Arxiv | When reasoning about tasks that involve large amounts of data, a common approach is to represent data items as objects in the Hamming space where operations can be done efficiently and effectively. Object similarity can then be computed by learning binary representations (hash codes) of the objects and computing their Hamming distance. While this is highly efficient, each bit dimension is equally weighted, which means that potentially discriminative information of the data is lost. A more expressive alternative is to use real-valued vector representations and compute their inner product; this allows varying the weight of each dimension but is many magnitudes slower. To fix this, we derive a new way of measuring the dissimilarity between two objects in the Hamming space with binary weighting of each dimension (i.e., disabling bits): we consider a field-agnostic dissimilarity that projects the vector of one object onto the vector of the other. When working in the Hamming space, this results in a novel projected Hamming dissimilarity, which by choice of projection, effectively allows a binary importance weighting of the hash code of one object through the hash code of the other. We propose a variational hashing model for learning hash codes optimized for this projected Hamming dissimilarity, and experimentally evaluate it in collaborative filtering experiments. The resultant hash codes lead to effectiveness gains of up to +7% in NDCG and +14% in MRR compared to state-of-the-art hashing-based collaborative filtering baselines, while requiring no additional storage and no computational overhead compared to using the Hamming distance. |
|||||
2021 | Representation Learning For Efficient And Effective Similarity Search And Recommendation | Hansen Casper | Arxiv | How data is represented and operationalized is critical for building computational solutions that are both effective and efficient. A common approach is to represent data objects as binary vectors, denoted \textit{hash codes}, which require little storage and enable efficient similarity search through direct indexing into a hash table or through similarity computations in an appropriate space. Due to the limited expressibility of hash codes, compared to real-valued representations, a core open challenge is how to generate hash codes that well capture semantic content or latent properties using a small number of bits, while ensuring that the hash codes are distributed in a way that does not reduce their search efficiency. State of the art methods use representation learning for generating such hash codes, focusing on neural autoencoder architectures where semantics are encoded into the hash codes by learning to reconstruct the original inputs of the hash codes. This thesis addresses the above challenge and makes a number of contributions to representation learning that (i) improve effectiveness of hash codes through more expressive representations and a more effective similarity measure than the current state of the art, namely the Hamming distance, and (ii) improve efficiency of hash codes by learning representations that are especially suited to the choice of search method. The contributions are empirically validated on several tasks related to similarity search and recommendation. |
|||||
2021 | Unsupervised Hashing With Contrastive Information Bottleneck | Qiu Zexuan, Su Qinliang, Ou Zijing, Yu Jianxing, Chen Changyou | Arxiv | Many unsupervised hashing methods are implicitly established on the idea of reconstructing the input data, which basically encourages the hashing codes to retain as much information of original data as possible. However, this requirement may force the models spending lots of their effort on reconstructing the unuseful background information, while ignoring to preserve the discriminative semantic information that is more important for the hashing task. To tackle this problem, inspired by the recent success of contrastive learning in learning continuous representations, we propose to adapt this framework to learn binary hashing codes. Specifically, we first propose to modify the objective function to meet the specific requirement of hashing and then introduce a probabilistic binary representation layer into the model to facilitate end-to-end training of the entire model. We further prove the strong connection between the proposed contrastive-learning-based hashing method and the mutual information, and show that the proposed model can be considered under the broader framework of the information bottleneck (IB). Under this perspective, a more general hashing model is naturally obtained. Extensive experimental results on three benchmark image datasets demonstrate that the proposed hashing method significantly outperforms existing baselines. |
|||||
2021 | BCD A Cross-architecture Binary Comparison Database Experiment Using Locality Sensitive Hashing Algorithms | Tan Haoxi | Arxiv | Given a binary executable without source code, it is difficult to determine what each function in the binary does by reverse engineering it, and even harder without prior experience and context. In this paper, we performed a comparison of different hashing functions’ effectiveness at detecting similar lifted snippets of LLVM IR code, and present the design and implementation of a framework for cross-architecture binary code similarity search database using MinHash as the chosen hashing algorithm, over SimHash, SSDEEP and TLSH. The motivation is to help reverse engineers to quickly gain context of functions in an unknown binary by comparing it against a database of known functions. The code for this project is open source and can be found at https://github.com/h4sh5/bcddb |
|||||
2021 | IRLI Iterative Re-partitioning For Learning To Index | Gupta Gaurav, Medini Tharun, Shrivastava Anshumali, Smola Alexander J | Arxiv | Neural models have transformed the fundamental information retrieval problem of mapping a query to a giant set of items. However, the need for efficient and low latency inference forces the community to reconsider efficient approximate near-neighbor search in the item space. To this end, learning to index is gaining much interest in recent times. Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings. We propose a novel approach called IRLI (pronounced `early’), which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. Furthermore, IRLI employs a superior power-of-\(k\)-choices based load balancing strategy. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing. IRLI surpasses the best baseline’s precision on multi-label classification while being \(5x\) faster on inference. For near-neighbor search tasks, the same method outperforms the state-of-the-art Learned Hashing approach NeuralLSH by requiring only ~ {1/6}^th of the candidates for the same recall. IRLI is both data and model parallel, making it ideal for distributed GPU implementation. We demonstrate this advantage by indexing 100 million dense vectors and surpassing the popular FAISS library by >10% on recall. |
|||||
2021 | Bytesteady Fast Classification Using Byte-level N-gram Embeddings | Zhang Xiang, Drouin Alexandre, Li Raymond | Arxiv | This article introduces byteSteady – a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data – DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning. |
|||||
2021 | Backdoor Attack On Hash-based Image Retrieval Via Clean-label Data Poisoning | Gao Kuofeng, Bai Jiawang, Chen Bin, Wu Dongxian, Xia Shu-tao | Arxiv | A backdoored deep hashing model is expected to behave normally on original query images and return the images with the target label when a specific trigger pattern presents. To this end, we propose the confusing perturbations-induced backdoor attack (CIBA). It injects a small number of poisoned images with the correct label into the training data, which makes the attack hard to be detected. To craft the poisoned images, we first propose the confusing perturbations to disturb the hashing code learning. As such, the hashing model can learn more about the trigger. The confusing perturbations are imperceptible and generated by optimizing the intra-class dispersion and inter-class shift in the Hamming space. We then employ the targeted adversarial patch as the backdoor trigger to improve the attack performance. We have conducted extensive experiments to verify the effectiveness of our proposed CIBA. Our code is available at https://github.com/KuofengGao/CIBA. |
|||||
2021 | Sketch-qnet A Quadruplet Convnet For Color Sketch-based Image Retrieval | Fuentes Anibal, Saavedra Jose M. | Arxiv | Architectures based on siamese networks with triplet loss have shown outstanding performance on the image-based similarity search problem. This approach attempts to discriminate between positive (relevant) and negative (irrelevant) items. However, it undergoes a critical weakness. Given a query, it cannot discriminate weakly relevant items, for instance, items of the same type but different color or texture as the given query, which could be a serious limitation for many real-world search applications. Therefore, in this work, we present a quadruplet-based architecture that overcomes the aforementioned weakness. Moreover, we present an instance of this quadruplet network, which we call Sketch-QNet, to deal with the color sketch-based image retrieval (CSBIR) problem, achieving new state-of-the-art results. |
|||||
2021 | Deep Triplet Hashing Network For Case-based Medical Image Retrieval | Fang Jiansheng, Fu Huazhu, Liu Jiang | Arxiv | Deep hashing methods have been shown to be the most efficient approximate nearest neighbor search techniques for large-scale image retrieval. However, existing deep hashing methods have a poor small-sample ranking performance for case-based medical image retrieval. The top-ranked images in the returned query results may be as a different class than the query image. This ranking problem is caused by classification, regions of interest (ROI), and small-sample information loss in the hashing space. To address the ranking problem, we propose an end-to-end framework, called Attention-based Triplet Hashing (ATH) network, to learn low-dimensional hash codes that preserve the classification, ROI, and small-sample information. We embed a spatial-attention module into the network structure of our ATH to focus on ROI information. The spatial-attention module aggregates the spatial information of feature maps by utilizing max-pooling, element-wise maximum, and element-wise mean operations jointly along the channel axis. The triplet cross-entropy loss can help to map the classification information of images and similarity between images into the hash codes. Extensive experiments on two case-based medical datasets demonstrate that our proposed ATH can further improve the retrieval performance compared to the state-of-the-art deep hashing methods and boost the ranking performance for small samples. Compared to the other loss methods, the triplet cross-entropy loss can enhance the classification performance and hash code-discriminability |
|||||
2021 | Facebook SimSearchNet++ | Facebook/Meta | NeurIPS | Facebook SimSearchNet++ is a new dataset released by Facebook for this competition. It consists of features used for image copy detection for integrity purposes. The features are generated by Facebook SimSearchNet++ model. |
|||||
2021 | A Faster Algorithm For Finding Closest Pairs In Hamming Metric | Esser Andre, Kübler Robert, Zweydinger Floyd | Arxiv | We study the Closest Pair Problem in Hamming metric, which asks to find the pair with the smallest Hamming distance in a collection of binary vectors. We give a new randomized algorithm for the problem on uniformly random input outperforming previous approaches whenever the dimension of input points is small compared to the dataset size. For moderate to large dimensions, our algorithm matches the time complexity of the previously best-known locality sensitive hashing based algorithms. Technically our algorithm follows similar design principles as Dubiner (IEEE Trans. Inf. Theory 2010) and May-Ozerov (Eurocrypt 2015). Besides improving the time complexity in the aforementioned areas, we significantly simplify the analysis of these previous works. We give a modular analysis, which allows us to investigate the performance of the algorithm also on non-uniform input distributions. Furthermore, we give a proof of concept implementation of our algorithm which performs well in comparison to a quadratic search baseline. This is the first step towards answering an open question raised by May and Ozerov regarding the practicability of algorithms following these design principles. |
|||||
2021 | Setsketch Filling The Gap Between Minhash And Hyperloglog | Ertl Otmar | Arxiv | MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases. |
|||||
2021 | Image-hashing-based Anomaly Detection For Privacy-preserving Online Proctoring | Yaqub Waheeb, Mohanty Manoranjan, Suleiman Basem | Arxiv | Online proctoring has become a necessity in online teaching. Video-based crowd-sourced online proctoring solutions are being used, where an exam-taking student’s video is monitored by third parties, leading to privacy concerns. In this paper, we propose a privacy-preserving online proctoring system. The proposed image-hashing-based system can detect the student’s excessive face and body movement (i.e., anomalies) that is resulted when the student tries to cheat in the exam. The detection can be done even if the student’s face is blurred or masked in video frames. Experiment with an in-house dataset shows the usability of the proposed system. |
|||||
2021 | Hash Layers For Large Sparse Models | Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston | Neural Information Processing Systems | We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks. |
|||||
2021 | Practical Near Neighbor Search Via Group Testing | Joshua Engels, Benjamin Coleman, Anshumali Shrivastava | Neural Information Processing Systems | We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as “positives,” non-neighbors as “negatives,” and approximate membership queries as group tests. We instantiate this framework using distance-sensitive Bloom Filters to Identify Near-Neighbor Groups (FLINNG). We prove that FLINNG has sub-linear query time and show that our algorithm comes with a variety of practical advantages. For example, FLINNG can be constructed in a single pass through the data, consists entirely of efficient integer operations, and does not require any distance computations. We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory. |
|||||
2021 | Vision Transformer Hashing For Image Retrieval | Dubey Shiv Ram, Singh Satish Kumar, Chu Wei-ta | Arxiv | Deep learning has shown a tremendous growth in hashing techniques for image retrieval. Recently, Transformer has emerged as a new architecture by utilizing self-attention without convolution. Transformer is also extended to Vision Transformer (ViT) for the visual recognition with a promising performance on ImageNet. In this paper, we propose a Vision Transformer based Hashing (VTS) for image retrieval. We utilize the pre-trained ViT on ImageNet as the backbone network and add the hashing head. The proposed VTS model is fine tuned for hashing under six different image retrieval frameworks, including Deep Supervised Hashing (DSH), HashNet, GreedyHash, Improved Deep Hashing Network (IDHN), Deep Polarized Network (DPN) and Central Similarity Quantization (CSQ) with their objective functions. We perform the extensive experiments on CIFAR10, ImageNet, NUS-Wide, and COCO datasets. The proposed VTS based image retrieval outperforms the recent state-of-the-art hashing techniques with a great margin. We also find the proposed VTS model as the backbone network is better than the existing networks, such as AlexNet and ResNet. The code is released at \url{https://github.com/shivram1987/VisionTransformerHashing}. |
|||||
2021 | ASH A Modern Framework For Parallel Spatial Hashing In 3D Perception | Dong Wei, Lao Yixing, Kaess Michael, Koltun Vladlen | Arxiv | We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU. Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction. Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch. To achieve this, we 1) detach stored key-value data from the low-level hash map implementation; 2) bridge the pointer-first low level data structures to index-first high-level tensor interfaces via an index heap; 3) adapt both generic and non-generic integer-only hash map implementations as backends to operate on multi-dimensional keys. We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including 1) point cloud voxelization, 2) retargetable volumetric scene reconstruction, 3) non-rigid point cloud registration and volumetric deformation, and 4) spatially varying geometry and appearance refinement. ASH and its example applications are open sourced in Open3D (http://www.open3d.org). |
|||||
2021 | Dxhash A Scalable Consistent Hash Based On The Pseudo-random Sequence | Dong Chao, Wang Fang, Feng Dan | Arxiv | Consistent hashing (CH) has been pivotal as a data router and load balancer in diverse fields, including distributed databases, cloud infrastructure, and peer-to-peer networks. However, existing CH algorithms often fall short in simultaneously meeting various critical requirements, such as load balance, minimal disruption, statelessness, high lookup rate, small memory footprint, and low update overhead. To address these limitations, we introduce DxHash, a scalable consistent hashing algorithm based on pseudo-random sequences. To adjust workloads on heterogeneous nodes and enhance flexibility, we propose weighted DxHash. Through comprehensive evaluations, DxHash demonstrates substantial improvements across all six requirements compared to state-of-the-art alternatives. Notably, even when confronted with a 50% failure ratio in a cluster of one million nodes, DxHash maintains remarkable processing capabilities, handling up to 13.3 million queries per second. |
|||||
2021 | Dynamic Texture Recognition Using PDV Hashing And Dictionary Learning On Multi-scale Volume Local Binary Pattern | Ding Ruxin, Ren Jianfeng, Yu Heng, Li Jiawei | Arxiv | Spatial-temporal local binary pattern (STLBP) has been widely used in dynamic texture recognition. STLBP often encounters the high-dimension problem as its dimension increases exponentially, so that STLBP could only utilize a small neighborhood. To tackle this problem, we propose a method for dynamic texture recognition using PDV hashing and dictionary learning on multi-scale volume local binary pattern (PHD-MVLBP). Instead of forming very high-dimensional LBP histogram features, it first uses hash functions to map the pixel difference vectors (PDVs) to binary vectors, then forms a dictionary using the derived binary vector, and encodes them using the derived dictionary. In such a way, the PDVs are mapped to feature vectors of the size of dictionary, instead of LBP histograms of very high dimension. Such an encoding scheme could extract the discriminant information from videos in a much larger neighborhood effectively. The experimental results on two widely-used dynamic textures datasets, DynTex++ and UCLA, show the superiority performance of the proposed approach over the state-of-the-art methods. |
|||||
2021 | Semantically Constrained Memory Allocation (SCMA) For Embedding In Efficient Recommendation Systems | Desai Aditya, Pan Yanzhou, Sun Kuangyuan, Chou Li, Shrivastava Anshumali | Arxiv | Deep learning-based models are utilized to achieve state-of-the-art performance for recommendation systems. A key challenge for these models is to work with millions of categorical classes or tokens. The standard approach is to learn end-to-end, dense latent representations or embeddings for each token. The resulting embeddings require large amounts of memory that blow up with the number of tokens. Training and inference with these models create storage, and memory bandwidth bottlenecks leading to significant computing and energy consumption when deployed in practice. To this end, we present the problem of \textit{Memory Allocation} under budget for embeddings and propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information. Our formulation admits a practical and efficient randomized solution with Locality sensitive hashing based Memory Allocation (LMA). We demonstrate a significant reduction in the memory footprint while maintaining performance. In particular, our LMA embeddings achieve the same performance compared to standard embeddings with a 16\(\times\) reduction in memory footprint. Moreover, LMA achieves an average improvement of over 0.003 AUC across different memory regimes than standard DLRM models on Criteo and Avazu datasets |
|||||
2021 | Scaling Shared-memory Data Structures As Distributed Global-view Data Structures In The Partitioned Global Address Space Model | Dewan Garvit, Jenkins Louis | Arxiv | The Partitioned Global Address Space (PGAS), a memory model in which the global address space is explicitly partitioned across compute nodes in a cluster, strives to bridge the gap between shared-memory and distributed-memory programming. To further bridge this gap, there has been an adoption of global-view distributed data structures, such as ‘global arrays’ or ‘distributed arrays’. This work demonstrates how shared-memory data structures can be modified to scale in distributed memory. Presented in this work is the Distributed Interlocked Hash Table (DIHT), a global-view distributed map data structure inpired by the Interlocked Hash Table (IHT). At 64 nodes with 44 cores per node, DIHT provides upto 110x the performance of the Chapel standard-library HashedDist. |
|||||
2021 | DOLG Single-stage Image Retrieval With Deep Orthogonal Fusion Of Local And Global Features | Yang Min, He Dongliang, Fan Miao, Shi Baorong, Xue Xuetong, Li Fu, Ding Errui, Huang Jizhou | Arxiv | Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their local features. Previous learning-based studies mainly focus on either global or local image representation learning to tackle the retrieval task. In this paper, we abandon the two-stage paradigm and seek to design an effective single-stage solution by integrating local and global information inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention at first. Components orthogonal to the global image representation are then extracted from the local information. At last, the orthogonal components are concatenated with the global representation as a complementary, and then aggregation is performed to generate the final representation. The whole framework is end-to-end differentiable and can be trained with image-level labels. Extensive experimental results validate the effectiveness of our solution and show that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets. |
|||||
2021 | Scalable Hash Table For NUMA Systems | Tripathy Alok, Green Oded | Arxiv | Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comparable to that of distributed hash tables. Distributed CPU hash tables have received significantly more attention than GPU-based hash tables. We show that a single node with multiple GPUs offers roughly the same performance as a 500-1,000-core CPU-based cluster. Our algorithm’s key component is our use of multiple sparse-graph data structures and binning techniques to build the hash table. As has been shown individually, these components can be written with massive parallelism that is amenable to GPU acceleration. Since we focus on an individual node, we also leverage communication primitives that are typically prohibitive in distributed environments. We show that our new multi-GPU algorithm shares many of the same features of the single GPU algorithm – thus we have efficient collision management capabilities and can deal with a large number of duplicates. We evaluate our algorithm on two multi-GPU compute nodes: 1) an NVIDIA DGX2 server with 16 GPUs and 2) an IBM Power 9 Processor with 6 NVIDIA GPUs. With 32-bit keys, our implementation processes 8B keys per second, comparable to some 500-1,000-core CPU-based clusters and 4X faster than prior single-GPU implementations. |
|||||
2021 | Sketches Image Analysis Web Image Search Engine Usinglsh Index And DNN Inceptionv3 | Schiavo Alessio, Minutella Filippo, Daole Mattia, Gomez Marsha Gomez | Arxiv | The adoption of an appropriate approximate similarity search method is an essential prereq-uisite for developing a fast and efficient CBIR system, especially when dealing with large amount ofdata. In this study we implement a web image search engine on top of a Locality Sensitive Hashing(LSH) Index to allow fast similarity search on deep features. Specifically, we exploit transfer learningfor deep features extraction from images. Firstly, we adopt InceptionV3 pretrained on ImageNet asfeatures extractor, secondly, we try out several CNNs built on top of InceptionV3 as convolutionalbase fine-tuned on our dataset. In both of the previous cases we index the features extracted within ourLSH index implementation so as to compare the retrieval performances with and without fine-tuning.In our approach we try out two different LSH implementations: the first one working with real numberfeature vectors and the second one with the binary transposed version of those vectors. Interestingly,we obtain the best performances when using the binary LSH, reaching almost the same result, in termsof mean average precision, obtained by performing sequential scan of the features, thus avoiding thebias introduced by the LSH index. Lastly, we carry out a performance analysis class by class in terms ofrecall againstmAPhighlighting, as expected, a strong positive correlation between the two. |
|||||
2021 | LSH Methods For Data Deduplication In A Wikipedia Artificial Dataset | Ciro Juan, Galvez Daniel, Schlippe Tim, Kanter David | Arxiv | This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data. |
|||||
2021 | Edge Sampling And Graph Parameter Estimation Via Vertex Neighborhood Accesses | Tětek Jakub, Thorup Mikkel | Arxiv | In this paper, we consider the problems from the area of sublinear-time algorithms of edge sampling, edge counting, and triangle counting. Part of our contribution is that we consider three different settings, differing in the way in which one may access the neighborhood of a given vertex. In previous work, people have considered indexed neighbor access, with a query returning the \(i\)-th neighbor of a given vertex. Full neighborhood access model, which has a query that returns the entire neighborhood at a unit cost, has recently been considered in the applied community. Between these, we propose hash-ordered neighbor access, inspired by coordinated sampling, where we have a global fully random hash function, and can access neighbors in order of their hash values, paying a constant for each accessed neighbor. For edge sampling and counting, our new lower bounds are in the most powerful full neighborhood access model. We provide matching upper bounds in the weaker hash-ordered neighbor access model. Our new faster algorithms can be provably implemented efficiently on massive graphs in external memory and with the current APIs for, e.g., Twitter or Wikipedia. For triangle counting, we provide a separation: a better upper bound with full neighborhood access than the known lower bounds with indexed neighbor access. The technical core of our paper is our edge-sampling algorithm on which the other results depend. |
|||||
2021 | Accurate And Efficient Suffix Tree Based Privacy-preserving String Matching | Vaiwsri Sirintra, Ranbaduge Thilina, Christen Peter, Ng Kee Siong | Arxiv | The task of calculating similarities between strings held by different organizations without revealing these strings is an increasingly important problem in areas such as health informatics, national censuses, genomics, and fraud detection. Most existing privacy-preserving string comparison functions are either based on comparing sets of encoded character q-grams, allow only exact matching of encrypted strings, or they are aimed at long genomic sequences that have a small alphabet. The set-based privacy-preserving similarity functions commonly used to compare name and address strings in the context of privacy-preserving record linkage do not take the positions of sub-strings into account. As a result, two very different strings can potentially be considered as an exact match leading to wrongly linked records. Existing set-based techniques also cannot identify the length of the longest common sub-string across two strings. In this paper we propose a novel approach for accurate and efficient privacy-preserving string matching based on suffix trees that are encoded using chained hashing. We incorporate a hashing based encoding technique upon the encoded suffixes to improve privacy against frequency attacks such as those exploiting Benford’s law. Our approach allows various operations to be performed without the strings to be compared being revealed: the length of the longest common sub-string, do two strings have the same beginning, middle or end, and the longest common sub-string similarity between two strings. These functions allow a more accurate comparison of, for example, bank account, credit card, or telephone numbers, which cannot be compared appropriately with existing privacy-preserving string matching techniques. Our evaluation on several data sets with different types of strings validates the privacy and accuracy of our proposed approach. |
|||||
2021 | Towards Low-loss 1-bit Quantization Of User-item Representations For Top-k Recommendation | Chen Yankai, Zhang Yifei, Zhang Yingxue, Guo Huifeng, Li Jingjie, Tang Ruiming, He Xiuqiang, King Irwin | Arxiv | Due to the promising advantages in space compression and inference acceleration, quantized representation learning for recommender systems has become an emerging research direction recently. As the target is to embed latent features in the discrete embedding space, developing quantization for user-item representations with a few low-precision integers confronts the challenge of high information loss, thus leading to unsatisfactory performance in Top-K recommendation. In this work, we study the problem of representation learning for recommendation with 1-bit quantization. We propose a model named Low-loss Quantized Graph Convolutional Network (L^2Q-GCN). Different from previous work that plugs quantization as the final encoder of user-item embeddings, L^2Q-GCN learns the quantized representations whilst capturing the structural information of user-item interaction graphs at different semantic levels. This achieves the substantial retention of intermediate interactive information, alleviating the feature smoothing issue for ranking caused by numerical quantization. To further improve the model performance, we also present an advanced solution named L^2Q-GCN-anl with quantization approximation and annealing training strategy. We conduct extensive experiments on four benchmarks over Top-K recommendation task. The experimental results show that, with nearly 9x representation storage compression, L^2Q-GCN-anl attains about 90~99% performance recovery compared to the state-of-the-art model. |
|||||
2021 | A High-dimensional Sparse Fourier Transform In The Continuous Setting | Chen Liang | There are some minor errors in the previous version please refer to In this paper, we theoretically propose a new hashing scheme to establish the sparse Fourier transform in high-dimensional space. The estimation of the algorithm complexity shows that this sparse Fourier transform can overcome the curse of dimensionality. To the best of our knowledge, this is the first polynomial-time algorithm to recover the high-dimensional continuous frequencies. | ||||||
2021 | DVHN A Deep Hashing Framework For Large-scale Vehicle Re-identification | Chen Yongbiao, Zhang Sheng, Liu Fangxin, Wu Chenggang, Guo Kaicheng, Qi Zhengwei | Arxiv | In this paper, we make the very first attempt to investigate the integration of deep hash learning with vehicle re-identification. We propose a deep hash-based vehicle re-identification framework, dubbed DVHN, which substantially reduces memory usage and promotes retrieval efficiency while reserving nearest neighbor search accuracy. Concretely,~DVHN directly learns discrete compact binary hash codes for each image by jointly optimizing the feature learning network and the hash code generating module. Specifically, we directly constrain the output from the convolutional neural network to be discrete binary codes and ensure the learned binary codes are optimal for classification. To optimize the deep discrete hashing framework, we further propose an alternating minimization method for learning binary similarity-preserved hashing codes. Extensive experiments on two widely-studied vehicle re-identification datasets- \textbf{VehicleID} and \textbf{VeRi}-~have demonstrated the superiority of our method against the state-of-the-art deep hash methods. \textbf{DVHN} of \(2048\) bits can achieve 13.94\% and 10.21\% accuracy improvement in terms of \textbf{mAP} and \textbf{Rank@1} for \textbf{VehicleID (800)} dataset. For \textbf{VeRi}, we achieve 35.45\% and 32.72\% performance gains for \textbf{Rank@1} and \textbf{mAP}, respectively. |
|||||
2021 | SPANN Highly-efficient Billion-scale Approximate Nearest Neighborhood Search | Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zhiyong Zheng, Mao Yang, Jingdong Wang | Neural Information Processing Systems | The in-memory algorithms for approximate nearest neighbor search (ANNS) have achieved great success for fast high-recall search, but are extremely expensive when handling very large scale database. Thus, there is an increasing request for the hybrid ANNS solutions with small memory and inexpensive solid-state drive (SSD). In this paper, we present a simple but efficient memory-disk hybrid indexing and search system, named SPANN, that follows the inverted index methodology. It stores the centroid points of the posting lists in the memory and the large posting lists in the disk. We guarantee both disk-access efficiency (low latency) and high recall by effectively reducing the disk-access number and retrieving high-quality posting lists. In the index-building stage, we adopt a hierarchical balanced clustering algorithm to balance the length of posting lists and augment the posting list by adding the points in the closure of the corresponding clusters. In the search stage, we use a query-aware scheme to dynamically prune the access of unnecessary posting lists. Experiment results demonstrate that SPANN is 2X faster than the state-of-the-art ANNS solution DiskANN to reach the same recall quality 90% with same memory cost in three billion-scale datasets. It can reach 90% recall@1 and recall@10 in just around one millisecond with only 32GB memory cost. Code is available at: https://github.com/microsoft/SPTAG. |
|||||
2021 | Transhash Transformer-based Hamming Hashing For Efficient Image Retrieval | Chen Yongbiao, Zhang Sheng, Liu Fangxin, Chang Zhigang, Ye Mang, Qi Zhengwei | Arxiv | Deep hamming hashing has gained growing popularity in approximate nearest neighbour search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. \texttt{Resnet}\cite{he2016deep}. In this paper, inspired by the recent advancements of vision transformers, we present \textbf{Transhash}, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based on \textit{Vision Transformer} (ViT), we design a siamese vision transformer backbone for image feature extraction. To learn fine-grained features, we innovate a dual-stream feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner.~To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (\textit{CNNs}). We perform comprehensive experiments on three widely-studied datasets: \textbf{CIFAR-10}, \textbf{NUSWIDE} and \textbf{IMAGENET}. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2\%, 2.6\%, 12.7\% performance gains in terms of average \textit{mAP} for different hash bit lengths on three public datasets, respectively. |
|||||
2021 | Exploration Into Translation-equivariant Image Quantization | Shin Woncheol, Lee Gyubok, Lee Jiyoung, Lyou Eunyi, Lee Joonseok, Choi Edward | Arxiv | This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing. Instead of focusing on anti-aliasing, we propose a simple yet effective way to achieve translation-equivariant image quantization by enforcing orthogonality among the codebook embeddings. To explore the advantages of translation-equivariant image quantization, we conduct three proof-of-concept experiments with a carefully controlled dataset: (1) text-to-image generation, where the quantized image indices are the target to predict, (2) image-to-text generation, where the quantized image indices are given as a condition, (3) using a smaller training set to analyze sample efficiency. From the strictly controlled experiments, we empirically verify that the translation-equivariant image quantizer improves not only sample efficiency but also the accuracy over VQGAN up to +11.9% in text-to-image generation and +3.9% in image-to-text generation. |
|||||
2021 | Deep Learning To Ternary Hash Codes By Continuation | Chen Mingrui, Li Weiyu, Lu Weizhi | Arxiv | Recently, it has been observed that {0,1,-1}-ternary codes which are simply generated from deep features by hard thresholding, tend to outperform {-1,1}-binary codes in image retrieval. To obtain better ternary codes, we for the first time propose to jointly learn the features with the codes by appending a smoothed function to the networks. During training, the function could evolve into a non-smoothed ternary function by a continuation method. The method circumvents the difficulty of directly training discrete functions and reduces the quantization errors of ternary codes. Experiments show that the generated codes indeed could achieve higher retrieval accuracy. |
|||||
2021 | Fractal Measures Of Image Local Features An Application To Texture Recognition | Silva Pedro M., Florindo Joao B. | Arxiv | Here we propose a new method for the classification of texture images combining fractal measures (fractal dimension, multifractal spectrum and lacunarity) with local binary patterns. More specifically we compute the box counting dimension of the local binary codes thresholded at different levels to compose the feature vector. The proposal is assessed in the classification of three benchmark databases: KTHTIPS-2b, UMD and UIUC as well as in a real-world problem, namely the identification of Brazilian plant species (database 1200Tex) using scanned images of their leaves. The proposed method demonstrated to be competitive with other state-of-the-art solutions reported in the literature. Such results confirmed the potential of combining a powerful local coding description with the multiscale information captured by the fractal dimension for texture classification. |
|||||
2021 | A Triangle Inequality For Cosine Similarity | Schubert Erich | Arxiv | Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for Cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics. |
|||||
2021 | Learning Discrete Representations Via Constrained Clustering For Effective And Efficient Dense Retrieval | Zhan Jingtao, Mao Jiaxin, Liu Yiqun, Guo Jiafeng, Zhang Min, Ma Shaoping | Arxiv | Dense Retrieval (DR) has achieved state-of-the-art first-stage ranking effectiveness. However, the efficiency of most existing DR models is limited by the large memory cost of storing dense vectors and the time-consuming nearest neighbor search (NNS) in vector space. Therefore, we present RepCONC, a novel retrieval model that learns discrete Representations via CONstrained Clustering. RepCONC jointly trains dual-encoders and the Product Quantization (PQ) method to learn discrete document representations and enables fast approximate NNS with compact indexes. It models quantization as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and supports end-to-end optimization of the quantization method and dual-encoders. We theoretically demonstrate the importance of the uniform clustering constraint in RepCONC and derive an efficient approximate solution for constrained clustering by reducing it to an instance of the optimal transport problem. Besides constrained clustering, RepCONC further adopts a vector-based inverted file system (IVF) to support highly efficient vector search on CPUs. Extensive experiments on two popular ad-hoc retrieval benchmarks show that RepCONC achieves better ranking effectiveness than competitive vector quantization baselines under different compression ratio settings. It also substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency. |
|||||
2021 | Hashing Embeddings Of Optimal Dimension With Applications To Linear Least Squares | Cartis Coralia, Fiala Jan, Shao Zhen | Arxiv | The aim of this paper is two-fold: firstly, to present subspace embedding properties for \(s\)-hashing sketching matrices, with \(s\geq 1\), that are optimal in the projection dimension \(m\) of the sketch, namely, \(m=\mathcal{O}(d)\), where \(d\) is the dimension of the subspace. A diverse set of results are presented that address the case when the input matrix has sufficiently low coherence (thus removing the \(log^2 d\) factor dependence in \(m\), in the low-coherence result of Bourgain et al (2015) at the expense of a smaller coherence requirement); how this coherence changes with the number \(s\) of column nonzeros (allowing a scaling of \(\sqrt{s}\) of the coherence bound), or is reduced through suitable transformations (when considering hashed – instead of subsampled – coherence reducing transformations such as randomised Hadamard). Secondly, we apply these general hashing sketching results to the special case of Linear Least Squares (LLS), and develop Ski-LLS, a generic software package for these problems, that builds upon and improves the Blendenpik solver on dense input and the (sequential) LSRN performance on sparse problems. In addition to the hashing sketching improvements, we add suitable linear algebra tools for rank-deficient and for sparse problems that lead Ski-LLS to outperform not only sketching-based routines on randomly generated input, but also state of the art direct solver SPQR and iterative code HSL on certain subsets of the sparse Florida matrix collection; namely, on least squares problems that are significantly overdetermined, or moderately sparse, or difficult. |
|||||
2021 | RED Looking For Redundancies For Data-freestructured Compression Of Deep Neural Networks | Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, Kevin Bailly | Neural Information Processing Systems | Deep Neural Networks (DNNs) are ubiquitous in today’s computer vision landscape, despite involving considerable computational costs. The mainstream approaches for runtime acceleration consist in pruning connections (unstructured pruning) or, better, filters (structured pruning), both often requiring data to retrain the model. In this paper, we present RED, a data-free, unified approach to tackle structured pruning. First, we propose a novel adaptive hashing of the scalar DNN weight distribution densities to increase the number of identical neurons represented by their weight vectors. Second, we prune the network by merging redundant neurons based on their relative similarities, as defined by their distance. Third, we propose a novel uneven depthwise separation technique to further prune convolutional layers. We demonstrate through a large variety of benchmarks that RED largely outperforms other data-free pruning methods, often reaching performance similar to unconstrained, data-driven methods. |
|||||
2021 | Hash-based Tree Similarity And Simplification In Genetic Programming For Symbolic Regression | Burlacu Bogdan, Kammerer Lukas, Affenzeller Michael, Kronberger Gabriel | In Moreno-Diaz R. et al. Computer Aided Systems Theory. Lecture Notes in Computer Science Vol. | We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems. |
|||||
2021 | Compositional Sketch Search | Black Alexander, Bui Tu, Mai Long, Jin Hailin, Collomosse John | Arxiv | We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects. Sketch based image retrieval (SBIR) methods predominantly match queries containing a single, dominant object invariant to its position within an image. Our work exploits drawings as a concise and intuitive representation for specifying entire scene compositions. We train a convolutional neural network (CNN) to encode masked visual features from sketched objects, pooling these into a spatial descriptor encoding the spatial relationships and appearances of objects in the composition. Training the CNN backbone as a Siamese network under triplet loss yields a metric search embedding for measuring compositional similarity which may be efficiently leveraged for visual search by applying product quantization. |
|||||
2021 | State Of The Art Image Hashing | Biswas Rubel, Blanco-medina Pablo | Arxiv | Perceptual image hashing methods are often applied in various objectives, such as image retrieval, finding duplicate or near-duplicate images, and finding similar images from large-scale image content. The main challenge in image hashing techniques is robust feature extraction, which generates the same or similar hashes in images that are visually identical. In this article, we present a short review of the state-of-the-art traditional perceptual hashing and deep learning-based perceptual hashing methods, identifying the best approaches. |
|||||
2021 | QUINT Node Embedding Using Network Hashing | Bera Debajyoti, Pratap Rameshwar, Verma Bhisham Dev, Sen Biswadeep, Chakraborty Tanmoy | Arxiv | Representation learning using network embedding has received tremendous attention due to its efficacy to solve downstream tasks. Popular embedding methods (such as deepwalk, node2vec, LINE) are based on a neural architecture, thus unable to scale on large networks both in terms of time and space usage. Recently, we proposed BinSketch, a sketching technique for compressing binary vectors to binary vectors. In this paper, we show how to extend BinSketch and use it for network hashing. Our proposal named QUINT is built upon BinSketch, and it embeds nodes of a sparse network onto a low-dimensional space using simple bi-wise operations. QUINT is the first of its kind that provides tremendous gain in terms of speed and space usage without compromising much on the accuracy of the downstream tasks. Extensive experiments are conducted to compare QUINT with seven state-of-the-art network embedding methods for two end tasks - link prediction and node classification. We observe huge performance gain for QUINT in terms of speedup (up to 7000x) and space saving (up to 80x) due to its bit-wise nature to obtain node embedding. Moreover, QUINT is a consistent top-performer for both the tasks among the baselines across all the datasets. Our empirical observations are backed by rigorous theoretical analysis to justify the effectiveness of QUINT. In particular, we prove that QUINT retains enough structural information which can be used further to approximate many topological properties of networks with high confidence. |
|||||
2021 | On The Optimal Time/space Tradeoff For Hash Tables | Bender Michael A., Farach-colton Martín, Kuszmaul John, Kuszmaul William, Liu Mingmou | Arxiv | For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. State-of-the-art hash tables offer the following guarantee: If keys/values are Theta(log n) bits each, then it is possible to achieve constant-time insertions/deletions/queries while wasting only O(loglog n) bits of space per key when compared to the information-theoretic optimum. Even prior to this bound being achieved, the target of O(loglog n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamically-resized filters). This paper shows that O(loglog n) wasted bits per key is not the end of the line for hashing. In fact, for any k \in [log* n], it is possible to achieve O(k)-time insertions/deletions, O(1)-time queries, and O(log^{(k)} n) wasted bits per key (all with high probability in n). This means that, each time we increase insertion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constant-time hash tables succinct. |
|||||
2021 | Linear Probing Revisited Tombstones Mark The Death Of Primary Clustering | Bender Michael A., Kuszmaul Bradley C., Kuszmaul William | Arxiv | First introduced in 1954, linear probing is one of the oldest data structures in computer science, and due to its unrivaled data locality, it continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because primary-clustering effects cause insertions at load factor \(1 - 1 /x\) to take expected time \(\Theta(x^2)\) (rather than the ideal \(\Theta(x)\)). The dangers of primary clustering, first discovered by Knuth in 1963, have been taught to generations of computer scientists, and have influenced the design of some of many widely used hash tables. We show that primary clustering is not a foregone conclusion. We demonstrate that small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions, so that, even if a hash table operates continuously at a load factor \(1 - \Theta(1/x)\), the expected amortized cost per operation is \(\tilde{O}(x)\). This is because tombstones created by deletions actually cause an anti-clustering effect that combats primary clustering. We also present a new variant of linear probing (which we call graveyard hashing) that completely eliminates primary clustering on any sequence of operations: if, when an operation is performed, the current load factor is \(1 - 1/x\) for some \(x\), then the expected cost of the operation is \(O(x)\). One corollary is that, in the external-memory model with a data blocks of size \(B\), graveyard hashing offers the following remarkable guarantee: at any load factor \(1 - 1/x\) satisfying \(x = o(B)\), graveyard hashing achieves \(1 + o(1)\) expected block transfers per operation. Past external-memory hash tables have only been able to offer a \(1 + o(1)\) guarantee when the block size \(B\) is at least \(Ω(x^2)\). |
|||||
2021 | Iceberg Hashing Optimizing Many Hash-table Criteria At Once | Bender Michael A., Conway Alex, Farach-colton Martín, Kuszmaul William, Tagliavini Guido | Arxiv | Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed. This paper introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constant-time operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to \(1 - o(1)\) while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are \(\Theta(log n)\) bits, the space guarantees that Iceberg hashing offers, namely that it uses at most \(log \binom{|U|}{n} + O(n log log n)\) bits to store \(n\) items from a universe \(U\), matches a lower bound by Demaine et al. that applies to any stable hash table. Iceberg hashing introduces new general-purpose techniques for some of the most basic aspects of hash-table design. Notably, our indirection-free technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and very-high probability guarantees, can be applied to any hash table that makes use of the front-yard/backyard paradigm for hash table design. |
|||||
2021 | Sampling A Near Neighbor In High Dimensions -- Who Is The Fairest Of Them All | Aumüller Martin, Har-peled Sariel, Mahabadi Sepideh, Pagh Rasmus, Silvestri Francesco | Arxiv | Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points \(S\) and a radius parameter \(r>0\), the \(r\)-near neighbor (\(r\)-NN) problem asks for a data structure that, given any query point \(q\), returns a point \(p\) within distance at most \(r\) from \(q\). In this paper, we study the \(r\)-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance \(r\) from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets. |
|||||
2021 | Better GPU Hash Tables | Awad Muhammad A., Ashkiani Saman, Porumbescu Serban D., Farach-colton Martín, Owens John D. | Arxiv | We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table’s different operations. Our results show that a bucketed cuckoo hash table that uses three hash functions (BCHT) outperforms alternative methods that use power-of-two choices, iceberg hashing, and a cuckoo hash table that uses a bucket size one. At high load factors as high as 0.99, BCHT enjoys an average probe count of 1.43 during insertion. Using three hash functions only, positive and negative queries require at most 1.39 and 2.8 average probes per key, respectively. |
|||||
2021 | Compactness Of Hashing Modes And Efficiency Beyond Merkle Tree | Andreeva Elena, Bhattacharyya Rishiraj, Roy Arnab | Arxiv | We revisit the classical problem of designing optimally efficient cryptographically secure hash functions. Hash functions are traditionally designed via applying modes of operation on primitives with smaller domains. The results of Shrimpton and Stam (ICALP 2008), Rogaway and Steinberger (CRYPTO 2008), and Mennink and Preneel (CRYPTO 2012) show how to achieve optimally efficient designs of \(2n\)-to-\(n\)-bit compression functions from non-compressing primitives with asymptotically optimal \(2^{n/2-\epsilon}\)-query collision resistance. Designing optimally efficient and secure hash functions for larger domains (\(> 2n\) bits) is still an open problem. In this work we propose the new \textit{compactness} efficiency notion. It allows us to focus on asymptotically optimally collision resistant hash function and normalize their parameters based on Stam’s bound from CRYPTO 2008 to obtain maximal efficiency. We then present two tree-based modes of operation -Our first construction is an \underline{A}ugmented \underline{B}inary T\underline{r}ee (ABR) mode. The design is a \((2^{\ell}+2^{\ell-1} -1)n\)-to-\(n\)-bit hash function making a total of \((2^{\ell}-1)\) calls to \(2n\)-to-\(n\)-bit compression functions for any \(\ell\geq 2\). Our construction is optimally compact with asymptotically (optimal) \(2^{n/2-\epsilon}\)-query collision resistance in the ideal model. For a tree of height \(\ell\), in comparison with Merkle tree, the \(ABR\) mode processes additional \((2^{\ell-1}-1)\) data blocks making the same number of internal compression function calls. -While the \(ABR\) mode achieves collision resistance, it fails to achieve indifferentiability from a random oracle within \(2^{n/3}\) queries. \(ABR^{+}\) compresses only \(1\) less data block than \(ABR\) with the same number of compression calls and achieves in addition indifferentiability up to \(2^{n/2-\epsilon}\) queries. |
|||||
2021 | Halftimehash Modern Hashing Without 64-bit Multipliers Or Finite Fields | Apple Jim | Arxiv | HalftimeHash is a new algorithm for hashing long strings. The goals are few collisions (different inputs that produce identical output hash values) and high performance. Compared to the fastest universal hash functions on long strings (clhash and UMASH) HalftimeHash decreases collision probability while also increasing performance by over 50%, exceeding 16 bytes per cycle. In addition, HalftimeHash does not use any widening 64-bit multiplications or any finite field arithmetic that could limit its portability. |
|||||
2021 | Additive Feature Hashing | Andrecut M. | Arxiv | The hashing trick is a machine learning technique used to encode categorical features into a numerical vector representation of pre-defined fixed length. It works by using the categorical hash values as vector indices, and updating the vector values at those indices. Here we discuss a different approach based on additive-hashing and the “almost orthogonal” property of high-dimensional random vectors. That is, we show that additive feature hashing can be performed directly by adding the hash values and converting them into high-dimensional numerical vectors. We show that the performance of additive feature hashing is similar to the hashing trick, and we illustrate the results numerically using synthetic, language recognition, and SMS spam detection data. |
|||||
2021 | From Average Embeddings To Nearest Neighbor Search | Andoni Alexandr, Cheikhi David | Arxiv | In this note, we show that one can use average embeddings, introduced recently in [Naor’20, arXiv:1905.01280], to obtain efficient algorithms for approximate nearest neighbor search. In particular, a metric \(X\) embeds into \(ℓ₂\) on average, with distortion \(D\), if, for any distribution \(\mu\) on \(X\), the embedding is \(D\) Lipschitz and the (square of) distance does not decrease on average (wrt \(\mu\)). In particular existence of such an embedding (assuming it is efficient) implies a \(O(D^3)\) approximate nearest neighbor search under \(X\). This can be seen as a strengthening of the classic (bi-Lipschitz) embedding approach to nearest neighbor search, and is another application of data-dependent hashing paradigm. |
|||||
2021 | Learning To Hash Robustly Guaranteed | Andoni Alexandr, Beaglehole Daniel | Arxiv | The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to “learn” the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm’s ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets. |
|||||
2021 | Hashing And Metric Learning For Charged Particle Tracking | Amrouche Sabrina, Kiehn Moritz, Golling Tobias, Salzburger Andreas | Arxiv | We propose a novel approach to charged particle tracking at high intensity particle colliders based on Approximate Nearest Neighbors search. With hundreds of thousands of measurements per collision to be reconstructed e.g. at the High Luminosity Large Hadron Collider, the currently employed combinatorial track finding approaches become inadequate. Here, we use hashing techniques to separate measurements into buckets of 20-50 hits and increase their purity using metric learning. Two different approaches are studied to further resolve tracks inside buckets: Local Fisher Discriminant Analysis and Neural Networks for triplet similarity learning. We demonstrate the proposed approach on simulated collisions and show significant speed improvement with bucket tracking efficiency of 96% and a fake rate of 8% on unseen particle events. |
|||||
2021 | Nearest Neighbor Search With Compact Codes A Decoder Perspective | Amara Kenza, Douze Matthijs, Sablayrolles Alexandre, Jégou Hervé | Arxiv | Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the mean squared error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods or product quantization on popular benchmarks. |
|||||
2021 | PHPQ Pyramid Hybrid Pooling Quantization For Efficient Fine-grained Image Retrieval | Zeng Ziyun, Wang Jinpeng, Chen Bin, Dai Tao, Xia Shu-tao, Wang Zhi | Pattern Recognition Letters Volume | Deep hashing approaches, including deep quantization and deep binary hashing, have become a common solution to large-scale image retrieval due to their high computation and storage efficiency. Most existing hashing methods cannot produce satisfactory results for fine-grained retrieval, because they usually adopt the outputs of the last CNN layer to generate binary codes. Since deeper layers tend to summarize visual clues, e.g., texture, into abstract semantics, e.g., dogs and cats, the feature produced by the last CNN layer is less effective in capturing subtle but discriminative visual details that mostly exist in shallow layers. To improve fine-grained image hashing, we propose Pyramid Hybrid Pooling Quantization (PHPQ). Specifically, we propose a Pyramid Hybrid Pooling (PHP) module to capture and preserve fine-grained semantic information from multi-level features, which emphasizes the subtle discrimination of different sub-categories. Besides, we propose a learnable quantization module with a partial codebook attention mechanism, which helps to optimize the most relevant codewords and improves the quantization. Comprehensive experiments on two widely-used public benchmarks, i.e., CUB-200-2011 and Stanford Dogs, demonstrate that PHPQ outperforms state-of-the-art methods. |
|||||
2021 | Orthonormal Product Quantization Network For Scalable Face Image Retrieval | Zhang Ming, Zhe Xuefei, Yan Hong | Arxiv | Existing deep quantization methods provided an efficient solution for large-scale image retrieval. However, the significant intra-class variations like pose, illumination, and expressions in face images, still pose a challenge for face image retrieval. In light of this, face image retrieval requires sufficiently powerful learning metrics, which are absent in current deep quantization works. Moreover, to tackle the growing unseen identities in the query stage, face image retrieval drives more demands regarding model generalization and system scalability than general image retrieval tasks. This paper integrates product quantization with orthonormal constraints into an end-to-end deep learning framework to effectively retrieve face images. Specifically, a novel scheme that uses predefined orthonormal vectors as codewords is proposed to enhance the quantization informativeness and reduce codewords’ redundancy. A tailored loss function maximizes discriminability among identities in each quantization subspace for both the quantized and original features. An entropy-based regularization term is imposed to reduce the quantization error. Experiments are conducted on four commonly-used face datasets under both seen and unseen identities retrieval settings. Our method outperforms all the compared deep hashing/quantization state-of-the-arts under both settings. Results validate the effectiveness of the proposed orthonormal codewords in improving models’ standard retrieval performance and generalization ability. Combing with further experiments on two general image datasets, it demonstrates the broad superiority of our method for scalable image retrieval. |
|||||
2021 | Dynamic Enumeration Of Similarity Joins | Agarwal Pankaj K., Hu Xiao, Sintos Stavros, Yang Jun | Arxiv | This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of \(n\) points \(A,B\) in \(\mathbb{R}^d\), a metric \(\phi(\cdot)\), and a distance threshold \(r > 0\), report all pairs of points \((a, b) \in A \times B\) with \(\phi(a,b) \le r\). Our goal is to store \(A,B\) into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from \(A\) or \(B\). We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for \(\ell_1, \ell_\infty\) metrics with \(log^{O(1)} n\) update time and delay. We show that such a data structure is not feasible for the \(ℓ₂\) metric for \(d \ge 4\). For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for \(\ell_p\) metric, with \(log^{O(1)} n\) delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH). |
|||||
2021 | Integrating Semantics And Neighborhood Information With Graph-driven Generative Models For Document Retrieval | Ou Zijing, Su Qinliang, Yu Jianxing, Liu Bang, Wang Jingwen, Zhao Ruihui, Chen Changyou, Zheng Yefeng | ACL | With the need of fast retrieval speed and small memory footprint, document hashing has been playing a crucial role in large-scale information retrieval. To generate high-quality hashing code, both semantics and neighborhood information are crucial. However, most existing methods leverage only one of them or simply combine them via some intuitive criteria, lacking a theoretical principle to guide the integration process. In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model. To deal with the complicated correlations among documents, we further propose a tree-structured approximation method for learning. Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones. Extensive experimental results on three benchmark datasets show that our method achieves superior performance over state-of-the-art methods, demonstrating the effectiveness of the proposed model for simultaneously preserving semantic and neighborhood information.\ |
|||||
2021 | Refining BERT Embeddings For Document Hashing Via Mutual Information Maximization | Ou Zijing, Su Qinliang, Yu Jianxing, Zhao Ruihui, Zheng Yefeng, Liu Bang | Arxiv | Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOW), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOW or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOW features by a substantial margin. |
|||||
2021 | Prototype-supervised Adversarial Network For Targeted Attack Of Deep Hashing | Wang Xunguang, Zhang Zheng, Wu Baoyuan, Shen Fumin, Lu Guangming | Arxiv | Due to its powerful capability of representation learning and high-efficiency computation, deep hashing has made significant progress in large-scale image retrieval. However, deep hashing networks are vulnerable to adversarial examples, which is a practical secure problem but seldom studied in hashing-based retrieval field. In this paper, we propose a novel prototype-supervised adversarial network (ProS-GAN), which formulates a flexible generative architecture for efficient and effective targeted hashing attack. To the best of our knowledge, this is the first generation-based method to attack deep hashing networks. Generally, our proposed framework consists of three parts, i.e., a PrototypeNet, a generator, and a discriminator. Specifically, the designed PrototypeNet embeds the target label into the semantic representation and learns the prototype code as the category-level representative of the target label. Moreover, the semantic representation and the original image are jointly fed into the generator for a flexible targeted attack. Particularly, the prototype code is adopted to supervise the generator to construct the targeted adversarial example by minimizing the Hamming distance between the hash code of the adversarial example and the prototype code. Furthermore, the generator is against the discriminator to simultaneously encourage the adversarial examples visually realistic and the semantic representation informative. Extensive experiments verify that the proposed framework can efficiently produce adversarial examples with better targeted attack performance and transferability over state-of-the-art targeted attack methods of deep hashing. The related codes could be available at https://github.com/xunguangwang/ProS-GAN . |
|||||
2021 | Oscar-net Object-centric Scene Graph Attention For Image Attribution | Nguyen Eric, Bui Tu, Swaminathan Vishy, Collomosse John | Arxiv | Images tell powerful stories but cannot always be trusted. Matching images back to trusted sources (attribution) enables users to make a more informed judgment of the images they encounter online. We propose a robust image hashing algorithm to perform such matching. Our hash is sensitive to manipulation of subtle, salient visual details that can substantially change the story told by an image. Yet the hash is invariant to benign transformations (changes in quality, codecs, sizes, shapes, etc.) experienced by images during online redistribution. Our key contribution is OSCAR-Net (Object-centric Scene Graph Attention for Image Attribution Network); a robust image hashing model inspired by recent successes of Transformers in the visual domain. OSCAR-Net constructs a scene graph representation that attends to fine-grained changes of every object’s visual appearance and their spatial relationships. The network is trained via contrastive learning on a dataset of original and manipulated images yielding a state of the art image hash for content fingerprinting that scales to millions of images. |
|||||
2021 | Towards A Model For LSH | Wang Li | Arxiv | As data volumes continue to grow, clustering and outlier detection algorithms are becoming increasingly time-consuming. Classical index structures for neighbor search are no longer sustainable due to the “curse of dimensionality”. Instead, approximated index structures offer a good opportunity to significantly accelerate the neighbor search for clustering and outlier detection and to have the lowest possible error rate in the results of the algorithms. Locality-sensitive hashing is one of those. We indicate directions to model the properties of LSH. |
|||||
2021 | Rescuing Deep Hashing From Dead Bits Problem | Zhao Shu, Wu Dayan, Zhou Yucan, Li Bo, Wang Weiping | Arxiv | Deep hashing methods have shown great retrieval accuracy and efficiency in large-scale image retrieval. How to optimize discrete hash bits is always the focus in deep hashing methods. A common strategy in these methods is to adopt an activation function, e.g. \(\operatorname{sigmoid}(\cdot)\) or \(\operatorname{tanh}(\cdot)\), and minimize a quantization loss to approximate discrete values. However, this paradigm may make more and more hash bits stuck into the wrong saturated area of the activation functions and never escaped. We call this problem “Dead Bits Problem~(DBP)”. Besides, the existing quantization loss will aggravate DBP as well. In this paper, we propose a simple but effective gradient amplifier which acts before activation functions to alleviate DBP. Moreover, we devise an error-aware quantization loss to further alleviate DBP. It avoids the negative effect of quantization loss based on the similarity between two images. The proposed gradient amplifier and error-aware quantization loss are compatible with a variety of deep hashing methods. Experimental results on three datasets demonstrate the efficiency of the proposed gradient amplifier and the error-aware quantization loss. |
|||||
2021 | Evaluating Post-training Compression In Gans Using Locality-sensitive Hashing | Mordido Gonçalo, Yang Haojin, Meinel Christoph | Arxiv | The analysis of the compression effects in generative adversarial networks (GANs) after training, i.e. without any fine-tuning, remains an unstudied, albeit important, topic with the increasing trend of their computation and memory requirements. While existing works discuss the difficulty of compressing GANs during training, requiring novel methods designed with the instability of GANs training in mind, we show that existing compression methods (namely clipping and quantization) may be directly applied to compress GANs post-training, without any additional changes. High compression levels may distort the generated set, likely leading to an increase of outliers that may negatively affect the overall assessment of existing k-nearest neighbor (KNN) based metrics. We propose two new precision and recall metrics based on locality-sensitive hashing (LSH), which, on top of increasing the outlier robustness, decrease the complexity of assessing an evaluation sample against \(n\) reference samples from \(O(n)\) to \(O(log(n))\), if using LSH and KNN, and to \(O(1)\), if only applying LSH. We show that low-bit compression of several pre-trained GANs on multiple datasets induces a trade-off between precision and recall, retaining sample quality while sacrificing sample diversity. |
|||||
2021 | Meta Cross-modal Hashing On Long-tailed Data | Wang Runmin, Yu Guoxian, Domeniconi Carlotta, Zhang Xiangliang | Arxiv | Due to the advantage of reducing storage while speeding up query time on big heterogeneous data, cross-modal hashing has been extensively studied for approximate nearest neighbor search of multi-modal data. Most hashing methods assume that training data is class-balanced.However, in practice, real world data often have a long-tailed distribution. In this paper, we introduce a meta-learning based cross-modal hashing method (MetaCMH) to handle long-tailed data. Due to the lack of training samples in the tail classes, MetaCMH first learns direct features from data in different modalities, and then introduces an associative memory module to learn the memory features of samples of the tail classes. It then combines the direct and memory features to obtain meta features for each sample. For samples of the head classes of the long tail distribution, the weight of the direct features is larger, because there are enough training data to learn them well; while for rare classes, the weight of the memory features is larger. Finally, MetaCMH uses a likelihood loss function to preserve the similarity in different modalities and learns hash functions in an end-to-end fashion. Experiments on long-tailed datasets show that MetaCMH performs significantly better than state-of-the-art methods, especially on the tail classes. |
|||||
2021 | Microsoft Turing-ANNS-1B | Herve Jegou | NeurIPS | Microsoft Turing-ANNS-1B is a new dataset being released by the Microsoft Turing team for this competition. It consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in web search queries. An early version of the RNN-based AGI Encoder is described in a SIGIR’19 paper and a blogpost. |
|||||
2021 | Microsoft SPACEV-1B | Microsoft | NeurIPS | Microsoft SPACEV-1B is a new web search related dataset released by Microsoft Bing for this competition. It consists of document and query vectors encoded by Microsoft SpaceV Superior model to capture generic intent representation. |
|||||
2021 | Hashing Modulo Alpha-equivalence | Maziarz Krzysztof, Ellis Tom, Lawrence Alan, Fitzgibbon Andrew, Jones Simon Peyton | Arxiv | In many applications one wants to identify identical subtrees of a program syntax tree. This identification should ideally be robust to alpha-renaming of the program, but no existing technique has been shown to achieve this with good efficiency (better than \(\mathcal{O}(n^2)\) in expression size). We present a new, asymptotically efficient way to hash modulo alpha-equivalence. A key insight of our method is to use a weak (commutative) hash combiner at exactly one point in the construction, which admits an algorithm with \(\mathcal{O}(n (log n)^2)\) time complexity. We prove that the use of the commutative combiner nevertheless yields a strong hash with low collision probability. Numerical benchmarks attest to the asymptotic behaviour of the method. |
|||||
2021 | Rank-consistency Deep Hashing For Scalable Multi-label Image Search | Ma Cheng, Lu Jiwen, Zhou Jie | IEEE Transactions on Multimedia | As hashing becomes an increasingly appealing technique for large-scale image retrieval, multi-label hashing is also attracting more attention for the ability to exploit multi-level semantic contents. In this paper, we propose a novel deep hashing method for scalable multi-label image search. Unlike existing approaches with conventional objectives such as contrast and triplet losses, we employ a rank list, rather than pairs or triplets, to provide sufficient global supervision information for all the samples. Specifically, a new rank-consistency objective is applied to align the similarity orders from two spaces, the original space and the hamming space. A powerful loss function is designed to penalize the samples whose semantic similarity and hamming distance are mismatched in two spaces. Besides, a multi-label softmax cross-entropy loss is presented to enhance the discriminative power with a concise formulation of the derivative function. In order to manipulate the neighborhood structure of the samples with different labels, we design a multi-label clustering loss to cluster the hashing vectors of the samples with the same labels by reducing the distances between the samples and their multiple corresponding class centers. The state-of-the-art experimental results achieved on three public multi-label datasets, MIRFLICKR-25K, IAPRTC12 and NUS-WIDE, demonstrate the effectiveness of the proposed method. |
|||||
2021 | A^2-net Learning Attribute-aware Hash Codes For Large-scale Fine-grained Image Retrieval | Xiu-shen Wei, Yang Shen, Xuhao Sun, Han-jia Ye, Jian Yang | Neural Information Processing Systems | Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose an Attribute-Aware hashing Network (A\(^2\)-Net) for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. A\(^2\)-Net is also equipped with a feature decorrelation constraint upon these attribute vectors to enhance their representation abilities. Finally, the required hash codes are generated by the attribute vectors driven by preserving original similarities. Qualitative experiments on five benchmark fine-grained datasets show our superiority over competing methods. More importantly, quantitative results demonstrate the obtained hash codes can strongly correspond to certain kinds of crucial properties of fine-grained objects. |
|||||
2021 | Deep Unsupervised Hashing By Distilled Smooth Guidance | Luo Xiao, Ma Zeyu, Wu Daqing, Zhong Huasong, Chen Chong, Ma Jinwen, Deng Minghua | ICME | Hashing has been widely used in approximate nearest neighbor search for its storage and computational efficiency. Deep supervised hashing methods are not widely used because of the lack of labeled data, especially when the domain is transferred. Meanwhile, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of reliable similarity signals. To tackle this problem, we propose a novel deep unsupervised hashing method, namely Distilled Smooth Guidance (DSG), which can learn a distilled dataset consisting of similarity signals as well as smooth confidence signals. To be specific, we obtain the similarity confidence weights based on the initial noisy similarity signals learned from local structures and construct a priority loss function for smooth similarity-preserving learning. Besides, global information based on clustering is utilized to distill the image pairs by removing contradictory similarity signals. Extensive experiments on three widely used benchmark datasets show that the proposed DSG consistently outperforms the state-of-the-art search methods. |
|||||
2021 | Large-scale Visual Search With Binary Distributed Graph At Alibaba | Zhao Kang, Pan Pan, Zheng Yun, Zhang Yanhao, Wang Changxu, Zhang Yingya, Xu Yinghui, Jin Rong | Arxiv | Graph-based approximate nearest neighbor search has attracted more and more attentions due to its online search advantages. Numbers of methods studying the enhancement of speed and recall have been put forward. However, few of them focus on the efficiency and scale of offline graph-construction. For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods. In this paper, we propose a novel algorithm called Binary Distributed Graph to solve this problem. Specifically, we combine binary codes with graph structure to speedup online and offline procedures, and achieve comparable performance with the ones in real-value based scenarios by recalling more binary candidates. Furthermore, the graph-construction is optimized to completely distributed implementation, which significantly accelerates the offline process and gets rid of the limitation of memory and disk within a single machine. Experimental comparisons on Alibaba Commodity Data Set (more than three billion images) show that the proposed method outperforms the state-of-the-art with respect to the online/offline trade-off. |
|||||
2021 | One Loss For All Deep Hashing With A Single Cosine Similarity Based Learning Objective | Jiun Tian Hoe, Kam Woh Ng, Tianyu Zhang, Chee Seng Chan, Yi-zhe Song, Tao Xiang | Neural Information Processing Systems | A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only \(\textit{a single learning objective}\). Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding \(\textit{binary orthogonal codes}\) can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is a one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. |
|||||
2021 | Hard Example Guided Hashing For Image Retrieval | Su Hai, Han Meiyin, Liang Junle, Liang Jun, Yu Songsen | Arxiv | Compared with the traditional hashing methods, deep hashing methods generate hash codes with rich semantic information and greatly improves the performances in the image retrieval field. However, it is unsatisfied for current deep hashing methods to predict the similarity of hard examples. It exists two main factors affecting the ability of learning hard examples, which are weak key features extraction and the shortage of hard examples. In this paper, we give a novel end-to-end model to extract the key feature from hard examples and obtain hash code with the accurate semantic information. In addition, we redesign a hard pair-wise loss function to assess the hard degree and update penalty weights of examples. It effectively alleviates the shortage problem in hard examples. Experimental results on CIFAR-10 and NUS-WIDE demonstrate that our model outperformances the mainstream hashing-based image retrieval methods. |
|||||
2021 | Deep Asymmetric Hashing With Dual Semantic Regression And Class Structure Quantization | Lu Jianglin, Wang Hailing, Zhou Jie, Yan Mengfan, Wen Jiajun | Arxiv | Recently, deep hashing methods have been widely used in image retrieval task. Most existing deep hashing approaches adopt one-to-one quantization to reduce information loss. However, such class-unrelated quantization cannot give discriminative feedback for network training. In addition, these methods only utilize single label to integrate supervision information of data for hashing function learning, which may result in inferior network generalization performance and relatively low-quality hash codes since the inter-class information of data is totally ignored. In this paper, we propose a dual semantic asymmetric hashing (DSAH) method, which generates discriminative hash codes under three-fold constraints. Firstly, DSAH utilizes class prior to conduct class structure quantization so as to transmit class information during the quantization process. Secondly, a simple yet effective label mechanism is designed to characterize both the intra-class compactness and inter-class separability of data, thereby achieving semantic-sensitive binary code learning. Finally, a meaningful pairwise similarity preserving loss is devised to minimize the distances between class-related network outputs based on an affinity graph. With these three main components, high-quality hash codes can be generated through network. Extensive experiments conducted on various datasets demonstrate the superiority of DSAH in comparison with state-of-the-art deep hashing methods. |
|||||
2021 | Learnable Locality-sensitive Hashing For Video Anomaly Detection | Lu Yue, Cao Congqi, Zhang Yanning | Arxiv | Video anomaly detection (VAD) mainly refers to identifying anomalous events that have not occurred in the training set where only normal samples are available. Existing works usually formulate VAD as a reconstruction or prediction problem. However, the adaptability and scalability of these methods are limited. In this paper, we propose a novel distance-based VAD method to take advantage of all the available normal data efficiently and flexibly. In our method, the smaller the distance between a testing sample and normal samples, the higher the probability that the testing sample is normal. Specifically, we propose to use locality-sensitive hashing (LSH) to map samples whose similarity exceeds a certain threshold into the same bucket in advance. In this manner, the complexity of near neighbor search is cut down significantly. To make the samples that are semantically similar get closer and samples not similar get further apart, we propose a novel learnable version of LSH that embeds LSH into a neural network and optimizes the hash functions with contrastive learning strategy. The proposed method is robust to data imbalance and can handle the large intra-class variations in normal data flexibly. Besides, it has a good ability of scalability. Extensive experiments demonstrate the superiority of our method, which achieves new state-of-the-art results on VAD benchmarks. |
|||||
2021 | Learning To Break Deep Perceptual Hashing The Use Case Neuralhash | Struppek Lukas, Hintersdorf Dominik, Neider Daniel, Kersting Kristian | Arxiv | Apple recently revealed its deep perceptual hashing system NeuralHash to detect child sexual abuse material (CSAM) on user devices before files are uploaded to its iCloud service. Public criticism quickly arose regarding the protection of user privacy and the system’s reliability. In this paper, we present the first comprehensive empirical analysis of deep perceptual hashing based on NeuralHash. Specifically, we show that current deep perceptual hashing may not be robust. An adversary can manipulate the hash values by applying slight changes in images, either induced by gradient-based approaches or simply by performing standard image transformations, forcing or preventing hash collisions. Such attacks permit malicious actors easily to exploit the detection system: from hiding abusive material to framing innocent users, everything is possible. Moreover, using the hash values, inferences can still be made about the data stored on user devices. In our view, based on our results, deep perceptual hashing in its current form is generally not ready for robust client-side scanning and should not be used from a privacy perspective. |
|||||
2021 | Assessing The Effectiveness Of YARA Rules For Signature-based Malware Detection And Classification | Lockett Adam | Arxiv | Malware often uses obfuscation techniques or is modified slightly to evade signature detection from antivirus software and malware analysis tools. Traditionally, to determine if a file is malicious and identify what type of malware a sample is, a cryptographic hash of a file is calculated. A more recent and flexible solution for malware detection is YARA, which enables the creation of rules to identify and classify malware based on a file’s binary patterns. In this paper, the author will critically evaluate the effectiveness of YARA rules for signature-based detection and classification of malware in comparison to alternative methods, which include cryptographic and fuzzy hashing. |
|||||
2021 | SLOSH Set Locality Sensitive Hashing Via Sliced-wasserstein Embeddings | Lu Yuzhe, Liu Xinran, Soltoggio Andrea, Kolouri Soheil | Arxiv | Learning from set-structured data is an essential problem with many applications in machine learning and computer vision. This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions, particularly locality-sensitive hashing. We consider the problem of set retrieval from an input set query. Such retrieval problem requires: 1) an efficient mechanism to calculate the distances/dissimilarities between sets, and 2) an appropriate data structure for fast nearest neighbor search. To that end, we propose Sliced-Wasserstein set embedding as a computationally efficient “set-2-vector” mechanism that enables downstream ANN, with theoretical guarantees. The set elements are treated as samples from an unknown underlying distribution, and the Sliced-Wasserstein distance is used to compare sets. We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets and compare our proposed embedding with standard set embedding approaches, including Generalized Mean (GeM) embedding/pooling, Featurewise Sort Pooling (FSPool), and Covariance Pooling and show consistent improvement in retrieval results. The code for replicating our results is available here: \href{https://github.com/mint-vu/SLOSH}{https://github.com/mint-vu/SLOSH}. |
|||||
2021 | FDDH Fast Discriminative Discrete Hashing For Large-scale Cross-modal Retrieval | Liu Xin, Wang Xingzhi, Cheung Yiu-ming | IEEE Transactions on Neural Networks and Learning Systems | Cross-modal hashing, favored for its effectiveness and efficiency, has received wide attention to facilitating efficient retrieval across different modalities. Nevertheless, most existing methods do not sufficiently exploit the discriminative power of semantic information when learning the hash codes, while often involving time-consuming training procedure for handling the large-scale dataset. To tackle these issues, we formulate the learning of similarity-preserving hash codes in terms of orthogonally rotating the semantic data so as to minimize the quantization loss of mapping such data to hamming space, and propose an efficient Fast Discriminative Discrete Hashing (FDDH) approach for large-scale cross-modal retrieval. More specifically, FDDH introduces an orthogonal basis to regress the targeted hash codes of training examples to their corresponding semantic labels, and utilizes “-dragging technique to provide provable large semantic margins. Accordingly, the discriminative power of semantic information can be explicitly captured and maximized. Moreover, an orthogonal transformation scheme is further proposed to map the nonlinear embedding data into the semantic subspace, which can well guarantee the semantic consistency between the data feature and its semantic representation. Consequently, an efficient closed form solution is derived for discriminative hash code learning, which is very computationally efficient. In addition, an effective and stable online learning strategy is presented for optimizing modality-specific projection functions, featuring adaptivity to different training sizes and streaming data. The proposed FDDH approach theoretically approximates the bi-Lipschitz continuity, runs sufficiently fast, and also significantly improves the retrieval performance over the state-of-the-art methods. The source code is released at: https://github.com/starxliu/FDDH. |
|||||
2021 | Ternary Hashing | Liu Chang, Fan Lixin, Ng Kam Woh, Jin Yilun, Ju Ce, Zhang Tianyu, Chan Chee Seng, Yang Qiang | Arxiv | This paper proposes a novel ternary hash encoding for learning to hash methods, which provides a principled more efficient coding scheme with performances better than those of the state-of-the-art binary hashing counterparts. Two kinds of axiomatic ternary logic, Kleene logic and {\L}ukasiewicz logic are adopted to calculate the Ternary Hamming Distance (THD) for both the learning/encoding and testing/querying phases. Our work demonstrates that, with an efficient implementation of ternary logic on standard binary machines, the proposed ternary hashing is compared favorably to the binary hashing methods with consistent improvements of retrieval mean average precision (mAP) ranging from 1\% to 5.9\% as shown in CIFAR10, NUS-WIDE and ImageNet100 datasets. |
|||||
2021 | Sentence Embeddings And High-speed Similarity Search For Fast Computer Assisted Annotation Of Legal Documents | Westermann Hannes, Savelka Jaromir, Walker Vern R., Ashley Kevin D., Benyekhlef Karim | Frontiers in Artificial Intelligence and Applications Volume | Human-performed annotation of sentences in legal documents is an important prerequisite to many machine learning based systems supporting legal tasks. Typically, the annotation is done sequentially, sentence by sentence, which is often time consuming and, hence, expensive. In this paper, we introduce a proof-of-concept system for annotating sentences “laterally.” The approach is based on the observation that sentences that are similar in meaning often have the same label in terms of a particular type system. We use this observation in allowing annotators to quickly view and annotate sentences that are semantically similar to a given sentence, across an entire corpus of documents. Here, we present the interface of the system and empirically evaluate the approach. The experiments show that lateral annotation has the potential to make the annotation process quicker and more consistent. |
|||||
2021 | Deep Self-adaptive Hashing For Image Retrieval | Lin Qinghong, Chen Xiaojun, Zhang Qin, Tian Shangxuan, Chen Yudong | Arxiv | Hashing technology has been widely used in image retrieval due to its computational and storage efficiency. Recently, deep unsupervised hashing methods have attracted increasing attention due to the high cost of human annotations in the real world and the superiority of deep learning technology. However, most deep unsupervised hashing methods usually pre-compute a similarity matrix to model the pairwise relationship in the pre-trained feature space. Then this similarity matrix would be used to guide hash learning, in which most of the data pairs are treated equivalently. The above process is confronted with the following defects: 1) The pre-computed similarity matrix is inalterable and disconnected from the hash learning process, which cannot explore the underlying semantic information. 2) The informative data pairs may be buried by the large number of less-informative data pairs. To solve the aforementioned problems, we propose a Deep Self-Adaptive Hashing (DSAH) model to adaptively capture the semantic information with two special designs: Adaptive Neighbor Discovery (AND) and Pairwise Information Content (PIC). Firstly, we adopt the AND to initially construct a neighborhood-based similarity matrix, and then refine this initial similarity matrix with a novel update strategy to further investigate the semantic structure behind the learned representation. Secondly, we measure the priorities of data pairs with PIC and assign adaptive weights to them, which is relies on the assumption that more dissimilar data pairs contain more discriminative information for hash learning. Extensive experiments on several datasets demonstrate that the above two technologies facilitate the deep hashing model to achieve superior performance. |
|||||
2021 | 3rd Place A Global And Local Dual Retrieval Solution To Facebook AI Image Similarity Challenge | Sun Xinlong, Qin Yangyang, Xu Xuyuan, Gong Guoping, Fang Yang, Wang Yexin | Arxiv | As a basic task of computer vision, image similarity retrieval is facing the challenge of large-scale data and image copy attacks. This paper presents our 3rd place solution to the matching track of Image Similarity Challenge (ISC) 2021 organized by Facebook AI. We propose a multi-branch retrieval method of combining global descriptors and local descriptors to cover all attack cases. Specifically, we attempt many strategies to optimize global descriptors, including abundant data augmentations, self-supervised learning with a single Transformer model, overlay detection preprocessing. Moreover, we introduce the robust SIFT feature and GPU Faiss for local retrieval which makes up for the shortcomings of the global retrieval. Finally, KNN-matching algorithm is used to judge the match and merge scores. We show some ablation experiments of our method, which reveals the complementary advantages of global and local features. |
|||||
2021 | Online Enhanced Semantic Hashing Towards Effective And Efficient Retrieval For Streaming Multi-modal Data | Wu Xiao-ming, Luo Xin, Zhan Yu-wei, Ding Chen-lu, Chen Zhen-duo, Xu Xin-shun | Arxiv | With the vigorous development of multimedia equipment and applications, efficient retrieval of large-scale multi-modal data has become a trendy research topic. Thereinto, hashing has become a prevalent choice due to its retrieval efficiency and low storage cost. Although multi-modal hashing has drawn lots of attention in recent years, there still remain some problems. The first point is that existing methods are mainly designed in batch mode and not able to efficiently handle streaming multi-modal data. The second point is that all existing online multi-modal hashing methods fail to effectively handle unseen new classes which come continuously with streaming data chunks. In this paper, we propose a new model, termed Online enhAnced SemantIc haShing (OASIS). We design novel semantic-enhanced representation for data, which could help handle the new coming classes, and thereby construct the enhanced semantic objective function. An efficient and effective discrete online optimization algorithm is further proposed for OASIS. Extensive experiments show that our method can exceed the state-of-the-art models. For good reproducibility and benefiting the community, our code and data are already available in supplementary material and will be made publicly available. |
|||||
2021 | Cross-modal Zero-shot Hashing By Label Attributes Embedding | Wang Runmin, Yu Guoxian, Liu Lei, Cui Lizhen, Domeniconi Carlotta, Zhang Xiangliang | Arxiv | Cross-modal hashing (CMH) is one of the most promising methods in cross-modal approximate nearest neighbor search. Most CMH solutions ideally assume the labels of training and testing set are identical. However, the assumption is often violated, causing a zero-shot CMH problem. Recent efforts to address this issue focus on transferring knowledge from the seen classes to the unseen ones using label attributes. However, the attributes are isolated from the features of multi-modal data. To reduce the information gap, we introduce an approach called LAEH (Label Attributes Embedding for zero-shot cross-modal Hashing). LAEH first gets the initial semantic attribute vectors of labels by word2vec model and then uses a transformation network to transform them into a common subspace. Next, it leverages the hash vectors and the feature similarity matrix to guide the feature extraction network of different modalities. At the same time, LAEH uses the attribute similarity as the supplement of label similarity to rectify the label embedding and common subspace. Experiments show that LAEH outperforms related representative zero-shot and cross-modal hashing methods. |
|||||
2021 | More Robust Dense Retrieval With Contrastive Dual Learning | Li Yizhi, Liu Zhenghao, Xiong Chenyan, Liu Zhiyuan | Arxiv | Dense retrieval conducts text retrieval in the embedding space and has shown many advantages compared to sparse retrieval. Existing dense retrievers optimize representations of queries and documents with contrastive training and map them to the embedding space. The embedding space is optimized by aligning the matched query-document pairs and pushing the negative documents away from the query. However, in such training paradigm, the queries are only optimized to align to the documents and are coarsely positioned, leading to an anisotropic query embedding space. In this paper, we analyze the embedding space distributions and propose an effective training paradigm, Contrastive Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained query representations for dense retrieval. DANCE incorporates an additional dual training object of query retrieval, inspired by the classic information retrieval training axiom, query likelihood. With contrastive learning, the dual training object of DANCE learns more tailored representations for queries and documents to keep the embedding space smooth and uniform, thriving on the ranking performance of DANCE on the MS MARCO document retrieval task. Different from ANCE that only optimized with the document retrieval task, DANCE concentrates the query embeddings closer to document representations while making the document distribution more discriminative. Such concentrated query embedding distribution assigns more uniform negative sampling probabilities to queries and helps to sufficiently optimize query representations in the query retrieval task. Our codes are released at https://github.com/thunlp/DANCE. |
|||||
2021 | Ce-dedup Cost-effective Convolutional Neural Nets Training Based On Image Deduplication | Li Xuan, Chang Liqiong, Liu Xue | Arxiv | Attributed to the ever-increasing large image datasets, Convolutional Neural Networks (CNNs) have become popular for vision-based tasks. It is generally admirable to have larger-sized datasets for higher network training accuracies. However, the impact of dataset quality has not to be involved. It is reasonable to assume the near-duplicate images exist in the datasets. For instance, the Street View House Numbers (SVHN) dataset having cropped house plate digits from 0 to 9 are likely to have repetitive digits from the same/similar house plates. Redundant images may take up a certain portion of the dataset without consciousness. While contributing little to no accuracy improvement for the CNNs training, these duplicated images unnecessarily pose extra resource and computation consumption. To this end, this paper proposes a framework to assess the impact of the near-duplicate images on CNN training performance, called CE-Dedup. Specifically, CE-Dedup associates a hashing-based image deduplication approach with downstream CNNs-based image classification tasks. CE-Dedup balances the tradeoff between a large deduplication ratio and a stable accuracy by adjusting the deduplication threshold. The effectiveness of CE-Dedup is validated through extensive experiments on well-known CNN benchmarks. On one hand, while maintaining the same validation accuracy, CE-Dedup can reduce the dataset size by 23%. On the other hand, when allowing a small validation accuracy drop (by 5%), CE-Dedup can trim the dataset size by 75%. |
|||||
2021 | C-OPH Improving The Accuracy Of One Permutation Hashing (OPH) With Circulant Permutations | Li Xiaoyun, Li Ping | Arxiv | Minwise hashing (MinHash) is a classical method for efficiently estimating the Jaccrad similarity in massive binary (0/1) data. To generate \(K\) hash values for each data vector, the standard theory of MinHash requires \(K\) independent permutations. Interestingly, the recent work on “circulant MinHash” (C-MinHash) has shown that merely two permutations are needed. The first permutation breaks the structure of the data and the second permutation is re-used \(K\) time in a circulant manner. Surprisingly, the estimation accuracy of C-MinHash is proved to be strictly smaller than that of the original MinHash. The more recent work further demonstrates that practically only one permutation is needed. Note that C-MinHash is different from the well-known work on “One Permutation Hashing (OPH)” published in NIPS’12. OPH and its variants using different “densification” schemes are popular alternatives to the standard MinHash. The densification step is necessary in order to deal with empty bins which exist in One Permutation Hashing. In this paper, we propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing. Basically, we develop a new densification method for OPH, which achieves the smallest estimation variance compared to all existing densification schemes for OPH. Our proposed method is named C-OPH (Circulant OPH). After the initial permutation (which breaks the existing structure of the data), C-OPH only needs a “shorter” permutation of length \(D/K\) (instead of \(D\)), where \(D\) is the original data dimension and \(K\) is the total number of bins in OPH. This short permutation is re-used in \(K\) bins in a circulant shifting manner. It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods. |
|||||
2021 | EXTRA Explanation Ranking Datasets For Explainable Recommendation | Li Lei, Zhang Yongfeng, Chen Li | Arxiv | Recently, research on explainable recommender systems has drawn much attention from both academia and industry, resulting in a variety of explainable models. As a consequence, their evaluation approaches vary from model to model, which makes it quite difficult to compare the explainability of different models. To achieve a standard way of evaluating recommendation explanations, we provide three benchmark datasets for EXplanaTion RAnking (denoted as EXTRA), on which explainability can be measured by ranking-oriented metrics. Constructing such datasets, however, poses great challenges. First, user-item-explanation triplet interactions are rare in existing recommender systems, so how to find alternatives becomes a challenge. Our solution is to identify nearly identical sentences from user reviews. This idea then leads to the second challenge, i.e., how to efficiently categorize the sentences in a dataset into different groups, since it has quadratic runtime complexity to estimate the similarity between any two sentences. To mitigate this issue, we provide a more efficient method based on Locality Sensitive Hashing (LSH) that can detect near-duplicates in sub-linear time for a given query. Moreover, we make our code publicly available to allow researchers in the community to create their own datasets. |
|||||
2021 | Parallel And External-memory Construction Of Minimal Perfect Hash Functions With Pthash | Pibiri Giulio Ermanno, Trani Roberto | Arxiv | A function \(f : U \to \{0,\ldots,n-1\}\) is a minimal perfect hash function for a set \(S \subseteq U\) of size \(n\), if \(f\) bijectively maps \(S\) into the first \(n\) natural numbers. These functions are important for many practical applications in computing, such as search engines, computer networks, and databases. Several algorithms have been proposed to build minimal perfect hash functions that: scale well to large sets, retain fast evaluation time, and take very little space, e.g., 2 - 3 bits/key. PTHash is one such algorithm, achieving very fast evaluation in compressed space, typically several times faster than other techniques. In this work, we propose a new construction algorithm for PTHash enabling: (1) multi-threading, to either build functions more quickly or more space-efficiently, and (2) external-memory processing to scale to inputs much larger than the available internal memory. Only few other algorithms in the literature share these features, despite of their big practical impact. We conduct an extensive experimental assessment on large real-world string collections and show that, with respect to other techniques, PTHash is competitive in construction time and space consumption, but retains 2
|
|||||
2021 | Pthash Revisiting FCH Minimal Perfect Hashing | Pibiri Giulio Ermanno, Trani Roberto | SIGIR | Given a set \(S\) of \(n\) distinct keys, a function \(f\) that bijectively maps the keys of \(S\) into the range \(\{0,\ldots,n-1\}\) is called a minimal perfect hash function for \(S\). Algorithms that find such functions when \(n\) is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find \(f\) (construction time), time to evaluate \(f\) on a key of \(S\) (lookup time), and space of representation for \(f\). Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide \(2-4\times\) better lookup time. |
|||||
2021 | LAION-400M Open Dataset Of Clip-filtered 400 Million Image-text Pairs | Schuhmann Christoph, Vencu Richard, Beaumont Romain, Kaczmarczyk Robert, Mullis Clayton, Katta Aarush, Coombes Theo, Jitsev Jenia, Komatsuzaki Aran | Arxiv | Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search. |
|||||
2021 | Vision Transformer Based Video Hashing Retrieval For Tracing The Source Of Fake Videos | Pei Pengfei, Zhao Xianfeng, Cao Yun, Li Jinchuan, Lai Xuyuan | Arxiv | In recent years, the spread of fake videos has brought great influence on individuals and even countries. It is important to provide robust and reliable results for fake videos. The results of conventional detection methods are not reliable and not robust for unseen videos. Another alternative and more effective way is to find the original video of the fake video. For example, fake videos from the Russia-Ukraine war and the Hong Kong law revision storm are refuted by finding the original video. We use an improved retrieval method to find the original video, named ViTHash. Specifically, tracing the source of fake videos requires finding the unique one, which is difficult when there are only small differences in the original videos. To solve the above problems, we designed a novel loss Hash Triplet Loss. In addition, we designed a tool called Localizator to compare the difference between the original traced video and the fake video. We have done extensive experiments on FaceForensics++, Celeb-DF and DeepFakeDetection, and we also have done additional experiments on our built three datasets: DAVIS2016-TL (video inpainting), VSTL (video splicing) and DFTL (similar videos). Experiments have shown that our performance is better than state-of-the-art methods, especially in cross-dataset mode. Experiments also demonstrated that ViTHash is effective in various forgery detection: video inpainting, video splicing and deepfakes. Our code and datasets have been released on GitHub: \url{https://github.com/lajlksdf/vtl}. |
|||||
2021 | Scalable Reverse Image Search Engine For Nasaworldview | Sodani Abhigya, Levy Michael, Koul Anirudh, Kasam Meher Anand, Ganju Siddha | Arxiv | Researchers often spend weeks sifting through decades of unlabeled satellite imagery(on NASA Worldview) in order to develop datasets on which they can start conducting research. We developed an interactive, scalable and fast image similarity search engine (which can take one or more images as the query image) that automatically sifts through the unlabeled dataset reducing dataset generation time from weeks to minutes. In this work, we describe key components of the end to end pipeline. Our similarity search system was created to be able to identify similar images from a potentially petabyte scale database that are similar to an input image, and for this we had to break down each query image into its features, which were generated by a classification layer stripped CNN trained in a supervised manner. To store and search these features efficiently, we had to make several scalability improvements. To improve the speed, reduce the storage, and shrink memory requirements for embedding search, we add a fully connected layer to our CNN make all images into a 128 length vector before entering the classification layers. This helped us compress the size of our image features from 2048 (for ResNet, which was initially tried as our featurizer) to 128 for our new custom model. Additionally, we utilize existing approximate nearest neighbor search libraries to significantly speed up embedding search. Our system currently searches over our entire database of images at 5 seconds per query on a single virtual machine in the cloud. In the future, we would like to incorporate a SimCLR based featurizing model which could be trained without any labelling by a human (since the classification aspect of the model is irrelevant to this use case). |
|||||
2021 | Web Image Search Engine Based On LSH Index And CNN Resnet50 | Parola Marco, Nannini Alice, Poleggi Stefano | Arxiv | To implement a good Content Based Image Retrieval (CBIR) system, it is essential to adopt efficient search methods. One way to achieve this results is by exploiting approximate search techniques. In fact, when we deal with very large collections of data, using an exact search method makes the system very slow. In this project, we adopt the Locality Sensitive Hashing (LSH) index to implement a CBIR system that allows us to perform fast similarity search on deep features. Specifically, we exploit transfer learning techniques to extract deep features from images; this phase is done using two famous Convolutional Neural Networks (CNNs) as features extractors: Resnet50 and Resnet50v2, both pre-trained on ImageNet. Then we try out several fully connected deep neural networks, built on top of both of the previously mentioned CNNs in order to fine-tuned them on our dataset. In both of previous cases, we index the features within our LSH index implementation and within a sequential scan, to better understand how much the introduction of the index affects the results. Finally, we carry out a performance analysis: we evaluate the relevance of the result set, computing the mAP (mean Average Precision) value obtained during the different experiments with respect to the number of done comparison and varying the hyper-parameter values of the LSH index. |
|||||
2020 | Locality-sensitive Hashing Scheme Based On Longest Circular Co-substring | Lei Yifan, Huang Qiang, Kankanhalli Mohan, Tung Anthony K. H. | Arxiv | Locality-Sensitive Hashing (LSH) is one of the most popular methods for \(c\)-Approximate Nearest Neighbor Search (\(c\)-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for \(k\)-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports \(c\)-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes. |
|||||
2020 | Generative Semantic Hashing Enhanced Via Boltzmann Machines | Zheng Lin, Su Qinliang, Shen Dinghan, Chen Changyou | Arxiv | Generative semantic hashing is a promising technique for large-scale information retrieval thanks to its fast retrieval speed and small memory footprint. For the tractability of training, existing generative-hashing methods mostly assume a factorized form for the posterior distribution, enforcing independence among the bits of hash codes. From the perspectives of both model representation and code space size, independence is always not the best assumption. In this paper, to introduce correlations among the bits of hash codes, we propose to employ the distribution of Boltzmann machine as the variational posterior. To address the intractability issue of training, we first develop an approximate method to reparameterize the distribution of a Boltzmann machine by augmenting it as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Based on that, an asymptotically-exact lower bound is further derived for the evidence lower bound (ELBO). With these novel techniques, the entire model can be optimized efficiently. Extensive experimental results demonstrate that by effectively modeling correlations among different bits within a hash code, our model can achieve significant performance gains. |
|||||
2020 | Flexor Trainable Fractional Quantization | Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun | Neural Information Processing Systems | Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. In this paper, we propose an encryption algorithm/architecture to compress quantized weights so as to achieve fractional numbers of bits per weight. Decryption during inference is implemented by digital XOR-gate networks added into the neural network model while XOR gates are described by utilizing \(\tanh(x)\) for backward propagation to enable gradient calculations. We perform experiments using MNIST, CIFAR-10, and ImageNet to show that inserting XOR gates learns quantization/encrypted bit decisions through training and obtains high accuracy even for fractional sub 1-bit weights. As a result, our proposed method yields smaller size and higher model accuracy compared to binary neural networks. |
|||||
2020 | Learning To Hash With Graph Neural Networks For Recommender Systems | Tan Qiaoyu, Liu Ninghao, Zhao Xing, Yang Hongxia, Zhou Jingren, Hu Xia | Arxiv | Graph representation learning has attracted much attention in supporting high quality candidate search at scale. Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users’ preferences in continuous embedding space are tremendous. In this work, we investigate the problem of hashing with graph neural networks (GNNs) for high quality retrieval, and propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes. Specifically, a deep hashing with GNNs (HashGNN) is presented, which consists of two components, a GNN encoder for learning node representations, and a hash layer for encoding representations to hash codes. The whole architecture is trained end-to-end by jointly optimizing two losses, i.e., reconstruction loss from reconstructing observed links, and ranking loss from preserving the relative ordering of hash codes. A novel discrete optimization strategy based on straight through estimator (STE) with guidance is proposed. The principal idea is to avoid gradient magnification in back-propagation of STE with continuous embedding guidance, in which we begin from learning an easier network that mimic the continuous embedding and let it evolve during the training until it finally goes back to STE. Comprehensive experiments over three publicly available and one real-world Alibaba company datasets demonstrate that our model not only can achieve comparable performance compared with its continuous counterpart but also runs multiple times faster during inference. |
|||||
2020 | A Peer-to-peer Distributed Secured Sustainable Large Scale Identity Document Verification System With Bittorrent Network And Hash Function | Rashedun-naby Khan Mohammad | Arxiv | Verifying identity documents from a large Central Identity Database (CIDB) is always challenging and it get more challenging when we need to verify a large number of documents at the same time. Usually most of the time we setup a gateway server connected to the CIDB and it serve all the identity document verification requests. Though it work well, there are still chances that this model will collapse in high traffic. We obviously can tune the system to be sustainable, but the process is economically expensive. In this paper we propose an economically cheaper way to verify ID documents with private BitTorrent network and hash function. |
|||||
2020 | Deep Reinforcement Learning With Label Embedding Reward For Supervised Image Hashing | Wang Zhenzhen, Hong Weixiang, Yuan Junsong | Arxiv | Deep hashing has shown promising results in image retrieval and recognition. Despite its success, most existing deep hashing approaches are rather similar: either multi-layer perceptron or CNN is applied to extract image feature, followed by different binarization activation functions such as sigmoid, tanh or autoencoder to generate binary code. In this work, we introduce a novel decision-making approach for deep supervised hashing. We formulate the hashing problem as travelling across the vertices in the binary code space, and learn a deep Q-network with a novel label embedding reward defined by Bose-Chaudhuri-Hocquenghem (BCH) codes to explore the best path. Extensive experiments and analysis on the CIFAR-10 and NUS-WIDE dataset show that our approach outperforms state-of-the-art supervised hashing methods under various code lengths. |
|||||
2020 | Sparse Hashing For Scalable Approximate Model Counting Theory And Practice | Meel Kuldeep S., Akshay S. | Arxiv | Given a CNF formula F on n variables, the problem of model counting or #SAT is to compute the number of satisfying assignments of F . Model counting is a fundamental but hard problem in computer science with varied applications. Recent years have witnessed a surge of effort towards developing efficient algorithmic techniques that combine the classical 2-universal hashing with the remarkable progress in SAT solving over the past decade. These techniques augment the CNF formula F with random XOR constraints and invoke an NP oracle repeatedly on the resultant CNF-XOR formulas. In practice, calls to the NP oracle calls are replaced a SAT solver whose runtime performance is adversely affected by size of XOR constraints. The standard construction of 2-universal hash functions chooses every variable with probability p = 1/2 leading to XOR constraints of size n/2 in expectation. Consequently, the challenge is to design sparse hash functions where variables can be chosen with smaller probability and lead to smaller sized XOR constraints. In this paper, we address this challenge from theoretical and practical perspectives. First, we formalize a relaxation of universal hashing, called concentrated hashing and establish a novel and beautiful connection between concentration measures of these hash functions and isoperimetric inequalities on boolean hypercubes. This allows us to obtain (log m) tight bounds on variance and dispersion index and show that p = O( log(m)/m ) suffices for design of sparse hash functions from {0, 1}^n to {0, 1}^m. We then use sparse hash functions belonging to this concentrated hash family to develop new approximate counting algorithms. A comprehensive experimental evaluation of our algorithm on 1893 benchmarks demonstrates that usage of sparse hash functions can lead to significant speedups. |
|||||
2020 | Nearest Neighbor Machine Translation | Khandelwal Urvashi, Fan Angela, Jurafsky Dan, Zettlemoyer Luke, Lewis Mike | Arxiv | We introduce \(k\)-nearest-neighbor machine translation (\(k\)NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. \(k\)NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results – without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, \(k\)NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples. |
|||||
2020 | Proximity Preserving Binary Code Using Signed Graph-cut | Lav Inbal, Avidan Shai, Singer Yoram, Hel-or Yacov | AAAI Conference on Artificial Intelligence Feb. | We introduce a binary embedding framework, called Proximity Preserving Code (PPC), which learns similarity and dissimilarity between data points to create a compact and affinity-preserving binary code. This code can be used to apply fast and memory-efficient approximation to nearest-neighbor searches. Our framework is flexible, enabling different proximity definitions between data points. In contrast to previous methods that extract binary codes based on unsigned graph partitioning, our system models the attractive and repulsive forces in the data by incorporating positive and negative graph weights. The proposed framework is shown to boil down to finding the minimal cut of a signed graph, a problem known to be NP-hard. We offer an efficient approximation and achieve superior results by constructing the code bit after bit. We show that the proposed approximation is superior to the commonly used spectral methods with respect to both accuracy and complexity. Thus, it is useful for many other problems that can be translated into signed graph cut. |
|||||
2020 | Deep Robust Multilevel Semantic Cross-modal Hashing | Song Ge, Zhao Jun, Tan Xiaoyang | Arxiv | Hashing based cross-modal retrieval has recently made significant progress. But straightforward embedding data from different modalities into a joint Hamming space will inevitably produce false codes due to the intrinsic modality discrepancy and noises. We present a novel Robust Multilevel Semantic Hashing (RMSH) for more accurate cross-modal retrieval. It seeks to preserve fine-grained similarity among data with rich semantics, while explicitly require distances between dissimilar points to be larger than a specific value for strong robustness. For this, we give an effective bound of this value based on the information coding-theoretic analysis, and the above goals are embodied into a margin-adaptive triplet loss. Furthermore, we introduce pseudo-codes via fusing multiple hash codes to explore seldom-seen semantics, alleviating the sparsity problem of similarity information. Experiments on three benchmarks show the validity of the derived bounds, and our method achieves state-of-the-art performance. |
|||||
2020 | Scaling Up Kernel Ridge Regression Via Locality Sensitive Hashing | Kapralov Michael, Nouri Navid, Razenshteyn Ilya, Velingker Ameya, Zandieh Amir | Arxiv | Random binning features, introduced in the seminal paper of Rahimi and Recht (2007), are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way of approximating the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes, such as the Gaussian kernel and Matern kernel. In this paper, we introduce a simple weighted version of random binning features and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression. Experiments on large scale regression datasets show that our method outperforms the accuracy of random Fourier features method. |
|||||
2020 | Locality Sensitive Hashing For Set-queries Motivated By Group Recommendations | Kaplan Haim, Tenenbaum Jay | Arxiv | Locality Sensitive Hashing (LSH) is an effective method to index a set of points such that we can efficiently find the nearest neighbors of a query point. We extend this method to our novel Set-query LSH (SLSH), such that it can find the nearest neighbors of a set of points, given as a query. Let \( s(x,y) \) be the similarity between two points \( x \) and \( y \). We define a similarity between a set \( Q\) and a point \( x \) by aggregating the similarities \( s(p,x) \) for all \( p\in Q \). For example, we can take \( s(p,x) \) to be the angular similarity between \( p \) and \( x \) (i.e., \(1-{\angle (x,p)}/{\pi}\)), and aggregate by arithmetic or geometric averaging, or taking the lowest similarity. We develop locality sensitive hash families and data structures for a large set of such arithmetic and geometric averaging similarities, and analyze their collision probabilities. We also establish an analogous framework and hash families for distance functions. Specifically, we give a structure for the euclidean distance aggregated by either averaging or taking the maximum. We leverage SLSH to solve a geometric extension of the approximate near neighbors problem. In this version, we consider a metric for which the unit ball is an ellipsoid and its orientation is specified with the query. An important application that motivates our work is group recommendation systems. Such a system embeds movies and users in the same feature space, and the task of recommending a movie for a group to watch together, translates to a set-query \( Q \) using an appropriate similarity. |
|||||
2020 | Learning To Embed Categorical Features Without Embedding Tables For Recommendation | Kang Wang-cheng, Cheng Derek Zhiyuan, Yao Tiansheng, Yi Xinyang, Chen Ting, Hong Lichan, Chi Ed H. | Arxiv | Embedding learning of categorical features (e.g. user/item IDs) is at the core of various recommendation models including matrix factorization and neural collaborative filtering. The standard approach creates an embedding table where each row represents a dedicated embedding vector for every unique feature value. However, this method fails to efficiently handle high-cardinality features and unseen feature values (e.g. new video ID) that are prevalent in real-world recommendation systems. In this paper, we propose an alternative embedding framework Deep Hash Embedding (DHE), replacing embedding tables by a deep embedding network to compute embeddings on the fly. DHE first encodes the feature value to a unique identifier vector with multiple hashing functions and transformations, and then applies a DNN to convert the identifier vector to an embedding. The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to learn embedding generation. Empirical results show that DHE achieves comparable AUC against the standard one-hot full embedding, with smaller model sizes. Our work sheds light on the design of DNN-based alternative embedding schemes for categorical features without using embedding table lookup. |
|||||
2020 | A Multi-modal Neural Embeddings Approach For Detecting Mobile Counterfeit Apps A Case Study On Google Play Store | Karunanayake Naveen, Rajasegaran Jathushan, Gunathillake Ashanie, Seneviratne Suranga, Jourjon Guillaume | Arxiv | Counterfeit apps impersonate existing popular apps in attempts to misguide users to install them for various reasons such as collecting personal information or spreading malware. Many counterfeits can be identified once installed, however even a tech-savvy user may struggle to detect them before installation. To this end, this paper proposes to leverage the recent advances in deep learning methods to create image and text embeddings so that counterfeit apps can be efficiently identified when they are submitted for publication. We show that a novel approach of combining content embeddings and style embeddings outperforms the baseline methods for image similarity such as SIFT, SURF, and various image hashing methods. We first evaluate the performance of the proposed method on two well-known datasets for evaluating image similarity methods and show that content, style, and combined embeddings increase precision@k and recall@k by 10%-15% and 12%-25%, respectively when retrieving five nearest neighbours. Second, specifically for the app counterfeit detection problem, combined content and style embeddings achieve 12% and 14% increase in precision@k and recall@k, respectively compared to the baseline methods. Third, we present an analysis of approximately 1.2 million apps from Google Play Store and identify a set of potential counterfeits for top-10,000 popular apps. Under a conservative assumption, we were able to find 2,040 potential counterfeits that contain malware in a set of 49,608 apps that showed high similarity to one of the top-10,000 popular apps in Google Play Store. We also find 1,565 potential counterfeits asking for at least five additional dangerous permissions than the original app and 1,407 potential counterfeits having at least five extra third party advertisement libraries. |
|||||
2020 | Dynamic Similarity Search On Integer Sketches | Kanda Shunsuke, Tabei Yasuo | Arxiv | Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for binary sketches and inefficient for integer sketches. Moreover, most methods are either inapplicable or inefficient for dynamic datasets, although modern real-world datasets are updated over time. We propose dynamic filter trie (DyFT), a dynamic similarity search method for both binary and integer sketches. An extensive experimental analysis using large real-world datasets shows that DyFT performs superiorly with respect to scalability, time performance, and memory efficiency. For example, on a huge dataset of 216 million data points, DyFT performs a similarity search 6,000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory. |
|||||
2020 | Succinct Trit-array Trie For Scalable Trajectory Similarity Search | Kanda Shunsuke, Takeuchi Koh, Fujii Keisuke, Tabei Yasuo | Arxiv | Massive datasets of spatial trajectories representing the mobility of a diversity of moving objects are ubiquitous in research and industry. Similarity search of a large collection of trajectories is indispensable for turning these datasets into knowledge. Locality sensitive hashing (LSH) is a powerful technique for fast similarity searches. Recent methods employ LSH and attempt to realize an efficient similarity search of trajectories; however, those methods are inefficient in terms of search time and memory when applied to massive datasets. To address this problem, we present the trajectory-indexing succinct trit-array trie (tSTAT), which is a scalable method leveraging LSH for trajectory similarity searches. tSTAT quickly performs the search on a tree data structure called trie. We also present two novel techniques that enable to dramatically enhance the memory efficiency of tSTAT. One is a node reduction technique that substantially omits redundant trie nodes while maintaining the time performance. The other is a space-efficient representation that leverages the idea behind succinct data structures (i.e., a compressed data structure supporting fast data operations). We experimentally test tSTAT on its ability to retrieve similar trajectories for a query from large collections of trajectories and show that tSTAT performs superiorly in comparison to state-of-the-art similarity search methods. |
|||||
2020 | Optimized Feature Space Learning For Generating Efficient Binary Codes For Image Retrieval | Jose Abin, Ottlik Erik Stefan, Rohlfing Christian, Ohm Jens-rainer | Arxiv | In this paper we propose an approach for learning low dimensional optimized feature space with minimum intra-class variance and maximum inter-class variance. We address the problem of high-dimensionality of feature vectors extracted from neural networks by taking care of the global statistics of feature space. Classical approach of Linear Discriminant Analysis (LDA) is generally used for generating an optimized low dimensional feature space for single-labeled images. Since, image retrieval involves both multi-labeled and single-labeled images, we utilize the equivalence between LDA and Canonical Correlation Analysis (CCA) to generate an optimized feature space for single-labeled images and use CCA to generate an optimized feature space for multi-labeled images. Our approach correlates the projections of feature vectors with label vectors in our CCA based network architecture. The neural network minimize a loss function which maximizes the correlation coefficients. We binarize our generated feature vectors with the popular Iterative Quantization (ITQ) approach and also propose an ensemble network to generate binary codes of desired bit length for image retrieval. Our measurement of mean average precision shows competitive results on other state-of-the-art single-labeled and multi-labeled image retrieval datasets. |
|||||
2020 | Procrustean Orthogonal Sparse Hashing | Tepper Mariano, Sengupta Dipanjan, Willke Ted | Arxiv | Hashing is one of the most popular methods for similarity search because of its speed and efficiency. Dense binary hashing is prevalent in the literature. Recently, insect olfaction was shown to be structurally and functionally analogous to sparse hashing [6]. Here, we prove that this biological mechanism is the solution to a well-posed optimization problem. Furthermore, we show that orthogonality increases the accuracy of sparse hashing. Next, we present a novel method, Procrustean Orthogonal Sparse Hashing (POSH), that unifies these findings, learning an orthogonal transform from training data compatible with the sparse hashing mechanism. We provide theoretical evidence of the shortcomings of Optimal Sparse Lifting (OSL) [22] and BioHash [30], two related olfaction-inspired methods, and propose two new methods, Binary OSL and SphericalHash, to address these deficiencies. We compare POSH, Binary OSL, and SphericalHash to several state-of-the-art hashing methods and provide empirical results for the superiority of the proposed methods across a wide range of standard benchmarks and parameter settings. |
|||||
2020 | A New Hashing Based Nearest Neighbors Selection Technique For Big Datasets | Tchaye-kondi Jude, Zhai Yanlong, Zhu Liehuang | Arxiv | KNN has the reputation to be the word simplest but efficient supervised learning algorithm used for either classification or regression. KNN prediction efficiency highly depends on the size of its training data but when this training data grows KNN suffers from slowness in making decisions since it needs to search nearest neighbors within the entire dataset at each decision making. This paper proposes a new technique that enables the selection of nearest neighbors directly in the neighborhood of a given observation. The proposed approach consists of dividing the data space into subcells of a virtual grid built on top of data space. The mapping between the data points and subcells is performed using hashing. When it comes to select the nearest neighbors of a given observation, we firstly identify the cell the observation belongs by using hashing, and then we look for nearest neighbors from that central cell and cells around it layer by layer. From our experiment performance analysis on publicly available datasets, our algorithm outperforms the original KNN in time efficiency with a prediction quality as good as that of KNN it also offers competitive performance with solutions like KDtree |
|||||
2020 | Fast Top-k Cosine Similarity Search Through Xor-friendly Binary Quantization On Gpus | Jian Xiaozheng, Lu Jianqiu, Yuan Zexi, Li Ao | Arxiv | We explore the use of GPU for accelerating large scale nearest neighbor search and we propose a fast vector-quantization-based exhaustive nearest neighbor search algorithm that can achieve high accuracy without any indexing construction specifically designed for cosine similarity. This algorithm uses a novel XOR-friendly binary quantization method to encode floating-point numbers such that high-complexity multiplications can be optimized as low-complexity bitwise operations. Experiments show that, our quantization method takes short preprocessing time, and helps make the search speed of our exhaustive search method much more faster than that of popular approximate nearest neighbor algorithms when high accuracy is needed. |
|||||
2020 | Generalized Product Quantization Network For Semi-supervised Image Retrieval | Jang Young Kyun, Cho Nam Ik | Arxiv | Image retrieval methods that employ hashing or vector quantization have achieved great success by taking advantage of deep learning. However, these approaches do not meet expectations unless expensive label information is sufficient. To resolve this issue, we propose the first quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network. We design a novel metric learning strategy that preserves semantic similarity between labeled data, and employ entropy regularization term to fully exploit inherent potentials of unlabeled data. Our solution increases the generalization capacity of the quantization network, which allows overcoming previous limitations in the retrieval community. Extensive experimental results demonstrate that GPQ yields state-of-the-art performance on large-scale real image benchmark datasets. |
|||||
2020 | Multi-feature Discrete Collaborative Filtering For Fast Cold-start Recommendation | Xu Yang, Zhu Lei, Cheng Zhiyong, Li Jingjing, Sun Jiande | Arxiv | Hashing is an effective technique to address the large-scale recommendation problem, due to its high computation and storage efficiency on calculating the user preferences on items. However, existing hashing-based recommendation methods still suffer from two important problems: 1) Their recommendation process mainly relies on the user-item interactions and single specific content feature. When the interaction history or the content feature is unavailable (the cold-start problem), their performance will be seriously deteriorated. 2) Existing methods learn the hash codes with relaxed optimization or adopt discrete coordinate descent to directly solve binary hash codes, which results in significant quantization loss or consumes considerable computation time. In this paper, we propose a fast cold-start recommendation method, called Multi-Feature Discrete Collaborative Filtering (MFDCF), to solve these problems. Specifically, a low-rank self-weighted multi-feature fusion module is designed to adaptively project the multiple content features into binary yet informative hash codes by fully exploiting their complementarity. Additionally, we develop a fast discrete optimization algorithm to directly compute the binary hash codes with simple operations. Experiments on two public recommendation datasets demonstrate that MFDCF outperforms the state-of-the-arts on various aspects. |
|||||
2020 | Asymmetric Correlation Quantization Hashing For Cross-modal Retrieval | Wang Lu, Yang Jie | Arxiv | Due to the superiority in similarity computation and database storage for large-scale multiple modalities data, cross-modal hashing methods have attracted extensive attention in similarity retrieval across the heterogeneous modalities. However, there are still some limitations to be further taken into account: (1) most current CMH methods transform real-valued data points into discrete compact binary codes under the binary constraints, limiting the capability of representation for original data on account of abundant loss of information and producing suboptimal hash codes; (2) the discrete binary constraint learning model is hard to solve, where the retrieval performance may greatly reduce by relaxing the binary constraints for large quantization error; (3) handling the learning problem of CMH in a symmetric framework, leading to difficult and complex optimization objective. To address above challenges, in this paper, a novel Asymmetric Correlation Quantization Hashing (ACQH) method is proposed. Specifically, ACQH learns the projection matrixs of heterogeneous modalities data points for transforming query into a low-dimensional real-valued vector in latent semantic space and constructs the stacked compositional quantization embedding in a coarse-to-fine manner for indicating database points by a series of learnt real-valued codeword in the codebook with the help of pointwise label information regression simultaneously. Besides, the unified hash codes across modalities can be directly obtained by the discrete iterative optimization framework devised in the paper. Comprehensive experiments on diverse three benchmark datasets have shown the effectiveness and rationality of ACQH. |
|||||
2020 | Lsf-join Locality Sensitive Filtering For Distributed All-pairs Set Similarity Under Skew | Rashtchian Cyrus, Sharma Aneesh, Woodruff David P. | Arxiv | All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work highlights the importance of finding pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, some dimensions are often highly skewed because they are very popular. Together these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. Our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs. |
|||||
2020 | Efficient Approximate Nearest Neighbor Search For Multiple Weighted l_pleq2 Distance Functions | Hu Huan, Li Jianzhong | Arxiv | Nearest neighbor search is fundamental to a wide range of applications. Since the exact nearest neighbor search suffers from the “curse of dimensionality”, approximate approaches, such as Locality-Sensitive Hashing (LSH), are widely used to trade a little query accuracy for a much higher query efficiency. In many scenarios, it is necessary to perform nearest neighbor search under multiple weighted distance functions in high-dimensional spaces. This paper considers the important problem of supporting efficient approximate nearest neighbor search for multiple weighted distance functions in high-dimensional spaces. To the best of our knowledge, prior work can only solve the problem for the \(l_2\) distance. However, numerous studies have shown that the \(l_p\) distance with \(p\in(0,2)\) could be more effective than the \(l_2\) distance in high-dimensional spaces. We propose a novel method, WLSH, to address the problem for the \(l_p\) distance for \(p\in(0,2]\). WLSH takes the LSH approach and can theoretically guarantee both the efficiency of processing queries and the accuracy of query results while minimizing the required total number of hash tables. We conduct extensive experiments on synthetic and real data sets, and the results show that WLSH achieves high performance in terms of query efficiency, query accuracy and space consumption. |
|||||
2020 | Creating Something From Nothing Unsupervised Knowledge Distillation For Cross-modal Hashing | Hu Hengtong, Xie Lingxi, Hong Richang, Tian Qi | Arxiv | In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin. |
|||||
2020 | Ihashnet Iris Hashing Network Based On Efficient Multi-index Hashing | Singh Avantika, Vashist Chirag, Gaurav Pratyush, Nigam Aditya, Pratap Rameshwar | Arxiv | Massive biometric deployments are pervasive in today’s world. But despite the high accuracy of biometric systems, their computational efficiency degrades drastically with an increase in the database size. Thus, it is essential to index them. An ideal indexing scheme needs to generate codes that preserve the intra-subject similarity as well as inter-subject dissimilarity. Here, in this paper, we propose an iris indexing scheme using real-valued deep iris features binarized to iris bar codes (IBC) compatible with the indexing structure. Firstly, for extracting robust iris features, we have designed a network utilizing the domain knowledge of ordinal filtering and learning their nonlinear combinations. Later these real-valued features are binarized. Finally, for indexing the iris dataset, we have proposed a loss that can transform the binary feature into an improved feature compatible with the Multi-Index Hashing scheme. This loss function ensures the hamming distance equally distributed among all the contiguous disjoint sub-strings. To the best of our knowledge, this is the first work in the iris indexing domain that presents an end-to-end iris indexing structure. Experimental results on four datasets are presented to depict the efficacy of the proposed approach. |
|||||
2020 | Model Size Reduction Using Frequency Based Double Hashing For Recommender Systems | Zhang Caojin, Liu Yicun, Xie Yuanpu, Ktena Sofia Ira, Tejani Alykhan, Gupta Akshay, Myana Pranay Kumar, Dilipkumar Deepak, Paul Suvadip, Ihara Ikuhiro, Upadhyaya Prasang, Huszar Ferenc, Shi Wenzhe | Arxiv | Deep Neural Networks (DNNs) with sparse input features have been widely used in recommender systems in industry. These models have large memory requirements and need a huge amount of training data. The large model size usually entails a cost, in the range of millions of dollars, for storage and communication with the inference services. In this paper, we propose a hybrid hashing method to combine frequency hashing and double hashing techniques for model size reduction, without compromising performance. We evaluate the proposed models on two product surfaces. In both cases, experiment results demonstrated that we can reduce the model size by around 90 % while keeping the performance on par with the original baselines. |
|||||
2020 | Unsupervised Deep Cross-modality Spectral Hashing | Hoang Tuan, Do Thanh-toan, Nguyen Tam V., Cheung Ngai-man | Arxiv | This paper presents a novel framework, namely Deep Cross-modality Spectral Hashing (DCSH), to tackle the unsupervised learning problem of binary hash codes for efficient cross-modal retrieval. The framework is a two-step hashing approach which decouples the optimization into (1) binary optimization and (2) hashing function learning. In the first step, we propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations. While the former is capable of well preserving the local structure of each modality, the latter reveals the hidden patterns from all modalities. In the second step, to learn mapping functions from informative data inputs (images and word embeddings) to binary codes obtained from the first step, we leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality. Quantitative evaluations on three standard benchmark datasets demonstrate that the proposed DCSH method consistently outperforms other state-of-the-art methods. |
|||||
2020 | Optimal Binary Linear Codes From Maximal Arcs | Heng Ziling, Ding Cunsheng, Wang Weiqiong | Arxiv | The binary Hamming codes with parameters \([2^m-1, 2^m-1-m, 3]\) are perfect. Their extended codes have parameters \([2^m, 2^m-1-m, 4]\) and are distance-optimal. The first objective of this paper is to construct a class of binary linear codes with parameters \([2^{m+s}+2^s-2^m,2^{m+s}+2^s-2^m-2m-2,4]\), which have better information rates than the class of extended binary Hamming codes, and are also distance-optimal. The second objective is to construct a class of distance-optimal binary codes with parameters \([2^m+2, 2^m-2m, 6]\). Both classes of binary linear codes have new parameters. |
|||||
2020 | Deep Multi-view Enhancement Hashing For Image Retrieval | Yan Chenggang, Gong Biao, Wei Yuxuan, Gao Yue | Arxiv | Hashing is an efficient method for nearest neighbor search in large-scale data space by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. However, large-scale high-speed retrieval through binary code has a certain degree of reduction in retrieval accuracy compared to traditional retrieval methods. We have noticed that multi-view methods can well preserve the diverse characteristics of data. Therefore, we try to introduce the multi-view deep neural network into the hash learning field, and design an efficient and innovative retrieval model, which has achieved a significant improvement in retrieval performance. In this paper, we propose a supervised multi-view hash model which can enhance the multi-view information through neural networks. This is a completely new hash learning method that combines multi-view and deep learning methods. The proposed method utilizes an effective view stability evaluation method to actively explore the relationship among views, which will affect the optimization direction of the entire network. We have also designed a variety of multi-data fusion methods in the Hamming space to preserve the advantages of both convolution and multi-view. In order to avoid excessive computing resources on the enhancement procedure during retrieval, we set up a separate structure called memory network which participates in training together. The proposed method is systematically evaluated on the CIFAR-10, NUS-WIDE and MS-COCO datasets, and the results show that our method significantly outperforms the state-of-the-art single-view and multi-view hashing methods. |
|||||
2020 | On Learning Semantic Representations For Million-scale Free-hand Sketches | Xu Peng, Huang Yongye, Yuan Tongtong, Xiang Tao, Hospedales Timothy M., Song Yi-zhe, Wang Liang | Arxiv | In this paper, we study learning semantic representations for million-scale free-hand sketches. This is highly challenging due to the domain-unique traits of sketches, e.g., diverse, sparse, abstract, noisy. We propose a dual-branch CNNRNN network architecture to represent sketches, which simultaneously encodes both the static and temporal patterns of sketch strokes. Based on this architecture, we further explore learning the sketch-oriented semantic representations in two challenging yet practical settings, i.e., hashing retrieval and zero-shot recognition on million-scale sketches. Specifically, we use our dual-branch architecture as a universal representation framework to design two sketch-specific deep models: (i) We propose a deep hashing model for sketch retrieval, where a novel hashing loss is specifically designed to accommodate both the abstract and messy traits of sketches. (ii) We propose a deep embedding model for sketch zero-shot recognition, via collecting a large-scale edge-map dataset and proposing to extract a set of semantic vectors from edge-maps as the semantic knowledge for sketch zero-shot domain alignment. Both deep models are evaluated by comprehensive experiments on million-scale sketches and outperform the state-of-the-art competitors. |
|||||
2020 | Directed Graph Hashing | Helbling Caleb | Arxiv | This paper presents several algorithms for hashing directed graphs. The algorithms given are capable of hashing entire graphs as well as assigning hash values to specific nodes in a given graph. The notion of node symmetry is made precise via computation of vertex orbits and the graph automorphism group, and nodes that are symmetrically identical are assigned equal hashes. We also present a novel Merkle-style hashing algorithm that seeks to fulfill the recursive principle that a hash of a node should depend only on the hash of its neighbors. This algorithm works even in the presence of cycles, which would not be possible with a naive approach. Structurally hashing trees has seen widespread use in blockchain, source code version control, and web applications. Despite the popularity of tree hashing, directed graph hashing remains unstudied in the literature. Our algorithms open new possibilities to hashing both directed graphs and more complex data structures that can be reduced to directed graphs such as hypergraphs. |
|||||
2020 | A Non-alternating Graph Hashing Algorithm For Large Scale Image Search | Hemati Sobhan, Mehdizavareh Mohammad Hadi, Chenouri Shojaeddin, Tizhoosh Hamid R | Arxiv | In the era of big data, methods for improving memory and computational efficiency have become crucial for successful deployment of technologies. Hashing is one of the most effective approaches to deal with computational limitations that come with big data. One natural way for formulating this problem is spectral hashing that directly incorporates affinity to learn binary codes. However, due to binary constraints, the optimization becomes intractable. To mitigate this challenge, different relaxation approaches have been proposed to reduce the computational load of obtaining binary codes and still attain a good solution. The problem with all existing relaxation methods is resorting to one or more additional auxiliary variables to attain high quality binary codes while relaxing the problem. The existence of auxiliary variables leads to coordinate descent approach which increases the computational complexity. We argue that introducing these variables is unnecessary. To this end, we propose a novel relaxed formulation for spectral hashing that adds no additional variables to the problem. Furthermore, instead of solving the problem in original space where number of variables is equal to the data points, we solve the problem in a much smaller space and retrieve the binary codes from this solution. This trick reduces both the memory and computational complexity at the same time. We apply two optimization techniques, namely projected gradient and optimization on manifold, to obtain the solution. Using comprehensive experiments on four public datasets, we show that the proposed efficient spectral hashing (ESH) algorithm achieves highly competitive retrieval performance compared with state of the art at low complexity. |
|||||
2020 | SIR Similar Image Retrieval For Product Search In E-commerce | Stanley Theban, Vanjara Nihar, Pan Yanxin, Pirogova Ekaterina, Chakraborty Swagata, Chaudhuri Abon | Arxiv | We present a similar image retrieval (SIR) platform that is used to quickly discover visually similar products in a catalog of millions. Given the size, diversity, and dynamism of our catalog, product search poses many challenges. It can be addressed by building supervised models to tagging product images with labels representing themes and later retrieving them by labels. This approach suffices for common and perennial themes like “white shirt” or “lifestyle image of TV”. It does not work for new themes such as “e-cigarettes”, hard-to-define ones such as “image with a promotional badge”, or the ones with short relevance span such as “Halloween costumes”. SIR is ideal for such cases because it allows us to search by an example, not a pre-defined theme. We describe the steps - embedding computation, encoding, and indexing - that power the approximate nearest neighbor search back-end. We also highlight two applications of SIR. The first one is related to the detection of products with various types of potentially objectionable themes. This application is run with a sense of urgency, hence the typical time frame to train and bootstrap a model is not permitted. Also, these themes are often short-lived based on current trends, hence spending resources to build a lasting model is not justified. The second application is a variant item detection system where SIR helps discover visual variants that are hard to find through text search. We analyze the performance of SIR in the context of these applications. |
|||||
2020 | Content-aware Neural Hashing For Cold-start Recommendation | Hansen Casper, Hansen Christian, Simonsen Jakob Grue, Alstrup Stephen, Lioma Christina | Arxiv | Content-aware recommendation approaches are essential for providing meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user-item relevance. NeuHash-CF is modelled as an autoencoder architecture, consisting of two joint hashing components for generating user and item hash codes. Inspired from semantic hashing, the item hashing component generates a hash code directly from an item’s content information (i.e., it generates cold-start and seen item hash codes in the same manner). This contrasts existing state-of-the-art models, which treat the two item cases separately. The user hash codes are generated directly based on user id, through learning a user embedding matrix. We show experimentally that NeuHash-CF significantly outperforms state-of-the-art baselines by up to 12\% NDCG and 13\% MRR in cold-start recommendation settings, and up to 4\% in both NDCG and MRR in standard settings where all items are present while training. Our approach uses 2-4x shorter hash codes, while obtaining the same or better performance compared to the state of the art, thus consequently also enabling a notable storage reduction. |
|||||
2020 | Unsupervised Semantic Hashing With Pairwise Reconstruction | Hansen Casper, Hansen Christian, Simonsen Jakob Grue, Alstrup Stephen, Lioma Christina | Arxiv | Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search. |
|||||
2020 | Deep Hashing With Hash-consistent Large Margin Proxy Embeddings | Morgado Pedro, Li Yunsheng, Pereira Jose Costa, Saberian Mohammad, Vasconcelos Nuno | Arxiv | Image hash codes are produced by binarizing the embeddings of convolutional neural networks (CNN) trained for either classification or retrieval. While proxy embeddings achieve good performance on both tasks, they are non-trivial to binarize, due to a rotational ambiguity that encourages non-binary embeddings. The use of a fixed set of proxies (weights of the CNN classification layer) is proposed to eliminate this ambiguity, and a procedure to design proxy sets that are nearly optimal for both classification and hashing is introduced. The resulting hash-consistent large margin (HCLM) proxies are shown to encourage saturation of hashing units, thus guaranteeing a small binarization error, while producing highly discriminative hash-codes. A semantic extension (sHCLM), aimed to improve hashing performance in a transfer scenario, is also proposed. Extensive experiments show that sHCLM embeddings achieve significant improvements over state-of-the-art hashing procedures on several small and large datasets, both within and beyond the set of training classes. |
|||||
2020 | Mosaic Finding Artistic Connections Across Culture With Conditional Image Retrieval | Hamilton Mark, Fu Stephanie, Lu Mindren, Bui Johnny, Bopp Darius, Chen Zhenbang, Tran Felix, Wang Margaret, Rogers Marina, Zhang Lei, Hoder Chris, Freeman William T. | Arxiv | We introduce MosAIc, an interactive web app that allows users to find pairs of semantically related artworks that span different cultures, media, and millennia. To create this application, we introduce Conditional Image Retrieval (CIR) which combines visual similarity search with user supplied filters or “conditions”. This technique allows one to find pairs of similar images that span distinct subsets of the image corpus. We provide a generic way to adapt existing image retrieval data-structures to this new domain and provide theoretical bounds on our approach’s efficiency. To quantify the performance of CIR systems, we introduce new datasets for evaluating CIR methods and show that CIR performs non-parametric style transfer. Finally, we demonstrate that our CIR data-structures can identify “blind spots” in Generative Adversarial Networks (GAN) where they fail to properly model the true data distribution. |
|||||
2020 | HMQ Hardware Friendly Mixed Precision Quantization Block For Cnns | Habi Hai Victor, Jennings Roy H., Netzer Arnon | Arxiv | Recent work in network quantization produced state-of-the-art results using mixed precision quantization. An imperative requirement for many efficient edge device hardware implementations is that their quantizers are uniform and with power-of-two thresholds. In this work, we introduce the Hardware Friendly Mixed Precision Quantization Block (HMQ) in order to meet this requirement. The HMQ is a mixed precision quantization block that repurposes the Gumbel-Softmax estimator into a smooth estimator of a pair of quantization parameters, namely, bit-width and threshold. HMQs use this to search over a finite space of quantization schemes. Empirically, we apply HMQs to quantize classification models trained on CIFAR10 and ImageNet. For ImageNet, we quantize four different architectures and show that, in spite of the added restrictions to our quantization scheme, we achieve competitive and, in some cases, state-of-the-art results. |
|||||
2020 | Deep Kernel Supervised Hashing For Node Classification In Structural Networks | Guo Jia-nan, Mao Xian-ling, Lin Shu-yang, Wei Wei, Huang Heyan | Arxiv | Node classification in structural networks has been proven to be useful in many real world applications. With the development of network embedding, the performance of node classification has been greatly improved. However, nearly all the existing network embedding based methods are hard to capture the actual category features of a node because of the linearly inseparable problem in low-dimensional space; meanwhile they cannot incorporate simultaneously network structure information and node label information into network embedding. To address the above problems, in this paper, we propose a novel Deep Kernel Supervised Hashing (DKSH) method to learn the hashing representations of nodes for node classification. Specifically, a deep multiple kernel learning is first proposed to map nodes into suitable Hilbert space to deal with linearly inseparable problem. Then, instead of only considering structural similarity between two nodes, a novel similarity matrix is designed to merge both network structure information and node label information. Supervised by the similarity matrix, the learned hashing representations of nodes simultaneously preserve the two kinds of information well from the learned Hilbert space. Extensive experiments show that the proposed method significantly outperforms the state-of-the-art baselines over three real world benchmark datasets. |
|||||
2020 | Collaborative Generative Hashing For Marketing And Fast Cold-start Recommendation | Zhang Yan, Tsang Ivor W., Duan Lixin | Arxiv | Cold-start has being a critical issue in recommender systems with the explosion of data in e-commerce. Most existing studies proposed to alleviate the cold-start problem are also known as hybrid recommender systems that learn representations of users and items by combining user-item interactive and user/item content information. However, previous hybrid methods regularly suffered poor efficiency bottlenecking in online recommendations with large-scale items, because they were designed to project users and items into continuous latent space where the online recommendation is expensive. To this end, we propose a collaborative generated hashing (CGH) framework to improve the efficiency by denoting users and items as binary codes, then fast hashing search techniques can be used to speed up the online recommendation. In addition, the proposed CGH can generate potential users or items for marketing application where the generative network is designed with the principle of Minimum Description Length (MDL), which is used to learn compact and informative binary codes. Extensive experiments on two public datasets show the advantages for recommendations in various settings over competing baselines and analyze its feasibility in marketing application. |
|||||
2020 | I Know Why You Like This Movie Interpretable Efficient Multimodal Recommender | Rychalska Barbara, Basaj Dominika, Dąbrowski Jacek, Daniluk Michał | Arxiv | Recently, the Efficient Manifold Density Estimator (EMDE) model has been introduced. The model exploits Local Sensitive Hashing and Count-Min Sketch algorithms, combining them with a neural network to achieve state-of-the-art results on multiple recommender datasets. However, this model ingests a compressed joint representation of all input items for each user/session, so calculating attributions for separate items via gradient-based methods seems not applicable. We prove that interpreting this model in a white-box setting is possible thanks to the properties of EMDE item retrieval method. By exploiting multimodal flexibility of this model, we obtain meaningful results showing the influence of multiple modalities: text, categorical features, and images, on movie recommendation output. |
|||||
2020 | Camera-based Piano Sheet Music Identification | Yang Daniel, Tsai Tj | Arxiv | This paper presents a method for large-scale retrieval of piano sheet music images. Our work differs from previous studies on sheet music retrieval in two ways. First, we investigate the problem at a much larger scale than previous studies, using all solo piano sheet music images in the entire IMSLP dataset as a searchable database. Second, we use cell phone images of sheet music as our input queries, which lends itself to a practical, user-facing application. We show that a previously proposed fingerprinting method for sheet music retrieval is far too slow for a real-time application, and we diagnose its shortcomings. We propose a novel hashing scheme called dynamic n-gram fingerprinting that significantly reduces runtime while simultaneously boosting retrieval accuracy. In experiments on IMSLP data, our proposed method achieves a mean reciprocal rank of 0.85 and an average runtime of 0.98 seconds per query. |
|||||
2020 | A Survey On Deep Hashing For Image Retrieval | Zhang Xiaopeng | Arxiv | Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH. |
|||||
2020 | Cluster-and-conquer When Randomness Meets Graph Locality | Giakkoupis George Wide, Kermarrec Anne-marie Epfl, Ruas Olivier Spirals, Taïani François Wide, Irisa | Arxiv | K-Nearest-Neighbors (KNN) graphs are central to many emblematic data mining and machine-learning applications. Some of the most efficient KNN graph algorithms are incremental and local: they start from a random graph, which they incrementally improve by traversing neighbors-of-neighbors links. Paradoxically, this random start is also one of the key weaknesses of these algorithms: nodes are initially connected to dissimilar neighbors, that lie far away according to the similarity metric. As a result, incremental algorithms must first laboriously explore spurious potential neighbors before they can identify similar nodes, and start converging. In this paper, we remove this drawback with Cluster-and-Conquer (C 2 for short). Cluster-and-Conquer boosts the starting configuration of greedy algorithms thanks to a novel lightweight clustering mechanism, dubbed FastRandomHash. FastRandomHash leverages random-ness and recursion to pre-cluster similar nodes at a very low cost. Our extensive evaluation on real datasets shows that Cluster-and-Conquer significantly outperforms existing approaches, including LSH, yielding speed-ups of up to x4.42 while incurring only a negligible loss in terms of KNN quality. |
|||||
2020 | Discrete Few-shot Learning For Pan Privacy | Gelbhart Roei, Rubinstein Benjamin I. P. | Arxiv | In this paper we present the first baseline results for the task of few-shot learning of discrete embedding vectors for image recognition. Few-shot learning is a highly researched task, commonly leveraged by recognition systems that are resource constrained to train on a small number of images per class. Few-shot systems typically store a continuous embedding vector of each class, posing a risk to privacy where system breaches or insider threats are a concern. Using discrete embedding vectors, we devise a simple cryptographic protocol, which uses one-way hash functions in order to build recognition systems that do not store their users’ embedding vectors directly, thus providing the guarantee of computational pan privacy in a practical and wide-spread setting. |
|||||
2020 | Deep Pairwise Hashing For Cold-start Recommendation | Zhang Yan, Tsang Ivor W., Yin Hongzhi, Yang Guowu, Lian Defu, Li Jingjing | Recommendation efficiency and data sparsity problems have been regarded as two challenges of improving performance for online recommendation. Most of the previous related work focus on improving recommendation accuracy instead of efficiency. In this paper, we propose a Deep Pairwise Hashing (DPH) to map users and items to binary vectors in Hamming space, where a user’s preference for an item can be efficiently calculated by Hamming distance, which significantly improves the efficiency of online recommendation. To alleviate data sparsity and cold-start problems, the user-item interactive information and item content information are unified to learn effective representations of items and users. Specifically, we first pre-train robust item representation from item content data by a Denoising Auto-encoder instead of other deterministic deep learning frameworks; then we finetune the entire framework by adding a pairwise loss objective with discrete constraints; moreover, DPH aims to minimize a pairwise ranking loss that is consistent with the ultimate goal of recommendation. Finally, we adopt the alternating optimization method to optimize the proposed model with discrete constraints. Extensive experiments on three different datasets show that DPH can significantly advance the state-of-the-art frameworks regarding data sparsity and item cold-start recommendation. |
||||||
2020 | Fast Compact And Highly Scalable Visual Place Recognition Through Sequence-based Matching Of Overloaded Representations | Garg Sourav, Milford Michael | Arxiv | Visual place recognition algorithms trade off three key characteristics: their storage footprint, their computational requirements, and their resultant performance, often expressed in terms of recall rate. Significant prior work has investigated highly compact place representations, sub-linear computational scaling and sub-linear storage scaling techniques, but have always involved a significant compromise in one or more of these regards, and have only been demonstrated on relatively small datasets. In this paper we present a novel place recognition system which enables for the first time the combination of ultra-compact place representations, near sub-linear storage scaling and extremely lightweight compute requirements. Our approach exploits the inherently sequential nature of much spatial data in the robotics domain and inverts the typical target criteria, through intentionally coarse scalar quantization-based hashing that leads to more collisions but is resolved by sequence-based matching. For the first time, we show how effective place recognition rates can be achieved on a new very large 10 million place dataset, requiring only 8 bytes of storage per place and 37K unitary operations to achieve over 50% recall for matching a sequence of 100 frames, where a conventional state-of-the-art approach both consumes 1300 times more compute and fails catastrophically. We present analysis investigating the effectiveness of our hashing overload approach under varying sizes of quantized vector length, comparison of near miss matches with the actual match selections and characterise the effect of variance re-scaling of data on quantization. |
|||||
2020 | Bio-inspired Hashing For Unsupervised Similarity Search | Ryali Chaitanya K., Hopfield John J., Grinberg Leopold, Krotov Dmitry | Proceedings of the International Conference on Machine Learning | The fruit fly Drosophila’s olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search. |
|||||
2020 | Deep Momentum Uncertainty Hashing | Fu Chaoyou, Wang Guoli, Wu Xiang, Zhang Qian, He Ran | Arxiv | Combinatorial optimization (CO) has been a hot research topic because of its theoretic and practical importance. As a classic CO problem, deep hashing aims to find an optimal code for each data from finite discrete possibilities, while the discrete nature brings a big challenge to the optimization process. Previous methods usually mitigate this challenge by binary approximation, substituting binary codes for real-values via activation functions or regularizations. However, such approximation leads to uncertainty between real-values and binary ones, degrading retrieval performance. In this paper, we propose a novel Deep Momentum Uncertainty Hashing (DMUH). It explicitly estimates the uncertainty during training and leverages the uncertainty information to guide the approximation process. Specifically, we model bit-level uncertainty via measuring the discrepancy between the output of a hashing network and that of a momentum-updated network. The discrepancy of each bit indicates the uncertainty of the hashing network to the approximate output of that bit. Meanwhile, the mean discrepancy of all bits in a hashing code can be regarded as image-level uncertainty. It embodies the uncertainty of the hashing network to the corresponding input image. The hashing bit and image with higher uncertainty are paid more attention during optimization. To the best of our knowledge, this is the first work to study the uncertainty in hashing bits. Extensive experiments are conducted on four datasets to verify the superiority of our method, including CIFAR-10, NUS-WIDE, MS-COCO, and a million-scale dataset Clothing1M. Our method achieves the best performance on all of the datasets and surpasses existing state-of-the-art methods by a large margin. |
|||||
2020 | Fast Geometric Learning With Symbolic Matrices | Jean Feydy, Alexis Glaunès, Benjamin Charlier, Michael Bronstein | Neural Information Processing Systems | Geometric methods rely on tensors that can be encoded using a symbolic formula and data arrays, such as kernel and distance matrices. We present an extension for standard machine learning frameworks that provides comprehensive support for this abstraction on CPUs and GPUs: our toolbox combines a versatile, transparent user interface with fast runtimes and low memory usage. Unlike general purpose acceleration frameworks such as XLA, our library turns generic Python code into binaries whose performances are competitive with state-of-the-art geometric libraries - such as FAISS for nearest neighbor search - with the added benefit of flexibility. We perform an extensive evaluation on a broad class of problems: Gaussian modelling, K-nearest neighbors search, geometric deep learning, non-Euclidean embeddings and optimal transport theory. In practice, for geometric problems that involve 1k to 1M samples in dimension 1 to 100, our library speeds up baseline GPU implementations by up to two orders of magnitude. |
|||||
2020 | Fedocr Communication-efficient Federated Learning For Scene Text Recognition | Zhang Wenqing, Qiu Yang, Bai Song, Zhang Rui, Wei Xiaolin, Bai Xiang | Arxiv | While scene text recognition techniques have been widely used in commercial applications, data privacy has rarely been taken into account by this research community. Most existing algorithms have assumed a set of shared or centralized training data. However, in practice, data may be distributed on different local devices that can not be centralized to share due to the privacy restrictions. In this paper, we study how to make use of decentralized datasets for training a robust scene text recognizer while keeping them stay on local devices. To the best of our knowledge, we propose the first framework leveraging federated learning for scene text recognition, which is trained with decentralized datasets collaboratively. Hence we name it FedOCR. To make FedCOR fairly suitable to be deployed on end devices, we make two improvements including using lightweight models and hashing techniques. We argue that both are crucial for FedOCR in terms of the communication efficiency of federated learning. The simulations on decentralized datasets show that the proposed FedOCR achieves competitive results to the models that are trained with centralized data, with fewer communication costs and higher-level privacy-preserving. |
|||||
2020 | Locality Sensitive Hashing With Extended Differential Privacy | Fernandes Natasha, Kawamoto Yusuke, Murakami Takao | Proceedings of the | Extended differential privacy, a generalization of standard differential privacy (DP) using a general metric, has been widely studied to provide rigorous privacy guarantees while keeping high utility. However, existing works on extended DP are limited to few metrics, such as the Euclidean metric. Consequently, they have only a small number of applications, such as location-based services and document processing. In this paper, we propose a couple of mechanisms providing extended DP with a different metric: angular distance (or cosine distance). Our mechanisms are based on locality sensitive hashing (LSH), which can be applied to the angular distance and work well for personal data in a high-dimensional space. We theoretically analyze the privacy properties of our mechanisms, and prove extended DP for input data by taking into account that LSH preserves the original metric only approximately. We apply our mechanisms to friend matching based on high-dimensional personal data with angular distance in the local model, and evaluate our mechanisms using two real datasets. We show that LDP requires a very large privacy budget and that RAPPOR does not work in this application. Then we show that our mechanisms enable friend matching with high utility and rigorous privacy guarantees based on extended DP. |
|||||
2020 | Chromatic Learning For Sparse Datasets | Feinberg Vladimir, Bailis Peter | Arxiv | Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features—an approach previously used as a runtime performance optimization for GBDT training. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension. |
|||||
2020 | Attention-based Saliency Hashing For Ophthalmic Image Retrieval | Fang Jiansheng, Xu Yanwu, Zhang Xiaoqing, Hu Yan, Liu Jiang | Arxiv | Deep hashing methods have been proved to be effective for the large-scale medical image search assisting reference-based diagnosis for clinicians. However, when the salient region plays a maximal discriminative role in ophthalmic image, existing deep hashing methods do not fully exploit the learning ability of the deep network to capture the features of salient regions pointedly. The different grades or classes of ophthalmic images may be share similar overall performance but have subtle differences that can be differentiated by mining salient regions. To address this issue, we propose a novel end-to-end network, named Attention-based Saliency Hashing (ASH), for learning compact hash-code to represent ophthalmic images. ASH embeds a spatial-attention module to focus more on the representation of salient regions and highlights their essential role in differentiating ophthalmic images. Benefiting from the spatial-attention module, the information of salient regions can be mapped into the hash-code for similarity calculation. In the training stage, we input the image pairs to share the weights of the network, and a pairwise loss is designed to maximize the discriminability of the hash-code. In the retrieval stage, ASH obtains the hash-code by inputting an image with an end-to-end manner, then the hash-code is used to similarity calculation to return the most similar images. Extensive experiments on two different modalities of ophthalmic image datasets demonstrate that the proposed ASH can further improve the retrieval performance compared to the state-of-the-art deep hashing methods due to the huge contributions of the spatial-attention module. |
|||||
2020 | LANNS A Web-scale Approximate Nearest Neighbor Lookup System | Doshi Ishita, Das Dhritiman, Bhutani Ashish, Kumar Rajeev, Bhatt Rushi, Balasubramanian Niranjan | Arxiv | Nearest neighbor search (NNS) has a wide range of applications in information retrieval, computer vision, machine learning, databases, and other areas. Existing state-of-the-art algorithm for nearest neighbor search, Hierarchical Navigable Small World Networks(HNSW), is unable to scale to large datasets of 100M records in high dimensions. In this paper, we propose LANNS, an end-to-end platform for Approximate Nearest Neighbor Search, which scales for web-scale datasets. Library for Large Scale Approximate Nearest Neighbor Search (LANNS) is deployed in multiple production systems for identifying topK (\(100 \leq topK \leq 200\)) approximate nearest neighbors with a latency of a few milliseconds per query, high throughput of 2.5k Queries Per Second (QPS) on a single node, on large (\(\sim\)180M data points) high dimensional (50-2048 dimensional) datasets. |
|||||
2020 | Adversarial Collision Attacks On Image Hashing Functions | Dolhansky Brian, Ferrer Cristian Canton | Arxiv | Hashing images with a perceptual algorithm is a common approach to solving duplicate image detection problems. However, perceptual image hashing algorithms are differentiable, and are thus vulnerable to gradient-based adversarial attacks. We demonstrate that not only is it possible to modify an image to produce an unrelated hash, but an exact image hash collision between a source and target image can be produced via minuscule adversarial perturbations. In a white box setting, these collisions can be replicated across nearly every image pair and hash type (including both deep and non-learned hashes). Furthermore, by attacking points other than the output of a hashing function, an attacker can avoid having to know the details of a particular algorithm, resulting in collisions that transfer across different hash sizes or model architectures. Using these techniques, an adversary can poison the image lookup table of a duplicate image detection service, resulting in undefined or unwanted behavior. Finally, we offer several potential mitigations to gradient-based image hash attacks. |
|||||
2020 | A Genetic Algorithm For Obtaining Memory Constrained Near-perfect Hashing | Domnita Dan, Oprisa Ciprian | Arxiv | The problem of fast items retrieval from a fixed collection is often encountered in most computer science areas, from operating system components to databases and user interfaces. We present an approach based on hash tables that focuses on both minimizing the number of comparisons performed during the search and minimizing the total collection size. The standard open-addressing double-hashing approach is improved with a non-linear transformation that can be parametrized in order to ensure a uniform distribution of the data in the hash table. The optimal parameter is determined using a genetic algorithm. The paper results show that near-perfect hashing is faster than binary search, yet uses less memory than perfect hashing, being a good choice for memory-constrained applications where search time is also critical. |
|||||
2020 | HM4 Hidden Markov Model With Memory Management For Visual Place Recognition | Doan Anh-dzung, Latif Yasir, Chin Tat-jun, Reid Ian | Arxiv | Visual place recognition needs to be robust against appearance variability due to natural and man-made causes. Training data collection should thus be an ongoing process to allow continuous appearance changes to be recorded. However, this creates an unboundedly-growing database that poses time and memory scalability challenges for place recognition methods. To tackle the scalability issue for visual place recognition in autonomous driving, we develop a Hidden Markov Model approach with a two-tiered memory management. Our algorithm, dubbed HM\(^4\), exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory when needed. The inference process takes into account both promising images and a coarse representations of the full database. We show that this allows constant time and space inference for a fixed coverage area. The coarse representations can also be updated incrementally to absorb new data. To further reduce the memory requirements, we derive a compact image representation inspired by Locality Sensitive Hashing (LSH). Through experiments on real world data, we demonstrate the excellent scalability and accuracy of the approach under appearance changes and provide comparisons against state-of-the-art techniques. |
|||||
2020 | Image Hashing By Minimizing Discrete Component-wise Wasserstein Distance | Doan Khoa D., Manchanda Saurav, Badirli Sarkhan, Reddy Chandan K. | Arxiv | Image hashing is one of the fundamental problems that demand both efficient and effective solutions for various practical scenarios. Adversarial autoencoders are shown to be able to implicitly learn a robust, locality-preserving hash function that generates balanced and high-quality hash codes. However, the existing adversarial hashing methods are inefficient to be employed for large-scale image retrieval applications. Specifically, they require an exponential number of samples to be able to generate optimal hash codes and a significantly high computational cost to train. In this paper, we show that the high sample-complexity requirement often results in sub-optimal retrieval performance of the adversarial hashing methods. To address this challenge, we propose a new adversarial-autoencoder hashing approach that has a much lower sample requirement and computational cost. Specifically, by exploiting the desired properties of the hash function in the low-dimensional, discrete space, our method efficiently estimates a better variant of Wasserstein distance by averaging a set of easy-to-compute one-dimensional Wasserstein distances. The resulting hashing approach has an order-of-magnitude better sample complexity, thus better generalization property, compared to the other adversarial hashing methods. In addition, the computational cost is significantly reduced using our approach. We conduct experiments on several real-world datasets and show that the proposed method outperforms the competing hashing methods, achieving up to 10% improvement over the current state-of-the-art image hashing methods. The code accompanying this paper is available on Github (https://github.com/khoadoan/adversarial-hashing). |
|||||
2020 | Self-supervised Asymmetric Deep Hashing With Margin-scalable Constraint | Yu Zhengyang, Wu Song, Dou Zhihao, Bakker Erwin M. | Arxiv | Due to its effectivity and efficiency, deep hashing approaches are widely used for large-scale visual search. However, it is still challenging to produce compact and discriminative hash codes for images associated with multiple semantics for two main reasons, 1) similarity constraints designed in most of the existing methods are based upon an oversimplified similarity assignment(i.e., 0 for instance pairs sharing no label, 1 for instance pairs sharing at least 1 label), 2) the exploration in multi-semantic relevance are insufficient or even neglected in many of the existing methods. These problems significantly limit the discrimination of generated hash codes. In this paper, we propose a novel self-supervised asymmetric deep hashing method with a margin-scalable constraint(SADH) approach to cope with these problems. SADH implements a self-supervised network to sufficiently preserve semantic information in a semantic feature dictionary and a semantic code dictionary for the semantics of the given dataset, which efficiently and precisely guides a feature learning network to preserve multilabel semantic information using an asymmetric learning strategy. By further exploiting semantic dictionaries, a new margin-scalable constraint is employed for both precise similarity searching and robust hash code generation. Extensive empirical research on four popular benchmarks validates the proposed method and shows it outperforms several state-of-the-art approaches. |
|||||
2020 | Encode The Unseen Predictive Video Hashing For Scalable Mid-stream Retrieval | Yu Tong, Padoy Nicolas | Arxiv | This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval. This task, which consists in searching a database for content similar to a video right as it is playing, e.g. from a live stream, exhibits challenging characteristics. Only the beginning part of the video is available as query and new frames are constantly added as the video plays out. To perform retrieval in this demanding situation, we propose an approach based on a binary encoder that is both predictive and incremental in order to (1) account for the missing video content at query time and (2) keep up with repeated, continuously evolving queries throughout the streaming. In particular, we present the first hashing framework that infers the unseen future content of a currently playing video. Experiments on FCVID and ActivityNet demonstrate the feasibility of this task. Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task, for instance 7.4% (2.6%) increase at 20% (50%) of elapsed runtime on FCVID using bitcodes of size 192 bits. |
|||||
2020 | MAC Address Anonymization For Crowd Counting | Determe Jean-françois, Azzagnuni Sophia, Horlin François, De Doncker Philippe | Algorithms | Research has shown that counting WiFi packets called probe requests (PRs) implicitly provides a proxy for the number of people in an area. In this paper, we discuss a crowd counting system involving WiFi sensors detecting PRs over the air, then extracting and anonymizing their media access control (MAC) addresses using a hash-based approach. This paper discusses an anonymization procedure and shows time-synchronization inaccuracies among sensors and hashing collision rates to be low enough to prevent anonymization from interfering with counting algorithms. In particular, we derive an approximation of the collision rate of uniformly distributed identifiers, with analytical error bounds. |
|||||
2020 | Comprehensive Graph-conditional Similarity Preserving Network For Unsupervised Cross-modal Hashing | Yu Jun, Zhou Hao, Zhan Yibing, Tao Dacheng | Arxiv | Unsupervised cross-modal hashing (UCMH) has become a hot topic recently. Current UCMH focuses on exploring data similarities. However, current UCMH methods calculate the similarity between two data, mainly relying on the two data’s cross-modal features. These methods suffer from inaccurate similarity problems that result in a suboptimal retrieval Hamming space, because the cross-modal features between the data are not sufficient to describe the complex data relationships, such as situations where two data have different feature representations but share the inherent concepts. In this paper, we devise a deep graph-neighbor coherence preserving network (DGCPN). Specifically, DGCPN stems from graph models and explores graph-neighbor coherence by consolidating the information between data and their neighbors. DGCPN regulates comprehensive similarity preserving losses by exploiting three types of data similarities (i.e., the graph-neighbor coherence, the coexistent similarity, and the intra- and inter-modality consistency) and designs a half-real and half-binary optimization strategy to reduce the quantization errors during hashing. Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data’s intrinsic relationships in a graph. We conduct extensive experiments on three public UCMH datasets. The experimental results demonstrate the superiority of DGCPN, e.g., by improving the mean average precision from 0.722 to 0.751 on MIRFlickr-25K using 64-bit hashing codes to retrieve texts from images. We will release the source code package and the trained model on https://github.com/Atmegal/DGCPN. |
|||||
2020 | ZSCRGAN A Gan-based Expectation Maximization Model For Zero-shot Retrieval Of Images From Textual Descriptions | Roy Anurag, Verma Vinay Kumar, Ghosh Kripabandhu, Ghosh Saptarshi | Arxiv | Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e.g., text) to the mode of the documents (e.g., images) from a given training set. Such a setup assumes that the training set contains an exhaustive representation of all possible classes of queries. In reality, a retrieval model may need to be deployed on previously unseen classes, which implies a zero-shot IR setup. In this paper, we propose a novel GAN-based model for zero-shot text to image retrieval. When given a textual description as the query, our model can retrieve relevant images in a zero-shot setup. The proposed model is trained using an Expectation-Maximization framework. Experiments on multiple benchmark datasets show that our proposed model comfortably outperforms several state-of-the-art zero-shot text to image retrieval models, as well as zero-shot classification and hashing models suitably used for retrieval. |
|||||
2020 | Improved Bounds For (bk)-hashing | Della Fiore Stefano, Costa Simone, Dalai Marco | Arxiv | For fixed integers \(b\geq k\), a problem of relevant interest in computer science and combinatorics is that of determining the asymptotic growth, with \(n\), of the largest set for which a \((b, k)\)-hash family of \(n\) functions exists. Equivalently, determining the asymptotic growth of a largest subset of \(\{1,2,\ldots,b\}^n\) such that, for any \(k\) distinct elements in the set, there is a coordinate where they all differ. An important asymptotic upper bound for general \(b, k\), was derived by Fredman and Koml'os in the ’80s and improved for certain \(b\neq k\) by K"orner and Marton and by Arikan. Only very recently better bounds were derived for the general \(b,k\) case by Guruswami and Riazanov while stronger results for small values of \(b=k\) were obtained by Arikan, by Dalai, Guruswami and Radhakrishnan and by Costa and Dalai. In this paper, we both show how some of the latter results extend to \(b\neq k\) and further strengthen the bounds for some specific small values of \(b\) and \(k\). The method we use, which depends on the reduction of an optimization problem to a finite number of cases, shows that further results might be obtained by refined arguments at the expense of higher complexity which could be reduced by using more sophisticated and optimized algorithmic approaches. |
|||||
2020 | Faster Binary Embeddings For Preserving Euclidean Distances | Zhang Jinjie, Saab Rayan | Arxiv | We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset \(\mathcal{T}\subseteq\mathbb{R}^n\) into binary sequences in the cube \(\{\pm 1\}^m\). When \(\mathcal{T}\) consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to \(A x\) where \(A\in\mathbb{R}^{m\times n}\) is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use \(x\mapsto \mathrm{sign}(Ax)\) for the embedding. Moreover, we show that Euclidean distances among the elements of \(\mathcal{T}\) are approximated by the \(\ell_1\) norm on the images of \(\{\pm 1\}^m\) under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity \(O(m)\) and space complexity \(O(m)\). Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension \(m\) increases. Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance. |
|||||
2020 | SMYRF - Efficient Attention Using Asymmetric Clustering | Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis | Neural Information Processing Systems | We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from \(O(N^2)\) to \(O(N log N)\), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. tight queries and keys) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we train BigGAN on Celeba-HQ, with attention at resolution 128x128 and 256x256, capable of generating realistic human faces. |
|||||
2020 | Convolutional Embedding For Edit Distance | Dai Xinyan, Yan Xiao, Zhou Kaiwen, Wang Yuxuan, Yang Han, Cheng James | Arxiv | Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude. |
|||||
2020 | Parallel Index-based Structural Graph Clustering And Its Approximation | Tseng Tom, Dhulipala Laxman, Shun Julian | Arxiv | SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel, achieving logarithmic span (parallel time) for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering behavior. We present an experimental evaluation of our algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50–151\(\times\) speedup over the construction of GS-Index. In fact, even on a single thread, our index construction algorithm is faster than GS-Index. Our parallel index query implementation achieves 5–32\(\times\) speedup over GS-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-the-art parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in faster index construction while maintaining good clustering quality. |
|||||
2020 | Pairwise Supervised Hashing With Bernoulli Variational Auto-encoder And Self-control Gradient Estimator | Dadaneh Siamak Zamani, Boluki Shahin, Yin Mingzhang, Zhou Mingyuan, Qian Xiaoning | Uncertainty in Artificial Intelligence Conference | Semantic hashing has become a crucial component of fast similarity search in many large-scale information retrieval systems, in particular, for text data. Variational auto-encoders (VAEs) with binary latent variables as hashing codes provide state-of-the-art performance in terms of precision for document retrieval. We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing. Instead of solving the optimization relying on existing biased gradient estimators, an unbiased low-variance gradient estimator is adopted to optimize the hashing function by evaluating the non-differentiable loss function over two correlated sets of binary hashing codes to control the variance of gradient estimates. This new semantic hashing framework achieves superior performance compared to the state-of-the-arts, as demonstrated by our comprehensive experiments. |
|||||
2020 | Exchnet A Unified Hashing Network For Large-scale Fine-grained Image Retrieval | Cui Quan, Jiang Qing-yuan, Wei Xiu-shen, Li Wu-jun, Yoshie Osamu | Arxiv | Retrieving content relevant images from a large-scale fine-grained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, it can firstly obtain both local and global features to represent object parts and whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning’s consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternative learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our proposal consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets, which shows our effectiveness. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality. |
|||||
2020 | New Bounds For Perfect k-hashing | Costa Simone, Dalai Marco | Arxiv | Let \(C\subseteq \{1,\ldots,k\}^n\) be such that for any \(k\) distinct elements of \(C\) there exists a coordinate where they all differ simultaneously. Fredman and Koml'os studied upper and lower bounds on the largest cardinality of such a set \(C\), in particular proving that as \(n\to\infty\), \(|C|\leq \exp(n k!/k^{k-1}+o(n))\). Improvements over this result where first derived by different authors for \(k=4\). More recently, Guruswami and Riazanov showed that the coefficient \(k!/k^{k-1}\) is certainly not tight for any \(k>3\), although they could only determine explicit improvements for \(k=5,6\). For larger \(k\), their method gives numerical values modulo a conjecture on the maxima of certain polynomials. In this paper, we first prove their conjecture, completing the explicit computation of an improvement over the Fredman-Koml'os bound for any \(k\). Then, we develop a different method which gives substantial improvements for \(k=5,6\). |
|||||
2020 | Dartminhash Fast Sketching For Weighted Sets | Christiani Tobias | Arxiv | Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set \(x \in \mathbb{R}{\geq 0}^{d}\) and computes \(k\) independent minhashes in expected time \(O(k log k + \Vert x \Vert{0}log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))\), improving upon the state-of-the-art BagMinHash algorithm (KDD ‘18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with \(k\) and \(\Vert x \Vert_0\) compared to ICWS (ICDM ‘10) and BagMinhash, obtaining \(10\)x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for \((L, K)\)-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time \(O(LK + \Vert x \Vert_0)\), improving on prior work even in the case of unweighted sets. |
|||||
2020 | Deep Cross-modal Hashing Via Margin-dynamic-softmax Loss | Tu Rong-cheng, Mao Xian-ling, Tu Rongxin, Bian Binbin, Wei Wei, Huang Heyan | Arxiv | Due to their high retrieval efficiency and low storage cost for cross-modal search task, cross-modal hashing methods have attracted considerable attention. For the supervised cross-modal hashing methods, how to make the learned hash codes preserve semantic information sufficiently contained in the label of datapoints is the key to further enhance the retrieval performance. Hence, almost all supervised cross-modal hashing methods usually depends on defining a similarity between datapoints with the label information to guide the hashing model learning fully or partly. However, the defined similarity between datapoints can only capture the label information of datapoints partially and misses abundant semantic information, then hinders the further improvement of retrieval performance. Thus, in this paper, different from previous works, we propose a novel cross-modal hashing method without defining the similarity between datapoints, called Deep Cross-modal Hashing via \textit{Margin-dynamic-softmax Loss} (DCHML). Specifically, DCHML first trains a proxy hashing network to transform each category information of a dataset into a semantic discriminative hash code, called proxy hash code. Each proxy hash code can preserve the semantic information of its corresponding category well. Next, without defining the similarity between datapoints to supervise the training process of the modality-specific hashing networks , we propose a novel \textit{margin-dynamic-softmax loss} to directly utilize the proxy hashing codes as supervised information. Finally, by minimizing the novel \textit{margin-dynamic-softmax loss}, the modality-specific hashing networks can be trained to generate hash codes which can simultaneously preserve the cross-modal similarity and abundant semantic information well. |
|||||
2020 | Deep Hashing For Secure Multimodal Biometrics | Talreja Veeru, Valenti Matthew, Nasrabadi Nasser | IEEE Transactions on Information Forensics and Securityvol. | When compared to unimodal systems, multimodal biometric systems have several advantages, including lower error rate, higher accuracy, and larger population coverage. However, multimodal systems have an increased demand for integrity and privacy because they must store multiple biometric traits associated with each user. In this paper, we present a deep learning framework for feature-level fusion that generates a secure multimodal template from each user’s face and iris biometrics. We integrate a deep hashing (binarization) technique into the fusion architecture to generate a robust binary multimodal shared latent representation. Further, we employ a hybrid secure architecture by combining cancelable biometrics with secure sketch techniques and integrate it with a deep hashing framework, which makes it computationally prohibitive to forge a combination of multiple biometrics that pass the authentication. The efficacy of the proposed approach is shown using a multimodal database of face and iris and it is observed that the matching performance is improved due to the fusion of multiple biometrics. Furthermore, the proposed approach also provides cancelability and unlinkability of the templates along with improved privacy of the biometric data. Additionally, we also test the proposed hashing function for an image retrieval application using a benchmark dataset. The main goal of this paper is to develop a method for integrating multimodal fusion, deep hashing, and biometric security, with an emphasis on structural data from modalities like face and iris. The proposed approach is in no way a general biometric security framework that can be applied to all biometric modalities, as further research is needed to extend the proposed framework to other unconstrained biometric modalities. |
|||||
2020 | Deep Learning For Image Search And Retrieval In Large Remote Sensing Archives | Sumbul Gencer, Kang Jian, Demir Begüm | Arxiv | This chapter presents recent advances in content based image search and retrieval (CBIR) systems in remote sensing (RS) for fast and accurate information discovery from massive data archives. Initially, we analyze the limitations of the traditional CBIR systems that rely on the hand-crafted RS image descriptors. Then, we focus our attention on the advances in RS CBIR systems for which deep learning (DL) models are at the forefront. In particular, we present the theoretical properties of the most recent DL based CBIR systems for the characterization of the complex semantic content of RS images. After discussing their strengths and limitations, we present the deep hashing based CBIR systems that have high time-efficient search capability within huge data archives. Finally, the most promising research directions in RS CBIR are discussed. |
|||||
2020 | Making Online Sketching Hashing Even Faster | Chen Xixian, Yang Haiqin, Zhao Shenglin, Lyu Michael R., King Irwin | IEEE Transactions on Knowledge and Data Engineering | Data-dependent hashing methods have demonstrated good performance in various machine learning applications to learn a low-dimensional representation from the original data. However, they still suffer from several obstacles: First, most of existing hashing methods are trained in a batch mode, yielding inefficiency for training streaming data. Second, the computational cost and the memory consumption increase extraordinarily in the big data setting, which perplexes the training procedure. Third, the lack of labeled data hinders the improvement of the model performance. To address these difficulties, we utilize online sketching hashing (OSH) and present a FasteR Online Sketching Hashing (FROSH) algorithm to sketch the data in a more compact form via an independent transformation. We provide theoretical justification to guarantee that our proposed FROSH consumes less time and achieves a comparable sketching precision under the same memory cost of OSH. We also extend FROSH to its distributed implementation, namely DFROSH, to further reduce the training time cost of FROSH while deriving the theoretical bound of the sketching precision. Finally, we conduct extensive experiments on both synthetic and real datasets to demonstrate the attractive merits of FROSH and DFROSH. |
|||||
2020 | Kernel Density Estimation Through Density Constrained Near Neighbor Search | Charikar Moses, Kapralov Michael, Nouri Navid, Siminelakis Paris | Arxiv | In this paper we revisit the kernel density estimation problem: given a kernel \(K(x, y)\) and a dataset of \(n\) points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query \(q\), a \((1+\epsilon)\)-approximation to \(\mu:=\frac1{|P|}\sum_{p\in P} K(p, q)\). First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time \(\mu^{-(1+o(1))/4}\) and space complexity \(\mu^{-(1+o(1))}\) for the Gaussian kernel. Our data dependent approach achieves query time \(\mu^{-0.173-o(1)}\) and space \(\mu^{-(1+o(1))}\) for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search. |
|||||
2020 | A Showcase Of The Use Of Autoencoders In Feature Learning Applications | Charte David, Charte Francisco, Del Jesus María J., Herrera Francisco | In From Bioinspired Systems and Biomedical Applications to Machine Learning/IWINAC | Autoencoders are techniques for data representation learning based on artificial neural networks. Differently to other feature learning methods which may be focused on finding specific transformations of the feature space, they can be adapted to fulfill many purposes, such as data visualization, denoising, anomaly detection and semantic hashing. This work presents these applications and provides details on how autoencoders can perform them, including code samples making use of an R package with an easy-to-use interface for autoencoder design and training, \texttt{ruta}. Along the way, the explanations on how each learning task has been achieved are provided with the aim to help the reader design their own autoencoders for these or other objectives. |
|||||
2020 | Fixed-length Protein Embeddings Using Contextual Lenses | Shanehsazzadeh Amir, Belanger David, Dohan David | Arxiv | The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being computationally expensive. As opposed to working with edit distance, a vector similarity approach can be accelerated substantially using modern hardware or hashing techniques. Such an approach would require fixed-length embeddings for biological sequences. There has been recent interest in learning fixed-length protein embeddings using deep learning models under the hypothesis that the hidden layers of supervised or semi-supervised models could produce potentially useful vector embeddings. We consider transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual lenses. The embeddings are trained to predict the family a protein belongs to for sequences in the Pfam database. We show that for nearest-neighbor family classification, pretraining offers a noticeable boost in performance and that the corresponding learned embeddings are competitive with BLAST. Furthermore, we show that the raw transformer embeddings, obtained via static pooling, do not perform well on nearest-neighbor family classification, which suggests that learning embeddings in a supervised manner via contextual lenses may be a compute-efficient alternative to fine-tuning. |
|||||
2020 | Faster Person Re-identification | Wang Guan'an, Gong Shaogang, Cheng Jian, Hou Zengguang | Arxiv | Fast person re-identification (ReID) aims to search person images quickly and accurately. The main idea of recent fast ReID methods is the hashing algorithm, which learns compact binary codes and performs fast Hamming distance and counting sort. However, a very long code is needed for high accuracy (e.g. 2048), which compromises search speed. In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Specifically, we design an All-in-One (AiO) framework together with a Distance Threshold Optimization (DTO) algorithm. In AiO, we simultaneously learn and enhance multiple codes of different lengths in a single model. It learns multiple codes in a pyramid structure, and encourage shorter codes to mimic longer codes by self-distillation. DTO solves a complex threshold search problem by a simple optimization process, and the balance between accuracy and speed is easily controlled by a single parameter. It formulates the optimization target as a \(F_{\beta}\) score that can be optimised by Gaussian cumulative distribution functions. Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods. Compared with non-hashing ReID methods, CtF is \(50\times\) faster with comparable accuracy. Code is available at https://github.com/wangguanan/light-reid. |
|||||
2020 | Its The Best Only When It Fits You Most Finding Related Models For Serving Based On Dynamic Locality Sensitive Hashing | Zhou Lixi, Wang Zijie, Das Amitabh, Zou Jia | Arxiv | In recent, deep learning has become the most popular direction in machine learning and artificial intelligence. However, preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research. Reusing models for inferencing a dataset can greatly save the human costs required for training data creation. Although there exist a number of model sharing platform such as TensorFlow Hub, PyTorch Hub, DLHub, most of these systems require model uploaders to manually specify the details of each model and model downloaders to screen keyword search results for selecting a model. They are in lack of an automatic model searching tool. This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models. While there exist many similarity measurements, we study how to efficiently apply these metrics without pair-wise comparison and compare the effectiveness of these metrics. We find that our proposed adaptivity measurement which is based on Jensen-Shannon (JS) divergence, is an effective measurement, and its computation can be significantly accelerated by using the technique of locality sensitive hashing. |
|||||
2020 | Massively Parallel Graph Drawing And Representation Learning | Böhm Christian, Plant Claudia | IEEE BigData | To fully exploit the performance potential of modern multi-core processors, machine learning and data mining algorithms for big data must be parallelized in multiple ways. Today’s CPUs consist of multiple cores, each following an independent thread of control, and each equipped with multiple arithmetic units which can perform the same operation on a vector of multiple data objects. Graph embedding, i.e. converting the vertices of a graph into numerical vectors is a data mining task of high importance and is useful for graph drawing (low-dimensional vectors) and graph representation learning (high-dimensional vectors). In this paper, we propose MulticoreGEMPE (Graph Embedding by Minimizing the Predictive Entropy), an information-theoretic method which can generate low and high-dimensional vectors. MulticoreGEMPE applies MIMD (Multiple Instructions Multiple Data, using OpenMP) and SIMD (Single Instructions Multiple Data, using AVX-512) parallelism. We propose general ideas applicable in other graph-based algorithms like vectorized hashing and vectorized reduction. Our experimental evaluation demonstrates the superiority of our approach. |
|||||
2020 | Scalable Blocking For Very Large Databases | Borthwick Andrew, Ash Stephen, Pang Bin, Qureshi Shehzad, Jones Timothy | Arxiv | In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service. |
|||||
2020 | Embedded Blockchains A Synthesis Of Blockchains Spread Spectrum Watermarking Perceptual Hashing Digital Signatures | Blake Sam | Arxiv | In this paper we introduce a scheme for detecting manipulated audio and video. The scheme is a synthesis of blockchains, encrypted spread spectrum watermarks, perceptual hashing and digital signatures, which we call an Embedded Blockchain. Within this scheme, we use the blockchain for its data structure of a cryptographically linked list, cryptographic hashing for absolute comparisons, perceptual hashing for flexible comparisons, digital signatures for proof of ownership, and encrypted spread spectrum watermarking to embed the blockchain into the background noise of the media. So each media recording has its own unique blockchain, with each block holding information describing the media segment. The problem of verifying the integrity of the media is recast to traversing the blockchain, block-by-block, and segment-by-segment of the media. If any chain is broken, the difference in the computed and extracted perceptual hash is used to estimate the level of manipulation. |
|||||
2020 | Perceptual Hashing Applied To Tor Domains Recognition | Biswas Rubel, Vasco-carofilis Roberto A., Fernandez Eduardo Fidalgo, Martino Francisco Jáñez, Medina Pablo Blanco | Arxiv | The Tor darknet hosts different types of illegal content, which are monitored by cybersecurity agencies. However, manually classifying Tor content can be slow and error-prone. To support this task, we introduce Frequency-Dominant Neighborhood Structure (F-DNS), a new perceptual hashing method for automatically classifying domains by their screenshots. First, we evaluated F-DNS using images subject to various content preserving operations. We compared them with their original images, achieving better correlation coefficients than other state-of-the-art methods, especially in the case of rotation. Then, we applied F-DNS to categorize Tor domains using the Darknet Usage Service Images-2K (DUSI-2K), a dataset with screenshots of active Tor service domains. Finally, we measured the performance of F-DNS against an image classification approach and a state-of-the-art hashing method. Our proposal obtained 98.75% accuracy in Tor images, surpassing all other methods compared. |
|||||
2020 | MMH With Arbitrary Modulus Is Always Almost-universal | Bibak Khodakhast, Kapron Bruce M., Srinivasan Venkatesh | Information Processing Letters | Universal hash functions, discovered by Carter and Wegman in 1979, are of great importance in computer science with many applications. MMH\(^\) is a well-known \(\triangle\)-universal hash function family, based on the evaluation of a dot product modulo a prime. In this paper, we introduce a generalization of MMH\(^\), that we call GMMH\(^\), using the same construction as MMH\(^\) but with an arbitrary integer modulus \(n>1\), and show that GMMH\(^*\) is \(\frac{1}{p}\)-almost-\(\triangle\)-universal, where \(p\) is the smallest prime divisor of \(n\). This bound is tight. |
|||||
2020 | Weakly-supervised Online Hashing | Zhan Yu-wei, Luo Xin, Sun Yu, Wang Yongxin, Chen Zhen-duo, Xu Xin-shun | Arxiv | With the rapid development of social websites, recent years have witnessed an explosive growth of social images with user-provided tags which continuously arrive in a streaming fashion. Due to the fast query speed and low storage cost, hashing-based methods for image search have attracted increasing attention. However, existing hashing methods for social image retrieval are based on batch mode which violates the nature of social images, i.e., social images are usually generated periodically or collected in a stream fashion. Although there exist many online image hashing methods, they either adopt unsupervised learning which ignore the relevant tags, or are designed in the supervised manner which needs high-quality labels. In this paper, to overcome the above limitations, we propose a new method named Weakly-supervised Online Hashing (WOH). In order to learn high-quality hash codes, WOH exploits the weak supervision by considering the semantics of tags and removing the noise. Besides, We develop a discrete online optimization algorithm for WOH, which is efficient and scalable. Extensive experiments conducted on two real-world datasets demonstrate the superiority of WOH compared with several state-of-the-art hashing baselines. |
|||||
2020 | Locality-sensitive Hashing In Function Spaces | Shand Will, Becker Stephen | Arxiv | We discuss the problem of performing similarity search over function spaces. To perform search over such spaces in a reasonable amount of time, we use {\it locality-sensitive hashing} (LSH). We present two methods that allow LSH functions on \(\mathbb{R}^N\) to be extended to \(L^p\) spaces: one using function approximation in an orthonormal basis, and another using (quasi-)Monte Carlo-style techniques. We use the presented hashing schemes to construct an LSH family for Wasserstein distance over one-dimensional, continuous probability distributions. |
|||||
2020 | Locality-sensitive Hashing For Efficient Web Application Security Testing | Ben-bassat Ilan, Rokah Erez | In Proceedings of the | Web application security has become a major concern in recent years, as more and more content and services are available online. A useful method for identifying security vulnerabilities is black-box testing, which relies on an automated crawling of web applications. However, crawling Rich Internet Applications (RIAs) is a very challenging task. One of the key obstacles crawlers face is the state similarity problem: how to determine if two client-side states are equivalent. As current methods do not completely solve this problem, a successful scan of many real-world RIAs is still not possible. We present a novel approach to detect redundant content for security testing purposes. The algorithm applies locality-sensitive hashing using MinHash sketches in order to analyze the Document Object Model (DOM) structure of web pages, and to efficiently estimate similarity between them. Our experimental results show that this approach allows a successful scan of RIAs that cannot be crawled otherwise. |
|||||
2020 | Multi-simlex A Large-scale Evaluation Of Multilingual And Cross-lingual Lexical Semantic Similarity | Vulić Ivan, Baker Simon, Ponti Edoardo Maria, Petti Ulla, Leviant Ira, Wing Kelly, Majewska Olga, Bar Eden, Malone Matt, Poibeau Thierry, Reichart Roi, Korhonen Anna | Arxiv | We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions – the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning – available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages. |
|||||
2020 | Targeted Attack For Deep Hashing Based Retrieval | Bai Jiawang, Chen Bin, Li Yiming, Wu Dongxian, Guo Weiwei, Xia Shu-tao, Yang En-hui | Arxiv | The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the \(\ell^\infty\) restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval. |
|||||
2020 | Distilling Knowledge By Mimicking Features | Wang Guo-hua, Ge Yifan, Wu Jianxin | Arxiv | Knowledge distillation (KD) is a popular method to train efficient networks (“student”) with the help of high-capacity networks (“teacher”). Traditional methods use the teacher’s soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher’s features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature’s magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy. We provide theoretical analyses of how LSH facilitates feature direction mimicking, and further extend feature mimicking to multi-label recognition and object detection. |
|||||
2020 | Minimizing Flops To Learn Efficient Sparse Representations | Paria Biswajit, Yeh Chih-kuan, Yen Ian E. H., Xu Ning, Ravikumar Pradeep, Póczos Barnabás | Arxiv | Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets. |
|||||
2020 | Towards Evaluating Gaussian Blurring In Perceptual Hashing As A Facial Image Filter | Alparslan Yigit, Alparslan Ken, Kshettry Mannika, Kratz Louis | Arxiv | With the growth in social media, there is a huge amount of images of faces available on the internet. Often, people use other people’s pictures on their own profile. Perceptual hashing is often used to detect whether two images are identical. Therefore, it can be used to detect whether people are misusing others’ pictures. In perceptual hashing, a hash is calculated for a given image, and a new test image is mapped to one of the existing hashes if duplicate features are present. Therefore, it can be used as an image filter to flag banned image content or adversarial attacks –which are modifications that are made on purpose to deceive the filter– even though the content might be changed to deceive the filters. For this reason, it is critical for perceptual hashing to be robust enough to take transformations such as resizing, cropping, and slight pixel modifications into account. In this paper, we would like to propose to experiment with effect of gaussian blurring in perceptual hashing for detecting misuse of personal images specifically for face images. We hypothesize that use of gaussian blurring on the image before calculating its hash will increase the accuracy of our filter that detects adversarial attacks which consist of image cropping, adding text annotation, and image rotation. |
|||||
2020 | Practical Hash-based Anonymity For MAC Addresses | Ali Junade, Dyo Vladimir | Arxiv | Given that a MAC address can uniquely identify a person or a vehicle, continuous tracking over a large geographical scale has raised serious privacy concerns amongst governments and the general public. Prior work has demonstrated that simple hash-based approaches to anonymization can be easily inverted due to the small search space of MAC addresses. In particular, it is possible to represent the entire allocated MAC address space in 39 bits and that frequency-based attacks allow for 50% of MAC addresses to be enumerated in 31 bits. We present a practical approach to MAC address anonymization using both computationally expensive hash functions and truncating the resulting hashes to allow for k-anonymity. We provide an expression for computing the percentage of expected collisions, demonstrating that for digests of 24 bits it is possible to store up to 168,617 MAC addresses with the rate of collisions less than 1%. We experimentally demonstrate that a rate of collision of 1% or less can be achieved by storing data sets of 100 MAC addresses in 13 bits, 1,000 MAC addresses in 17 bits and 10,000 MAC addresses in 20 bits. |
|||||
2020 | Cross Hashing Anonymizing Encounters In Decentralised Contact Tracing Protocols | Ali Junade, Dyo Vladimir | Arxiv | During the COVID-19 (SARS-CoV-2) epidemic, Contact Tracing emerged as an essential tool for managing the epidemic. App-based solutions have emerged for Contact Tracing, including a protocol designed by Apple and Google (influenced by an open-source protocol known as DP3T). This protocol contains two well-documented de-anonymisation attacks. Firstly that when someone is marked as having tested positive and their keys are made public, they can be tracked over a large geographic area for 24 hours at a time. Secondly, whilst the app requires a minimum exposure duration to register a contact, there is no cryptographic guarantee for this property. This means an adversary can scan Bluetooth networks and retrospectively find who is infected. We propose a novel “cross hashing” approach to cryptographically guarantee minimum exposure durations. We further mitigate the 24-hour data exposure of infected individuals and reduce computational time for identifying if a user has been exposed using \(k\)-Anonymous buckets of hashes and Private Set Intersection. We empirically demonstrate that this modified protocol can offer like-for-like efficacy to the existing protocol. |
|||||
2020 | A Quantum Algorithm To Locate Unknown Hashgrams | Allgood Nicholas R., Nicholas Charles K. | Arxiv | Quantum computing has evolved quickly in recent years and is showing significant benefits in a variety of fields, especially in the realm of cybersecurity. The combination of software used to locate the most frequent hashes and \(n\)-grams that identify malicious software could greatly benefit from a quantum algorithm. By loading the table of hashes and \(n\)-grams into a quantum computer we can speed up the process of mapping \(n\)-grams to their hashes. The first phase will be to use KiloGram to find the top-\(k\) hashes and \(n\)-grams for a large malware corpus. From here, the resulting hash table is then loaded into a quantum simulator. A quantum search algorithm is then used search among every permutation of the entangled key and value pairs to find the desired hash value. This prevents one from having to re-compute hashes for a set of \(n\)-grams, which can take on average \(O(MN)\) time, whereas the quantum algorithm could take \(O(\sqrt{N})\) in the number of table lookups to find the desired hash values. |
|||||
2020 | On The Problem Of p_1^-1 In Locality-sensitive Hashing | Ahle Thomas Dybdahl | Arxiv | A Locality-Sensitive Hash (LSH) function is called \((r,cr,p_1,p_2)\)-sensitive, if two data-points with a distance less than \(r\) collide with probability at least \(p_1\) while data points with a distance greater than \(cr\) collide with probability at most \(p_2\). These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a \(c\)-approximate nearest neighbour data structure with query time \(\tilde O(n^\rho/p_1)\) where \(\rho=\frac{log1/p_1}{log1/p_2}\in(0,1)\). That is, sub-linear time, as long as \(p_1\) is not too small. This is significant since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database. Unfortunately, the best LSH functions tend to have very low collision probabilities, \(p_1\) and \(p_2\). Including the best functions for Cosine and Jaccard Similarity. This means that the \(n^\rho/p_1\) query time of LSH is often not sub-linear after all, even for approximate nearest neighbours! In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to \(\tilde O(n^\rho/p_1^{1-\rho})\) (and the space usage correspondingly.) Since \(n^\rho p_1^{\rho-1} < n \Leftrightarrow p_1 > n^{-1}\), our algorithm always obtains sublinear query time, for any collision probabilities at least \(1/n\). For \(p_1\) and \(p_2\) small enough, our improvement over all previous methods can be up to a factor \(n\) in both query time and space. The improvement comes from a simple change to the Indyk-Motwani algorithm, which can easily be implemented in existing software packages. |
|||||
2020 | The Power Of Hashing With Mersenne Primes | Ahle Thomas Dybdahl, Knudsen Jakob Tejs Bæk, Thorup Mikkel | Arxiv | The classic way of computing a \(k\)-universal hash function is to use a random degree-\((k-1)\) polynomial over a prime field \(\mathbb Z_p\). For a fast computation of the polynomial, the prime \(p\) is often chosen as a Mersenne prime \(p=2^b-1\). In this paper, we show that there are other nice advantages to using Mersenne primes. Our view is that the hash function’s output is a \(b\)-bit integer that is uniformly distributed in \(\{0, \dots, 2^b-1\}\), except that \(p\) (the all \texttt1s value in binary) is missing. Uniform bit strings have many nice properties, such as splitting into substrings which gives us two or more hash functions for the cost of one, while preserving strong theoretical qualities. We call this trick “Two for one” hashing, and we demonstrate it on 4-universal hashing in the classic Count Sketch algorithm for second-moment estimation. We also provide a new fast branch-free code for division and modulus with Mersenne primes. Contrasting our analytic work, this code generalizes to any Pseudo-Mersenne primes \(p=2^b-c\) for small \(c\). |
|||||
2020 | No Repetition Fast Streaming With Highly Concentrated Hashing | Aamand Anders, Das Debarati, Kipouridis Evangelos, Knudsen Jakob B. T., Rasmussen Peter M. R., Thorup Mikkel | Arxiv | To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing \(r\) independent repetitions and taking the median of the estimators, the error probability falls exponentially in \(r\). However, running \(r\) independent experiments increases the processing time by a factor \(r\). Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of \(r\) independent sketches, we have a single sketch that is \(r\) times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor \(r\) in time, and the overall algorithms just get simpler. Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams. |
|||||
2020 | HM-ANN Efficient Billion-point Nearest Neighbor Search On Heterogeneous Memory | Jie Ren, Minjia Zhang, Dong Li | Neural Information Processing Systems | The state-of-the-art approximate nearest neighbor search (ANNS) algorithms face a fundamental tradeoff between query latency and accuracy, because of small main memory capacity: To store indices in main memory for short query latency, the ANNS algorithms have to limit dataset size or use a quantization scheme which hurts search accuracy. The emergence of heterogeneous memory (HM) brings a solution to significantly increase memory capacity and break the above tradeoff: Using HM, billions of data points can be placed in the main memory on a single machine without using any data compression. However, HM consists of both fast (but small) memory and slow (but large) memory, and using HM inappropriately slows down query significantly. In this work, we present a novel graph-based similarity search algorithm called HM-ANN, which takes both memory and data heterogeneity into consideration and enables billion-scale similarity search on a single node without using compression. On two billion-sized datasets BIGANN and DEEP1B, HM-ANN outperforms state-of-the-art compression-based solutions such as L&C and IMI+OPQ in recall-vs-latency by a large margin, obtaining 46% higher recall under the same search latency. We also extend existing graph-based methods such as HNSW and NSG with two strong baseline implementations on HM. At billion-point scale, HM-ANN is 2X and 5.8X faster than our HNSWand NSG baselines respectively to reach the same accuracy. |
|||||
2020 | Bitpruning Learning Bitlengths For Aggressive And Accurate Quantization | Nikolić Miloš, Hacene Ghouthi Boukli, Bannon Ciaran, Lascorz Alberto Delmas, Courbariaux Matthieu, Bengio Yoshua, Gripon Vincent, Moshovos Andreas | Arxiv | Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer quantization, yielding both execution time and energy benefits on existing hardware designs that support short bitlengths. However, the question of finding the minimum bitlength for a desired accuracy remains open. We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. Namely, we propose a regularizer that penalizes large bitlength representations throughout the architecture and show how it can be modified to minimize other quantifiable criteria, such as number of operations or memory footprint. We demonstrate that our method learns thrifty representations while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits on AlexNet, ResNet18 and MobileNet V2 respectively, remaining within 2.0%, 0.5% and 0.5% of the base TOP-1 accuracy. |
|||||
2020 | A Novel Incremental Cross-modal Hashing Approach | Mandal Devraj, Biswas Soma | Arxiv | Cross-modal retrieval deals with retrieving relevant items from one modality, when provided with a search query from another modality. Hashing techniques, where the data is represented as binary bits have specifically gained importance due to the ease of storage, fast computations and high accuracy. In real world, the number of data categories is continuously increasing, which requires algorithms capable of handling this dynamic scenario. In this work, we propose a novel incremental cross-modal hashing algorithm termed “iCMH”, which can adapt itself to handle incoming data of new categories. The proposed approach consists of two sequential stages, namely, learning the hash codes and training the hash functions. At every stage, a small amount of old category data termed “exemplars” is is used so as not to forget the old data while trying to learn for the new incoming data, i.e. to avoid catastrophic forgetting. In the first stage, the hash codes for the exemplars is used, and simultaneously, hash codes for the new data is computed such that it maintains the semantic relations with the existing data. For the second stage, we propose both a non-deep and deep architectures to learn the hash functions effectively. Extensive experiments across a variety of cross-modal datasets and comparisons with state-of-the-art cross-modal algorithms shows the usefulness of our approach. |
|||||
2020 | Dual-level Semantic Transfer Deep Hashing For Efficient Social Image Retrieval | Zhu Lei, Cui Hui, Cheng Zhiyong, Li Jingjing, Zhang Zheng | Arxiv | Social network stores and disseminates a tremendous amount of user shared images. Deep hashing is an efficient indexing technique to support large-scale social image retrieval, due to its deep representation capability, fast retrieval speed and low storage cost. Particularly, unsupervised deep hashing has well scalability as it does not require any manually labelled data for training. However, owing to the lacking of label guidance, existing methods suffer from severe semantic shortage when optimizing a large amount of deep neural network parameters. Differently, in this paper, we propose a Dual-level Semantic Transfer Deep Hashing (DSTDH) method to alleviate this problem with a unified deep hash learning framework. Our model targets at learning the semantically enhanced deep hash codes by specially exploiting the user-generated tags associated with the social images. Specifically, we design a complementary dual-level semantic transfer mechanism to efficiently discover the potential semantics of tags and seamlessly transfer them into binary hash codes. On the one hand, instance-level semantics are directly preserved into hash codes from the associated tags with adverse noise removing. Besides, an image-concept hypergraph is constructed for indirectly transferring the latent high-order semantic correlations of images and tags into hash codes. Moreover, the hash codes are obtained simultaneously with the deep representation learning by the discrete hash optimization strategy. Extensive experiments on two public social image retrieval datasets validate the superior performance of our method compared with state-of-the-art hashing methods. The source codes of our method can be obtained at https://github.com/research2020-1/DSTDH |
|||||
2020 | A Survey On Deep Hashing Methods | Luo Xiao, Wang Haixin, Wu Daqing, Chen Chong, Deng Minghua, Huang Jianqiang, Hua Xian-sheng | Arxiv | Nearest neighbor search aims to obtain the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this survey, we detailedly investigate current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing. Specifically, we categorize deep supervised hashing methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes. Moreover, deep unsupervised hashing is categorized into similarity reconstruction-based methods, pseudo-label-based methods and prediction-free self-supervised learning-based methods based on their semantic learning manners. We also introduce three related important topics including semi-supervised deep hashing, domain adaption deep hashing and multi-modal deep hashing. Meanwhile, we present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discuss some potential research directions in conclusion. |
|||||
2020 | CIMON Towards High-quality Hash Codes | Luo Xiao, Wu Daqing, Ma Zeyu, Chen Chong, Deng Minghua, Ma Jinwen, Jin Zhongming, Huang Jianqiang, Hua Xian-sheng | Arxiv | Recently, hashing is widely used in approximate nearest neighbor search for its storage and computational efficiency. Most of the unsupervised hashing methods learn to map images into semantic similarity-preserving hash codes by constructing local semantic similarity structure from the pre-trained model as the guiding information, i.e., treating each point pair similar if their distance is small in feature space. However, due to the inefficient representation ability of the pre-trained model, many false positives and negatives in local semantic similarity will be introduced and lead to error propagation during the hash code learning. Moreover, few of the methods consider the robustness of models, which will cause instability of hash codes to disturbance. In this paper, we propose a new method named {\textbf{C}}omprehensive s{\textbf{I}}milarity {\textbf{M}}ining and c{\textbf{O}}nsistency lear{\textbf{N}}ing (CIMON). First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes. Extensive experiments on several benchmark datasets show that the proposed method outperforms a wide range of state-of-the-art methods in both retrieval performance and robustness. |
|||||
2020 | Secure Single-server Nearly-identical Image Deduplication | Takeshita Jonathan, Karl Ryan, Jung Taeho | Arxiv | Cloud computing is often utilized for file storage. Clients of cloud storage services want to ensure the privacy of their data, and both clients and servers want to use as little storage as possible. Cross-user deduplication is one method to reduce the amount of storage a server uses. Deduplication and privacy are naturally conflicting goals, especially for nearly-identical (``fuzzy’’) deduplication, as some information about the data must be used to perform deduplication. Prior solutions thus utilize multiple servers, or only function for exact deduplication. In this paper, we present a single-server protocol for cross-user nearly-identical deduplication based on secure locality-sensitive hashing (SLSH). We formally define our ideal security, and rigorously prove our protocol secure against fully malicious, colluding adversaries with a proof by simulation. We show experimentally that the individual parts of the protocol are computationally feasible, and further discuss practical issues of security and efficiency. |
|||||
2020 | Reinforcing Short-length Hashing | Liu Xingbo, Nie Xiushan, Dai Qi, Huang Yupan, Yin Yilong | Arxiv | Due to the compelling efficiency in retrieval and storage, similarity-preserving hashing has been widely applied to approximate nearest neighbor search in large-scale image retrieval. However, existing methods have poor performance in retrieval using an extremely short-length hash code due to weak ability of classification and poor distribution of hash bit. To address this issue, in this study, we propose a novel reinforcing short-length hashing (RSLH). In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information. Furthermore, to enhance the accuracy of hash representation, a pairwise similarity matrix is designed to make a balance between accuracy and training expenditure on memory. In addition, a parameter boosting strategy is integrated to reinforce the precision with hash bits fusion. Extensive experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios. |
|||||
2020 | Shuffle And Learn Minimizing Mutual Information For Unsupervised Hashing | Liu Fangrui, Liu Zheng | Arxiv | Unsupervised binary representation allows fast data retrieval without any annotations, enabling practical application like fast person re-identification and multimedia retrieval. It is argued that conflicts in binary space are one of the major barriers to high-performance unsupervised hashing as current methods failed to capture the precise code conflicts in the full domain. A novel relaxation method called Shuffle and Learn is proposed to tackle code conflicts in the unsupervised hash. Approximated derivatives for joint probability and the gradients for the binary layer are introduced to bridge the update from the hash to the input. Proof on \(\epsilon\)-Convergence of joint probability with approximated derivatives is provided to guarantee the preciseness on update applied on the mutual information. The proposed algorithm is carried out with iterative global updates to minimize mutual information, diverging the code before regular unsupervised optimization. Experiments suggest that the proposed method can relax the code optimization from local optimum and help to generate binary representations that are more discriminative and informative without any annotations. Performance benchmarks on image retrieval with the unsupervised binary code are conducted on three open datasets, and the model achieves state-of-the-art accuracy on image retrieval task for all those datasets. Datasets and reproducible code are provided. |
|||||
2020 | Robust Homomorphic Video Hashing | Singh Priyanka | Arxiv | The Internet has been weaponized to carry out cybercriminal activities at an unprecedented pace. The rising concerns for preserving the privacy of personal data while availing modern tools and technologies is alarming. End-to-end encrypted solutions are in demand for almost all commercial platforms. On one side, it seems imperative to provide such solutions and give people trust to reliably use these platforms. On the other side, this creates a huge opportunity to carry out unchecked cybercrimes. This paper proposes a robust video hashing technique, scalable and efficient in chalking out matches from an enormous bulk of videos floating on these commercial platforms. The video hash is validated to be robust to common manipulations like scaling, corruptions by noise, compression, and contrast changes that are most probable to happen during transmission. It can also be transformed into the encrypted domain and work on top of encrypted videos without deciphering. Thus, it can serve as a potential forensic tool that can trace the illegal sharing of videos without knowing the underlying content. Hence, it can help preserve privacy and combat cybercrimes such as revenge porn, hateful content, child abuse, or illegal material propagated in a video. |
|||||
2020 | Random VLAD Based Deep Hashing For Efficient Image Retrieval | Weng Li, Ye Lingzhi, Tian Jiangmin, Cao Jiuwen, Wang Jianzhong | Arxiv | Image hash algorithms generate compact binary representations that can be quickly matched by Hamming distance, thus become an efficient solution for large-scale image retrieval. This paper proposes RV-SSDH, a deep image hash algorithm that incorporates the classical VLAD (vector of locally aggregated descriptors) architecture into neural networks. Specifically, a novel neural network component is formed by coupling a random VLAD layer with a latent hash layer through a transform layer. This component can be combined with convolutional layers to realize a hash algorithm. We implement RV-SSDH as a point-wise algorithm that can be efficiently trained by minimizing classification error and quantization loss. Comprehensive experiments show this new architecture significantly outperforms baselines such as NetVLAD and SSDH, and offers a cost-effective trade-off in the state-of-the-art. In addition, the proposed random VLAD layer leads to satisfactory accuracy with low complexity, thus shows promising potentials as an alternative to NetVLAD. |
|||||
2020 | Approximate Nearest Neighbor Negative Contrastive Learning For Dense Text Retrieval | Xiong Lee, Xiong Chenyan, Li Ye, Tang Kwok-fung, Liu Jialin, Bennett Paul, Ahmed Junaid, Overwijk Arnold | Arxiv | Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up. |
|||||
2020 | Self-supervised Bernoulli Autoencoders For Semi-supervised Hashing | Ñanculef Ricardo, Mena Francisco, Macaluso Antonio, Lodi Stefano, Sartori Claudio | Arxiv | Semantic hashing is an emerging technique for large-scale similarity search based on representing high-dimensional data using similarity-preserving binary codes used for efficient indexing and search. It has recently been shown that variational autoencoders, with Bernoulli latent representations parametrized by neural nets, can be successfully trained to learn such codes in supervised and unsupervised scenarios, improving on more traditional methods thanks to their ability to handle the binary constraints architecturally. However, the scenario where labels are scarce has not been studied yet. This paper investigates the robustness of hashing methods based on variational autoencoders to the lack of supervision, focusing on two semi-supervised approaches currently in use. The first augments the variational autoencoder’s training objective to jointly model the distribution over the data and the class labels. The second approach exploits the annotations to define an additional pairwise loss that enforces consistency between the similarity in the code (Hamming) space and the similarity in the label space. Our experiments show that both methods can significantly increase the hash codes’ quality. The pairwise approach can exhibit an advantage when the number of labelled points is large. However, we found that this method degrades quickly and loses its advantage when labelled samples decrease. To circumvent this problem, we propose a novel supervision method in which the model uses its label distribution predictions to implement the pairwise objective. Compared to the best baseline, this procedure yields similar performance in fully supervised settings but improves the results significantly when labelled data is scarce. Our code is made publicly available at https://github.com/amacaluso/SSB-VAE. |
|||||
2020 | Fast Class-wise Updating For Online Hashing | Lin Mingbao, Ji Rongrong, Sun Xiaoshuai, Zhang Baochang, Huang Feiyue, Tian Yonghong, Tao Dacheng | Arxiv | Online image hashing has received increasing research attention recently, which processes large-scale data in a streaming fashion to update the hash functions on-the-fly. To this end, most existing works exploit this problem under a supervised setting, i.e., using class labels to boost the hashing performance, which suffers from the defects in both adaptivity and efficiency: First, large amounts of training batches are required to learn up-to-date hash functions, which leads to poor online adaptivity. Second, the training is time-consuming, which contradicts with the core need of online learning. In this paper, a novel supervised online hashing scheme, termed Fast Class-wise Updating for Online Hashing (FCOH), is proposed to address the above two challenges by introducing a novel and efficient inner product operation. To achieve fast online adaptivity, a class-wise updating method is developed to decompose the binary code learning and alternatively renew the hash functions in a class-wise fashion, which well addresses the burden on large amounts of training batches. Quantitatively, such a decomposition further leads to at least 75% storage saving. To further achieve online efficiency, we propose a semi-relaxation optimization, which accelerates the online training by treating different binary constraints independently. Without additional constraints and variables, the time complexity is significantly reduced. Such a scheme is also quantitatively shown to well preserve past information during updating hashing functions. We have quantitatively demonstrated that the collective effort of class-wise updating and semi-relaxation optimization provides a superior performance comparing to various state-of-the-art methods, which is verified through extensive experiments on three widely-used datasets. |
|||||
2020 | Embedding Compression With Isotropic Iterative Quantization | Liao Siyu, Chen Jie, Wang Yanzhi, Qiu Qinru, Yuan Bo | Arxiv | Continuous representation of words is a standard component in deep learning-based NLP models. However, representing a large vocabulary requires significant memory, which can cause problems, particularly on resource-constrained platforms. Therefore, in this paper we propose an isotropic iterative quantization (IIQ) approach for compressing embedding vectors into binary ones, leveraging the iterative quantization technique well established for image retrieval, while satisfying the desired isotropic property of PMI based models. Experiments with pre-trained embeddings (i.e., GloVe and HDC) demonstrate a more than thirty-fold compression ratio with comparable and sometimes even improved performance over the original real-valued embedding vectors. |
|||||
2020 | A Practical Blockchain Framework Using Image Hashing For Image Authentication | White Cameron, Paul Manoranjan, Chakraborty Subrata | Arxiv | Blockchain is a relatively new technology that can be seen as a decentralised database. Blockchain systems heavily rely on cryptographic hash functions to store their data, which makes it difficult to tamper with any data stored in the system. A topic that was researched along with blockchain is image authentication. Image authentication focuses on investigating and maintaining the integrity of images. As a blockchain system can be useful for maintaining data integrity, image authentication has the potential to be enhanced by blockchain. There are many techniques that can be used to authenticate images; the technique investigated by this work is image hashing. Image hashing is a technique used to calculate how similar two different images are. This is done by converting the images into hashes and then comparing them using a distance formula. To investigate the topic, an experiment involving a simulated blockchain was created. The blockchain acted as a database for images. This blockchain was made up of devices which contained their own unique image hashing algorithms. The blockchain was tested by creating modified copies of the images contained in the database, and then submitting them to the blockchain to see if it will return the original image. Through this experiment it was discovered that it is plausible to create an image authentication system using blockchain and image hashing. However, the design proposed by this work requires refinement, as it appears to struggle in some situations. This work shows that blockchain can be a suitable approach for authenticating images, particularly via image hashing. Other observations include that using multiple image hash algorithms at the same time can increase performance in some cases, as well as that each type of test done to the blockchain has its own unique pattern to its data. |
|||||
2020 | Hashing-based Non-maximum Suppression For Crowded Object Detection | Wang Jianfeng, Yin Xi, Wang Lijuan, Zhang Lei | Arxiv | In this paper, we propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection. Non-maximum suppression (NMS) is an essential component to suppress the boxes at closely located locations with similar shapes. The time cost tends to be huge when the number of boxes becomes large, especially for crowded scenes. The basic idea of HNMS is to firstly map each box to a discrete code (hash cell) and then remove the boxes with lower confidences if they are in the same cell. Considering the intersection-over-union (IoU) as the metric, we propose a simple yet effective hashing algorithm, named IoUHash, which guarantees that the boxes within the same cell are close enough by a lower IoU bound. For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy. For one-stage detectors, HNMS is used as a pre-filter to speed up the suppression with a large margin. Extensive experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS. Code is released at \url{https://github.com/microsoft/hnms.git}. |
|||||
2020 | Fast Search On Binary Codes By Weighted Hamming Distance | Weng Zhenyu, Zhu Yuesheng, Liu Ruixin | Arxiv | Weighted Hamming distance, as a similarity measure between binary codes and binary queries, provides superior accuracy in search tasks than Hamming distance. However, how to efficiently and accurately find \(K\) binary codes that have the smallest weighted Hamming distance to the query remains an open issue. In this paper, a fast search algorithm is proposed to perform the non-exhaustive search for \(K\) nearest binary codes by weighted Hamming distance. By using binary codes as direct bucket indices in a hash table, the search algorithm generates a sequence to probe the buckets based on the independence characteristic of the weights for each bit. Furthermore, a fast search framework based on the proposed search algorithm is designed to solve the problem of long binary codes. Specifically, long binary codes are split into substrings and multiple hash tables are built on them. Then, the search algorithm probes the buckets to obtain candidates according to the generated substring indices, and a merging algorithm is proposed to find the nearest binary codes by merging the candidates. Theoretical analysis and experimental results demonstrate that the search algorithm improves the search accuracy compared to other non-exhaustive algorithms and provides orders-of-magnitude faster search than the linear scan baseline. |
|||||
2020 | Deep Optimized Multiple Description Image Coding Via Scalar Quantization Learning | Zhao Lijun, Bai Huihui, Wang Anhong, Zhao Yao | Arxiv | In this paper, we introduce a deep multiple description coding (MDC) framework optimized by minimizing multiple description (MD) compressive loss. First, MD multi-scale-dilated encoder network generates multiple description tensors, which are discretized by scalar quantizers, while these quantized tensors are decompressed by MD cascaded-ResBlock decoder networks. To greatly reduce the total amount of artificial neural network parameters, an auto-encoder network composed of these two types of network is designed as a symmetrical parameter sharing structure. Second, this autoencoder network and a pair of scalar quantizers are simultaneously learned in an end-to-end self-supervised way. Third, considering the variation in the image spatial distribution, each scalar quantizer is accompanied by an importance-indicator map to generate MD tensors, rather than using direct quantization. Fourth, we introduce the multiple description structural similarity distance loss, which implicitly regularizes the diversified multiple description generations, to explicitly supervise multiple description diversified decoding in addition to MD reconstruction loss. Finally, we demonstrate that our MDC framework performs better than several state-of-the-art MDC approaches regarding image coding efficiency when tested on several commonly available datasets. |
|||||
2020 | Error-corrected Margin-based Deep Cross-modal Hashing For Facial Image Retrieval | Taherkhani Fariborz, Talreja Veeru, Valenti Matthew C., Nasrabadi Nasser M. | Arxiv | Cross-modal hashing facilitates mapping of heterogeneous multimedia data into a common Hamming space, which can beutilized for fast and flexible retrieval across different modalities. In this paper, we propose a novel cross-modal hashingarchitecture-deep neural decoder cross-modal hashing (DNDCMH), which uses a binary vector specifying the presence of certainfacial attributes as an input query to retrieve relevant face images from a database. The DNDCMH network consists of two separatecomponents: an attribute-based deep cross-modal hashing (ADCMH) module, which uses a margin (m)-based loss function toefficiently learn compact binary codes to preserve similarity between modalities in the Hamming space, and a neural error correctingdecoder (NECD), which is an error correcting decoder implemented with a neural network. The goal of NECD network in DNDCMH isto error correct the hash codes generated by ADCMH to improve the retrieval efficiency. The NECD network is trained such that it hasan error correcting capability greater than or equal to the margin (m) of the margin-based loss function. This results in NECD cancorrect the corrupted hash codes generated by ADCMH up to the Hamming distance of m. We have evaluated and comparedDNDCMH with state-of-the-art cross-modal hashing methods on standard datasets to demonstrate the superiority of our method. |
|||||
2020 | Topology-aware Hashing For Effective Control Flow Graph Similarity Analysis | Li Yuping, Jang Jiong, Ou Xinming | In International Conference on Security and Privacy in Communication Systems pp. | Control Flow Graph (CFG) similarity analysis is an essential technique for a variety of security analysis tasks, including malware detection and malware clustering. Even though various algorithms have been developed, existing CFG similarity analysis methods still suffer from limited efficiency, accuracy, and usability. In this paper, we propose a novel fuzzy hashing scheme called topology-aware hashing (TAH) for effective and efficient CFG similarity analysis. Given the CFGs constructed from program binaries, we extract blended n-gram graphical features of the CFGs, encode the graphical features into numeric vectors (called graph signatures), and then measure the graph similarity by comparing the graph signatures. We further employ a fuzzy hashing technique to convert the numeric graph signatures into smaller fixed-size fuzzy hash signatures for efficient similarity calculation. Our comprehensive evaluation demonstrates that TAH is more effective and efficient compared to existing CFG comparison techniques. To demonstrate the applicability of TAH to real-world security analysis tasks, we develop a binary similarity analysis tool based on TAH, and show that it outperforms existing similarity analysis tools while conducting malware clustering. |
|||||
2020 | Task-adaptive Asymmetric Deep Cross-modal Hashing | Li Fengling, Wang Tong, Zhu Lei, Zhang Zheng, Wang Xinhua | Arxiv | Supervised cross-modal hashing aims to embed the semantic correlations of heterogeneous modality data into the binary hash codes with discriminative semantic labels. Because of its advantages on retrieval and storage efficiency, it is widely used for solving efficient cross-modal retrieval. However, existing researches equally handle the different tasks of cross-modal retrieval, and simply learn the same couple of hash functions in a symmetric way for them. Under such circumstance, the uniqueness of different cross-modal retrieval tasks are ignored and sub-optimal performance may be brought. Motivated by this, we present a Task-adaptive Asymmetric Deep Cross-modal Hashing (TA-ADCMH) method in this paper. It can learn task-adaptive hash functions for two sub-retrieval tasks via simultaneous modality representation and asymmetric hash learning. Unlike previous cross-modal hashing approaches, our learning framework jointly optimizes semantic preserving that transforms deep features of multimedia data into binary hash codes, and the semantic regression which directly regresses query modality representation to explicit label. With our model, the binary codes can effectively preserve semantic correlations across different modalities, meanwhile, adaptively capture the query semantics. The superiority of TA-ADCMH is proved on two standard datasets from many aspects. |
|||||
2020 | Hamming OCR A Locality Sensitive Hashing Neural Network For Scene Text Recognition | Li Bingcong, Tang Xin, Qi Xianbiao, Chen Yihao, Xiao Rong | Arxiv | Recently, inspired by Transformer, self-attention-based scene text recognition approaches have achieved outstanding performance. However, we find that the size of model expands rapidly with the lexicon increasing. Specifically, the number of parameters for softmax classification layer and output embedding layer are proportional to the vocabulary size. It hinders the development of a lightweight text recognition model especially applied for Chinese and multiple languages. Thus, we propose a lightweight scene text recognition model named Hamming OCR. In this model, a novel Hamming classifier, which adopts locality sensitive hashing (LSH) algorithm to encode each character, is proposed to replace the softmax regression and the generated LSH code is directly employed to replace the output embedding. We also present a simplified transformer decoder to reduce the number of parameters by removing the feed-forward network and using cross-layer parameter sharing technique. Compared with traditional methods, the number of parameters in both classification and embedding layers is independent on the size of vocabulary, which significantly reduces the storage requirement without loss of accuracy. Experimental results on several datasets, including four public benchmaks and a Chinese text dataset synthesized by SynthText with more than 20,000 characters, shows that Hamming OCR achieves competitive results. |
|||||
2020 | Deep Unsupervised Image Hashing By Maximizing Bit Entropy | Li Yunqiang, Van Gemert Jan | Arxiv | Unsupervised hashing is important for indexing huge image or video collections without having expensive annotations available. Hashing aims to learn short binary codes for compact storage and efficient semantic retrieval. We propose an unsupervised deep hashing layer called Bi-half Net that maximizes entropy of the binary codes. Entropy is maximal when both possible values of the bit are uniformly (half-half) distributed. To maximize bit entropy, we do not add a term to the loss function as this is difficult to optimize and tune. Instead, we design a new parameter-free network layer to explicitly force continuous image features to approximate the optimal half-half bit distribution. This layer is shown to minimize a penalized term of the Wasserstein distance between the learned continuous image features and the optimal half-half bit distribution. Experimental results on the image datasets Flickr25k, Nus-wide, Cifar-10, Mscoco, Mnist and the video datasets Ucf-101 and Hmdb-51 show that our approach leads to compact codes and compares favorably to the current state-of-the-art. |
|||||
2020 | Multiple Code Hashing For Efficient Image Retrieval | Li Ming-wei, Jiang Qing-yuan, Li Wu-jun | Arxiv | Due to its low storage cost and fast query speed, hashing has been widely used in large-scale image retrieval tasks. Hash bucket search returns data points within a given Hamming radius to each query, which can enable search at a constant or sub-linear time cost. However, existing hashing methods cannot achieve satisfactory retrieval performance for hash bucket search in complex scenarios, since they learn only one hash code for each image. More specifically, by using one hash code to represent one image, existing methods might fail to put similar image pairs to the buckets with a small Hamming distance to the query when the semantic information of images is complex. As a result, a large number of hash buckets need to be visited for retrieving similar images, based on the learned codes. This will deteriorate the efficiency of hash bucket search. In this paper, we propose a novel hashing framework, called multiple code hashing (MCH), to improve the performance of hash bucket search. The main idea of MCH is to learn multiple hash codes for each image, with each code representing a different region of the image. Furthermore, we propose a deep reinforcement learning algorithm to learn the parameters in MCH. To the best of our knowledge, this is the first work that proposes to learn multiple hash codes for each image in image retrieval. Experiments demonstrate that MCH can achieve a significant improvement in hash bucket search, compared with existing methods that learn only one hash code for each image. |
|||||
2020 | Perceptual Robust Hashing For Color Images With Canonical Correlation Analysis | Li Xinran, Qin Chuan, Qian Zhenxing, Yao Heng, Zhang Xinpeng | Arxiv | In this paper, a novel perceptual image hashing scheme for color images is proposed based on ring-ribbon quadtree and color vector angle. First, original image is subjected to normalization and Gaussian low-pass filtering to produce a secondary image, which is divided into a series of ring-ribbons with different radii and the same number of pixels. Then, both textural and color features are extracted locally and globally. Quadtree decomposition (QD) is applied on luminance values of the ring-ribbons to extract local textural features, and the gray level co-occurrence matrix (GLCM) is used to extract global textural features. Local color features of significant corner points on outer boundaries of ring-ribbons are extracted through color vector angles (CVA), and color low-order moments (CLMs) is utilized to extract global color features. Finally, two types of feature vectors are fused via canonical correlation analysis (CCA) to prodcue the final hash after scrambling. Compared with direct concatenation, the CCA feature fusion method improves classification performance, which better reflects overall correlation between two sets of feature vectors. Receiver operating characteristic (ROC) curve shows that our scheme has satisfactory performances with respect to robustness, discrimination and security, which can be effectively used in copy detection and content authentication. |
|||||
2020 | Distributed Tera-scale Similarity Search With MPI Provably Efficient Similarity Search Over Billions Without A Single Distance Computation | Meisburger Nicholas, Shrivastava Anshumali | Arxiv | We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query \(q\) in sub-linear (\( \ll O(n)\)) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster. |
|||||
2020 | Auto-encoding Twin-bottleneck Hashing | Shen Yuming, Qin Jie, Chen Jiaxin, Yu Mengyang, Liu Li, Zhu Fan, Shen Fumin, Shao Ling | Arxiv | Conventional unsupervised hashing methods usually take advantage of similarity graphs, which are either pre-computed in the high-dimensional space or obtained from random anchor points. On the one hand, existing methods uncouple the procedures of hash function learning and graph construction. On the other hand, graphs empirically built upon original data could introduce biased prior knowledge of data relevance, leading to sub-optimal retrieval performance. In this paper, we tackle the above problems by proposing an efficient and adaptive code-driven graph, which is updated by decoding in the context of an auto-encoder. Specifically, we introduce into our framework twin bottlenecks (i.e., latent variables) that exchange crucial information collaboratively. One bottleneck (i.e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i.e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes. The auto-encoding learning objective literally rewards the code-driven graph to learn an optimal encoder. Moreover, the proposed model can be simply optimized by gradient descent without violating the binary constraints. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods. Our source code can be found at https://github.com/ymcidence/TBH. |
|||||
2019 | Quantum Security Of Hash Functions And Property-preservation Of Iterated Hashing | Hamlin Ben, Song Fang | Arxiv | This work contains two major parts: comprehensively studying the security notions of cryptographic hash functions against quantum attacks and the relationships between them; and revisiting whether Merkle-Damgard and related iterated hash constructions preserve the security properties of the compression function in the quantum setting. Specifically, we adapt the seven notions in Rogaway and Shrimpton (FSE’04) to the quantum setting and prove that the seemingly stronger attack model where an adversary accesses a challenger in quantum superposition does not make a difference. We confirm the implications and separations between the seven properties in the quantum setting, and in addition we construct explicit examples separating an inherently quantum notion called collapsing from several proposed properties. Finally, we pin down the properties that are preserved under several iterated hash schemes. In particular, we prove that the ROX construction in Andreeva et al. (Asiacrypt’07) preserves the seven properties in the quantum random oracle model. |
|||||
2019 | Quotient Hash Tables - Efficiently Detecting Duplicates In Streaming Data | Géraud Rémi, Lombard-platet Marius, Naccache David | Arxiv | This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33\% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions. We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD). Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT. |
|||||
2019 | Deep Hashing For Signed Social Network Embedding | Guo Jia-nan, Mao Xian-ling, Jiang Xiao-jian, Sun Ying-xiang, Wei Wei, Huang He-yan | Arxiv | Network embedding is a promising way of network representation, facilitating many signed social network processing and analysis tasks such as link prediction and node classification. Recently, feature hashing has been adopted in several existing embedding algorithms to improve the efficiency, which has obtained a great success. However, the existing feature hashing based embedding algorithms only consider the positive links in signed social networks. Intuitively, negative links can also help improve the performance. Thus, in this paper, we propose a novel deep hashing method for signed social network embedding by considering simultaneously positive and negative links. Extensive experiments show that the proposed method performs better than several state-of-the-art baselines through link prediction task over two real-world signed social networks. |
|||||
2019 | Conv-codes Audio Hashing For Bird Species Classification | Thakur Anshul, Sharma Pulkit, Abrol Vinayak, Rajan Padmanabhan | Arxiv | In this work, we propose a supervised, convex representation based audio hashing framework for bird species classification. The proposed framework utilizes archetypal analysis, a matrix factorization technique, to obtain convex-sparse representations of a bird vocalization. These convex representations are hashed using Bloom filters with non-cryptographic hash functions to obtain compact binary codes, designated as conv-codes. The conv-codes extracted from the training examples are clustered using class-specific k-medoids clustering with Jaccard coefficient as the similarity metric. A hash table is populated using the cluster centers as keys while hash values/slots are pointers to the species identification information. During testing, the hash table is searched to find the species information corresponding to a cluster center that exhibits maximum similarity with the test conv-code. Hence, the proposed framework classifies a bird vocalization in the conv-code space and requires no explicit classifier or reconstruction error calculations. Apart from that, based on min-hash and direct addressing, we also propose a variant of the proposed framework that provides faster and effective classification. The performances of both these frameworks are compared with existing bird species classification frameworks on the audio recordings of 50 different bird species. |
|||||
2019 | Supervised Discrete Hashing With Relaxation | Gui Jie, Liu Tongliang, Sun Zhenan, Tao Dacheng, Tan Tieniu | Arxiv | Data-dependent hashing has recently attracted attention due to being able to support efficient retrieval and storage of high-dimensional data such as documents, images, and videos. In this paper, we propose a novel learning-based hashing method called “Supervised Discrete Hashing with Relaxation” (SDHR) based on “Supervised Discrete Hashing” (SDH). SDH uses ordinary least squares regression and traditional zero-one matrix encoding of class label information as the regression target (code words), thus fixing the regression target. In SDHR, the regression target is instead optimized. The optimized regression target matrix satisfies a large margin constraint for correct classification of each example. Compared with SDH, which uses the traditional zero-one matrix, SDHR utilizes the learned regression target matrix and, therefore, more accurately measures the classification error of the regression model and is more flexible. As expected, SDHR generally outperforms SDH. Experimental results on two large-scale image datasets (CIFAR-10 and MNIST) and a large-scale and challenging face dataset (FRGC) demonstrate the effectiveness and efficiency of SDHR. |
|||||
2019 | Post-training 4-bit Quantization On Embedding Tables | Guan Hui, Malevich Andrey, Yang Jiyan, Park Jongsoo, Yuen Hector | Arxiv | Continuous representations have been widely adopted in recommender systems where a large number of entities are represented using embedding vectors. As the cardinality of the entities increases, the embedding components can easily contain millions of parameters and become the bottleneck in both storage and inference due to large memory consumption. This work focuses on post-training 4-bit quantization on the continuous embeddings. We propose row-wise uniform quantization with greedy search and codebook-based quantization that consistently outperforms state-of-the-art quantization approaches on reducing accuracy degradation. We deploy our uniform quantization technique on a production model in Facebook and demonstrate that it can reduce the model size to only 13.89% of the single-precision version while the model quality stays neutral. |
|||||
2019 | Fast Supervised Discrete Hashing | Gui Jie, Liu Tongliang, Sun Zhenan, Tao Dacheng, Tan Tieniu | Arxiv | Learning-based hashing algorithms are |
|||||
2019 | Hashgraph -- Scalable Hash Tables Using A Sparse Graph Data Structure | Green Oded | Arxiv | Hash tables are ubiquitous and used in a wide range of applications for efficient probing of large and unsorted data. If designed properly, hash-tables can enable efficients look ups in a constant number of operations or commonly referred to as O(1) operations. As data sizes continue to grow and data becomes less structured (as is common for big-data applications), the need for efficient and scalable hash table also grows. In this paper we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations–hence the name HashGraph. We show two different variants of HashGraph, a simple algorithm that outlines the method to create the hash-table and an advanced method that creates the hash table in a more efficient manner (with an improved memory access pattern). HashGraph shows a new way to deal with hash-collisions that does not use “open-addressing” or “chaining”, yet has all the benefits of both these approaches. HashGraph currently works for static inputs, though recent progress with dynamic graph data structures suggest that HashGraph might be extended to dynamic inputs as well. We show that HashGraph can deal with a large number of hash-values per entry without loss of performance as most open-addressing and chaining approaches have. Further, we show that HashGraph is indifferent to the load-factor. Lastly, we show a new probing algorithm for the second phase of value lookups. Given the above, HashGraph is extremely fast and outperforms several state of the art hash-table implementations. The implementation of HashGraph in this paper is for NVIDIA GPUs, though HashGraph is not architecture dependent. Using a NVIDIA GV100 GPU, HashGraph is anywhere from 2X-8X faster than cuDPP, WarpDrive, and cuDF. HashGraph is able to build a hash-table at a rate of 2.5 billion keys per second and can probe at nearly the same rate. |
|||||
2019 | Fusion-supervised Deep Cross-modal Hashing | Wang Li, Zhu Lei, Yu En, Sun Jiande, Zhang Huaxiang | Arxiv | Deep hashing has recently received attention in cross-modal retrieval for its impressive advantages. However, existing hashing methods for cross-modal retrieval cannot fully capture the heterogeneous multi-modal correlation and exploit the semantic information. In this paper, we propose a novel Fusion-supervised Deep Cross-modal Hashing (FDCH) approach. Firstly, FDCH learns unified binary codes through a fusion hash network with paired samples as input, which effectively enhances the modeling of the correlation of heterogeneous multi-modal data. Then, these high-quality unified hash codes further supervise the training of the modality-specific hash networks for encoding out-of-sample queries. Meanwhile, both pair-wise similarity information and classification information are embedded in the hash networks under one stream framework, which simultaneously preserves cross-modal similarity and keeps semantic consistency. Experimental results on two benchmark datasets demonstrate the state-of-the-art performance of FDCH. |
|||||
2019 | Deep Collaborative Discrete Hashing With Semantic-invariant Structure | Wang Zijian, Zhang Zheng, Luo Yadan, Huang Zi | SIGIR | Existing deep hashing approaches fail to fully explore semantic correlations and neglect the effect of linguistic context on visual attention learning, leading to inferior performance. This paper proposes a dual-stream learning framework, dubbed Deep Collaborative Discrete Hashing (DCDH), which constructs a discriminative common discrete space by collaboratively incorporating the shared and individual semantics deduced from visual features and semantic labels. Specifically, the context-aware representations are generated by employing the outer product of visual embeddings and semantic encodings. Moreover, we reconstruct the labels and introduce the focal loss to take advantage of frequent and rare concepts. The common binary code space is built on the joint learning of the visual representations attended by language, the semantic-invariant structure construction and the label distribution correction. Extensive experiments demonstrate the superiority of our method. |
|||||
2019 | The Bitwise Hashing Trick For Personalized Search | Gaskill Braddock | Applied Artificial Intelligence Volume | Many real world problems require fast and efficient lexical comparison of large numbers of short text strings. Search personalization is one such domain. We introduce the use of feature bit vectors using the hashing trick for improving relevance in personalized search and other personalization applications. We present results of several lexical hashing and comparison methods. These methods are applied to a user’s historical behavior and are used to predict future behavior. Using a single bit per dimension instead of floating point results in an order of magnitude decrease in data structure size, while preserving or even improving quality. We use real data to simulate a search personalization task. A simple method for combining bit vectors demonstrates an order of magnitude improvement in compute time on the task with only a small decrease in accuracy. |
|||||
2019 | Nearly-unsupervised Hashcode Representations For Relation Extraction | Garg Sahil, Galstyan Aram, Steeg Greg Ver, Cecchi Guillermo | Arxiv | Recently, kernelized locality sensitive hashcodes have been successfully employed as representations of natural language text, especially showing high relevance to biomedical relation extraction tasks. In this paper, we propose to optimize the hashcode representations in a nearly unsupervised manner, in which we only use data points, but not their class labels, for learning. The optimized hashcode representations are then fed to a supervised classifier following the prior work. This nearly unsupervised approach allows fine-grained optimization of each hash function, which is particularly suitable for building hashcode representations generalizing from a training set to a test set. We empirically evaluate the proposed approach for biomedical relation extraction tasks, obtaining significant accuracy improvements w.r.t. state-of-the-art supervised and semi-supervised approaches. |
|||||
2019 | Efficient Discrete Supervised Hashing For Large-scale Cross-modal Retrieval | Yao Tao, Kong Xiangwei, Yan Lianshan, Tang Wenjing, Tian Qi | Arxiv | Supervised cross-modal hashing has gained increasing research interest on large-scale retrieval task owning to its satisfactory performance and efficiency. However, it still has some challenging issues to be further studied: 1) most of them fail to well preserve the semantic correlations in hash codes because of the large heterogenous gap; 2) most of them relax the discrete constraint on hash codes, leading to large quantization error and consequent low performance; 3) most of them suffer from relatively high memory cost and computational complexity during training procedure, which makes them unscalable. In this paper, to address above issues, we propose a supervised cross-modal hashing method based on matrix factorization dubbed Efficient Discrete Supervised Hashing (EDSH). Specifically, collective matrix factorization on heterogenous features and semantic embedding with class labels are seamlessly integrated to learn hash codes. Therefore, the feature based similarities and semantic correlations can be both preserved in hash codes, which makes the learned hash codes more discriminative. Then an efficient discrete optimal algorithm is proposed to handle the scalable issue. Instead of learning hash codes bit-by-bit, hash codes matrix can be obtained directly which is more efficient. Extensive experimental results on three public real-world datasets demonstrate that EDSH produces a superior performance in both accuracy and scalability over some existing cross-modal hashing methods. |
|||||
2019 | Metric-learning Based Deep Hashing Network For Content Based Retrieval Of Remote Sensing Images | Roy Subhankar, Sangineto Enver, Demir Begüm, Sebe Nicu | Arxiv | Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information content of RS images. To overcome this problem, in this paper we introduce a metric-learning based hashing network, which learns: 1) a semantic-based metric space for effective feature representation; and 2) compact binary hash codes for fast archive search. Our network considers an interplay of multiple loss functions that allows to jointly learn a metric based semantic space facilitating similar images to be clustered together in that target space and at the same time producing compact final activations that lose negligible information when binarized. Experiments carried out on two benchmark RS archives point out that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS. |
|||||
2019 | Bag Of Negatives For Siamese Architectures | Gajic Bojana, Amato Ariel, Baldrich Ramon, Gatta Carlo | Arxiv | Training a Siamese architecture for re-identification with a large number of identities is a challenging task due to the difficulty of finding relevant negative samples efficiently. In this work we present Bag of Negatives (BoN), a method for accelerated and improved training of Siamese networks that scales well on datasets with a very large number of identities. BoN is an efficient and loss-independent method, able to select a bag of high quality negatives, based on a novel online hashing strategy. |
|||||
2019 | Feature Pyramid Hashing | Yang Yifan, Geng Libing, Lai Hanjiang, Pan Yan, Yin Jian | Arxiv | In recent years, deep-networks-based hashing has become a leading approach for large-scale image retrieval. Most deep hashing approaches use the high layer to extract the powerful semantic representations. However, these methods have limited ability for fine-grained image retrieval because the semantic features extracted from the high layer are difficult in capturing the subtle differences. To this end, we propose a novel two-pyramid hashing architecture to learn both the semantic information and the subtle appearance details for fine-grained image search. Inspired by the feature pyramids of convolutional neural network, a vertical pyramid is proposed to capture the high-layer features and a horizontal pyramid combines multiple low-layer features with structural information to capture the subtle differences. To fuse the low-level features, a novel combination strategy, called consensus fusion, is proposed to capture all subtle information from several low-layers for finer retrieval. Extensive evaluation on two fine-grained datasets CUB-200-2011 and Stanford Dogs demonstrate that the proposed method achieves significant performance compared with the state-of-art baselines. |
|||||
2019 | Deep Cross-modal Hashing With Hashing Functions And Unified Hash Codes Jointly Learning | Tu Rong-cheng, Mao Xian-ling, Ma Bing, Hu Yong, Yan Tan, Wei Wei, Huang Heyan | Arxiv | Due to their high retrieval efficiency and low storage cost, cross-modal hashing methods have attracted considerable attention. Generally, compared with shallow cross-modal hashing methods, deep cross-modal hashing methods can achieve a more satisfactory performance by integrating feature learning and hash codes optimizing into a same framework. However, most existing deep cross-modal hashing methods either cannot learn a unified hash code for the two correlated data-points of different modalities in a database instance or cannot guide the learning of unified hash codes by the feedback of hashing function learning procedure, to enhance the retrieval accuracy. To address the issues above, in this paper, we propose a novel end-to-end Deep Cross-Modal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning (DCHUC). Specifically, by an iterative optimization algorithm, DCHUC jointly learns unified hash codes for image-text pairs in a database and a pair of hash functions for unseen query image-text pairs. With the iterative optimization algorithm, the learned unified hash codes can be used to guide the hashing function learning procedure; Meanwhile, the learned hashing functions can feedback to guide the unified hash codes optimizing procedure. Extensive experiments on three public datasets demonstrate that the proposed method outperforms the state-of-the-art cross-modal hashing methods. |
|||||
2019 | Recsplit Minimal Perfect Hashing Via Recursive Splitting | Esposito Emmanuel, Graf Thomas Mueller, Vigna Sebastiano | Arxiv | A minimal perfect hash function bijectively maps a key set \(S\) out of a universe \(U\) into the first \(|S|\) natural numbers. Minimal perfect hash functions are used, for example, to map irregularly-shaped keys, such as string, in a compact space so that metadata can then be simply stored in an array. While it is known that just \(1.44\) bits per key are necessary to store a minimal perfect function, no published technique can go below \(2\) bits per key in practice. We propose a new technique for storing minimal perfect hash functions with expected linear construction time and expected constant lookup time that makes it possible to build for the first time, for example, structures which need \(1.56\) bits per key, that is, within \(8.3\)% of the lower bound, in less than \(2\) ms per key. We show that instances of our construction are able to simultaneously beat the construction time, space usage and lookup time of the state-of-the-art data structure reaching \(2\) bits per key. Moreover, we provide parameter choices giving structures which are competitive with alternative, larger-size data structures in terms of space and lookup time. The construction of our data structures can be easily parallelized or mapped on distributed computational units (e.g., within the MapReduce framework), and structures larger than the available RAM can be directly built in mass storage. |
|||||
2019 | Probminhash -- A Class Of Locality-sensitive Hash Algorithms For The (probability) Jaccard Similarity | Ertl Otmar | Arxiv | The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence between signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art. |
|||||
2019 | Deep Spherical Quantization For Image Search | Eghbali Sepehr, Tahvildari Ladan | Arxiv | Hashing methods, which encode high-dimensional images with compact discrete codes, have been widely applied to enhance large-scale image retrieval. In this paper, we put forward Deep Spherical Quantization (DSQ), a novel method to make deep convolutional neural networks generate supervised and compact binary codes for efficient image search. Our approach simultaneously learns a mapping that transforms the input images into a low-dimensional discriminative space, and quantizes the transformed data points using multi-codebook quantization. To eliminate the negative effect of norm variance on codebook learning, we force the network to L_2 normalize the extracted features and then quantize the resulting vectors using a new supervised quantization technique specifically designed for points lying on a unit hypersphere. Furthermore, we introduce an easy-to-implement extension of our quantization technique that enforces sparsity on the codebooks. Extensive experiments demonstrate that DSQ and its sparse variant can generate semantically separable compact binary codes outperforming many state-of-the-art image retrieval methods on three benchmarks. |
|||||
2019 | Pairwise Teacher-student Network For Semi-supervised Hashing | Zhang Shifeng, Li Jianmin, Zhang Bo | Arxiv | Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise information produced by the teacher network. The network follows the smoothness assumption, which achieves consistent distances for similar data pairs so that the retrieval results are similar for neighborhood queries. Experiments on large-scale datasets show that the proposed method reaches impressive gain over the supervised baselines and is superior to state-of-the-art semi-supervised hashing methods. |
|||||
2019 | Unsupervised Rank-preserving Hashing For Large-scale Image Retrieval | Karaman Svebor, Lin Xudong, Hu Xuefeng, Chang Shih-fu | Arxiv | We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-based model, which transforms the input representation into a binary representation. We formalize the training objective of the network in an intuitive and effective way, considering each training sample as a query and aiming to obtain the same retrieval results using the produced hash codes as those obtained with the original features. This training formulation directly optimizes the hashing model for the target usage of the hash codes it produces. We further explore the addition of a decoder trained to obtain an approximated reconstruction of the original features. At test time, we retrieved the most promising database samples with an efficient graph-based search procedure using only our hash codes and perform re-ranking using the reconstructed features, thus without needing to access the original features at all. Experiments conducted on multiple publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost. |
|||||
2019 | Online Hashing With Efficient Updating Of Binary Codes | Weng Zhenyu, Zhu Yuesheng | Arxiv | Online hashing methods are efficient in learning the hash functions from the streaming data. However, when the hash functions change, the binary codes for the database have to be recomputed to guarantee the retrieval accuracy. Recomputing the binary codes by accumulating the whole database brings a timeliness challenge to the online retrieval process. In this paper, we propose a novel online hashing framework to update the binary codes efficiently without accumulating the whole database. In our framework, the hash functions are fixed and the projection functions are introduced to learn online from the streaming data. Therefore, inefficient updating of the binary codes by accumulating the whole database can be transformed to efficient updating of the binary codes by projecting the binary codes into another binary space. The queries and the binary code database are projected asymmetrically to further improve the retrieval accuracy. The experiments on two multi-label image databases demonstrate the effectiveness and the efficiency of our method for multi-label image retrieval. |
|||||
2019 | Candidate Generation With Binary Codes For Large-scale Top-n Recommendation | Kang Wang-cheng, Mcauley Julian | Arxiv | Generating the Top-N recommendations from a large corpus is computationally expensive to perform at scale. Candidate generation and re-ranking based approaches are often adopted in industrial settings to alleviate efficiency problems. However it remains to be fully studied how well such schemes approximate complete rankings (or how many candidates are required to achieve a good approximation), or to develop systematic approaches to generate high-quality candidates efficiently. In this paper, we seek to investigate these questions via proposing a candidate generation and re-ranking based framework (CIGAR), which first learns a preference-preserving binary embedding for building a hash table to retrieve candidates, and then learns to re-rank the candidates using real-valued ranking models with a candidate-oriented objective. We perform a comprehensive study on several large-scale real-world datasets consisting of millions of users/items and hundreds of millions of interactions. Our results show that CIGAR significantly boosts the Top-N accuracy against state-of-the-art recommendation models, while reducing the query time by orders of magnitude. We hope that this work could draw more attention to the candidate generation problem in recommender systems. |
|||||
2019 | Cross-modal Zero-shot Hashing | Liu Xuanwu, Li Zhao, Wang Jun, Yu Guoxian, Domeniconi Carlotta, Zhang Xiangliang | Arxiv | Hashing has been widely studied for big data retrieval due to its low storage cost and fast query speed. Zero-shot hashing (ZSH) aims to learn a hashing model that is trained using only samples from seen categories, but can generalize well to samples of unseen categories. ZSH generally uses category attributes to seek a semantic embedding space to transfer knowledge from seen categories to unseen ones. As a result, it may perform poorly when labeled data are insufficient. ZSH methods are mainly designed for single-modality data, which prevents their application to the widely spread multi-modal data. On the other hand, existing cross-modal hashing solutions assume that all the modalities share the same category labels, while in practice the labels of different data modalities may be different. To address these issues, we propose a general Cross-modal Zero-shot Hashing (CZHash) solution to effectively leverage unlabeled and labeled multi-modality data with different label spaces. CZHash first quantifies the composite similarity between instances using label and feature information. It then defines an objective function to achieve deep feature learning compatible with the composite similarity preserving, category attribute space learning, and hashing coding function learning. CZHash further introduces an alternative optimization procedure to jointly optimize these learning objectives. Experiments on benchmark multi-modal datasets show that CZHash significantly outperforms related representative hashing approaches both on effectiveness and adaptability. |
|||||
2019 | (b)-bit Sketch Trie Scalable Similarity Search On Integer Sketches | Kanda Shunsuke, Tabei Yasuo | Arxiv | Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although many efficient similarity searches have been proposed, most of them are designed for binary sketches. Similarity searches on integer sketches are in their infancy. In this paper, we present a novel space-efficient trie named \(b\)-bit sketch trie on integer sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures while supporting various data operations in the compressed format) and a favorable property of integer sketches as fixed-length strings. Our experimental results obtained using real-world datasets show that a trie-based index is built from integer sketches and efficiently performs similarity searches on the index by pruning useless portions of the search space, which greatly improves the search time and space-efficiency of the similarity search. The experimental results show that our similarity search is at most one order of magnitude faster than state-of-the-art similarity searches. Besides, our method needs only 10 GiB of memory on a billion-scale database, while state-of-the-art similarity searches need 29 GiB of memory. |
|||||
2019 | b-bit Sketch Trie Scalable Similarity Search On Integer Sketches | Kanda Shunsuke, Tabei Yasuo | Arxiv | Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although many efficient similarity searches have been proposed, most of them are designed for binary sketches. Similarity searches on integer sketches are in their infancy. In this paper, we present a novel space-efficient trie named \(b\)-bit sketch trie on integer sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures while supporting various data operations in the compressed format) and a favorable property of integer sketches as fixed-length strings. Our experimental results obtained using real-world datasets show that a trie-based index is built from integer sketches and efficiently performs similarity searches on the index by pruning useless portions of the search space, which greatly improves the search time and space-efficiency of the similarity search. The experimental results show that our similarity search is at most one order of magnitude faster than state-of-the-art similarity searches. Besides, our method needs only 10 GiB of memory on a billion-scale database, while state-of-the-art similarity searches need 29 GiB of memory. |
|||||
2019 | PDH Probabilistic Deep Hashing Based On MAP Estimation Of Hamming Distance | Kaga Yosuke, Fujio Masakazu, Takahashi Kenta, Ohki Tetsushi, Nishigaki Masakatsu | With the growth of image on the web, research on hashing which enables high-speed image retrieval has been actively studied. In recent years, various hashing methods based on deep neural networks have been proposed and achieved higher precision than the other hashing methods. In these methods, multiple losses for hash codes and the parameters of neural networks are defined. They generate hash codes that minimize the weighted sum of the losses. Therefore, an expert has to tune the weights for the losses heuristically, and the probabilistic optimality of the loss function cannot be explained. In order to generate explainable hash codes without weight tuning, we theoretically derive a single loss function with no hyperparameters for the hash code from the probability distribution of the images. By generating hash codes that minimize this loss function, highly accurate image retrieval with probabilistic optimality is performed. We evaluate the performance of hashing using MNIST, CIFAR-10, SVHN and show that the proposed method outperforms the state-of-the-art hashing methods. |
||||||
2019 | Note On Distance Matrix Hashing | Junussov I. A. | Arxiv | Hashing algorithm of dynamical set of distances is described. Proposed hashing function is residual. Data structure which implementation accelerates computations is presented |
|||||
2019 | Mutual Linear Regression-based Discrete Hashing | Liu Xingbo, Nie Xiushan, Yin Yilong | Arxiv | Label information is widely used in hashing methods because of its effectiveness of improving the precision. The existing hashing methods always use two different projections to represent the mutual regression between hash codes and class labels. In contrast to the existing methods, we propose a novel learning-based hashing method termed stable supervised discrete hashing with mutual linear regression (S2DHMLR) in this study, where only one stable projection is used to describe the linear correlation between hash codes and corresponding labels. To the best of our knowledge, this strategy has not been used for hashing previously. In addition, we further use a boosting strategy to improve the final performance of the proposed method without adding extra constraints and with little extra expenditure in terms of time and space. Extensive experiments conducted on three image benchmarks demonstrate the superior performance of the proposed method. |
|||||
2019 | SSAH Semi-supervised Adversarial Deep Hashing With Self-paced Hard Sample Generation | Jin Sheng, Zhou Shangchen, Liu Yao, Chen Chao, Sun Xiaoshuai, Yao Hongxun, Hua Xiansheng | Arxiv | Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semi-supervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A-Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widely-used hashing datasets and fine-grained datasets. |
|||||
2019 | Deep Semantic Multimodal Hashing Network For Scalable Image-text And Video-text Retrievals | Jin Lu, Li Zechao, Tang Jinhui | Arxiv | Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods. |
|||||
2019 | Optimal Projection Guided Transfer Hashing For Image Retrieval | Liu Ji, Zhang Lei | Arxiv | Recently, learning to hash has been widely studied for image retrieval thanks to the computation and storage efficiency of binary codes. For most existing learning to hash methods, sufficient training images are required and used to learn precise hashing codes. However, in some real-world applications, there are not always sufficient training images in the domain of interest. In addition, some existing supervised approaches need a amount of labeled data, which is an expensive process in term of time, label and human expertise. To handle such problems, inspired by transfer learning, we propose a simple yet effective unsupervised hashing method named Optimal Projection Guided Transfer Hashing (GTH) where we borrow the images of other different but related domain i.e., source domain to help learn precise hashing codes for the domain of interest i.e., target domain. Besides, we propose to seek for the maximum likelihood estimation (MLE) solution of the hashing functions of target and source domains due to the domain gap. Furthermore,an alternating optimization method is adopted to obtain the two projections of target and source domains such that the domain hashing disparity is reduced gradually. Extensive experiments on various benchmark databases verify that our method outperforms many state-of-the-art learning to hash methods. The implementation details are available at https://github.com/liuji93/GTH. |
|||||
2019 | Graph-based Multi-view Binary Learning For Image Clustering | Jiang Guangqi, Wang Huibing, Peng Jinjia, Chen Dongyan, Fu Xianping | Arxiv | Hashing techniques, also known as binary code learning, have recently gained increasing attention in large-scale data analysis and storage. Generally, most existing hash clustering methods are single-view ones, which lack complete structure or complementary information from multiple views. For cluster tasks, abundant prior researches mainly focus on learning discrete hash code while few works take original data structure into consideration. To address these problems, we propose a novel binary code algorithm for clustering, which adopts graph embedding to preserve the original data structure, called (Graph-based Multi-view Binary Learning) GMBL in this paper. GMBL mainly focuses on encoding the information of multiple views into a compact binary code, which explores complementary information from multiple views. In particular, in order to maintain the graph-based structure of the original data, we adopt a Laplacian matrix to preserve the local linear relationship of the data and map it to the Hamming space. Considering different views have distinctive contributions to the final clustering results, GMBL adopts a strategy of automatically assign weights for each view to better guide the clustering. Finally, An alternating iterative optimization method is adopted to optimize discrete binary codes directly instead of relaxing the binary constraint in two steps. Experiments on five public datasets demonstrate the superiority of our proposed method compared with previous approaches in terms of clustering performance. |
|||||
2019 | On The Evaluation Metric For Hashing | Jiang Qing-yuan, Li Ming-wei, Li Wu-jun | Arxiv | Due to its low storage cost and fast query speed, hashing has been widely used for large-scale approximate nearest neighbor (ANN) search. Bucket search, also called hash lookup, can achieve fast query speed with a sub-linear time cost based on the inverted index table constructed from hash codes. Many metrics have been adopted to evaluate hashing algorithms. However, all existing metrics are improper to evaluate the hash codes for bucket search. On one hand, all existing metrics ignore the retrieval time cost which is an important factor reflecting the performance of search. On the other hand, some of them, such as mean average precision (MAP), suffer from the uncertainty problem as the ranked list is based on integer-valued Hamming distance, and are insensitive to Hamming radius as these metrics only depend on relative Hamming distance. Other metrics, such as precision at Hamming radius R, fail to evaluate global performance as these metrics only depend on one specific Hamming radius. In this paper, we first point out the problems of existing metrics which have been ignored by the hashing community, and then propose a novel evaluation metric called radius aware mean average precision (RAMAP) to evaluate hash codes for bucket search. Furthermore, two coding strategies are also proposed to qualitatively show the problems of existing metrics. Experiments demonstrate that our proposed RAMAP can provide more proper evaluation than existing metrics. |
|||||
2019 | End-to-end Efficient Representation Learning Via Cascading Combinatorial Optimization | Jeong Yeonwoo, Kim Yoonsung, Song Hyun Oh | Arxiv | We develop hierarchically quantized efficient embedding representations for similarity-based search and show that this representation provides not only the state of the art performance on the search accuracy but also provides several orders of speed up during inference. The idea is to hierarchically quantize the representation so that the quantization granularity is greatly increased while maintaining the accuracy and keeping the computational complexity low. We also show that the problem of finding the optimal sparse compound hash code respecting the hierarchical structure can be optimized in polynomial time via minimum cost flow in an equivalent flow network. This allows us to train the method end-to-end in a mini-batch stochastic gradient descent setting. Our experiments on Cifar100 and ImageNet datasets show the state of the art search accuracy while providing several orders of magnitude search speedup respectively over exhaustive linear search over the dataset. |
|||||
2019 | Lucene For Approximate Nearest-neighbors Search On Arbitrary Dense Vectors | Teofili Tommaso, Lin Jimmy | Arxiv | We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study. At its core, Lucene is built around inverted indexes of a document collection’s (sparse) term-document matrix, which is incompatible with the lower-dimensional dense vectors that are common in deep learning applications. We evaluate three techniques to overcome these challenges that can all be natively integrated into Lucene: the creation of documents populated with fake words, LSH applied to lexical realizations of dense vectors, and k-d trees coupled with dimensionality reduction. Experiments show that the “fake words” approach represents the best balance between effectiveness and efficiency. These techniques are integrated into the Anserini open-source toolkit and made available to the community. |
|||||
2019 | Diskann Fast Accurate Billion-point Nearest Neighbor Search On A Single Node | Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, Rohan Kadekodi | Neural Information Processing Systems | Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC+G+P plateau at around 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can index and serve 5 − 10x more points per node compared to state-of-the-art graph- based methods such as HNSW and NSG. Finally, as part of our overall DiskANN system, we introduce Vamana, a new graph-based ANNS index that is more versatile than the graph indices even for in-memory indices. |
|||||
2019 | Unsupervised Neural Quantization For Compressed-domain Similarity Search | Morozov Stanislav, Babenko Artem | Arxiv | We tackle the problem of unsupervised visual descriptors compression, which is a key ingredient of large-scale image retrieval systems. While the deep learning machinery has benefited literally all computer vision pipelines, the existing state-of-the-art compression methods employ shallow architectures, and we aim to close this gap by our paper. In more detail, we introduce a DNN architecture for the unsupervised compressed-domain retrieval, based on multi-codebook quantization. The proposed architecture is designed to incorporate both fast data encoding and efficient distances computation via lookup tables. We demonstrate the exceptional advantage of our scheme over existing quantization approaches on several datasets of visual descriptors via outperforming the previous state-of-the-art by a large margin. |
|||||
2019 | Fpscreen A Rapid Similarity Search Tool For Massive Molecular Library Based On Molecular Fingerprint Comparison | Wang Lijun, Gong Jianbing, Zhang Yingxia, Liu Tianmou, Gao Junhui | Arxiv | We designed a fast similarity search engine for large molecular libraries: FPScreen. We downloaded 100 million molecules’ structure files in PubChem with SDF extension, then applied a computational chemistry tool RDKit to convert each structure file into one line of text in MACCS format and stored them in a text file as our molecule library. The similarity search engine compares the similarity while traversing the 166-bit strings in the library file line by line. FPScreen can complete similarity search through 100 million entries in our molecule library within one hour. That is very fast as a biology computation tool. Additionally, we divided our library into several strides for parallel processing. FPScreen was developed in WEB mode. |
|||||
2019 | Deephashing Using Tripletloss | James Jithin | Arxiv | Hashing is one of the most efficient techniques for approximate nearest neighbour search for large scale image retrieval. Most of the techniques are based on hand-engineered features and do not give optimal results all the time. Deep Convolutional Neural Networks have proven to generate very effective representation of images that are used for various computer vision tasks and inspired by this there have been several Deep Hashing models like Wang et al. (2016) have been proposed. These models train on the triplet loss function which can be used to train models with superior representation capabilities. Taking the latest advancements in training using the triplet loss I propose new techniques that help the Deep Hash-ing models train more faster and efficiently. Experiment result1show that using the more efficient techniques for training on the triplet loss, we have obtained a 5%percent improvement in our model compared to the original work of Wang et al.(2016). Using a larger model and more training data we can drastically improve the performance using the techniques we propose |
|||||
2019 | An Empirical Comparison Of FAISS And FENSHSES For Nearest Neighbor Search In Hamming Space | Mu Cun, Yang Binwei, Yan Zheng | Arxiv | In this paper, we compare the performances of FAISS and FENSHSES on nearest neighbor search in Hamming space–a fundamental task with ubiquitous applications in nowadays eCommerce. Comprehensive evaluations are made in terms of indexing speed, search latency and RAM consumption. This comparison is conducted towards a better understanding on trade-offs between nearest neighbor search systems implemented in main memory and the ones implemented in secondary memory, which is largely unaddressed in literature. |
|||||
2019 | Understanding Sparse JL For Feature Hashing | Meena Jagadeesan | Neural Information Processing Systems | Feature hashing and other random projection schemes are commonly used to reduce the dimensionality of feature vectors. The goal is to efficiently project a high-dimensional feature vector living in R^n into a much lower-dimensional space R^m, while approximately preserving Euclidean norm. These schemes can be constructed using sparse random projections, for example using a sparse Johnson-Lindenstrauss (JL) transform. A line of work introduced by Weinberger et. al (ICML ‘09) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors with small linfinity-to-l2 norm ratio. Recently, Freksen, Kamma, and Larsen (NeurIPS ‘18) closed this line of work by proving a tight tradeoff between linfinity-to-l2 norm ratio and accuracy for sparse JL with sparsity 1. In this paper, we demonstrate the benefits of using sparsity s greater than 1 in sparse JL on feature vectors. Our main result is a tight tradeoff between linfinity-to-l2 norm ratio and accuracy for a general sparsity s, that significantly generalizes the result of Freksen et. al. Our result theoretically demonstrates that sparse JL with s > 1 can have significantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this finding. |
|||||
2019 | Efficient Bitmap-based Indexing And Retrieval Of Similarity Search Image Queries | Jafari Omid, Nagarkar Parth, Montaño Jonathan | Finding similar images is a necessary operation in many multimedia applications. Images are often represented and stored as a set of high-dimensional features, which are extracted using localized feature extraction algorithms. Locality Sensitive Hashing is one of the most popular approximate processing techniques for finding similar points in high-dimensional spaces. Locality Sensitive Hashing (LSH) and its variants are designed to find similar points, but they are not designed to find objects (such as images, which are made up of a collection of points) efficiently. In this paper, we propose an index structure, Bitmap-Image LSH (bImageLSH), for efficient processing of high-dimensional images. Using a real dataset, we experimentally show the performance benefit of our novel design while keeping the accuracy of the image results high. |
||||||
2019 | Clustered Hierarchical Entropy-scaling Search Of Astronomical And Biological Data | Ishaq Najib, Student George, Daniels Noah M. | Arxiv | Both astronomy and biology are experiencing explosive growth of data, resulting in a “big data” problem that stands in the way of a “big data” opportunity for discovery. One common question asked of such data is that of approximate search (\(\rho-\)nearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in both astronomical and biological data sets, namely the metric entropy and fractal dimensionality of the data. We present CHESS (Clustered Hierarchical Entropy-Scaling Search), a search tool with virtually no loss in specificity or sensitivity, demonstrating a \(13.6\times\) speedup over linear search on the Sloan Digital Sky Survey’s APOGEE data set and a \(68\times\) speedup on the GreenGenes 16S metagenomic data set, as well as asymptotically fewer distance comparisons on APOGEE when compared to the FALCONN locality-sensitive hashing library. CHESS demonstrates an asymptotic complexity not directly dependent on data set size, and is in practice at least an order of magnitude faster than linear search by performing fewer distance comparisons. Unlike locality-sensitive hashing approaches, CHESS can work with any user-defined distance function. CHESS also allows for implicit data compression, which we demonstrate on the APOGEE data set. We also discuss an extension allowing for efficient k-nearest neighbors search. |
|||||
2019 | Compositional Coding For Collaborative Filtering | Liu Chenghao, Lu Tao, Wang Xin, Cheng Zhiyong, Sun Jianling, Hoi Steven C. H. | Arxiv | Efficiency is crucial to the online recommender systems. Representing users and items as binary vectors for Collaborative Filtering (CF) can achieve fast user-item affinity computation in the Hamming space, in recent years, we have witnessed an emerging research effort in exploiting binary hashing techniques for CF methods. However, CF with binary codes naturally suffers from low accuracy due to limited representation capability in each bit, which impedes it from modeling complex structure of the data. In this work, we attempt to improve the efficiency without hurting the model performance by utilizing both the accuracy of real-valued vectors and the efficiency of binary codes to represent users/items. In particular, we propose the Compositional Coding for Collaborative Filtering (CCCF) framework, which not only gains better recommendation efficiency than the state-of-the-art binarized CF approaches but also achieves even higher accuracy than the real-valued CF method. Specifically, CCCF innovatively represents each user/item with a set of binary vectors, which are associated with a sparse real-value weight vector. Each value of the weight vector encodes the importance of the corresponding binary vector to the user/item. The continuous weight vectors greatly enhances the representation capability of binary codes, and its sparsity guarantees the processing speed. Furthermore, an integer weight approximation scheme is proposed to further accelerate the speed. Based on the CCCF framework, we design an efficient discrete optimization algorithm to learn its parameters. Extensive experiments on three real-world datasets show that our method outperforms the state-of-the-art binarized CF methods (even achieves better performance than the real-valued CF method) by a large margin in terms of both recommendation accuracy and efficiency. |
|||||
2019 | Learning Hash Function Through Codewords | Huang Yinjie, Georgiopoulos Michael, Anagnostopoulos Georgios C. | Arxiv | In this paper, we propose a novel hash learning approach that has the following main distinguishing features, when compared to past frameworks. First, the codewords are utilized in the Hamming space as ancillary techniques to accomplish its hash learning task. These codewords, which are inferred from the data, attempt to capture grouping aspects of the data’s hash codes. Furthermore, the proposed framework is capable of addressing supervised, unsupervised and, even, semi-supervised hash learning scenarios. Additionally, the framework adopts a regularization term over the codewords, which automatically chooses the codewords for the problem. To efficiently solve the problem, one Block Coordinate Descent algorithm is showcased in the paper. We also show that one step of the algorithms can be casted into several Support Vector Machine problems which enables our algorithms to utilize efficient software package. For the regularization term, a closed form solution of the proximal operator is provided in the paper. A series of comparative experiments focused on content-based image retrieval highlights its performance advantages. |
|||||
2019 | Weakly-paired Cross-modal Hashing | Liu Xuanwu, Wang Jun, Yu Guoxian, Domeniconi Carlotta, Zhang Xiangliang | Arxiv | Hashing has been widely adopted for large-scale data retrieval in many domains, due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities are readily available. This assumption is unrealistic in practical applications. In addition, these methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (Flex-CMH) to learn effective hashing codes from weakly-paired data, whose correspondence across modalities are partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the local structure of each cluster, and thus to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes in a unified objective function the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the reciprocal effects of the two objectives. Experiments on publicly multi-modal datasets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it indeed offers a high degree of flexibility for practical cross-modal hashing tasks. |
|||||
2019 | Using Deep Cross Modal Hashing And Error Correcting Codes For Improving The Efficiency Of Attribute Guided Facial Image Retrieval | Talreja Veeru, Taherkhani Fariborz, Valenti Matthew C., Nasrabadi Nasser M. | Arxiv | With benefits of fast query speed and low storage cost, hashing-based image retrieval approaches have garnered considerable attention from the research community. In this paper, we propose a novel Error-Corrected Deep Cross Modal Hashing (CMH-ECC) method which uses a bitmap specifying the presence of certain facial attributes as an input query to retrieve relevant face images from the database. In this architecture, we generate compact hash codes using an end-to-end deep learning module, which effectively captures the inherent relationships between the face and attribute modality. We also integrate our deep learning module with forward error correction codes to further reduce the distance between different modalities of the same subject. Specifically, the properties of deep hashing and forward error correction codes are exploited to design a cross modal hashing framework with high retrieval performance. Experimental results using two standard datasets with facial attributes-image modalities indicate that our CMH-ECC face image retrieval model outperforms most of the current attribute-based face image retrieval approaches. |
|||||
2019 | Ranking-based Deep Cross-modal Hashing | Liu Xuanwu, Yu Guoxian, Domeniconi Carlotta, Wang Jun, Ren Yazhou, Guo Maozu | Arxiv | Cross-modal hashing has been receiving increasing interests for its low storage cost and fast query speed in multi-modal data retrievals. However, most existing hashing methods are based on hand-crafted or raw level features of objects, which may not be optimally compatible with the coding process. Besides, these hashing methods are mainly designed to handle simple pairwise similarity. The complex multilevel ranking semantic structure of instances associated with multiple labels has not been well explored yet. In this paper, we propose a ranking-based deep cross-modal hashing approach (RDCMH). RDCMH firstly uses the feature and label information of data to derive a semi-supervised semantic ranking list. Next, to expand the semantic representation power of hand-crafted features, RDCMH integrates the semantic ranking information into deep cross-modal hashing and jointly optimizes the compatible parameters of deep feature representations and of hashing functions. Experiments on real multi-modal datasets show that RDCMH outperforms other competitive baselines and achieves the state-of-the-art performance in cross-modal retrieval applications. |
|||||
2019 | Fast And Exact Nearest Neighbor Search In Hamming Space On Full-text Search Engines | Mu Cun, Zhao Jun, Yang Guang, Yang Binwei, Yan Zheng | Arxiv | A growing interest has been witnessed recently from both academia and industry in building nearest neighbor search (NNS) solutions on top of full-text search engines. Compared with other NNS systems, such solutions are capable of effectively reducing main memory consumption, coherently supporting multi-model search and being immediately ready for production deployment. In this paper, we continue the journey to explore specifically how to empower full-text search engines with fast and exact NNS in Hamming space (i.e., the set of binary codes). By revisiting three techniques (bit operation, subs-code filtering and data preprocessing with permutation) in information retrieval literature, we develop a novel engineering solution for full-text search engines to efficiently accomplish this special but important NNS task. In the experiment, we show that our proposed approach enables full-text search engines to achieve significant speed-ups over its state-of-the-art term match approach for NNS within binary codes. |
|||||
2019 | Deep Triplet Quantization | Liu Bin, Cao Yue, Long Mingsheng, Wang Jianmin, Wang Jingdong | Arxiv | Deep hashing establishes efficient and effective image retrieval by end-to-end learning of deep representations and hash codes from similarity data. We present a compact coding solution, focusing on deep learning to quantization approach that has shown superior performance over hashing solutions for similarity retrieval. We propose Deep Triplet Quantization (DTQ), a novel approach to learning deep quantization models from the similarity triplets. To enable more effective triplet training, we design a new triplet selection approach, Group Hard, that randomly selects hard triplets in each image group. To generate compact binary codes, we further apply a triplet quantization with weak orthogonality during triplet training. The quantization loss reduces the codebook redundancy and enhances the quantizability of deep representations through back-propagation. Extensive experiments demonstrate that DTQ can generate high-quality and compact binary codes, which yields state-of-the-art image retrieval performance on three benchmark datasets, NUS-WIDE, CIFAR-10, and MS-COCO. |
|||||
2019 | Space-efficient Algorithms For Computing Minimal/shortest Unique Substrings | Mieno Takuya, Köppl Dominik, Nakashima Yuto, Inenaga Shunsuke, Bannai Hideo, Takeda Masayuki | Arxiv | Given a string \(T\) of length \(n\), a substring \(u = T[i..j]\) of \(T\) is called a shortest unique substring (SUS) for an interval \([s,t]\) if (a) \(u\) occurs exactly once in \(T\), (b) \(u\) contains the interval \([s,t]\) (i.e. \(i \leq s \leq t \leq j\)), and (c) every substring \(v\) of \(T\) with \(|v| < |u|\) containing \([s,t]\) occurs at least twice in \(T\). Given a query interval \([s, t] \subset [1, n]\), the interval SUS problem is to output all the SUSs for the interval \([s,t]\). In this article, we propose a \(4n + o(n)\) bits data structure answering an interval SUS query in output-sensitive \(O(\mathit{occ})\) time, where \(\mathit{occ}\) is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for \(s = t\). Here, we propose a \(\lceil (log_2{3} + 1)n \rceil + o(n)\) bits data structure answering a point SUS query in the same output-sensitive time. We also propose space-efficient algorithms for computing the minimal unique substrings of \(T\). |
|||||
2019 | 2-bit Model Compression Of Deep Convolutional Neural Network On ASIC Engine For Image Retrieval | Yang Bin, Yang Lin, Li Xiaochun, Zhang Wenhan, Zhou Hua, Zhang Yequn, Ren Yongxiong, Shi Yinbo | Arxiv | Image retrieval utilizes image descriptors to retrieve the most similar images to a given query image. Convolutional neural network (CNN) is becoming the dominant approach to extract image descriptors for image retrieval. For low-power hardware implementation of image retrieval, the drawback of CNN-based feature descriptor is that it requires hundreds of megabytes of storage. To address this problem, this paper applies deep model quantization and compression to CNN in ASIC chip for image retrieval. It is demonstrated that the CNN-based features descriptor can be extracted using as few as 2-bit weights quantization to deliver a similar performance as floating-point model for image retrieval. In addition, to implement CNN in ASIC, especially for large scale images, the limited buffer size of chips should be considered. To retrieve large scale images, we propose an improved pooling strategy, region nested invariance pooling (RNIP), which uses cropped sub-images for CNN. Testing results on chip show that integrating RNIP with the proposed 2-bit CNN model compression approach is capable of retrieving large scale images. |
|||||
2019 | Shared Predictive Cross-modal Deep Quantization | Yang Erkun, Deng Cheng, Li Chao, Liu Wei, Li Jie, Tao Dacheng | Arxiv | With explosive growth of data volume and ever-increasing diversity of data modalities, cross-modal similarity search, which conducts nearest neighbor search across different modalities, has been attracting increasing interest. This paper presents a deep compact code learning solution for efficient cross-modal similarity search. Many recent studies have proven that quantization-based approaches perform generally better than hashing-based approaches on single-modal similarity search. In this paper, we propose a deep quantization approach, which is among the early attempts of leveraging deep neural networks into quantization-based cross-modal similarity search. Our approach, dubbed shared predictive deep quantization (SPDQ), explicitly formulates a shared subspace across different modalities and two private subspaces for individual modalities, and representations in the shared subspace and the private subspaces are learned simultaneously by embedding them to a reproducing kernel Hilbert space, where the mean embedding of different modality distributions can be explicitly compared. In addition, in the shared subspace, a quantizer is learned to produce the semantics preserving compact codes with the help of label alignment. Thanks to this novel network architecture in cooperation with supervised quantization training, SPDQ can preserve intramodal and intermodal similarities as much as possible and greatly reduce quantization error. Experiments on two popular benchmarks corroborate that our approach outperforms state-of-the-art methods. |
|||||
2019 | Asymmetric Deep Semantic Quantization For Image Retrieval | Yang Zhan, Raymond Osolo Ian, Sun Wuqing, Long Jun | Arxiv | Due to its fast retrieval and storage efficiency capabilities, hashing has been widely used in nearest neighbor retrieval tasks. By using deep learning based techniques, hashing can outperform non-learning based hashing technique in many applications. However, we argue that the current deep learning based hashing methods ignore some critical problems (e.g., the learned hash codes are not discriminative due to the hashing methods being unable to discover rich semantic information and the training strategy having difficulty optimizing the discrete binary codes). In this paper, we propose a novel image hashing method, termed as \textbf{\underline{A}}symmetric \textbf{\underline{D}}eep \textbf{\underline{S}}emantic \textbf{\underline{Q}}uantization (\textbf{ADSQ}). \textbf{ADSQ} is implemented using three stream frameworks, which consist of one LabelNet and two ImgNets. The LabelNet leverages the power of three fully-connected layers, which are used to capture rich semantic information between image pairs. For the two ImgNets, they each adopt the same convolutional neural network structure, but with different weights (i.e., asymmetric convolutional neural networks). The two ImgNets are used to generate discriminative compact hash codes. Specifically, the function of the LabelNet is to capture rich semantic information that is used to guide the two ImgNets in minimizing the gap between the real-continuous features and the discrete binary codes. Furthermore, \textbf{ADSQ} can utilize the most critical semantic information to guide the feature learning process and consider the consistency of the common semantic space and Hamming space. Experimental results on three benchmarks (i.e., CIFAR-10, NUS-WIDE, and ImageNet) demonstrate that the proposed \textbf{ADSQ} can outperforms current state-of-the-art methods. |
|||||
2019 | One Network For Multi-domains Domain Adaptive Hashing With Intersectant Generative Adversarial Network | He Tao, Li Yuan-fang, Gao Lianli, Zhang Dongxiang, Song Jingkuan | Arxiv | With the recent explosive increase of digital data, image recognition and retrieval become a critical practical application. Hashing is an effective solution to this problem, due to its low storage requirement and high query speed. However, most of past works focus on hashing in a single (source) domain. Thus, the learned hash function may not adapt well in a new (target) domain that has a large distributional difference with the source domain. In this paper, we explore an end-to-end domain adaptive learning framework that simultaneously and precisely generates discriminative hash codes and classifies target domain images. Our method encodes two domains images into a semantic common space, followed by two independent generative adversarial networks arming at crosswise reconstructing two domains’ images, reducing domain disparity and improving alignment in the shared space. We evaluate our framework on {four} public benchmark datasets, all of which show that our method is superior to the other state-of-the-art methods on the tasks of object recognition and image retrieval. |
|||||
2019 | Near Neighbor Who Is The Fairest Of Them All | Sariel Har-peled, Sepideh Mahabadi | Neural Information Processing Systems | In this work we study a “fair” variant of the near neighbor problem. Namely, given a set of \(n\) points \(P\) and a parameter \(r\), the goal is to preprocess the points, such that given a query point \(q\), any point in the \(r\)-neighborhood of the query, i.e., \(B(q,r)\), have the same probability of being reported as the near neighbor. We show that LSH based algorithms can be made fair, without a significant loss in efficiency. Specifically, we show an algorithm that reports a point \(p\) in the \(r\)-neighborhood of a query \(q\) with almost uniform probability. The time to report such a point is proportional to \(O(\dns(q.r) Q(n,c))\), and its space is \(O(S(n,c))\), where \(Q(n,c)\) and \(S(n,c)\) are the query time and space of an LSH algorithm for \(c\)-approximate near neighbor, and \(\dns(q,r)\) is a function of the local density around \(q\). Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. Finally, we run experiments to show performance of our approach on real data. |
|||||
2019 | A Survey Of Binary Code Similarity | Haq Irfan Ul, Caballero Juan | Arxiv | Binary code similarity approaches compare two or more pieces of binary code to identify their similarities and differences. The ability to compare binary code enables many real-world applications on scenarios where source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over the past 20 years numerous binary code similarity approaches have been proposed, but the research area has not yet been systematically analyzed. This paper presents a first survey of binary code similarity. It analyzes 61 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable, (2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its evolution over the past two decades, and the challenges that lie ahead. |
|||||
2019 | Distillhash Unsupervised Deep Hashing By Distilling Data Pairs | Yang Erkun, Liu Tongliang, Deng Cheng, Liu Wei, Tao Dacheng | Arxiv | Due to the high storage and search efficiency, hashing has become prevalent for large-scale similarity search. Particularly, deep hashing methods have greatly improved the search performance under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of reliable supervisory similarity signals. To address this issue, we propose a novel deep unsupervised hashing model, dubbed DistillHash, which can learn a distilled data set consisted of data pairs, which have confidence similarity signals. Specifically, we investigate the relationship between the initial noisy similarity signals learned from local structures and the semantic similarity labels assigned by a Bayes optimal classifier. We show that under a mild assumption, some data pairs, of which labels are consistent with those assigned by the Bayes optimal classifier, can be potentially distilled. Inspired by this fact, we design a simple yet effective strategy to distill data pairs automatically and further adopt a Bayesian learning framework to learn hash functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets show that the proposed DistillHash consistently accomplishes the state-of-the-art search performance. |
|||||
2019 | Optimal Few-weight Codes From Simplicial Complexes | Wu Yansheng, Zhu Xiaomeng, Yue Qin | Arxiv | Recently, some infinite families of binary minimal and optimal linear codes are constructed from simplicial complexes by Hyun {\em et al}. Inspired by their work, we present two new constructions of codes over the ring \(\Bbb F_2+u\Bbb F_2\) by employing simplicial complexes. When the simplicial complexes are all generated by a maximal element, we determine the Lee weight distributions of two classes of the codes over \(\Bbb F_2+u\Bbb F_2\). Our results show that the codes have few Lee weights. Via the Gray map, we obtain an infinite family of binary codes meeting the Griesmer bound and a class of binary distance optimal codes. |
|||||
2019 | Unsupervised Neural Generative Semantic Hashing | Hansen Casper, Hansen Christian, Simonsen Jakob Grue, Alstrup Stephen, Lioma Christina | Arxiv | Fast similarity search is a key component in large-scale information retrieval, where semantic hashing has become a popular strategy for representing documents as binary hash codes. Recent advances in this area have been obtained through neural network based models: generative models trained by learning to reconstruct the original documents. We present a novel unsupervised generative semantic hashing approach, \textit{Ranking based Semantic Hashing} (RBSH) that consists of both a variational and a ranking based component. Similarly to variational autoencoders, the variational component is trained to reconstruct the original document conditioned on its generated hash code, and as in prior work, it only considers documents individually. The ranking component solves this limitation by incorporating inter-document similarity into the hash code generation, modelling document ranking through a hinge loss. To circumvent the need for labelled data to compute the hinge loss, we use a weak labeller and thus keep the approach fully unsupervised. Extensive experimental evaluation on four publicly available datasets against traditional baselines and recent state-of-the-art methods for semantic hashing shows that RBSH significantly outperforms all other methods across all evaluated hash code lengths. In fact, RBSH hash codes are able to perform similarly to state-of-the-art hash codes while using 2-4x fewer bits. |
|||||
2019 | Deep Heterogeneous Hashing For Face Video Retrieval | Qiao Shishi, Wang Ruiping, Shan Shiguang, Chen Xilin | IEEE Transactions on Image Processing | Retrieving videos of a particular person with face image as a query via hashing technique has many important applications. While face images are typically represented as vectors in Euclidean space, characterizing face videos with some robust set modeling techniques (e.g. covariance matrices as exploited in this study, which reside on Riemannian manifold), has recently shown appealing advantages. This hence results in a thorny heterogeneous spaces matching problem. Moreover, hashing with handcrafted features as done in many existing works is clearly inadequate to achieve desirable performance for this task. To address such problems, we present an end-to-end Deep Heterogeneous Hashing (DHH) method that integrates three stages including image feature learning, video modeling, and heterogeneous hashing in a single framework, to learn unified binary codes for both face images and videos. To tackle the key challenge of hashing on the manifold, a well-studied Riemannian kernel mapping is employed to project data (i.e. covariance matrices) into Euclidean space and thus enables to embed the two heterogeneous representations into a common Hamming space, where both intra-space discriminability and inter-space compatibility are considered. To perform network optimization, the gradient of the kernel mapping is innovatively derived via structured matrix backpropagation in a theoretically principled way. Experiments on three challenging datasets show that our method achieves quite competitive performance compared with existing hashing methods. |
|||||
2019 | Constructing Minimal Perfect Hash Functions Using SAT Technology | Weaver Sean, Heule Marijn | Arxiv | Minimal perfect hash functions (MPHFs) are used to provide efficient access to values of large dictionaries (sets of key-value pairs). Discovering new algorithms for building MPHFs is an area of active research, especially from the perspective of storage efficiency. The information-theoretic limit for MPHFs is 1/(ln 2) or roughly 1.44 bits per key. The current best practical algorithms range between 2 and 4 bits per key. In this article, we propose two SAT-based constructions of MPHFs. Our first construction yields MPHFs near the information-theoretic limit. For this construction, current state-of-the-art SAT solvers can handle instances where the dictionaries contain up to 40 elements, thereby outperforming the existing (brute-force) methods. Our second construction uses XOR-SAT filters to realize a practical approach with long-term storage of approximately 1.83 bits per key. |
|||||
2019 | Polytopes Lattices And Spherical Codes For The Nearest Neighbor Problem | Laarhoven Thijs | ICALP | We study locality-sensitive hash methods for the nearest neighbor problem for the angular distance, focusing on the approach of first projecting down onto a low-dimensional subspace, and then partitioning the projected vectors according to Voronoi cells induced by a suitable spherical code. This approach generalizes and interpolates between the fast but suboptimal hyperplane hashing of Charikar [STOC’02] and the asymptotically optimal but practically often slower hash families of Andoni-Indyk [FOCS’06], Andoni-Indyk-Nguyen-Razenshteyn [SODA’14] and Andoni-Indyk-Laarhoven-Razenshteyn-Schmidt [NIPS’15]. We set up a framework for analyzing the performance of any spherical code in this context, and we provide results for various codes from the literature, such as those related to regular polytopes and root lattices. Similar to hyperplane hashing, and unlike cross-polytope hashing, our analysis of collision probabilities and query exponents is exact and does not hide order terms which vanish only for large \(d\), facilitating an easy parameter selection. For the two-dimensional case, we derive closed-form expressions for arbitrary spherical codes, and we show that the equilateral triangle is optimal, achieving a better performance than the two-dimensional analogues of hyperplane and cross-polytope hashing. In three and four dimensions, we numerically find that the tetrahedron, \(5\)-cell, and \(16\)-cell achieve the best query exponents, while in five or more dimensions orthoplices appear to outperform regular simplices, as well as the root lattice families \(A_k\) and \(D_k\). We argue that in higher dimensions, larger spherical codes will likely exist which will outperform orthoplices in theory, and we argue why using the \(D_k\) root lattices will likely lead to better results in practice, due to a better trade-off between the asymptotic query exponent and the concrete costs of hashing. |
|||||
2019 | Separate Chaining Meets Compact Hashing | Köppl Dominik | Arxiv | While separate chaining is a common strategy for resolving collisions in a hash table taught in most textbooks, compact hashing is a less common technique for saving space when hashing integers whose domain is relatively small with respect to the problem size. It is widely believed that hash tables waste a considerable amount of memory, as they either leave allocated space untouched (open addressing) or store additional pointers (separate chaining). For the former, Cleary introduced the compact hashing technique that stores only a part of a key to save space. However, as can be seen by the line of research focusing on compact hash tables with open addressing, there is additional information, called displacement, required for restoring a key. There are several representations of this displacement information with different space and time trade-offs. In this article, we introduce a separate chaining hash table that applies the compact hashing technique without the need for the displacement information. Practical evaluations reveal that insertions in this hash table are faster or use less space than all previously known compact hash tables on modern computer architectures when storing sufficiently large satellite data. |
|||||
2019 | Focused Quantization For Sparse Cnns | Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, Cheng-zhong Xu | Neural Information Processing Systems | Deep convolutional neural networks (CNNs) are powerful tools for a wide range of vision tasks, but the enormous amount of memory and compute resources required by CNNs poses a challenge in deploying them on constrained devices. Existing compression techniques, while excelling at reducing model sizes, struggle to be computationally friendly. In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on power-of-two values, which exploits the weight distributions after fine-grained pruning. The proposed method dynamically discovers the most effective numerical representation for weights in layers with varying sparsities, significantly reducing model sizes. Multiplications in quantized CNNs are replaced with much cheaper bit-shift operations for efficient inference. Coupled with lossless encoding, we build a compression pipeline that provides CNNs with high compression ratios (CR), low computation cost and minimal loss in accuracies. In ResNet-50, we achieved a 18.08x CR with only 0.24% loss in top-5 accuracy, outperforming existing compression methods. We fully compress a ResNet-18 and found that it is not only higher in CR and top-5 accuracy, but also more hardware efficient as it requires fewer logic gates to implement when compared to other state-of-the-art quantization methods assuming the same throughput. |
|||||
2019 | Efficient Inner Product Approximation In Hybrid Spaces | Wu Xiang, Guo Ruiqi, Simcha David, Dopson Dave, Kumar Sanjiv | Arxiv | Many emerging use cases of data mining and machine learning operate on large datasets with data from heterogeneous sources, specifically with both sparse and dense components. For example, dense deep neural network embedding vectors are often used in conjunction with sparse textual features to provide high dimensional hybrid representation of documents. Efficient search in such hybrid spaces is very challenging as the techniques that perform well for sparse vectors have little overlap with those that work well for dense vectors. Popular techniques like Locality Sensitive Hashing (LSH) and its data-dependent variants also do not give good accuracy in high dimensional hybrid spaces. Even though hybrid scenarios are becoming more prevalent, currently there exist no efficient techniques in literature that are both fast and accurate. In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy. We also propose efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines. The performance of the proposed method is demonstrated on several datasets including a very large scale industrial dataset containing one billion vectors in a billion dimensional space, achieving over 10x speedup and higher accuracy against competitive baselines. |
|||||
2019 | Video Segment Copy Detection Using Memory Constrained Hierarchical Batch-normalized LSTM Autoencoder | Krishna Arjun, Ibrahim A S Akil Arif | Arxiv | In this report, we introduce a video hashing method for scalable video segment copy detection. The objective of video segment copy detection is to find the video (s) present in a large database, one of whose segments (cropped in time) is a (transformed) copy of the given query video. This transformation may be temporal (for example frame dropping, change in frame rate) or spatial (brightness and contrast change, addition of noise etc.) in nature although the primary focus of this report is detecting temporal attacks. The video hashing method proposed by us uses a deep learning neural network to learn variable length binary hash codes for the entire video considering both temporal and spatial features into account. This is in contrast to most existing video hashing methods, as they use conventional image hashing techniques to obtain hash codes for a video after extracting features for every frame or certain key frames, in which case the temporal information present in the video is not exploited. Our hashing method is specifically resilient to time cropping making it extremely useful in video segment copy detection. Experimental results obtained on the large augmented dataset consisting of around 25,000 videos with segment copies demonstrate the efficacy of our proposed video hashing method. |
|||||
2019 | Proof-of-forgery For Hash-based Signatures | Kiktenko E. O., Kudinov M. A., Bulychev A. A., Fedorov A. K. | Proceedings of the | In the present work, a peculiar property of hash-based signatures allowing detection of their forgery event is explored. This property relies on the fact that a successful forgery of a hash-based signature most likely results in a collision with respect to the employed hash function, while the demonstration of this collision could serve as convincing evidence of the forgery. Here we prove that with properly adjusted parameters Lamport and Winternitz one-time signatures schemes could exhibit a forgery detection availability property. This property is of significant importance in the framework of crypto-agility paradigm since the considered forgery detection serves as an alarm that the employed cryptographic hash function becomes insecure to use and the corresponding scheme has to be replaced. |
|||||
2019 | Interleaved Composite Quantization For High-dimensional Similarity Search | Khoram Soroosh, Wright Stephen J, Li Jing | Arxiv | Similarity search retrieves the nearest neighbors of a query vector from a dataset of high-dimensional vectors. As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encodings can be used during query processing to find approximate nearest neighbors of the query point quickly. Search speed can be improved by using shorter codes, but shorter codes have higher quantization error, leading to degraded precision. In this work, we propose the Interleaved Composite Quantization (ICQ) which achieves fast similarity search without using shorter codes. In ICQ, a small subset of the code is used to approximate the distances, with complete codes being used only when necessary. Our method effectively reduces both code length and quantization error. Furthermore, ICQ is compatible with several recently proposed techniques for reducing quantization error and can be used in conjunction with these other techniques to improve results. We confirm these claims and show strong empirical performance of ICQ using several synthetic and real-word datasets. |
|||||
2019 | Lock-free Hopscotch Hashing | Kelly Robert, Pearlmutter Barak A., Maguire Phil | Arxiv | In this paper we present a lock-free version of Hopscotch Hashing. Hopscotch Hashing is an open addressing algorithm originally proposed by Herlihy, Shavit, and Tzafrir, which is known for fast performance and excellent cache locality. The algorithm allows users of the table to skip or jump over irrelevant entries, allowing quick search, insertion, and removal of entries. Unlike traditional linear probing, Hopscotch Hashing is capable of operating under a high load factor, as probe counts remain small. Our lock-free version improves on both speed, cache locality, and progress guarantees of the original, being a chimera of two concurrent hash tables. We compare our data structure to various other lock-free and blocking hashing algorithms and show that its performance is in many cases superior to existing strategies. The proposed lock-free version overcomes some of the drawbacks associated with the original blocking version, leading to a substantial boost in scalability while maintaining attractive features like physical deletion or probe-chain compression. |
|||||
2019 | BTEL A Binary Tree Encoding Approach For Visual Localization | Le Huu, Hoang Tuan, Milford Michael | Arxiv | Visual localization algorithms have achieved significant improvements in performance thanks to recent advances in camera technology and vision-based techniques. However, there remains one critical caveat: all current approaches that are based on image retrieval currently scale at best linearly with the size of the environment with respect to both storage, and consequentially in most approaches, query time. This limitation severely curtails the capability of autonomous systems in a wide range of compute, power, storage, size, weight or cost constrained applications such as drones. In this work, we present a novel binary tree encoding approach for visual localization which can serve as an alternative for existing quantization and indexing techniques. The proposed tree structure allows us to derive a compressed training scheme that achieves sub-linearity in both required storage and inference time. The encoding memory can be easily configured to satisfy different storage constraints. Moreover, our approach is amenable to an optional sequence filtering mechanism to further improve the localization results, while maintaining the same amount of storage. Our system is entirely agnostic to the front-end descriptors, allowing it to be used on top of recent state-of-the-art image representations. Experimental results show that the proposed method significantly outperforms state-of-the-art approaches under limited storage constraints. |
|||||
2019 | Nearest Neighbor Search-based Bitwise Source Separation Using Discriminant Winner-take-all Hashing | Kim Sunwoo, Kim Minje | Arxiv | We propose an iteration-free source separation algorithm based on Winner-Take-All (WTA) hash codes, which is a faster, yet accurate alternative to a complex machine learning model for single-channel source separation in a resource-constrained environment. We first generate random permutations with WTA hashing to encode the shape of the multidimensional audio spectrum to a reduced bitstring representation. A nearest neighbor search on the hash codes of an incoming noisy spectrum as the query string results in the closest matches among the hashed mixture spectra. Using the indices of the matching frames, we obtain the corresponding ideal binary mask vectors for denoising. Since both the training data and the search operation are bitwise, the procedure can be done efficiently in hardware implementations. Experimental results show that the WTA hash codes are discriminant and provide an affordable dictionary search mechanism that leads to a competent performance compared to a comprehensive model and oracle masking. |
|||||
2019 | Query-adaptive Hash Code Ranking For Large-scale Multi-view Visual Search | Liu Xianglong, Huang Lei, Deng Cheng, Lang Bo, Tao Dacheng | Arxiv | Hash based nearest neighbor search has become attractive in many applications. However, the quantization in hashing usually degenerates the discriminative power when using Hamming distance ranking. Besides, for large-scale visual search, existing hashing methods cannot directly support the efficient search over the data with multiple sources, and while the literature has shown that adaptively incorporating complementary information from diverse sources or views can significantly boost the search performance. To address the problems, this paper proposes a novel and generic approach to building multiple hash tables with multiple views and generating fine-grained ranking results at bitwise and tablewise levels. For each hash table, a query-adaptive bitwise weighting is introduced to alleviate the quantization loss by simultaneously exploiting the quality of hash functions and their complement for nearest neighbor search. From the tablewise aspect, multiple hash tables are built for different data views as a joint index, over which a query-specific rank fusion is proposed to rerank all results from the bitwise ranking by diffusing in a graph. Comprehensive experiments on image search over three well-known benchmarks show that the proposed method achieves up to 17.11% and 20.28% performance gains on single and multiple table search over state-of-the-art methods. |
|||||
2019 | On-device Text Representations Robust To Misspellings Via Projections | Sankar Chinnadhurai, Ravi Sujith, Kozareva Zornitsa | Arxiv | Recently, there has been a strong interest in developing natural language applications that live on personal devices such as mobile phones, watches and IoT with the objective to preserve user privacy and have low memory. Advances in Locality-Sensitive Hashing (LSH)-based projection networks have demonstrated state-of-the-art performance in various classification tasks without explicit word (or word-piece) embedding lookup tables by computing on-the-fly text representations. In this paper, we show that the projection based neural classifiers are inherently robust to misspellings and perturbations of the input text. We empirically demonstrate that the LSH projection based classifiers are more robust to common misspellings compared to BiLSTMs (with both word-piece & word-only tokenization) and fine-tuned BERT based methods. When subject to misspelling attacks, LSH projection based classifiers had a small average accuracy drop of 2.94% across multiple classifications tasks, while the fine-tuned BERT model accuracy had a significant drop of 11.44%. |
|||||
2019 | Fast And Bayes-consistent Nearest Neighbors | Efremenko Klim, Kontorovich Aryeh, Noivirt Moshe | Arxiv | Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects – either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. We combine the locality-sensitive hashing (LSH) technique with a novel missing-mass argument to obtain a fast and Bayes-consistent classifier. Our algorithm’s prediction runtime compares favorably against state of the art approximate NN methods, while maintaining Bayes-consistency and attaining rates comparable to minimax. On samples of size \(n\) in \(\R^d\), our pre-processing phase has runtime \(O(d n log n)\), while the evaluation phase has runtime \(O(dlog n)\) per query point. |
|||||
2019 | Transferable Neural Projection Representations | Sankar Chinnadhurai, Ravi Sujith, Kozareva Zornitsa | Proc. of NAACL | Neural word representations are at the core of many state-of-the-art natural language processing models. A widely used approach is to pre-train, store and look up word or character embedding matrices. While useful, such representations occupy huge memory making it hard to deploy on-device and often do not generalize to unknown words due to vocabulary pruning. In this paper, we propose a skip-gram based architecture coupled with Locality-Sensitive Hashing (LSH) projections to learn efficient dynamically computable representations. Our model does not need to store lookup tables as representations are computed on-the-fly and require low memory footprint. The representations can be trained in an unsupervised fashion and can be easily transferred to other NLP tasks. For qualitative evaluation, we analyze the nearest neighbors of the word representations and discover semantically similar words even with misspellings. For quantitative evaluation, we plug our transferable projections into a simple LSTM and run it on multiple NLP tasks and show how our transferable projections achieve better performance compared to prior work. |
|||||
2019 | Document Hashing With Mixture-prior Generative Models | Dong Wei, Su Qinliang, Shen Dinghan, Chen Changyou | Arxiv | Hashing is promising for large-scale information retrieval tasks thanks to the efficiency of distance evaluation between binary codes. Generative hashing is often used to generate hashing codes in an unsupervised way. However, existing generative hashing methods only considered the use of simple priors, like Gaussian and Bernoulli priors, which limits these methods to further improve their performance. In this paper, two mixture-prior generative models are proposed, under the objective to produce high-quality hashing codes for documents. Specifically, a Gaussian mixture prior is first imposed onto the variational auto-encoder (VAE), followed by a separate step to cast the continuous latent representation of VAE into binary code. To avoid the performance loss caused by the separate casting, a model using a Bernoulli mixture prior is further developed, in which an end-to-end training is admitted by resorting to the straight-through (ST) discrete gradient estimator. Experimental results on several benchmark datasets demonstrate that the proposed methods, especially the one using Bernoulli mixture priors, consistently outperform existing ones by a substantial margin. |
|||||
2019 | Learning Space Partitions For Nearest Neighbor Search | Dong Yihe, Indyk Piotr, Razenshteyn Ilya, Wagner Tal | Arxiv | Space partitions of \(\mathbb{R}^d\) underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner [Sanders, Schulz SEA 2013] and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS, our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH. |
|||||
2019 | Deep Hashing Learning For Visual And Semantic Retrieval Of Remote Sensing Images | Song Weiwei, Li Shutao, Benediktsson Jon Atli | Arxiv | Driven by the urgent demand for managing remote sensing big data, large-scale remote sensing image retrieval (RSIR) attracts increasing attention in the remote sensing field. In general, existing retrieval methods can be regarded as visual-based retrieval approaches which search and return a set of similar images from a database to a given query image. Although retrieval methods have achieved great success, there is still a question that needs to be responded to: Can we obtain the accurate semantic labels of the returned similar images to further help analyzing and processing imagery? Inspired by the above question, in this paper, we redefine the image retrieval problem as visual and semantic retrieval of images. Specifically, we propose a novel deep hashing convolutional neural network (DHCNN) to simultaneously retrieve the similar images and classify their semantic labels in a unified framework. In more detail, a convolutional neural network (CNN) is used to extract high-dimensional deep features. Then, a hash layer is perfectly inserted into the network to transfer the deep features into compact hash codes. In addition, a fully connected layer with a softmax function is performed on hash layer to generate class distribution. Finally, a loss function is elaborately designed to simultaneously consider the label loss of each image and similarity loss of pairs of images. Experimental results on two remote sensing datasets demonstrate that the proposed method achieves the state-of-art retrieval and classification performance. |
|||||
2019 | Simultaneous Feature Aggregating And Hashing For Compact Binary Code Learning | Do Thanh-toan, Le Khoa, Hoang Tuan, Le Huu, Nguyen Tam V., Cheung Ngai-man | Arxiv | Representing images by compact hash codes is an attractive approach for large-scale content-based image retrieval. In most state-of-the-art hashing-based image retrieval systems, for each image, local descriptors are first aggregated as a global representation vector. This global vector is then subjected to a hashing function to generate a binary hash code. In previous works, the aggregating and the hashing processes are designed independently. Hence these frameworks may generate suboptimal hash codes. In this paper, we first propose a novel unsupervised hashing framework in which feature aggregating and hashing are designed simultaneously and optimized jointly. Specifically, our joint optimization generates aggregated representations that can be better reconstructed by some binary codes. This leads to more discriminative binary hash codes and improved retrieval accuracy. In addition, the proposed method is flexible. It can be extended for supervised hashing. When the data label is available, the framework can be adapted to learn binary codes which minimize the reconstruction loss w.r.t. label vectors. Furthermore, we also propose a fast version of the state-of-the-art hashing method Binary Autoencoder to be used in our proposed frameworks. Extensive experiments on benchmark datasets under various settings show that the proposed methods outperform state-of-the-art unsupervised and supervised hashing methods. |
|||||
2019 | Unsupervised Multi-modal Hashing For Cross-modal Retrieval | Yu Jun, Wu Xiao-jun | Arxiv | With the advantage of low storage cost and high efficiency, hashing learning has received much attention in the domain of Big Data. In this paper, we propose a novel unsupervised hashing learning method to cope with this open problem to directly preserve the manifold structure by hashing. To address this problem, both the semantic correlation in textual space and the locally geometric structure in the visual space are explored simultaneously in our framework. Besides, the `2;1-norm constraint is imposed on the projection matrices to learn the discriminative hash function for each modality. Extensive experiments are performed to evaluate the proposed method on the three publicly available datasets and the experimental results show that our method can achieve superior performance over the state-of-the-art methods. |
|||||
2019 | Bilinear Supervised Hashing Based On 2D Image Features | Ding Yujuan, Wong Wai Kueng, Lai Zhihui, Zhang Zheng | Arxiv | Hashing has been recognized as an efficient representation learning method to effectively handle big data due to its low computational complexity and memory cost. Most of the existing hashing methods focus on learning the low-dimensional vectorized binary features based on the high-dimensional raw vectorized features. However, studies on how to obtain preferable binary codes from the original 2D image features for retrieval is very limited. This paper proposes a bilinear supervised discrete hashing (BSDH) method based on 2D image features which utilizes bilinear projections to binarize the image matrix features such that the intrinsic characteristics in the 2D image space are preserved in the learned binary codes. Meanwhile, the bilinear projection approximation and vectorization binary codes regression are seamlessly integrated together to formulate the final robust learning framework. Furthermore, a discrete optimization strategy is developed to alternatively update each variable for obtaining the high-quality binary codes. In addition, two 2D image features, traditional SURF-based FVLAD feature and CNN-based AlexConv5 feature are designed for further improving the performance of the proposed BSDH method. Results of extensive experiments conducted on four benchmark datasets show that the proposed BSDH method almost outperforms all competing hashing methods with different input features by different evaluation protocols. |
|||||
2019 | Beyond Product Quantization Deep Progressive Quantization For Image Retrieval | Gao Lianli, Zhu Xiaosu, Song Jingkuan, Zhao Zhou, Shen Heng Tao | Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence | Product Quantization (PQ) has long been a mainstream for generating an exponentially large codebook at very low memory/time cost. Despite its success, PQ is still tricky for the decomposition of high-dimensional vector space, and the retraining of model is usually unavoidable when the code length changes. In this work, we propose a deep progressive quantization (DPQ) model, as an alternative to PQ, for large scale image retrieval. DPQ learns the quantization codes sequentially and approximates the original feature space progressively. Therefore, we can train the quantization codes with different code lengths simultaneously. Specifically, we first utilize the label information for guiding the learning of visual features, and then apply several quantization blocks to progressively approach the visual features. Each quantization block is designed to be a layer of a convolutional neural network, and the whole framework can be trained in an end-to-end manner. Experimental results on the benchmark datasets show that our model significantly outperforms the state-of-the-art for image retrieval. Our model is trained once for different code lengths and therefore requires less computation time. Additional ablation study demonstrates the effect of each component of our proposed model. Our code is released at https://github.com/cfm-uestc/DPQ. |
|||||
2019 | Triplet-based Deep Hashing Network For Cross-modal Retrieval | Deng Cheng, Chen Zhaojia, Liu Xianglong, Gao Xinbo, Tao Dacheng | Arxiv | Given the benefits of its low storage requirements and high retrieval efficiency, hashing has recently received increasing attention. In particular,cross-modal hashing has been widely and successfully used in multimedia similarity search applications. However, almost all existing methods employing cross-modal hashing cannot obtain powerful hash codes due to their ignoring the relative similarity between heterogeneous data that contains richer semantic information, leading to unsatisfactory retrieval performance. In this paper, we propose a triplet-based deep hashing (TDH) network for cross-modal retrieval. First, we utilize the triplet labels, which describes the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross-modal instances. We then establish a loss function from the inter-modal view and the intra-modal view to boost the discriminative abilities of the hash codes. Finally, graph regularization is introduced into our proposed TDH method to preserve the original semantic similarity between hash codes in Hamming space. Experimental results show that our proposed method outperforms several state-of-the-art approaches on two popular cross-modal datasets. |
|||||
2019 | Snap And Find Deep Discrete Cross-domain Garment Image Retrieval | Luo Yadan, Wang Ziwei, Huang Zi, Yang Yang, Lu Huimin | Arxiv | With the increasing number of online stores, there is a pressing need for intelligent search systems to understand the item photos snapped by customers and search against large-scale product databases to find their desired items. However, it is challenging for conventional retrieval systems to match up the item photos captured by customers and the ones officially released by stores, especially for garment images. To bridge the customer- and store- provided garment photos, existing studies have been widely exploiting the clothing attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to learn a common embedding space for garment representations. Unfortunately they omit the sequential correlation of attributes and consume large quantity of human labors to label the landmarks. In this paper, we propose a deep multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain embedding and sequential attribute learning are modeled simultaneously. Sequential attribute learning not only provides the semantic guidance for embedding, but also generates rich attention on discriminative local details (\textit{e.g.,} black buttons) of clothing items without requiring extra landmark labels. This leads to promising performance and 306\(\times\) boost on efficiency when compared with the state-of-the-art models, which is demonstrated through rigorous experiments on two public fashion datasets. |
|||||
2019 | Forestdsh A Universal Hash Design For Discrete Probability Distributions | Davoodi Arash Gholami, Chang Sean, Yoo Hyun Gon, Baweja Anubhav, Mongia Mihir, Mohimani Hosein | DAMI | In this paper, we consider the problem of classification of \(M\) high dimensional queries \(y^1,\cdots,y^M\in B^S\) to \(N\) high dimensional classes \(x^1,\cdots,x^N\in A^S\) where \(A\) and \(B\) are discrete alphabets and the probabilistic model that relates data to the classes \(P(x,y)\) is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs \((x,y)\sim P\) to the same bucket with probability higher than random pairs \(x\sim P^A\) and \(y\sim P^B\), where \(P^A\) and \(P^B\) are the marginal probability distributions of \(P\). We design distribution sensitive hashes using a forest of decision trees and we show that the complexity of search grows with \(O(N^{\lambda^(P)})\) where \(\lambda^(P)\) is expressed in an analytical form. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods. |
|||||
2019 | Supervised Quantization For Similarity Search | Wang Xiaojuan, Zhang Ting, Q Guo-jun, Tang Jinhui, Wang Jingdong | Arxiv | In this paper, we address the problem of searching for semantically similar images from a large database. We present a compact coding approach, supervised quantization. Our approach simultaneously learns feature selection that linearly transforms the database points into a low-dimensional discriminative subspace, and quantizes the data points in the transformed space. The optimization criterion is that the quantized points not only approximate the transformed points accurately, but also are semantically separable: the points belonging to a class lie in a cluster that is not overlapped with other clusters corresponding to other classes, which is formulated as a classification problem. The experiments on several standard datasets show the superiority of our approach over the state-of-the art supervised hashing and unsupervised quantization algorithms. |
|||||
2019 | Higher-order Count Sketch Dimensionality Reduction That Retains Efficient Tensor Operations | Shi Yang, Anandkumar Animashree | Arxiv | Sketching is a randomized dimensionality-reduction method that aims to preserve relevant information in large-scale datasets. Count sketch is a simple popular sketch which uses a randomized hash function to achieve compression. In this paper, we propose a novel extension known as Higher-order Count Sketch (HCS). While count sketch uses a single hash function, HCS uses multiple (smaller) hash functions for sketching. HCS reshapes the input (vector) data into a higher-order tensor and employs a tensor product of the random hash functions to compute the sketch. This results in an exponential saving (with respect to the order of the tensor) in the memory requirements of the hash functions, under certain conditions on the input data. Furthermore, when the input data itself has an underlying structure in the form of various tensor representations such as the Tucker decomposition, we obtain significant advantages. We derive efficient (approximate) computation of various tensor operations such as tensor products and tensor contractions directly on the sketched data. Thus, HCS is the first sketch to fully exploit the multi-dimensional nature of higher-order tensors. We apply HCS to tensorized neural networks where we replace fully connected layers with sketched tensor operations. We achieve nearly state of the art accuracy with significant compression on the image classification benchmark. |
|||||
2019 | Implementing Noise With Hash Functions For Graphics Processing Units | Valdenegro-toro Matias, Pincheira Hector | Arxiv | We propose a modification to Perlin noise which use computable hash functions instead of textures as lookup tables. We implemented the FNV1, Jenkins and Murmur hashes on Shader Model 4.0 Graphics Processing Units for noise generation. Modified versions of the FNV1 and Jenkins hashes provide very close performance compared to a texture based Perlin noise implementation. Our noise modification enables noise function evaluation without any texture fetches, trading computational power for memory bandwidth. |
|||||
2019 | Deep Hashing Using Entropy Regularised Product Quantisation Network | Schlemper Jo, Caballero Jose, Aitken Andy, Van Amersfoort Joost | Arxiv | In large scale systems, approximate nearest neighbour search is a crucial algorithm to enable efficient data retrievals. Recently, deep learning-based hashing algorithms have been proposed as a promising paradigm to enable data dependent schemes. Often their efficacy is only demonstrated on data sets with fixed, limited numbers of classes. In practical scenarios, those labels are not always available or one requires a method that can handle a higher input variability, as well as a higher granularity. To fulfil those requirements, we look at more flexible similarity measures. In this work, we present a novel, flexible, end-to-end trainable network for large-scale data hashing. Our method works by transforming the data distribution to behave as a uniform distribution on a product of spheres. The transformed data is subsequently hashed to a binary form in a way that maximises entropy of the output, (i.e. to fully utilise the available bit-rate capacity) while maintaining the correctness (i.e. close items hash to the same key in the map). We show that the method outperforms baseline approaches such as locality-sensitive hashing and product quantisation in the limited capacity regime. |
|||||
2019 | Deep Metric Learning Using Similarities From Nonlinear Rank Approximations | Schall Konstantin, Barthel Kai Uwe, Hezel Nico, Jung Klaus | Arxiv | In recent years, deep metric learning has achieved promising results in learning high dimensional semantic feature embeddings where the spatial relationships of the feature vectors match the visual similarities of the images. Similarity search for images is performed by determining the vectors with the smallest distances to a query vector. However, high retrieval quality does not depend on the actual distances of the feature vectors, but rather on the ranking order of the feature vectors from similar images. In this paper, we introduce a metric learning algorithm that focuses on identifying and modifying those feature vectors that most strongly affect the retrieval quality. We compute normalized approximated ranks and convert them to similarities by applying a nonlinear transfer function. These similarities are used in a newly proposed loss function that better contracts similar and disperses dissimilar samples. Experiments demonstrate significant improvement over existing deep feature embedding methods on the CUB-200-2011, Cars196, and Stanford Online Products data sets for all embedding sizes. |
|||||
2019 | Central Similarity Quantization For Efficient Image And Video Retrieval | Yuan Li, Wang Tao, Zhang Xiaopeng, Tay Francis Eh, Jie Zequn, Liu Wei, Feng Jiashi | Arxiv | Existing data-dependent hashing methods usually learn hash functions from pairwise or triplet data relationships, which only capture the data similarity locally, and often suffer from low learning efficiency and low collision rate. In this work, we propose a new global similarity metric, termed as central similarity, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy. We principally formulate the computation of the proposed central similarity metric by introducing a new concept, i.e., hash center that refers to a set of data points scattered in the Hamming space with a sufficient mutual distance between each other. We then provide an efficient method to construct well separated hash centers by leveraging the Hadamard matrix and Bernoulli distributions. Finally, we propose the Central Similarity Quantization (CSQ) that optimizes the central similarity between data points w.r.t.\ their hash centers instead of optimizing the local similarity. CSQ is generic and applicable to both image and video hashing scenarios. Extensive experiments on large-scale image and video retrieval tasks demonstrate that CSQ can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs, achieving a noticeable boost in retrieval performance, i.e. 3\%-20\% in mAP over the previous state-of-the-arts. The code is at: \url{https://github.com/yuanli2333/Hadamard-Matrix-for-hashing} |
|||||
2019 | Sub-linear Memory Sketches For Near Neighbor Search On Streaming Data | Coleman Benjamin, Baraniuk Richard G., Shrivastava Anshumali | Arxiv | We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size \(O(N^b log^3 N)\) in \(O(N^{(b+1)} log^3 N)\) time, where \(b < 1\). This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by \(b\). We achieve sublinear memory performance on stable queries by combining recent advances in locality sensitive hash (LSH)-based estimators, online kernel density estimation, and compressed sensing. Our theoretical results shed new light on the memory-accuracy tradeoff for nearest neighbor search, and our sketch, which consists entirely of short integer arrays, has a variety of attractive features in practice. We evaluate the memory-recall tradeoff of our method on a friend recommendation task in the Google Plus social media network. We obtain orders of magnitude better compression than the random projection based alternative while retaining the ability to report the nearest neighbors of practical queries. |
|||||
2019 | Cluster-wise Unsupervised Hashing For Cross-modal Similarity Search | Wang Lu, Yang Jie | Arxiv | Large-scale cross-modal hashing similarity retrieval has attracted more and more attention in modern search applications such as search engines and autopilot, showing great superiority in computation and storage. However, current unsupervised cross-modal hashing methods still have some limitations: (1)many methods relax the discrete constraints to solve the optimization objective which may significantly degrade the retrieval performance;(2)most existing hashing model project heterogenous data into a common latent space, which may always lose sight of diversity in heterogenous data;(3)transforming real-valued data point to binary codes always results in abundant loss of information, producing the suboptimal continuous latent space. To overcome above problems, in this paper, a novel Cluster-wise Unsupervised Hashing (CUH) method is proposed. Specifically, CUH jointly performs the multi-view clustering that projects the original data points from different modalities into its own low-dimensional latent semantic space and finds the cluster centroid points and the common clustering indicators in its own low-dimensional space, and learns the compact hash codes and the corresponding linear hash functions. An discrete optimization framework is developed to learn the unified binary codes across modalities under the guidance cluster-wise code-prototypes. The reasonableness and effectiveness of CUH is well demonstrated by comprehensive experiments on diverse benchmark datasets. |
|||||
2019 | Algorithms For Similarity Search And Pseudorandomness | Christiani Tobias | Arxiv | We study the problem of approximate near neighbor (ANN) search and show the following results:
|
|||||
2019 | An Efficient Approach For Super And Nested Term Indexing And Retrieval | Chowdhury Md Faisal Mahbub, Farrell Robert | Arxiv | This paper describes a new approach, called Terminological Bucket Indexing (TBI), for efficient indexing and retrieval of both nested and super terms using a single method. We propose a hybrid data structure for facilitating faster indexing building. An evaluation of our approach with respect to widely used existing approaches on several publicly available dataset is provided. Compared to Trie based approaches, TBI provides comparable performance on nested term retrieval and far superior performance on super term retrieval. Compared to traditional hash table, TBI needs 80\% less time for indexing. |
|||||
2019 | VISIR Visual And Semantic Image Label Refinement | Chowdhury Sreyasi Nag, Tandon Niket, Ferhatosmanoglu Hakan, Weikum Gerhard | ACM ISBN | The social media explosion has populated the Internet with a wealth of images. There are two existing paradigms for image retrieval: 1) content-based image retrieval (CBIR), which has traditionally used visual features for similarity search (e.g., SIFT features), and 2) tag-based image retrieval (TBIR), which has relied on user tagging (e.g., Flickr tags). CBIR now gains semantic expressiveness by advances in deep-learning-based detection of visual labels. TBIR benefits from query-and-click logs to automatically infer more informative labels. However, learning-based tagging still yields noisy labels and is restricted to concrete objects, missing out on generalizations and abstractions. Click-based tagging is limited to terms that appear in the textual context of an image or in queries that lead to a click. This paper addresses the above limitations by semantically refining and expanding the labels suggested by learning-based object detection. We consider the semantic coherence between the labels for different objects, leverage lexical and commonsense knowledge, and cast the label assignment into a constrained optimization problem solved by an integer linear program. Experiments show that our method, called VISIR, improves the quality of the state-of-the-art visual labeling tools like LSDA and YOLO. |
|||||
2019 | Modal-aware Features For Multimodal Hashing | Zeng Haien, Lai Hanjiang, Chu Hanlu, Tang Yong, Yin Jian | Arxiv | Many retrieval applications can benefit from multiple modalities, e.g., text that contains images on Wikipedia, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependences among the heterogeneous intermediate features that can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. Then, to learn better representations for binary hash codes, we present an attention network that finds the informative regions of these modal-aware features that are favorable for retrieval. Experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods. |
|||||
2019 | Joint Cluster Unary Loss For Efficient Cross-modal Hashing | Zhang Shifeng, Li Jianmin, Zhang Bo | Arxiv | With the rapid growth of various types of multimodal data, cross-modal deep hashing has received broad attention for solving cross-modal retrieval problems efficiently. Most cross-modal hashing methods follow the traditional supervised hashing framework in which the \(O(n^2)\) data pairs and \(O(n^3)\) data triplets are generated for training, but the training procedure is less efficient because the complexity is high for large-scale dataset. To address these issues, we propose a novel and efficient cross-modal hashing algorithm in which the unary loss is introduced. First of all, We introduce the Cross-Modal Unary Loss (CMUL) with \(O(n)\) complexity to bridge the traditional triplet loss and classification-based unary loss. A more accurate bound of the triplet loss for structured multilabel data is also proposed in CMUL. Second, we propose the novel Joint Cluster Cross-Modal Hashing (JCCH) algorithm for efficient hash learning, in which the CMUL is involved. The resultant hashcodes form several clusters in which the hashcodes in the same cluster share similar semantic information, and the heterogeneity gap on different modalities is diminished by sharing the clusters. The proposed algorithm is able to be applied to various types of data, and experiments on large-scale datasets show that the proposed method is superior over or comparable with state-of-the-art cross-modal hashing methods, and training with the proposed method is more efficient than others. |
|||||
2019 | Analysis Of Sparsehash An Efficient Embedding Of Set-similarity Via Sparse Projections | Valsesia Diego, Fosson Sophie Marie, Ravazzi Chiara, Bianchi Tiziano, Magli Enrico | Arxiv | Embeddings provide compact representations of signals in order to perform efficient inference in a wide variety of tasks. In particular, random projections are common tools to construct Euclidean distance-preserving embeddings, while hashing techniques are extensively used to embed set-similarity metrics, such as the Jaccard coefficient. In this letter, we theoretically prove that a class of random projections based on sparse matrices, called SparseHash, can preserve the Jaccard coefficient between the supports of sparse signals, which can be used to estimate set similarities. Moreover, besides the analysis, we provide an efficient implementation and we test the performance in several numerical experiments, both on synthetic and real datasets. |
|||||
2019 | A Standalone Fpga-based Miner For Lyra2rev2 Cryptocurrencies | Têtu Jean-françois, Trudeau Louis-charles, Van Beirendonck Michiel, Balatsoukas-stimming Alexios, Giard Pascal | Arxiv | Lyra2REv2 is a hashing algorithm that consists of a chain of individual hashing algorithms, and it is used as a proof-of-work function in several cryptocurrencies. The most crucial and exotic hashing algorithm in the Lyra2REv2 chain is a specific instance of the general Lyra2 algorithm. This work presents the first hardware implementation of the specific instance of Lyra2 that is used in Lyra2REv2. Several properties of the aforementioned algorithm are exploited in order to optimize the design. In addition, an FPGA-based hardware implementation of a standalone miner for Lyra2REv2 on a Xilinx Multi-Processor System on Chip is presented. The proposed Lyra2REv2 miner is shown to be significantly more energy efficient than both a GPU and a commercially available FPGA-based miner. Finally, we also explain how the simplified Lyra2 and Lyra2REv2 architectures can be modified with minimal effort to also support the recent Lyra2REv3 chained hashing algorithm. |
|||||
2019 | Unaligned Sequence Similarity Search Using Deep Learning | Senter James K., Royalty Taylor M., Steen Andrew D., Sadovnik Amir | Arxiv | Gene annotation has traditionally required direct comparison of DNA sequences between an unknown gene and a database of known ones using string comparison methods. However, these methods do not provide useful information when a gene does not have a close match in the database. In addition, each comparison can be costly when the database is large since it requires alignments and a series of string comparisons. In this work we propose a novel approach: using recurrent neural networks to embed DNA or amino-acid sequences in a low-dimensional space in which distances correlate with functional similarity. This embedding space overcomes both shortcomings of the method of aligning sequences and comparing homology. First, it allows us to obtain information about genes which do not have exact matches by measuring their similarity to other ones in the database. If our database is labeled this can provide labels for a query gene as is done in traditional methods. However, even if the database is unlabeled it allows us to find clusters and infer some characteristics of the gene population. In addition, each comparison is much faster than traditional methods since the distance metric is reduced to the Euclidean distance, and thus efficient approximate nearest neighbor algorithms can be used to find the best match. We present results showing the advantage of our algorithm. More specifically we show how our embedding can be useful for both classification tasks when our labels are known, and clustering tasks where our sequences belong to classes which have not been seen before. |
|||||
2019 | Efficient Querying From Weighted Binary Codes | Weng Zhenyu, Zhu Yuesheng | Arxiv | Binary codes are widely used to represent the data due to their small storage and efficient computation. However, there exists an ambiguity problem that lots of binary codes share the same Hamming distance to a query. To alleviate the ambiguity problem, weighted binary codes assign different weights to each bit of binary codes and compare the binary codes by the weighted Hamming distance. Till now, performing the querying from the weighted binary codes efficiently is still an open issue. In this paper, we propose a new method to rank the weighted binary codes and return the nearest weighted binary codes of the query efficiently. In our method, based on the multi-index hash tables, two algorithms, the table bucket finding algorithm and the table merging algorithm, are proposed to select the nearest weighted binary codes of the query in a non-exhaustive and accurate way. The proposed algorithms are justified by proving their theoretic properties. The experiments on three large-scale datasets validate both the search efficiency and the search accuracy of our method. Especially for the number of weighted binary codes up to one billion, our method shows a great improvement of more than 1000 times faster than the linear scan. |
|||||
2019 | Revisiting Consistent Hashing With Bounded Loads | Chen John, Coleman Ben, Shrivastava Anshumali | Arxiv | Dynamic load balancing lies at the heart of distributed caching. Here, the goal is to assign objects (load) to servers (computing nodes) in a way that provides load balancing while at the same time dynamically adjusts to the addition or removal of servers. One essential requirement is that the addition or removal of small servers should not require us to recompute the complete assignment. A popular and widely adopted solution is the two-decade-old Consistent Hashing (CH). Recently, an elegant extension was provided to account for server bounds. In this paper, we identify that existing methodologies for CH and its variants suffer from cascaded overflow, leading to poor load balancing. This cascading effect leads to decreasing performance of the hashing procedure with increasing load. To overcome the cascading effect, we propose a simple solution to CH based on recent advances in fast minwise hashing. We show, both theoretically and empirically, that our proposed solution is significantly superior for load balancing and is optimal in many senses. On the AOL search dataset and Indiana University Clicks dataset with real user activity, our proposed solution reduces cache misses by several magnitudes. |
|||||
2019 | Locality-sensitive Hashing For F-divergences Mutual Information Loss And Beyond | Lin Chen, Hossein Esfandiari, Gang Fu, Vahab Mirrokni | Neural Information Processing Systems | Computing approximate nearest neighbors in high dimensional spaces is a central problem in large-scale data mining with a wide range of applications in machine learning and data science. A popular and effective technique in computing nearest neighbors approximately is the locality-sensitive hashing (LSH) scheme. In this paper, we aim to develop LSH schemes for distance functions that measure the distance between two probability distributions, particularly for f-divergences as well as a generalization to capture mutual information loss. First, we provide a general framework to design LHS schemes for f-divergence distance functions and develop LSH schemes for the generalized Jensen-Shannon divergence and triangular discrimination in this framework. We show a two-sided approximation result for approximation of the generalized Jensen-Shannon divergence by the Hellinger distance, which may be of independent interest. Next, we show a general method of reducing the problem of designing an LSH scheme for a Krein kernel (which can be expressed as the difference of two positive definite kernels) to the problem of maximum inner product search. We exemplify this method by applying it to the mutual information loss, due to its several important applications such as model compression. |
|||||
2019 | Vector And Line Quantization For Billion-scale Similarity Search On Gpus | Chen Wei, Chen Jincai, Zou Fuhao, Li Yuan-fang, Lu Ping, Wang Qiang, Zhao Wei | Arxiv | Billion-scale high-dimensional approximate nearest neighbour (ANN) search has become an important problem for searching similar objects among the vast amount of images and videos available online. The existing ANN methods are usually characterized by their specific indexing structures, including the inverted index and the inverted multi-index structure. The inverted index structure is amenable to GPU-based implementations, and the state-of-the-art systems such as Faiss are able to exploit the massive parallelism offered by GPUs. However, the inverted index requires high memory overhead to index the dataset effectively. The inverted multi-index structure is difficult to implement for GPUs, and also ineffective in dealing with database with different data distributions. In this paper we propose a novel hierarchical inverted index structure generated by vector and line quantization methods. Our quantization method improves both search efficiency and accuracy, while maintaining comparable memory consumption. This is achieved by reducing search space and increasing the number of indexed regions. We introduce a new ANN search system, VLQ-ADC, that is based on the proposed inverted index, and perform extensive evaluation on two public billion-scale benchmark datasets SIFT1B and DEEP1B. Our evaluation shows that VLQ-ADC significantly outperforms the state-of-the-art GPU- and CPU-based systems in terms of both accuracy and search speed. The source code of VLQ-ADC is available at https://github.com/zjuchenwei/vector-line-quantization. |
|||||
2019 | Hadamard Codebook Based Deep Hashing | Chen Shen, Cao Liujuan, Lin Mingbao, Wang Yan, Sun Xiaoshuai, Wu Chenglin, Qiu Jingfei, Ji Rongrong | Arxiv | As an approximate nearest neighbor search technique, hashing has been widely applied in large-scale image retrieval due to its excellent efficiency. Most supervised deep hashing methods have similar loss designs with embedding learning, while quantizing the continuous high-dim feature into compact binary space. We argue that the existing deep hashing schemes are defective in two issues that seriously affect the performance, i.e., bit independence and bit balance. The former refers to hash codes of different classes should be independent of each other, while the latter means each bit should have a balanced distribution of +1s and -1s. In this paper, we propose a novel supervised deep hashing method, termed Hadamard Codebook based Deep Hashing (HCDH), which solves the above two problems in a unified formulation. Specifically, we utilize an off-the-shelf algorithm to generate a binary Hadamard codebook to satisfy the requirement of bit independence and bit balance, which subsequently serves as the desired outputs of the hash functions learning. We also introduce a projection matrix to solve the inconsistency between the order of Hadamard matrix and the number of classes. Besides, the proposed HCDH further exploits the supervised labels by constructing a classifier on top of the outputs of hash functions. Extensive experiments demonstrate that HCDH can yield discriminative and balanced binary codes, which well outperforms many state-of-the-arts on three widely-used benchmarks. |
|||||
2019 | A Literature Study Of Embeddings On Source Code | Chen Zimin, Monperrus Martin | Arxiv | Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future. |
|||||
2019 | Simultaneous Region Localization And Hash Coding For Fine-grained Image Retrieval | Zeng Haien, Lai Hanjiang, Yin Jian | Arxiv | Fine-grained image hashing is a challenging problem due to the difficulties of discriminative region localization and hash code generation. Most existing deep hashing approaches solve the two tasks independently. While these two tasks are correlated and can reinforce each other. In this paper, we propose a deep fine-grained hashing to simultaneously localize the discriminative regions and generate the efficient binary codes. The proposed approach consists of a region localization module and a hash coding module. The region localization module aims to provide informative regions to the hash coding module. The hash coding module aims to generate effective binary codes and give feedback for learning better localizer. Moreover, to better capture subtle differences, multi-scale regions at different layers are learned without the need of bounding-box/part annotations. Extensive experiments are conducted on two public benchmark fine-grained datasets. The results demonstrate significant improvements in the performance of our method relative to other fine-grained hashing algorithms. |
|||||
2019 | Search Efficient Binary Network Embedding | Zhang Daokun, Yin Jie, Zhu Xingquan, Zhang Chengqi | TKDD- | Traditional network embedding primarily focuses on learning a continuous vector representation for each node, preserving network structure and/or node content information, such that off-the-shelf machine learning algorithms can be easily applied to the vector-format node representations for network analysis. However, the learned continuous vector representations are inefficient for large-scale similarity search, which often involves finding nearest neighbors measured by distance or similarity in a continuous vector space. In this paper, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a binary code for each node, by simultaneously modeling node context relations and node attribute relations through a three-layer neural network. BinaryNE learns binary node representations through a stochastic gradient descent based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bit-wise comparisons to support faster node similarity search than using Euclidean distance or other distance measures. Extensive experiments and comparisons demonstrate that BinaryNE not only delivers more than 25 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods. The binary codes learned by BinaryNE also render competitive performance on node classification and node clustering tasks. The source code of this paper is available at https://github.com/daokunzhang/BinaryNE. |
|||||
2019 | Exploring Auxiliary Context Discrete Semantic Transfer Hashing For Scalable Image Retrieval | Zhu Lei, Huang Zi, Li Zhihui, Xie Liang, Shen Heng Tao | Arxiv | Unsupervised hashing can desirably support scalable content-based image retrieval (SCBIR) for its appealing advantages of semantic label independence, memory and search efficiency. However, the learned hash codes are embedded with limited discriminative semantics due to the intrinsic limitation of image representation. To address the problem, in this paper, we propose a novel hashing approach, dubbed as Discrete Semantic Transfer Hashing (DSTH). The key idea is to directly augment the semantics of discrete image hash codes by exploring auxiliary contextual modalities. To this end, a unified hashing framework is formulated to simultaneously preserve visual similarities of images and perform semantic transfer from contextual modalities. Further, to guarantee direct semantic transfer and avoid information loss, we explicitly impose the discrete constraint, bit–uncorrelation constraint and bit-balance constraint on hash codes. A novel and effective discrete optimization method based on augmented Lagrangian multiplier is developed to iteratively solve the optimization problem. The whole learning process has linear computation complexity and desirable scalability. Experiments on three benchmark datasets demonstrate the superiority of DSTH compared with several state-of-the-art approaches. |
|||||
2019 | An Efficient Approximate Knn Graph Method For Diffusion On Image Retrieval | Magliani Federico, Mcguinness Kevin, Mohedano Eva, Prati Andrea | Arxiv | The application of the diffusion in many computer vision and artificial intelligence projects has been shown to give excellent improvements in performance. One of the main bottlenecks of this technique is the quadratic growth of the kNN graph size due to the high-quantity of new connections between nodes in the graph, resulting in long computation times. Several strategies have been proposed to address this, but none are effective and efficient. Our novel technique, based on LSH projections, obtains the same performance as the exact kNN graph after diffusion, but in less time (approximately 18 times faster on a dataset of a hundred thousand images). The proposed method was validated and compared with other state-of-the-art on several public image datasets, including Oxford5k, Paris6k, and Oxford105k. |
|||||
2019 | Massively Parallel Path Space Filtering | Binder Nikolaus, Fricke Sascha, Keller Alexander | Arxiv | Restricting path tracing to a small number of paths per pixel for performance reasons rarely achieves a satisfactory image quality for scenes of interest. However, path space filtering may dramatically improve the visual quality by sharing information across vertices of paths classified as proximate. Unlike screen space-based approaches, these paths neither need to be present on the screen, nor is filtering restricted to the first intersection with the scene. While searching proximate vertices had been more expensive than filtering in screen space, we greatly improve over this performance penalty by storing, updating, and looking up the required information in a hash table. The keys are constructed from jittered and quantized information, such that only a single query very likely replaces costly neighborhood searches. A massively parallel implementation of the algorithm is demonstrated on a graphics processing unit (GPU). |
|||||
2019 | Learning To Route In Similarity Graphs | Baranchuk Dmitry, Persiyanov Dmitry, Sinitsin Anton, Babenko Artem | Arxiv | Recently similarity graphs became the leading paradigm for efficient nearest neighbor search, outperforming traditional tree-based and LSH-based methods. Similarity graphs perform the search via greedy routing: a query traverses the graph and in each vertex moves to the adjacent vertex that is the closest to this query. In practice, similarity graphs are often susceptible to local minima, when queries do not reach its nearest neighbors, getting stuck in suboptimal vertices. In this paper we propose to learn the routing function that overcomes local minima via incorporating information about the graph global structure. In particular, we augment the vertices of a given graph with additional representations that are learned to provide the optimal routing from the start vertex to the query nearest neighbor. By thorough experiments, we demonstrate that the proposed learnable routing successfully diminishes the local minima problem and significantly improves the overall search performance. |
|||||
2019 | Collaborative Quantization For Cross-modal Similarity Search | Zhang Ting, Wang Jingdong | Arxiv | Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e.g., using an image to search for texts or using a text to search for images. This paper presents a compact coding solution for efficient search, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in the single-modal similarity search. We propose a cross-modal quantization approach, which is among the early attempts to introduce quantization into cross-modal search. The major contribution lies in jointly learning the quantizers for both modalities through aligning the quantized representations for each pair of image and text belonging to a document. In addition, our approach simultaneously learns the common space for both modalities in which quantization is conducted to enable efficient and effective search using the Euclidean distance computed in the common space with fast distance table lookup. Experimental results compared with several competitive algorithms over three benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance. |
|||||
2019 | Hierarchy Neighborhood Discriminative Hashing For An Unified View Of Single-label And Multi-label Image Retrieval | Ma Lei, Li Hongliang, Wu Qingbo, Meng Fanman, Ngan King Ngi | Arxiv | Recently, deep supervised hashing methods have become popular for large-scale image retrieval task. To preserve the semantic similarity notion between examples, they typically utilize the pairwise supervision or the triplet supervised information for hash learning. However, these methods usually ignore the semantic class information which can help the improvement of the semantic discriminative ability of hash codes. In this paper, we propose a novel hierarchy neighborhood discriminative hashing method. Specifically, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the sub-class feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the single-label and multilabel image retrieval problem with a one-stream deep neural network architecture. Experimental results on two largescale datasets demonstrate that the proposed method can outperform the state-of-the-art hashing methods. |
|||||
2019 | Visualizing Deep Similarity Networks | Stylianou Abby, Souvenir Richard, Pless Robert | Arxiv | For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for classification networks, but applicable to the problem domains better suited to similarity learning. The visualization shows how similarity networks that are fine-tuned learn to focus on different features. We also generalize our approach to embedding networks that use different pooling strategies and provide a simple mechanism to support image similarity searches on objects or sub-regions in the query image. |
|||||
2019 | The Superm-tree Indexing Metric Spaces With Sized Objects | Bachmann Jörg P. | Arxiv | A common approach to implementing similarity search applications is the usage of distance functions, where small distances indicate high similarity. In the case of metric distance functions, metric index structures can be used to accelerate nearest neighbor queries. On the other hand, many applications ask for approximate subsequences or subsets, e.g. searching for a similar partial sequence of a gene, for a similar scene in a movie, or for a similar object in a picture which is represented by a set of multidimensional features. Metric index structures such as the M-Tree cannot be utilized for these tasks because of the symmetry of the metric distance functions. In this work, we propose the SuperM-Tree as an extension of the M-Tree where approximate subsequence and subset queries become nearest neighbor queries. In order to do this, we introduce metric subset spaces as a generalized concept of metric spaces. Various metric distance functions can be extended to metric subset distance functions, e.g. the Euclidean distance (on windows), the Hausdorff distance (on subsets), the Edit distance and the Dog-Keeper distance (on subsequences). We show that these examples subsume the applications mentioned above. |
|||||
2019 | Space And Time Efficient Kernel Density Estimation In High Dimensions | Arturs Backurs, Piotr Indyk, Tal Wagner | Neural Information Processing Systems | Recently, Charikar and Siminelakis (2017) presented a framework for kernel density estimation in provably sublinear query time, for kernels that possess a certain hashing-based property. However, their data structure requires a significantly increased super-linear storage space, as well as super-linear preprocessing time. These limitations inhibit the practical applicability of their approach on large datasets. In this work, we present an improvement to their framework that retains the same query time, while requiring only linear space and linear preprocessing time. We instantiate our framework with the Laplacian and Exponential kernels, two popular kernels which possess the aforementioned property. Our experiments on various datasets verify that our approach attains accuracy and query time similar to Charikar and Siminelakis (2017), with significantly improved space and preprocessing time. |
|||||
2019 | Fair Near Neighbor Search Independent Range Sampling In High Dimensions | Aumüller Martin, Pagh Rasmus, Silvestri Francesco | Arxiv | Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the \(r\)-near neighbor (\(r\)-NN) problem: given a radius \(r>0\) and a set of points \(S\), construct a data structure that, for any given query point \(q\), returns a point \(p\) within distance at most \(r\) from \(q\). In this paper, we study the \(r\)-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance \(r\) from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for \(r\)-NN where all points in \(S\) that are near \(q\) have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem. |
|||||
2019 | PUFFINN Parameterless And Universally Fast Finding Of Nearest Neighbors | Aumüller Martin, Christiani Tobias, Pagh Rasmus, Vesterli Michael | Arxiv | We present PUFFINN, a parameterless LSH-based index for solving the \(k\)-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search. We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods. |
|||||
2019 | Global Hashing System For Fast Image Search | Tian Dayong, Tao Dacheng | Arxiv | Hashing methods have been widely investigated for fast approximate nearest neighbor searching in large data sets. Most existing methods use binary vectors in lower dimensional spaces to represent data points that are usually real vectors of higher dimensionality. We divide the hashing process into two steps. Data points are first embedded in a low-dimensional space, and the global positioning system method is subsequently introduced but modified for binary embedding. We devise dataindependent and data-dependent methods to distribute the satellites at appropriate locations. Our methods are based on finding the tradeoff between the information losses in these two steps. Experiments show that our data-dependent method outperforms other methods in different-sized data sets from 100k to 10M. By incorporating the orthogonality of the code matrix, both our data-independent and data-dependent methods are particularly impressive in experiments on longer bits. |
|||||
2019 | SHREWD Semantic Hierarchy-based Relational Embeddings For Weakly-supervised Deep Hashing | Arponen Heikki, Bishop Tom E | Arxiv | Using class labels to represent class similarity is a typical approach to training deep hashing systems for retrieval; samples from the same or different classes take binary 1 or 0 similarity values. This similarity does not model the full rich knowledge of semantic relations that may be present between data points. In this work we build upon the idea of using semantic hierarchies to form distance metrics between all available sample labels; for example cat to dog has a smaller distance than cat to guitar. We combine this type of semantic distance into a loss function to promote similar distances between the deep neural network embeddings. We also introduce an empirical Kullback-Leibler divergence loss term to promote binarization and uniformity of the embeddings. We test the resulting SHREWD method and demonstrate improvements in hierarchical retrieval scores using compact, binary hash codes instead of real valued ones, and show that in a weakly supervised hashing setting we are able to learn competitively without explicitly relying on class labels, but instead on similarities between labels. |
|||||
2019 | SADIH Semantic-aware Discrete Hashing | Zhang Zheng, Xie Guo-sen, Li Yang, Li Sheng, Huang Zi | Arxiv | Due to its low storage cost and fast query speed, hashing has been recognized to accomplish similarity search in large-scale multimedia retrieval applications. Particularly supervised hashing has recently received considerable research attention by leveraging the label information to preserve the pairwise similarities of data points in the Hamming space. However, there still remain two crucial bottlenecks: 1) the learning process of the full pairwise similarity preservation is computationally unaffordable and unscalable to deal with big data; 2) the available category information of data are not well-explored to learn discriminative hash functions. To overcome these challenges, we propose a unified Semantic-Aware DIscrete Hashing (SADIH) framework, which aims to directly embed the transformed semantic information into the asymmetric similarity approximation and discriminative hashing function learning. Specifically, a semantic-aware latent embedding is introduced to asymmetrically preserve the full pairwise similarities while skillfully handle the cumbersome n times n pairwise similarity matrix. Meanwhile, a semantic-aware autoencoder is developed to jointly preserve the data structures in the discriminative latent semantic space and perform data reconstruction. Moreover, an efficient alternating optimization algorithm is proposed to solve the resulting discrete optimization problem. Extensive experimental results on multiple large-scale datasets demonstrate that our SADIH can clearly outperform the state-of-the-art baselines with the additional benefit of lower computational costs. |
|||||
2019 | Derived Codebooks For High-accuracy Nearest Neighbor Search | André Fabien, Kermarrec Anne-marie, Scouarnec Nicolas Le | Arxiv | High-dimensional Nearest Neighbor (NN) search is central in multimedia search systems. Product Quantization (PQ) is a widespread NN search technique which has a high performance and good scalability. PQ compresses high-dimensional vectors into compact codes thanks to a combination of quantizers. Large databases can, therefore, be stored entirely in RAM, enabling fast responses to NN queries. In almost all cases, PQ uses 8-bit quantizers as they offer low response times. In this paper, we advocate the use of 16-bit quantizers. Compared to 8-bit quantizers, 16-bit quantizers boost accuracy but they increase response time by a factor of 3 to 10. We propose a novel approach that allows 16-bit quantizers to offer the same response time as 8-bit quantizers, while still providing a boost of accuracy. Our approach builds on two key ideas: (i) the construction of derived codebooks that allow a fast and approximate distance evaluation, and (ii) a two-pass NN search procedure which builds a candidate set using the derived codebooks, and then refines it using 16-bit quantizers. On 1 billion SIFT vectors, with an inverted index, our approach offers a Recall@100 of 0.85 in 5.2 ms. By contrast, 16-bit quantizers alone offer a Recall@100 of 0.85 in 39 ms, and 8-bit quantizers a Recall@100 of 0.82 in 3.8 ms. |
|||||
2019 | Adversarially Trained Deep Neural Semantic Hashing Scheme For Subjective Search In Fashion Inventory | Singh Saket, Sheet Debdoot, Dasgupta Mithun | Arxiv | The simple approach of retrieving a closest match of a query image from one in the gallery, compares an image pair using sum of absolute difference in pixel or feature space. The process is computationally expensive, ill-posed to illumination, background composition, pose variation, as well as inefficient to be deployed on gallery sets with more than 1000 elements. Hashing is a faster alternative which involves representing images in reduced dimensional simple feature spaces. Encoding images into binary hash codes enables similarity comparison in an image-pair using the Hamming distance measure. The challenge, however, lies in encoding the images using a semantic hashing scheme that lets subjective neighbors lie within the tolerable Hamming radius. This work presents a solution employing adversarial learning of a deep neural semantic hashing network for fashion inventory retrieval. It consists of a feature extracting convolutional neural network (CNN) learned to (i) minimize error in classifying type of clothing, (ii) minimize hamming distance between semantic neighbors and maximize distance between semantically dissimilar images, (iii) maximally scramble a discriminator’s ability to identify the corresponding hash code-image pair when processing a semantically similar query-gallery image pair. Experimental validation for fashion inventory search yields a mean average precision (mAP) of 90.65% in finding the closest match as compared to 53.26% obtained by the prior art of deep Cauchy hashing for hamming space retrieval. |
|||||
2019 | Deep Supervised Hashing Leveraging Quadratic Spherical Mutual Information For Content-based Image Retrieval | Passalis Nikolaos, Tefas Anastasios | Arxiv | Several deep supervised hashing techniques have been proposed to allow for efficiently querying large image databases. However, deep supervised image hashing techniques are developed, to a great extent, heuristically often leading to suboptimal results. Contrary to this, we propose an efficient deep supervised hashing algorithm that optimizes the learned codes using an information-theoretic measure, the Quadratic Mutual Information (QMI). The proposed method is adapted to the needs of large-scale hashing and information retrieval leading to a novel information-theoretic measure, the Quadratic Spherical Mutual Information (QSMI). Apart from demonstrating the effectiveness of the proposed method under different scenarios and outperforming existing state-of-the-art image hashing techniques, this paper provides a structured way to model the process of information retrieval and develop novel methods adapted to the needs of each application. |
|||||
2019 | Transductive Zero-shot Hashing For Multilabel Image Retrieval | Zou Qin, Zhang Zheng, Cao Ling, Chen Long, Wang Song | IEEE Transactions on Neural Networks and Learning Systems | Hash coding has been widely used in approximate nearest neighbor search for large-scale image retrieval. Given semantic annotations such as class labels and pairwise similarities of the training data, hashing methods can learn and generate effective and compact binary codes. While some newly introduced images may contain undefined semantic labels, which we call unseen images, zeor-shot hashing techniques have been studied. However, existing zeor-shot hashing methods focus on the retrieval of single-label images, and cannot handle multi-label images. In this paper, for the first time, a novel transductive zero-shot hashing method is proposed for multi-label unseen image retrieval. In order to predict the labels of the unseen/target data, a visual-semantic bridge is built via instance-concept coherence ranking on the seen/source data. Then, pairwise similarity loss and focal quantization loss are constructed for training a hashing model using both the seen/source and unseen/target data. Extensive evaluations on three popular multi-label datasets demonstrate that, the proposed hashing method achieves significantly better results than the competing methods. |
|||||
2019 | Zero-shot Deep Hashing And Neural Network Based Error Correction For Face Template Protection | Talreja Veeru, Valenti Matthew C., Nasrabadi Nasser M. | Arxiv | In this paper, we present a novel architecture that integrates a deep hashing framework with a neural network decoder (NND) for application to face template protection. It improves upon existing face template protection techniques to provide better matching performance with one-shot and multi-shot enrollment. A key novelty of our proposed architecture is that the framework can also be used with zero-shot enrollment. This implies that our architecture does not need to be re-trained even if a new subject is to be enrolled into the system. The proposed architecture consists of two major components: a deep hashing (DH) component, which is used for robust mapping of face images to their corresponding intermediate binary codes, and a NND component, which corrects errors in the intermediate binary codes that are caused by differences in the enrollment and probe biometrics due to factors such as variation in pose, illumination, and other factors. The final binary code generated by the NND is then cryptographically hashed and stored as a secure face template in the database. The efficacy of our approach with zero-shot, one-shot, and multi-shot enrollments is shown for CMU-PIE, Extended Yale B, WVU multimodal and Multi-PIE face databases. With zero-shot enrollment, the system achieves approximately 85% genuine accept rates (GAR) at 0.01% false accept rate (FAR), and with one-shot and multi-shot enrollments, it achieves approximately 99.95% GAR at 0.01% FAR, while providing a high level of template security. |
|||||
2019 | Scalable Source Code Similarity Detection In Large Code Repositories | Alomari F, Harbi M | EAI Endorsed Transactions on Scalable Information Systems Online first | Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions. |
|||||
2019 | Conlsh Context Based Locality Sensitive Hashing For Mapping Of Noisy SMRT Reads | Chakraborty Angana, Bandyopadhyay Sanghamitra | Arxiv | Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has \(\mathcal{O}(n^{\rho+1})\) space requirement, where \(n\) is the number of sequences in the corpus and \(\rho\) is a constant. The indexing time and querying time are bounded by \(\mathcal{O}( \frac{n^{\rho+1} \cdot \ln n}{\ln \frac{1}{P_2}})\) and \(\mathcal{O}(n^\rho)\) respectively, where \(P_2 > 0\), is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately \(24.2\%\) less processing time, while saving about \(70.3\%\) in peak memory requirement for H.sapiens PacBio dataset. |
|||||
2019 | Signal-to-noise Ratio A Robust Distance Metric For Deep Metric Learning | Yuan Tongtong, Deng Weihong, Tang Jian, Tang Yinan, Chen Binghui | Arxiv | Deep metric learning, which learns discriminative features to process image clustering and retrieval tasks, has attracted extensive attention in recent years. A number of deep metric learning methods, which ensure that similar examples are mapped close to each other and dissimilar examples are mapped farther apart, have been proposed to construct effective structures for loss functions and have shown promising results. In this paper, different from the approaches on learning the loss structures, we propose a robust SNR distance metric based on Signal-to-Noise Ratio (SNR) for measuring the similarity of image pairs for deep metric learning. By exploring the properties of our SNR distance metric from the view of geometry space and statistical theory, we analyze the properties of our metric and show that it can preserve the semantic similarity between image pairs, which well justify its suitability for deep metric learning. Compared with Euclidean distance metric, our SNR distance metric can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features. Leveraging our SNR distance metric, we propose Deep SNR-based Metric Learning (DSML) to generate discriminative feature embeddings. By extensive experiments on three widely adopted benchmarks, including CARS196, CUB200-2011 and CIFAR10, our DSML has shown its superiority over other state-of-the-art methods. Additionally, we extend our SNR distance metric to deep hashing learning, and conduct experiments on two benchmarks, including CIFAR10 and NUS-WIDE, to demonstrate the effectiveness and generality of our SNR distance metric. |
|||||
2019 | Fast Hashing With Strong Concentration Bounds | Aamand Anders, Knudsen Jakob B. T., Knudsen Mathias B. T., Rasmussen Peter M. R., Thorup Mikkel | Arxiv | Previous work on tabulation hashing by Patrascu and Thorup from STOC’11 on simple tabulation and from SODA’13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of \(c=O(1)\) characters, e.g., a 64-bit key as \(c=8\) characters of 8-bits. The character domain \(\Sigma\) should be small enough that character tables of size \(|\Sigma|\) fit in fast cache. The schemes then use \(O(1)\) tables of this size, so the space of tabulation hashing is \(O(|\Sigma|)\). However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are \(\ll |\Sigma|\). To see the problem, consider the very simple case where we use tabulation hashing to throw \(n\) balls into \(m\) bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if \(n=m\), for then the expected value is \(1\). However, if \(m=2\), as when tossing \(n\) unbiased coins, the expected value \(n/2\) is \(\gg |\Sigma|\) for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds. |
|||||
2019 | Semantic Hierarchy Preserving Deep Hashing For Large-scale Image Retrieval | Zhang Ming, Zhe Xuefei, Ou-yang Le, Chen Shifeng, Yan Hong | Arxiv | Deep hashing models have been proposed as an efficient method for large-scale similarity search. However, most existing deep hashing methods only utilize fine-level labels for training while ignoring the natural semantic hierarchy structure. This paper presents an effective method that preserves the classwise similarity of full-level semantic hierarchy for large-scale image retrieval. Experiments on two benchmark datasets show that our method helps improve the fine-level retrieval performance. Moreover, with the help of the semantic hierarchy, it can produce significantly better binary codes for hierarchical retrieval, which indicates its potential of providing more user-desired retrieval results. |
|||||
2019 | Approximate Similarity Search Under Edit Distance Using Locality-sensitive Hashing | Mccauley Samuel | Arxiv | Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess \(n\) strings of length \(d\), to quickly answer queries \(q\) of the form: if there is a database string within edit distance \(r\) of \(q\), return a database string within edit distance \(cr\) of \(q\). Previous approaches to this problem either rely on very large (superconstant) approximation ratios \(c\), or very small search radii \(r\). Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all \(n\) strings. In this work give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time \(\tilde{O}(d3^rn^{1/c})\). The best known practical results require \(c \gg r\) to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time at least \(24^r\). Our results significantly broaden the range of parameters for which we can achieve nontrivial bounds, while retaining the practicality of a locality-sensitive hash function. We also show how to apply our ideas to the closely-related Approximate Nearest Neighbor problem for edit distance, obtaining similar time bounds. |
|||||
2019 | Compositional Embeddings Using Complementary Partitions For Memory-efficient Recommendation Systems | Shi Hao-jun Michael, Mudigere Dheevatsa, Naumov Maxim, Yang Jiyan | Arxiv | Modern deep learning-based recommendation systems exploit hundreds to thousands of different categorical features, each with millions of different categories ranging from clicks to posts. To respect the natural diversity within the categorical data, embeddings map each category to a unique dense representation within an embedded space. Since each categorical feature could take on as many as tens of millions of different possible categories, the embedding tables form the primary memory bottleneck during both training and inference. We propose a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. By storing multiple smaller embedding tables based on each complementary partition and combining embeddings from each table, we define a unique embedding for each category at smaller memory cost. This approach may be interpreted as using a specific fixed codebook to ensure uniqueness of each category’s representation. Our experimental results demonstrate the effectiveness of our approach over the hashing trick for reducing the size of the embedding tables in terms of model loss and accuracy, while retaining a similar reduction in the number of parameters. |
|||||
2019 | Effective And Efficient Indexing In Cross-modal Hashing-based Datasets | Markchit Sarawut, Chiu Chih-yi | Arxiv | To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although dozens of cross-modal hashing algorithms are proposed to yield compact binary codes, the exhaustive search is impractical for the real-time purpose, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed hash code indexing scheme exploits a few binary bits of the hash code as the index code. We construct an inverted index table based on index codes and train a neural network to improve the indexing accuracy and efficiency. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated by three cross-modal hashing methods. Results show the proposed method effectively boost the performance on these hash methods. |
|||||
2019 | Towards Optimal Discrete Online Hashing With Balanced Similarity | Lin Mingbao, Ji Rongrong, Liu Hong, Sun Xiaoshuai, Wu Yongjian, Wu Yunsheng | Arxiv | When facing large-scale image datasets, online hashing serves as a promising solution for online retrieval and prediction tasks. It encodes the online streaming data into compact binary codes, and simultaneously updates the hash functions to renew codes of the existing dataset. To this end, the existing methods update hash functions solely based on the new data batch, without investigating the correlation between such new data and the existing dataset. In addition, existing works update the hash functions using a relaxation process in its corresponding approximated continuous space. And it remains as an open problem to directly apply discrete optimizations in online hashing. In this paper, we propose a novel supervised online hashing method, termed Balanced Similarity for Online Discrete Hashing (BSODH), to solve the above problems in a unified framework. BSODH employs a well-designed hashing algorithm to preserve the similarity between the streaming data and the existing dataset via an asymmetric graph regularization. We further identify the “data-imbalance” problem brought by the constructed asymmetric graph, which restricts the application of discrete optimization in our problem. Therefore, a novel balanced similarity is further proposed, which uses two equilibrium factors to balance the similar and dissimilar weights and eventually enables the usage of discrete optimizations. Extensive experiments conducted on three widely-used benchmarks demonstrate the advantages of the proposed method over the state-of-the-art methods. |
|||||
2019 | Supervised Online Hashing Via Similarity Distribution Learning | Lin Mingbao, Ji Rongrong, Chen Shen, Zheng Feng, Sun Xiaoshuai, Zhang Baochang, Cao Liujuan, Guo Guodong, Huang Feiyue | Arxiv | Online hashing has attracted extensive research attention when facing streaming data. Most online hashing methods, learning binary codes based on pairwise similarities of training instances, fail to capture the semantic relationship, and suffer from a poor generalization in large-scale applications due to large variations. In this paper, we propose to model the similarity distributions between the input data and the hashing codes, upon which a novel supervised online hashing method, dubbed as Similarity Distribution based Online Hashing (SDOH), is proposed, to keep the intrinsic semantic relationship in the produced Hamming space. Specifically, we first transform the discrete similarity matrix into a probability matrix via a Gaussian-based normalization to address the extremely imbalanced distribution issue. And then, we introduce a scaling Student t-distribution to solve the challenging initialization problem, and efficiently bridge the gap between the known and unknown distributions. Lastly, we align the two distributions via minimizing the Kullback-Leibler divergence (KL-diverence) with stochastic gradient descent (SGD), by which an intuitive similarity constraint is imposed to update hashing model on the new streaming data with a powerful generalizing ability to the past data. Extensive experiments on three widely-used benchmarks validate the superiority of the proposed SDOH over the state-of-the-art methods in the online retrieval task. |
|||||
2019 | Hadamard Matrix Guided Online Hashing | Lin Mingbao, Ji Rongrong, Liu Hong, Sun Xiaoshuai, Chen Shen, Tian Qi | Arxiv | Online image hashing has attracted increasing research attention recently, which receives large-scale data in a streaming manner to update the hash functions on-the-fly. Its key challenge lies in the difficulty of balancing the learning timeliness and model accuracy. To this end, most works follow a supervised setting, i.e., using class labels to boost the hashing performance, which defects in two aspects: First, strong constraints, e.g., orthogonal or similarity preserving, are used, which however are typically relaxed and lead to large accuracy drop. Second, large amounts of training batches are required to learn the up-to-date hash functions, which largely increase the learning complexity. To handle the above challenges, a novel supervised online hashing scheme termed Hadamard Matrix Guided Online Hashing (HMOH) is proposed in this paper. Our key innovation lies in introducing Hadamard matrix, which is an orthogonal binary matrix built via Sylvester method. In particular, to release the need of strong constraints, we regard each column of Hadamard matrix as the target code for each class label, which by nature satisfies several desired properties of hashing codes. To accelerate the online training, LSH is first adopted to align the lengths of target code and to-be-learned binary code. We then treat the learning of hash functions as a set of binary classification problems to fit the assigned target code. Finally, extensive experiments demonstrate the superior accuracy and efficiency of the proposed method over various state-of-the-art methods. Codes are available at https://github.com/lmbxmu/mycode. |
|||||
2019 | A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets | Wang Pinghui, Qi Yiyan, Zhang Yuanming, Zhai Qiaozhu, Wang Chenxu, Lui John C. S., Guan Xiaohong | Arxiv | Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities. |
|||||
2019 | Embarrassingly Simple Binary Representation Learning | Shen Yuming, Qin Jie, Chen Jiaxin, Liu Li, Zhu Fan | Arxiv | Recent binary representation learning models usually require sophisticated binary optimization, similarity measure or even generative models as auxiliaries. However, one may wonder whether these non-trivial components are needed to formulate practical and effective hashing models. In this paper, we answer the above question by proposing an embarrassingly simple approach to binary representation learning. With a simple classification objective, our model only incorporates two additional fully-connected layers onto the top of an arbitrary backbone network, whilst complying with the binary constraints during training. The proposed model lower-bounds the Information Bottleneck (IB) between data samples and their semantics, and can be related to many recent `learning to hash’ paradigms. We show that, when properly designed, even such a simple network can generate effective binary codes, by fully exploring data semantics without any held-out alternating updating steps or auxiliary models. Experiments are conducted on conventional large-scale benchmarks, i.e., CIFAR-10, NUS-WIDE, and ImageNet, where the proposed simple model outperforms the state-of-the-art methods. |
|||||
2019 | Re-randomized Densification For One Permutation Hashing And Bin-wise Consistent Weighted Sampling | Ping Li, Xiaoyun Li, Cun-hui Zhang | Neural Information Processing Systems | Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named “re-randomization” in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled “bin-wise consistent weighted sampling (BCWS)” algorithm. We analyze the behavior of BCWS and compare it with a recent alternative. Extensive experiments on various datasets illustrates the effectiveness of our proposed methods. |
|||||
2019 | Random Projections With Asymmetric Quantization | Xiaoyun Li, Ping Li | Neural Information Processing Systems | The method of random projection has been a popular tool for data compression, similarity search, and machine learning. In many practical scenarios, applying quantization on randomly projected data could be very helpful to further reduce storage cost and facilitate more efficient retrievals, while only suffering from little loss in accuracy. In real-world applications, however, data collected from different sources may be quantized under different schemes, which calls for a need to study the asymmetric quantization problem. In this paper, we investigate the cosine similarity estimators derived in such setting under the Lloyd-Max (LM) quantization scheme. We thoroughly analyze the biases and variances of a series of estimators including the basic simple estimators, their normalized versions, and their debiased versions. Furthermore, by studying the monotonicity, we show that the expectation of proposed estimators increases with the true cosine similarity, on a broader family of stair-shaped quantizers. Experiments on nearest neighbor search justify the theory and illustrate the effectiveness of our proposed estimators. |
|||||
2019 | Coupled Cyclegan Unsupervised Hashing Network For Cross-modal Retrieval | Li Chao, Deng Cheng, Wang Lei, Xie De, Liu Xianglong | Arxiv | In recent years, hashing has attracted more and more attention owing to its superior capacity of low storage cost and high query efficiency in large-scale cross-modal retrieval. Benefiting from deep leaning, continuously compelling results in cross-modal retrieval community have been achieved. However, existing deep cross-modal hashing methods either rely on amounts of labeled information or have no ability to learn an accuracy correlation between different modalities. In this paper, we proposed Unsupervised coupled Cycle generative adversarial Hashing networks (UCH), for cross-modal retrieval, where outer-cycle network is used to learn powerful common representation, and inner-cycle network is explained to generate reliable hash codes. Specifically, our proposed UCH seamlessly couples these two networks with generative adversarial mechanism, which can be optimized simultaneously to learn representation and hash codes. Extensive experiments on three popular benchmark datasets show that the proposed UCH outperforms the state-of-the-art unsupervised cross-modal hashing methods. |
|||||
2019 | Deep Multi-index Hashing For Person Re-identification | Li Ming-wei, Jiang Qing-yuan, Li Wu-jun | Arxiv | Traditional person re-identification (ReID) methods typically represent person images as real-valued features, which makes ReID inefficient when the gallery set is extremely large. Recently, some hashing methods have been proposed to make ReID more efficient. However, these hashing methods will deteriorate the accuracy in general, and the efficiency of them is still not high enough. In this paper, we propose a novel hashing method, called deep multi-index hashing (DMIH), to improve both efficiency and accuracy for ReID. DMIH seamlessly integrates multi-index hashing and multi-branch based networks into the same framework. Furthermore, a novel block-wise multi-index hashing table construction approach and a search-aware multi-index (SAMI) loss are proposed in DMIH to improve the search efficiency. Experiments on three widely used datasets show that DMIH can outperform other state-of-the-art baselines, including both hashing methods and real-valued methods, in terms of both efficiency and accuracy. |
|||||
2019 | Push For Quantization Deep Fisher Hashing | Li Yunqiang, Pei Wenjie, Zha Yufei, Van Gemert Jan | Arxiv | Current massive datasets demand light-weight access for analysis. Discrete hashing methods are thus beneficial because they map high-dimensional data to compact binary codes that are efficient to store and process, while preserving semantic similarity. To optimize powerful deep learning methods for image hashing, gradient-based methods are required. Binary codes, however, are discrete and thus have no continuous derivatives. Relaxing the problem by solving it in a continuous space and then quantizing the solution is not guaranteed to yield separable binary codes. The quantization needs to be included in the optimization. In this paper we push for quantization: We optimize maximum class separability in the binary space. We introduce a margin on distances between dissimilar image pairs as measured in the binary space. In addition to pair-wise distances, we draw inspiration from Fisher’s Linear Discriminant Analysis (Fisher LDA) to maximize the binary distances between classes and at the same time minimize the binary distance of images within the same class. Experiments on CIFAR-10, NUS-WIDE and ImageNet100 demonstrate compact codes comparing favorably to the current state of the art. |
|||||
2018 | Quantized Guided Pruning For Efficient Hardware Implementations Of Convolutional Neural Networks | Hacene Ghouthi Boukli Elec, Gripon Vincent Elec, Arzel Matthieu Elec, Farrugia Nicolas Elec, Bengio Yoshua Diro | Arxiv | Convolutional Neural Networks (CNNs) are state-of-the-art in numerous computer vision tasks such as object classification and detection. However, the large amount of parameters they contain leads to a high computational complexity and strongly limits their usability in budget-constrained devices such as embedded devices. In this paper, we propose a combination of a new pruning technique and a quantization scheme that effectively reduce the complexity and memory usage of convolutional layers of CNNs, and replace the complex convolutional operation by a low-cost multiplexer. We perform experiments on the CIFAR10, CIFAR100 and SVHN and show that the proposed method achieves almost state-of-the-art accuracy, while drastically reducing the computational and memory footprints. We also propose an efficient hardware architecture to accelerate CNN operations. The proposed hardware architecture is a pipeline and accommodates multiple layers working at the same time to speed up the inference process. |
|||||
2018 | Beating Fredman-komlos For Perfect k-hashing | Guruswami Venkatesan, Riazanov Andrii | Arxiv | We say a subset \(C \subseteq \{1,2,\dots,k\}^n\) is a \(k\)-hash code (also called \(k\)-separated) if for every subset of \(k\) codewords from \(C\), there exists a coordinate where all these codewords have distinct values. Understanding the largest possible rate (in bits), defined as \((log_2 |C|)/n\), of a \(k\)-hash code is a classical problem. It arises in two equivalent contexts: (i) the smallest size possible for a perfect hash family that maps a universe of \(N\) elements into \(\{1,2,\dots,k\}\), and (ii) the zero-error capacity for decoding with lists of size less than \(k\) for a certain combinatorial channel. A general upper bound of \(k!/k^{k-1}\) on the rate of a \(k\)-hash code (in the limit of large \(n\)) was obtained by Fredman and Koml'{o}s in 1984 for any \(k \geq 4\). While better bounds have been obtained for \(k=4\), their original bound has remained the best known for each \(k \ge 5\). In this work, we obtain the first improvement to the Fredman-Koml'{o}s bound for every \(k \ge 5\). While we get explicit (numerical) bounds for \(k=5,6\), for larger \(k\) we only show that the FK bound can be improved by a positive, but unspecified, amount. Under a conjecture on the optimum value of a certain polynomial optimization problem over the simplex, our methods allow an effective bound to be computed for every \(k\). |
|||||
2018 | Hardness Of Approximate Nearest Neighbor Search | Rubinstein Aviad | Arxiv | We prove conditional near-quadratic running time lower bounds for approximate Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance. Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false, for every \(\delta>0\) there exists a constant \(\epsilon>0\) such that computing a \((1+\epsilon)\)-approximation to the Bichromatic Closest Pair requires \(n^{2-\delta}\) time. In particular, this implies a near-linear query time for Approximate Nearest Neighbor search with polynomial preprocessing time. Our reduction uses the Distributed PCP framework of [ARW’17], but obtains improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG codes have been constructed in other settings before [BKKMS’16, BCGRS’17], but our construction is the first to yield new hardness results. |
|||||
2018 | Convolutional Hashing For Automated Scene Matching | Loncaric Martin, Liu Bowei, Weber Ryan | Arxiv | We present a powerful new loss function and training scheme for learning binary hash functions. In particular, we demonstrate our method by creating for the first time a neural network that outperforms state-of-the-art Haar wavelets and color layout descriptors at the task of automated scene matching. By accurately relating distance on the manifold of network outputs to distance in Hamming space, we achieve a 100-fold reduction in nontrivial false positive rate and significantly higher true positive rate. We expect our insights to provide large wins for hashing models applied to other information retrieval hashing tasks as well. |
|||||
2018 | Round-hashing For Data Storage Distributed Servers And External-memory Tables | Grossi Roberto, Versari Luca | Arxiv | This paper proposes round-hashing, which is suitable for data storage on distributed servers and for implementing external-memory tables in which each lookup retrieves at most a single block of external memory, using a stash. For data storage, round-hashing is like consistent hashing as it avoids a full rehashing of the keys when new servers are added. Experiments show that the speed to serve requests is tenfold or more than the state of the art. In distributed data storage, this guarantees better throughput for serving requests and, moreover, greatly reduces decision times for which data should move to new servers as rescanning data is much faster. |
|||||
2018 | Robust Set Reconciliation Via Locality Sensitive Hashing | Mitzenmacher Michael, Morgan Tom | Arxiv | We consider variations of set reconciliation problems where two parties, Alice and Bob, each hold a set of points in a metric space, and the goal is for Bob to conclude with a set of points that is close to Alice’s set of points in a well-defined way. This setting has been referred to as robust set reconciliation. More specifically, in one variation we examine the goal is for Bob to end with a set of points that is close to Alice’s in earth mover’s distance, and in another the goal is for Bob to have a point that is close to each of Alice’s. The first problem has been studied before; our results scale better with the dimension of the space. The second problem appears new. Our primary novelty is utilizing Invertible Bloom Lookup Tables in combination with locality sensitive hashing. This combination allows us to cope with the geometric setting in a communication-efficient manner. |
|||||
2018 | Understanding The Gist Of Images - Ranking Of Concepts For Multimedia Indexing | Weiland Lydia, Ponzetto Simone Paolo, Effelsberg Wolfgang, Dietz Laura | Arxiv | Nowadays, where multimedia data is continuously generated, stored, and distributed, multimedia indexing, with its purpose of group- ing similar data, becomes more important than ever. Understanding the gist (=message) of multimedia instances is framed in related work as a ranking of concepts from a knowledge base, i.e., Wikipedia. We cast the task of multimedia indexing as a gist understanding problem. Our pipeline benefits from external knowledge and two subsequent learning- to-rank (l2r) settings. The first l2r produces a ranking of concepts rep- resenting the respective multimedia instance. The second l2r produces a mapping between the concept representation of an instance and the targeted class topic(s) for the multimedia indexing task. The evaluation on an established big size corpus (MIRFlickr25k, with 25,000 images), shows that multimedia indexing benefits from understanding the gist. Finally, with a MAP of 61.42, it can be shown that the multimedia in- dexing task benefits from understanding the gist. Thus, the presented end-to-end setting outperforms DBM and competes with Hashing-based methods. |
|||||
2018 | Geometry And Clustering With Metrics Derived From Separable Bregman Divergences | Gomes-gonçalves Erika, Gzyl Henryk, Nielsen Frank | Arxiv | Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances. |
|||||
2018 | End-to-end Retrieval In Continuous Space | Gillick Daniel, Presta Alessandro, Tomar Gaurav Singh | Arxiv | Most text-based information retrieval (IR) systems index objects by words or phrases. These discrete systems have been augmented by models that use embeddings to measure similarity in continuous space. But continuous-space models are typically used just to re-rank the top candidates. We consider the problem of end-to-end continuous retrieval, where standard approximate nearest neighbor (ANN) search replaces the usual discrete inverted index, and rely entirely on distances between learned embeddings. By training simple models specifically for retrieval, with an appropriate model architecture, we improve on a discrete baseline by 8% and 26% (MAP) on two similar-question retrieval tasks. We also discuss the problem of evaluation for retrieval systems, and show how to modify existing pairwise similarity datasets for this purpose. |
|||||
2018 | Learning Hash Codes Via Hamming Distance Targets | Loncaric Martin, Liu Bowei, Weber Ryan | Arxiv | We present a powerful new loss function and training scheme for learning binary hash codes with any differentiable model and similarity function. Our loss function improves over prior methods by using log likelihood loss on top of an accurate approximation for the probability that two inputs fall within a Hamming distance target. Our novel training scheme obtains a good estimate of the true gradient by better sampling inputs and evaluating loss terms between all pairs of inputs in each minibatch. To fully leverage the resulting hashes, we use multi-indexing. We demonstrate that these techniques provide large improvements to a similarity search tasks. We report the best results to date on competitive information retrieval tasks for ImageNet and SIFT 1M, improving MAP from 73% to 84% and reducing query cost by a factor of 2-8, respectively. |
|||||
2018 | Regularizing Deep Hashing Networks Using GAN Generated Fake Images | Geng Libing, Pan Yan, Chen Jikai, Lai Hanjiang | Arxiv | Recently, deep-networks-based hashing (deep hashing) has become a leading approach for large-scale image retrieval. It aims to learn a compact bitwise representation for images via deep networks, so that similar images are mapped to nearby hash codes. Since a deep network model usually has a large number of parameters, it may probably be too complicated for the training data we have, leading to model over-fitting. To address this issue, in this paper, we propose a simple two-stage pipeline to learn deep hashing models, by regularizing the deep hashing networks using fake images. The first stage is to generate fake images from the original training set without extra data, via a generative adversarial network (GAN). In the second stage, we propose a deep architec- ture to learn hash functions, in which we use a maximum-entropy based loss to incorporate the newly created fake images by the GAN. We show that this loss acts as a strong regularizer of the deep architecture, by penalizing low-entropy output hash codes. This loss can also be interpreted as a model ensemble by simultaneously training many network models with massive weight sharing but over different training sets. Empirical evaluation results on several benchmark datasets show that the proposed method has superior performance gains over state-of-the-art hashing methods. |
|||||
2018 | Weakly Supervised Deep Image Hashing Through Tag Embeddings | Gattupalli Vijetha, Zhuo Yaoxin, Li Baoxin | Arxiv | Many approaches to semantic image hashing have been formulated as supervised learning problems that utilize images and label information to learn the binary hash codes. However, large-scale labeled image data is expensive to obtain, thus imposing a restriction on the usage of such algorithms. On the other hand, unlabelled image data is abundant due to the existence of many Web image repositories. Such Web images may often come with images tags that contain useful information, although raw tags, in general, do not readily lead to semantic labels. Motivated by this scenario, we formulate the problem of semantic image hashing as a weakly-supervised learning problem. We utilize the information contained in the user-generated tags associated with the images to learn the hash codes. More specifically, we extract the word2vec semantic embeddings of the tags and use the information contained in them for constraining the learning. Accordingly, we name our model Weakly Supervised Deep Hashing using Tag Embeddings (WDHT). WDHT is tested for the task of semantic image retrieval and is compared against several state-of-art models. Results show that our approach sets a new state-of-art in the area of weekly supervised image hashing. |
|||||
2018 | Representation Learning By Reconstructing Neighborhoods | Yeh Chin-chia Michael, Zhu Yan, Papalexakis Evangelos E., Mueen Abdullah, Keogh Eamonn | Arxiv | Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-encoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoder-decoder architecture of the classic autoencoder.In contrast to autoencoder, which reconstructs the input data itself, neighbor-encoder reconstructs the input data’s neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any off-the-shelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighbor-encoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the state-of-the-art performance on text document clustering. |
|||||
2018 | Learning Decorrelated Hashing Codes For Multimodal Retrieval | Tian Dayong | Arxiv | In social networks, heterogeneous multimedia data correlate to each other, such as videos and their corresponding tags in YouTube and image-text pairs in Facebook. Nearest neighbor retrieval across multiple modalities on large data sets becomes a hot yet challenging problem. Hashing is expected to be an efficient solution, since it represents data as binary codes. As the bit-wise XOR operations can be fast handled, the retrieval time is greatly reduced. Few existing multimodal hashing methods consider the correlation among hashing bits. The correlation has negative impact on hashing codes. When the hashing code length becomes longer, the retrieval performance improvement becomes slower. In this paper, we propose a minimum correlation regularization (MCR) for multimodal hashing. First, the sigmoid function is used to embed the data matrices. Then, the MCR is applied on the output of sigmoid function. As the output of sigmoid function approximates a binary code matrix, the proposed MCR can efficiently decorrelate the hashing codes. Experiments show the superiority of the proposed method becomes greater as the code length increases. |
|||||
2018 | A Deep Learning Pipeline For Product Recognition On Store Shelves | Tonioni Alessio, Serra Eugenio, Di Stefano Luigi | Arxiv | Recognition of grocery products in store shelves poses peculiar challenges. Firstly, the task mandates the recognition of an extremely high number of different items, in the order of several thousands for medium-small shops, with many of them featuring small inter and intra class variability. Then, available product databases usually include just one or a few studio-quality images per product (referred to herein as reference images), whilst at test time recognition is performed on pictures displaying a portion of a shelf containing several products and taken in the store by cheap cameras (referred to as query images). Moreover, as the items on sale in a store as well as their appearance change frequently over time, a practical recognition system should handle seamlessly new products/packages. Inspired by recent advances in object detection and image retrieval, we propose to leverage on state of the art object detectors based on deep learning to obtain an initial productagnostic item detection. Then, we pursue product recognition through a similarity search between global descriptors computed on reference and cropped query images. To maximize performance, we learn an ad-hoc global descriptor by a CNN trained on reference images based on an image embedding loss. Our system is computationally expensive at training time but can perform recognition rapidly and accurately at test time. |
|||||
2018 | Fmhash Deep Hashing Of In-air-handwriting For User Identification | Lu Duo, Huang Dijiang, Rai Anshul | Arxiv | Many mobile systems and wearable devices, such as Virtual Reality (VR) or Augmented Reality (AR) headsets, lack a keyboard or touchscreen to type an ID and password for signing into a virtual website. However, they are usually equipped with gesture capture interfaces to allow the user to interact with the system directly with hand gestures. Although gesture-based authentication has been well-studied, less attention is paid to the gesture-based user identification problem, which is essentially an input method of account ID and an efficient searching and indexing method of a database of gesture signals. In this paper, we propose FMHash (i.e., Finger Motion Hash), a user identification framework that can generate a compact binary hash code from a piece of in-air-handwriting of an ID string. This hash code enables indexing and fast search of a large account database using the in-air-handwriting by a hash table. To demonstrate the effectiveness of the framework, we implemented a prototype and achieved >99.5% precision and >92.6% recall with exact hash code match on a dataset of 200 accounts collected by us. The ability of hashing in-air-handwriting pattern to binary code can be used to achieve convenient sign-in and sign-up with in-air-handwriting gesture ID on future mobile and wearable systems connected to the Internet. |
|||||
2018 | Neurons Merging Layer Towards Progressive Redundancy Reduction For Deep Supervised Hashing | Fu Chaoyou, Song Liangchen, Wu Xiang, Wang Guoli, He Ran | Arxiv | Deep supervised hashing has become an active topic in information retrieval. It generates hashing bits by the output neurons of a deep hashing network. During binary discretization, there often exists much redundancy between hashing bits that degenerates retrieval performance in terms of both storage and accuracy. This paper proposes a simple yet effective Neurons Merging Layer (NMLayer) for deep supervised hashing. A graph is constructed to represent the redundancy relationship between hashing bits that is used to guide the learning of a hashing network. Specifically, it is dynamically learned by a novel mechanism defined in our active and frozen phases. According to the learned relationship, the NMLayer merges the redundant neurons together to balance the importance of each output neuron. Moreover, multiple NMLayers are progressively trained for a deep hashing network to learn a more compact hashing code from a long redundant code. Extensive experiments on four datasets demonstrate that our proposed method outperforms state-of-the-art hashing methods. |
|||||
2018 | Fully Understanding The Hashing Trick | Freksen Casper Benjamin, Kamma Lior, Larsen Kasper Green | Arxiv | Feature hashing, also known as {\em the hashing trick}, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix \(A : \mathbb{R}^n \to \mathbb{R}^m\) (where \(m \ll n\)) in order to reduce the dimension of the data from \(n\) to \(m\) while approximately preserving the Euclidean norm. Every column of \(A\) contains exactly one non-zero entry, equals to either \(-1\) or \(1\). Weinberger et al. showed tail bounds on \(|Ax|2^2\). Specifically they showed that for every \(\epsilon, \delta\), if \(|x|{\infty} / |x|2\) is sufficiently small, and \(m\) is sufficiently large, then $\(\Pr[ \; | \;|Ax|_2^2 - |x|_2^2\; | < \epsilon |x|_2^2 \;] \ge 1 - \delta \;.\)\( These bounds were later extended by Dasgupta \etal (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters \)|x|{\infty} / |x|_2, m, \epsilon, \delta\( remained an open question. We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants “hiding” in the asymptotic notation are, in fact, very close to \)1$, thus further illustrating the tightness of the presented bounds in practice. |
|||||
2018 | A Scalable Optimization Mechanism For Pairwise Based Discrete Hashing | Shi Xiaoshuang, Xing Fuyong, Zhang Zizhao, Sapkota Manish, Guo Zhenhua, Yang Lin | Arxiv | Maintaining the pair similarity relationship among originally high-dimensional data into a low-dimensional binary space is a popular strategy to learn binary codes. One simiple and intutive method is to utilize two identical code matrices produced by hash functions to approximate a pairwise real label matrix. However, the resulting quartic problem is difficult to directly solve due to the non-convex and non-smooth nature of the objective. In this paper, unlike previous optimization methods using various relaxation strategies, we aim to directly solve the original quartic problem using a novel alternative optimization mechanism to linearize the quartic problem by introducing a linear regression model. Additionally, we find that gradually learning each batch of binary codes in a sequential mode, i.e. batch by batch, is greatly beneficial to the convergence of binary code learning. Based on this significant discovery and the proposed strategy, we introduce a scalable symmetric discrete hashing algorithm that gradually and smoothly updates each batch of binary codes. To further improve the smoothness, we also propose a greedy symmetric discrete hashing algorithm to update each bit of batch binary codes. Moreover, we extend the proposed optimization mechanism to solve the non-convex optimization problems for binary code learning in many other pairwise based hashing algorithms. Extensive experiments on benchmark single-label and multi-label databases demonstrate the superior performance of the proposed mechanism over recent state-of-the-art methods. |
|||||
2018 | An O(N) Sorting Algorithm Machine Learning Sort | Zhao Hanqing, Luo Yuehan | Arxiv | We propose an \(O(N\cdot M)\) sorting algorithm by Machine Learning method, which shows a huge potential sorting big data. This sorting algorithm can be applied to parallel sorting and is suitable for GPU or TPU acceleration. Furthermore, we discuss the application of this algorithm to sparse hash table. |
|||||
2018 | SIG-DB Leveraging Homomorphic Encryption To Securely Interrogate Privately Held Genomic Databases | Titus Alexander J., Flower Audrey, Hagerty Patrick, Gamble Paul, Lewis Charlie, Stavish Todd, Oconnell Kevin P., Shipley Greg, Rogers Stephanie M. | PLoS Computational Biology; | Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function. Advances in synthetic biology and the decreased cost of sequencing are increasing the amount of privately held genomic data. As the quantity and value of private genomic data grows, so does the incentive to acquire and protect such data, which creates a need to store and process these data securely. We present an algorithm for the Secure Interrogation of Genomic DataBases (SIG-DB). The SIG-DB algorithm enables databases of genomic sequences to be searched with an encrypted query sequence without revealing the query sequence to the Database Owner or any of the database sequences to the Querier. SIG-DB is the first application of its kind to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data. |
|||||
2018 | A Filter Of Minhash For Image Similarity Measures | Long Jun, Liu Qunfeng, Yuan Xinpan, Zhang Chengyuan, Liu Junfeng | Arxiv | Image similarity measures play an important role in nearest neighbor search and duplicate detection for large-scale image datasets. Recently, Minwise Hashing (or Minhash) and its related hashing algorithms have achieved great performances in large-scale image retrieval systems. However, there are a large number of comparisons for image pairs in these applications, which may spend a lot of computation time and affect the performance. In order to quickly obtain the pairwise images that theirs similarities are higher than the specific threshold T (e.g., 0.5), we propose a dynamic threshold filter of Minwise Hashing for image similarity measures. It greatly reduces the calculation time by terminating the unnecessary comparisons in advance. We also find that the filter can be extended to other hashing algorithms, on when the estimator satisfies the binomial distribution, such as b-Bit Minwise Hashing, One Permutation Hashing, etc. In this pager, we use the Bag-of-Visual-Words (BoVW) model based on the Scale Invariant Feature Transform (SIFT) to represent the image features. We have proved that the filter is correct and effective through the experiment on real image datasets. |
|||||
2018 | H-CNN Spatial Hashing Based CNN For 3D Shape Analysis | Shao Tianjia, Yang Yin, Weng Yanlin, Hou Qiming, Zhou Kun | Arxiv | We present a novel spatial hashing based data structure to facilitate 3D shape analysis using convolutional neural networks (CNNs). Our method well utilizes the sparse occupancy of 3D shape boundary and builds hierarchical hash tables for an input model under different resolutions. Based on this data structure, we design two efficient GPU algorithms namely hash2col and col2hash so that the CNN operations like convolution and pooling can be efficiently parallelized. The spatial hashing is nearly minimal, and our data structure is almost of the same size as the raw input. Compared with state-of-the-art octree-based methods, our data structure significantly reduces the memory footprint during the CNN training. As the input geometry features are more compactly packed, CNN operations also run faster with our data structure. The experiment shows that, under the same network structure, our method yields comparable or better benchmarks compared to the state-of-the-art while it has only one-third memory consumption. Such superior memory performance allows the CNN to handle high-resolution shape analysis. |
|||||
2018 | When Hashing Met Matching Efficient Spatio-temporal Search For Ridesharing | Dutta Chinmoy | Arxiv | Carpooling, or sharing a ride with other passengers, holds immense potential for urban transportation. Ridesharing platforms enable such sharing of rides using real-time data. Finding ride matches in real-time at urban scale is a difficult combinatorial optimization task and mostly heuristic approaches are applied. In this work, we mathematically model the problem as that of finding near-neighbors and devise a novel efficient spatio-temporal search algorithm based on the theory of locality sensitive hashing for Maximum Inner Product Search (MIPS). The proposed algorithm can find \(k\) near-optimal potential matches for every ride from a pool of \(n\) rides in time \(O(n^{1 + \rho} (k + log n) log k)\) and space \(O(n^{1 + \rho} log k)\) for a small \(\rho < 1\). Our algorithm can be extended in several useful and interesting ways increasing its practical appeal. Experiments with large NY yellow taxi trip datasets show that our algorithm consistently outperforms state-of-the-art heuristic methods thereby proving its practical applicability. |
|||||
2018 | Graph Kernels Based On High Order Graphlet Parsing And Hashing | Dutta Anjan, Sahbi Hichem | Arxiv | Graph-based methods are known to be successful in many machine learning and pattern classification tasks. These methods consider semi-structured data as graphs where nodes correspond to primitives (parts, interest points, segments, etc.) and edges characterize the relationships between these primitives. However, these non-vectorial graph data cannot be straightforwardly plugged into off-the-shelf machine learning algorithms without a preliminary step of – explicit/implicit – graph vectorization and embedding. This embedding process should be resilient to intra-class graph variations while being highly discriminant. In this paper, we propose a novel high-order stochastic graphlet embedding (SGE) that maps graphs into vector spaces. Our main contribution includes a new stochastic search procedure that efficiently parses a given graph and extracts/samples unlimitedly high-order graphlets. We consider these graphlets, with increasing orders, to model local primitives as well as their increasingly complex interactions. In order to build our graph representation, we measure the distribution of these graphlets into a given graph, using particular hash functions that efficiently assign sampled graphlets into isomorphic sets with a very low probability of collision. When combined with maximum margin classifiers, these graphlet-based representations have positive impact on the performance of pattern comparison and recognition as corroborated through extensive experiments using standard benchmark databases. |
|||||
2018 | Semi-supervised Hashing For Semi-paired Cross-view Retrieval | Yu Jun, Wu Xiao-jun, Kittler Josef | Arxiv | Recently, hashing techniques have gained importance in large-scale retrieval tasks because of their retrieval speed. Most of the existing cross-view frameworks assume that data are well paired. However, the fully-paired multiview situation is not universal in real applications. The aim of the method proposed in this paper is to learn the hashing function for semi-paired cross-view retrieval tasks. To utilize the label information of partial data, we propose a semi-supervised hashing learning framework which jointly performs feature extraction and classifier learning. The experimental results on two datasets show that our method outperforms several state-of-the-art methods in terms of retrieval accuracy. |
|||||
2018 | Learning Discriminative Hashing Codes For Cross-modal Retrieval Based On Multi-view Features | Yu Jun, Wu Xiao-jun, Kittler Josef | Arxiv | Hashing techniques have been applied broadly in retrieval tasks due to their low storage requirements and high speed of processing. Many hashing methods based on a single view have been extensively studied for information retrieval. However, the representation capacity of a single view is insufficient and some discriminative information is not captured, which results in limited improvement. In this paper, we employ multiple views to represent images and texts for enriching the feature information. Our framework exploits the complementary information among multiple views to better learn the discriminative compact hash codes. A discrete hashing learning framework that jointly performs classifier learning and subspace learning is proposed to complete multiple search tasks simultaneously. Our framework includes two stages, namely a kernelization process and a quantization process. Kernelization aims to find a common subspace where multi-view features can be fused. The quantization stage is designed to learn discriminative unified hashing codes. Extensive experiments are performed on single-label datasets (WiKi and MMED) and multi-label datasets (MIRFlickr and NUS-WIDE) and the experimental results indicate the superiority of our method compared with the state-of-the-art methods. |
|||||
2018 | Link And Code Fast Indexing With Graphs And Compact Regression Codes | Douze Matthijs, Sablayrolles Alexandre, Jégou Hervé | Arxiv | Similarity search approaches based on graph walks have recently attained outstanding speed-accuracy trade-offs, taking aside the memory requirements. In this paper, we revisit these approaches by considering, additionally, the memory constraint required to index billions of images on a single server. This leads us to propose a method based both on graph traversal and compact representations. We encode the indexed vectors using quantization and exploit the graph structure to refine the similarity estimation. In essence, our method takes the best of these two worlds: the search strategy is based on nested graphs, thereby providing high precision with a relatively small set of comparisons. At the same time it offers a significant memory compression. As a result, our approach outperforms the state of the art on operating points considering 64-128 bytes per vector, as demonstrated by our results on two billion-scale public benchmarks. |
|||||
2018 | Discriminative Supervised Hashing For Cross-modal Similarity Search | Yu Jun, Wu Xiao-jun, Kittler Josef | Arxiv | With the advantage of low storage cost and high retrieval efficiency, hashing techniques have recently been an emerging topic in cross-modal similarity search. As multiple modal data reflect similar semantic content, many researches aim at learning unified binary codes. However, discriminative hashing features learned by these methods are not adequate. This results in lower accuracy and robustness. We propose a novel hashing learning framework which jointly performs classifier learning, subspace learning and matrix factorization to preserve class-specific semantic content, termed Discriminative Supervised Hashing (DSH), to learn the discrimative unified binary codes for multi-modal data. Besides, reducing the loss of information and preserving the non-linear structure of data, DSH non-linearly projects different modalities into the common space in which the similarity among heterogeneous data points can be measured. Extensive experiments conducted on the three publicly available datasets demonstrate that the framework proposed in this paper outperforms several state-of -the-art methods. |
|||||
2018 | Collaborative Learning For Extremely Low Bit Asymmetric Hashing | Luo Yadan, Huang Zi, Li Yang, Shen Fumin, Yang Yang, Cui Peng | Arxiv | Hashing techniques are in great demand for a wide range of real-world applications such as image retrieval and network compression. Nevertheless, existing approaches could hardly guarantee a satisfactory performance with the extremely low-bit (e.g., 4-bit) hash codes due to the severe information loss and the shrink of the discrete solution space. In this paper, we propose a novel \textit{Collaborative Learning} strategy that is tailored for generating high-quality low-bit hash codes. The core idea is to jointly distill bit-specific and informative representations for a group of pre-defined code lengths. The learning of short hash codes among the group can benefit from the manifold shared with other long codes, where multiple views from different hash codes provide the supplementary guidance and regularization, making the convergence faster and more stable. To achieve that, an asymmetric hashing framework with two variants of multi-head embedding structures is derived, termed as Multi-head Asymmetric Hashing (MAH), leading to great efficiency of training and querying. Extensive experiments on three benchmark datasets have been conducted to verify the superiority of the proposed MAH, and have shown that the 8-bit hash codes generated by MAH achieve \(94.3\%\) of the MAP (Mean Average Precision (MAP)) score on the CIFAR-10 dataset, which significantly surpasses the performance of the 48-bit codes by the state-of-the-arts in image retrieval tasks. |
|||||
2018 | Vector Quantized Spectral Clustering Applied To Soybean Whole Genome Sequences | Shastri Aditya A., Ahuja Kapil, Ratnaparkhe Milind B., Shah Aditya, Gagrani Aishwary, Lal Anant | Arxiv | We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping Soybean genomes. The inspiration here is to use SC for its accuracy and VQ to make the algorithm computationally cheap (the complexity of SC is cubic in-terms of the input size). Although the combination of SC and VQ is not new, the novelty of our work is in developing the crucial similarity matrix in SC as well as use of k-medoids in VQ, both adapted for the Soybean genome data. We compare our approach with commonly used techniques like UPGMA (Un-weighted Pair Graph Method with Arithmetic Mean) and NJ (Neighbour Joining). Experimental results show that our approach outperforms both these techniques significantly in terms of cluster quality (up to 25% better cluster quality) and time complexity (order of magnitude faster). |
|||||
2018 | Fast Locality Sensitive Hashing For Beam Search On GPU | Shi Xing, Xu Shizhen, Knight Kevin | Arxiv | We present a GPU-based Locality Sensitive Hashing (LSH) algorithm to speed up beam search for sequence models. We utilize the winner-take-all (WTA) hash, which is based on relative ranking order of hidden dimensions and thus resilient to perturbations in numerical values. Our algorithm is designed by fully considering the underling architecture of CUDA-enabled GPUs (Algorithm/Architecture Co-design): 1) A parallel Cuckoo hash table is applied for LSH code lookup (guaranteed O(1) lookup time); 2) Candidate lists are shared across beams to maximize the parallelism; 3) Top frequent words are merged into candidate lists to improve performance. Experiments on 4 large-scale neural machine translation models demonstrate that our algorithm can achieve up to 4x speedup on softmax module, and 2x overall speedup without hurting BLEU on GPU. |
|||||
2018 | Binary Constrained Deep Hashing Network For Image Retrieval Without Manual Annotation | Do Thanh-toan, Hoang Tuan, Tan Dang-khoa Le, Pham Trung, Le Huu, Cheung Ngai-man, Reid Ian | Arxiv | Learning compact binary codes for image retrieval task using deep neural networks has attracted increasing attention recently. However, training deep hashing networks for the task is challenging due to the binary constraints on the hash codes, the similarity preserving property, and the requirement for a vast amount of labelled images. To the best of our knowledge, none of the existing methods has tackled all of these challenges completely in a unified framework. In this work, we propose a novel end-to-end deep learning approach for the task, in which the network is trained to produce binary codes directly from image pixels without the need of manual annotation. In particular, to deal with the non-smoothness of binary constraints, we propose a novel pairwise constrained loss function, which simultaneously encodes the distances between pairs of hash codes, and the binary quantization error. In order to train the network with the proposed loss function, we propose an efficient parameter learning algorithm. In addition, to provide similar / dissimilar training images to train the network, we exploit 3D models reconstructed from unlabelled images for automatic generation of enormous training image pairs. The extensive experiments on image retrieval benchmark datasets demonstrate the improvements of the proposed method over the state-of-the-art compact representation methods on the image retrieval problem. |
|||||
2018 | Mean Local Group Average Precision (mlgap) A New Performance Metric For Hashing-based Retrieval | Ding Pak Lun Kevin, Li Yikang, Li Baoxin | Arxiv | The research on hashing techniques for visual data is gaining increased attention in recent years due to the need for compact representations supporting efficient search/retrieval in large-scale databases such as online images. Among many possibilities, Mean Average Precision(mAP) has emerged as the dominant performance metric for hashing-based retrieval. One glaring shortcoming of mAP is its inability in balancing retrieval accuracy and utilization of hash codes: pushing a system to attain higher mAP will inevitably lead to poorer utilization of the hash codes. Poor utilization of the hash codes hinders good retrieval because of increased collision of samples in the hash space. This means that a model giving a higher mAP values does not necessarily do a better job in retrieval. In this paper, we introduce a new metric named Mean Local Group Average Precision (mLGAP) for better evaluation of the performance of hashing-based retrieval. The new metric provides a retrieval performance measure that also reconciles the utilization of hash codes, leading to a more practically meaningful performance metric than conventional ones like mAP. To this end, we start by mathematical analysis of the deficiencies of mAP for hashing-based retrieval. We then propose mLGAP and show why it is more appropriate for hashing-based retrieval. Experiments on image retrieval are used to demonstrate the effectiveness of the proposed metric. |
|||||
2018 | Bagminhash - Minwise Hashing Algorithm For Weighted Sets | Ertl Otmar | Arxiv | Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past. |
|||||
2018 | From Selective Deep Convolutional Features To Compact Binary Representations For Image Retrieval | Do Thanh-toan, Hoang Tuan, Tan Dang-khoa Le, Le Huu, Nguyen Tam V., Cheung Ngai-man | Arxiv | In the large-scale image retrieval task, the two most important requirements are the discriminability of image representations and the efficiency in computation and storage of representations. Regarding the former requirement, Convolutional Neural Network (CNN) is proven to be a very powerful tool to extract highly discriminative local descriptors for effective image search. Additionally, in order to further improve the discriminative power of the descriptors, recent works adopt fine-tuned strategies. In this paper, taking a different approach, we propose a novel, computationally efficient, and competitive framework. Specifically, we firstly propose various strategies to compute masks, namely SIFT-mask, SUM-mask, and MAX-mask, to select a representative subset of local convolutional features and eliminate redundant features. Our in-depth analyses demonstrate that proposed masking schemes are effective to address the burstiness drawback and improve retrieval accuracy. Secondly, we propose to employ recent embedding and aggregating methods which can significantly boost the feature discriminability. Regarding the computation and storage efficiency, we include a hashing module to produce very compact binary image representations. Extensive experiments on six image retrieval benchmarks demonstrate that our proposed framework achieves the state-of-the-art retrieval performances. |
|||||
2018 | Large-scale Speaker Retrieval On Random Speaker Variability Subspace | Shon Suwon, Lee Younggun, Kim Taesu | Arxiv | This paper describes a fast speaker search system to retrieve segments of the same voice identity in the large-scale data. A recent study shows that Locality Sensitive Hashing (LSH) enables quick retrieval of a relevant voice in the large-scale data in conjunction with i-vector while maintaining accuracy. In this paper, we proposed Random Speaker-variability Subspace (RSS) projection to map a data into LSH based hash tables. We hypothesized that rather than projecting on completely random subspace without considering data, projecting on randomly generated speaker variability space would give more chance to put the same speaker representation into the same hash bins, so we can use less number of hash tables. Multiple RSS can be generated by randomly selecting a subset of speakers from a large speaker cohort. From the experimental result, the proposed approach shows 100 times and 7 times faster than the linear search and LSH, respectively |
|||||
2018 | Instance-level Sketch-based Retrieval By Deep Triplet Classification Siamese Network | Lu Peng, Lin Hangyu, Fu Yanwei, Gong Shaogang, Jiang Yu-gang, Xue Xiangyang | Arxiv | Sketch has been employed as an effective communicative tool to express the abstract and intuitive meanings of object. Recognizing the free-hand sketch drawing is extremely useful in many real-world applications. While content-based sketch recognition has been studied for several decades, the instance-level Sketch-Based Image Retrieval (SBIR) tasks have attracted significant research attention recently. The existing datasets such as QMUL-Chair and QMUL-Shoe, focus on the retrieval tasks of chairs and shoes. However, there are several key limitations in previous instance-level SBIR works. The state-of-the-art works have to heavily rely on the pre-training process, quality of edge maps, multi-cropping testing strategy, and augmenting sketch images. To efficiently solve the instance-level SBIR, we propose a new Deep Triplet Classification Siamese Network (DeepTCNet) which employs DenseNet-169 as the basic feature extractor and is optimized by the triplet loss and classification loss. Critically, our proposed DeepTCNet can break the limitations existed in previous works. The extensive experiments on five benchmark sketch datasets validate the effectiveness of the proposed model. Additionally, to study the tasks of sketch-based hairstyle retrieval, this paper contributes a new instance-level photo-sketch dataset - Hairstyle Photo-Sketch dataset, which is composed of 3600 sketches and photos, and 2400 sketch-photo pairs. |
|||||
2018 | Diving Deep Onto Discriminative Ensemble Of Histological Hashing Class-specific Manifold Learning For Multi-class Breast Carcinoma Taxonomy | Pratiher Sawon, Chattoraj Subhankar | Arxiv | Histopathological images (HI) encrypt resolution dependent heterogeneous textures & diverse color distribution variability, manifesting in micro-structural surface tissue convolutions. Also, inherently high coherency of cancerous cells poses significant challenges to breast cancer (BC) multi-classification. As such, multi-class stratification is sparsely explored & prior work mainly focus on benign & malignant tissue characterization only, which forestalls further quantitative analysis of subordinate classes like adenosis, mucinous carcinoma & fibroadenoma etc, for diagnostic competence. In this work, a fully-automated, near-real-time & computationally inexpensive robust multi-classification deep framework from HI is presented. The proposed scheme employs deep neural network (DNN) aided discriminative ensemble of holistic class-specific manifold learning (CSML) for underlying HI sub-space embedding & HI hashing based local shallow signatures. The model achieves 95.8% accuracy pertinent to multi-classification & 2.8% overall performance improvement & 38.2% enhancement for Lobular carcinoma (LC) sub-class recognition rate as compared to the existing state-of-the-art on well known BreakHis dataset is achieved. Also, 99.3% recognition rate at 200X & a sensitivity of 100% for binary grading at all magnification validates its suitability for clinical deployment in hand-held smart devices. |
|||||
2018 | Improving Similarity Search With High-dimensional Locality-sensitive Hashing | Sharma Jaiyam, Navlakha Saket | Arxiv | We propose a new class of data-independent locality-sensitive hashing (LSH) algorithms based on the fruit fly olfactory circuit. The fundamental difference of this approach is that, instead of assigning hashes as dense points in a low dimensional space, hashes are assigned in a high dimensional space, which enhances their separability. We show theoretically and empirically that this new family of hash functions is locality-sensitive and preserves rank similarity for inputs in any `p space. We then analyze different variations on this strategy and show empirically that they outperform existing LSH methods for nearest-neighbors search on six benchmark datasets. Finally, we propose a multi-probe version of our algorithm that achieves higher performance for the same query time, or conversely, that maintains performance of prior approaches while taking significantly less indexing time and memory. Overall, our approach leverages the advantages of separability provided by high-dimensional spaces, while still remaining computationally efficient |
|||||
2018 | Incremental Sparse TFIDF Incremental Similarity With Bipartite Graphs | Sarmento Rui Portocarrero, Brazdil Pavel | Arxiv | In this report, we experimented with several concepts regarding text streams analysis. We tested an implementation of Incremental Sparse TF-IDF (IS-TFIDF) and Incremental Cosine Similarity (ICS) with the use of bipartite graphs. We are using bipartite graphs - one type of node are documents, and the other type of nodes are words - to know what documents are affected with a word arrival at the stream (the neighbors of the word in the graph). Thus, with this information, we leverage optimized algorithms used for graph-based applications. The concept is similar to, for example, the use of hash tables or other computer science concepts used for fast access to information in memory. |
|||||
2018 | Fast Similarity Search Via Optimal Sparse Lifting | Wenye Li, Jingwei Mao, Yin Zhang, Shuguang Cui | Neural Information Processing Systems | Similarity search is a fundamental problem in computing science with various applications and has attracted significant research attention, especially in large-scale search with high dimensions. Motivated by the evidence in biological science, our work develops a novel approach for similarity search. Fundamentally different from existing methods that typically reduce the dimension of the data to lessen the computational complexity and speed up the search, our approach projects the data into an even higher-dimensional space while ensuring the sparsity of the data in the output space, with the objective of further improving precision and speed. Specifically, our approach has two key steps. Firstly, it computes the optimal sparse lifting for given input samples and increases the dimension of the data while approximately preserving their pairwise similarity. Secondly, it seeks the optimal lifting operator that best maps input samples to the optimal sparse lifting. Computationally, both steps are modeled as optimization problems that can be efficiently and effectively solved by the Frank-Wolfe algorithm. Simple as it is, our approach has reported significantly improved results in empirical evaluations, and exhibited its high potentials in solving practical problems. |
|||||
2018 | Self-supervised Adversarial Hashing Networks For Cross-modal Retrieval | Li Chao, Deng Cheng, Li Ning, Liu Wei, Gao Xinbo, Tao Dacheng | Arxiv | Thanks to the success of deep learning, cross-modal retrieval has made significant progress recently. However, there still remains a crucial bottleneck: how to bridge the modality gap to further enhance the retrieval accuracy. In this paper, we propose a self-supervised adversarial hashing (\textbf{SSAH}) approach, which lies among the early attempts to incorporate adversarial learning into cross-modal hashing in a self-supervised fashion. The primary contribution of this work is that two adversarial networks are leveraged to maximize the semantic correlation and consistency of the representations between different modalities. In addition, we harness a self-supervised semantic network to discover high-level semantic information in the form of multi-label annotations. Such information guides the feature learning process and preserves the modality relationships in both the common semantic space and the Hamming space. Extensive experiments carried out on three benchmark datasets validate that the proposed SSAH surpasses the state-of-the-art methods. |
|||||
2018 | Confirmation Sampling For Exact Nearest Neighbor Search | Christiani Tobias, Pagh Rasmus, Thorup Mikkel | Arxiv | Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ‘98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings. We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability \(1-\delta\), using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with \(L\) trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of \(Ω(L)\) trees with internal parameters specifically tuned to the query and data. |
|||||
2018 | CRH A Simple Benchmark Approach To Continuous Hashing | Cheng Miao, Tsoi Ah Chung | Arxiv | In recent years, the distinctive advancement of handling huge data promotes the evolution of ubiquitous computing and analysis technologies. With the constantly upward system burden and computational complexity, adaptive coding has been a fascinating topic for pattern analysis, with outstanding performance. In this work, a continuous hashing method, termed continuous random hashing (CRH), is proposed to encode sequential data stream, while ignorance of previously hashing knowledge is possible. Instead, a random selection idea is adopted to adaptively approximate the differential encoding patterns of data stream, e.g., streaming media, and iteration is avoided for stepwise learning. Experimental results demonstrate our method is able to provide outstanding performance, as a benchmark approach to continuous hashing. |
|||||
2018 | Adaptive Mapreduce Similarity Joins | Mccauley Samuel, Silvestri Francesco | Arxiv | Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aum"uller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.’s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice. |
|||||
2018 | Object Detection Based Deep Unsupervised Hashing | Tu Rong-cheng, Mao Xian-ling, Feng Bo-si, Bian Bing-bing, Ying Yu-shu | Arxiv | Recently, similarity-preserving hashing methods have been extensively studied for large-scale image retrieval. Compared with unsupervised hashing, supervised hashing methods for labeled data have usually better performance by utilizing semantic label information. Intuitively, for unlabeled data, it will improve the performance of unsupervised hashing methods if we can first mine some supervised semantic ‘label information’ from unlabeled data and then incorporate the ‘label information’ into the training process. Thus, in this paper, we propose a novel Object Detection based Deep Unsupervised Hashing method (ODDUH). Specifically, a pre-trained object detection model is utilized to mining supervised ‘label information’, which is used to guide the learning process to generate high-quality hash codes.Extensive experiments on two public datasets demonstrate that the proposed method outperforms the state-of-the-art unsupervised hashing methods in the image retrieval task. |
|||||
2018 | Anchorhash A Scalable Consistent Hash | Mendelson Gal, Vargaftik Shay, Barabash Katherine, Lorenz Dean, Keslassy Isaac, Orda Ariel | Arxiv | Consistent hashing (CH) is a central building block in many networking applications, from datacenter load-balancing to distributed storage. Unfortunately, state-of-the-art CH solutions cannot ensure full consistency under arbitrary changes and/or cannot scale while maintaining reasonable memory footprints and update times. We present AnchorHash, a scalable and fully-consistent hashing algorithm. AnchorHash achieves high key lookup rates, a low memory footprint, and low update times. We formally establish its strong theoretical guarantees, and present advanced implementations with a memory footprint of only a few bytes per resource. Moreover, extensive evaluations indicate that it outperforms state-of-the-art algorithms, and that it can scale on a single core to 100 million resources while still achieving a key lookup rate of more than 15 million keys per second. |
|||||
2018 | Local Density Estimation In High Dimensions | Wu Xian, Charikar Moses, Natchu Vishnu | Arxiv | An important question that arises in the study of high dimensional vector representations learned from data is: given a set \(\mathcal{D}\) of vectors and a query \(q\), estimate the number of points within a specified distance threshold of \(q\). We develop two estimators, LSH Count and Multi-Probe Count that use locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via importance sampling. A key innovation is the ability to maintain a small number of hash tables via preprocessing data structures and algorithms that sample from multiple buckets in each hash table. We give bounds on the space requirements and sample complexity of our schemes, and demonstrate their effectiveness in experiments on a standard word embedding dataset. |
|||||
2018 | Distributed Collaborative Hashing And Its Applications In Ant Financial | Chen Chaochao, Liu Ziqi, Zhao Peilin, Li Longfei, Zhou Jun, Li Xiaolong | Arxiv | Collaborative filtering, especially latent factor model, has been popularly used in personalized recommendation. Latent factor model aims to learn user and item latent factors from user-item historic behaviors. To apply it into real big data scenarios, efficiency becomes the first concern, including offline model training efficiency and online recommendation efficiency. In this paper, we propose a Distributed Collaborative Hashing (DCH) model which can significantly improve both efficiencies. Specifically, we first propose a distributed learning framework, following the state-of-the-art parameter server paradigm, to learn the offline collaborative model. Our model can be learnt efficiently by distributedly computing subgradients in minibatches on workers and updating model parameters on servers asynchronously. We then adopt hashing technique to speedup the online recommendation procedure. Recommendation can be quickly made through exploiting lookup hash tables. We conduct thorough experiments on two real large-scale datasets. The experimental results demonstrate that, comparing with the classic and state-of-the-art (distributed) latent factor models, DCH has comparable performance in terms of recommendation accuracy but has both fast convergence speed in offline model training procedure and realtime efficiency in online recommendation procedure. Furthermore, the encouraging performance of DCH is also shown for several real-world applications in Ant Financial. |
|||||
2018 | Greedy Hash Towards Fast Optimization For Accurate Hash Coding In CNN | Shupeng Su, Chao Zhang, Kai Han, Yonghong Tian | Neural Information Processing Systems | To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks. |
|||||
2018 | Zero-shot Sketch-image Hashing | Shen Yuming, Liu Li, Shen Fumin, Shao Ling | Arxiv | Recent studies show that large-scale sketch-based image retrieval (SBIR) can be efficiently tackled by cross-modal binary representation learning methods, where Hamming distance matching significantly speeds up the process of similarity search. Providing training and test data subjected to a fixed set of pre-defined categories, the cutting-edge SBIR and cross-modal hashing works obtain acceptable retrieval performance. However, most of the existing methods fail when the categories of query sketches have never been seen during training. In this paper, the above problem is briefed as a novel but realistic zero-shot SBIR hashing task. We elaborate the challenges of this special task and accordingly propose a zero-shot sketch-image hashing (ZSIH) model. An end-to-end three-network architecture is built, two of which are treated as the binary encoders. The third network mitigates the sketch-image heterogeneity and enhances the semantic relations among data by utilizing the Kronecker fusion layer and graph convolution, respectively. As an important part of ZSIH, we formulate a generative hashing scheme in reconstructing semantic knowledge representations for zero-shot retrieval. To the best of our knowledge, ZSIH is the first zero-shot hashing work suitable for SBIR and cross-modal search. Comprehensive experiments are conducted on two extended datasets, i.e., Sketchy and TU-Berlin with a novel zero-shot train-test split. The proposed model remarkably outperforms related works. |
|||||
2018 | Multi-resolution Hashing For Fast Pairwise Summations | Charikar Moses, Siminelakis Paris | Arxiv | A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector \(y\) (query) that is unknown a priori. Given a set of points \(X\subset \mathbb{R}^{d}\) and a pairwise function \(w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1]\), we study the problem of designing a data-structure that enables sublinear-time approximation of the summation \(Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y)\) for any query \(y\in \mathbb{R}^{d}\). By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS’17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of \(T\geq 1\) hashing schemes with collision probabilities \(p_{1},\ldots, p_{T}\) such that \(\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)})\). This leads to a data-structure that approximates \(Z_{w}(y)\) using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS’18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function \(w(x,y)=e^{\phi(\langle x,y\rangle)}\) of the inner product on the unit sphere \(x,y\in \mathcal{S}^{d-1}\). Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to \(\mathbb{R}^{d}\) and from scalar functions to vector functions. |
|||||
2018 | Deep Reinforcement Learning For Image Hashing | Peng Yuxin, Zhang Jian, Ye Zhaoda | Arxiv | Deep hashing methods have received much attention recently, which achieve promising results by taking advantage of the strong representation power of deep networks. However, most existing deep hashing methods learn a whole set of hashing functions independently, while ignore the correlations between different hashing functions that can promote the retrieval accuracy greatly. Inspired by the sequential decision ability of deep reinforcement learning, we propose a new Deep Reinforcement Learning approach for Image Hashing (DRLIH). Our proposed DRLIH approach models the hashing learning problem as a sequential decision process, which learns each hashing function by correcting the errors imposed by previous ones and promotes retrieval accuracy. To the best of our knowledge, this is the first work to address hashing problem from deep reinforcement learning perspective. The main contributions of our proposed DRLIH approach can be summarized as follows: (1) We propose a deep reinforcement learning hashing network. In the proposed network, we utilize recurrent neural network (RNN) as agents to model the hashing functions, which take actions of projecting images into binary codes sequentially, so that the current hashing function learning can take previous hashing functions’ error into account. (2) We propose a sequential learning strategy based on proposed DRLIH. We define the state as a tuple of internal features of RNN’s hidden layers and image features, which can reflect history decisions made by the agents. We also propose an action group method to enhance the correlation of hash functions in the same group. Experiments on three widely-used datasets demonstrate the effectiveness of our proposed DRLIH approach. |
|||||
2018 | Hashing-based-estimators For Kernel Density In High Dimensions | Charikar Moses, Siminelakis Paris | Arxiv | Given a set of points \(P\subset \mathbb{R}^{d}\) and a kernel \(k\), the Kernel Density Estimate at a point \(x\in\mathbb{R}^{d}\) is defined as \(\mathrm{KDE}{P}(x)=\frac{1}{|P|}\sum{y\in P} k(x,y)\). We study the problem of designing a data structure that given a data set \(P\) and a kernel function, returns approximations to the kernel density of a query point in sublinear time. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions. |
|||||
2018 | Cycle-consistent Deep Generative Hashing For Cross-modal Retrieval | Wu Lin, Wang Yang, Shao Ling | Arxiv | In this paper, we propose a novel deep generative approach to cross-modal retrieval to learn hash functions in the absence of paired training samples through the cycle consistency loss. Our proposed approach employs adversarial training scheme to lean a couple of hash functions enabling translation between modalities while assuming the underlying semantic relationship. To induce the hash codes with semantics to the input-output pair, cycle consistency loss is further proposed upon the adversarial training to strengthen the correlations between inputs and corresponding outputs. Our approach is generative to learn hash functions such that the learned hash codes can maximally correlate each input-output correspondence, meanwhile can also regenerate the inputs so as to minimize the information loss. The learning to hash embedding is thus performed to jointly optimize the parameters of the hash functions across modalities as well as the associated generative models. Extensive experiments on a variety of large-scale cross-modal data sets demonstrate that our proposed method achieves better retrieval results than the state-of-the-arts. |
|||||
2018 | On Finding Quantum Multi-collisions | Liu Qipeng, Zhandry Mark | Arxiv | A \(k\)-collision for a compressing hash function \(H\) is a set of \(k\) distinct inputs that all map to the same output. In this work, we show that for any constant \(k\), \(\Theta\left(N^{\frac{1}{2}(1-\frac{1}{2^k-1})}\right)\) quantum queries are both necessary and sufficient to achieve a \(k\)-collision with constant probability. This improves on both the best prior upper bound (Hosoyamada et al., ASIACRYPT 2017) and provides the first non-trivial lower bound, completely resolving the problem. |
|||||
2018 | FRESH Frechet Similarity With Hashing | Ceccarello Matteo, Driemel Anne, Silvestri Francesco | Proc. of Algorithms and Data Structures Symposium | This paper studies the \(r\)-range search problem for curves under the continuous Fr'echet distance: given a dataset \(S\) of \(n\) polygonal curves and a threshold \(r>0\), construct a data structure that, for any query curve \(q\), efficiently returns all entries in \(S\) with distance at most \(r\) from \(q\). We propose FRESH, an approximate and randomized approach for \(r\)-range search, that leverages on a locality sensitive hashing scheme for detecting candidate near neighbors of the query curve, and on a subsequent pruning step based on a cascade of curve simplifications. We experimentally compare \fresh to exact and deterministic solutions, and we show that high performance can be reached by suitably relaxing precision and recall. |
|||||
2018 | Efficient Nearest Neighbors Search For Large-scale Landmark Recognition | Magliani Federico, Fontanini Tomaso, Prati Andrea | Arxiv | The problem of landmark recognition has achieved excellent results in small-scale datasets. When dealing with large-scale retrieval, issues that were irrelevant with small amount of data, quickly become fundamental for an efficient retrieval phase. In particular, computational time needs to be kept as low as possible, whilst the retrieval accuracy has to be preserved as much as possible. In this paper we propose a novel multi-index hashing method called Bag of Indexes (BoI) for Approximate Nearest Neighbors (ANN) search. It allows to drastically reduce the query time and outperforms the accuracy results compared to the state-of-the-art methods for large-scale landmark recognition. It has been demonstrated that this family of algorithms can be applied on different embedding techniques like VLAD and R-MAC obtaining excellent results in very short times on different public datasets: Holidays+Flickr1M, Oxford105k and Paris106k. |
|||||
2018 | Deep Priority Hashing | Cao Zhangjie, Sun Ziping, Long Mingsheng, Wang Jianmin, Yu Philip S. | Arxiv | Deep hashing enables image retrieval by end-to-end learning of deep representations and hash codes from training data with pairwise similarity information. Subject to the distribution skewness underlying the similarity information, most existing deep hashing methods may underperform for imbalanced data due to misspecified loss functions. This paper presents Deep Priority Hashing (DPH), an end-to-end architecture that generates compact and balanced hash codes in a Bayesian learning framework. The main idea is to reshape the standard cross-entropy loss for similarity-preserving learning such that it down-weighs the loss associated to highly-confident pairs. This idea leads to a novel priority cross-entropy loss, which prioritizes the training on uncertain pairs over confident pairs. Also, we propose another priority quantization loss, which prioritizes hard-to-quantize examples for generation of nearly lossless hash codes. Extensive experiments demonstrate that DPH can generate high-quality hash codes and yield state-of-the-art image retrieval results on three datasets, ImageNet, NUS-WIDE, and MS-COCO. |
|||||
2018 | Improved Deep Hashing With Soft Pairwise Similarity For Multi-label Image Retrieval | Zhang Zheng, Zou Qin, Lin Yuewei, Chen Long, Wang Song | Arxiv | Hash coding has been widely used in the approximate nearest neighbor search for large-scale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional feature-learning-based methods. Most of these methods examine the pairwise similarity on the semantic-level labels, where the pairwise similarity is generally defined in a hard-assignment way. That is, the pairwise similarity is ‘1’ if they share no less than one class label and ‘0’ if they do not share any. However, such similarity definition cannot reflect the similarity ranking for pairwise images that hold multiple labels. In this paper, a new deep hashing method is proposed for multi-label image retrieval by re-defining the pairwise similarity into an instance similarity, where the instance similarity is quantified into a percentage based on the normalized semantic labels. Based on the instance similarity, a weighted cross-entropy loss and a minimum mean square error loss are tailored for loss-function construction, and are efficiently used for simultaneous feature learning and hash coding. Experiments on three popular datasets demonstrate that, the proposed method outperforms the competing methods and achieves the state-of-the-art performance in multi-label image retrieval. |
|||||
2018 | Hashing With Binary Matrix Pursuit | Cakir Fatih, He Kun, Sclaroff Stan | Arxiv | We propose theoretical and empirical improvements for two-stage hashing methods. We first provide a theoretical analysis on the quality of the binary codes and show that, under mild assumptions, a residual learning scheme can construct binary codes that fit any neighborhood structure with arbitrary accuracy. Secondly, we show that with high-capacity hash functions such as CNNs, binary code inference can be greatly simplified for many standard neighborhood definitions, yielding smaller optimization problems and more robust codes. Incorporating our findings, we propose a novel two-stage hashing method that significantly outperforms previous hashing studies on widely used image retrieval benchmarks. |
|||||
2018 | SCH-GAN Semi-supervised Cross-modal Hashing By Generative Adversarial Network | Zhang Jian, Peng Yuxin, Yuan Mingkuan | Arxiv | Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Supervised cross-modal hashing methods have achieved considerable progress by incorporating semantic side information. However, they mainly have two limitations: (1) Heavily rely on large-scale labeled cross-modal training data which are labor intensive and hard to obtain. (2) Ignore the rich information contained in the large amount of unlabeled data across different modalities, especially the margin examples that are easily to be incorrectly retrieved, which can help to model the correlations. To address these problems, in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN’s ability for modeling data distributions to promote cross-modal hashing learning in an adversarial way. The main contributions can be summarized as follows: (1) We propose a novel generative adversarial network for cross-modal hashing. In our proposed SCH-GAN, the generative model tries to select margin examples of one modality from unlabeled data when giving a query of another modality. While the discriminative model tries to distinguish the selected examples and true positive examples of the query. These two models play a minimax game so that the generative model can promote the hashing performance of discriminative model. (2) We propose a reinforcement learning based algorithm to drive the training of proposed SCH-GAN. The generative model takes the correlation score predicted by discriminative model as a reward, and tries to select the examples close to the margin to promote discriminative model by maximizing the margin between positive and negative data. Experiments on 3 widely-used datasets verify the effectiveness of our proposed approach. |
|||||
2018 | Gradient Augmented Information Retrieval With Autoencoders And Semantic Hashing | Billings Sean | Arxiv | This paper will explore the use of autoencoders for semantic hashing in the context of Information Retrieval. This paper will summarize how to efficiently train an autoencoder in order to create meaningful and low-dimensional encodings of data. This paper will demonstrate how computing and storing the closest encodings to an input query can help speed up search time and improve the quality of our search results. The novel contributions of this paper involve using the representation of the data learned by an auto-encoder in order to augment our search query in various ways. I present and evaluate the new gradient search augmentation (GSA) approach, as well as the more well-known pseudo-relevance-feedback (PRF) adjustment. I find that GSA helps to improve the performance of the TF-IDF based information retrieval system, and PRF combined with GSA works best overall for the systems compared in this paper. |
|||||
2018 | Texture Synthesis Guided Deep Hashing For Texture Image Retrieval | Bhunia Ayan Kumar, Kishore Perla Sai Raj, Mukherjee Pranay, Das Abhirup, Roy Partha Pratim | Arxiv | With the large-scale explosion of images and videos over the internet, efficient hashing methods have been developed to facilitate memory and time efficient retrieval of similar images. However, none of the existing works uses hashing to address texture image retrieval mostly because of the lack of sufficiently large texture image databases. Our work addresses this problem by developing a novel deep learning architecture that generates binary hash codes for input texture images. For this, we first pre-train a Texture Synthesis Network (TSN) which takes a texture patch as input and outputs an enlarged view of the texture by injecting newer texture content. Thus it signifies that the TSN encodes the learnt texture specific information in its intermediate layers. In the next stage, a second network gathers the multi-scale feature representations from the TSN’s intermediate layers using channel-wise attention, combines them in a progressive manner to a dense continuous representation which is finally converted into a binary hash code with the help of individual and pairwise label information. The new enlarged texture patches also help in data augmentation to alleviate the problem of insufficient texture data and are used to train the second stage of the network. Experiments on three public texture image retrieval datasets indicate the superiority of our texture synthesis guided hashing approach over current state-of-the-art methods. |
|||||
2018 | Do You Like What I Like Similarity Estimation In Proximity-based Mobile Social Networks | Beierle Felix | Arxiv | While existing social networking services tend to connect people who know each other, people show a desire to also connect to yet unknown people in physical proximity. Existing research shows that people tend to connect to similar people. Utilizing technology in order to stimulate human interaction between strangers, we consider the scenario of two strangers meeting. On the example of similarity in musical taste, we develop a solution for the problem of similarity estimation in proximity-based mobile social networks. We show that a single exchange of a probabilistic data structure between two devices can closely estimate the similarity of two users - without the need to contact a third-party server.We introduce metrics for fast and space-efficient approximation of the Dice coefficient of two multisets - based on the comparison of two Counting Bloom Filters or two Count-Min Sketches. Our analysis shows that utilizing a single hash function minimizes the error when comparing these probabilistic data structures. The size that should be chosen for the data structure depends on the expected average number of unique input elements. Using real user data, we show that a Counting Bloom Filter with a single hash function and a length of 128 is sufficient to accurately estimate the similarity between two multisets representing the musical tastes of two users. Our approach is generalizable for any other similarity estimation of frequencies represented as multisets. |
|||||
2018 | Learning-based Efficient Graph Similarity Computation Via Multi-scale Convolutional Set Matching | Bai Yunsheng, Ding Hao, Sun Yizhou, Wang Wei | Arxiv | Graph similarity computation is one of the core operations in many graph-based applications, such as graph similarity search, graph database analysis, graph clustering, etc. Since computing the exact distance/similarity between two graphs is typically NP-hard, a series of approximate methods have been proposed with a trade-off between accuracy and speed. Recently, several data-driven approaches based on neural networks have been proposed, most of which model the graph-graph similarity as the inner product of their graph-level representations, with different techniques proposed for generating one embedding per graph. However, using one fixed-dimensional embedding per graph may fail to fully capture graphs in varying sizes and link structures, a limitation that is especially problematic for the task of graph similarity computation, where the goal is to find the fine-grained difference between two graphs. In this paper, we address the problem of graph similarity computation from another perspective, by directly matching two sets of node embeddings without the need to use fixed-dimensional vectors to represent whole graphs for their similarity computation. The model, GraphSim, achieves the state-of-the-art performance on four real-world graph datasets under six out of eight settings (here we count a specific dataset and metric combination as one setting), compared to existing popular methods for approximate Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) computation. |
|||||
2018 | XNORBIN A 95 Top/s/w Hardware Accelerator For Binary Convolutional Neural Networks | Bahou Andrawes Al, Karunaratne Geethan, Andri Renzo, Cavigelli Lukas, Benini Luca | Arxiv | Deploying state-of-the-art CNNs requires power-hungry processors and off-chip memory. This precludes the implementation of CNNs in low-power embedded systems. Recent research shows CNNs sustain extreme quantization, binarizing their weights and intermediate feature maps, thereby saving 8-32\x memory and collapsing energy-intensive sum-of-products into XNOR-and-popcount operations. We present XNORBIN, an accelerator for binary CNNs with computation tightly coupled to memory for aggressive data reuse. Implemented in UMC 65nm technology XNORBIN achieves an energy efficiency of 95 TOp/s/W and an area efficiency of 2.0 TOp/s/MGE at 0.8 V. |
|||||
2018 | Convolutional Set Matching For Graph Similarity | Bai Yunsheng, Ding Hao, Sun Yizhou, Wang Wei | Arxiv | We introduce GSimCNN (Graph Similarity Computation via Convolutional Neural Networks) for predicting the similarity score between two graphs. As the core operation of graph similarity search, pairwise graph similarity computation is a challenging problem due to the NP-hard nature of computing many graph distance/similarity metrics. We demonstrate our model using the Graph Edit Distance (GED) as the example metric. Experiments on three real graph datasets demonstrate that our model achieves the state-of-the-art performance on graph similarity search. |
|||||
2018 | Revisiting The Inverted Indices For Billion-scale Approximate Nearest Neighbors | Baranchuk Dmitry, Babenko Artem, Malkov Yury | Arxiv | This work addresses the problem of billion-scale nearest neighbor search. The state-of-the-art retrieval systems for billion-scale databases are currently based on the inverted multi-index, the recently proposed generalization of the inverted index structure. The multi-index provides a very fine-grained partition of the feature space that allows extracting concise and accurate short-lists of candidates for the search queries. In this paper, we argue that the potential of the simple inverted index was not fully exploited in previous works and advocate its usage both for the highly-entangled deep descriptors and relatively disentangled SIFT descriptors. We introduce a new retrieval system that is based on the inverted index and outperforms the multi-index by a large margin for the same memory consumption and construction complexity. For example, our system achieves the state-of-the-art recall rates several times faster on the dataset of one billion deep descriptors compared to the efficient implementation of the inverted multi-index from the FAISS library. |
|||||
2018 | Fully Understanding The Hashing Trick | Casper B. Freksen, Lior Kamma, Kasper Green Larsen | Neural Information Processing Systems | Feature hashing, also known as {\em the hashing trick}, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix \(A : \mathbb{R}^n \to \mathbb{R}^m\) (where \(m \ll n\)) in order to reduce the dimension of the data from \(n\) to \(m\) while approximately preserving the Euclidean norm. Every column of \(A\) contains exactly one non-zero entry, equals to either \(-1\) or \(1\). Weinberger et al. showed tail bounds on \(|Ax|2^2\). Specifically they showed that for every \(\epsilon, \delta\), if \(|x|{\infty} / |x|2\) is sufficiently small, and \(m\) is sufficiently large, then \begin{equation*}\Pr[ \; | \;|Ax|_2^2 - |x|_2^2\; | < \epsilon |x|_2^2 \;] \ge 1 - \delta \;.\end{equation*} These bounds were later extended by Dasgupta et al. (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters \(|x|{\infty} / |x|_2, m, \epsilon, \delta\) remained an open question. We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants “hiding” in the asymptotic notation are, in fact, very close to \(1\), thus further illustrating the tightness of the presented bounds in practice. |
|||||
2018 | Fuzzy Hashing As Perturbation-consistent Adversarial Kernel Embedding | Azarafrooz Ari, Brock John | Arxiv | Measuring the similarity of two files is an important task in malware analysis, with fuzzy hash functions being a popular approach. Traditional fuzzy hash functions are data agnostic: they do not learn from a particular dataset how to determine similarity; their behavior is fixed across all datasets. In this paper, we demonstrate that fuzzy hash functions can be learned in a novel minimax training framework and that these learned fuzzy hash functions outperform traditional fuzzy hash functions at the file similarity task for Portable Executable files. In our approach, hash digests can be extracted from the kernel embeddings of two kernel networks, trained in a minimax framework, where the roles of players during training (i.e adversary versus generator) alternate along with the input data. We refer to this new minimax architecture as perturbation-consistent. The similarity score for a pair of files is the utility of the minimax game in equilibrium. Our experiments show that learned fuzzy hash functions generalize well, capable of determining that two files are similar even when one of those files was generated using insertion and deletion operations. |
|||||
2018 | Supermodular Locality Sensitive Hashes | Berman Maxim, Blaschko Matthew B. | Arxiv | In this work, we show deep connections between Locality Sensitive Hashability and submodular analysis. We show that the LSHablility of the most commonly analyzed set similarities is in one-to-one correspondance with the supermodularity of these similarities when taken with respect to the symmetric difference of their arguments. We find that the supermodularity of equivalent LSHable similarities can be dependent on the set encoding. While monotonicity and supermodularity does not imply the metric condition necessary for supermodularity, this condition is guaranteed for the more restricted class of supermodular Hamming similarities that we introduce. We show moreover that LSH preserving transformations are also supermodular-preserving, yielding a way to generate families of similarities both LSHable and supermodular. Finally, we show that even the more restricted family of cardinality-based supermodular Hamming similarities presents promising aspects for the study of the link between LSHability and supermodularity. We hope that the several bridges that we introduce between LSHability and supermodularity paves the way to a better understanding both of supermodular analysis and LSHability, notably in the context of large-scale supermodular optimization. |
|||||
2018 | Ann-benchmarks A Benchmarking Tool For Approximate Nearest Neighbor Algorithms | Aumüller Martin, Bernhardsson Erik, Faithfull Alexander | Arxiv | This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating \(k\)-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm. Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, \(\LaTeX\) plots, and websites with interactive plots. ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of \(k\)-NN algorithms. In the short term, this overview allows users to choose the correct \(k\)-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning. The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work. Interestingly, very different approaches to \(k\)-NN search yield comparable quality-performance trade-offs. The system is available at http://ann-benchmarks.com . |
|||||
2018 | Massively Multilingual Sentence Embeddings For Zero-shot Cross-lingual Transfer And Beyond | Artetxe Mikel, Schwenk Holger | Arxiv | We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pre-trained encoder and the multilingual test set are available at https://github.com/facebookresearch/LASER |
|||||
2018 | Near-lossless Binarization Of Word Embeddings | Tissier Julien, Gravier Christophe, Habrard Amaury | Arxiv | Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors. |
|||||
2018 | Compressing Deep Neural Networks A New Hashing Pipeline Using Kacs Random Walk Matrices | Parker-holder Jack, Gass Sam | Arxiv | The popularity of deep learning is increasing by the day. However, despite the recent advancements in hardware, deep neural networks remain computationally intensive. Recent work has shown that by preserving the angular distance between vectors, random feature maps are able to reduce dimensionality without introducing bias to the estimator. We test a variety of established hashing pipelines as well as a new approach using Kac’s random walk matrices. We demonstrate that this method achieves similar accuracy to existing pipelines. |
|||||
2018 | Hashtran-dnn A Framework For Enhancing Robustness Of Deep Neural Networks Against Adversarial Malware Samples | Li Deqiang, Baral Ramesh, Li Tao, Wang Han, Li Qianmu, Xu Shouhuai | Arxiv | Adversarial machine learning in the context of image processing and related applications has received a large amount of attention. However, adversarial machine learning, especially adversarial deep learning, in the context of malware detection has received much less attention despite its apparent importance. In this paper, we present a framework for enhancing the robustness of Deep Neural Networks (DNNs) against adversarial malware samples, dubbed Hashing Transformation Deep Neural Networks} (HashTran-DNN). The core idea is to use hash functions with a certain locality-preserving property to transform samples to enhance the robustness of DNNs in malware classification. The framework further uses a Denoising Auto-Encoder (DAE) regularizer to reconstruct the hash representations of samples, making the resulting DNN classifiers capable of attaining the locality information in the latent space. We experiment with two concrete instantiations of the HashTran-DNN framework to classify Android malware. Experimental results show that four known attacks can render standard DNNs useless in classifying Android malware, that known defenses can at most defend three of the four attacks, and that HashTran-DNN can effectively defend against all of the four attacks. |
|||||
2018 | Learning Product Codebooks Using Vector Quantized Autoencoders For Image Retrieval | Wu Hanwei, Flierl Markus | Arxiv | Vector-Quantized Variational Autoencoders (VQ-VAE)[1] provide an unsupervised model for learning discrete representations by combining vector quantization and autoencoders. In this paper, we study the use of VQ-VAE for representation learning for downstream tasks, such as image retrieval. We first describe the VQ-VAE in the context of an information-theoretic framework. We show that the regularization term on the learned representation is determined by the size of the embedded codebook before the training and it affects the generalization ability of the model. As a result, we introduce a hyperparameter to balance the strength of the vector quantizer and the reconstruction error. By tuning the hyperparameter, the embedded bottleneck quantizer is used as a regularizer that forces the output of the encoder to share a constrained coding space such that learned latent features preserve the similarity relations of the data space. In addition, we provide a search range for finding the best hyperparameter. Finally, we incorporate the product quantization into the bottleneck stage of VQ-VAE and propose an end-to-end unsupervised learning model for the image retrieval task. The product quantizer has the advantage of generating large-size codebooks. Fast retrieval can be achieved by using the lookup tables that store the distance between any pair of sub-codewords. State-of-the-art retrieval results are achieved by the learned codebooks. |
|||||
2018 | Non-empty Bins With Simple Tabulation Hashing | Aamand Anders, Thorup Mikkel | Arxiv | We consider the hashing of a set \(X\subseteq U\) with \(|X|=m\) using a simple tabulation hash function \(h:U\to [n]=\{0,\dots,n-1\}\) and analyse the number of non-empty bins, that is, the size of \(h(X)\). We show that the expected size of \(h(X)\) matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure in the balls and bins paradigm, and it is critical in applications such as Bloom filters and Filter hashing. For example, normally Bloom filters are proportioned for a desired low false-positive probability assuming fully random hashing (see \url{en.wikipedia.org/wiki/Bloom_filter}). Our results imply that if we implement the hashing with simple tabulation, we obtain the same low false-positive probability for any possible input. |
|||||
2018 | Semantic Cluster Unary Loss For Efficient Deep Hashing | Zhang Shifeng, Li Jianmin, Zhang Bo | IEEE Transactions on Image Processing | Hashing method maps similar data to binary hashcodes with smaller hamming distance, which has received a broad attention due to its low storage cost and fast retrieval speed. With the rapid development of deep learning, deep hashing methods have achieved promising results in efficient information retrieval. Most of the existing deep hashing methods adopt pairwise or triplet losses to deal with similarities underlying the data, but the training is difficult and less efficient because \(O(n^2)\) data pairs and \(O(n^3)\) triplets are involved. To address these issues, we propose a novel deep hashing algorithm with unary loss which can be trained very efficiently. We first of all introduce a Unary Upper Bound of the traditional triplet loss, thus reducing the complexity to \(O(n)\) and bridging the classification-based unary loss and the triplet loss. Second, we propose a novel Semantic Cluster Deep Hashing (SCDH) algorithm by introducing a modified Unary Upper Bound loss, named Semantic Cluster Unary Loss (SCUL). The resultant hashcodes form several compact clusters, which means hashcodes in the same cluster have similar semantic information. We also demonstrate that the proposed SCDH is easy to be extended to semi-supervised settings by incorporating the state-of-the-art semi-supervised learning algorithms. Experiments on large-scale datasets show that the proposed method is superior to state-of-the-art hashing algorithms. |
|||||
2018 | Dual Asymmetric Deep Hashing Learning | Li Jinxing, Zhang Bob, Lu Guangming, Zhang David | Arxiv | Due to the impressive learning power, deep learning has achieved a remarkable performance in supervised hash function learning. In this paper, we propose a novel asymmetric supervised deep hashing method to preserve the semantic structure among different categories and generate the binary codes simultaneously. Specifically, two asymmetric deep networks are constructed to reveal the similarity between each pair of images according to their semantic labels. The deep hash functions are then learned through two networks by minimizing the gap between the learned features and discrete codes. Furthermore, since the binary codes in the Hamming space also should keep the semantic affinity existing in the original space, another asymmetric pairwise loss is introduced to capture the similarity between the binary codes and real-value features. This asymmetric loss not only improves the retrieval performance, but also contributes to a quick convergence at the training phase. By taking advantage of the two-stream deep structures and two types of asymmetric pairwise functions, an alternating algorithm is designed to optimize the deep features and high-quality binary codes efficiently. Experimental results on three real-world datasets substantiate the effectiveness and superiority of our approach as compared with state-of-the-art. |
|||||
2018 | Deep Semantic Hashing With Generative Adversarial Networks | Qiu Zhaofan, Pan Yingwei, Yao Ting, Mei Tao | Arxiv | Hashing has been a widely-adopted technique for nearest neighbor search in large-scale image retrieval tasks. Recent research has shown that leveraging supervised information can lead to high quality hashing. However, the cost of annotating data is often an obstacle when applying supervised hashing to a new domain. Moreover, the results can suffer from the robustness problem as the data at training and test stage could come from similar but different distributions. This paper studies the exploration of generating synthetic data through semi-supervised generative adversarial networks (GANs), which leverages largely unlabeled and limited labeled training data to produce highly compelling data with intrinsic invariance and global coherence, for better understanding statistical structures of natural data. We demonstrate that the above two limitations can be well mitigated by applying the synthetic data for hashing. Specifically, a novel deep semantic hashing with GANs (DSH-GANs) is presented, which mainly consists of four components: a deep convolution neural networks (CNN) for learning image representations, an adversary stream to distinguish synthetic images from real ones, a hash stream for encoding image representations to hash codes and a classification stream. The whole architecture is trained end-to-end by jointly optimizing three losses, i.e., adversarial loss to correct label of synthetic or real for each sample, triplet ranking loss to preserve the relative similarity ordering in the input real-synthetic triplets and classification loss to classify each sample accurately. Extensive experiments conducted on both CIFAR-10 and NUS-WIDE image benchmarks validate the capability of exploiting synthetic images for hashing. Our framework also achieves superior results when compared to state-of-the-art deep hash models. |
|||||
2018 | Fast Counting In Machine Learning Applications | Karan Subhadeep, Eichhorn Matthew, Hurlburt Blake, Iraci Grant, Zola Jaroslaw | Arxiv | We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and association rule mining. Our methods significantly outperform commonly used ADtrees and hash tables, and are practical alternatives for processing large-scale data. |
|||||
2018 | Discriminative Cross-view Binary Representation Learning | Liu Liu, Qi Hairong | WACV | Learning compact representation is vital and challenging for large scale multimedia data. Cross-view/cross-modal hashing for effective binary representation learning has received significant attention with exponentially growing availability of multimedia content. Most existing cross-view hashing algorithms emphasize the similarities in individual views, which are then connected via cross-view similarities. In this work, we focus on the exploitation of the discriminative information from different views, and propose an end-to-end method to learn semantic-preserving and discriminative binary representation, dubbed Discriminative Cross-View Hashing (DCVH), in light of learning multitasking binary representation for various tasks including cross-view retrieval, image-to-image retrieval, and image annotation/tagging. The proposed DCVH has the following key components. First, it uses convolutional neural network (CNN) based nonlinear hashing functions and multilabel classification for both images and texts simultaneously. Such hashing functions achieve effective continuous relaxation during training without explicit quantization loss by using Direct Binary Embedding (DBE) layers. Second, we propose an effective view alignment via Hamming distance minimization, which is efficiently accomplished by bit-wise XOR operation. Extensive experiments on two image-text benchmark datasets demonstrate that DCVH outperforms state-of-the-art cross-view hashing algorithms as well as single-view image hashing algorithms. In addition, DCVH can provide competitive performance for image annotation/tagging. |
|||||
2018 | Extracting Parallel Paragraphs From Common Crawl | Kúdela Jakub, Holubová Irena, Bojar Ondřej | The Prague Bulletin of Mathematical Linguistics Volume | Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data. |
|||||
2018 | Word2bits - Quantized Word Vectors | Lam Maximilian | Arxiv | Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on word similarity tasks and question answering. |
|||||
2018 | Bernoulli Embeddings For Graphs | Misra Vinith, Bhatia Sumit | Arxiv | Just as semantic hashing can accelerate information retrieval, binary valued embeddings can significantly reduce latency in the retrieval of graphical data. We introduce a simple but effective model for learning such binary vectors for nodes in a graph. By imagining the embeddings as independent coin flips of varying bias, continuous optimization techniques can be applied to the approximate expected loss. Embeddings optimized in this fashion consistently outperform the quantization of both spectral graph embeddings and various learned real-valued embeddings, on both ranking and pre-ranking tasks for a variety of datasets. |
|||||
2018 | MTFH A Matrix Tri-factorization Hashing Framework For Efficient Cross-modal Retrieval | Liu Xin, Hu Zhikai, Ling Haibin, Cheung Yiu-ming | IEEE Transactions on Pattern Analysis and Machine Intelligence | Hashing has recently sparked a great revolution in cross-modal retrieval because of its low storage cost and high query speed. Recent cross-modal hashing methods often learn unified or equal-length hash codes to represent the multi-modal data and make them intuitively comparable. However, such unified or equal-length hash representations could inherently sacrifice their representation scalability because the data from different modalities may not have one-to-one correspondence and could be encoded more efficiently by different hash codes of unequal lengths. To mitigate these problems, this paper exploits a related and relatively unexplored problem: encode the heterogeneous data with varying hash lengths and generalize the cross-modal retrieval in various challenging scenarios. To this end, a generalized and flexible cross-modal hashing framework, termed Matrix Tri-Factorization Hashing (MTFH), is proposed to work seamlessly in various settings including paired or unpaired multi-modal data, and equal or varying hash length encoding scenarios. More specifically, MTFH exploits an efficient objective function to flexibly learn the modality-specific hash codes with different length settings, while synchronously learning two semantic correlation matrices to semantically correlate the different hash representations for heterogeneous data comparable. As a result, the derived hash codes are more semantically meaningful for various challenging cross-modal retrieval tasks. Extensive experiments evaluated on public benchmark datasets highlight the superiority of MTFH under various retrieval scenarios and show its competitive performance with the state-of-the-arts. |
|||||
2018 | Self-supervised Video Hashing With Hierarchical Binary Auto-encoder | Song Jingkuan, Zhang Hanwang, Li Xiangpeng, Gao Lianli, Wang Meng, Hong Richang | Arxiv | Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed Self-Supervised Video Hashing (SSVH), that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos; and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary autoencoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world datasets (FCVID and YFCC) show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the currently best performance on the task of unsupervised video retrieval. |
|||||
2018 | Deep Segment Hash Learning For Music Generation | Joslyn Kevin, Zhuang Naifan, Hua Kien A. | Arxiv | Music generation research has grown in popularity over the past decade, thanks to the deep learning revolution that has redefined the landscape of artificial intelligence. In this paper, we propose a novel approach to music generation inspired by musical segment concatenation methods and hash learning algorithms. Given a segment of music, we use a deep recurrent neural network and ranking-based hash learning to assign a forward hash code to the segment to retrieve candidate segments for continuation with matching backward hash codes. The proposed method is thus called Deep Segment Hash Learning (DSHL). To the best of our knowledge, DSHL is the first end-to-end segment hash learning method for music generation, and the first to use pair-wise training with segments of music. We demonstrate that this method is capable of generating music which is both original and enjoyable, and that DSHL offers a promising new direction for music generation research. |
|||||
2018 | Error Correction Maximization For Deep Image Hashing | Xu Xiang, Wang Xiaofang, Kitani Kris M. | Arxiv | We propose to use the concept of the Hamming bound to derive the optimal criteria for learning hash codes with a deep network. In particular, when the number of binary hash codes (typically the number of image categories) and code length are known, it is possible to derive an upper bound on the minimum Hamming distance between the hash codes. This upper bound can then be used to define the loss function for learning hash codes. By encouraging the margin (minimum Hamming distance) between the hash codes of different image categories to match the upper bound, we are able to learn theoretically optimal hash codes. Our experiments show that our method significantly outperforms competing deep learning-based approaches and obtains top performance on benchmark datasets. |
|||||
2018 | Engineering A Simplified 0-bit Consistent Weighted Sampling | Raff Edward, Sylvester Jared, Nicholas Charles | In Proceedings of the | The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster. |
|||||
2018 | Unsupervised Semantic Deep Hashing | Jin Sheng | Arxiv | In recent years, deep hashing methods have been proved to be efficient since it employs convolutional neural network to learn features and hashing codes simultaneously. However, these methods are mostly supervised. In real-world application, it is a time-consuming and overloaded task for annotating a large number of images. In this paper, we propose a novel unsupervised deep hashing method for large-scale image retrieval. Our method, namely unsupervised semantic deep hashing (\textbf{USDH}), uses semantic information preserved in the CNN feature layer to guide the training of network. We enforce four criteria on hashing codes learning based on VGG-19 model: 1) preserving relevant information of feature space in hashing space; 2) minimizing quantization loss between binary-like codes and hashing codes; 3) improving the usage of each bit in hashing codes by using maximum information entropy, and 4) invariant to image rotation. Extensive experiments on CIFAR-10, NUSWIDE have demonstrated that \textbf{USDH} outperforms several state-of-the-art unsupervised hashing methods for image retrieval. We also conduct experiments on Oxford 17 datasets for fine-grained classification to verify its efficiency for other computer vision tasks. |
|||||
2018 | Deep Saliency Hashing | Jin Sheng, Yao Hongxun, Sun Xiaoshuai, Zhou Shangchen, Zhang Lei, Hua Xiansheng | Arxiv | In recent years, hashing methods have been proved to be effective and efficient for the large-scale Web media search. However, the existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce the attention mechanism to the learning of fine-grained hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semantic-preserving hashing codes simultaneously. DSaH is a two-step end-to-end model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. As the core of DSaH, the saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both fine-grained and general retrieval datasets for performance evaluation. Experimental results on fine-grained datasets, including Oxford Flowers-17, Stanford Dogs-120, and CUB Bird demonstrate that our DSaH performs the best for fine-grained retrieval task and beats the strongest competitor (DTQ) by approximately 10% on both Stanford Dogs-120 and CUB Bird. DSaH is also comparable to several state-of-the-art hashing methods on general datasets, including CIFAR-10 and NUS-WIDE. |
|||||
2018 | GPU Accelerated Cascade Hashing Image Matching For Large Scale 3D Reconstruction | Xu Tao, Sun Kun, Tao Wenbing | Arxiv | Image feature point matching is a key step in Structure from Motion(SFM). However, it is becoming more and more time consuming because the number of images is getting larger and larger. In this paper, we proposed a GPU accelerated image matching method with improved Cascade Hashing. Firstly, we propose a Disk-Memory-GPU data exchange strategy and optimize the load order of data, so that the proposed method can deal with big data. Next, we parallelize the Cascade Hashing method on GPU. An improved parallel reduction and an improved parallel hashing ranking are proposed to fulfill this task. Finally, extensive experiments show that our image matching is about 20 times faster than SiftGPU on the same graphics card, nearly 100 times faster than the CPU CasHash method and hundreds of times faster than the CPU Kd-Tree based matching method. Further more, we introduce the epipolar constraint to the proposed method, and use the epipolar geometry to guide the feature matching procedure, which further reduces the matching cost. |
|||||
2018 | On The Needs For Rotations In Hypercubic Quantization Hashing | Morvan Anne, Souloumiac Antoine, Choromanski Krzysztof, Gouy-pailler Cédric, Atif Jamal | Arxiv | The aim of this paper is to endow the well-known family of hypercubic quantization hashing methods with theoretical guarantees. In hypercubic quantization, applying a suitable (random or learned) rotation after dimensionality reduction has been experimentally shown to improve the results accuracy in the nearest neighbors search problem. We prove in this paper that the use of these rotations is optimal under some mild assumptions: getting optimal binary sketches is equivalent to applying a rotation uniformizing the diagonal of the covariance matrix between data points. Moreover, for two closed points, the probability to have dissimilar binary sketches is upper bounded by a factor of the initial distance between the data points. Relaxing these assumptions, we obtain a general concentration result for random matrices. We also provide some experiments illustrating these theoretical points and compare a set of algorithms in both the batch and online settings. |
|||||
2018 | Attribute-guided Network For Cross-modal Zero-shot Hashing | Ji Zhong, Sun Yuxin, Yu Yunlong, Pang Yanwei, Han Jungong | Arxiv | Zero-Shot Hashing aims at learning a hashing model that is trained only by instances from seen categories but can generate well to those of unseen categories. Typically, it is achieved by utilizing a semantic embedding space to transfer knowledge from seen domain to unseen domain. Existing efforts mainly focus on single-modal retrieval task, especially Image-Based Image Retrieval (IBIR). However, as a highlighted research topic in the field of hashing, cross-modal retrieval is more common in real world applications. To address the Cross-Modal Zero-Shot Hashing (CMZSH) retrieval task, we propose a novel Attribute-Guided Network (AgNet), which can perform not only IBIR, but also Text-Based Image Retrieval (TBIR). In particular, AgNet aligns different modal data into a semantically rich attribute space, which bridges the gap caused by modality heterogeneity and zero-shot setting. We also design an effective strategy that exploits the attribute to guide the generation of hash codes for image and text within the same network. Extensive experimental results on three benchmark datasets (AwA, SUN, and ImageNet) demonstrate the superiority of AgNet on both cross-modal and single-modal zero-shot image retrieval tasks. |
|||||
2018 | Efficient End-to-end Learning For Quantizable Representations | Jeong Yeonwoo, Song Hyun Oh | Arxiv | Embedding representation learning via neural networks is at the core foundation of modern similarity based search. While much effort has been put in developing algorithms for learning binary hamming code representations for search efficiency, this still requires a linear scan of the entire dataset per each query and trades off the search accuracy through binarization. To this end, we consider the problem of directly learning a quantizable embedding representation and the sparse binary hash code end-to-end which can be used to construct an efficient hash table not only providing significant search reduction in the number of data but also achieving the state of the art search accuracy outperforming previous state of the art deep metric learning methods. We also show that finding the optimal sparse binary hash code in a mini-batch can be computed exactly in polynomial time by solving a minimum cost flow problem. Our results on Cifar-100 and on ImageNet datasets show the state of the art search accuracy in precision@k and NMI metrics while providing up to 98X and 478X search speedup respectively over exhaustive linear search. The source code is available at https://github.com/maestrojeong/Deep-Hash-Table-ICML18 |
|||||
2018 | Sketchmate Deep Hashing For Million-scale Human Sketch Retrieval | Xu Peng, Huang Yongye, Yuan Tongtong, Pang Kaiyue, Song Yi-zhe, Xiang Tao, Hospedales Timothy M., Ma Zhanyu, Guo Jun | Arxiv | We propose a deep hashing framework for sketch retrieval that, for the first time, works on a multi-million scale human sketch dataset. Leveraging on this large dataset, we explore a few sketch-specific traits that were otherwise under-studied in prior literature. Instead of following the conventional sketch recognition task, we introduce the novel problem of sketch hashing retrieval which is not only more challenging, but also offers a better testbed for large-scale sketch analysis, since: (i) more fine-grained sketch feature learning is required to accommodate the large variations in style and abstraction, and (ii) a compact binary code needs to be learned at the same time to enable efficient retrieval. Key to our network design is the embedding of unique characteristics of human sketch, where (i) a two-branch CNN-RNN architecture is adapted to explore the temporal ordering of strokes, and (ii) a novel hashing loss is specifically designed to accommodate both the temporal and abstract traits of sketches. By working with a 3.8M sketch dataset, we show that state-of-the-art hashing models specifically engineered for static images fail to perform well on temporal sketch data. Our network on the other hand not only offers the best retrieval performance on various code sizes, but also yields the best generalization performance under a zero-shot setting and when re-purposed for sketch recognition. Such superior performances effectively demonstrate the benefit of our sketch-specific design. |
|||||
2018 | NASH Toward End-to-end Neural Architecture For Generative Semantic Hashing | Shen Dinghan, Su Qinliang, Chapfuwa Paidamoyo, Wang Wenlin, Wang Guoyin, Carin Lawrence, Henao Ricardo | Arxiv | Semantic hashing has become a powerful paradigm for fast similarity search in many information retrieval systems. While fairly successful, previous techniques generally require two-stage training, and the binary constraints are handled ad-hoc. In this paper, we present an end-to-end Neural Architecture for Semantic Hashing (NASH), where the binary hashing codes are treated as Bernoulli latent variables. A neural variational inference framework is proposed for training, where gradients are directly back-propagated through the discrete latent variable to optimize the hash function. We also draw connections between proposed method and rate-distortion theory, which provides a theoretical foundation for the effectiveness of the proposed framework. Experimental results on three public datasets demonstrate that our method significantly outperforms several state-of-the-art models on both unsupervised and supervised scenarios. |
|||||
2018 | Fusion Hashing A General Framework For Self-improvement Of Hashing | Liu Xingbo, Nie Xiushan, Yin Yilong | Arxiv | Hashing has been widely used for efficient similarity search based on its query and storage efficiency. To obtain better precision, most studies focus on designing different objective functions with different constraints or penalty terms that consider neighborhood information. In this paper, in contrast to existing hashing methods, we propose a novel generalized framework called fusion hashing (FH) to improve the precision of existing hashing methods without adding new constraints or penalty terms. In the proposed FH, given an existing hashing method, we first execute it several times to get several different hash codes for a set of training samples. We then propose two novel fusion strategies that combine these different hash codes into one set of final hash codes. Based on the final hash codes, we learn a simple linear hash function for the samples that can significantly improve model precision. In general, the proposed FH can be adopted in existing hashing method and achieve more precise and stable performance compared to the original hashing method with little extra expenditure in terms of time and space. Extensive experiments were performed based on three benchmark datasets and the results demonstrate the superior performance of the proposed framework |
|||||
2018 | Deep Class-wise Hashing Semantics-preserving Hashing Via Class-wise Loss | Zhe Xuefei, Chen Shifeng, Yan Hong | Arxiv | Deep supervised hashing has emerged as an influential solution to large-scale semantic image retrieval problems in computer vision. In the light of recent progress, convolutional neural network based hashing methods typically seek pair-wise or triplet labels to conduct the similarity preserving learning. However, complex semantic concepts of visual contents are hard to capture by similar/dissimilar labels, which limits the retrieval performance. Generally, pair-wise or triplet losses not only suffer from expensive training costs but also lack in extracting sufficient semantic information. In this regard, we propose a novel deep supervised hashing model to learn more compact class-level similarity preserving binary codes. Our deep learning based model is motivated by deep metric learning that directly takes semantic labels as supervised information in training and generates corresponding discriminant hashing code. Specifically, a novel cubic constraint loss function based on Gaussian distribution is proposed, which preserves semantic variations while penalizes the overlap part of different classes in the embedding space. To address the discrete optimization problem introduced by binary codes, a two-step optimization strategy is proposed to provide efficient training and avoid the problem of gradient vanishing. Extensive experiments on four large-scale benchmark databases show that our model can achieve the state-of-the-art retrieval performance. Moreover, when training samples are limited, our method surpasses other supervised deep hashing methods with non-negligible margins. |
|||||
2018 | VISER Visual Self-regularization | Izadinia Hamid, Garrigues Pierre | Arxiv | In this work, we propose the use of large set of unlabeled images as a source of regularization data for learning robust visual representation. Given a visual model trained by a labeled dataset in a supervised fashion, we augment our training samples by incorporating large number of unlabeled data and train a semi-supervised model. We demonstrate that our proposed learning approach leverages an abundance of unlabeled images and boosts the visual recognition performance which alleviates the need to rely on large labeled datasets for learning robust representation. To increment the number of image instances needed to learn robust visual models in our approach, each labeled image propagates its label to its nearest unlabeled image instances. These retrieved unlabeled images serve as local perturbations of each labeled image to perform Visual Self-Regularization (VISER). To retrieve such visual self regularizers, we compute the cosine similarity in a semantic space defined by the penultimate layer in a fully convolutional neural network. We use the publicly available Yahoo Flickr Creative Commons 100M dataset as the source of our unlabeled image set and propose a distributed approximate nearest neighbor algorithm to make retrieval practical at that scale. Using the labeled instances and their regularizer samples we show that we significantly improve object categorization and localization performance on the MS COCO and Visual Genome datasets where objects appear in context. |
|||||
2018 | Local Orthogonal-group Testing | Iscen Ahmet, Chum Ondrej | Arxiv | This work addresses approximate nearest neighbor search applied in the domain of large-scale image retrieval. Within the group testing framework we propose an efficient off-line construction of the search structures. The linear-time complexity orthogonal grouping increases the probability that at most one element from each group is matching to a given query. Non-maxima suppression with each group efficiently reduces the number of false positive results at no extra cost. Unlike in other well-performing approaches, all processing is local, fast, and suitable to process data in batches and in parallel. We experimentally show that the proposed method achieves search accuracy of the exhaustive search with significant reduction in the search complexity. The method can be naturally combined with existing embedding methods. |
|||||
2018 | Approximate Nearest Neighbors In Limited Space | Indyk Piotr, Wagner Tal | Arxiv | We consider the \((1+\epsilon)\)-approximate nearest neighbor search problem: given a set \(X\) of \(n\) points in a \(d\)-dimensional space, build a data structure that, given any query point \(y\), finds a point \(x \in X\) whose distance to \(y\) is at most \((1+\epsilon) \min_{x \in X} |x-y|\) for an accuracy parameter \(\epsilon \in (0,1)\). Our main result is a data structure that occupies only \(O(\epsilon^{-2} n log(n) log(1/\epsilon))\) bits of space, assuming all point coordinates are integers in the range \(\{-n^{O(1)} \ldots n^{O(1)}\}\), i.e., the coordinates have \(O(log n)\) bits of precision. This improves over the best previously known space bound of \(O(\epsilon^{-2} n log(n)^2)\), obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points \(X\), and provide almost tight upper and lower bounds for the space complexity of this problem. |
|||||
2018 | Robust Image Identification For Double-compressed JPEG Images | Iida Kenta, Kiya Hitoshi | Arxiv | It is known that JPEG images uploaded to social networks (SNs) are mostly re-compressed by the social network providers. Because of such a situation, a new image identification scheme for double-compressed JPEG images is proposed in this paper. The aim is to detect single-compressed images that have the same original image as that of a double-compressed one. In the proposed scheme, the signs of only DC coefficients in DCT coefficients and one threshold value are used for the identification. The use of them allows us to robustly avoid errors caused by double-compression, which are not considered in conventional schemes. The proposed scheme has applications not only to find uploaded images corresponding to double-compressed ones, but also to detect some image integrity. The simulation results demonstrate that the proposed one outperforms conventional ones including state-of-art image hashing one in terms of the querying performance. |
|||||
2018 | Norm-ranging LSH For Maximum Inner Product Search | Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, James Cheng | Neural Information Processing Systems | Neyshabur and Srebro proposed SIMPLE-LSH, which is the state-of-the-art hashing based algorithm for maximum inner product search (MIPS). We found that the performance of SIMPLE-LSH, in both theory and practice, suffers from long tails in the 2-norm distribution of real datasets. We propose NORM-RANGING LSH, which addresses the excessive normalization problem caused by long tails by partitioning a dataset into sub-datasets and building a hash index for each sub-dataset independently. We prove that NORM-RANGING LSH achieves lower query time complexity than SIMPLE-LSH under mild conditions. We also show that the idea of dataset partitioning can improve another hashing based MIPS algorithm. Experiments show that NORM-RANGING LSH probes much less items than SIMPLE-LSH at the same recall, thus significantly benefiting MIPS based applications. |
|||||
2018 | Probabilistic Blocking With An Application To The Syrian Conflict | Steorts Rebecca C., Shrivastava Anshumali | Steorts R.C. Shrivastava A. | Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce \(k\)-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method. |
|||||
2018 | Subword Semantic Hashing For Intent Classification On Small Datasets | Shridhar Kumar, Dash Ayushman, Sahu Amit, Pihlgren Gustav Grund, Alonso Pedro, Pondenkandath Vinaychandran, Kovacs Gyorgy, Simistira Foteini, Liwicki Marcus | Arxiv | In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: AskUbuntu, Chatbot, and Web Application. Our benchmarks are available online: https://github.com/kumar-shridhar/Know-Your-Intent |
|||||
2018 | A Review For Weighted Minhash Algorithms | Wu Wei, Li Bin, Chen Ling, Gao Junbin, Zhang Chengqi | Arxiv | Data similarity (or distance) computation is a fundamental research topic which underpins many high-level applications based on similarity measures in machine learning and data mining. However, in large-scale real-world scenarios, the exact similarity computation has become daunting due to “3V” nature (volume, velocity and variety) of big data. In such cases, the hashing techniques have been verified to efficiently conduct similarity estimation in terms of both theory and practice. Currently, MinHash is a popular technique for efficiently estimating the Jaccard similarity of binary sets and furthermore, weighted MinHash is generalized to estimate the generalized Jaccard similarity of weighted sets. This review focuses on categorizing and discussing the existing works of weighted MinHash algorithms. In this review, we mainly categorize the Weighted MinHash algorithms into quantization-based approaches, “active index”-based ones and others, and show the evolution and inherent connection of the weighted MinHash algorithms, from the integer weighted MinHash algorithms to real-valued weighted MinHash ones (particularly the Consistent Weighted Sampling scheme). Also, we have developed a python toolbox for the algorithms, and released it in our github. Based on the toolbox, we experimentally conduct a comprehensive comparative study of the standard MinHash algorithm and the weighted MinHash ones. |
|||||
2018 | From Hashing To Cnns Training Binaryweight Networks Via Hashing | Hu Qinghao, Wang Peisong, Cheng Jian | Arxiv | Deep convolutional neural networks (CNNs) have shown appealing performance on various computer vision tasks in recent years. This motivates people to deploy CNNs to realworld applications. However, most of state-of-art CNNs require large memory and computational resources, which hinders the deployment on mobile devices. Recent studies show that low-bit weight representation can reduce much storage and memory demand, and also can achieve efficient network inference. To achieve this goal, we propose a novel approach named BWNH to train Binary Weight Networks via Hashing. In this paper, we first reveal the strong connection between inner-product preserving hashing and binary weight networks, and show that training binary weight networks can be intrinsically regarded as a hashing problem. Based on this perspective, we propose an alternating optimization method to learn the hash codes instead of directly learning binary weights. Extensive experiments on CIFAR10, CIFAR100 and ImageNet demonstrate that our proposed BWNH outperforms current state-of-art by a large margin. |
|||||
2018 | Deep LDA Hashing | Hu Di, Nie Feiping, Li Xuelong | Arxiv | The conventional supervised hashing methods based on classification do not entirely meet the requirements of hashing technique, but Linear Discriminant Analysis (LDA) does. In this paper, we propose to perform a revised LDA objective over deep networks to learn efficient hashing codes in a truly end-to-end fashion. However, the complicated eigenvalue decomposition within each mini-batch in every epoch has to be faced with when simply optimizing the deep network w.r.t. the LDA objective. In this work, the revised LDA objective is transformed into a simple least square problem, which naturally overcomes the intractable problems and can be easily solved by the off-the-shelf optimizer. Such deep extension can also overcome the weakness of LDA Hashing in the limited linear projection and feature learning. Amounts of experiments are conducted on three benchmark datasets. The proposed Deep LDA Hashing shows nearly 70 points improvement over the conventional one on the CIFAR-10 dataset. It also beats several state-of-the-art methods on various metrics. |
|||||
2018 | Data-parallel Hashing Techniques For GPU Architectures | Lessley Brenton | Arxiv | Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are discovered and used to suggest best practices and pinpoint areas for further research. |
|||||
2018 | Simultaneous Compression And Quantization A Joint Approach For Efficient Unsupervised Hashing | Hoang Tuan, Do Thanh-toan, Le Huu, Le-tan Dang-khoa, Cheung Ngai-man | Arxiv | For unsupervised data-dependent hashing, the two most important requirements are to preserve similarity in the low-dimensional feature space and to minimize the binary quantization loss. A well-established hashing approach is Iterative Quantization (ITQ), which addresses these two requirements in separate steps. In this paper, we revisit the ITQ approach and propose novel formulations and algorithms to the problem. Specifically, we propose a novel approach, named Simultaneous Compression and Quantization (SCQ), to jointly learn to compress (reduce dimensionality) and binarize input data in a single formulation under strict orthogonal constraint. With this approach, we introduce a loss function and its relaxed version, termed Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE) respectively, which involve challenging binary and orthogonal constraints. We propose to attack the optimization using novel algorithms based on recent advances in cyclic coordinate descent approach. Comprehensive experiments on unsupervised image retrieval demonstrate that our proposed methods consistently outperform other state-of-the-art hashing methods. Notably, our proposed methods outperform recent deep neural networks and GAN based hashing in accuracy, while being very computationally-efficient. |
|||||
2018 | Deep Attention-guided Hashing | Yang Zhan, Raymond Osolo Ian, Sun Wuqing, Long Jun | Arxiv | With the rapid growth of multimedia data (e.g., image, audio and video etc.) on the web, learning-based hashing techniques such as Deep Supervised Hashing (DSH) have proven to be very efficient for large-scale multimedia search. The recent successes seen in Learning-based hashing methods are largely due to the success of deep learning-based hashing methods. However, there are some limitations to previous learning-based hashing methods (e.g., the learned hash codes containing repetitive and highly correlated information). In this paper, we propose a novel learning-based hashing method, named Deep Attention-guided Hashing (DAgH). DAgH is implemented using two stream frameworks. The core idea is to use guided hash codes which are generated by the hashing network of the first stream framework (called first hashing network) to guide the training of the hashing network of the second stream framework (called second hashing network). Specifically, in the first network, it leverages an attention network and hashing network to generate the attention-guided hash codes from the original images. The loss function we propose contains two components: the semantic loss and the attention loss. The attention loss is used to punish the attention network to obtain the salient region from pairs of images; in the second network, these attention-guided hash codes are used to guide the training of the second hashing network (i.e., these codes are treated as supervised labels to train the second network). By doing this, DAgH can make full use of the most critical information contained in images to guide the second hashing network in order to learn efficient hash codes in a true end-to-end fashion. Results from our experiments demonstrate that DAgH can generate high quality hash codes and it outperforms current state-of-the-art methods on three benchmark datasets, CIFAR-10, NUS-WIDE, and ImageNet. |
|||||
2018 | Fast Binary Embeddings And Quantized Compressed Sensing With Structured Matrices | Huynh Thang, Saab Rayan | Arxiv | This paper deals with two related problems, namely distance-preserving binary embeddings and quantization for compressed sensing . First, we propose fast methods to replace points from a subset \(\mathcal{X} \subset \mathbb{R}^n\), associated with the Euclidean metric, with points in the cube \(\{\pm 1\}^m\) and we associate the cube with a pseudo-metric that approximates Euclidean distance among points in \(\mathcal{X}\). Our methods rely on quantizing fast Johnson-Lindenstrauss embeddings based on bounded orthonormal systems and partial circulant ensembles, both of which admit fast transforms. Our quantization methods utilize noise-shaping, and include Sigma-Delta schemes and distributed noise-shaping schemes. The resulting approximation errors decay polynomially and exponentially fast in \(m\), depending on the embedding method. This dramatically outperforms the current decay rates associated with binary embeddings and Hamming distances. Additionally, it is the first such binary embedding result that applies to fast Johnson-Lindenstrauss maps while preserving \(ℓ₂\) norms. Second, we again consider noise-shaping schemes, albeit this time to quantize compressed sensing measurements arising from bounded orthonormal ensembles and partial circulant matrices. We show that these methods yield a reconstruction error that again decays with the number of measurements (and bits), when using convex optimization for reconstruction. Specifically, for Sigma-Delta schemes, the error decays polynomially in the number of measurements, and it decays exponentially for distributed noise-shaping schemes based on beta encoding. These results are near optimal and the first of their kind dealing with bounded orthonormal systems. |
|||||
2017 | Hash Embeddings For Efficient Word Representations | Dan Tito Svenstrup, Jonas Hansen, Ole Winther | Neural Information Processing Systems | We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by \(k\) \(d\)-dimensional embeddings vectors and one \(k\) dimensional weight vector. The final \(d\) dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of \(B\) embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types. |
|||||
2017 | Optimal Densification For Fast And Accurate Minwise Hashing | Shrivastava Anshumali | Arxiv | Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14} have shown that it is possible to compute \(k\) minwise hashes, of a vector with \(d\) nonzeros, in mere \((d + k)\) computations, a significant improvement over the classical \(O(dk)\). These advances have led to an algorithmic improvement in the query complexity of traditional indexing algorithms based on minwise hashing. Unfortunately, the variance of the current densification techniques is unnecessarily high, which leads to significantly poor accuracy compared to vanilla minwise hashing, especially when the data is sparse. In this paper, we provide a novel densification scheme which relies on carefully tailored 2-universal hashes. We show that the proposed scheme is variance-optimal, and without losing the runtime efficiency, it is significantly more accurate than existing densification techniques. As a result, we obtain a significantly efficient hashing scheme which has the same variance and collision probability as minwise hashing. Experimental evaluations on real sparse and high-dimensional datasets validate our claims. We believe that given the significant advantages, our method will replace minwise hashing implementations in practice. |
|||||
2017 | Hashed Binary Search Sampling For Convolutional Network Training With Large Overhead Image Patches | Lunga Dalton, Yang Lexie, Bhaduri Budhendra | Arxiv | Very large overhead imagery associated with ground truth maps has the potential to generate billions of training image patches for machine learning algorithms. However, random sampling selection criteria often leads to redundant and noisy-image patches for model training. With minimal research efforts behind this challenge, the current status spells missed opportunities to develop supervised learning algorithms that generalize over wide geographical scenes. In addition, much of the computational cycles for large scale machine learning are poorly spent crunching through noisy and redundant image patches. We demonstrate a potential framework to address these challenges specifically, while evaluating a human settlement detection task. A novel binary search tree sampling scheme is fused with a kernel based hashing procedure that maps image patches into hash-buckets using binary codes generated from image content. The framework exploits inherent redundancy within billions of image patches to promote mostly high variance preserving samples for accelerating algorithmic training and increasing model generalization. |
|||||
2017 | A Multi-layer Network Based On Sparse Ternary Codes For Universal Vector Compression | Ferdowsi Sohrab, Voloshynovskiy Slava, Kostadinov Dimche | Arxiv | We present the multi-layer extension of the Sparse Ternary Codes (STC) for fast similarity search where we focus on the reconstruction of the database vectors from the ternary codes. To consider the trade-offs between the compactness of the STC and the quality of the reconstructed vectors, we study the rate-distortion behavior of these codes under different setups. We show that a single-layer code cannot achieve satisfactory results at high rates. Therefore, we extend the concept of STC to multiple layers and design the ML-STC, a codebook-free system that successively refines the reconstruction of the residuals of previous layers. While the ML-STC keeps the sparse ternary structure of the single-layer STC and hence is suitable for fast similarity search in large-scale databases, we show its superior rate-distortion performance on both model-based synthetic data and public large-scale databases, as compared to several binary hashing methods. |
|||||
2017 | Sparse Ternary Codes For Similarity Search Have Higher Coding Gain Than Dense Binary Codes | Ferdowsi Sohrab, Voloshynovskiy Slava, Kostadinov Dimche, Holotyak Taras | Arxiv | This paper addresses the problem of Approximate Nearest Neighbor (ANN) search in pattern recognition where feature vectors in a database are encoded as compact codes in order to speed-up the similarity search in large-scale databases. Considering the ANN problem from an information-theoretic perspective, we interpret it as an encoding, which maps the original feature vectors to a less entropic sparse representation while requiring them to be as informative as possible. We then define the coding gain for ANN search using information-theoretic measures. We next show that the classical approach to this problem, which consists of binarization of the projected vectors is sub-optimal. Instead, a properly designed ternary encoding achieves higher coding gains and lower complexity. |
|||||
2017 | Image2song Song Retrieval Via Bridging Image Content And Lyric Words | Li Xuelong, Hu Di, Lu Xiaoqiang | Arxiv | Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas. There is another better way that combines the image and relevant song to amplify the expression, which has drawn much attention in the social network recently. Hence, the automatic selection of songs should be expected. In this paper, we propose to retrieve semantic relevant songs just by an image query, which is named as the image2song problem. Motivated by the requirements of establishing correlation in semantic/content, we build a semantic-based song retrieval framework, which learns the correlation between image content and lyric words. This model uses a convolutional neural network to generate rich tags from image regions, a recurrent neural network to model lyric, and then establishes correlation via a multi-layer perceptron. To reduce the content gap between image and lyric, we propose to make the lyric modeling focus on the main image content via a tag attention. We collect a dataset from the social-sharing multimodal data to study the proposed problem, which consists of (image, music clip, lyric) triplets. We demonstrate that our proposed model shows noticeable results in the image2song retrieval task and provides suitable songs. Besides, the song2image task is also performed. |
|||||
2017 | An Efficient Deep Learning Hashing Neural Network For Mobile Visual Search | Qi Heng, Liu Wu, Liu Liang | Arxiv | Mobile visual search applications are emerging that enable users to sense their surroundings with smart phones. However, because of the particular challenges of mobile visual search, achieving a high recognition bitrate has becomes a consistent target of previous related works. In this paper, we propose a few-parameter, low-latency, and high-accuracy deep hashing approach for constructing binary hash codes for mobile visual search. First, we exploit the architecture of the MobileNet model, which significantly decreases the latency of deep feature extraction by reducing the number of model parameters while maintaining accuracy. Second, we add a hash-like layer into MobileNet to train the model on labeled mobile visual data. Evaluations show that the proposed system can exceed state-of-the-art accuracy performance in terms of the MAP. More importantly, the memory consumption is much less than that of other deep learning models. The proposed method requires only \(13\) MB of memory for the neural network and achieves a MAP of \(97.80\%\) on the mobile location recognition dataset used for testing. |
|||||
2017 | Part-based Deep Hashing For Large-scale Person Re-identification | Zhu Fuqing, Kong Xiangwei, Zheng Liang, Fu Haiyan, Tian Qi | Arxiv | Large-scale is a trend in person re-identification (re-id). It is important that real-time search be performed in a large gallery. While previous methods mostly focus on discriminative learning, this paper makes the attempt in integrating deep learning and hashing into one framework to evaluate the efficiency and accuracy for large-scale person re-id. We integrate spatial information for discriminative visual representation by partitioning the pedestrian image into horizontal parts. Specifically, Part-based Deep Hashing (PDH) is proposed, in which batches of triplet samples are employed as the input of the deep hashing architecture. Each triplet sample contains two pedestrian images (or parts) with the same identity and one pedestrian image (or part) of the different identity. A triplet loss function is employed with a constraint that the Hamming distance of pedestrian images (or parts) with the same identity is smaller than ones with the different identity. In the experiment, we show that the proposed Part-based Deep Hashing method yields very competitive re-id accuracy on the large-scale Market-1501 and Market-1501+500K datasets. |
|||||
2017 | Stochastic Graphlet Embedding | Dutta Anjan, Sahbi Hichem | IEEE TNNLS pages | Graph-based methods are known to be successful in many machine learning and pattern classification tasks. These methods consider semi-structured data as graphs where nodes correspond to primitives (parts, interest points, segments, etc.) and edges characterize the relationships between these primitives. However, these non-vectorial graph data cannot be straightforwardly plugged into off-the-shelf machine learning algorithms without a preliminary step of – explicit/implicit – graph vectorization and embedding. This embedding process should be resilient to intra-class graph variations while being highly discriminant. In this paper, we propose a novel high-order stochastic graphlet embedding (SGE) that maps graphs into vector spaces. Our main contribution includes a new stochastic search procedure that efficiently parses a given graph and extracts/samples unlimitedly high-order graphlets. We consider these graphlets, with increasing orders, to model local primitives as well as their increasingly complex interactions. In order to build our graph representation, we measure the distribution of these graphlets into a given graph, using particular hash functions that efficiently assign sampled graphlets into isomorphic sets with a very low probability of collision. When combined with maximum margin classifiers, these graphlet-based representations have positive impact on the performance of pattern comparison and recognition as corroborated through extensive experiments using standard benchmark databases. |
|||||
2017 | End-to-end Binary Representation Learning Via Direct Binary Embedding | Liu Liu, Rahimpour Alireza, Taalimi Ali, Qi Hairong | Arxiv | Learning binary representation is essential to large-scale computer vision tasks. Most existing algorithms require a separate quantization constraint to learn effective hashing functions. In this work, we present Direct Binary Embedding (DBE), a simple yet very effective algorithm to learn binary representation in an end-to-end fashion. By appending an ingeniously designed DBE layer to the deep convolutional neural network (DCNN), DBE learns binary code directly from the continuous DBE layer activation without quantization error. By employing the deep residual network (ResNet) as DCNN component, DBE captures rich semantics from images. Furthermore, in the effort of handling multilabel images, we design a joint cross entropy loss that includes both softmax cross entropy and weighted binary cross entropy in consideration of the correlation and independence of labels, respectively. Extensive experiments demonstrate the significant superiority of DBE over state-of-the-art methods on tasks of natural object recognition, image retrieval and image annotation. |
|||||
2017 | Locality-sensitive Hashing Of Curves | Driemel Anne, Silvestri Francesco | Arxiv | We study data structures for storing a set of polygonal curves in \({\rm R}^d\) such that, given a query curve, we can efficiently retrieve similar curves from the set, where similarity is measured using the discrete Fr'echet distance or the dynamic time warping distance. To this end we devise the first locality-sensitive hashing schemes for these distance measures. A major challenge is posed by the fact that these distance measures internally optimize the alignment between the curves. We give solutions for different types of alignments including constrained and unconstrained versions. For unconstrained alignments, we improve over a result by Indyk from 2002 for short curves. Let \(n\) be the number of input curves and let \(m\) be the maximum complexity of a curve in the input. In the particular case where \(m \leq \frac{\alpha}{4d} log n\), for some fixed \(\alpha>0\), our solutions imply an approximate near-neighbor data structure for the discrete Fr'echet distance that uses space in \(O(n^{1+\alpha}log n)\) and achieves query time in \(O(n^{\alpha}log^2 n)\) and constant approximation factor. Furthermore, our solutions provide a trade-off between approximation quality and computational performance: for any parameter \(k \in [m]\), we can give a data structure that uses space in \(O(2^{2k}m^{k-1} n log n + nm)\), answers queries in \(O( 2^{2k} m^{k}log n)\) time and achieves approximation factor in \(O(m/k)\). |
|||||
2017 | Linear Hashing Is Awesome | Knudsen Mathias Bæk Tejs | Arxiv | We consider the hash function \(h(x) = ((ax+b) \bmod p) \bmod n\) where \(a,b\) are chosen uniformly at random from \(\{0,1,\ldots,p-1\}\). We prove that when we use \(h(x)\) in hashing with chaining to insert \(n\) elements into a table of size \(n\) the expected length of the longest chain is \(\tilde{O}!\left(n^{1/3}\right)\). The proof also generalises to give the same bound when we use the multiply-shift hash function by Dietzfelbinger et al. [Journal of Algorithms 1997]. |
|||||
2017 | End-to-end Network For Twitter Geolocation Prediction And Hashing | Lau Jey Han, Chi Lianhua, Tran Khoi-nguyen, Cohn Trevor | Arxiv | We propose an end-to-end neural network to predict the geolocation of a tweet. The network takes as input a number of raw Twitter metadata such as the tweet message and associated user account information. Our model is language independent, and despite minimal feature engineering, it is interpretable and capable of learning location indicative words and timing patterns. Compared to state-of-the-art systems, our model outperforms them by 2%-6%. Additionally, we propose extensions to the model to compress representation learnt by the network into binary codes. Experiments show that it produces compact codes compared to benchmark hashing algorithms. An implementation of the model is released publicly. |
|||||
2017 | Practical Data-dependent Metric Compression With Provable Guarantees | Piotr Indyk, Ilya Razenshteyn, Tal Wagner | Neural Information Processing Systems | We introduce a new distance-preserving compact representation of multi-dimensional point-sets. Given n points in a d-dimensional space where each coordinate is represented using B bits (i.e., dB bits per point), it produces a representation of size O( d log(d B/epsilon) +log n) bits per point from which one can approximate the distances up to a factor of 1 + epsilon. Our algorithm almost matches the recent bound of Indyk et al, 2017} while being much simpler. We compare our algorithm to Product Quantization (PQ) (Jegou et al, 2011) a state of the art heuristic metric compression method. We evaluate both algorithms on several data sets: SIFT, MNIST, New York City taxi time series and a synthetic one-dimensional data set embedded in a high-dimensional space. Our algorithm produces representations that are comparable to or better than those produced by PQ, while having provable guarantees on its performance. |
|||||
2017 | Graph-based Time-space Trade-offs For Approximate Near Neighbors | Laarhoven Thijs | We take a first step towards a rigorous asymptotic analysis of graph-based approaches for finding (approximate) nearest neighbors in high-dimensional spaces, by analyzing the complexity of (randomized) greedy walks on the approximate near neighbor graph. For random data sets of size \(n = 2^{o(d)}\) on the \(d\)-dimensional Euclidean unit sphere, using near neighbor graphs we can provably solve the approximate nearest neighbor problem with approximation factor \(c > 1\) in query time \(n^{\rho_q + o(1)}\) and space \(n^{1 + \rho_s + o(1)}\), for arbitrary \(\rho_q, \rho_s \geq 0\) satisfying \begin{align} (2c^2 - 1) \rho_q + 2 c^2 (c^2 - 1) \sqrt{\rho_s (1 - \rho_s)} \geq c^4. \end{align} Graph-based near neighbor searching is especially competitive with hash-based methods for small \(c\) and near-linear memory, and in this regime the asymptotic scaling of a greedy graph-based search matches the recent optimal hash-based trade-offs of Andoni-Laarhoven-Razenshteyn-Waingarten [SODA’17]. We further study how the trade-offs scale when the data set is of size \(n = 2^{\Theta(d)}\), and analyze asymptotic complexities when applying these results to lattice sieving. |
||||||
2017 | Grayscale Image Authentication Using Neural Hashing | Kutlu Yakup, Yayık Apdullah | Arxiv | Many different approaches for neural network based hash functions have been proposed. Statistical analysis must correlate security of them. This paper proposes novel neural hashing approach for gray scale image authentication. The suggested system is rapid, robust, useful and secure. Proposed hash function generates hash values using neural network one-way property and non-linear techniques. As a result security and performance analysis are performed and satisfying results are achieved. These features are dominant reasons for preferring against traditional ones. |
|||||
2017 | Kernelized Hashcode Representations For Relation Extraction | Garg Sahil, Galstyan Aram, Steeg Greg Ver, Rish Irina, Cecchi Guillermo, Gao Shuyang | Arxiv | Kernel methods have produced state-of-the-art results for a number of NLP tasks such as relation extraction, but suffer from poor scalability due to the high cost of computing kernel similarities between natural language structures. A recently proposed technique, kernelized locality-sensitive hashing (KLSH), can significantly reduce the computational cost, but is only applicable to classifiers operating on kNN graphs. Here we propose to use random subspaces of KLSH codes for efficiently constructing an explicit representation of NLP structures suitable for general classification methods. Further, we propose an approach for optimizing the KLSH model for classification problems by maximizing an approximation of mutual information between the KLSH codes (feature vectors) and the class labels. We evaluate the proposed approach on biomedical relation extraction datasets, and observe significant and robust improvements in accuracy w.r.t. state-of-the-art classifiers, along with drastic (orders-of-magnitude) speedup compared to conventional kernel methods. |
|||||
2017 | Effective Multi-query Expansions Collaborative Deep Networks For Robust Landmark Retrieval | Wang Yang, Lin Xuemin, Wu Lin, Zhang Wenjie | Arxiv | Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users over social media community may convey different geometry information depending on the viewpoints and/or angles, and may subsequently yield very different results. In fact, dealing with the landmarks with \illshapes caused by the photography of q-users is often nontrivial and has seldom been studied. In this paper we propose a novel framework, namely multi-query expansions, to retrieve semantically robust landmarks by two steps. Firstly, we identify the top-\(k\) photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible \illshape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Then, motivated by the typical collaborative filtering methods, we propose to learn a collaborative deep networks based semantically, nonlinear and high-level features over the latent factor for landmark photo as the training set, which is formed by matrix factorization over collaborative user-photo matrix regarding the multi-query set. The learned deep network is further applied to generate the features for all the other photos, meanwhile resulting into a compact multi-query set within such space. Extensive experiments are conducted on real-world social media data with both landmark photos together with their user information to show the superior performance over the existing methods. |
|||||
2017 | MILD Multi-index Hashing For Loop Closure Detection | Han Lei, Fang Lu | Arxiv | Loop Closure Detection (LCD) has been proved to be extremely useful in global consistent visual Simultaneously Localization and Mapping (SLAM) and appearance-based robot relocalization. Methods exploiting binary features in bag of words representation have recently gained a lot of popularity for their efficiency, but suffer from low recall due to the inherent drawback that high dimensional binary feature descriptors lack well-defined centroids. In this paper, we propose a realtime LCD approach called MILD (Multi-Index Hashing for Loop closure Detection), in which image similarity is measured by feature matching directly to achieve high recall without introducing extra computational complexity with the aid of Multi-Index Hashing (MIH). A theoretical analysis of the approximate image similarity measurement using MIH is presented, which reveals the trade-off between efficiency and accuracy from a probabilistic perspective. Extensive comparisons with state-of-the-art LCD methods demonstrate the superiority of MILD in both efficiency and accuracy. |
|||||
2017 | Faster Tuple Lattice Sieving Using Spherical Locality-sensitive Filters | Laarhoven Thijs | Arxiv | To overcome the large memory requirement of classical lattice sieving algorithms for solving hard lattice problems, Bai-Laarhoven-Stehl'{e} [ANTS 2016] studied tuple lattice sieving, where tuples instead of pairs of lattice vectors are combined to form shorter vectors. Herold-Kirshanova [PKC 2017] recently improved upon their results for arbitrary tuple sizes, for example showing that a triple sieve can solve the shortest vector problem (SVP) in dimension \(d\) in time \(2^{0.3717d + o(d)}\), using a technique similar to locality-sensitive hashing for finding nearest neighbors. In this work, we generalize the spherical locality-sensitive filters of Becker-Ducas-Gama-Laarhoven [SODA 2016] to obtain space-time tradeoffs for near neighbor searching on dense data sets, and we apply these techniques to tuple lattice sieving to obtain even better time complexities. For instance, our triple sieve heuristically solves SVP in time \(2^{0.3588d + o(d)}\). For practical sieves based on Micciancio-Voulgaris’ GaussSieve [SODA 2010], this shows that a triple sieve uses less space and less time than the current best near-linear space double sieve. |
|||||
2017 | Deep Hashing With Category Mask For Fast Video Retrieval | Liu Xu, Zhao Lili, Ding Dajun, Dong Yajiao | Arxiv | This paper proposes an end-to-end deep hashing framework with category mask for fast video retrieval. We train our network in a supervised way by fully exploiting inter-class diversity and intra-class identity. Classification loss is optimized to maximize inter-class diversity, while intra-pair is introduced to learn representative intra-class identity. We investigate the binary bits distribution related to categories and find out that the effectiveness of binary bits is highly correlated with data categories, and some bits may degrade classification performance of some categories. We then design hash code generation scheme with category mask to filter out bits with negative contribution. Experimental results demonstrate the proposed method outperforms several state-of-the-arts under various evaluation metrics on public datasets. |
|||||
2017 | Hashing As Tie-aware Learning To Rank | He Kun, Cakir Fatih, Bargal Sarah Adel, Sclaroff Stan | Arxiv | Hashing, or learning binary embeddings of data, is frequently used in nearest neighbor retrieval. In this paper, we develop learning to rank formulations for hashing, aimed at directly optimizing ranking-based evaluation metrics such as Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). We first observe that the integer-valued Hamming distance often leads to tied rankings, and propose to use tie-aware versions of AP and NDCG to evaluate hashing for retrieval. Then, to optimize tie-aware ranking metrics, we derive their continuous relaxations, and perform gradient-based optimization with deep neural networks. Our results establish the new state-of-the-art for image retrieval by Hamming ranking in common benchmarks. |
|||||
2017 | Beyond SIFT Using Binary Features For Loop Closure Detection | Han Lei, Zhou Guyue, Xu Lan, Fang Lu | Arxiv | In this paper a binary feature based Loop Closure Detection (LCD) method is proposed, which for the first time achieves higher precision-recall (PR) performance compared with state-of-the-art SIFT feature based approaches. The proposed system originates from our previous work Multi-Index hashing for Loop closure Detection (MILD), which employs Multi-Index Hashing (MIH)~\cite{greene1994multi} for Approximate Nearest Neighbor (ANN) search of binary features. As the accuracy of MILD is limited by repeating textures and inaccurate image similarity measurement, burstiness handling is introduced to solve this problem and achieves considerable accuracy improvement. Additionally, a comprehensive theoretical analysis on MIH used in MILD is conducted to further explore the potentials of hashing methods for ANN search of binary features from probabilistic perspective. This analysis provides more freedom on best parameter choosing in MIH for different application scenarios. Experiments on popular public datasets show that the proposed approach achieved the highest accuracy compared with state-of-the-art while running at 30Hz for databases containing thousands of images. |
|||||
2017 | SUBIC A Supervised Structured Binary Code For Image Search | Jain Himalaya, Zepeda Joaquin, Pérez Patrick, Gribonval Rémi | Arxiv | For large-scale visual search, highly compressed yet meaningful representations of images are essential. Structured vector quantizers based on product quantization and its variants are usually employed to achieve such compression while minimizing the loss of accuracy. Yet, unlike binary hashing schemes, these unsupervised methods have not yet benefited from the supervision, end-to-end learning and novel architectures ushered in by the deep learning revolution. We hence propose herein a novel method to make deep convolutional neural networks produce supervised, compact, structured binary codes for visual search. Our method makes use of a novel block-softmax non-linearity and of batch-based entropy losses that together induce structure in the learned encodings. We show that our method outperforms state-of-the-art compact representations based on deep hashing or structured quantization in single and cross-domain category retrieval, instance retrieval and classification. We make our code and models publicly available online. |
|||||
2017 | End-to-end Supervised Product Quantization For Image Search And Retrieval | Klein Benjamin, Wolf Lior | Arxiv | Product Quantization, a dictionary based hashing method, is one of the leading unsupervised hashing techniques. While it ignores the labels, it harnesses the features to construct look up tables that can approximate the feature space. In recent years, several works have achieved state of the art results on hashing benchmarks by learning binary representations in a supervised manner. This work presents Deep Product Quantization (DPQ), a technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a dictionary-based representation that is inspired by Product Quantization and which is learned end-to-end, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straight-through estimator. Our method obtains state of the art results on an extensive array of retrieval and classification experiments. |
|||||
2017 | Simple Strategies For Recovering Inner Products From Coarsely Quantized Random Projections | Ping Li, Martin Slawski | Neural Information Processing Systems | Random projections have been increasingly adopted for a diverse set of tasks in machine learning involving dimensionality reduction. One specific line of research on this topic has investigated the use of quantization subsequent to projection with the aim of additional data compression. Motivated by applications in nearest neighbor search and linear learning, we revisit the problem of recovering inner products (respectively cosine similarities) in such setting. We show that even under coarse scalar quantization with 3 to 5 bits per projection, the loss in accuracy tends to range from negligible’’ tomoderate’’. One implication is that in most scenarios of practical interest, there is no need for a sophisticated recovery approach like maximum likelihood estimation as considered in previous work on the subject. What we propose herein also yields considerable improvements in terms of accuracy over the Hamming distance-based approach in Li et al. (ICML 2014) which is comparable in terms of simplicity |
|||||
2017 | Superminhash - A New Minwise Hashing Algorithm For Jaccard Similarity Estimation | Ertl Otmar | Arxiv | This paper presents a new algorithm for calculating hash signatures of sets which can be directly used for Jaccard similarity estimation. The new approach is an improvement over the MinHash algorithm, because it has a better runtime behavior and the resulting signatures allow a more precise estimation of the Jaccard index. |
|||||
2017 | Video Retrieval Based On Deep Convolutional Neural Network | Dong Yj, Li Jg | Arxiv | Recently, with the enormous growth of online videos, fast video retrieval research has received increasing attention. As an extension of image hashing techniques, traditional video hashing methods mainly depend on hand-crafted features and transform the real-valued features into binary hash codes. As videos provide far more diverse and complex visual information than images, extracting features from videos is much more challenging than that from images. Therefore, high-level semantic features to represent videos are needed rather than low-level hand-crafted methods. In this paper, a deep convolutional neural network is proposed to extract high-level semantic features and a binary hash function is then integrated into this framework to achieve an end-to-end optimization. Particularly, our approach also combines triplet loss function which preserves the relative similarity and difference of videos and classification loss function as the optimization objective. Experiments have been performed on two public datasets and the results demonstrate the superiority of our proposed method compared with other state-of-the-art video retrieval methods. |
|||||
2017 | Semi-supervised Multimodal Hashing | Tian Dayong, Gong Maoguo, Zhou Deyun, Shi Jiao, Lei Yu | Arxiv | Retrieving nearest neighbors across correlated data in multiple modalities, such as image-text pairs on Facebook and video-tag pairs on YouTube, has become a challenging task due to the huge amount of data. Multimodal hashing methods that embed data into binary codes can boost the retrieving speed and reduce storage requirement. As unsupervised multimodal hashing methods are usually inferior to supervised ones, while the supervised ones requires too much manually labeled data, the proposed method in this paper utilizes a part of labels to design a semi-supervised multimodal hashing method. It first computes the transformation matrices for data matrices and label matrix. Then, with these transformation matrices, fuzzy logic is introduced to estimate a label matrix for unlabeled data. Finally, it uses the estimated label matrix to learn hashing functions for data in each modality to generate a unified binary code matrix. Experiments show that the proposed semi-supervised method with 50% labels can get a medium performance among the compared supervised ones and achieve an approximate performance to the best supervised method with 90% labels. With only 10% labels, the proposed method can still compete with the worst compared supervised one. |
|||||
2017 | Improved Search In Hamming Space Using Deep Multi-index Hashing | Lai Hanjiang, Pan Yan | Arxiv | Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. There has been considerable research on generating efficient image representation via the deep-network-based hashing methods. However, the issue of efficient searching in the deep representation space remains largely unsolved. To this end, we propose a simple yet efficient deep-network-based multi-index hashing method for simultaneously learning the powerful image representation and the efficient searching. To achieve these two goals, we introduce the multi-index hashing (MIH) mechanism into the proposed deep architecture, which divides the binary codes into multiple substrings. Due to the non-uniformly distributed codes will result in inefficiency searching, we add the two balanced constraints at feature-level and instance-level, respectively. Extensive evaluations on several benchmark image retrieval datasets show that the learned balanced binary codes bring dramatic speedups and achieve comparable performance over the existing baselines. |
|||||
2017 | Arrays Of (locality-sensitive) Count Estimators (ACE) High-speed Anomaly Detection Via Cache Lookups | Luo Chen, Shrivastava Anshumali | Arxiv | Anomaly detection is one of the frequent and important subroutines deployed in large-scale data processing systems. Even being a well-studied topic, existing techniques for unsupervised anomaly detection require storing significant amounts of data, which is prohibitive from memory and latency perspective. In the big-data world existing methods fail to address the new set of memory and latency constraints. In this paper, we propose ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than the ELKI package~\cite{DBLP:conf/ssd/AchtertBKSZ09}, which has the fastest implementation of the unsupervised anomaly detection algorithms. ACE algorithm requires less than \(4MB\) memory, to dynamically compress the full data information into a set of count arrays. These tiny \(4MB\) arrays of counts are sufficient for unsupervised anomaly detection. At the core of the ACE algorithm, there is a novel statistical estimator which is derived from the sampling view of Locality Sensitive Hashing(LSH). This view is significantly different and efficient than the widely popular view of LSH for near-neighbor search. We show the superiority of ACE algorithm over 11 popular baselines on 3 benchmark datasets, including the KDD-Cup99 data which is the largest available benchmark comprising of more than half a million entries with ground truth anomaly labels. |
|||||
2017 | Supervised Deep Hashing For Hierarchical Labeled Data | Wang Dan, Huang Heyan, Lu Chi, Feng Bo-si, Nie Liqiang, Wen Guihua, Mao Xian-ling | Arxiv | Recently, hashing methods have been widely used in large-scale image retrieval. However, most existing hashing methods did not consider the hierarchical relation of labels, which means that they ignored the rich information stored in the hierarchy. Moreover, most of previous works treat each bit in a hash code equally, which does not meet the scenario of hierarchical labeled data. In this paper, we propose a novel deep hashing method, called supervised hierarchical deep hashing (SHDH), to perform hash code learning for hierarchical labeled data. Specifically, we define a novel similarity formula for hierarchical labeled data by weighting each layer, and design a deep convolutional neural network to obtain a hash code for each data point. Extensive experiments on several real-world public datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task. |
|||||
2017 | Pqtable Non-exhaustive Fast Search For Product-quantized Codes Using Hash Tables | Matsui Yusuke, Yamasaki Toshihiko, Aizawa Kiyoharu | Arxiv | In this paper, we propose a product quantization table (PQTable); a fast search method for product-quantized codes via hash-tables. An identifier of each database vector is associated with the slot of a hash table by using its PQ-code as a key. For querying, an input vector is PQ-encoded and hashed, and the items associated with that code are then retrieved. The proposed PQTable produces the same results as a linear PQ scan, and is 10^2 to 10^5 times faster. Although state-of-the-art performance can be achieved by previous inverted-indexing-based approaches, such methods require manually-designed parameter setting and significant training; our PQTable is free of these limitations, and therefore offers a practical and effective solution for real-world problems. Specifically, when the vectors are highly compressed, our PQTable achieves one of the fastest search performances on a single CPU to date with significantly efficient memory usage (0.059 ms per query over 10^9 data points with just 5.5 GB memory consumption). Finally, we show that our proposed PQTable can naturally handle the codes of an optimized product quantization (OPQTable). |
|||||
2017 | Deep Discrete Supervised Hashing | Jiang Qing-yuan, Cui Xue, Li Wu-jun | Arxiv | Hashing has been widely used for large-scale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, utilizing supervised information to directly guide discrete (binary) coding procedure can avoid sub-optimal solution and improve the accuracy. On the other hand, deep hashing, which integrates deep feature learning and hash-code learning into an end-to-end architecture, can enhance the feedback between feature learning and hash-code learning. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure in hashing. The key in deep hashing is to adopt the supervised information to directly guide the deep feature learning procedure. However, there have not existed works which can use the supervised information to directly guide both discrete coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH), to address this problem. DDSH is the first deep hashing method which can utilize supervised information to directly guide both discrete coding procedure and deep feature learning procedure, and thus enhance the feedback between these two important procedures. Experiments on three real datasets show that DDSH can outperform other state-of-the-art baselines, including both discrete hashing and deep hashing baselines, for image retrieval. |
|||||
2017 | Discrete Latent Factor Model For Cross-modal Hashing | Jiang Qing-yuan, Li Wu-jun | Arxiv | Due to its storage and retrieval efficiency, cross-modal hashing~(CMH) has been widely used for cross-modal similarity search in multimedia applications. According to the training strategy, existing CMH methods can be mainly divided into two categories: relaxation-based continuous methods and discrete methods. In general, the training of relaxation-based continuous methods is faster than discrete methods, but the accuracy of relaxation-based continuous methods is not satisfactory. On the contrary, the accuracy of discrete methods is typically better than relaxation-based continuous methods, but the training of discrete methods is time-consuming. In this paper, we propose a novel CMH method, called discrete latent factor model based cross-modal hashing~(DLFH), for cross modal similarity search. DLFH is a discrete method which can directly learn the binary hash codes for CMH. At the same time, the training of DLFH is efficient. Experiments on real datasets show that DLFH can achieve significantly better accuracy than existing methods, and the training time of DLFH is comparable to that of relaxation-based continuous methods which are much faster than existing discrete methods. |
|||||
2017 | Compact Hash Code Learning With Binary Deep Neural Network | Do Thanh-toan, Hoang Tuan, Tan Dang-khoa Le, Doan Anh-dzung, Cheung Ngai-man | Arxiv | Learning compact binary codes for image retrieval problem using deep neural networks has recently attracted increasing attention. However, training deep hashing networks is challenging due to the binary constraints on the hash codes. In this paper, we propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners. The novelty of our network design is that we constrain one hidden layer to directly output the binary codes. This design has overcome a challenging problem in some previous works: optimizing non-smooth objective functions because of binarization. In addition, we propose to incorporate independence and balance properties in the direct and strict forms into the learning schemes. We also include a similarity preserving property in our objective functions. The resulting optimizations involving these binary, independence, and balance constraints are difficult to solve. To tackle this difficulty, we propose to learn the networks with alternating optimization and careful relaxation. Furthermore, by leveraging the powerful capacity of convolutional neural networks, we propose an end-to-end architecture that jointly learns to extract visual features and produce binary hash codes. Experimental results for the benchmark datasets show that the proposed methods compare favorably or outperform the state of the art. |
|||||
2017 | Simultaneous Feature Aggregating And Hashing For Large-scale Image Search | Do Thanh-toan, Tan Dang-khoa Le, Pham Trung T., Cheung Ngai-man | Arxiv | In most state-of-the-art hashing-based visual search systems, local image descriptors of an image are first aggregated as a single feature vector. This feature vector is then subjected to a hashing function that produces a binary hash code. In previous work, the aggregating and the hashing processes are designed independently. In this paper, we propose a novel framework where feature aggregating and hashing are designed simultaneously and optimized jointly. Specifically, our joint optimization produces aggregated representations that can be better reconstructed by some binary codes. This leads to more discriminative binary hash codes and improved retrieval accuracy. In addition, we also propose a fast version of the recently-proposed Binary Autoencoder to be used in our proposed framework. We perform extensive retrieval experiments on several benchmark datasets with both SIFT and convolutional features. Our results suggest that the proposed framework achieves significant improvements over the state of the art. |
|||||
2017 | Hypercube LSH For Approximate Near Neighbors | Laarhoven Thijs | A celebrated technique for finding near neighbors for the angular distance involves using a set of \textit{random} hyperplanes to partition the space into hash regions [Charikar, STOC 2002]. Experiments later showed that using a set of \textit{orthogonal} hyperplanes, thereby partitioning the space into the Voronoi regions induced by a hypercube, leads to even better results [Terasawa and Tanaka, WADS 2007]. However, no theoretical explanation for this improvement was ever given, and it remained unclear how the resulting hypercube hash method scales in high dimensions. In this work, we provide explicit asymptotics for the collision probabilities when using hypercubes to partition the space. For instance, two near-orthogonal vectors are expected to collide with probability \((\frac{1}{\pi})^{d + o(d)}\) in dimension \(d\), compared to \((\frac{1}{2})^d\) when using random hyperplanes. Vectors at angle \(\frac{\pi}{3}\) collide with probability \((\frac{\sqrt{3}}{\pi})^{d + o(d)}\), compared to \((\frac{2}{3})^d\) for random hyperplanes, and near-parallel vectors collide with similar asymptotic probabilities in both cases. For \(c\)-approximate nearest neighbor searching, this translates to a decrease in the exponent \(\rho\) of locality-sensitive hashing (LSH) methods of a factor up to \(log_2(\pi) \approx 1.652\) compared to hyperplane LSH. For \(c = 2\), we obtain \(\rho \approx 0.302 + o(1)\) for hypercube LSH, improving upon the \(\rho \approx 0.377\) for hyperplane LSH. We further describe how to use hypercube LSH in practice, and we consider an example application in the area of lattice algorithms. |
||||||
2017 | Subspace Approximation For Approximate Nearest Neighbor Search In NLP | Wang Jing | Arxiv | Most natural language processing tasks can be formulated as the approximated nearest neighbor search problem, such as word analogy, document similarity, machine translation. Take the question-answering task as an example, given a question as the query, the goal is to search its nearest neighbor in the training dataset as the answer. However, existing methods for approximate nearest neighbor search problem may not perform well owing to the following practical challenges: 1) there are noise in the data; 2) the large scale dataset yields a huge retrieval space and high search time complexity. In order to solve these problems, we propose a novel approximate nearest neighbor search framework which i) projects the data to a subspace based spectral analysis which eliminates the influence of noise; ii) partitions the training dataset to different groups in order to reduce the search space. Specifically, the retrieval space is reduced from \(O(n)\) to \(O(log n)\) (where \(n\) is the number of data points in the training dataset). We prove that the retrieved nearest neighbor in the projected subspace is the same as the one in the original feature space. We demonstrate the outstanding performance of our framework on real-world natural language processing tasks. |
|||||
2017 | Sketching Word Vectors Through Hashing | Qasemizadeh Behrang, Kallmeyer Laura | Arxiv | We propose a new fast word embedding technique using hash functions. The method is a derandomization of a new type of random projections: By disregarding the classic constraint used in designing random projections (i.e., preserving pairwise distances in a particular normed space), our solution exploits extremely sparse non-negative random projections. Our experiments show that the proposed method can achieve competitive results, comparable to neural embedding learning techniques, however, with only a fraction of the computational complexity of these methods. While the proposed derandomization enhances the computational and space complexity of our method, the possibility of applying weighting methods such as positive pointwise mutual information (PPMI) to our models after their construction (and at a reduced dimensionality) imparts a high discriminatory power to the resulting embeddings. Obviously, this method comes with other known benefits of random projection-based techniques such as ease of update. |
|||||
2017 | Evaluation Of Hashing Methods Performance On Binary Feature Descriptors | Komorowski Jacek, Trzcinski Tomasz | Arxiv | In this paper we evaluate performance of data-dependent hashing methods on binary data. The goal is to find a hashing method that can effectively produce lower dimensional binary representation of 512-bit FREAK descriptors. A representative sample of recent unsupervised, semi-supervised and supervised hashing methods was experimentally evaluated on large datasets of labelled binary FREAK feature descriptors. |
|||||
2017 | Ranking Based Locality Sensitive Hashing Enabled Cancelable Biometrics Index-of-max Hashing | Jin Zhe, Lai Yen-lung, Hwang Jung-yeon, Kim Soohyung, Teoh Andrew Beng Jin | Arxiv | In this paper, we propose a ranking based locality sensitive hashing inspired two-factor cancelable biometrics, dubbed “Index-of-Max” (IoM) hashing for biometric template protection. With externally generated random parameters, IoM hashing transforms a real-valued biometric feature vector into discrete index (max ranked) hashed code. We demonstrate two realizations from IoM hashing notion, namely Gaussian Random Projection based and Uniformly Random Permutation based hashing schemes. The discrete indices representation nature of IoM hashed codes enjoy serveral merits. Firstly, IoM hashing empowers strong concealment to the biometric information. This contributes to the solid ground of non-invertibility guarantee. Secondly, IoM hashing is insensitive to the features magnitude, hence is more robust against biometric features variation. Thirdly, the magnitude-independence trait of IoM hashing makes the hash codes being scale-invariant, which is critical for matching and feature alignment. The experimental results demonstrate favorable accuracy performance on benchmark FVC2002 and FVC2004 fingerprint databases. The analyses justify its resilience to the existing and newly introduced security and privacy attacks as well as satisfy the revocability and unlinkability criteria of cancelable biometrics. |
|||||
2017 | Stochastic Generative Hashing | Dai Bo, Guo Ruiqi, Kumar Sanjiv, He Niao, Song Le | Arxiv | Learning-based binary hashing has become a powerful paradigm for fast search and retrieval in massive databases. However, due to the requirement of discrete outputs for the hash functions, learning such functions is known to be very challenging. In addition, the objective functions adopted by existing hashing techniques are mostly chosen heuristically. In this paper, we propose a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs. We also develop an efficient learning algorithm based on the stochastic distributional gradient, which avoids the notorious difficulty caused by binary output constraints, to jointly optimize the parameters of the hash function and the associated generative model. Extensive experiments on a variety of large-scale datasets show that the proposed method achieves better retrieval results than the existing state-of-the-art methods. |
|||||
2017 | Deep Hashing With Triplet Quantization Loss | Zhou Yuefu, Huang Shanshan, Zhang Ya, Wang Yanfeng | Arxiv | With the explosive growth of image databases, deep hashing, which learns compact binary descriptors for images, has become critical for fast image retrieval. Many existing deep hashing methods leverage quantization loss, defined as distance between the features before and after quantization, to reduce the error from binarizing features. While minimizing the quantization loss guarantees that quantization has minimal effect on retrieval accuracy, it unfortunately significantly reduces the expressiveness of features even before the quantization. In this paper, we show that the above definition of quantization loss is too restricted and in fact not necessary for maintaining high retrieval accuracy. We therefore propose a new form of quantization loss measured in triplets. The core idea of the triplet quantization loss is to learn discriminative real-valued descriptors which lead to minimal loss on retrieval accuracy after quantization. Extensive experiments on two widely used benchmark data sets of different scales, CIFAR-10 and In-shop, demonstrate that the proposed method outperforms the state-of-the-art deep hashing methods. Moreover, we show that the compact binary descriptors obtained with triplet quantization loss lead to very small performance drop after quantization. |
|||||
2017 | A Genetic Algorithm Approach For Imagerepresentation Learning Through Color Quantization | Pereira Érico M., Torres Ricardo Da S., Santos Jefersson A. Dos | Arxiv | Over the last decades, hand-crafted feature extractors have been used to encode image visual properties into feature vectors. Recently, data-driven feature learning approaches have been successfully explored as alternatives for producing more representative visual features. In this work, we combine both research venues, focusing on the color quantization problem. We propose two data-driven approaches to learn image representations through the search for optimized quantization schemes, which lead to more effective feature extraction algorithms and compact representations. Our strategy employs Genetic Algorithm, a soft-computing apparatus successfully utilized in Information-retrieval-related optimization problems. We hypothesize that changing the quantization affects the quality of image description approaches, leading to effective and efficient representations. We evaluate our approaches in content-based image retrieval tasks, considering eight well-known datasets with different visual properties. Results indicate that the approach focused on representation effectiveness outperformed baselines in all tested scenarios. The other approach, which also considers the size of created representations, produced competitive results keeping or even reducing the dimensionality of feature vectors up to 25%. |
|||||
2017 | Segmentation Of Objects By Hashing | Curtó J. D., Zarza I. C., Smola Alex, Van Gool Luc | Arxiv | We propose a novel approach to address the problem of Simultaneous Detection and Segmentation introduced in [Hariharan et al 2014]. Using the hierarchical structures first presented in [Arbel'aez et al 2011] we use an efficient and accurate procedure that exploits the feature information of the hierarchy using Locality Sensitive Hashing. We build on recent work that utilizes convolutional neural networks to detect bounding boxes in an image [Ren et al 2015] and then use the top similar hierarchical region that best fits each bounding box after hashing, we call this approach C&Z Segmentation. We then refine our final segmentation results by automatic hierarchical pruning. C&Z Segmentation introduces a train-free alternative to Hypercolumns [Hariharan et al 2015]. We conduct extensive experiments on PASCAL VOC 2012 segmentation dataset, showing that C&Z gives competitive state-of-the-art segmentations of objects. |
|||||
2017 | Fast Similarity Sketching | Dahlgaard Søren, Langhede Mathias Bæk Tejs, Houen Jakob Bæk Tejs, Thorup Mikkel | Arxiv | We consider the \(\textit{Similarity Sketching}\) problem: Given a universe \([u] = \{0,\ldots, u-1\}\) we want a random function \(S\) mapping subsets \(A\subseteq [u]\) into vectors \(S(A)\) of size \(t\), such that the Jaccard similarity \(J(A,B) = |A\cap B|/|A\cup B|\) between sets \(A\) and \(B\) is preserved. More precisely, define \(X_i = [S(A)[i] = S(B)[i]]\) and \(X = \sum_{i\in [t]} X_i\). We want \(E[X_i]=J(A,B)\), and we want \(X\) to be strongly concentrated around \(E[X] = t \cdot J(A,B)\) (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors \(S(A)\) are also called \(\textit{sketches}\). Strong concentration is critical, for often we want to sketch many sets \(B_1,\ldots,B_n\) so that we later, for a query set \(A\), can find (one of) the most similar \(B_i\). It is then critical that no \(B_i\) looks much more similar to \(A\) due to errors in the sketch. The seminal \(t\times\textit{MinHash}\) algorithm uses \(t\) random hash functions \(h_1,\ldots, h_t\), and stores \(\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )\) as the sketch of \(A\). The main drawback of MinHash is, however, its \(O(t\cdot |A|)\) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued…) |
|||||
2017 | Scalable Prototype Selection By Genetic Algorithms And Hashing | Plasencia-calaña Yenisel, Orozco-alzate Mauricio, Méndez-vázquez Heydi, García-reyes Edel, Duin Robert P. W. | Arxiv | Classification in the dissimilarity space has become a very active research area since it provides a possibility to learn from data given in the form of pairwise non-metric dissimilarities, which otherwise would be difficult to cope with. The selection of prototypes is a key step for the further creation of the space. However, despite previous efforts to find good prototypes, how to select the best representation set remains an open issue. In this paper we proposed scalable methods to select the set of prototypes out of very large datasets. The methods are based on genetic algorithms, dissimilarity-based hashing, and two different unsupervised and supervised scalable criteria. The unsupervised criterion is based on the Minimum Spanning Tree of the graph created by the prototypes as nodes and the dissimilarities as edges. The supervised criterion is based on counting matching labels of objects and their closest prototypes. The suitability of these type of algorithms is analyzed for the specific case of dissimilarity representations. The experimental results showed that the methods select good prototypes taking advantage of the large datasets, and they do so at low runtimes. |
|||||
2017 | Leveraging Sparsity For Efficient Submodular Data Summarization | Lindgren Erik M., Wu Shanshan, Dimakis Alexandros G. | Arxiv | The facility location problem is widely used for summarizing large datasets and has additional applications in sensor placement, image retrieval, and clustering. One difficulty of this problem is that submodular optimization algorithms require the calculation of pairwise benefits for all items in the dataset. This is infeasible for large problems, so recent work proposed to only calculate nearest neighbor benefits. One limitation is that several strong assumptions were invoked to obtain provable approximation guarantees. In this paper we establish that these extra assumptions are not necessary—solving the sparsified problem will be almost optimal under the standard assumptions of the problem. We then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations and extend the use of the problem to a broader family of similarities. We validate our approach by demonstrating that it rapidly generates interpretable summaries. |
|||||
2017 | High-dimensional Simplexes For Supermetric Search | Connor Richard, Vadicamo Lucia, Rabitti Fausto | Arxiv | In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These include Euclidean, Cosine and Jensen-Shannon spaces of any dimension. A simple corollary of the n-point property is that, for any (n+1) objects sampled from the space, there exists an n-dimensional simplex in Euclidean space whose edge lengths correspond to the distances among the objects. We show how the construction of such simplexes in higher dimensions can be used to give arbitrarily tight lower and upper bounds on distances within the original space. This allows the construction of an n-dimensional Euclidean space, from which lower and upper bounds of the original space can be calculated, and which is itself an indexable space with the n-point property. For similarity search, the engineering tradeoffs are good: we show significant reductions in data size and metric cost with little loss of accuracy, leading to a significant overall improvement in search performance. |
|||||
2017 | Composite Quantization | Wang Jingdong, Zhang Ting | Arxiv | This paper studies the compact coding approach to approximate nearest neighbor search. We introduce a composite quantization framework. It uses the composition of several (\(M\)) elements, each of which is selected from a different dictionary, to accurately approximate a \(D\)-dimensional vector, thus yielding accurate search, and represents the data vector by a short code composed of the indices of the selected elements in the corresponding dictionaries. Our key contribution lies in introducing a near-orthogonality constraint, which makes the search efficiency is guaranteed as the cost of the distance computation is reduced to \(O(M)\) from \(O(D)\) through a distance table lookup scheme. The resulting approach is called near-orthogonal composite quantization. We theoretically justify the equivalence between near-orthogonal composite quantization and minimizing an upper bound of a function formed by jointly considering the quantization error and the search cost according to a generalized triangle inequality. We empirically show the efficacy of the proposed approach over several benchmark datasets. In addition, we demonstrate the superior performances in other three applications: combination with inverted multi-index, quantizing the query for mobile search, and inner-product similarity search. |
|||||
2017 | Learning Robust Hash Codes For Multiple Instance Image Retrieval | Conjeti Sailesh, Paschali Magdalini, Katouzian Amin, Navab Nassir | Arxiv | In this paper, for the first time, we introduce a multiple instance (MI) deep hashing technique for learning discriminative hash codes with weak bag-level supervision suited for large-scale retrieval. We learn such hash codes by aggregating deeply learnt hierarchical representations across bag members through a dedicated MI pool layer. For better trainability and retrieval quality, we propose a two-pronged approach that includes robust optimization and training with an auxiliary single instance hashing arm which is down-regulated gradually. We pose retrieval for tumor assessment as an MI problem because tumors often coexist with benign masses and could exhibit complementary signatures when scanned from different anatomical views. Experimental validations on benchmark mammography and histology datasets demonstrate improved retrieval performance over the state-of-the-art methods. |
|||||
2017 | Practical Hash Functions For Similarity Estimation And Dimensionality Reduction | Søren Dahlgaard, Mathias Knudsen, Mikkel Thorup | Neural Information Processing Systems | Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS’12] and feature hashing (FH) of Weinberger et al. [ICML’09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM. We consider the recent mixed tabulation hash function of Dahlgaard et al. [FOCS’15] which was proved theoretically to perform like a truly random hash function in many applications, including the above OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar when the input vectors are sparse. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH. We find that mixed tabulation hashing is almost as fast as the classic multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the very popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation was 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input making it more reliable. |
|||||
2017 | Fast Locality-sensitive Hashing Frameworks For Approximate Near Neighbor Search | Christiani Tobias | Arxiv | The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution \(\mathcal{H}\) over locality-sensitive hash functions that partition space. For a collection of \(n\) points, after preprocessing, the query time is dominated by \(O(n^{\rho} log n)\) evaluations of hash functions from \(\mathcal{H}\) and \(O(n^{\rho})\) hash table lookups and distance computations where \(\rho \in (0,1)\) is determined by the locality-sensitivity properties of \(\mathcal{H}\). It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to \(O(log^2 n)\), leaving the query time to be dominated by \(O(n^{\rho})\) distance computations and \(O(n^{\rho} log n)\) additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to \(O(n^\rho)\). |
|||||
2017 | Large-scale Image Geo-localization Using Dominant Sets | Zemene Eyasu, Tariku Yonatan, Idrees Haroon, Prati Andrea, Pelillo Marcello, Shah Mubarak | Arxiv | This paper presents a new approach for the challenging problem of geo-locating an image using image matching in a structured database of city-wide reference images with known GPS coordinates. We cast the geo-localization as a clustering problem on local image features. Akin to existing approaches on the problem, our framework builds on low-level features which allow partial matching between images. For each local feature in the query image, we find its approximate nearest neighbors in the reference set. Next, we cluster the features from reference images using Dominant Set clustering, which affords several advantages over existing approaches. First, it permits variable number of nodes in the cluster which we use to dynamically select the number of nearest neighbors (typically coming from multiple reference images) for each query feature based on its discrimination value. Second, as we also quantify in our experiments, this approach is several orders of magnitude faster than existing approaches. Thus, we obtain multiple clusters (different local maximizers) and obtain a robust final solution to the problem using multiple weak solutions through constrained Dominant Set clustering on global image features, where we enforce the constraint that the query image must be included in the cluster. This second level of clustering also bypasses heuristic approaches to voting and selecting the reference image that matches to the query. We evaluated the proposed framework on an existing dataset of 102k street view images as well as a new dataset of 300k images, and show that it outperforms the state-of-the-art by 20% and 7%, respectively, on the two datasets. |
|||||
2017 | Binary Generative Adversarial Networks For Image Retrieval | Song Jingkuan | Arxiv | The most striking successes in image retrieval using deep hashing have mostly involved discriminative models, which require labels. In this paper, we use binary generative adversarial networks (BGAN) to embed images to binary codes in an unsupervised way. By restricting the input noise variable of generative adversarial networks (GAN) to be binary and conditioned on the features of each input image, BGAN can simultaneously learn a binary representation per image, and generate an image plausibly similar to the original one. In the proposed framework, we address two main problems: 1) how to directly generate binary codes without relaxation? 2) how to equip the binary representation with the ability of accurate image retrieval? We resolve these problems by proposing new sign-activation strategy and a loss function steering the learning process, which consists of new models for adversarial loss, a content loss, and a neighborhood structure loss. Experimental results on standard datasets (CIFAR-10, NUSWIDE, and Flickr) demonstrate that our BGAN significantly outperforms existing hashing methods by up to 107\% in terms of~mAP (See Table tab.res.map.comp) Our anonymous code is available at: https://github.com/htconquer/BGAN. |
|||||
2017 | Discrete Multi-modal Hashing With Canonical Views For Robust Mobile Landmark Search | Zhu Lei, Huang Zi, Liu Xiaobai, He Xiangnan, Song Jingkuan, Zhou Xiaofang | Arxiv | Mobile landmark search (MLS) recently receives increasing attention for its great practical values. However, it still remains unsolved due to two important challenges. One is high bandwidth consumption of query transmission, and the other is the huge visual variations of query images sent from mobile devices. In this paper, we propose a novel hashing scheme, named as canonical view based discrete multi-modal hashing (CV-DMH), to handle these problems via a novel three-stage learning procedure. First, a submodular function is designed to measure visual representativeness and redundancy of a view set. With it, canonical views, which capture key visual appearances of landmark with limited redundancy, are efficiently discovered with an iterative mining strategy. Second, multi-modal sparse coding is applied to transform visual features from multiple modalities into an intermediate representation. It can robustly and adaptively characterize visual contents of varied landmark images with certain canonical views. Finally, compact binary codes are learned on intermediate representation within a tailored discrete binary embedding model which preserves visual relations of images measured with canonical views and removes the involved noises. In this part, we develop a new augmented Lagrangian multiplier (ALM) based optimization method to directly solve the discrete binary codes. We can not only explicitly deal with the discrete constraint, but also consider the bit-uncorrelated constraint and balance constraint together. Experiments on real world landmark datasets demonstrate the superior performance of CV-DMH over several state-of-the-art methods. |
|||||
2017 | Distributed Stratified Locality Sensitive Hashing For Critical Event Prediction In The Cloud | De Palma Alessandro, Hemberg Erik, O'reilly Una-may | Arxiv | The availability of massive healthcare data repositories calls for efficient tools for data-driven medicine. We introduce a distributed system for Stratified Locality Sensitive Hashing to perform fast similarity-based prediction on large medical waveform datasets. Our implementation, for an ICU use case, prioritizes latency over throughput and is targeted at a cloud environment. We demonstrate our system on Acute Hypotensive Episode prediction from Arterial Blood Pressure waveforms. On a dataset of \(1.37\) million points, we show scaling up to \(40\) processors and a \(21\times\) speedup in number of comparisons to parallel exhaustive search at the price of a \(10\%\) Matthews correlation coefficient (MCC) loss. Furthermore, if additional MCC loss can be tolerated, our system achieves speedups up to two orders of magnitude. |
|||||
2017 | On Hash-based Work Distribution Methods For Parallel Best-first Search | Jinnai Yuu, Fukunaga Alex | Yuu Jinnai and Alex Fukunaga. | Parallel best-first search algorithms such as Hash Distributed A* (HDA) distribute work among the processes using a global hash function. We analyze the search and communication overheads of state-of-the-art hash-based parallel best-first search algorithms, and show that although Zobrist hashing, the standard hash function used by HDA, achieves good load balance for many domains, it incurs significant communication overhead since almost all generated nodes are transferred to a different processor than their parents. We propose Abstract Zobrist hashing, a new work distribution method for parallel search which, instead of computing a hash value based on the raw features of a state, uses a feature projection function to generate a set of abstract features which results in a higher locality, resulting in reduced communications overhead. We show that Abstract Zobrist hashing outperforms previous methods on search domains using hand-coded, domain specific feature projection functions. We then propose GRAZHDA, a graph-partitioning based approach to automatically generating feature projection functions. GRAZHDA seeks to approximate the partitioning of the actual search space graph by partitioning the domain transition graph, an abstraction of the state space graph. We show that GRAZHDA* outperforms previous methods on domain-independent planning. |
|||||
2017 | Momentsnet A Simple Learning-free Method For Binary Image Recognition | Wu Jiasong, Qiu Shijie, Kong Youyong, Chen Yang, Senhadji Lotfi, Shu Huazhong | Arxiv | In this paper, we propose a new simple and learning-free deep learning network named MomentsNet, whose convolution layer, nonlinear processing layer and pooling layer are constructed by Moments kernels, binary hashing and block-wise histogram, respectively. Twelve typical moments (including geometrical moment, Zernike moment, Tchebichef moment, etc.) are used to construct the MomentsNet whose recognition performance for binary image is studied. The results reveal that MomentsNet has better recognition performance than its corresponding moments in almost all cases and ZernikeNet achieves the best recognition performance among MomentsNet constructed by twelve moments. ZernikeNet also shows better recognition performance on binary image database than that of PCANet, which is a learning-based deep learning network. |
|||||
2017 | Hashing In The Zero Shot Framework With Domain Adaptation | Pachori Shubham, Deshpande Ameya, Raman Shanmuganathan | Arxiv | Techniques to learn hash codes which can store and retrieve large dimensional
multimedia data efficiently have attracted broad research interests in the
recent years. With rapid explosion of newly emerged concepts and online data,
existing supervised hashing algorithms suffer from the problem of scarcity of
ground truth annotations due to the high cost of obtaining manual annotations.
Therefore, we propose an algorithm to learn a hash function from training
images belonging to |
|||||
2017 | Structured Deep Hashing With Convolutional Neural Networks For Fast Person Re-identification | Wu Lin, Wang Yang | Arxiv | Given a pedestrian image as a query, the purpose of person re-identification is to identify the correct match from a large collection of gallery images depicting the same person captured by disjoint camera views. The critical challenge is how to construct a robust yet discriminative feature representation to capture the compounded variations in pedestrian appearance. To this end, deep learning methods have been proposed to extract hierarchical features against extreme variability of appearance. However, existing methods in this category generally neglect the efficiency in the matching stage whereas the searching speed of a re-identification system is crucial in real-world applications. In this paper, we present a novel deep hashing framework with Convolutional Neural Networks (CNNs) for fast person re-identification. Technically, we simultaneously learn both CNN features and hash functions/codes to get robust yet discriminative features and similarity-preserving hash codes. Thereby, person re-identification can be resolved by efficiently computing and ranking the Hamming distances between images. A structured loss function defined over positive pairs and hard negatives is proposed to formulate a novel optimization problem so that fast convergence and more stable optimized solution can be obtained. Extensive experiments on two benchmarks CUHK03 \cite{FPNN} and Market-1501 \cite{Market1501} show that the proposed deep architecture is efficacy over state-of-the-arts. |
|||||
2017 | Efficient Large-scale Approximate Nearest Neighbor Search On The GPU | Wieschollek Patrick, Wang Oliver, Sorkine-hornung Alexander, Lensch Hendrik P. A. | The IEEE Conference on Computer Vision and Pattern Recognition | We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos. |
|||||
2017 | Neural Network-based Graph Embedding For Cross-platform Binary Code Similarity Detection | Xu Xiaojun, Liu Chang, Feng Qian, Yin Heng, Song Le, Song Dawn | Arxiv | The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art’s embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems. |
|||||
2017 | Derandomized Balanced Allocation | Chen Xue | Arxiv | In this paper, we study the maximum loads of explicit hash families in the \(d\)-choice schemes when allocating sequentially \(n\) balls into \(n\) bins. We consider the Uniform-Greedy scheme, which provides \(d\) independent bins for each ball and places the ball into the bin with the least load, and its non-uniform variant — the Always-Go-Left scheme introduced by V"ocking. We construct a hash family with \(O(log n log log n)\) random bits based on the previous work of Celis et al. and show the following results.
|
|||||
2017 | Approximate String Matching Theory And Applications (la Recherche Approchee De Motifs Theorie Et Applications) | Chegrane Ibrahim | Arxiv | The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows: Let \(D=\{x_1,x_2,\ldots x_d\}\) be a set of \(d\) words defined on an alphabet \(\Sigma\), let \(q\) be a query defined also on \(\Sigma\), and let \(k\) be a positive integer. We want to build a data structure on \(D\) capable of answering the following query: find all words in \(D\) that are at most different from the query word \(q\) with \(k\) errors. In this thesis, we study the approximate string matching methods in dictionaries, texts, and indexes, to propose practical methods that solve this problem efficiently. We explore this problem in three complementary directions: 1) The approximate string matching in the dictionary. We propose two solutions to this problem, the first one uses hash tables for \(k \geq 2\), the second uses the Trie and reverse Trie, and it is restricted to (k = 1). The two solutions are adaptable, without loss of performance, to the approximate string matching in a text. 2) The approximate string matching for \textit{autocompletion}, which is, find all suffixes of a given prefix that may contain errors. We give a new solution better in practice than all the previous proposed solutions. 3) The problem of the alignment of biological sequences can be interpreted as an approximate string matching problem. We propose a solution for peers and multiple sequences alignment. \medskip All the results obtained showed that our algorithms, give the best performance on sets of practical data (benchmark from the real world). All our methods are proposed as libraries, and they are published online. |
|||||
2017 | Deep Binaries Encoding Semantic-rich Cues For Efficient Textual-visual Cross Retrieval | Shen Yuming, Liu Li, Shao Ling, Song Jingkuan | Arxiv | Cross-modal hashing is usually regarded as an effective technique for large-scale textual-visual cross retrieval, where data from different modalities are mapped into a shared Hamming space for matching. Most of the traditional textual-visual binary encoding methods only consider holistic image representations and fail to model descriptive sentences. This renders existing methods inappropriate to handle the rich semantics of informative cross-modal data for quality textual-visual search tasks. To address the problem of hashing cross-modal data with semantic-rich cues, in this paper, a novel integrated deep architecture is developed to effectively encode the detailed semantics of informative images and long descriptive sentences, named as Textual-Visual Deep Binaries (TVDB). In particular, region-based convolutional networks with long short-term memory units are introduced to fully explore image regional details while semantic cues of sentences are modeled by a text convolutional network. Additionally, we propose a stochastic batch-wise training routine, where high-quality binary codes and deep encoding functions are efficiently optimized in an alternating manner. Experiments are conducted on three multimedia datasets, i.e. Microsoft COCO, IAPR TC-12, and INRIA Web Queries, where the proposed TVDB model significantly outperforms state-of-the-art binary coding methods in the task of cross-modal retrieval. |
|||||
2017 | Compression Of Deep Neural Networks For Image Instance Retrieval | Chandrasekhar Vijay, Lin Jie, Liao Qianli, Morère Olivier, Veillard Antoine, Duan Lingyu, Poggio Tomaso | Arxiv | Image instance retrieval is the problem of retrieving images from a database which contain the same object. Convolutional Neural Network (CNN) based descriptors are becoming the dominant approach for generating {\it global image descriptors} for the instance retrieval problem. One major drawback of CNN-based {\it global descriptors} is that uncompressed deep neural network models require hundreds of megabytes of storage making them inconvenient to deploy in mobile applications or in custom hardware. In this work, we study the problem of neural network model compression focusing on the image instance retrieval task. We study quantization, coding, pruning and weight sharing techniques for reducing model size for the instance retrieval problem. We provide extensive experimental results on the trade-off between retrieval performance and model size for different types of networks on several data sets providing the most comprehensive study on this topic. We compress models to the order of a few MBs: two orders of magnitude smaller than the uncompressed models while achieving negligible loss in retrieval performance. |
|||||
2017 | Lattice-based Locality Sensitive Hashing Is Optimal | Chandrasekaran Karthekeyan, Dadush Daniel, Gandikota Venkata, Grigorescu Elena | Arxiv | Locality sensitive hashing (LSH) was introduced by Indyk and Motwani (STOC
|
|||||
2017 | Asymmetric Deep Supervised Hashing | Jiang Qing-yuan, Li Wu-jun | Arxiv | Hashing has been widely used for large-scale approximate nearest neighbor search because of its storage and search efficiency. Recent work has found that deep supervised hashing can significantly outperform non-deep supervised hashing in many applications. However, most existing deep supervised hashing methods adopt a symmetric strategy to learn one deep hash function for both query points and database (retrieval) points. The training of these symmetric deep supervised hashing methods is typically time-consuming, which makes them hard to effectively utilize the supervised information for cases with large-scale database. In this paper, we propose a novel deep supervised hashing method, called asymmetric deep supervised hashing (ADSH), for large-scale nearest neighbor search. ADSH treats the query points and database points in an asymmetric way. More specifically, ADSH learns a deep hash function only for query points, while the hash codes for database points are directly learned. The training of ADSH is much more efficient than that of traditional symmetric deep supervised hashing methods. Experiments show that ADSH can achieve state-of-the-art performance in real applications. |
|||||
2017 | Billion-scale Similarity Search With Gpus | Johnson Jeff, Douze Matthijs, Jégou Hervé | Arxiv | Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility. |
|||||
2017 | Variational Deep Semantic Hashing For Text Documents | Chaidaroon Suthee, Fang Yi | Arxiv | As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. In this paper, we propose a series of novel deep document generative models for text hashing. The first proposed model is unsupervised while the second one is supervised by utilizing document labels/tags for hashing. The third model further considers document-specific factors that affect the generation of words. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on four public testbeds. The experimental results have demonstrated the effectiveness of the proposed supervised learning models for text hashing. |
|||||
2017 | Analysing The Performance Of GPU Hash Tables For State Space Exploration | Cassee Nathan Eindhoven University Of Technology, Eindhoven, The Netherlands, Wijs Anton Eindhoven University Of Technology, Eindhoven, The Netherlands | EPTCS | In the past few years, General Purpose Graphics Processors (GPUs) have been used to significantly speed up numerous applications. One of the areas in which GPUs have recently led to a significant speed-up is model checking. In model checking, state spaces, i.e., large directed graphs, are explored to verify whether models satisfy desirable properties. GPUexplore is a GPU-based model checker that uses a hash table to efficiently keep track of already explored states. As a large number of states is discovered and stored during such an exploration, the hash table should be able to quickly handle many inserts and queries concurrently. In this paper, we experimentally compare two different hash tables optimised for the GPU, one being the GPUexplore hash table, and the other using Cuckoo hashing. We compare the performance of both hash tables using random and non-random data obtained from model checking experiments, to analyse the applicability of the two hash tables for state space exploration. We conclude that Cuckoo hashing is three times faster than GPUexplore hashing for random data, and that Cuckoo hashing is five to nine times faster for non-random data. This suggests great potential to further speed up GPUexplore in the near future. |
|||||
2017 | Hashnet Deep Learning To Hash By Continuation | Cao Zhangjie, Long Mingsheng, Wang Jianmin, Yu Philip S. | Arxiv | Learning to hash has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by end-to-end representation learning and hash encoding, has received increasing attention recently. Subject to the ill-posed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the ill-posed gradient problem in optimizing deep networks with non-smooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield state-of-the-art multimedia retrieval performance on standard benchmarks. |
|||||
2017 | Transfer Adversarial Hashing For Hamming Space Retrieval | Cao Zhangjie, Long Mingsheng, Huang Chao, Wang Jianmin | Arxiv | Hashing is widely applied to large-scale image retrieval due to the storage and retrieval efficiency. Existing work on deep hashing assumes that the database in the target domain is identically distributed with the training set in the source domain. This paper relaxes this assumption to a transfer retrieval setting, which allows the database and the training set to come from different but relevant domains. However, the transfer retrieval setting will introduce two technical difficulties: first, the hash model trained on the source domain cannot work well on the target domain due to the large distribution gap; second, the domain gap makes it difficult to concentrate the database points to be within a small Hamming ball. As a consequence, transfer retrieval performance within Hamming Radius 2 degrades significantly in existing hashing methods. This paper presents Transfer Adversarial Hashing (TAH), a new hybrid deep architecture that incorporates a pairwise \(t\)-distribution cross-entropy loss to learn concentrated hash codes and an adversarial network to align the data distributions between the source and target domains. TAH can generate compact transfer hash codes for efficient image retrieval on both source and target domains. Comprehensive experiments validate that TAH yields state of the art Hamming space retrieval performance on standard datasets. |
|||||
2017 | Mihash Online Hashing With Mutual Information | Cakir Fatih, He Kun, Bargal Sarah Adel, Sclaroff Stan | Arxiv | Learning-based hashing methods are widely used for nearest neighbor retrieval, and recently, online hashing methods have demonstrated good performance-complexity trade-offs by learning hash functions from streaming data. In this paper, we first address a key challenge for online hashing: the binary codes for indexed data must be recomputed to keep pace with updates to the hash functions. We propose an efficient quality measure for hash functions, based on an information-theoretic quantity, mutual information, and use it successfully as a criterion to eliminate unnecessary hash table updates. Next, we also show how to optimize the mutual information objective using stochastic gradient descent. We thus develop a novel hashing method, MIHash, that can be used in both online and batch settings. Experiments on image retrieval benchmarks (including a 2.5M image dataset) confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions. |
|||||
2017 | A Revisit On Deep Hashings For Large-scale Content Based Image Retrieval | Cai Deng, Gu Xiuye, Wang Chaoqi | Arxiv | There is a growing trend in studying deep hashing methods for content-based image retrieval (CBIR), where hash functions and binary codes are learnt using deep convolutional neural networks and then the binary codes can be used to do approximate nearest neighbor (ANN) search. All the existing deep hashing papers report their methods’ superior performance over the traditional hashing methods according to their experimental results. However, there are serious flaws in the evaluations of existing deep hashing papers: (1) The datasets they used are too small and simple to simulate the real CBIR situation. (2) They did not correctly include the search time in their evaluation criteria, while the search time is crucial in real CBIR systems. (3) The performance of some unsupervised hashing algorithms (e.g., LSH) can easily be boosted if one uses multiple hash tables, which is an important factor should be considered in the evaluation while most of the deep hashing papers failed to do so. We re-evaluate several state-of-the-art deep hashing methods with a carefully designed experimental setting. Empirical results reveal that the performance of these deep hashing methods are inferior to multi-table IsoH, a very simple unsupervised hashing method. Thus, the conclusions in all the deep hashing papers should be carefully re-examined. |
|||||
2017 | An Improved Video Analysis Using Context Based Extension Of LSH | Chakraborty Angana, Bandyopadhyay Sanghamitra | Arxiv | Locality Sensitive Hashing (LSH) based algorithms have already shown their promise in finding approximate nearest neighbors in high dimen- sional data space. However, there are certain scenarios, as in sequential data, where the proximity of a pair of points cannot be captured without considering their surroundings or context. In videos, as for example, a particular frame is meaningful only when it is seen in the context of its preceding and following frames. LSH has no mechanism to handle the con- texts of the data points. In this article, a novel scheme of Context based Locality Sensitive Hashing (conLSH) has been introduced, in which points are hashed together not only based on their closeness, but also because of similar context. The contribution made in this article is three fold. First, conLSH is integrated with a recently proposed fast optimal sequence alignment algorithm (FOGSAA) using a layered approach. The resultant method is applied to video retrieval for extracting similar sequences. The pro- posed algorithm yields more than 80% accuracy on an average in different datasets. It has been found to save 36.3% of the total time, consumed by the exhaustive search. conLSH reduces the search space to approximately 42% of the entire dataset, when compared with an exhaustive search by the aforementioned FOGSAA, Bag of Words method and the standard LSH implementations. Secondly, the effectiveness of conLSH is demon- strated in action recognition of the video clips, which yields an average gain of 12.83% in terms of classification accuracy over the state of the art methods using STIP descriptors. The last but of great significance is that this article provides a way of automatically annotating long and composite real life videos. The source code of conLSH is made available at http://www.isical.ac.in/~bioinfo_miu/conLSH/conLSH.html |
|||||
2017 | Multi-level Spherical Locality Sensitive Hashing For Approximate Near Neighbors | Brooks Teresa Nicole, Almajalid Rania | Arxiv | This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure uses a modified version of a multi-probe adaptive querying algorithm, with the potential of achieving a \(O(n^p + t)\) query run time, for all inputs n where \(t <= n\). |
|||||
2017 | Uniqueness Of Codes Using Semidefinite Programming | Brouwer Andries E., Polak Sven C. | Arxiv | For \(n,d,w \in \mathbb{N}\), let \(A(n,d,w)\) denote the maximum size of a binary code of word length \(n\), minimum distance \(d\) and constant weight \(w\). Schrijver recently showed using semidefinite programming that \(A(23,8,11)=1288\), and the second author that \(A(22,8,11)=672\) and \(A(22,8,10)=616\). Here we show uniqueness of the codes achieving these bounds. Let \(A(n,d)\) denote the maximum size of a binary code of word length \(n\) and minimum distance \(d\). Gijswijt, Mittelmann and Schrijver showed that \(A(20,8)=256\). We show that there are several nonisomorphic codes achieving this bound, and classify all such codes with all distances divisible by 4. |
|||||
2017 | Open-set Language Identification | Malmasi Shervin | Arxiv | We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification. |
|||||
2017 | A Resource-frugal Probabilistic Dictionary And Applications In Bioinformatics | Marchet Camille, Lecompte Lolita, Limasset Antoine, Bittner Lucie, Peterlongo Pierre | Arxiv | Indexing massive data sets is extremely expensive for large scale problems. In many fields, huge amounts of data are currently generated, however extracting meaningful information from voluminous data sets, such as computing similarity between elements, is far from being trivial. It remains nonetheless a fundamental need. This work proposes a probabilistic data structure based on a minimal perfect hash function for indexing large sets of keys. Our structure out-compete the hash table for construction, query times and for memory usage, in the case of the indexation of a static set. To illustrate the impact of algorithms performances, we provide two applications based on similarity computation between collections of sequences, and for which this calculation is an expensive but required operation. In particular, we show a practical case in which other bioinformatics tools fail to scale up the tested data set or provide lower recall quality results. |
|||||
2017 | Fast Nearest Neighbor Preserving Embeddings | Sivertsen Johan | Arxiv | We show an analog to the Fast Johnson-Lindenstrauss Transform for Nearest Neighbor Preserving Embeddings in \(ℓ₂\). These are sparse, randomized embeddings that preserve the (approximate) nearest neighbors. The dimensionality of the embedding space is bounded not by the size of the embedded set n, but by its doubling dimension {\lambda}. For most large real-world datasets this will mean a considerably lower-dimensional embedding space than possible when preserving all distances. The resulting embeddings can be used with existing approximate nearest neighbor data structures to yield speed improvements. |
|||||
2017 | Deduplication In A Massive Clinical Note Dataset | Shenoy Sanjeev, Kuo Tsung-ting, Gabriel Rodney, Mcauley Julian, Hsu Chun-nan | Arxiv | Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm. |
|||||
2017 | Hashganattention-aware Deep Adversarial Hashing For Cross Modal Retrieval | Zhang Xi, Zhou Siyu, Feng Jiashi, Lai Hanjiang, Li Bo, Pan Yan, Yin Jian, Yan Shuicheng | Arxiv | As the rapid growth of multi-modal data, hashing methods for cross-modal retrieval have received considerable attention. Deep-networks-based cross-modal hashing methods are appealing as they can integrate feature learning and hash coding into end-to-end trainable frameworks. However, it is still challenging to find content similarities between different modalities of data due to the heterogeneity gap. To further address this problem, we propose an adversarial hashing network with attention mechanism to enhance the measurement of content similarities by selectively focusing on informative parts of multi-modal data. The proposed new adversarial network, HashGAN, consists of three building blocks: 1) the feature learning module to obtain feature representations, 2) the generative attention module to generate an attention mask, which is used to obtain the attended (foreground) and the unattended (background) feature representations, 3) the discriminative hash coding module to learn hash functions that preserve the similarities between different modalities. In our framework, the generative module and the discriminative module are trained in an adversarial way: the generator is learned to make the discriminator cannot preserve the similarities of multi-modal data w.r.t. the background feature representations, while the discriminator aims to preserve the similarities of multi-modal data w.r.t. both the foreground and the background feature representations. Extensive evaluations on several benchmark datasets demonstrate that the proposed HashGAN brings substantial improvements over other state-of-the-art cross-modal hashing methods. |
|||||
2017 | Exact Clustering In Linear Time | Marshall Jonathan A., Rafsky Lawrence C. | Arxiv | The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear time computational complexity on clustering tasks. MIMOSA algorithms mark and match partial-signature keys in a hash table to obtain exact, error-free cluster retrieval. Benchmark measurements, on clustering a data set of 10,000,000 news articles by news topic, found that a MIMOSA implementation finished more than four orders of magnitude faster than a standard centroid implementation. |
|||||
2017 | Dynamic Space Efficient Hashing | Maier Tobias, Sanders Peter | Arxiv | We consider space efficient hash tables that can grow and shrink dynamically and are always highly space efficient, i.e., their space consumption is always close to the lower bound even while growing and when taking into account storage that is only needed temporarily. None of the traditionally used hash tables have this property. We show how known approaches like linear probing and bucket cuckoo hashing can be adapted to this scenario by subdividing them into many subtables or using virtual memory overcommitting. However, these rather straightforward solutions suffer from slow amortized insertion times due to frequent reallocation in small increments. Our main result is DySECT ({\bf Dy}namic {\bf S}pace {\bf E}fficient {\bf C}uckoo {\bf T}able) which avoids these problems. DySECT consists of many subtables which grow by doubling their size. The resulting inhomogeneity in subtable sizes is equalized by the flexibility available in bucket cuckoo hashing where each element can go to several buckets each of which containing several cells. Experiments indicate that DySECT works well with load factors up to 98\%. With up to 2.7 times better performance than the next best solution. |
|||||
2017 | Unsupervised Generative Adversarial Cross-modal Hashing | Zhang Jian, Peng Yuxin, Yuan Mingkuan | Arxiv | Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Unsupervised cross-modal hashing is more flexible and applicable than supervised methods, since no intensive labeling work is involved. However, existing unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities for cross-modal retrieval. To address the above problem, in this paper we propose an Unsupervised Generative Adversarial Cross-modal Hashing approach (UGACH), which makes full use of GAN’s ability for unsupervised representation learning to exploit the underlying manifold structure of cross-modal data. The main contributions can be summarized as follows: (1) We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given a data of one modality, the generative model tries to fit the distribution over the manifold structure, and select informative data of another modality to challenge the discriminative model. The discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. These two models are trained in an adversarial way to improve each other and promote hashing function learning. (2) We propose a correlation graph based approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods verify the effectiveness of our proposed approach. |
|||||
2017 | Streaming Binary Sketching Based On Subspace Tracking And Diagonal Uniformization | Morvan Anne, Souloumiac Antoine, Gouy-pailler Cédric, Atif Jamal | Arxiv | In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new online method for computing binary compressed representations -sketches- of high-dimensional real feature vectors. Given an expected code length \(c\) and high-dimensional input data points, our algorithm provides a \(c\)-bits binary code for preserving the distance between the points from the original high-dimensional space. Our algorithm does not require neither the storage of the whole dataset nor a chunk, thus it is fully adaptable to the streaming setting. It also provides low time complexity and convergence guarantees. We demonstrate the quality of our binary sketches through experiments on real data for the nearest neighbors search task in the online setting. |
|||||
2017 | Transductive Zero-shot Hashing Via Coarse-to-fine Similarity Mining | Lai Hanjiang, Pan Yan | Arxiv | Zero-shot Hashing (ZSH) is to learn hashing models for novel/target classes without training data, which is an important and challenging problem. Most existing ZSH approaches exploit transfer learning via an intermediate shared semantic representations between the seen/source classes and novel/target classes. However, due to having disjoint, the hash functions learned from the source dataset are biased when applied directly to the target classes. In this paper, we study the transductive ZSH, i.e., we have unlabeled data for novel classes. We put forward a simple yet efficient joint learning approach via coarse-to-fine similarity mining which transfers knowledges from source data to target data. It mainly consists of two building blocks in the proposed deep architecture: 1) a shared two-streams network, which the first stream operates on the source data and the second stream operates on the unlabeled data, to learn the effective common image representations, and 2) a coarse-to-fine module, which begins with finding the most representative images from target classes and then further detect similarities among these images, to transfer the similarities of the source data to the target data in a greedy fashion. Extensive evaluation results on several benchmark datasets demonstrate that the proposed hashing method achieves significant improvement over the state-of-the-art methods. |
|||||
2017 | Characterizing And Enumerating Walsh-hadamard Transform Algorithms | Serre François, Püschel Markus | Arxiv | We propose a way of characterizing the algorithms computing a Walsh-Hadamard transform that consist of a sequence of arrays of butterflies (\(I_{2^{n-1}}\otimes \text{DFT}_2\)) interleaved by linear permutations. Linear permutations are those that map linearly the binary representation of its element indices. We also propose a method to enumerate these algorithms. |
|||||
2017 | Distance-sensitive Hashing | Aumüller Martin, Christiani Tobias, Pagh Rasmus, Silvestri Francesco | Arxiv | Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point. In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space \((X, \text{dist})\) and a “collision probability function” (CPF) \(f\colon \mathbb{R}\rightarrow [0,1]\) we seek a distribution over pairs of functions \((h,g)\) such that for every pair of points \(x, y \in X\) the collision probability is \(\Pr[h(x)=g(y)] = f(\text{dist}(x,y))\). Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, \(f\) can be made exponentially decreasing even if we restrict attention to the symmetric case where \(g=h\). We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management. |
|||||
2017 | Deep Discrete Hashing With Self-supervised Pairwise Labels | Song Jingkuan, He Tao, Fan Hangbo, Gao Lianli | Arxiv | Hashing methods have been widely used for applications of large-scale image retrieval and classification. Non-deep hashing methods using handcrafted features have been significantly outperformed by deep hashing methods due to their better feature representation and end-to-end learning framework. However, the most striking successes in deep hashing have mostly involved discriminative models, which require labels. In this paper, we propose a novel unsupervised deep hashing method, named Deep Discrete Hashing (DDH), for large-scale image retrieval and classification. In the proposed framework, we address two main problems: 1) how to directly learn discrete binary codes? 2) how to equip the binary representation with the ability of accurate image retrieval and classification in an unsupervised way? We resolve these problems by introducing an intermediate variable and a loss function steering the learning process, which is based on the neighborhood structure in the original space. Experimental results on standard datasets (CIFAR-10, NUS-WIDE, and Oxford-17) demonstrate that our DDH significantly outperforms existing hashing methods by large margin in terms of~mAP for image retrieval and object recognition. Code is available at \url{https://github.com/htconquer/ddh}. |
|||||
2017 | Generic LSH Families For The Angular Distance Based On Johnson-lindenstrauss Projections And Feature Hashing LSH | Argerich Luis, Golmar Natalia | Arxiv | In this paper we propose the creation of generic LSH families for the angular distance based on Johnson-Lindenstrauss projections. We show that feature hashing is a valid J-L projection and propose two new LSH families based on feature hashing. These new LSH families are tested on both synthetic and real datasets with very good results and a considerable performance improvement over other LSH families. While the theoretical analysis is done for the angular distance, these families can also be used in practice for the euclidean distance with excellent results [2]. Our tests using real datasets show that the proposed LSH functions work well for the euclidean distance. |
|||||
2017 | Supervised Hashing With End-to-end Binary Deep Neural Network | Tan Dang-khoa Le, Do Thanh-toan, Cheung Ngai-man | Arxiv | Image hashing is a popular technique applied to large scale content-based visual retrieval due to its compact and efficient binary codes. Our work proposes a new end-to-end deep network architecture for supervised hashing which directly learns binary codes from input images and maintains good properties over binary codes such as similarity preservation, independence, and balancing. Furthermore, we also propose a new learning scheme that can cope with the binary constrained loss function. The proposed algorithm not only is scalable for learning over large-scale datasets but also outperforms state-of-the-art supervised hashing methods, which are illustrated throughout extensive experiments from various image retrieval benchmarks. |
|||||
2017 | Accelerated Nearest Neighbor Search With Quick ADC | André Fabien Technicolor, Kermarrec Anne-marie Inria, Scouarnec Nicolas Le Technicolor | Arxiv | Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. This allows fast answers to NN queries, without accessing the SSD or HDD. The key feature of PQ is that it can compute distances between short codes and high-dimensional vectors using cache-resident lookup tables. The efficiency of this technique, named Asymmetric Distance Computation (ADC), remains limited because it performs many cache accesses. In this paper, we introduce Quick ADC, a novel technique that achieves a 3 to 6 times speedup over ADC by exploiting Single Instruction Multiple Data (SIMD) units available in current CPUs. Efficiently exploiting SIMD requires algorithmic changes to the ADC procedure. Namely, Quick ADC relies on two key modifications of ADC: (i) the use 4-bit sub-quantizers instead of the standard 8-bit sub-quantizers and (ii) the quantization of floating-point distances. This allows Quick ADC to exceed the performance of state-of-the-art systems, e.g., it achieves a Recall@100 of 0.94 in 3.4 ms on 1 billion SIFT descriptors (128-bit codes). |
|||||
2017 | Exploiting Modern Hardware For High-dimensional Nearest Neighbor Search | André Fabien | Arxiv | Many multimedia information retrieval or machine learning problems require efficient high-dimensional nearest neighbor search techniques. For instance, multimedia objects (images, music or videos) can be represented by high-dimensional feature vectors. Finding two similar multimedia objects then comes down to finding two objects that have similar feature vectors. In the current context of mass use of social networks, large scale multimedia databases or large scale machine learning applications are more and more common, calling for efficient nearest neighbor search approaches. This thesis builds on product quantization, an efficient nearest neighbor search technique that compresses high-dimensional vectors into short codes. This makes it possible to store very large databases entirely in RAM, enabling low response times. We propose several contributions that exploit the capabilities of modern CPUs, especially SIMD and the cache hierarchy, to further decrease response times offered by product quantization. |
|||||
2017 | Foresthash Semantic Hashing With Shallow Random Forests And Tiny Convolutional Networks | Qiu Qiang, Lezama Jose, Bronstein Alex, Sapiro Guillermo | Arxiv | Hash codes are efficient data representations for coping with the ever
growing amounts of data. In this paper, we introduce a random forest semantic
hashing scheme that embeds tiny convolutional neural networks (CNN) into
shallow random forests, with near-optimal information-theoretic code
aggregation among trees. We start with a simple hashing scheme, where random
trees in a forest act as hashing functions by setting |
|||||
2017 | Scalable Nearest Neighbor Search Based On Knn Graph | Zhao Wan-lei, Yang Jie, Deng Cheng-hao | Arxiv | Nearest neighbor search is known as a challenging issue that has been studied for several decades. Recently, this issue becomes more and more imminent in viewing that the big data problem arises from various fields. In this paper, a scalable solution based on hill-climbing strategy with the support of k-nearest neighbor graph (kNN) is presented. Two major issues have been considered in the paper. Firstly, an efficient kNN graph construction method based on two means tree is presented. For the nearest neighbor search, an enhanced hill-climbing procedure is proposed, which sees considerable performance boost over original procedure. Furthermore, with the support of inverted indexing derived from residue vector quantization, our method achieves close to 100% recall with high speed efficiency in two state-of-the-art evaluation benchmarks. In addition, a comparative study on both the compressional and traditional nearest neighbor search methods is presented. We show that our method achieves the best trade-off between search quality, efficiency and memory complexity. |
|||||
2017 | Fast K-nearest Neighbour Search Via Prioritized DCI | Li Ke, Malik Jitendra | Arxiv | Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equivalently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI significantly outperforms prior methods. In particular, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the memory consumption by a factor of 21. |
|||||
2017 | Random Binary Trees For Approximate Nearest Neighbour Search In Binary Space | Komorowski Michal, Trzcinski Tomasz | Arxiv | Approximate nearest neighbour (ANN) search is one of the most important problems in computer science fields such as data mining or computer vision. In this paper, we focus on ANN for high-dimensional binary vectors and we propose a simple yet powerful search method that uses Random Binary Search Trees (RBST). We apply our method to a dataset of 1.25M binary local feature descriptors obtained from a real-life image-based localisation system provided by Google as a part of Project Tango. An extensive evaluation of our method against the state-of-the-art variations of Locality Sensitive Hashing (LSH), namely Uniform LSH and Multi-probe LSH, shows the superiority of our method in terms of retrieval precision with performance boost of over 20% |
|||||
2017 | On Fast Bounded Locality Sensitive Hashing | Wygocki Piotr | Arxiv | In this paper, we examine the hash functions expressed as scalar products, i.e., \(f(x)=<v,x>\), for some bounded random vector \(v\). Such hash functions have numerous applications, but often there is a need to optimize the choice of the distribution of \(v\). In the present work, we focus on so-called anti-concentration bounds, i.e. the upper bounds of \(\mathbb{P}\left[|<v,x>| < \alpha \right]\). In many applications, \(v\) is a vector of independent random variables with standard normal distribution. In such case, the distribution of \(<v,x>\) is also normal and it is easy to approximate \(\mathbb{P}\left[|<v,x>| < \alpha \right]\). Here, we consider two bounded distributions in the context of the anti-concentration bounds. Particularly, we analyze \(v\) being a random vector from the unit ball in \(l_{\infty}\) and \(v\) being a random vector from the unit sphere in \(l_{2}\). We show optimal up to a constant anti-concentration measures for functions \(f(x)=<v,x>\). As a consequence of our research, we obtain new best results for \newline \textit{\(c\)-approximate nearest neighbors without false negatives} for \(l_p\) in high dimensional space for all \(p\in[1,\infty]\), for \(c=Ω(\max\{\sqrt{d},d^{1/p}\})\). These results improve over those presented in [16]. Finally, our paper reports progress on answering the open problem by Pagh~[17], who considered the nearest neighbor search without false negatives for the Hamming distance. |
|||||
2017 | High-quality Shared-memory Graph Partitioning | Akhremtsev Yaroslav, Sanders Peter, Schulz Christian | Arxiv | Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation in processing graphs. Recently, size, variety, and structural complexity of these networks has grown dramatically. Unfortunately, previous approaches to parallel graph partitioning have problems in this context since they often show a negative trade-off between speed and quality. We present an approach to multi-level shared-memory parallel graph partitioning that guarantees balanced solutions, shows high speed-ups for a variety of large graphs and yields very good quality independently of the number of cores used. For example, on 31 cores, our algorithm partitions our largest test instance into 16 blocks cutting less than half the number of edges than our main competitor when both algorithms are given the same amount of time. Important ingredients include parallel label propagation for both coarsening and improvement, parallel initial partitioning, a simple yet effective approach to parallel localized local search, and fast locality preserving hash tables. |
|||||
2017 | Similarity Search Over Graphs Using Localized Spectral Analysis | Aizenbud Yariv, Averbuch Amir, Shabat Gil, Ziv Guy | Arxiv | This paper provides a new similarity detection algorithm. Given an input set of multi-dimensional data points, where each data point is assumed to be multi-dimensional, and an additional reference data point for similarity finding, the algorithm uses kernel method that embeds the data points into a low dimensional manifold. Unlike other kernel methods, which consider the entire data for the embedding, our method selects a specific set of kernel eigenvectors. The eigenvectors are chosen to separate between the data points and the reference data point so that similar data points can be easily identified as being distinct from most of the members in the dataset. |
|||||
2017 | Optimal Las Vegas Locality Sensitive Data Structures | Ahle Thomas Dybdahl | Arxiv | We show that approximate similarity (near neighbour) search can be solved in high dimensions with performance matching state of the art (data independent) Locality Sensitive Hashing, but with a guarantee of no false negatives. Specifically, we give two data structures for common problems. For \(c\)-approximate near neighbour in Hamming space we get query time \(dn^{1/c+o(1)}\) and space \(dn^{1+1/c+o(1)}\) matching that of \cite{indyk1998approximate} and answering a long standing open question from~\cite{indyk2000dimensionality} and~\cite{pagh2016locality} in the affirmative. By means of a new deterministic reduction from \(\ell_1\) to Hamming we also solve \(\ell_1\) and \(ℓ₂\) with query time \(d^2n^{1/c+o(1)}\) and space \(d^2 n^{1+1/c+o(1)}\). For \((s_1,s_2)\)-approximate Jaccard similarity we get query time \(dn^{\rho+o(1)}\) and space \(dn^{1+\rho+o(1)}\), \(\rho=log\frac{1+s_1}{2s_1}\big/log\frac{1+s_2}{2s_2}\), when sets have equal size, matching the performance of~\cite{tobias2016}. The algorithms are based on space partitions, as with classic LSH, but we construct these using a combination of brute force, tensoring, perfect hashing and splitter functions `a la~\cite{naor1995splitters}. We also show a new dimensionality reduction lemma with 1-sided error. |
|||||
2017 | Fast And Scalable Minimal Perfect Hashing For Massive Key Sets | Limasset Antoine, Rizk Guillaume, Chikhi Rayan, Peterlongo Pierre | Arxiv | Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of \(10^{10}\) elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality \(10^{12}\). Source code: https://github.com/rizkg/BBHash |
|||||
2017 | Deep Hashing Network For Unsupervised Domain Adaptation | Venkateswara Hemanth, Eusebio Jose, Chakraborty Shayok, Panchanathan Sethuraman | Arxiv | In recent years, deep neural networks have emerged as a dominant machine learning tool for a wide variety of application domains. However, training a deep neural network requires a large amount of labeled data, which is an expensive process in terms of time, labor and human expertise. Domain adaptation or transfer learning algorithms address this challenge by leveraging labeled data in a different, but related source domain, to develop a model for the target domain. Further, the explosive growth of digital data has posed a fundamental challenge concerning its storage and retrieval. Due to its storage and retrieval efficiency, recent years have witnessed a wide application of hashing in a variety of computer vision applications. In this paper, we first introduce a new dataset, Office-Home, to evaluate domain adaptation algorithms. The dataset contains images of a variety of everyday objects from multiple domains. We then propose a novel deep learning framework that can exploit labeled source data and unlabeled target data to learn informative hash codes, to accurately classify unseen target data. To the best of our knowledge, this is the first research effort to exploit the feature learning capabilities of deep neural networks to learn representative hash codes to address the domain adaptation problem. Our extensive empirical studies on multiple transfer tasks corroborate the usefulness of the framework in learning efficient hash codes which outperform existing competitive baselines for unsupervised domain adaptation. |
|||||
2017 | Multiscale Quantization For Fast Similarity Search | Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N. Holtmann-rice, David Simcha, Felix Yu | Neural Information Processing Systems | We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets. The key insight of the approach is that quantization methods, in particular product quantization, perform poorly when there is large variance in the norms of the data points. This is a common scenario for real- world datasets, especially when doing product quantization of residuals obtained from coarse vector quantization. To address this issue, we propose a multiscale formulation where we learn a separate scalar quantizer of the residual norm scales. All parameters are learned jointly in a stochastic gradient descent framework to minimize the overall quantization error. We provide theoretical motivation for the proposed technique and conduct comprehensive experiments on two large-scale public datasets, demonstrating substantial improvements in recall over existing state-of-the-art methods. |
|||||
2017 | Hierarchical Bloom Filter Trees For Approximate Matching | Lillis David, Breitinger Frank, Scanlon Mark | Arxiv | Bytewise approximate matching algorithms have in recent years shown significant promise in de- tecting files that are similar at the byte level. This is very useful for digital forensic investigators, who are regularly faced with the problem of searching through a seized device for pertinent data. A common scenario is where an investigator is in possession of a collection of “known-illegal” files (e.g. a collection of child abuse material) and wishes to find whether copies of these are stored on the seized device. Approximate matching addresses shortcomings in traditional hashing, which can only find identical files, by also being able to deal with cases of merged files, embedded files, partial files, or if a file has been changed in any way. Most approximate matching algorithms work by comparing pairs of files, which is not a scalable approach when faced with large corpora. This paper demonstrates the effectiveness of using a “Hierarchical Bloom Filter Tree” (HBFT) data structure to reduce the running time of collection-against-collection matching, with a specific focus on the MRSH-v2 algorithm. Three experiments are discussed, which explore the effects of different configurations of HBFTs. The proposed approach dramatically reduces the number of pairwise comparisons required, and demonstrates substantial speed gains, while maintaining effectiveness. |
|||||
2017 | Set-to-set Hashing With Applications In Visual Recognition | Jhuo I-hong, Wang Jun | Arxiv | Visual data, such as an image or a sequence of video frames, is often naturally represented as a point set. In this paper, we consider the fundamental problem of finding a nearest set from a collection of sets, to a query set. This problem has obvious applications in large-scale visual retrieval and recognition, and also in applied fields beyond computer vision. One challenge stands out in solving the problem—set representation and measure of similarity. Particularly, the query set and the sets in dataset collection can have varying cardinalities. The training collection is large enough such that linear scan is impractical. We propose a simple representation scheme that encodes both statistical and structural information of the sets. The derived representations are integrated in a kernel framework for flexible similarity measurement. For the query set process, we adopt a learning-to-hash pipeline that turns the kernel representations into hash bits based on simple learners, using multiple kernel learning. Experiments on two visual retrieval datasets show unambiguously that our set-to-set hashing framework outperforms prior methods that do not take the set-to-set search setting. |
|||||
2017 | Efficient Inferencing Of Compressed Deep Neural Networks | Vooturi Dharma Teja, Goyal Saurabh, Choudhury Anamitra R., Sabharwal Yogish, Verma Ashish | Arxiv | Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as “inferencing as a service” environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints. |
|||||
2017 | Unsupervised Triplet Hashing For Fast Image Retrieval | Huang Shanshan, Xiong Yichao, Zhang Ya, Wang Jia | Arxiv | Hashing has played a pivotal role in large-scale image retrieval. With the development of Convolutional Neural Network (CNN), hashing learning has shown great promise. But existing methods are mostly tuned for classification, which are not optimized for retrieval tasks, especially for instance-level retrieval. In this study, we propose a novel hashing method for large-scale image retrieval. Considering the difficulty in obtaining labeled datasets for image retrieval task in large scale, we propose a novel CNN-based unsupervised hashing method, namely Unsupervised Triplet Hashing (UTH). The unsupervised hashing network is designed under the following three principles: 1) more discriminative representations for image retrieval; 2) minimum quantization loss between the original real-valued feature descriptors and the learned hash codes; 3) maximum information entropy for the learned hash codes. Extensive experiments on CIFAR-10, MNIST and In-shop datasets have shown that UTH outperforms several state-of-the-art unsupervised hashing methods in terms of retrieval accuracy. |
|||||
2017 | Online Hashing | Huang Long-kai, Yang Qiang, Zheng Wei-shi | Arxiv | Although hash function learning algorithms have achieved great success in recent years, most existing hash models are off-line, which are not suitable for processing sequential or online data. To address this problem, this work proposes an online hash model to accommodate data coming in stream for online learning. Specifically, a new loss function is proposed to measure the similarity loss between a pair of data samples in hamming space. Then, a structured hash model is derived and optimized in a passive-aggressive way. Theoretical analysis on the upper bound of the cumulative loss for the proposed online hash model is provided. Furthermore, we extend our online hashing from a single-model to a multi-model online hashing that trains multiple models so as to retain diverse online hashing models in order to avoid biased update. The competitive efficiency and effectiveness of the proposed online hash models are verified through extensive experiments on several large-scale datasets as compared to related hashing methods. |
|||||
2017 | Supervised Hashing Based On Energy Minimization | Hu Zihao, Luo Xiyi, Lu Hongtao, Yu Yong | Arxiv | Recently, supervised hashing methods have attracted much attention since they can optimize retrieval speed and storage cost while preserving semantic information. Because hashing codes learning is NP-hard, many methods resort to some form of relaxation technique. But the performance of these methods can easily deteriorate due to the relaxation. Luckily, many supervised hashing formulations can be viewed as energy functions, hence solving hashing codes is equivalent to learning marginals in the corresponding conditional random field (CRF). By minimizing the KL divergence between a fully factorized distribution and the Gibbs distribution of this CRF, a set of consistency equations can be obtained, but updating them in parallel may not yield a local optimum since the variational lower bound is not guaranteed to increase. In this paper, we use a linear approximation of the sigmoid function to convert these consistency equations to linear systems, which have a closed-form solution. By applying this novel technique to two classical hashing formulations KSH and SPLH, we obtain two new methods called EM (energy minimizing based)-KSH and EM-SPLH. Experimental results on three datasets show the superiority of our methods. |
|||||
2017 | On The Vc-dimension Of Binary Codes | Hu Sihuang, Weinberger Nir, Shayevitz Ofer | SIAM J. Discrete Math. | We investigate the asymptotic rates of length-\(n\) binary codes with VC-dimension at most \(dn\) and minimum distance at least \(\delta n\). Two upper bounds are obtained, one as a simple corollary of a result by Haussler and the other via a shortening approach combining Sauer-Shelah lemma and the linear programming bound. Two lower bounds are given using Gilbert-Varshamov type arguments over constant-weight and Markov-type sets. |
|||||
2017 | Variant Tolerant Read Mapping Using Min-hashing | Quedenfeld Jens, Rahmann Sven | Arxiv | DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Most existing read mappers do not handle these new circumstances appropriately. We introduce a new read mapper prototype called VATRAM that considers variants. It is based on Min-Hashing of q-gram sets of reference genome windows. Min-Hashing is one form of locality sensitive hashing. The variants are directly inserted into VATRAMs index which leads to a fast mapping process. Our results show that VATRAM achieves better precision and recall than state-of-the-art read mappers like BWA under certain cirumstances. VATRAM is open source and can be accessed at https://bitbucket.org/Quedenfeld/vatram-src/. |
|||||
2017 | Enhance Feature Discrimination For Unsupervised Hashing | Hoang Tuan, Do Thanh-toan, Tan Dang-khoa Le, Cheung Ngai-man | Arxiv | We introduce a novel approach to improve unsupervised hashing. Specifically, we propose a very efficient embedding method: Gaussian Mixture Model embedding (Gemb). The proposed method, using Gaussian Mixture Model, embeds feature vector into a low-dimensional vector and, simultaneously, enhances the discriminative property of features before passing them into hashing. Our experiment shows that the proposed method boosts the hashing performance of many state-of-the-art, e.g. Binary Autoencoder (BA) [1], Iterative Quantization (ITQ) [2], in standard evaluation metrics for the three main benchmark datasets. |
|||||
2017 | Hash Embeddings For Efficient Word Representations | Svenstrup Dan, Hansen Jonas Meinertz, Winther Ole | Arxiv | We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by \(k\) \(d\)-dimensional embeddings vectors and one \(k\) dimensional weight vector. The final \(d\) dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of \(B\) embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions of tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types. |
|||||
2017 | Deep Supervised Discrete Hashing | Qi Li, Zhenan Sun, Ran He, Tieniu Tan | Neural Information Processing Systems | With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefiting from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets. |
|||||
2017 | FLASH Randomized Algorithms Accelerated Over CPU-GPU For Ultra-high Dimensional Similarity Search | Wang Yiqiu, Shrivastava Anshumali, Wang Jonathan, Ryu Junghee | Arxiv | We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (\(n^2D\)), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results. |
|||||
2017 | Compact Environment-invariant Codes For Robust Visual Place Recognition | Jain Unnat, Namboodiri Vinay P., Pandey Gaurav | Arxiv | Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low-dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ and MLH and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks. |
|||||
2017 | Fast Spectral Ranking For Similarity Search | Iscen Ahmet, Avrithis Yannis, Tolias Giorgos, Furon Teddy, Chum Ondrej | Arxiv | Despite the success of deep learning on representing images for particular object retrieval, recent studies show that the learned representations still lie on manifolds in a high dimensional space. This makes the Euclidean nearest neighbor search biased for this task. Exploring the manifolds online remains expensive even if a nearest neighbor graph has been computed offline. This work introduces an explicit embedding reducing manifold search to Euclidean search followed by dot product similarity search. This is equivalent to linear graph filtering of a sparse signal in the frequency domain. To speed up online search, we compute an approximate Fourier basis of the graph offline. We improve the state of art on particular object retrieval datasets including the challenging Instre dataset containing small objects. At a scale of 10^5 images, the offline cost is only a few hours, while query time is comparable to standard similarity search. |
|||||
2016 | Hashmod A Hashing Method For Scalable 3D Object Detection | Kehl Wadim, Tombari Federico, Navab Nassir, Ilic Slobodan, Lepetit Vincent | Arxiv | We present a scalable method for detecting objects and estimating their 3D poses in RGB-D data. To this end, we rely on an efficient representation of object views and employ hashing techniques to match these views against the input frame in a scalable way. While a similar approach already exists for 2D detection, we show how to extend it to estimate the 3D pose of the detected objects. In particular, we explore different hashing strategies and identify the one which is more suitable to our problem. We show empirically that the complexity of our method is sublinear with the number of objects and we enable detection and pose estimation of many 3D objects with high accuracy while outperforming the state-of-the-art in terms of runtime. |
|||||
2016 | Deep Supervised Hashing With Triplet Labels | Wang Xiaofang, Shi Yi, Kitani Kris M. | Arxiv | Hashing is one of the most popular and powerful approximate nearest neighbor search techniques for large-scale image retrieval. Most traditional hashing methods first represent images as off-the-shelf visual features and then produce hashing codes in a separate stage. However, off-the-shelf visual features may not be optimally compatible with the hash code learning procedure, which may result in sub-optimal hash codes. Recently, deep hashing methods have been proposed to simultaneously learn image features and hash codes using deep neural networks and have shown superior performance over traditional hashing methods. Most deep hashing methods are given supervised information in the form of pairwise labels or triplet labels. The current state-of-the-art deep hashing method DPSH~\cite{li2015feature}, which is based on pairwise labels, performs image feature learning and hash code learning simultaneously by maximizing the likelihood of pairwise similarities. Inspired by DPSH~\cite{li2015feature}, we propose a triplet label based deep hashing method which aims to maximize the likelihood of the given triplet labels. Experimental results show that our method outperforms all the baselines on CIFAR-10 and NUS-WIDE datasets, including the state-of-the-art method DPSH~\cite{li2015feature} and all the previous triplet label based deep hashing methods. |
|||||
2016 | Noisy 1-bit Compressed Sensing Embeddings Enjoy A Restricted Isometry Property | Spencer Scott | Arxiv | We investigate the sign-linear embeddings of 1-bit compressed sensing given by Gaussian measurements. One can give short arguments concerning a Restricted Isometry Property of such maps using Vapnik-Chervonenkis dimension of sparse hemispheres. This approach has a natural extension to the presence of additive white noise prior to quantization. Noisy one-bit mappings are shown to satisfy an RIP when the metric on the sphere is given by the noise. |
|||||
2016 | Deviation Results For Sparse Tables In Hashing With Linear Probing | Klein Thierry Imt, Lagnoux A Imt, Petit P Imt | Arxiv | We consider the model of hashing with linear probing and we establish the moderate and large deviations for the total displacement in sparse tables. In this context, Weibull-like-tailed random variables appear. Deviations for sums of such heavy-tailed random variables are studied in \cite{Nagaev69-1,Nagaev69-2}. Here we adapt the proofs therein to deal with conditioned sums of such variables and solve the open question in \cite{TFC12}. By the way, we establish the deviations of the total displacement in full tables, which can be derived from the deviations of empirical processes of i.i.d.\ random variables established in \cite{Wu94}.. |
|||||
2016 | Quantized Random Projections And Non-linear Estimation Of Cosine Similarity | Ping Li, Michael Mitzenmacher, Martin Slawski | Neural Information Processing Systems | Random projections constitute a simple, yet effective technique for dimensionality reduction with applications in learning and search problems. In the present paper, we consider the problem of estimating cosine similarities when the projected data undergo scalar quantization to \(b\) bits. We here argue that the maximum likelihood estimator (MLE) is a principled approach to deal with the non-linearity resulting from quantization, and subsequently study its computational and statistical properties. A specific focus is on the on the trade-off between bit depth and the number of projections given a fixed budget of bits for storage or transmission. Along the way, we also touch upon the existence of a qualitative counterpart to the Johnson-Lindenstrauss lemma in the presence of quantization. |
|||||
2016 | Supervised Matrix Factorization For Cross-modality Hashing | Liu Hong, Ji Rongrong, Wu Yongjian, Hua Gang | Arxiv | Matrix factorization has been recently utilized for the task of multi-modal hashing for cross-modality visual search, where basis functions are learned to map data from different modalities to the same Hamming embedding. In this paper, we propose a novel cross-modality hashing algorithm termed Supervised Matrix Factorization Hashing (SMFH) which tackles the multi-modal hashing problem with a collective non-matrix factorization across the different modalities. In particular, SMFH employs a well-designed binary code learning algorithm to preserve the similarities among multi-modal original features through a graph regularization. At the same time, semantic labels, when available, are incorporated into the learning procedure. We conjecture that all these would facilitate to preserve the most relevant information during the binary quantization process, and hence improve the retrieval accuracy. We demonstrate the superior performance of SMFH on three cross-modality visual search benchmarks, i.e., the PASCAL-Sentence, Wiki, and NUS-WIDE, with quantitative comparison to various state-of-the-art methods |
|||||
2016 | Fast Approximate Furthest Neighbors With Data-dependent Hashing | Curtin Ryan R., Gardner Andrew B. | Arxiv | We present a novel hashing strategy for approximate furthest neighbor search that selects projection bases using the data distribution. This strategy leads to an algorithm, which we call DrusillaHash, that is able to outperform existing approximate furthest neighbor strategies. Our strategy is motivated by an empirical study of the behavior of the furthest neighbor search problem, which lends intuition for where our algorithm is most useful. We also present a variant of the algorithm that gives an absolute approximation guarantee; to our knowledge, this is the first such approximate furthest neighbor hashing approach to give such a guarantee. Performance studies indicate that DrusillaHash can achieve comparable levels of approximation to other algorithms while giving up to an order of magnitude speedup. An implementation is available in the mlpack machine learning library (found at http://www.mlpack.org). |
|||||
2016 | Learning A Deep ell_infty Encoder For Hashing | Wang Zhangyang, Yang Yingzhen, Chang Shiyu, Ling Qing, Huang Thomas S. | Arxiv | We investigate the \(\ell_\infty\)-constrained representation which demonstrates robustness to quantization errors, utilizing the tool of deep learning. Based on the Alternating Direction Method of Multipliers (ADMM), we formulate the original convex minimization problem as a feed-forward neural network, named \textit{Deep \(\ell_\infty\) Encoder}, by introducing the novel Bounded Linear Unit (BLU) neuron and modeling the Lagrange multipliers as network biases. Such a structural prior acts as an effective network regularization, and facilitates the model initialization. We then investigate the effective use of the proposed model in the application of hashing, by coupling the proposed encoders under a supervised pairwise loss, to develop a \textit{Deep Siamese \(\ell_\infty\) Network}, which can be optimized from end to end. Extensive experiments demonstrate the impressive performances of the proposed model. We also provide an in-depth analysis of its behaviors against the competitors. |
|||||
2016 | Large Margin Discriminant Dimensionality Reduction In Prediction Space | Mohammad Saberian, Jose Costa Pereira, Can Xu, Jian Yang, Nuno Nvasconcelos | Neural Information Processing Systems | In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through combination of weak learners. We argue that the intermediate mapping, e.g. boosting predictor, is preserving the discriminant aspects of the data and by controlling the dimension of this mapping it is possible to achieve discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification. |
|||||
2016 | Binary Subspace Coding For Query-by-image Video Retrieval | Xu Ruicong, Yang Yang, Luo Yadan, Shen Fumin, Huang Zi, Shen Heng Tao | Arxiv | The query-by-image video retrieval (QBIVR) task has been attracting considerable research attention recently. However, most existing methods represent a video by either aggregating or projecting all its frames into a single datum point, which may easily cause severe information loss. In this paper, we propose an efficient QBIVR framework to enable an effective and efficient video search with image query. We first define a similarity-preserving distance metric between an image and its orthogonal projection in the subspace of the video, which can be equivalently transformed to a Maximum Inner Product Search (MIPS) problem. Besides, to boost the efficiency of solving the MIPS problem, we propose two asymmetric hashing schemes, which bridge the domain gap of images and videos. The first approach, termed Inner-product Binary Coding (IBC), preserves the inner relationships of images and videos in a common Hamming space. To further improve the retrieval efficiency, we devise a Bilinear Binary Coding (BBC) approach, which employs compact bilinear projections instead of a single large projection matrix. Extensive experiments have been conducted on four real-world video datasets to verify the effectiveness of our proposed approaches as compared to the state-of-the-arts. |
|||||
2016 | Deep Residual Hashing | Conjeti Sailesh, Roy Abhijit Guha, Katouzian Amin, Navab Nassir | Arxiv | Hashing aims at generating highly compact similarity preserving code words which are well suited for large-scale image retrieval tasks. Most existing hashing methods first encode the images as a vector of hand-crafted features followed by a separate binarization step to generate hash codes. This two-stage process may produce sub-optimal encoding. In this paper, for the first time, we propose a deep architecture for supervised hashing through residual learning, termed Deep Residual Hashing (DRH), for an end-to-end simultaneous representation learning and hash coding. The DRH model constitutes four key elements: (1) a sub-network with multiple stacked residual blocks; (2) hashing layer for binarization; (3) supervised retrieval loss function based on neighbourhood component analysis for similarity preserving embedding; and (4) hashing related losses and regularisation to control the quantization error and improve the quality of hash coding. We present results of extensive experiments on a large public chest x-ray image database with co-morbidities and discuss the outcome showing substantial improvements over the latest state-of-the art methods. |
|||||
2016 | Hilbert Exclusion Improved Metric Search Through Finite Isometric Embeddings | Connor Richard, Cardillo Franco Alberto, Vadicamo Lucia, Rabitti Fausto | ACM Transactions on Information Systems | Most research into similarity search in metric spaces relies upon the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger property, sometimes called the four-point property: in essence, these spaces allow an isometric embedding of any four points in three-dimensional Euclidean space, as well as any three points in two-dimensional Euclidean space. In fact, we show that any space which is isometrically embeddable in Hilbert space has the stronger property. This property gives stronger geometric guarantees, and one in particular, which we name the Hilbert Exclusion property, allows any indexing mechanism which uses hyperplane partitioning to perform better. One outcome of this observation is that a number of state-of-the-art indexing mechanisms over high dimensional spaces can be easily extended to give a significant increase in performance; furthermore, the improvement given is greater in higher dimensions. This therefore leads to a significant improvement in the cost of metric search in these spaces. |
|||||
2016 | Improved Upper Bound On A(188) | Polak Sven | Arxiv | For nonnegative integers \(n\) and \(d\), let \(A(n,d)\) be the maximum cardinality of a binary code of length \(n\) and minimum distance at least \(d\). We consider a slight sharpening of the semidefinite programming bound of Gijswijt, Mittelmann and Schrijver, and obtain that \(A(18,8)\leq 70\). |
|||||
2016 | Fasttext.zip Compressing Text Classification Models | Joulin Armand, Grave Edouard, Bojanowski Piotr, Douze Matthijs, Jégou Hérve, Mikolov Tomas | Arxiv | We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. |
|||||
2016 | Generalized Intersection Kernel | Li Ping | Arxiv | Following the very recent line of work on the |
|||||
2016 | Zero-shot Hashing Via Transferring Supervised Knowledge | Yang Yang, Chen Weilun, Luo Yadan, Shen Fumin, Shao Jie, Shen Heng Tao | Arxiv | Hashing has shown its efficiency and effectiveness in facilitating large-scale multimedia applications. Supervised knowledge e.g. semantic labels or pair-wise relationship) associated to data is capable of significantly improving the quality of hash codes and hash functions. However, confronted with the rapid growth of newly-emerging concepts and multimedia data on the Web, existing supervised hashing approaches may easily suffer from the scarcity and validity of supervised information due to the expensive cost of manual labelling. In this paper, we propose a novel hashing scheme, termed zero-shot hashing (ZSH), which compresses images of “unseen” categories to binary codes with hash functions learned from limited training data of “seen” categories. Specifically, we project independent data labels i.e. 0/1-form label vectors) into semantic embedding space, where semantic relationships among all the labels can be precisely characterized and thus seen supervised knowledge can be transferred to unseen classes. Moreover, in order to cope with the semantic shift problem, we rotate the embedded space to more suitably align the embedded semantics with the low-level visual feature space, thereby alleviating the influence of semantic gap. In the meantime, to exert positive effects on learning high-quality hash functions, we further propose to preserve local structural property and discrete nature in binary codes. Besides, we develop an efficient alternating algorithm to solve the ZSH model. Extensive experiments conducted on various real-life datasets show the superior zero-shot image retrieval performance of ZSH as compared to several state-of-the-art hashing methods. |
|||||
2016 | A Framework For Similarity Search With Space-time Tradeoffs Using Locality-sensitive Filtering | Christiani Tobias | Arxiv | We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space with certain locality-sensitivity properties, we can solve the approximate near neighbor problem in \(d\)-dimensional space for an \(n\)-point data set with query time \(dn^{\rho_q+o(1)}\), update time \(dn^{\rho_u+o(1)}\), and space usage \(dn + n^{1
|
|||||
2016 | Approximate Furthest Neighbor With Application To Annulus Query | Pagh Rasmus, Silvestri Francesco, Sivertsen Johan, Skala Matthew | Information Systems Available online | Much recent work has been devoted to approximate nearest neighbor queries. Motivated by applications in recommender systems, we consider approximate furthest neighbor (AFN) queries and present a simple, fast, and highly practical data structure for answering AFN queries in high- dimensional Euclidean space. The method builds on the technique of In- dyk (SODA 2003), storing random projections to provide sublinear query time for AFN. However, we introduce a different query algorithm, improving on Indyk’s approximation factor and reducing the running time by a logarithmic factor. We also present a variation based on a query- independent ordering of the database points; while this does not have the provable approximation factor of the query-dependent data structure, it offers significant improvement in time and space complexity. We give a theoretical analysis, and experimental results. As an application, the query-dependent approach is used for deriving a data structure for the approximate annulus query problem, which is defined as follows: given an input set S and two parameters r > 0 and w >= 1, construct a data structure that returns for each query point q a point p in S such that the distance between p and q is at least r/w and at most wr. |
|||||
2016 | Set Similarity Search Beyond Minhash | Christiani Tobias, Pagh Rasmus | Arxiv | We consider the problem of approximate set similarity search under Braun-Blanquet similarity \(B(\mathbf{x}, \mathbf{y}) = |\mathbf{x} \cap \mathbf{y}| / \max(|\mathbf{x}|, |\mathbf{y}|)\). The \((b_2, b_2)\)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets \(P\) such that, given a query set \(\mathbf{q}\), if there exists \(\mathbf{x} \in P\) with \(B(\mathbf{q}, \mathbf{x}) \geq b_1\), then we can efficiently return \(\mathbf{x}’ \in P\) with \(B(\mathbf{q}, \mathbf{x}’) > b_2\). We present a simple data structure that solves this problem with space usage \(O(n^{1+\rho}log n + \sum_{\mathbf{x} \in P}|\mathbf{x}|)\) and query time \(O(|\mathbf{q}|n^{\rho} log n)\) where \(n = |P|\) and \(\rho = log(1/b_1)/log(1/b_2)\). Making use of existing lower bounds for locality-sensitive hashing by O’Donnell et al. (TOCT 2014) we show that this value of \(\rho\) is tight across the parameter space, i.e., for every choice of constants \(0 < b_2 < b_1 < 1\). In the case where all sets have the same size our solution strictly improves upon the value of \(\rho\) that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder’s MinHash (CCS 1997) for Jaccard similarity and Andoni et al.’s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015). |
|||||
2016 | Fast Training Of Triplet-based Deep Binary Embedding Networks | Zhuang Bohan, Lin Guosheng, Shen Chunhua, Reid Ian | Arxiv | In this paper, we aim to learn a mapping (or embedding) from images to a compact binary space in which Hamming distances correspond to a ranking measure for the image retrieval task. We make use of a triplet loss because this has been shown to be most effective for ranking problems. However, training in previous works can be prohibitively expensive due to the fact that optimization is directly performed on the triplet space, where the number of possible triplets for training is cubic in the number of training examples. To address this issue, we propose to formulate high-order binary codes learning as a multi-label classification problem by explicitly separating learning into two interleaved stages. To solve the first stage, we design a large-scale high-order binary codes inference algorithm to reduce the high-order objective to a standard binary quadratic problem such that graph cuts can be used to efficiently infer the binary code which serve as the label of each training datum. In the second stage we propose to map the original image to compact binary codes via carefully designed deep convolutional neural networks (CNNs) and the hashing function fitting can be solved by training binary CNN classifiers. An incremental/interleaved optimization strategy is proffered to ensure that these two steps are interactive with each other during training for better accuracy. We conduct experiments on several benchmark datasets, which demonstrate both improved training time (by as much as two orders of magnitude) as well as producing state-of-the-art hashing for various retrieval tasks. |
|||||
2016 | Locality-sensitive Hashing Without False Negatives For L_p | Pacuk Andrzej, Sankowski Piotr, Wegrzycki Karol, Wygocki Piotr | Computing and Combinatorics - | In this paper, we show a construction of locality-sensitive hash functions without false negatives, i.e., which ensure collision for every pair of points within a given radius \(R\) in \(d\) dimensional space equipped with \(l_p\) norm when \(p \in [1,\infty]\). Furthermore, we show how to use these hash functions to solve the \(c\)-approximate nearest neighbor search problem without false negatives. Namely, if there is a point at distance \(R\), we will certainly report it and points at distance greater than \(cR\) will not be reported for \(c=Ω(\sqrt{d},d^{1-\frac{1}{p}})\). The constructed algorithms work: - with preprocessing time \(\mathcal{O}(n log(n))\) and sublinear expected query time,
|
|||||
2016 | Theory Of The GMM Kernel | Li Ping, Zhang Cun-hui | Arxiv | We develop some theoretical results for a robust similarity measure named “generalized min-max” (GMM). This similarity has direct applications in machine learning as a positive definite kernel and can be efficiently computed via probabilistic hashing. Owing to the discrete nature, the hashed values can also be used for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a very general family of distributions and includes the multivariate \(t\)-distribution as a special case. The consistency result holds as long as the data have bounded first moment (an assumption which essentially holds for datasets commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. Compared to the “cosine” similarity which is routinely adopted in current practice in statistics and machine learning, the consistency of GMM requires much weaker conditions. Interestingly, when the data follow the \(t\)-distribution with \(\nu\) degrees of freedom, GMM typically provides a better measure of similarity than “cosine” roughly when \(\nu<8\) (which is already very close to normal). These theoretical results will help explain the recent success of GMM in learning tasks. |
|||||
2016 | Zero Shot Hashing | Pachori Shubham, Raman Shanmuganathan | Arxiv | This paper provides a framework to hash images containing instances of unknown object classes. In many object recognition problems, we might have access to huge amount of data. It may so happen that even this huge data doesn’t cover the objects belonging to classes that we see in our day to day life. Zero shot learning exploits auxiliary information (also called as signatures) in order to predict the labels corresponding to unknown classes. In this work, we attempt to generate the hash codes for images belonging to unseen classes, information of which is available only through the textual corpus. We formulate this as an unsupervised hashing formulation as the exact labels are not available for the instances of unseen classes. We show that the proposed solution is able to generate hash codes which can predict labels corresponding to unseen classes with appreciably good precision. |
|||||
2016 | Supervised Incremental Hashing | Ozdemir Bahadir, Najibi Mahyar, Davis Larry S. | Arxiv | We propose an incremental strategy for learning hash functions with kernels for large-scale image search. Our method is based on a two-stage classification framework that treats binary codes as intermediate variables between the feature space and the semantic space. In the first stage of classification, binary codes are considered as class labels by a set of binary SVMs; each corresponds to one bit. In the second stage, binary codes become the input space of a multi-class SVM. Hash functions are learned by an efficient algorithm where the NP-hard problem of finding optimal binary codes is solved via cyclic coordinate descent and SVMs are trained in a parallelized incremental manner. For modifications like adding images from a previously unseen class, we describe an incremental procedure for effective and efficient updates to the previous hash functions. Experiments on three large-scale image datasets demonstrate the effectiveness of the proposed hashing method, Supervised Incremental Hashing (SIH), over the state-of-the-art supervised hashing methods. |
|||||
2016 | Concurrent Hash Tables Fast And General(!) | Maier Tobias, Sanders Peter, Dementiev Roman | Arxiv | Concurrent hash tables are one of the most important concurrent data structures with numerous applications. Since hash table accesses can dominate the execution time of the overall application, we need implementations that achieve good speedup. Unfortunately, currently available concurrent hashing libraries turn out to be far away from this requirement in particular when contention on some elements occurs. Our starting point for better performing data structures is a fast and simple lock-free concurrent hash table based on linear probing that is limited to word-sized key-value types and does not support dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate that dynamic growing has a performance overhead comparable to the same generalization in sequential hash tables. We perform extensive experiments comparing the performance of our implementations with six of the most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the difference even approaches four orders of magnitude. |
|||||
2016 | 2-bit Random Projections Nonlinear Estimators And Approximate Near Neighbor Search | Li Ping, Mitzenmacher Michael, Shrivastava Anshumali | Arxiv | The method of random projections has become a standard tool for machine learning, data mining, and search with massive data at Web scale. The effective use of random projections requires efficient coding schemes for quantizing (real-valued) projected data into integers. In this paper, we focus on a simple 2-bit coding scheme. In particular, we develop accurate nonlinear estimators of data similarity based on the 2-bit strategy. This work will have important practical applications. For example, in the task of near neighbor search, a crucial step (often called re-ranking) is to compute or estimate data similarities once a set of candidate data points have been identified by hash table techniques. This re-ranking step can take advantage of the proposed coding scheme and estimator. As a related task, in this paper, we also study a simple uniform quantization scheme for the purpose of building hash tables with projected data. Our analysis shows that typically only a small number of bits are needed. For example, when the target similarity level is high, 2 or 3 bits might be sufficient. When the target similarity level is not so high, it is preferable to use only 1 or 2 bits. Therefore, a 2-bit scheme appears to be overall a good choice for the task of sublinear time approximate near neighbor search via hash tables. Combining these results, we conclude that 2-bit random projections should be recommended for approximate near neighbor search and similarity estimation. Extensive experimental results are provided. |
|||||
2016 | Scalable Gaussian Processes For Supervised Hashing | Ozdemir Bahadir, Davis Larry S. | Arxiv | We propose a flexible procedure for large-scale image search by hash functions with kernels. Our method treats binary codes and pairwise semantic similarity as latent and observed variables, respectively, in a probabilistic model based on Gaussian processes for binary classification. We present an efficient inference algorithm with the sparse pseudo-input Gaussian process (SPGP) model and parallelization. Experiments on three large-scale image dataset demonstrate the effectiveness of the proposed hashing method, Gaussian Process Hashing (GPH), for short binary codes and the datasets without predefined classes in comparison to the state-of-the-art supervised hashing methods. |
|||||
2016 | Ordinal Constrained Binary Code Learning For Nearest Neighbor Search | Liu Hong, Ji Rongrong, Wu Yongjian, Huang Feiyue | Arxiv | Recent years have witnessed extensive attention in binary code learning, a.k.a. hashing, for nearest neighbor search problems. It has been seen that high-dimensional data points can be quantized into binary codes to give an efficient similarity approximation via Hamming distance. Among existing schemes, ranking-based hashing is recent promising that targets at preserving ordinal relations of ranking in the Hamming space to minimize retrieval loss. However, the size of the ranking tuples, which shows the ordinal relations, is quadratic or cubic to the size of training samples. By given a large-scale training data set, it is very expensive to embed such ranking tuples in binary code learning. Besides, it remains a dificulty to build ranking tuples efficiently for most ranking-preserving hashing, which are deployed over an ordinal graph-based setting. To handle these problems, we propose a novel ranking-preserving hashing method, dubbed Ordinal Constraint Hashing (OCH), which efficiently learns the optimal hashing functions with a graph-based approximation to embed the ordinal relations. The core idea is to reduce the size of ordinal graph with ordinal constraint projection, which preserves the ordinal relations through a small data set (such as clusters or random samples). In particular, to learn such hash functions effectively, we further relax the discrete constraints and design a specific stochastic gradient decent algorithm for optimization. Experimental results on three large-scale visual search benchmark datasets, i.e. LabelMe, Tiny100K and GIST1M, show that the proposed OCH method can achieve superior performance over the state-of-the-arts approaches. |
|||||
2016 | Revisiting Winner Take All (WTA) Hashing For Sparse Datasets | Chen Beidi, Shrivastava Anshumali | Arxiv | WTA (Winner Take All) hashing has been successfully applied in many large scale vision applications. This hashing scheme was tailored to take advantage of the comparative reasoning (or order based information), which showed significant accuracy improvements. In this paper, we identify a subtle issue with WTA, which grows with the sparsity of the datasets. This issue limits the discriminative power of WTA. We then propose a solution for this problem based on the idea of Densification which provably fixes the issue. Our experiments show that Densified WTA Hashing outperforms Vanilla WTA both in image classification and retrieval tasks consistently and significantly. |
|||||
2016 | Unsupervised Cross-media Hashing With Structure Preservation | Wang Xiangyu, Chia Alex Yong-sang | Arxiv | Recent years have seen the exponential growth of heterogeneous multimedia data. The need for effective and accurate data retrieval from heterogeneous data sources has attracted much research interest in cross-media retrieval. Here, given a query of any media type, cross-media retrieval seeks to find relevant results of different media types from heterogeneous data sources. To facilitate large-scale cross-media retrieval, we propose a novel unsupervised cross-media hashing method. Our method incorporates local affinity and distance repulsion constraints into a matrix factorization framework. Correspondingly, the proposed method learns hash functions that generates unified hash codes from different media types, while ensuring intrinsic geometric structure of the data distribution is preserved. These hash codes empower the similarity between data of different media types to be evaluated directly. Experimental results on two large-scale multimedia datasets demonstrate the effectiveness of the proposed method, where we outperform the state-of-the-art methods. |
|||||
2016 | An Algorithm For L1 Nearest Neighbor Search Via Monotonic Embedding | Xinan Wang, Sanjoy Dasgupta | Neural Information Processing Systems | Fast algorithms for nearest neighbor (NN) search have in large part focused on L2 distance. Here we develop an approach for L1 distance that begins with an explicit and exact embedding of the points into L2. We show how this embedding can efficiently be combined with random projection methods for L2 NN search, such as locality-sensitive hashing or random projection trees. We rigorously establish the correctness of the methodology and show by experimentation that it is competitive in practice with available alternatives. |
|||||
2016 | Robust Hashing For Multi-view Data Jointly Learning Low-rank Kernelized Similarity Consensus And Hash Functions | Wu Lin, Wang Yang | Arxiv | Learning hash functions/codes for similarity search over multi-view data is attracting increasing attention, where similar hash codes are assigned to the data objects characterizing consistently neighborhood relationship across views. Traditional methods in this category inherently suffer three limitations: 1) they commonly adopt a two-stage scheme where similarity matrix is first constructed, followed by a subsequent hash function learning; 2) these methods are commonly developed on the assumption that data samples with multiple representations are noise-free,which is not practical in real-life applications; 3) they often incur cumbersome training model caused by the neighborhood graph construction using all \(N\) points in the database (\(O(N)\)). In this paper, we motivate the problem of jointly and efficiently training the robust hash functions over data objects with multi-feature representations which may be noise corrupted. To achieve both the robustness and training efficiency, we propose an approach to effectively and efficiently learning low-rank kernelized \footnote{We use kernelized similarity rather than kernel, as it is not a squared symmetric matrix for data-landmark affinity matrix.} hash functions shared across views. Specifically, we utilize landmark graphs to construct tractable similarity matrices in multi-views to automatically discover neighborhood structure in the data. To learn robust hash functions, a latent low-rank kernel function is used to construct hash functions in order to accommodate linearly inseparable data. In particular, a latent kernelized similarity matrix is recovered by rank minimization on multiple kernel-based similarity matrices. Extensive experiments on real-world multi-view datasets validate the efficacy of our method in the presence of error corruptions. |
|||||
2016 | Randomised Relevance Model | Wurzer Dominik, Osborne Miles, Lavrenko Victor | Arxiv | Relevance Models are well-known retrieval models and capable of producing competitive results. However, because they use query expansion they can be very slow. We address this slowness by incorporating two variants of locality sensitive hashing (LSH) into the query expansion process. Results on two document collections suggest that we can obtain large reductions in the amount of work, with a small reduction in effectiveness. Our approach is shown to be additive when pruning query terms. |
|||||
2016 | An Ensemble Diversity Approach To Supervised Binary Hashing | Carreira-perpiñán Miguel Á., Raziperchikolaei Ramin | Arxiv | Binary hashing is a well-known approach for fast approximate nearest-neighbor search in information retrieval. Much work has focused on affinity-based objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves state-of-the-art precision and recall in experiments with image retrieval. |
|||||
2016 | Query-adaptive Image Retrieval By Deep Weighted Hashing | Zhang Jian, Peng Yuxin | Arxiv | Hashing methods have attracted much attention for large scale image retrieval. Some deep hashing methods have achieved promising results by taking advantage of the strong representation power of deep networks recently. However, existing deep hashing methods treat all hash bits equally. On one hand, a large number of images share the same distance to a query image due to the discrete Hamming distance, which raises a critical issue of image retrieval where fine-grained rankings are very important. On the other hand, different hash bits actually contribute to the image retrieval differently, and treating them equally greatly affects the retrieval accuracy of image. To address the above two problems, we propose the query-adaptive deep weighted hashing (QaDWH) approach, which can perform fine-grained ranking for different queries by weighted Hamming distance. First, a novel deep hashing network is proposed to learn the hash codes and corresponding class-wise weights jointly, so that the learned weights can reflect the importance of different hash bits for different image classes. Second, a query-adaptive image retrieval method is proposed, which rapidly generates hash bit weights for different query images by fusing its semantic probability and the learned class-wise weights. Fine-grained image retrieval is then performed by the weighted Hamming distance, which can provide more accurate ranking than the traditional Hamming distance. Experiments on four widely used datasets show that the proposed approach outperforms eight state-of-the-art hashing methods. |
|||||
2016 | Scalable Discrete Supervised Hash Learning With Asymmetric Matrix Factorization | Zhang Shifeng, Li Jianmin, Guo Jinma, Zhang Bo | Arxiv | Hashing method maps similar data to binary hashcodes with smaller hamming distance, and it has received a broad attention due to its low storage cost and fast retrieval speed. However, the existing limitations make the present algorithms difficult to deal with large-scale datasets: (1) discrete constraints are involved in the learning of the hash function; (2) pairwise or triplet similarity is adopted to generate efficient hashcodes, resulting both time and space complexity are greater than O(n^2). To address these issues, we propose a novel discrete supervised hash learning framework which can be scalable to large-scale datasets. First, the discrete learning procedure is decomposed into a binary classifier learning scheme and binary codes learning scheme, which makes the learning procedure more efficient. Second, we adopt the Asymmetric Low-rank Matrix Factorization and propose the Fast Clustering-based Batch Coordinate Descent method, such that the time and space complexity is reduced to O(n). The proposed framework also provides a flexible paradigm to incorporate with arbitrary hash function, including deep neural networks and kernel methods. Experiments on large-scale datasets demonstrate that the proposed method is superior or comparable with state-of-the-art hashing algorithms. |
|||||
2016 | Transitive Hashing Network For Heterogeneous Multimedia Retrieval | Cao Zhangjie, Long Mingsheng, Yang Qiang | Arxiv | Hashing has been widely applied to large-scale multimedia retrieval due to the storage and retrieval efficiency. Cross-modal hashing enables efficient retrieval from database of one modality in response to a query of another modality. Existing work on cross-modal hashing assumes heterogeneous relationship across modalities for hash function learning. In this paper, we relax the strong assumption by only requiring such heterogeneous relationship in an auxiliary dataset different from the query/database domain. We craft a hybrid deep architecture to simultaneously learn the cross-modal correlation from the auxiliary dataset, and align the dataset distributions between the auxiliary dataset and the query/database domain, which generates transitive hash codes for heterogeneous multimedia retrieval. Extensive experiments exhibit that the proposed approach yields state of the art multimedia retrieval performance on public datasets, i.e. NUS-WIDE, ImageNet-YahooQA. |
|||||
2016 | Scalable And Sustainable Deep Learning Via Randomized Hashing | Spring Ryan, Shrivastava Anshumali | Arxiv | Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend to bring deep learning to low-power, embedded devices. The matrix operations, associated with both training and testing of deep networks, are very expensive from a computational and energy standpoint. We present a novel hashing based technique to drastically reduce the amount of computation needed to train and test deep networks. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select the nodes with the highest activation efficiently. Our new algorithm for deep learning reduces the overall computational cost of forward and back-propagation by operating on significantly fewer (sparse) nodes. As a consequence, our algorithm uses only 5% of the total multiplications, while keeping on average within 1% of the accuracy of the original model. A unique property of the proposed hashing based back-propagation is that the updates are always sparse. Due to the sparse gradient updates, our algorithm is ideally suited for asynchronous and parallel training leading to near linear speedup with increasing number of cores. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations on several real datasets. |
|||||
2016 | Functional Hashing For Compressing Neural Networks | Shi Lei, Feng Shikun, Zhu Zhifan | Arxiv | As the complexity of deep neural networks (DNNs) trend to grow to absorb the increasing sizes of data, memory and energy consumption has been receiving more and more attentions for industrial applications, especially on mobile devices. This paper presents a novel structure based on functional hashing to compress DNNs, namely FunHashNN. For each entry in a deep net, FunHashNN uses multiple low-cost hash functions to fetch values in the compression space, and then employs a small reconstruction network to recover that entry. The reconstruction network is plugged into the whole network and trained jointly. FunHashNN includes the recently proposed HashedNets as a degenerated case, and benefits from larger value capacity and less reconstruction loss. We further discuss extensions with dual space hashing and multi-hops. On several benchmark datasets, FunHashNN demonstrates high compression ratios with little loss on prediction accuracy. |
|||||
2016 | LOH And Behold Web-scale Visual Search Recommendation And Clustering Using Locally Optimized Hashing | Kalantidis Yannis, Kennedy Lyndon, Nguyen Huy, Mellina Clayton, Shamma David A. | Arxiv | We propose a novel hashing-based matching scheme, called Locally Optimized Hashing (LOH), based on a state-of-the-art quantization algorithm that can be used for efficient, large-scale search, recommendation, clustering, and deduplication. We show that matching with LOH only requires set intersections and summations to compute and so is easily implemented in generic distributed computing systems. We further show application of LOH to: a) large-scale search tasks where performance is on par with other state-of-the-art hashing approaches; b) large-scale recommendation where queries consisting of thousands of images can be used to generate accurate recommendations from collections of hundreds of millions of images; and c) efficient clustering with a graph-based algorithm that can be scaled to massive collections in a distributed environment or can be used for deduplication for small collections, like search results, performing better than traditional hashing approaches while only requiring a few milliseconds to run. In this paper we experiment on datasets of up to 100 million images, but in practice our system can scale to larger collections and can be used for other types of data that have a vector representation in a Euclidean space. |
|||||
2016 | A Revisit Of Hashing Algorithms For Approximate Nearest Neighbor Search | Cai Deng | Arxiv | Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims outperform other state-of-the-art hashing methods. However, the evaluation of these hashing papers was not thorough enough, and those claims should be re-examined. The ultimate goal of an ANNS method is returning the most accurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing papers only report the performance with the code length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index, and analyze the pros and cons of two popular hash index search procedures. Then we proposed a very simple but effective two level index structures and make a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing (LSH) is the best performed algorithm, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms. |
|||||
2016 | Fast Cross-polytope Locality-sensitive Hashing | Kennedy Christopher, Ward Rachel | Arxiv | We provide a variant of cross-polytope locality sensitive hashing with respect to angular distance which is provably optimal in asymptotic sensitivity and enjoys \(\mathcal{O}(d \ln d )\) hash computation time. Building on a recent result (by Andoni, Indyk, Laarhoven, Razenshteyn, Schmidt, 2015), we show that optimal asymptotic sensitivity for cross-polytope LSH is retained even when the dense Gaussian matrix is replaced by a fast Johnson-Lindenstrauss transform followed by discrete pseudo-rotation, reducing the hash computation time from \(\mathcal{O}(d^2)\) to \(\mathcal{O}(d \ln d )\). Moreover, our scheme achieves the optimal rate of convergence for sensitivity. By incorporating a low-randomness Johnson-Lindenstrauss transform, our scheme can be modified to require only \(\mathcal{O}(\ln^9(d))\) random bits |
|||||
2016 | SSDH Semi-supervised Deep Hashing For Large Scale Image Retrieval | Zhang Jian, Peng Yuxin | Arxiv | Hashing methods have been widely used for efficient similarity retrieval on large scale image database. Traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing (SSDH) approach, to perform more effective hash function learning by simultaneously preserving semantic similarity and underlying data structures. The main contributions are as follows: (1) We propose a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing. (2) A semi-supervised deep hashing network is designed to extensively exploit both labeled and unlabeled data, in which we propose an online graph construction method to benefit from the evolving deep features during training to better capture semantic neighbors. To the best of our knowledge, the proposed deep network is the first deep hashing method that can perform hash code learning and feature learning simultaneously in a semi-supervised fashion. Experimental results on 5 widely-used datasets show that our proposed approach outperforms the state-of-the-art hashing methods. |
|||||
2016 | Efficient Similarity Search In Dynamic Data Streams | Bury Marc, Schwiegelshohn Chris, Sorella Mara | Arxiv | The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index \(\frac{\vert A \cap B \vert}{\vert A\cup B\vert}\) of two item sets \(A\) and \(B\) is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity \(S\) admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance \(1-S\) is a metric, we can show that there exists a space efficient summary maintaining a \((1\pm \epsilon)\) multiplicative approximation to \(1-S\) in dynamic data streams. This in turn also yields a \(\epsilon\) additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions. |
|||||
2016 | Sub-linear Privacy-preserving Near-neighbor Search | Riazi M. Sadegh, Chen Beidi, Shrivastava Anshumali, Wallach Dan, Koushanfar Farinaz | Arxiv | In Near-Neighbor Search (NNS), a new client queries a database (held by a server) for the most similar data (near-neighbors) given a certain similarity metric. The Privacy-Preserving variant (PP-NNS) requires that neither server nor the client shall learn information about the other party’s data except what can be inferred from the outcome of NNS. The overwhelming growth in the size of current datasets and the lack of a truly secure server in the online world render the existing solutions impractical; either due to their high computational requirements or non-realistic assumptions which potentially compromise privacy. PP-NNS having query time {\it sub-linear} in the size of the database has been suggested as an open research direction by Li et al. (CCSW’15). In this paper, we provide the first such algorithm, called Secure Locality Sensitive Indexing (SLSI) which has a sub-linear query time and the ability to handle honest-but-curious parties. At the heart of our proposal lies a secure binary embedding scheme generated from a novel probabilistic transformation over locality sensitive hashing family. We provide information theoretic bound for the privacy guarantees and support our theoretical claims using substantial empirical evidence on real-world datasets. |
|||||
2016 | Faster Kernels For Graphs With Continuous Attributes Via Hashing | Morris Christopher, Kriege Nils M., Kersting Kristian, Mutzel Petra | Arxiv | While state-of-the-art kernels for graphs with discrete labels scale well to graphs with thousands of nodes, the few existing kernels for graphs with continuous attributes, unfortunately, do not scale well. To overcome this limitation, we present hash graph kernels, a general framework to derive kernels for graphs with continuous attributes from discrete ones. The idea is to iteratively turn continuous attributes into discrete labels using randomized hash functions. We illustrate hash graph kernels for the Weisfeiler-Lehman subtree kernel and for the shortest-path kernel. The resulting novel graph kernels are shown to be, both, able to handle graphs with continuous attributes and scalable to large graphs and data sets. This is supported by our theoretical analysis and demonstrated by an extensive experimental evaluation. |
|||||
2016 | Correlation Hashing Network For Efficient Cross-modal Retrieval | Cao Yue, Long Mingsheng, Wang Jianmin, Yu Philip S. | Arxiv | Hashing is widely applied to approximate nearest neighbor search for large-scale multimodal retrieval with storage and computation efficiency. Cross-modal hashing improves the quality of hash coding by exploiting semantic correlations across different modalities. Existing cross-modal hashing methods first transform data into low-dimensional feature vectors, and then generate binary codes by another separate quantization step. However, suboptimal hash codes may be generated since the quantization error is not explicitly minimized and the feature representation is not jointly optimized with the binary codes. This paper presents a Correlation Hashing Network (CHN) approach to cross-modal hashing, which jointly learns good data representation tailored to hash coding and formally controls the quantization error. The proposed CHN is a hybrid deep architecture that constitutes a convolutional neural network for learning good image representations, a multilayer perception for learning good text representations, two hashing layers for generating compact binary codes, and a structured max-margin loss that integrates all things together to enable learning similarity-preserving and high-quality hash codes. Extensive empirical study shows that CHN yields state of the art cross-modal retrieval performance on standard benchmarks. |
|||||
2016 | Nested Invariance Pooling And RBM Hashing For Image Instance Retrieval | Morère Olivier, Lin Jie, Veillard Antoine, Chandrasekhar Vijay, Poggio Tomaso | Arxiv | The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produce compact and well-performing descriptors with visual representations extracted from convolutional neural networks. We specifically incorporate scale, translation and rotation invariances but the scheme can be extended to any arbitrary sets of transformations. We also show that using moments of increasing order throughout nesting is important. The NIP descriptors are then hashed to the target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel batch-level regularization scheme specifically designed for the purpose of hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets. |
|||||
2016 | Scalability And Total Recall With Fast Coveringlsh | Pham Ninh, Pagh Rasmus | Arxiv | Locality-sensitive hashing (LSH) has emerged as the dominant algorithmic technique for similarity search with strong performance guarantees in high-dimensional spaces. A drawback of traditional LSH schemes is that they may have false negatives, i.e., the recall is less than 100\%. This limits the applicability of LSH in settings requiring precise performance guarantees. Building on the recent theoretical “CoveringLSH” construction that eliminates false negatives, we propose a fast and practical covering LSH scheme for Hamming space called Fast CoveringLSH (fcLSH). Inheriting the design benefits of CoveringLSH our method avoids false negatives and always reports all near neighbors. Compared to CoveringLSH we achieve an asymptotic improvement to the hash function computation time from \(\mathcal{O}(dL)\) to \(\mathcal{O}(d + Llog{L})\), where \(d\) is the dimensionality of data and \(L\) is the number of hash tables. Our experiments on synthetic and real-world data sets demonstrate that fcLSH is comparable (and often superior) to traditional hashing-based approaches for search radius up to 20 in high-dimensional Hamming space. |
|||||
2016 | Massively-parallel Similarity Join Edge-isoperimetry And Distance Correlations On The Hypercube | Beame Paul, Rashtchian Cyrus | Arxiv | We study distributed protocols for finding all pairs of similar vectors in a large dataset. Our results pertain to a variety of discrete metrics, and we give concrete instantiations for Hamming distance. In particular, we give improved upper bounds on the overhead required for similarity defined by Hamming distance \(r>1\) and prove a lower bound showing qualitative optimality of the overhead required for similarity over any Hamming distance \(r\). Our main conceptual contribution is a connection between similarity search algorithms and certain graph-theoretic quantities. For our upper bounds, we exhibit a general method for designing one-round protocols using edge-isoperimetric shapes in similarity graphs. For our lower bounds, we define a new combinatorial optimization problem, which can be stated in purely graph-theoretic terms yet also captures the core of the analysis in previous theoretical work on distributed similarity joins. As one of our main technical results, we prove new bounds on distance correlations in subsets of the Hamming cube. |
|||||
2016 | Leveraging Sparsity For Efficient Submodular Data Summarization | Erik Lindgren, Shanshan Wu, Alexandros G. Dimakis | Neural Information Processing Systems | The facility location problem is widely used for summarizing large datasets and has additional applications in sensor placement, image retrieval, and clustering. One difficulty of this problem is that submodular optimization algorithms require the calculation of pairwise benefits for all items in the dataset. This is infeasible for large problems, so recent work proposed to only calculate nearest neighbor benefits. One limitation is that several strong assumptions were invoked to obtain provable approximation guarantees. In this paper we establish that these extra assumptions are not necessary—solving the sparsified problem will be almost optimal under the standard assumptions of the problem. We then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations and extend the use of the problem to a broader family of similarities. We validate our approach by demonstrating that it rapidly generates interpretable summaries. |
|||||
2016 | A Survey On Learning To Hash | Wang Jingdong, Zhang Ting, Song Jingkuan, Sebe Nicu, Shen Heng Tao | Arxiv | Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics. |
|||||
2016 | Extraction Of Layout Entities And Sub-layout Query-based Retrieval Of Document Images | Bansal Anukriti, Roy Sumantra Dutta, Harit Gaurav | Arxiv | Layouts and sub-layouts constitute an important clue while searching a document on the basis of its structure, or when textual content is unknown/irrelevant. A sub-layout specifies the arrangement of document entities within a smaller portion of the document. We propose an efficient graph-based matching algorithm, integrated with hash-based indexing, to prune a possibly large search space. A user can specify a combination of sub-layouts of interest using sketch-based queries. The system supports partial matching for unspecified layout entities. We handle cases of segmentation pre-processing errors (for text/non-text blocks) with a symmetry maximization-based strategy, and accounting for multiple domain-specific plausible segmentation hypotheses. We show promising results of our system on a database of unstructured entities, containing 4776 newspaper images. |
|||||
2016 | Group Invariant Deep Representations For Image Instance Retrieval | Morère Olivier, Veillard Antoine, Lin Jie, Petta Julie, Chandrasekhar Vijay, Poggio Tomaso | Arxiv | Most image instance retrieval pipelines are based on comparison of vectors known as global image descriptors between a query image and the database images. Due to their success in large scale image classification, representations extracted from Convolutional Neural Networks (CNN) are quickly gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors for image instance retrieval. While CNN-based descriptors are generally remarked for good retrieval performance at lower bitrates, they nevertheless present a number of drawbacks including the lack of robustness to common object transformations such as rotations compared with their interest point based FV counterparts. In this paper, we propose a method for computing invariant global descriptors from CNNs. Our method implements a recently proposed mathematical theory for invariance in a sensory cortex modeled as a feedforward neural network. The resulting global descriptors can be made invariant to multiple arbitrary transformation groups while retaining good discriminativeness. Based on a thorough empirical evaluation using several publicly available datasets, we show that our method is able to significantly and consistently improve retrieval results every time a new type of invariance is incorporated. We also show that our method which has few parameters is not prone to overfitting: improvements generalize well across datasets with different properties with regard to invariances. Finally, we show that our descriptors are able to compare favourably to other state-of-the-art compact descriptors in similar bitranges, exceeding the highest retrieval results reported in the literature on some datasets. A dedicated dimensionality reduction step –quantization or hashing– may be able to further improve the competitiveness of the descriptors. |
|||||
2016 | A Faster Algorithm For Cuckoo Insertion And Bipartite Matching In Large Graphs | Khosla Megha, Anand Avishek | Algorithmica | Hash tables are ubiquitous in computer science for efficient access to large datasets. However, there is always a need for approaches that offer compact memory utilisation without substantial degradation of lookup performance. Cuckoo hashing is an efficient technique of creating hash tables with high space utilisation and offer a guaranteed constant access time. We are given \(n\) locations and \(m\) items. Each item has to be placed in one of the \(k\ge2\) locations chosen by \(k\) random hash functions. By allowing more than one choice for a single item, cuckoo hashing resembles multiple choice allocations schemes. In addition it supports dynamically changing the location of an item among its possible locations. We propose and analyse an insertion algorithm for cuckoo hashing that runs in linear time with high probability and in expectation. Previous work on total allocation time has analysed breadth first search, and it was shown to be linear only in expectation. Our algorithm finds an assignment (with probability 1) whenever it exists. In contrast, the other known insertion method, known as random walk insertion, may run indefinitely even for a solvable instance. We also present experimental results comparing the performance of our algorithm with the random walk method, also for the case when each location can hold more than one item. As a corollary we obtain a linear time algorithm (with high probability and in expectation) for finding perfect matchings in a special class of sparse random bipartite graphs. We support this by performing experiments on a real world large dataset for finding maximum matchings in general large bipartite graphs. We report an order of magnitude improvement in the running time as compared to the Hopkraft-Karp matching algorithm. |
|||||
2016 | Dual Purpose Hashing | Liu Haomiao, Wang Ruiping, Shan Shiguang, Chen Xilin | Arxiv | Recent years have seen more and more demand for a unified framework to address multiple realistic image retrieval tasks concerning both category and attributes. Considering the scale of modern datasets, hashing is favorable for its low complexity. However, most existing hashing methods are designed to preserve one single kind of similarity, thus improper for dealing with the different tasks simultaneously. To overcome this limitation, we propose a new hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the category and attribute similarities by exploiting the Convolutional Neural Network (CNN) models to hierarchically capture the correlations between category and attributes. Since images with both category and attribute labels are scarce, our method is designed to take the abundant partially labelled images on the Internet as training inputs. With such a framework, the binary codes of new-coming images can be readily obtained by quantizing the network outputs of a binary-like layer, and the attributes can be recovered from the codes easily. Experiments on two large-scale datasets show that our dual purpose hash codes can achieve comparable or even better performance than those state-of-the-art methods specifically designed for each individual retrieval task, while being more compact than the compared methods. |
|||||
2016 | Structured Learning Of Binary Codes With Column Generation | Lin Guosheng, Liu Fayao, Shen Chunhua, Wu Jianxin, Shen Heng Tao | Arxiv | Hashing methods aim to learn a set of hash functions which map the original features to compact binary codes with similarity preserving in the Hamming space. Hashing has proven a valuable tool for large-scale information retrieval. We propose a column generation based binary code learning framework for data-dependent hash function learning. Given a set of triplets that encode the pairwise similarity comparison information, our column generation based method learns hash functions that preserve the relative comparison relations within the large-margin learning framework. Our method iteratively learns the best hash functions during the column generation procedure. Existing hashing methods optimize over simple objectives such as the reconstruction error or graph Laplacian related loss functions, instead of the performance evaluation criteria of interest—multivariate performance measures such as the AUC and NDCG. Our column generation based method can be further generalized from the triplet loss to a general structured learning based framework that allows one to directly optimize multivariate performance measures. For optimizing general ranking measures, the resulting optimization problem can involve exponentially or infinitely many variables and constraints, which is more challenging than standard structured output learning. We use a combination of column generation and cutting-plane techniques to solve the optimization problem. To speed-up the training we further explore stage-wise training and propose to use a simplified NDCG loss for efficient inference. We demonstrate the generality of our method by applying it to ranking prediction and image retrieval, and show that it outperforms a few state-of-the-art hashing methods. |
|||||
2016 | A Simple Hash Class With Strong Randomness Properties In Graphs And Hypergraphs | Aumüller Martin, Dietzfelbinger Martin, Woelfel Philipp | Arxiv | We study randomness properties of graphs and hypergraphs generated by simple hash functions. Several hashing applications can be analyzed by studying the structure of \(d\)-uniform random (\(d\)-partite) hypergraphs obtained from a set \(S\) of \(n\) keys and \(d\) randomly chosen hash functions \(h_1,\dots,h_d\) by associating each key \(x\in S\) with a hyperedge \(\{h_1(x),\dots, h_d(x)\}\). Often it is assumed that \(h_1,\dots,h_d\) exhibit a high degree of independence. We present a simple construction of a hash class whose hash functions have small constant evaluation time and can be stored in sublinear space. We devise general techniques to analyze the randomness properties of the graphs and hypergraphs generated by these hash functions, and we show that they can replace other, less efficient constructions in cuckoo hashing (with and without stash), the simulation of a uniform hash function, the construction of a perfect hash function, generalized cuckoo hashing and different load balancing scenarios. |
|||||
2016 | Approximate Search With Quantized Sparse Representations | Jain Himalaya, Pérez Patrick, Gribonval Rémi, Zepeda Joaquin, Jégou Hervé | Arxiv | This paper tackles the task of storing a large collection of vectors, such as visual descriptors, and of searching in it. To this end, we propose to approximate database vectors by constrained sparse coding, where possible atom weights are restricted to belong to a finite subset. This formulation encompasses, as particular cases, previous state-of-the-art methods such as product or residual quantization. As opposed to traditional sparse coding methods, quantized sparse coding includes memory usage as a design constraint, thereby allowing us to index a large collection such as the BIGANN billion-sized benchmark. Our experiments, carried out on standard benchmarks, show that our formulation leads to competitive solutions when considering different trade-offs between learning/coding time, index size and search quality. |
|||||
2016 | Regular And Almost Universal Hashing An Efficient Implementation | Ivanchykhin Dmytro, Ignatchenko Sergey, Lemire Daniel | Software Practice and Experience | Random hashing can provide guarantees regarding the performance of data structures such as hash tables—even in an adversarial setting. Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random. However, universality fails to ensure that all hash functions are well behaved. We further require regularity: when picking data objects at random they should have a low probability of having the same hash value, for any fixed hash function. We present the efficient implementation of a family of non-cryptographic hash functions (PM+) offering good running times, good memory usage as well as distinguishing theoretical guarantees: almost universality and component-wise regularity. On a variety of platforms, our implementations are comparable to the state of the art in performance. On recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization through SIMD instructions (e.g., AVX2) and optimizations for superscalar execution. |
|||||
2016 | Hash2vec Feature Hashing For Word Embeddings | Argerich Luis, Zaffaroni Joaquín Torré, Cano Matías J | In this paper we propose the application of feature hashing to create word embeddings for natural language processing. Feature hashing has been used successfully to create document vectors in related tasks like document classification. In this work we show that feature hashing can be applied to obtain word embeddings in linear time with the size of the data. The results show that this algorithm, that does not need training, is able to capture the semantic meaning of words. We compare the results against GloVe showing that they are similar. As far as we know this is the first application of feature hashing to the word embeddings problem and the results indicate this is a scalable technique with practical results for NLP applications. |
||||||
2016 | Exact Weighted Minwise Hashing In Constant Time | Shrivastava Anshumali | Arxiv | Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large scale-search and learning. The resource bottleneck of the algorithms is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. The fastest hashing algorithm is by Ioffe \cite{Proc:Ioffe_ICDM10}, which requires one pass over the entire data vector, \(O(d)\) (\(d\) is the number of non-zeros), for computing one hash. However, the requirement of multiple hashes demands hundreds or thousands passes over the data. This is very costly for modern massive dataset. In this work, we break this expensive barrier and show an expected constant amortized time algorithm which computes \(k\) independent and unbiased WMH in time \(O(k)\) instead of \(O(dk)\) required by Ioffe’s method. Moreover, our proposal only needs a few bits (5 - 9 bits) of storage per hash value compared to around \(64\) bits required by the state-of-art-methodologies. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe’s method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient “densified” one permutation hashing schemes \cite{Proc:OneHashLSH_ICML14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice. |
|||||
2016 | Note On Optimal Trees For Parallel Hash Functions | Atighehchi Kevin | Arxiv | A recent work shows how we can optimize a tree based mode of operation for a rate 1 hash function. In particular, an algorithm and a theorem are presented for selecting a good tree topology in order to optimize both the running time and the number of processors at each step of the computation. Because this paper deals only with trees having their leaves at the same depth, the number of saved computing resources is perfectly optimal only for this category of trees. In this note, we address the more general case and describe a simple algorithm which, starting from such a tree topology, reworks it to further reduce the number of processors and the total amount of work done to hash a message. |
|||||
2016 | Optimizing Affinity-based Binary Hashing Using Auxiliary Coordinates | Ramin Raziperchikolaei, Miguel A. Carreira-perpinan | Neural Information Processing Systems | In supervised binary hashing, one wants to learn a function that maps a high-dimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as an iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets. In addition, our framework facilitates the design of optimization algorithms for arbitrary types of loss and hash functions. |
|||||
2016 | Lower Bounds On Time-space Trade-offs For Approximate Near Neighbors | Andoni Alexandr, Laarhoven Thijs, Razenshteyn Ilya, Waingarten Erik | Arxiv | We show tight lower bounds for the entire trade-off between space and query time for the Approximate Near Neighbor search problem. Our lower bounds hold in a restricted model of computation, which captures all hashing-based approaches. In articular, our lower bound matches the upper bound recently shown in [Laarhoven 2015] for the random instance on a Euclidean sphere (which we show in fact extends to the entire space \(\mathbb{R}^d\) using the techniques from [Andoni, Razenshteyn 2015]). We also show tight, unconditional cell-probe lower bounds for one and two probes, improving upon the best known bounds from [Panigrahy, Talwar, Wieder 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than for one probe. To show the result for two probes, we establish and exploit a connection to locally-decodable codes. |
|||||
2016 | Approximate Near Neighbors For General Symmetric Norms | Andoni Alexandr, Nguyen Huy L., Nikolov Aleksandar, Razenshteyn Ilya, Waingarten Erik | Arxiv | We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every \(n\), \(d = n^{o(1)}\), and every \(d\)-dimensional symmetric norm \(|\cdot|\), there exists a data structure for \(\mathrm{poly}(log log n)\)-approximate nearest neighbor search over \(|\cdot|\) for \(n\)-point datasets achieving \(n^{o(1)}\) query time and \(n^{1+o(1)}\) space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-\(k\) norms. We also show that our techniques cannot be extended to general norms. |
|||||
2016 | Optimal Hashing-based Time-space Trade-offs For Approximate Near Neighbors | Andoni Alexandr, Laarhoven Thijs, Razenshteyn Ilya, Waingarten Erik | Arxiv | [See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the \(c\)-Approximate Near Neighbor Search problem. For the \(d\)-dimensional Euclidean space and \(n\)-point datasets, we develop a data structure with space \(n^{1 + \rho_u + o(1)} + O(dn)\) and query time \(n^{\rho_q + o(1)} + d n^{o(1)}\) for every \(\rho_u, \rho_q \geq 0\) such that: \begin{equation} c^2 \sqrt{\rho_q} + (c^2 - 1) \sqrt{\rho_u} = \sqrt{2c^2 - 1}. \end{equation} This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor \(c > 1\), improving upon [Kapralov, PODS 2015]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [Becker, Ducas, Gama, Laarhoven, SODA 2016] and data-dependent hashing [Andoni, Indyk, Nguyen, Razenshteyn, SODA 2014] [Andoni, Razenshteyn, STOC 2015]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole above trade-off in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower bounds for one and two probes that match the above trade-off for \(\rho_q = 0\), improving upon the best known lower bounds from [Panigrahy, Talwar, Wieder, FOCS 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes. |
|||||
2016 | A Refined Analysis Of LSH For Well-dispersed Data Points | Mou Wenlong, Wang Liwei | Arxiv | Near neighbor problems are fundamental in algorithms for high-dimensional Euclidean spaces. While classical approaches suffer from the curse of dimensionality, locality sensitive hashing (LSH) can effectively solve a-approximate r-near neighbor problem, and has been proven to be optimal in the worst case. However, for real-world data sets, LSH can naturally benefit from well-dispersed data and low doubling dimension, leading to significantly improved performance. In this paper, we address this issue and propose a refined analyses for running time of approximating near neighbors queries via LSH. We characterize dispersion of data using N_b, the number of b*r-near pairs among the data points. Combined with optimal data-oblivious LSH scheme, we get a new query time bound depending on N_b and doubling dimension. For many natural scenarios where points are well-dispersed or lying in a low-doubling-dimension space, our result leads to sharper performance than existing worst-case analysis. This paper not only present first rigorous proof on how LSHs make use of the structure of data points, but also provide important insights into parameter setting in the practice of LSH beyond worst case. Besides, the techniques in our analysis involve a generalized version of sphere packing problem, which might be of some independent interest. |
|||||
2016 | Using Apache Lucene To Search Vector Of Locally Aggregated Descriptors | Amato Giuseppe, Bolettieri Paolo, Falchi Fabrizio, Gennaro Claudio, Vadicamo Lucia | Arxiv | Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors. |
|||||
2016 | On Reducing The Number Of Visual Words In The Bag-of-features Representation | Amato Giuseppe, Falchi Fabrizio, Gennaro Claudio | VISAPP | A new class of applications based on visual search engines are emerging, especially on smart-phones that have evolved into powerful tools for processing images and videos. The state-of-the-art algorithms for large visual content recognition and content based similarity search today use the “Bag of Features” (BoF) or “Bag of Words” (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted files. A very well known issue with this approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious efficiency problems when using inverted files to perform efficient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve efficiency and we study the effects of this reduction on effectiveness in landmark recognition and retrieval scenarios. We show that very relevant improvement in performance are achievable still preserving the advantages of the BoF base approach. |
|||||
2016 | Fast Keyed Hash/pseudo-random Function Using SIMD Multiply And Permute | Alakuijala Jyrki, Cox Bill, Wassenberg Jan | Arxiv | HighwayHash is a new pseudo-random function based on SIMD multiply and permute instructions for thorough and fast hashing. It is 5.2 times as fast as SipHash for 1 KiB inputs. An open-source implementation is available under a permissive license. We discuss design choices and provide statistical analysis, speed measurements and preliminary cryptanalysis. Assuming it withstands further analysis, strengthened variants may also substantially accelerate file checksums and stream ciphers. |
|||||
2016 | Parameter-free Locality Sensitive Hashing For Spherical Range Reporting | Ahle Thomas D., Aumüller Martin, Pagh Rasmus | Arxiv | We present a data structure for spherical range reporting on a point set \(S\), i.e., reporting all points in \(S\) that lie within radius \(r\) of a given query point \(q\). Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from \(q\) to the points of \(S\), our data structure is parameter-free, except for the space usage, which is configurable by the user. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been optimally chosen for the data and query in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH) and easy queries such as those where the number of points to report is a constant fraction of \(S\), or where almost all points in \(S\) are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone. The algorithm has expected query time bounded by \(O(t (n/t)^\rho)\), where \(t\) is the number of points to report and \(\rho\in (0,1)\) depends on the data distribution and the strength of the LSH family used. We further present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to \(O(n^\rho+t)\), which is the best we can hope to achieve using LSH. The previously best running time in high dimensions was \(Ω(t n^\rho)\). For many data distributions where the intrinsic dimensionality of the point set close to \(q\) is low, we can give improved upper bounds on the expected query time. |
|||||
2016 | Elliptic Curve Multiset Hash | Maitin-shepard Jeremy, Tibouchi Mehdi, Aranha Diego | Arxiv | A homomorphic, or incremental, multiset hash function, associates a hash value to arbitrary collections of objects (with possible repetitions) in such a way that the hash of the union of two collections is easy to compute from the hashes of the two collections themselves: it is simply their sum under a suitable group operation. In particular, hash values of large collections can be computed incrementally and/or in parallel. Homomorphic hashing is thus a very useful primitive with applications ranging from database integrity verification to streaming set/multiset comparison and network coding. Unfortunately, constructions of homomorphic hash functions in the literature are hampered by two main drawbacks: they tend to be much longer than usual hash functions at the same security level (e.g. to achieve a collision resistance of 2^128, they are several thousand bits long, as opposed to 256 bits for usual hash functions), and they are also quite slow. In this paper, we introduce the Elliptic Curve Multiset Hash (ECMH), which combines a usual bit string-valued hash function like BLAKE2 with an efficient encoding into binary elliptic curves to overcome both difficulties. On the one hand, the size of ECMH digests is essentially optimal: 2m-bit hash values provide O(2^m) collision resistance. On the other hand, we demonstrate a highly-efficient software implementation of ECMH, which our thorough empirical evaluation shows to be capable of processing over 3 million set elements per second on a 4 GHz Intel Haswell machine at the 128-bit security level—many times faster than previous practical methods. |
|||||
2016 | Near-isometric Binary Hashing For Large-scale Datasets | Aghazadeh Amirali, Lan Andrew, Shrivastava Anshumali, Baraniuk Richard | Arxiv | We develop a scalable algorithm to learn binary hash codes for indexing large-scale datasets. Near-isometric binary hashing (NIBH) is a data-dependent hashing scheme that quantizes the output of a learned low-dimensional embedding to obtain a binary hash code. In contrast to conventional hashing schemes, which typically rely on an \(ℓ₂\)-norm (i.e., average distortion) minimization, NIBH is based on a \(\ell_{\infty}\)-norm (i.e., worst-case distortion) minimization that provides several benefits, including superior distance, ranking, and near-neighbor preservation performance. We develop a practical and efficient algorithm for NIBH based on column generation that scales well to large datasets. A range of experimental evaluations demonstrate the superiority of NIBH over ten state-of-the-art binary hashing schemes. |
|||||
2016 | An Ensemble Diversity Approach To Supervised Binary Hashing | Miguel A. Carreira-perpinan, Ramin Raziperchikolaei | Neural Information Processing Systems | Binary hashing is a well-known approach for fast approximate nearest-neighbor search in information retrieval. Much work has focused on affinity-based objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves state-of-the-art precision and recall in experiments with image retrieval. |
|||||
2016 | Contextual Visual Similarity | Wang Xiaofang, Kitani Kris M., Hebert Martial | Arxiv | Measuring visual similarity is critical for image understanding. But what makes two images similar? Most existing work on visual similarity assumes that images are similar because they contain the same object instance or category. However, the reason why images are similar is much more complex. For example, from the perspective of category, a black dog image is similar to a white dog image. However, in terms of color, a black dog image is more similar to a black horse image than the white dog image. This example serves to illustrate that visual similarity is ambiguous but can be made precise when given an explicit contextual perspective. Based on this observation, we propose the concept of contextual visual similarity. To be concrete, we examine the concept of contextual visual similarity in the application domain of image search. Instead of providing only a single image for image similarity search (\eg, Google image search), we require three images. Given a query image, a second positive image and a third negative image, dissimilar to the first two images, we define a contextualized similarity search criteria. In particular, we learn feature weights over all the feature dimensions of each image such that the distance between the query image and the positive image is small and their distances to the negative image are large after reweighting their features. The learned feature weights encode the contextualized visual similarity specified by the user and can be used for attribute specific image search. We also show the usefulness of our contextualized similarity weighting scheme for different tasks, such as answering visual analogy questions and unsupervised attribute discovery. |
|||||
2016 | Vector Quantization For Machine Vision | Liguori Vincenzo | Arxiv | This paper shows how to reduce the computational cost for a variety of common machine vision tasks by operating directly in the compressed domain, particularly in the context of hardware acceleration. Pyramid Vector Quantization (PVQ) is the compression technique of choice and its properties are exploited to simplify Support Vector Machines (SVM), Convolutional Neural Networks(CNNs), Histogram of Oriented Gradients (HOG) features, interest points matching and other algorithms. |
|||||
2016 | Instance-aware Hashing For Multi-label Image Retrieval | Lai Hanjiang, Yan Pan, Shu Xiangbo, Wei Yunchao, Yan Shuicheng | Arxiv | Similarity-preserving hashing is a commonly used method for nearest neighbour search in large-scale image retrieval. For image retrieval, deep-networks-based hashing methods are appealing since they can simultaneously learn effective image representations and compact hash codes. This paper focuses on deep-networks-based hashing for multi-label images, each of which may contain objects of multiple categories. In most existing hashing methods, each image is represented by one piece of hash code, which is referred to as semantic hashing. This setting may be suboptimal for multi-label image retrieval. To solve this problem, we propose a deep architecture that learns \textbf{instance-aware} image representations for multi-label image data, which are organized in multiple groups, with each group containing the features for one category. The instance-aware representations not only bring advantages to semantic hashing, but also can be used in category-aware hashing, in which an image is represented by multiple pieces of hash codes and each piece of code corresponds to a category. Extensive evaluations conducted on several benchmark datasets demonstrate that, for both semantic hashing and category-aware hashing, the proposed method shows substantial improvement over the state-of-the-art supervised and unsupervised hashing methods. |
|||||
2016 | Efficient Convolutional Neural Network With Binary Quantization Layer | Ravanbakhsh Mahdyar, Mousavi Hossein, Nabi Moin, Marcenaro Lucio, Regazzoni Carlo | Arxiv | In this paper we introduce a novel method for segmentation that can benefit from general semantics of Convolutional Neural Network (CNN). Our segmentation proposes visually and semantically coherent image segments. We use binary encoding of CNN features to overcome the difficulty of the clustering on the high-dimensional CNN feature space. These binary encoding can be embedded into the CNN as an extra layer at the end of the network. This results in real-time segmentation. To the best of our knowledge our method is the first attempt on general semantic image segmentation using CNN. All the previous papers were limited to few number of category of the images (e.g. PASCAL VOC). Experiments show that our segmentation algorithm outperform the state-of-the-art non-semantic segmentation methods by a large margin. |
|||||
2016 | Time For Dithering Fast And Quantized Random Embeddings Via The Restricted Isometry Property | Jacques Laurent, Cambareri Valerio | Arxiv | Recently, many works have focused on the characterization of non-linear dimensionality reduction methods obtained by quantizing linear embeddings, e.g., to reach fast processing time, efficient data compression procedures, novel geometry-preserving embeddings or to estimate the information/bits stored in this reduced data representation. In this work, we prove that many linear maps known to respect the restricted isometry property (RIP) can induce a quantized random embedding with controllable multiplicative and additive distortions with respect to the pairwise distances of the data points beings considered. In other words, linear matrices having fast matrix-vector multiplication algorithms (e.g., based on partial Fourier ensembles or on the adjacency matrix of unbalanced expanders) can be readily used in the definition of fast quantized embeddings with small distortions. This implication is made possible by applying right after the linear map an additive and random “dither” that stabilizes the impact of the uniform scalar quantization operator applied afterwards. For different categories of RIP matrices, i.e., for different linear embeddings of a metric space \((\mathcal K \subset \mathbb R^n, \ell_q)\) in \((\mathbb R^m, \ell_p)\) with \(p,q \geq 1\), we derive upper bounds on the additive distortion induced by quantization, showing that it decays either when the embedding dimension \(m\) increases or when the distance of a pair of embedded vectors in \(\mathcal K\) decreases. Finally, we develop a novel “bi-dithered” quantization scheme, which allows for a reduced distortion that decreases when the embedding dimension grows and independently of the considered pair of vectors. |
|||||
2016 | Fast Binary Embedding Via Circulant Downsampled Matrix -- A Data-independent Approach | Hsieh Sung-hsien, Lu Chun-shien, Pei Soo-chang | Arxiv | Binary embedding of high-dimensional data aims to produce low-dimensional binary codes while preserving discriminative power. State-of-the-art methods often suffer from high computation and storage costs. We present a simple and fast embedding scheme by first downsampling N-dimensional data into M-dimensional data and then multiplying the data with an MxM circulant matrix. Our method requires O(N +M log M) computation and O(N) storage costs. We prove if data have sparsity, our scheme can achieve similarity-preserving well. Experiments further demonstrate that though our method is cost-effective and fast, it still achieves comparable performance in image applications. |
|||||
2016 | Auto-jacobin Auto-encoder Jacobian Binary Hashing | Fu Xiping, Mccane Brendan, Mills Steven, Albert Michael, Szymanski Lech | Arxiv | Binary codes can be used to speed up nearest neighbor search tasks in large scale data sets as they are efficient for both storage and retrieval. In this paper, we propose a robust auto-encoder model that preserves the geometric relationships of high-dimensional data sets in Hamming space. This is done by considering a noise-removing function in a region surrounding the manifold where the training data points lie. This function is defined with the property that it projects the data points near the manifold into the manifold wisely, and we approximate this function by its first order approximation. Experimental results show that the proposed method achieves better than state-of-the-art results on three large scale high dimensional data sets. |
|||||
2016 | EFANNA An Extremely Fast Approximate Nearest Neighbor Search Algorithm Based On Knn Graph | Fu Cong, Cai Deng | Arxiv | Approximate nearest neighbor (ANN) search is a fundamental problem in many areas of data mining, machine learning and computer vision. The performance of traditional hierarchical structure (tree) based methods decreases as the dimensionality of data grows, while hashing based methods usually lack efficiency in practice. Recently, the graph based methods have drawn considerable attention. The main idea is that a neighbor of a neighbor is also likely to be a neighbor, which we refer as NN-expansion. These methods construct a \(k\)-nearest neighbor (\(k\)NN) graph offline. And at online search stage, these methods find candidate neighbors of a query point in some way (\eg, random selection), and then check the neighbors of these candidate neighbors for closer ones iteratively. Despite some promising results, there are mainly two problems with these approaches: 1) These approaches tend to converge to local optima. 2) Constructing a \(k\)NN graph is time consuming. We find that these two problems can be nicely solved when we provide a good initialization for NN-expansion. In this paper, we propose EFANNA, an extremely fast approximate nearest neighbor search algorithm based on \(k\)NN Graph. Efanna nicely combines the advantages of hierarchical structure based methods and nearest-neighbor-graph based methods. Extensive experiments have shown that EFANNA outperforms the state-of-art algorithms both on approximate nearest neighbor search and approximate nearest neighbor graph construction. To the best of our knowledge, EFANNA is the fastest algorithm so far both on approximate nearest neighbor graph construction and approximate nearest neighbor search. A library EFANNA based on this research is released on Github. |
|||||
2016 | On The Insertion Time Of Random Walk Cuckoo Hashing | Frieze Alan, Johansson Tony | Arxiv | Cuckoo Hashing is a hashing scheme invented by Pagh and Rodler. It uses \(d\geq 2\) distinct hash functions to insert items into the hash table. It has been an open question for some time as to the expected time for Random Walk Insertion to add items. We show that if the number of hash functions \(d=O(1)\) is sufficiently large, then the expected insertion time is \(O(1)\) per item. |
|||||
2016 | Deep Cross-modal Hashing | Jiang Qing-yuan, Li Wu-jun | Arxiv | Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, almost all existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with handcrafted features may not achieve satisfactory performance. In this paper, we propose a novel cross-modal hashing method, called deep crossmodal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on two real datasets with text-image modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications. |
|||||
2016 | Large-scale Video Search With Efficient Temporal Voting Structure | Esen Ersin, Ozkan Savas, Atil Ilkay | Arxiv | In this work, we propose a fast content-based video querying system for large-scale video search. The proposed system is distinguished from similar works with two major contributions. First contribution is superiority of joint usage of repeated content representation and efficient hashing mechanisms. Repeated content representation is utilized with a simple yet robust feature, which is based on edge energy of frames. Each of the representation is converted into hash code with Hamming Embedding method for further queries. Second contribution is novel queue-based voting scheme that leads to modest memory requirements with gradual memory allocation capability, contrary to complete brute-force temporal voting schemes. This aspect enables us to make queries on large video databases conveniently, even on commodity computers with limited memory capacity. Our results show that the system can respond to video queries on a large video database with fast query times, high recall rate and very low memory and disk requirements. |
|||||
2016 | Deep Image Set Hashing | Feng Jie, Karaman Svebor, Jhuo I-hong, Chang Shih-fu | Arxiv | In applications involving matching of image sets, the information from multiple images must be effectively exploited to represent each set. State-of-the-art methods use probabilistic distribution or subspace to model a set and use specific distance measure to compare two sets. These methods are slow to compute and not compact to use in a large scale scenario. Learning-based hashing is often used in large scale image retrieval as they provide a compact representation of each sample and the Hamming distance can be used to efficiently compare two samples. However, most hashing methods encode each image separately and discard knowledge that multiple images in the same set represent the same object or person. We investigate the set hashing problem by combining both set representation and hashing in a single deep neural network. An image set is first passed to a CNN module to extract image features, then these features are aggregated using two types of set feature to capture both set specific and database-wide distribution information. The computed set feature is then fed into a multilayer perceptron to learn a compact binary embedding. Triplet loss is used to train the network by forming set similarity relations using class labels. We extensively evaluate our approach on datasets used for image matching and show highly competitive performance compared to state-of-the-art methods. |
|||||
2016 | Learning Binary Codes And Binary Weights For Efficient Classification | Shen Fumin, Mu Yadong, Liu Wei, Yang Yang, Shen Heng Tao | Arxiv | This paper proposes a generic formulation that significantly expedites the training and deployment of image classification models, particularly under the scenarios of many image categories and high feature dimensions. As a defining property, our method represents both the images and learned classifiers using binary hash codes, which are simultaneously learned from the training data. Classifying an image thereby reduces to computing the Hamming distance between the binary codes of the image and classifiers and selecting the class with minimal Hamming distance. Conventionally, compact hash codes are primarily used for accelerating image search. Our work is first of its kind to represent classifiers using binary codes. Specifically, we formulate multi-class image classification as an optimization problem over binary variables. The optimization alternatively proceeds over the binary classifiers and image hash codes. Profiting from the special property of binary codes, we show that the sub-problems can be efficiently solved through either a binary quadratic program (BQP) or linear program. In particular, for attacking the BQP problem, we propose a novel bit-flipping procedure which enjoys high efficacy and local optimality guarantee. Our formulation supports a large family of empirical loss functions and is here instantiated by exponential / hinge losses. Comprehensive evaluations are conducted on several representative image benchmarks. The experiments consistently observe reduced complexities of model training and deployment, without sacrifice of accuracies. |
|||||
2016 | Gerbil A Fast And Memory-efficient k-mer Counter With Gpu-support | Erbert Marius, Rechner Steffen, Müller-hannemann Matthias | Arxiv | A basic task in bioinformatics is the counting of \(k\)-mers in genome strings. The \(k\)-mer counting problem is to build a histogram of all substrings of length \(k\) in a given genome sequence. We present the open source \(k\)-mer counting software Gerbil that has been designed for the efficient counting of \(k\)-mers for \(k\geq32\). Given the technology trend towards long reads of next-generation sequencers, support for large \(k\) becomes increasingly important. While existing \(k\)-mer counting tools suffer from excessive memory resource consumption or degrading performance for large \(k\), Gerbil is able to efficiently support large \(k\) without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into \(k\)-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large \(k\), we outperform state-of-the-art open source \(k\)-mer counting tools for large genome data sets. |
|||||
2016 | Compact Hash Codes For Efficient Visual Descriptors Retrieval In Large Scale Databases | Ercoli Simone, Bertini Marco, Del Bimbo Alberto | Arxiv | In this paper we present an efficient method for visual descriptors retrieval based on compact hash codes computed using a multiple k-means assignment. The method has been applied to the problem of approximate nearest neighbor (ANN) search of local and global visual content descriptors, and it has been tested on different datasets: three large scale public datasets of up to one billion descriptors (BIGANN) and, supported by recent progress in convolutional neural networks (CNNs), also on the CIFAR-10 and MNIST datasets. Experimental results show that, despite its simplicity, the proposed method obtains a very high performance that makes it superior to more complex state-of-the-art methods. |
|||||
2016 | Variable-length Hashing | Yu Honghai, Moulin Pierre, Ng Hong Wei, Li Xiaoli | Arxiv | Hashing has emerged as a popular technique for large-scale similarity search. Most learning-based hashing methods generate compact yet correlated hash codes. However, this redundancy is storage-inefficient. Hence we propose a lossless variable-length hashing (VLH) method that is both storage- and search-efficient. Storage efficiency is achieved by converting the fixed-length hash code into a variable-length code. Search efficiency is obtained by using a multiple hash table structure. With VLH, we are able to deliberately add redundancy into hash codes to improve retrieval performance with little sacrifice in storage efficiency or search complexity. In particular, we propose a block K-means hashing (B-KMH) method to obtain significantly improved retrieval performance with no increase in storage and marginal increase in computational cost. |
|||||
2016 | Fast Cosine Similarity Search In Binary Space With Angular Multi-index Hashing | Eghbali Sepehr, Tahvildari Ladan | Arxiv | Given a large dataset of binary codes and a binary query point, we address how to efficiently find \(K\) codes in the dataset that yield the largest cosine similarities to the query. The straightforward answer to this problem is to compare the query with all items in the dataset, but this is practical only for small datasets. One potential solution to enhance the search time and achieve sublinear cost is to use a hash table populated with binary codes of the dataset and then look up the nearby buckets to the query to retrieve the nearest neighbors. However, if codes are compared in terms of cosine similarity rather than the Hamming distance, then the main issue is that the order of buckets to probe is not evident. To examine this issue, we first elaborate on the connection between the Hamming distance and the cosine similarity. Doing this allows us to systematically find the probing sequence in the hash table. However, solving the nearest neighbor search with a single table is only practical for short binary codes. To address this issue, we propose the angular multi-index hashing search algorithm which relies on building multiple hash tables on binary code substrings. The proposed search algorithm solves the exact angular \(K\) nearest neighbor problem in a time that is often orders of magnitude faster than the linear scan baseline and even approximation methods. |
|||||
2016 | Distortion-resistant Hashing For Rapid Search Of Similar DNA Subsequence | Duda Jarek | Arxiv | One of the basic tasks in bioinformatics is localizing a short subsequence \(S\), read while sequencing, in a long reference sequence \(R\), like the human geneome. A natural rapid approach would be finding a hash value for \(S\) and compare it with a prepared database of hash values for each of length \(|S|\) subsequences of \(R\). The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length \(|S|\) sequences, the hash value represents the closest sequence to \(S\). This gives some control of the distance of collisions: sequences having the same hash value. |
|||||
2016 | Unsupervised Deep Hashing For Large-scale Visual Search | Xia Zhaoqiang, Feng Xiaoyi, Peng Jinye, Hadid Abdenour | Learning based hashing plays a pivotal role in large-scale visual search. However, most existing hashing algorithms tend to learn shallow models that do not seek representative binary codes. In this paper, we propose a novel hashing approach based on unsupervised deep learning to hierarchically transform features into hash codes. Within the heterogeneous deep hashing framework, the autoencoder layers with specific constraints are considered to model the nonlinear mapping between features and binary codes. Then, a Restricted Boltzmann Machine (RBM) layer with constraints is utilized to reduce the dimension in the hamming space. Extensive experiments on the problem of visual search demonstrate the competitiveness of our proposed approach compared to state-of-the-art. |
||||||
2016 | How Should We Evaluate Supervised Hashing | Sablayrolles Alexandre, Douze Matthijs, Jégou Hervé, Usunier Nicolas | Arxiv | Hashing produces compact representations for documents, to perform tasks like classification or retrieval based on these short codes. When hashing is supervised, the codes are trained using labels on the training data. This paper first shows that the evaluation protocols used in the literature for supervised hashing are not satisfactory: we show that a trivial solution that encodes the output of a classifier significantly outperforms existing supervised or semi-supervised methods, while using much shorter codes. We then propose two alternative protocols for supervised hashing: one based on retrieval on a disjoint set of classes, and another based on transfer learning to new classes. We provide two baseline methods for image-related tasks to assess the performance of (semi-)supervised hashing: without coding and with unsupervised codes. These baselines give a lower- and upper-bound on the performance of a supervised hashing scheme. |
|||||
2016 | Polysemous Codes | Douze Matthijs, Jégou Hervé, Perronnin Florent | Arxiv | This paper considers the problem of approximate nearest neighbor search in the compressed domain. We introduce polysemous codes, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance. Their design is inspired by algorithms introduced in the 90’s to construct channel-optimized vector quantizers. At search time, this dual interpretation accelerates the search. Most of the indexed vectors are filtered out with Hamming distance, letting only a fraction of the vectors to be ranked with an asymmetric distance estimator. The method is complementary with a coarse partitioning of the feature space such as the inverted multi-index. This is shown by our experiments performed on several public benchmarks such as the BIGANN dataset comprising one billion vectors, for which we report state-of-the-art results for query times below 0.3\,millisecond per core. Last but not least, our approach allows the approximate computation of the k-NN graph associated with the Yahoo Flickr Creative Commons 100M, described by CNN image descriptors, in less than 8 hours on a single machine. |
|||||
2016 | De-hashing Server-side Context-aware Feature Reconstruction For Mobile Visual Search | Kuo Yin-hsi, Hsu Winston H. | Arxiv | Due to the prevalence of mobile devices, mobile search becomes a more convenient way than desktop search. Different from the traditional desktop search, mobile visual search needs more consideration for the limited resources on mobile devices (e.g., bandwidth, computing power, and memory consumption). The state-of-the-art approaches show that bag-of-words (BoW) model is robust for image and video retrieval; however, the large vocabulary tree might not be able to be loaded on the mobile device. We observe that recent works mainly focus on designing compact feature representations on mobile devices for bandwidth-limited network (e.g., 3G) and directly adopt feature matching on remote servers (cloud). However, the compact (binary) representation might fail to retrieve target objects (images, videos). Based on the hashed binary codes, we propose a de-hashing process that reconstructs BoW by leveraging the computing power of remote servers. To mitigate the information loss from binary codes, we further utilize contextual information (e.g., GPS) to reconstruct a context-aware BoW for better retrieval results. Experiment results show that the proposed method can achieve competitive retrieval accuracy as BoW while only transmitting few bits from mobile devices. |
|||||
2016 | Transfer Hashing With Privileged Information | Zhou Joey Tianyi, Xu Xinxing, Pan Sinno Jialin, Tsang Ivor W., Qin Zheng, Goh Rick Siow Mong | Arxiv | Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some real-world applications. To address this data sparsity issue in hashing, inspired by transfer learning, we propose a new framework named Transfer Hashing with Privileged Information (THPI). Specifically, we extend the standard learning to hash method, Iterative Quantization (ITQ), in a transfer learning manner, namely ITQ+. In ITQ+, a new slack function is learned from auxiliary data to approximate the quantization error in ITQ. We developed an alternating optimization approach to solve the resultant optimization problem for ITQ+. We further extend ITQ+ to LapITQ+ by utilizing the geometry structure among the auxiliary data for learning more precise binary codes in the target domain. Extensive experiments on several benchmark datasets verify the effectiveness of our proposed approaches through comparisons with several state-of-the-art baselines. |
|||||
2016 | SSH (sketch Shingle Hash) For Indexing Massive-scale Time Series | Luo Chen, Shrivastava Anshumali | Arxiv | Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy. |
|||||
2016 | Learning To Hash With Binary Deep Neural Network | Do Thanh-toan, Doan Anh-dzung, Cheung Ngai-man | Arxiv | This work proposes deep network models and learning algorithms for unsupervised and supervised binary hashing. Our novel network design constrains one hidden layer to directly output the binary codes. This addresses a challenging issue in some previous works: optimizing non-smooth objective functions due to binarization. Moreover, we incorporate independence and balance properties in the direct and strict forms in the learning. Furthermore, we include similarity preserving property in our objective function. Our resulting optimization with these binary, independence, and balance constraints is difficult to solve. We propose to attack it with alternating optimization and careful relaxation. Experimental results on three benchmark datasets show that our proposed methods compare favorably with the state of the art. |
|||||
2016 | Binary Hashing With Semidefinite Relaxation And Augmented Lagrangian | Do Thanh-toan, Doan Anh-dzung, Nguyen Duc-thanh, Cheung Ngai-man | Arxiv | This paper proposes two approaches for inferencing binary codes in two-step (supervised, unsupervised) hashing. We first introduce an unified formulation for both supervised and unsupervised hashing. Then, we cast the learning of one bit as a Binary Quadratic Problem (BQP). We propose two approaches to solve BQP. In the first approach, we relax BQP as a semidefinite programming problem which its global optimum can be achieved. We theoretically prove that the objective value of the binary solution achieved by this approach is well bounded. In the second approach, we propose an augmented Lagrangian based approach to solve BQP directly without relaxing the binary constraint. Experimental results on three benchmark datasets show that our proposed methods compare favorably with the state of the art. |
|||||
2016 | Fast Binary Embeddings With Gaussian Circulant Matrices Improved Bounds | Dirksen Sjoerd, Stollenwerk Alexander | Arxiv | We consider the problem of encoding a finite set of vectors into a small number of bits while approximately retaining information on the angular distances between the vectors. By deriving improved variance bounds related to binary Gaussian circulant embeddings, we largely fix a gap in the proof of the best known fast binary embedding method. Our bounds also show that well-spreadness assumptions on the data vectors, which were needed in earlier work on variance bounds, are unnecessary. In addition, we propose a new binary embedding with a faster running time on sparse data. |
|||||
2016 | Simple And Efficient Weighted Minwise Hashing | Anshumali Shrivastava | Neural Information Processing Systems | Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. We propose a simple rejection type sampling scheme based on a carefully designed red-green map, where we show that the number of rejected sample has exactly the same distribution as weighted minwise sampling. The running time of our method, for many practical datasets, is an order of magnitude smaller than existing methods. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe’s method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient ``densified” one permutation hashing schemes~\cite{Proc:OneHashLSHICML14,Proc:ShrivastavaUAI14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice. |
|||||
2016 | Scalable Image Retrieval By Sparse Product Quantization | Ning Qingqun, Zhu Jianke, Zhong Zhiyuan, Hoi Steven C. H., Chen Chun | Arxiv | Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional feature indexing and retrieval is the crux of large-scale image retrieval. A recent promising technique is Product Quantization, which attempts to index high-dimensional image features by decomposing the feature space into a Cartesian product of low dimensional subspaces and quantizing each of them separately. Despite the promising results reported, their quantization approach follows the typical hard assignment of traditional quantization methods, which may result in large quantization errors and thus inferior search performance. Unlike the existing approaches, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the high-dimensional feature vectors into sparse representation. We optimize the sparse representations of the feature vectors by minimizing their quantization errors, making the resulting representation is essentially close to the original data in practice. Experiments show that the proposed SPQ technique is not only able to compress data, but also an effective encoding technique. We obtain state-of-the-art results for ANN search on four public image datasets and the promising results of content-based image retrieval further validate the efficacy of our proposed method. |
|||||
2016 | Triplet Similarity Embedding For Face Verification | Sankaranarayanan Swami, Alavi Azadeh, Chellappa Rama | Arxiv | In this work, we present an unconstrained face verification algorithm and evaluate it on the recently released IJB-A dataset that aims to push the boundaries of face verification methods. The proposed algorithm couples a deep CNN-based approach with a low-dimensional discriminative embedding learnt using triplet similarity constraints in a large margin fashion. Aside from yielding performance improvement, this embedding provides significant advantages in terms of memory and post-processing operations like hashing and visualization. Experiments on the IJB-A dataset show that the proposed algorithm outperforms state of the art methods in verification and identification metrics, while requiring less training time. |
|||||
2016 | Bloom Filters And Compact Hash Codes For Efficient And Distributed Image Retrieval | Salvi Andrea, Ercoli Simone, Bertini Marco, Del Bimbo Alberto | Arxiv | This paper presents a novel method for efficient image retrieval, based on a simple and effective hashing of CNN features and the use of an indexing structure based on Bloom filters. These filters are used as gatekeepers for the database of image features, allowing to avoid to perform a query if the query features are not stored in the database and speeding up the query process, without affecting retrieval performance. Thanks to the limited memory requirements the system is suitable for mobile applications and distributed databases, associating each filter to a distributed portion of the database. Experimental validation has been performed on three standard image retrieval datasets, outperforming state-of-the-art hashing methods in terms of precision, while the proposed indexing method obtains a \(2\times\) speedup. |
|||||
2016 | Fast Supervised Discrete Hashing And Its Analysis | Koutaki Gou, Shirai Keiichiro, Ambai Mitsuru | Arxiv | In this paper, we propose a learning-based supervised discrete hashing method. Binary hashing is widely used for large-scale image retrieval as well as video and document searches because the compact representation of binary code is essential for data storage and reasonable for query searches using bit-operations. The recently proposed Supervised Discrete Hashing (SDH) efficiently solves mixed-integer programming problems by alternating optimization and the Discrete Cyclic Coordinate descent (DCC) method. We show that the SDH model can be simplified without performance degradation based on some preliminary experiments; we call the approximate model for this the “Fast SDH” (FSDH) model. We analyze the FSDH model and provide a mathematically exact solution for it. In contrast to SDH, our model does not require an alternating optimization algorithm and does not depend on initial values. FSDH is also easier to implement than Iterative Quantization (ITQ). Experimental results involving a large-scale database showed that FSDH outperforms conventional SDH in terms of precision, recall, and computation time. |
|||||
2016 | Deep Hashing A Joint Approach For Image Signature Learning | Mu Yadong, Liu Zhu | Arxiv | Similarity-based image hashing represents crucial technique for visual data storage reduction and expedited image search. Conventional hashing schemes typically feed hand-crafted features into hash functions, which separates the procedures of feature extraction and hash function learning. In this paper, we propose a novel algorithm that concurrently performs feature engineering and non-linear supervised hashing function learning. Our technical contributions in this paper are two-folds: 1) deep network optimization is often achieved by gradient propagation, which critically requires a smooth objective function. The discrete nature of hash codes makes them not amenable for gradient-based optimization. To address this issue, we propose an exponentiated hashing loss function and its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled; 2) pre-training is an important trick in supervised deep learning. The impact of pre-training on the hash code quality has never been discussed in current deep hashing literature. We propose a pre-training scheme inspired by recent advance in deep network based image classification, and experimentally demonstrate its effectiveness. Comprehensive quantitative evaluations are conducted on several widely-used image benchmarks. On all benchmarks, our proposed deep hashing algorithm outperforms all state-of-the-art competitors by significant margins. In particular, our algorithm achieves a near-perfect 0.99 in terms of Hamming ranking accuracy with only 12 bits on MNIST, and a new record of 0.74 on the CIFAR10 dataset. In comparison, the best accuracies obtained on CIFAR10 by existing hashing algorithms without or with deep networks are known to be 0.36 and 0.58 respectively. |
|||||
2016 | Scalable Similarity Search For Molecular Descriptors | Tabei Yasuo, Puglisi Simon J. | Arxiv | Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a time- and space- efficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice. |
|||||
2016 | End-to-end Learning Of Deep Visual Representations For Image Retrieval | Gordo Albert, Almazan Jon, Revaud Jerome, Larlus Diane | Arxiv | While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: i) noisy training data, ii) inappropriate deep architecture, and iii) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy. For additional material, please see www.xrce.xerox.com/Deep-Image-Retrieval. |
|||||
2016 | Associative Memories To Accelerate Approximate Nearest Neighbor Search | Gripon Vincent, Löwe Matthias, Vermet Franck | Arxiv | Nearest neighbor search is a very active field in machine learning for it appears in many application cases, including classification and object retrieval. In its canonical version, the complexity of the search is linear with both the dimension and the cardinal of the collection of vectors the search is performed in. Recently many works have focused on reducing the dimension of vectors using quantization techniques or hashing, while providing an approximate result. In this paper we focus instead on tackling the cardinal of the collection of vectors. Namely, we introduce a technique that partitions the collection of vectors and stores each part in its own associative memory. When a query vector is given to the system, associative memories are polled to identify which one contain the closest match. Then an exhaustive search is conducted only on the part of vectors stored in the selected associative memory. We study the effectiveness of the system when messages to store are generated from i.i.d. uniform \(\pm\)1 random variables or 0-1 sparse i.i.d. random variables. We also conduct experiment on both synthetic data and real data and show it is possible to achieve interesting trade-offs between complexity and accuracy. |
|||||
2016 | Barcodes For Medical Image Retrieval Using Autoencoded Radon Transform | Tizhoosh Hamid R., Mitcheltree Christopher, Zhu Shujin, Dutta Shamak | Arxiv | Using content-based binary codes to tag digital images has emerged as a promising retrieval technology. Recently, Radon barcodes (RBCs) have been introduced as a new binary descriptor for image search. RBCs are generated by binarization of Radon projections and by assembling them into a vector, namely the barcode. A simple local thresholding has been suggested for binarization. In this paper, we put forward the idea of “autoencoded Radon barcodes”. Using images in a training dataset, we autoencode Radon projections to perform binarization on outputs of hidden layers. We employed the mini-batch stochastic gradient descent approach for the training. Each hidden layer of the autoencoder can produce a barcode using a threshold determined based on the range of the logistic function used. The compressing capability of autoencoders apparently reduces the redundancies inherent in Radon projections leading to more accurate retrieval results. The IRMA dataset with 14,410 x-ray images is used to validate the performance of the proposed method. The experimental results, containing comparison with RBCs, SURF and BRISK, show that autoencoded Radon barcode (ARBC) has the capacity to capture important information and to learn richer representations resulting in lower retrieval errors for image retrieval measured with the accuracy of the first hit only. |
|||||
2016 | A Generic Inverted Index Framework For Similarity Search On The GPU - Technical Report | Zhou Jingbo, Guo Qi, Jagadish H. V., Krčál Luboš, Liu Siyuan, Luan Wenhao, Tung Anthony K. H., Yang Yueji, Zheng Yuxin | Arxiv | We propose a novel generic inverted index framework on the GPU (called GENIE), aiming to reduce the programming complexity of the GPU for parallel similarity search of different data types. Not every data type and similarity measure are supported by GENIE, but many popular ones are. We present the system design of GENIE, and demonstrate similarity search with GENIE on several data types along with a theoretical analysis of search results. A new concept of locality sensitive hashing (LSH) named \(\tau\)-ANN search, and a novel data structure c-PQ on the GPU are also proposed for achieving this purpose. Extensive experiments on different real-life datasets demonstrate the efficiency and effectiveness of our framework. The implemented system has been released as open source. |
|||||
2016 | Constructing Error-correcting Binary Codes Using Transitive Permutation Groups | Laaksonen Antti, Östergård Patric R. J. | Arxiv | Let \(A_2(n,d)\) be the maximum size of a binary code of length \(n\) and minimum distance \(d\). In this paper we present the following new lower bounds: \(A_2(18,4) \ge 5632\), \(A_2(21,4) \ge 40960\), \(A_2(22,4) \ge 81920\), \(A_2(23,4) \ge 163840\), \(A_2(24,4) \ge 327680\), \(A_2(24,10) \ge 136\), and \(A_2(25,6) \ge 17920\). The new lower bounds are a result of a systematic computer search over transitive permutation groups. |
|||||
2016 | Sharing Hash Codes For Multiple Purposes | Pronobis Wikor, Panknin Danny, Kirschnick Johannes, Srinivasan Vignesh, Samek Wojciech, Markl Volker, Kaul Manohar, Mueller Klaus-robert, Nakajima Shinichi | Arxiv | Locality sensitive hashing (LSH) is a powerful tool for sublinear-time approximate nearest neighbor search, and a variety of hashing schemes have been proposed for different dissimilarity measures. However, hash codes significantly depend on the dissimilarity, which prohibits users from adjusting the dissimilarity at query time. In this paper, we propose {multiple purpose LSH (mp-LSH) which shares the hash codes for different dissimilarities. mp-LSH supports L2, cosine, and inner product dissimilarities, and their corresponding weighted sums, where the weights can be adjusted at query time. It also allows us to modify the importance of pre-defined groups of features. Thus, mp-LSH enables us, for example, to retrieve similar items to a query with the user preference taken into account, to find a similar material to a query with some properties (stability, utility, etc.) optimized, and to turn on or off a part of multi-modal information (brightness, color, audio, text, etc.) in image/video retrieval. We theoretically and empirically analyze the performance of three variants of mp-LSH, and demonstrate their usefulness on real-world data sets. |
|||||
2016 | Generalized Residual Vector Quantization For Large Scale Data | Liu Shicong, Shao Junru, Lu Hongtao | Arxiv | Vector quantization is an essential tool for tasks involving large scale data, for example, large scale similarity search, which is crucial for content-based information retrieval and analysis. In this paper, we propose a novel vector quantization framework that iteratively minimizes quantization error. First, we provide a detailed review on a relevant vector quantization method named \textit{residual vector quantization} (RVQ). Next, we propose \textit{generalized residual vector quantization} (GRVQ) to further improve over RVQ. Many vector quantization methods can be viewed as the special cases of our proposed framework. We evaluate GRVQ on several large scale benchmark datasets for large scale search, classification and object retrieval. We compared GRVQ with existing methods in detail. Extensive experiments demonstrate our GRVQ framework substantially outperforms existing methods in term of quantization accuracy and computation efficiency. |
|||||
2016 | Binary Codes For Tagging X-ray Images Via Deep De-noising Autoencoders | Sze-to Antonio, Tizhoosh Hamid R., Wong Andrew K. C. | Arxiv | A Content-Based Image Retrieval (CBIR) system which identifies similar medical images based on a query image can assist clinicians for more accurate diagnosis. The recent CBIR research trend favors the construction and use of binary codes to represent images. Deep architectures could learn the non-linear relationship among image pixels adaptively, allowing the automatic learning of high-level features from raw pixels. However, most of them require class labels, which are expensive to obtain, particularly for medical images. The methods which do not need class labels utilize a deep autoencoder for binary hashing, but the code construction involves a specific training algorithm and an ad-hoc regularization technique. In this study, we explored using a deep de-noising autoencoder (DDA), with a new unsupervised training scheme using only backpropagation and dropout, to hash images into binary codes. We conducted experiments on more than 14,000 x-ray images. By using class labels only for evaluating the retrieval results, we constructed a 16-bit DDA and a 512-bit DDA independently. Comparing to other unsupervised methods, we succeeded to obtain the lowest total error by using the 512-bit codes for retrieval via exhaustive search, and speed up 9.27 times with the use of the 16-bit codes while keeping a comparable total error. We found that our new training scheme could reduce the total retrieval error significantly by 21.9%. To further boost the image retrieval performance, we developed Radon Autoencoder Barcode (RABC) which are learned from the Radon projections of images using a de-noising autoencoder. Experimental results demonstrated its superior performance in retrieval when it was combined with DDA binary codes. |
|||||
2016 | Accurate Deep Representation Quantization With Gradient Snapping Layer For Similarity Search | Liu Shicong, Lu Hongtao | Arxiv | Recent advance of large scale similarity search involves using deeply learned representations to improve the search accuracy and use vector quantization methods to increase the search speed. However, how to learn deep representations that strongly preserve similarities between data pairs and can be accurately quantized via vector quantization remains a challenging task. Existing methods simply leverage quantization loss and similarity loss, which result in unexpectedly biased back-propagating gradients and affect the search performances. To this end, we propose a novel gradient snapping layer (GSL) to directly regularize the back-propagating gradient towards a neighboring codeword, the generated gradients are un-biased for reducing similarity loss and also propel the learned representations to be accurately quantized. Joint deep representation and vector quantization learning can be easily performed by alternatively optimize the quantization codebook and the deep neural network. The proposed framework is compatible with various existing vector quantization approaches. Experimental results demonstrate that the proposed framework is effective, flexible and outperforms the state-of-the-art large scale similarity search methods. |
|||||
2015 | Efficient Training Of Very Deep Neural Networks For Supervised Hashing | Zhang Ziming, Chen Yuting, Saligrama Venkatesh | Arxiv | In this paper, we propose training very deep neural networks (DNNs) for supervised learning of hash codes. Existing methods in this context train relatively “shallow” networks limited by the issues arising in back propagation (e.e. vanishing gradients) as well as computational efficiency. We propose a novel and efficient training algorithm inspired by alternating direction method of multipliers (ADMM) that overcomes some of these limitations. Our method decomposes the training process into independent layer-wise local updates through auxiliary variables. Empirically we observe that our training algorithm always converges and its computational complexity is linearly proportional to the number of edges in the networks. Empirically we manage to train DNNs with 64 hidden layers and 1024 nodes per layer for supervised hashing in about 3 hours using a single GPU. Our proposed very deep supervised hashing (VDSH) method significantly outperforms the state-of-the-art on several benchmark datasets. |
|||||
2015 | Bit-scalable Deep Hashing With Regularized Similarity Learning For Image Retrieval And Person Re-identification | Zhang Ruimao, Lin Liang, Zhang Rui, Zuo Wangmeng, Zhang Lei | Arxiv | Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval . Conventional methods often study these two steps separately, e.g., learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. Specifically, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between matched pairs and mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths. |
|||||
2015 | Hashing With Binary Autoencoders | Carreira-perpiñán Miguel Á., Raziperchikolaei Ramin | Arxiv | An attractive approach for fast search in image databases is binary hashing, where each high-dimensional, real-valued image is mapped onto a low-dimensional, binary vector and the search is done in this binary space. Finding the optimal hash function is difficult because it involves binary constraints, and most approaches approximate the optimization by relaxing the constraints and then binarizing the result. Here, we focus on the binary autoencoder model, which seeks to reconstruct an image from the binary code produced by the hash function. We show that the optimization can be simplified with the method of auxiliary coordinates. This reformulates the optimization as alternating two easier steps: one that learns the encoder and decoder separately, and one that optimizes the code for each image. Image retrieval experiments, using precision/recall and a measure of code utilization, show the resulting hash function outperforms or is competitive with state-of-the-art methods for binary hashing. |
|||||
2015 | Reflectance Hashing For Material Recognition | Zhang Hang, Dana Kristin, Nishino Ko | Arxiv | We introduce a novel method for using reflectance to identify materials. Reflectance offers a unique signature of the material but is challenging to measure and use for recognizing materials due to its high-dimensionality. In this work, one-shot reflectance is captured using a unique optical camera measuring {\it reflectance disks} where the pixel coordinates correspond to surface viewing angles. The reflectance has class-specific stucture and angular gradients computed in this reflectance space reveal the material class. These reflectance disks encode discriminative information for efficient and accurate material recognition. We introduce a framework called reflectance hashing that models the reflectance disks with dictionary learning and binary hashing. We demonstrate the effectiveness of reflectance hashing for material recognition with a number of real-world materials. |
|||||
2015 | Deep Semantic Ranking Based Hashing For Multi-label Image Retrieval | Zhao Fang, Huang Yongzhen, Wang Liang, Tan Tieniu | Arxiv | With the rapid growth of web images, hashing has received increasing interests in large scale image retrieval. Research efforts have been devoted to learning compact binary codes that preserve semantic similarity based on labels. However, most of these hashing methods are designed to handle simple binary similarity. The complex multilevel semantic structure of images associated with multiple labels have not yet been well explored. Here we propose a deep semantic ranking based method for learning hash functions that preserve multilevel semantic similarity between multi-label images. In our approach, deep convolutional neural network is incorporated into hash functions to jointly learn feature representations and mappings from them to hash codes, which avoids the limitation of semantic representation power of hand-crafted features. Meanwhile, a ranking list that encodes the multilevel similarity information is employed to guide the learning of such deep hash functions. An effective scheme based on surrogate loss is used to solve the intractable optimization problem of nonsmooth and multivariate ranking measures involved in the learning procedure. Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods in term of ranking evaluation metrics when tested on multi-label image datasets. |
|||||
2015 | Perfect Consistent Hashing | Sackman Matthew | Arxiv | Consistent Hashing functions are widely used for load balancing across a variety of applications. However, the original presentation and typical implementations of Consistent Hashing rely on randomised allocation of hash codes to keys which results in a flawed and approximately-uniform allocation of keys to hash codes. We analyse the desired properties and present an algorithm that perfectly achieves them without resorting to any random distributions. The algorithm is simple and adds to our understanding of what is necessary to create a consistent hash function. |
|||||
2015 | Online Supervised Hashing For Ever-growing Datasets | Cakir Fatih, Bargal Sarah Adel, Sclaroff Stan | Arxiv | Supervised hashing methods are widely-used for nearest neighbor search in computer vision applications. Most state-of-the-art supervised hashing approaches employ batch-learners. Unfortunately, batch-learning strategies can be inefficient when confronted with large training datasets. Moreover, with batch-learners, it is unclear how to adapt the hash functions as a dataset continues to grow and diversify over time. Yet, in many practical scenarios the dataset grows and diversifies; thus, both the hash functions and the indexing must swiftly accommodate these changes. To address these issues, we propose an online hashing method that is amenable to changes and expansions of the datasets. Since it is an online algorithm, our approach offers linear complexity with the dataset size. Our solution is supervised, in that we incorporate available label information to preserve the semantic neighborhood. Such an adaptive hashing method is attractive; but it requires recomputing the hash table as the hash functions are updated. If the frequency of update is high, then recomputing the hash table entries may cause inefficiencies in the system, especially for large indexes. Thus, we also propose a framework to reduce hash table updates. We compare our method to state-of-the-art solutions on two benchmarks and demonstrate significant improvements over previous work. |
|||||
2015 | Achieving Arbitrary Locality And Availability In Binary Codes | Wang Anyu, Zhang Zhifang | Arxiv | The \(i\)th coordinate of an \((n,k)\) code is said to have locality \(r\) and availability \(t\) if there exist \(t\) disjoint groups, each containing at most \(r\) other coordinates that can together recover the value of the \(i\)th coordinate. This property is particularly useful for codes for distributed storage systems because it permits local repair and parallel accesses of hot data. In this paper, for any positive integers \(r\) and \(t\), we construct a binary linear code of length \(\binom{r+t}{t}\) which has locality \(r\) and availability \(t\) for all coordinates. The information rate of this code attains \(\frac{r}{r+t}\), which is always higher than that of the direct product code, the only known construction that can achieve arbitrary locality and availability. |
|||||
2015 | Rank Subspace Learning For Compact Hash Codes | Li Kai, Qi Guojun, Ye Jun, Hua Kien A. | Arxiv | The era of Big Data has spawned unprecedented interests in developing hashing algorithms for efficient storage and fast nearest neighbor search. Most existing work learn hash functions that are numeric quantizations of feature values in projected feature space. In this work, we propose a novel hash learning framework that encodes feature’s rank orders instead of numeric values in a number of optimal low-dimensional ranking subspaces. We formulate the ranking subspace learning problem as the optimization of a piece-wise linear convex-concave function and present two versions of our algorithm: one with independent optimization of each hash bit and the other exploiting a sequential learning framework. Our work is a generalization of the Winner-Take-All (WTA) hash family and naturally enjoys all the numeric stability benefits of rank correlation measures while being optimized to achieve high precision at very short code length. We compare with several state-of-the-art hashing algorithms in both supervised and unsupervised domain, showing superior performance in a number of data sets. |
|||||
2015 | A Reliable Order-statistics-based Approximate Nearest Neighbor Search Algorithm | Verdoliva Luisa, Cozzolino Davide, Poggi Giovanni | Arxiv | We propose a new algorithm for fast approximate nearest neighbor search based on the properties of ordered vectors. Data vectors are classified based on the index and sign of their largest components, thereby partitioning the space in a number of cones centered in the origin. The query is itself classified, and the search starts from the selected cone and proceeds to neighboring ones. Overall, the proposed algorithm corresponds to locality sensitive hashing in the space of directions, with hashing based on the order of components. Thanks to the statistical features emerging through ordering, it deals very well with the challenging case of unstructured data, and is a valuable building block for more complex techniques dealing with structured data. Experiments on both simulated and real-world data prove the proposed algorithm to provide a state-of-the-art performance. |
|||||
2015 | Coveringlsh Locality-sensitive Hashing Without False Negatives | Pagh Rasmus | Arxiv | We consider a new construction of locality-sensitive hash functions for Hamming space that is covering in the sense that is it guaranteed to produce a collision for every pair of vectors within a given radius \(r\). The construction is efficient in the sense that the expected number of hash collisions between vectors at distance~\(cr\), for a given \(c>1\), comes close to that of the best possible data independent LSH without the covering guarantee, namely, the seminal LSH construction of Indyk and Motwani (STOC ‘98). The efficiency of the new construction essentially matches their bound when the search radius is not too large — e.g., when \(cr = o(log(n)/loglog n)\), where \(n\) is the number of points in the data set, and when \(cr = log(n)/k\) where \(k\) is an integer constant. In general, it differs by at most a factor \(\ln(4)\) in the exponent of the time bounds. As a consequence, LSH-based similarity search in Hamming space can avoid the problem of false negatives at little or no cost in efficiency. |
|||||
2015 | SHOE Supervised Hashing With Output Embeddings | Bondugula Sravanthi, Manjunatha Varun, Davis Larry S., Doermann David | Arxiv | We present a supervised binary encoding scheme for image retrieval that learns projections by taking into account similarity between classes obtained from output embeddings. Our motivation is that binary hash codes learned in this way improve both the visual quality of retrieval results and existing supervised hashing schemes. We employ a sequential greedy optimization that learns relationship aware projections by minimizing the difference between inner products of binary codes and output embedding vectors. We develop a joint optimization framework to learn projections which improve the accuracy of supervised hashing over the current state of the art with respect to standard and sibling evaluation metrics. We further boost performance by applying the supervised dimensionality reduction technique on kernelized input CNN features. Experiments are performed on three datasets: CUB-2011, SUN-Attribute and ImageNet ILSVRC 2010. As a by-product of our method, we show that using a simple k-nn pooling classifier with our discriminative codes improves over the complex classification models on fine grained datasets like CUB and offer an impressive compression ratio of 1024 on CNN features. |
|||||
2015 | Nearbucket-lsh Efficient Similarity Search In P2P Networks | Kraus Naama, Carmel David, Keidar Idit, Orenbach Meni | Arxiv | We present NearBucket-LSH, an effective algorithm for similarity search in large-scale distributed online social networks organized as peer-to-peer overlays. As communication is a dominant consideration in distributed systems, we focus on minimizing the network cost while guaranteeing good search quality. Our algorithm is based on Locality Sensitive Hashing (LSH), which limits the search to collections of objects, called buckets, that have a high probability to be similar to the query. More specifically, NearBucket-LSH employs an LSH extension that searches in near buckets, and improves search quality but also significantly increases the network cost. We decrease the network cost by considering the internals of both LSH and the P2P overlay, and harnessing their properties to our needs. We show that our NearBucket-LSH increases search quality for a given network cost compared to previous art. In many cases, the search quality increases by more than 50%. |
|||||
2015 | I/o-efficient Similarity Join | Pagh Rasmus, Pham Ninh, Silvestri Francesco, Stöckel Morten | Arxiv | We present an I/O-efficient algorithm for computing similarity joins based on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly suggested our method has provable sub-quadratic dependency on the data size. Further, in contrast to straightforward implementations of known LSH-based algorithms on external memory, our approach is able to take significant advantage of the available internal memory: Whereas the time complexity of classical algorithms includes a factor of \(N^\rho\), where \(\rho\) is a parameter of the LSH used, the I/O complexity of our algorithm merely includes a factor \((N/M)^\rho\), where \(N\) is the data size and \(M\) is the size of internal memory. Our algorithm is randomized and outputs the correct result with high probability. It is a simple, recursive, cache-oblivious procedure, and we believe that it will be useful also in other computational settings such as parallel computation. |
|||||
2015 | Binary Speaker Embedding | Li Lantian, Wang Dong, Xing Chao, Yu Kaimin, Zheng Thomas Fang | Arxiv | The popular i-vector model represents speakers as low-dimensional continuous vectors (i-vectors), and hence it is a way of continuous speaker embedding. In this paper, we investigate binary speaker embedding, which transforms i-vectors to binary vectors (codes) by a hash function. We start from locality sensitive hashing (LSH), a simple binarization approach where binary codes are derived from a set of random hash functions. A potential problem of LSH is that the randomly sampled hash functions might be suboptimal. We therefore propose an improved Hamming distance learning approach, where the hash function is learned by a variable-sized block training that projects each dimension of the original i-vectors to variable-sized binary codes independently. Our experiments show that binary speaker embedding can deliver competitive or even better results on both speaker verification and identification tasks, while the memory usage and the computation cost are significantly reduced. |
|||||
2015 | Hash Function Learning Via Codewords | Huang Yinjie, Georgiopoulos Michael, Anagnostopoulos Georgios C. | Arxiv | In this paper we introduce a novel hash learning framework that has two main distinguishing features, when compared to past approaches. First, it utilizes codewords in the Hamming space as ancillary means to accomplish its hash learning task. These codewords, which are inferred from the data, attempt to capture similarity aspects of the data’s hash codes. Secondly and more importantly, the same framework is capable of addressing supervised, unsupervised and, even, semi-supervised hash learning tasks in a natural manner. A series of comparative experiments focused on content-based image retrieval highlights its performance advantages. |
|||||
2015 | Linear Probing With 5-independent Hashing | Thorup Mikkel | Arxiv | These lecture notes show that linear probing takes expected constant time if the hash function is 5-independent. This result was first proved by Pagh et al. [STOC’07,SICOMP’09]. The simple proof here is essentially taken from [Patrascu and Thorup ICALP’10]. We will also consider a smaller space version of linear probing that may have false positives like Bloom filters. These lecture notes illustrate the use of higher moments in data structures, and could be used in a course on randomized algorithms. |
|||||
2015 | Hdidx High-dimensional Indexing For Efficient Approximate Nearest Neighbor Search | Wan Ji, Tang Sheng, Zhang Yongdong, Li Jintao, Wu Pengcheng, Hoi Steven C. H. | Arxiv | Fast Nearest Neighbor (NN) search is a fundamental challenge in large-scale data processing and analytics, particularly for analyzing multimedia contents which are often of high dimensionality. Instead of using exact NN search, extensive research efforts have been focusing on approximate NN search algorithms. In this work, we present “HDIdx”, an efficient high-dimensional indexing library for fast approximate NN search, which is open-source and written in Python. It offers a family of state-of-the-art algorithms that convert input high-dimensional vectors into compact binary codes, making them very efficient and scalable for NN search with very low space complexity. |
|||||
2015 | Bilinear Random Projections For Locality-sensitive Binary Codes | Kim Saehoon, Choi Seungjin | Arxiv | Locality-sensitive hashing (LSH) is a popular data-independent indexing method for approximate similarity search, where random projections followed by quantization hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Most of high-dimensional visual descriptors for images exhibit a natural matrix structure. When visual descriptors are represented by high-dimensional feature vectors and long binary codes are assigned, a random projection matrix requires expensive complexities in both space and time. In this paper we analyze a bilinear random projection method where feature matrices are transformed to binary codes by two smaller random projection matrices. We base our theoretical analysis on extending Raginsky and Lazebnik’s result where random Fourier features are composed with random binary quantizers to form locality sensitive binary codes. To this end, we answer the following two questions: (1) whether a bilinear random projection also yields similarity-preserving binary codes; (2) whether a bilinear random projection yields performance gain or loss, compared to a large linear projection. Regarding the first question, we present upper and lower bounds on the expected Hamming distance between binary codes produced by bilinear random projections. In regards to the second question, we analyze the upper and lower bounds on covariance between two bits of binary codes, showing that the correlation between two bits is small. Numerical experiments on MNIST and Flickr45K datasets confirm the validity of our method. |
|||||
2015 | Quantization Based Fast Inner Product Search | Guo Ruiqi, Kumar Sanjiv, Choromanski Krzysztof, Simcha David | Arxiv | We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a small sample of example queries is given at the training time, we propose a modified codebook learning procedure which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep neural networks show that the proposed approach significantly outperforms the existing state-of-the-art. |
|||||
2015 | Faster 64-bit Universal Hashing Using Carry-less Multiplications | Lemire Daniel, Kaser Owen | Journal of Cryptographic Engineering Volume | Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Google’s CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise. |
|||||
2015 | Approximate Nearest Neighbor Search For ell_p-spaces (2 < P < infty) Via Embeddings | Bartal Yair, Gottlieb Lee-ad | Arxiv | While the problem of approximate nearest neighbor search has been well-studied for Euclidean space and \(\ell_1\), few non-trivial algorithms are known for \(\ell_p\) when (\(2 < p < \infty\)). In this paper, we revisit this fundamental problem and present approximate nearest-neighbor search algorithms which give the first non-trivial approximation factor guarantees in this setting. |
|||||
2015 | Deephash Getting Regularization Depth And Fine-tuning Right | Lin Jie, Morere Olivier, Chandrasekhar Vijay, Veillard Antoine, Goh Hanlin | Arxiv | This work focuses on representing very high-dimensional global image descriptors using very compact 64-1024 bit binary hashes for instance retrieval. We propose DeepHash: a hashing scheme based on deep networks. Key to making DeepHash work at extremely low bitrates are three important considerations – regularization, depth and fine-tuning – each requiring solutions specific to the hashing problem. In-depth evaluation shows that our scheme consistently outperforms state-of-the-art methods across all data sets for both Fisher Vectors and Deep Convolutional Neural Network features, by up to 20 percent over other schemes. The retrieval performance with 256-bit hashes is close to that of the uncompressed floating point features – a remarkable 512 times compression. |
|||||
2015 | Nearest Neighbor Search In Complex Network For Community Detection | Saha Suman, Ghrera S. P. | Arxiv | Nearest neighbor search is a basic computational tool used extensively in almost research domains of computer science specially when dealing with large amount of data. However, the use of nearest neighbor search is restricted for the purpose of algorithmic development by the existence of the notion of nearness among the data points. The recent trend of research is on large, complex networks and their structural analysis, where nodes represent entities and edges represent any kind of relation between entities. Community detection in complex network is an important problem of much interest. In general, a community detection algorithm represents an objective function and captures the communities by optimizing it to extract the interesting communities for the user. In this article, we have studied the nearest neighbor search problem in complex network via the development of a suitable notion of nearness. Initially, we have studied and analyzed the exact nearest neighbor search using metric tree on proposed metric space constructed from complex network. After, the approximate nearest neighbor search problem is studied using locality sensitive hashing. For evaluation of the proposed nearest neighbor search on complex network we applied it in community detection problem. The results obtained using our methods are very competitive with most of the well known algorithms exists in the literature and this is verified on collection of real networks. On the other-hand, it can be observed that time taken by our algorithm is quite less compared to popular methods. |
|||||
2015 | Speeding Up Neural Networks For Large Scale Classification Using WTA Hashing | Bakhtiary Amir H., Lapedriza Agata, Masip David | Arxiv | In this paper we propose to use the Winner Takes All hashing technique to speed up forward propagation and backward propagation in fully connected layers in convolutional neural networks. The proposed technique reduces significantly the computational complexity, which in turn, allows us to train layers with a large number of kernels with out the associated time penalty. As a consequence we are able to train convolutional neural network on a very large number of output classes with only a small increase in the computational cost. To show the effectiveness of the technique we train a new output layer on a pretrained network using both the regular multiplicative approach and our proposed hashing methodology. Our results showed no drop in performance and demonstrate, with our implementation, a 7 fold speed up during the training. |
|||||
2015 | Diamond Sampling For Approximate Maximum All-pairs Dot-product (MAD) Search | Ballard Grey, Pinar Ali, Kolda Tamara G., Seshadhri C. | ICDM | Given two sets of vectors, \(A = \{{a_1}, \dots, {a_m}\}\) and \(B=\{{b_1},\dots,{b_n}\}\), our problem is to find the top-\(t\) dot products, i.e., the largest \(|{a_i}\cdot{b_j}|\) among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all \(mn\) dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of \(A\) and \(B\). The probability of selecting a diamond corresponding to pair \((i,j)\) is proportional to \(({a_i}\cdot{b_j})^2\), amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. |
|||||
2015 | LSHTC A Benchmark For Large-scale Text Classification | Partalas Ioannis, Kosmopoulos Aris, Baskiotis Nicolas, Artieres Thierry, Paliouras George, Gaussier Eric, Androutsopoulos Ion, Amini Massih-reza, Galinari Patrick | Arxiv | LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). This paper describes the dataset that have been released along the LSHTC series. The paper details the construction of the datsets and the design of the tracks as well as the evaluation measures that we implemented and a quick overview of the results. All of these datasets are available online and runs may still be submitted on the online server of the challenges. |
|||||
2015 | New Hashing Algorithm For Use In TCP Reassembly Module Of IPS | Bagaria Sankalp | Arxiv | Since last decade, IDS/ IPS has gained popularity in protecting large networks. They can employ signature based techniques and/or flow-based techniques to prevent intrusion from outside/ inside the network they are trying to protect. Signature based IDS/ IPS can be stateless or stateful. Stateful IDS can store the state of the protocol and use it for better detection of malware. In the case of TCP/IP networks, an attacker can also launch an attack such that the malicious code is distributed over many packets. These packets pass through the traditional IDS/ IPS and reassemble inside the network. Once re-assembled inside the network by the TCP/IP layer, the malicious code launches an attack. The TCP state and a copy of last few packets for each active connection has to be maintained in IDS/IPS. In TCP re-assembly, packets are re-assembled at IDS/IPS and searched for signature matches. A connection table has to be maintained for active connections and their list of last few (atmost 11) packets that have already arrived. We need data structures for searching the connection that the latest incoming packet belongs to. Popular hashing algorithms like CRC, XOR, summing tuple, taking modulus are inefficient as hash keys are not evenly distributed in hash-key space. Thus we show how an algorithm based on cryptography concepts can be used for efficient hashing in network connection management. We also show how to use full four tuple for calculating hash key instead of simply summing the tuple and taking the modulus of the sum. |
|||||
2015 | Implicit Sparse Code Hashing | Lin Tsung-yu, Ke Tsung-wei, Liu Tyng-luh | Arxiv | We address the problem of converting large-scale high-dimensional image data into binary codes so that approximate nearest-neighbor search over them can be efficiently performed. Different from most of the existing unsupervised approaches for yielding binary codes, our method is based on a dimensionality-reduction criterion that its resulting mapping is designed to preserve the image relationships entailed by the inner products of sparse codes, rather than those implied by the Euclidean distances in the ambient space. While the proposed formulation does not require computing any sparse codes, the underlying computation model still inevitably involves solving an unmanageable eigenproblem when extremely high-dimensional descriptors are used. To overcome the difficulty, we consider the column-sampling technique and presume a special form of rotation matrix to facilitate subproblem decomposition. We test our method on several challenging image datasets and demonstrate its effectiveness by comparing with state-of-the-art binary coding techniques. |
|||||
2015 | Fast K-nearest Neighbour Search Via Dynamic Continuous Indexing | Li Ke, Malik Jitendra | Arxiv | Existing methods for retrieving k-nearest neighbours suffer from the curse of dimensionality. We argue this is caused in part by inherent deficiencies of space partitioning, which is the underlying strategy used by most existing methods. We devise a new strategy that avoids partitioning the vector space and present a novel randomized algorithm that runs in time linear in dimensionality of the space and sub-linear in the intrinsic dimensionality and the size of the dataset and takes space constant in dimensionality of the space and linear in the size of the dataset. The proposed algorithm allows fine-grained control over accuracy and speed on a per-query basis, automatically adapts to variations in data density, supports dynamic updates to the dataset and is easy-to-implement. We show appealing theoretical properties and demonstrate empirically that the proposed algorithm outperforms locality-sensitivity hashing (LSH) in terms of approximation quality, speed and space efficiency. |
|||||
2015 | Clustering Is Efficient For Approximate Maximum Inner Product Search | Auvolat Alex, Chandar Sarath, Vincent Pascal, Larochelle Hugo, Bengio Yoshua | Arxiv | Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise. |
|||||
2015 | Svcr An R Package For Support Vector Clustering Improved With Geometric Hashing Applied To Lexical Pattern Discovery | Turenne Nicolas | Arxiv | We present a new R package which takes a numerical matrix format as data input, and computes clusters using a support vector clustering method (SVC). We have implemented an original 2D-grid labeling approach to speed up cluster extraction. In this sense, SVC can be seen as an efficient cluster extraction if clusters are separable in a 2-D map. Secondly we showed that this SVC approach using a Jaccard-Radial base kernel can help to classify well enough a set of terms into ontological classes and help to define regular expression rules for information extraction in documents; our case study concerns a set of terms and documents about developmental and molecular biology. |
|||||
2015 | Near-optimal Bounds For Binary Embeddings Of Arbitrary Sets | Oymak Samet, Recht Ben | Arxiv | We study embedding a subset \(K\) of the unit sphere to the Hamming cube \(\{-1,+1\}^m\). We characterize the tradeoff between distortion and sample complexity \(m\) in terms of the Gaussian width \(\omega(K)\) of the set. For subspaces and several structured sets we show that Gaussian maps provide the optimal tradeoff \(m\sim \delta^{-2}\omega^2(K)\), in particular for \(\delta\) distortion one needs \(m\approx\delta^{-2}{d}\) where \(d\) is the subspace dimension. For general sets, we provide sharp characterizations which reduces to \(m\approx{\delta^{-4}}{\omega^2(K)}\) after simplification. We provide improved results for local embedding of points that are in close proximity of each other which is related to locality sensitive hashing. We also discuss faster binary embedding where one takes advantage of an initial sketching procedure based on Fast Johnson-Lindenstauss Transform. Finally, we list several numerical observations and discuss open problems. |
|||||
2015 | Quicksort Largest Bucket And Min-wise Hashing With Limited Independence | Knudsen Mathias Bæk Tejs, Stöckel Morten | Arxiv | Randomized algorithms and data structures are often analyzed under the assumption of access to a perfect source of randomness. The most fundamental metric used to measure how “random” a hash function or a random number generator is, is its independence: a sequence of random variables is said to be \(k\)-independent if every variable is uniform and every size \(k\) subset is independent. In this paper we consider three classic algorithms under limited independence. We provide new bounds for randomized quicksort, min-wise hashing and largest bucket size under limited independence. Our results can be summarized as follows. -Randomized quicksort. When pivot elements are computed using a \(5\)-independent hash function, Karloff and Raghavan, J.ACM’93 showed \(O ( n log n)\) expected worst-case running time for a special version of quicksort. We improve upon this, showing that the same running time is achieved with only \(4\)-independence. -Min-wise hashing. For a set \(A\), consider the probability of a particular element being mapped to the smallest hash value. It is known that \(5\)-independence implies the optimal probability \(O (1 /n)\). Broder et al., STOC’98 showed that \(2\)-independence implies it is \(O(1 / \sqrt{|A|})\). We show a matching lower bound as well as new tight bounds for \(3\)- and \(4\)-independent hash functions. -Largest bucket. We consider the case where \(n\) balls are distributed to \(n\) buckets using a \(k\)-independent hash function and analyze the largest bucket size. Alon et. al, STOC’97 showed that there exists a \(2\)-independent hash function implying a bucket of size \(Ω ( n^{1/2})\). We generalize the bound, providing a \(k\)-independent family of functions that imply size \(Ω ( n^{1/k})\). |
|||||
2015 | From Independence To Expansion And Back Again | Christiani Tobias, Pagh Rasmus, Thorup Mikkel | Arxiv | We consider the following fundamental problems: (1) Constructing \(k\)-independent hash functions with a space-time tradeoff close to Siegel’s lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one of them leads to a good solution to the other one. In this paper we exploit this connection to present efficient, recursive constructions of \(k\)-independent hash functions (and hence expanders with a small representation). While the previously most efficient construction (Thorup, FOCS 2013) needed time quasipolynomial in Siegel’s lower bound, our time bound is just a logarithmic factor from the lower bound. |
|||||
2015 | Efficient Data Hashing With Structured Binary Embeddings | Choromanski Krzysztof | Arxiv | We present here new mechanisms for hashing data via binary embeddings. Contrary to most of the techniques presented before, the embedding matrix of our mechanism is highly structured. That enables us to perform hashing more efficiently and use less memory. What is crucial and nonintuitive is the fact that imposing structured mechanism does not affect the quality of the produced hash. To the best of our knowledge, we are the first to give strong theoretical guarantees of the proposed binary hashing method by proving the efficiency of the mechanism for several classes of structured projection matrices. As a corollary, we obtain binary hashing mechanisms with strong concentration results for circulant and Topelitz matrices. Our approach is however much more general. |
|||||
2015 | More Analysis Of Double Hashing For Balanced Allocations | Mitzenmacher Michael | Arxiv | With double hashing, for a key \(x\), one generates two hash values \(f(x)\) and \(g(x)\), and then uses combinations \((f(x) +i g(x)) \bmod n\) for \(i=0,1,2,…\) to generate multiple hash values in the range \([0,n-1]\) from the initial two. For balanced allocations, keys are hashed into a hash table where each bucket can hold multiple keys, and each key is placed in the least loaded of \(d\) choices. It has been shown previously that asymptotically the performance of double hashing and fully random hashing is the same in the balanced allocation paradigm using fluid limit methods. Here we extend a coupling argument used by Lueker and Molodowitch to show that double hashing and ideal uniform hashing are asymptotically equivalent in the setting of open address hash tables to the balanced allocation setting, providing further insight into this phenomenon. We also discuss the potential for and bottlenecks limiting the use this approach for other multiple choice hashing schemes. |
|||||
2015 | Per-bucket Concurrent Rehashing Algorithms | Malakhov Anton | Arxiv | This paper describes a generic algorithm for concurrent resizing and on-demand per-bucket rehashing for an extensible hash table. In contrast to known lock-based hash table algorithms, the proposed algorithm separates the resizing and rehashing stages so that they neither invalidate existing buckets nor block any concurrent operations. Instead, the rehashing work is deferred and split across subsequent operations with the table. The rehashing operation uses bucket-level synchronization only and therefore allows a race condition between lookup and moving operations running in different threads. Instead of using explicit synchronization, the algorithm detects the race condition and restarts the lookup operation. In comparison with other lock-based algorithms, the proposed algorithm reduces high-level synchronization on the hot path, improving performance, concurrency, and scalability of the table. The response time of the operations is also more predictable. The algorithm is compatible with cache friendly data layouts for buckets and does not depend on any memory reclamation techniques thus potentially achieving additional performance gain with corresponding implementations. |
|||||
2015 | Unsupervised Feature Learning For Dense Correspondences Across Scenes | Zhang Chao, Shen Chunhua, Shen Tingzhi | Arxiv | We propose a fast, accurate matching method for estimating dense pixel correspondences across scenes. It is a challenging problem to estimate dense pixel correspondences between images depicting different scenes or instances of the same object category. While most such matching methods rely on hand-crafted features such as SIFT, we learn features from a large amount of unlabeled image patches using unsupervised learning. Pixel-layer features are obtained by encoding over the dictionary, followed by spatial pooling to obtain patch-layer features. The learned features are then seamlessly embedded into a multi-layer match- ing framework. We experimentally demonstrate that the learned features, together with our matching model, outperforms state-of-the-art methods such as the SIFT flow, coherency sensitive hashing and the recent deformable spatial pyramid matching methods both in terms of accuracy and computation efficiency. Furthermore, we evaluate the performance of a few different dictionary learning and feature encoding methods in the proposed pixel correspondences estimation framework, and analyse the impact of dictionary learning and feature encoding with respect to the final matching performance. |
|||||
2015 | Binary Embeddings With Structured Hashed Projections | Choromanska Anna, Choromanski Krzysztof, Bojarski Mariusz, Jebara Tony, Kumar Sanjiv, Lecun Yann | Arxiv | We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed “budget of randomness” is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier. |
|||||
2015 | HCLAE High Capacity Locally Aggregating Encodings For Approximate Nearest Neighbor Search | Liu Shicong, Shao Junru, Lu Hongtao | Arxiv | Vector quantization-based approaches are successful to solve Approximate Nearest Neighbor (ANN) problems which are critical to many applications. The idea is to generate effective encodings to allow fast distance approximation. We propose quantization-based methods should partition the data space finely and exhibit locality of the dataset to allow efficient non-exhaustive search. In this paper, we introduce the concept of High Capacity Locality Aggregating Encodings (HCLAE) to this end, and propose Dictionary Annealing (DA) to learn HCLAE by a simulated annealing procedure. The quantization error is lower than other state-of-the-art. The algorithms of DA can be easily extended to an online learning scheme, allowing effective handle of large scale data. Further, we propose Aggregating-Tree (A-Tree), a non-exhaustive search method using HCLAE to perform efficient ANN-Search. A-Tree achieves magnitudes of speed-up on ANN-Search tasks, compared to the state-of-the-art. |
|||||
2015 | Diverse Yet Efficient Retrieval Using Hash Functions | Rao Vidyadhar, Jain Prateek, Jawahar C. V | Arxiv | Typical retrieval systems have three requirements: a) Accurate retrieval i.e., the method should have high precision, b) Diverse retrieval, i.e., the obtained set of points should be diverse, c) Retrieval time should be small. However, most of the existing methods address only one or two of the above mentioned requirements. In this work, we present a method based on randomized locality sensitive hashing which tries to address all of the above requirements simultaneously. While earlier hashing approaches considered approximate retrieval to be acceptable only for the sake of efficiency, we argue that one can further exploit approximate retrieval to provide impressive trade-offs between accuracy and diversity. We extend our method to the problem of multi-label prediction, where the goal is to output a diverse and accurate set of labels for a given document in real-time. Moreover, we introduce a new notion to simultaneously evaluate a method’s performance for both the precision and diversity measures. Finally, we present empirical results on several different retrieval tasks and show that our method retrieves diverse and accurate images/labels while ensuring \(100x\)-speed-up over the existing diverse retrieval approaches. |
|||||
2015 | Min-max Kernels | Li Ping | Arxiv | The min-max kernel is a generalization of the popular resemblance kernel (which is designed for binary data). In this paper, we demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. As the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we show that the min-max kernel can be linearized via hashing techniques. This allows practitioners to apply min-max kernel to large-scale applications using well matured linear algorithms such as linear SVM or logistic regression. The previous remarkable work on consistent weighted sampling (CWS) produces samples in the form of (\(i^, t^\)) where the \(i^\) records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the \(t^\) is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. In this paper, we provide a simple solution by discarding \(t^*\) (which we refer to as the “0-bit” scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply the “0-bit” CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of publicly available classification datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data. |
|||||
2015 | Binary Coding In Stream | Ghashami Mina, Abdullah Amirali | Arxiv | Big data is becoming ever more ubiquitous, ranging over massive video repositories, document corpuses, image sets and Internet routing history. Proximity search and clustering are two algorithmic primitives fundamental to data analysis, but suffer from the “curse of dimensionality” on these gigantic datasets. A popular attack for this problem is to convert object representations into short binary codewords, while approximately preserving near neighbor structure. However, there has been limited research on constructing codewords in the “streaming” or “online” settings often applicable to this scale of data, where one may only make a single pass over data too massive to fit in local memory. In this paper, we apply recent advances in matrix sketching techniques to construct binary codewords in both streaming and online setting. Our experimental results compete outperform several of the most popularly used algorithms, and we prove theoretical guarantees on performance in the streaming setting under mild assumptions on the data and randomness of the training set. |
|||||
2015 | Short Text Hashing Improved By Integrating Multi-granularity Topics And Tags | Xu Jiaming, Xu Bo, Tian Guanhua, Zhao Jun, Wang Fangyuan, Hao Hongwei | Arxiv | Due to computational and storage efficiencies of compact binary codes, hashing has been widely used for large-scale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to utilize latent topics of certain granularity to preserve semantic similarity in hash codes beyond keyword matching. However, topics of certain granularity are not adequate to represent the intrinsic semantic information. In this paper, we present a novel unified approach for short text Hashing using Multi-granularity Topics and Tags, dubbed HMTT. In particular, we propose a selection method to choose the optimal multi-granularity topics depending on the type of dataset, and design two distinct hashing strategies to incorporate multi-granularity topics. We also propose a simple and effective method to exploit tags to enhance the similarity of related texts. We carry out extensive experiments on one short text dataset as well as on one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baselines on several evaluation metrics. |
|||||
2015 | Compressing Convolutional Neural Networks | Chen Wenlin, Wilson James T., Tyree Stephen, Weinberger Kilian Q., Chen Yixin | Arxiv | Convolutional neural networks (CNN) are increasingly used in many areas of computer vision. They are particularly attractive because of their ability to “absorb” great quantities of labeled data through millions of parameters. However, as model sizes increase, so do the storage and memory requirements of the classifiers. We present a novel network architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption. Based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard back-propagation. To further reduce model size we allocate fewer hash buckets to high-frequency components, which are generally less important. We evaluate FreshNets on eight data sets, and show that it leads to drastically better compressed performance than several relevant baselines. |
|||||
2015 | Improved Residual Vector Quantization For High-dimensional Approximate Nearest Neighbor Search | Liu Shicong, Lu Hongtao, Shao Junru | Arxiv | Quantization methods have been introduced to perform large scale approximate nearest search tasks. Residual Vector Quantization (RVQ) is one of the effective quantization methods. RVQ uses a multi-stage codebook learning scheme to lower the quantization error stage by stage. However, there are two major limitations for RVQ when applied to on high-dimensional approximate nearest neighbor search: 1. The performance gain diminishes quickly with added stages.
|
|||||
2015 | On Large-scale Retrieval Binary Or N-ary Coding | Najibi Mahyar, Rastegari Mohammad, Davis Larry S. | Arxiv | The growing amount of data available in modern-day datasets makes the need to efficiently search and retrieve information. To make large-scale search feasible, Distance Estimation and Subset Indexing are the main approaches. Although binary coding has been popular for implementing both techniques, n-ary coding (known as Product Quantization) is also very effective for Distance Estimation. However, their relative performance has not been studied for Subset Indexing. We investigate whether binary or n-ary coding works better under different retrieval strategies. This leads to the design of a new n-ary coding method, “Linear Subspace Quantization (LSQ)” which, unlike other n-ary encoders, can be used as a similarity-preserving embedding. Experiments on image retrieval show that when Distance Estimation is used, n-ary LSQ outperforms other methods. However, when Subset Indexing is applied, interestingly, binary codings are more effective and binary LSQ achieves the best accuracy. |
|||||
2015 | Composite Correlation Quantization For Efficient Multimodal Retrieval | Long Mingsheng, Cao Yue, Wang Jianmin, Yu Philip S. | Arxiv | Efficient similarity retrieval from large-scale multimodal database is pervasive in modern search engines and social networks. To support queries across content modalities, the system should enable cross-modal correlation and computation-efficient indexing. While hashing methods have shown great potential in achieving this goal, current attempts generally fail to learn isomorphic hash codes in a seamless scheme, that is, they embed multiple modalities in a continuous isomorphic space and separately threshold embeddings into binary codes, which incurs substantial loss of retrieval accuracy. In this paper, we approach seamless multimodal hashing by proposing a novel Composite Correlation Quantization (CCQ) model. Specifically, CCQ jointly finds correlation-maximal mappings that transform different modalities into isomorphic latent space, and learns composite quantizers that convert the isomorphic latent features into compact binary codes. An optimization framework is devised to preserve both intra-modal similarity and inter-modal correlation through minimizing both reconstruction and quantization errors, which can be trained from both paired and partially paired data in linear time. A comprehensive set of experiments clearly show the superior effectiveness and efficiency of CCQ against the state of the art hashing methods for both unimodal and cross-modal retrieval. |
|||||
2015 | Accelerated Distance Computation With Encoding Tree For High Dimensional Data | Liu Shicong, Shao Junru, Lu Hongtao | Arxiv | We propose a novel distance to calculate distance between high dimensional vector pairs, utilizing vector quantization generated encodings. Vector quantization based methods are successful in handling large scale high dimensional data. These methods compress vectors into short encodings, and allow efficient distance computation between an uncompressed vector and compressed dataset without decompressing explicitly. However for large datasets, these distance computing methods perform excessive computations. We avoid excessive computations by storing the encodings on an Encoding Tree(E-Tree), interestingly the memory consumption is also lowered. We also propose Encoding Forest(E-Forest) to further lower the computation cost. E-Tree and E-Forest is compatible with various existing quantization-based methods. We show by experiments our methods speed-up distance computing for high dimensional data drastically, and various existing algorithms can benefit from our methods. |
|||||
2015 | Properties And Examples Of Faber--walsh Polynomials | Sète Olivier, Liesen Jörg | Computational Methods and Function Theory Volume | The Faber–Walsh polynomials are a direct generalization of the (classical) Faber polynomials from simply connected sets to sets with several simply connected components. In this paper we derive new properties of the Faber–Walsh polynomials, where we focus on results of interest in numerical linear algebra, and on the relation between the Faber–Walsh polynomials and the classical Faber and Chebyshev polynomials. Moreover, we present examples of Faber–Walsh polynomials for two real intervals as well as some non-real sets consisting of several simply connected components. |
|||||
2015 | Constrained Sampling And Counting Universal Hashing Meets SAT Solving | Meel Kuldeep S., Vardi Moshe, Chakraborty Supratik, Fremont Daniel J., Seshia Sanjit A., Fried Dror, Ivrii Alexander, Malik Sharad | Arxiv | Constrained sampling and counting are two fundamental problems in artificial intelligence with a diverse range of applications, spanning probabilistic reasoning and planning to constrained-random verification. While the theory of these problems was thoroughly investigated in the 1980s, prior work either did not scale to industrial size instances or gave up correctness guarantees to achieve scalability. Recently, we proposed a novel approach that combines universal hashing and SAT solving and scales to formulas with hundreds of thousands of variables without giving up correctness guarantees. This paper provides an overview of the key ingredients of the approach and discusses challenges that need to be overcome to handle larger real-world instances. |
|||||
2015 | A Deep Hashing Learning Network | Zhong Guoqiang, Yang Pan, Wang Sijiang, Dong Junyu | Arxiv | Hashing-based methods seek compact and efficient binary codes that preserve the neighborhood structure in the original data space. For most existing hashing methods, an image is first encoded as a vector of hand-crafted visual feature, followed by a hash projection and quantization step to get the compact binary vector. Most of the hand-crafted features just encode the low-level information of the input, the feature may not preserve the semantic similarities of images pairs. Meanwhile, the hashing function learning process is independent with the feature representation, so the feature may not be optimal for the hashing projection. In this paper, we propose a supervised hashing method based on a well designed deep convolutional neural network, which tries to learn hashing code and compact representations of data simultaneously. The proposed model learn the binary codes by adding a compact sigmoid layer before the loss layer. Experiments on several image data sets show that the proposed model outperforms other state-of-the-art methods. |
|||||
2015 | On Binary Embedding Using Circulant Matrices | Yu Felix X., Bhaskara Aditya, Kumar Sanjiv, Gong Yunchao, Chang Shih-fu | Arxiv | Binary embeddings provide efficient and powerful ways to perform operations on large scale data. However binary embedding typically requires long codes in order to preserve the discriminative power of the input space. Thus binary coding methods traditionally suffer from high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure allows us to use Fast Fourier Transform algorithms to speed up the computation. For obtaining \(k\)-bit binary codes from \(d\)-dimensional data, this improves the time complexity from \(O(dk)\) to \(O(dlog{d})\), and the space complexity from \(O(dk)\) to \(O(d)\). We study two settings, which differ in the way we choose the parameters of the circulant matrix. In the first, the parameters are chosen randomly and in the second, the parameters are learned using the data. For randomized CBE, we give a theoretical analysis comparing it with binary embedding using an unstructured random projection matrix. The challenge here is to show that the dependencies in the entries of the circulant matrix do not lead to a loss in performance. In the second setting, we design a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. In both the settings, we show by extensive experiments that the CBE approach gives much better performance than the state-of-the-art approaches if we fix a running time, and provides much faster computation with negligible performance degradation if we fix the number of bits in the embedding. |
|||||
2015 | Tiny Descriptors For Image Retrieval With Unsupervised Triplet Hashing | Lin Jie, Morère Olivier, Petta Julie, Chandrasekhar Vijay, Veillard Antoine | Arxiv | A typical image retrieval pipeline starts with the comparison of global descriptors from a large database to find a short list of candidate matches. A good image descriptor is key to the retrieval pipeline and should reconcile two contradictory requirements: providing recall rates as high as possible and being as compact as possible for fast matching. Following the recent successes of Deep Convolutional Neural Networks (DCNN) for large scale image classification, descriptors extracted from DCNNs are increasingly used in place of the traditional hand crafted descriptors such as Fisher Vectors (FV) with better retrieval performances. Nevertheless, the dimensionality of a typical DCNN descriptor –extracted either from the visual feature pyramid or the fully-connected layers– remains quite high at several thousands of scalar values. In this paper, we propose Unsupervised Triplet Hashing (UTH), a fully unsupervised method to compute extremely compact binary hashes –in the 32-256 bits range– from high-dimensional global descriptors. UTH consists of two successive deep learning steps. First, Stacked Restricted Boltzmann Machines (SRBM), a type of unsupervised deep neural nets, are used to learn binary embedding functions able to bring the descriptor size down to the desired bitrate. SRBMs are typically able to ensure a very high compression rate at the expense of loosing some desirable metric properties of the original DCNN descriptor space. Then, triplet networks, a rank learning scheme based on weight sharing nets is used to fine-tune the binary embedding functions to retain as much as possible of the useful metric properties of the original space. A thorough empirical evaluation conducted on multiple publicly available dataset using DCNN descriptors shows that our method is able to significantly outperform state-of-the-art unsupervised schemes in the target bit range. |
|||||
2015 | Cross-modality Hashing With Partial Correspondence | Gu Yun, Xue Haoyang, Yang Jie | Arxiv | Learning a hashing function for cross-media search is very desirable due to its low storage cost and fast query speed. However, the data crawled from Internet cannot always guarantee good correspondence among different modalities which affects the learning for hashing function. In this paper, we focus on cross-modal hashing with partially corresponded data. The data without full correspondence are made in use to enhance the hashing performance. The experiments on Wiki and NUS-WIDE datasets demonstrates that the proposed method outperforms some state-of-the-art hashing approaches with fewer correspondence information. |
|||||
2015 | Optimization Of Tree Modes For Parallel Hash Functions A Case Study | Atighehchi Kevin, Rolland Robert | Arxiv | This paper focuses on parallel hash functions based on tree modes of operation for an inner Variable-Input-Length function. This inner function can be either a single-block-length (SBL) and prefix-free MD hash function, or a sponge-based hash function. We discuss the various forms of optimality that can be obtained when designing parallel hash functions based on trees where all leaves have the same depth. The first result is a scheme which optimizes the tree topology in order to decrease the running time. Then, without affecting the optimal running time we show that we can slightly change the corresponding tree topology so as to minimize the number of required processors as well. Consequently, the resulting scheme decreases in the first place the running time and in the second place the number of required processors. |
|||||
2015 | Sampled Weighted Min-hashing For Large-scale Topic Mining | Fuentes-pineda Gibran, Meza-ruiz Ivan Vladimir | Arxiv | We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification. |
|||||
2015 | Fast K-nn Search | Hyvönen Ville, Pitkänen Teemu, Tasoulis Sotiris, Jääsaari Elias, Tuomainen Risto, Wang Liang, Corander Jukka, Roos Teemu | IEEE International Conference on Big Data | Efficient index structures for fast approximate nearest neighbor queries are required in many applications such as recommendation systems. In high-dimensional spaces, many conventional methods suffer from excessive usage of memory and slow response times. We propose a method where multiple random projection trees are combined by a novel voting scheme. The key idea is to exploit the redundancy in a large number of candidate sets obtained by independently generated random projections in order to reduce the number of expensive exact distance evaluations. The method is straightforward to implement using sparse projections which leads to a reduced memory footprint and fast index construction. Furthermore, it enables grouping of the required computations into big matrix multiplications, which leads to additional savings due to cache effects and low-level parallelization. We demonstrate by extensive experiments on a wide variety of data sets that the method is faster than existing partitioning tree or hashing based approaches, making it the fastest available technique on high accuracy levels. |
|||||
2015 | Multi-probe Consistent Hashing | Appleton Ben, O'reilly Michael | Arxiv | We describe a consistent hashing algorithm which performs multiple lookups per key in a hash table of nodes. It requires no additional storage beyond the hash table, and achieves a peak-to-average load ratio of 1 + epsilon with just 1 + 1/epsilon lookups per key. |
|||||
2015 | A Coloring Of The Square Of The 8-cube With 13 Colors | Kokkala Janne I., Östergård Patric R. J. | Arxiv | Let \(\chi_{\bar{k}}(n)\) be the number of colors required to color the \(n\)-dimensional hypercube such that no two vertices with the same color are at a distance at most \(k\). In other words, \(\chi_{\bar{k}}(n)\) is the minimum number of binary codes with minimum distance at least \(k+1\) required to partition the \(n\)-dimensional Hamming space. By giving an explicit coloring, it is shown that \(\chi_{\bar{2}}(8)=13\). |
|||||
2015 | Pairwise Rotation Hashing For High-dimensional Features | Ishikawa Kohta, Sato Ikuro, Ambai Mitsuru | Arxiv | Binary Hashing is widely used for effective approximate nearest neighbors search. Even though various binary hashing methods have been proposed, very few methods are feasible for extremely high-dimensional features often used in visual tasks today. We propose a novel highly sparse linear hashing method based on pairwise rotations. The encoding cost of the proposed algorithm is \(\mathrm{O}(n log n)\) for n-dimensional features, whereas that of the existing state-of-the-art method is typically \(\mathrm{O}(n^2)\). The proposed method is also remarkably faster in the learning phase. Along with the efficiency, the retrieval accuracy is comparable to or slightly outperforming the state-of-the-art. Pairwise rotations used in our method are formulated from an analytical study of the trade-off relationship between quantization error and entropy of binary codes. Although these hashing criteria are widely used in previous researches, its analytical behavior is rarely studied. All building blocks of our algorithm are based on the analytical solution, and it thus provides a fairly simple and efficient procedure. |
|||||
2015 | Projection Bank From High-dimensional Data To Medium-length Binary Codes | Liu Li, Yu Mengyang, Shao Ling | Arxiv | Recently, very high-dimensional feature representations, e.g., Fisher Vector, have achieved excellent performance for visual recognition and retrieval. However, these lengthy representations always cause extremely heavy computational and storage costs and even become unfeasible in some large-scale applications. A few existing techniques can transfer very high-dimensional data into binary codes, but they still require the reduced code length to be relatively long to maintain acceptable accuracies. To target a better balance between computational efficiency and accuracies, in this paper, we propose a novel embedding method called Binary Projection Bank (BPB), which can effectively reduce the very high-dimensional representations to medium-dimensional binary codes without sacrificing accuracies. Instead of using conventional single linear or bilinear projections, the proposed method learns a bank of small projections via the max-margin constraint to optimally preserve the intrinsic data similarity. We have systematically evaluated the proposed method on three datasets: Flickr 1M, ILSVR2010 and UCF101, showing competitive retrieval and recognition accuracies compared with state-of-the-art approaches, but with a significantly smaller memory footprint and lower coding complexity. |
|||||
2015 | CNN Based Hashing For Image Retrieval | Guo Jinma, Li Jianmin | Arxiv | Along with data on the web increasing dramatically, hashing is becoming more and more popular as a method of approximate nearest neighbor search. Previous supervised hashing methods utilized similarity/dissimilarity matrix to get semantic information. But the matrix is not easy to construct for a new dataset. Rather than to reconstruct the matrix, we proposed a straightforward CNN-based hashing method, i.e. binarilizing the activations of a fully connected layer with threshold 0 and taking the binary result as hash codes. This method achieved the best performance on CIFAR-10 and was comparable with the state-of-the-art on MNIST. And our experiments on CIFAR-10 suggested that the signs of activations may carry more information than the relative values of activations between samples, and that the co-adaption between feature extractor and hash functions is important for hashing. |
|||||
2015 | Supervised Discrete Hashing | Shen Fumin, Shen Chunhua, Liu Wei, Shen Heng Tao | Arxiv | This paper has been withdrawn by the authour. |
|||||
2015 | Simultaneous Feature Learning And Hash Coding With Deep Neural Networks | Lai Hanjiang, Pan Yan, Liu Ye, Yan Shuicheng | Arxiv | Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. For most existing hashing methods, an image is first encoded as a vector of hand-engineering visual features, followed by another separate projection or quantization step that generates binary codes. However, such visual feature vectors may not be optimally compatible with the coding process, thus producing sub-optimal hashing codes. In this paper, we propose a deep architecture for supervised hashing, in which images are mapped into binary codes via carefully designed deep neural networks. The pipeline of the proposed deep architecture consists of three building blocks: 1) a sub-network with a stack of convolution layers to produce the effective intermediate image features; 2) a divide-and-encode module to divide the intermediate image features into multiple branches, each encoded into one hash bit; and 3) a triplet ranking loss designed to characterize that one image is more similar to the second image than to the third one. Extensive evaluations on several benchmark image datasets show that the proposed simultaneous feature learning and hash coding pipeline brings substantial improvements over other state-of-the-art supervised or unsupervised hashing methods. |
|||||
2015 | A Conditional Berry-esseen Bound And A Conditional Large Deviation Result Without Laplace Transform. Application To Hashing With Linear Probing | Klein Thierry, Lagnoux Agnès, Petit Pierre | Arxiv | \noindent We study the asymptotic behavior of a sum of independent and identically distributed random variables conditioned by a sum of independent and identically distributed integer-valued random variables. We prove a Berry-Esseen bound in a general setting and a large deviation result when the Laplace transform of the underlying distribution is not defined in a neighborhood of zero. Then we present several combinatorial applications. In particular, we prove a large deviation result for the model of hashing with linear probing. |
|||||
2015 | Optimal Data-dependent Hashing For Approximate Near Neighbors | Andoni Alexandr, Razenshteyn Ilya | Arxiv | We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an \(n\)-point data set in a \(d\)-dimensional space our data structure achieves query time \(O(d n^{\rho+o(1)})\) and space \(O(n^{1+\rho+o(1)}
|
|||||
2015 | Efficient Similarity Indexing And Searching In High Dimensions | Zhong Yu | Arxiv | Efficient indexing and searching of high dimensional data has been an area of active research due to the growing exploitation of high dimensional data and the vulnerability of traditional search methods to the curse of dimensionality. This paper presents a new approach for fast and effective searching and indexing of high dimensional features using random partitions of the feature space. Experiments on both handwritten digits and 3-D shape descriptors have shown the proposed algorithm to be highly effective and efficient in indexing and searching real data sets of several hundred dimensions. We also compare its performance to that of the state-of-the-art locality sensitive hashing algorithm. |
|||||
2015 | Practical And Optimal LSH For Angular Distance | Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt | Neural Information Processing Systems | We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH (Andoni-Indyk-Nguyen-Razenshteyn 2014) (Andoni-Razenshteyn 2015)), our algorithm is also practical, improving upon the well-studied hyperplane LSH (Charikar 2002) in practice. We also introduce a multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets.We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions. |
|||||
2015 | Sketch-based Manga Retrieval Using Manga109 Dataset | Matsui Yusuke, Ito Kota, Aramaki Yuji, Yamasaki Toshihiko, Aizawa Kiyoharu | Multimedia Tools and Applications Volume | Manga (Japanese comics) are popular worldwide. However, current e-manga archives offer very limited search support, including keyword-based search by title or author, or tag-based categorization. To make the manga search experience more intuitive, efficient, and enjoyable, we propose a content-based manga retrieval system. First, we propose a manga-specific image-describing framework. It consists of efficient margin labeling, edge orientation histogram feature description, and approximate nearest-neighbor search using product quantization. Second, we propose a sketch-based interface as a natural way to interact with manga content. The interface provides sketch-based querying, relevance feedback, and query retouch. For evaluation, we built a novel dataset of manga images, Manga109, which consists of 109 comic books of 21,142 pages drawn by professional manga artists. To the best of our knowledge, Manga109 is currently the biggest dataset of manga images available for research. We conducted a comparative study, a localization evaluation, and a large-scale qualitative study. From the experiments, we verified that: (1) the retrieval accuracy of the proposed method is higher than those of previous methods; (2) the proposed method can localize an object instance with reasonable runtime and accuracy; and (3) sketch querying is useful for manga search. |
|||||
2015 | First-take-all Temporal Order-preserving Hashing For 3D Action Videos | Ye Jun, Hu Hao, Li Kai, Qi Guo-jun, Hua Kien A. | Arxiv | With the prevalence of the commodity depth cameras, the new paradigm of user interfaces based on 3D motion capturing and recognition have dramatically changed the way of interactions between human and computers. Human action recognition, as one of the key components in these devices, plays an important role to guarantee the quality of user experience. Although the model-driven methods have achieved huge success, they cannot provide a scalable solution for efficiently storing, retrieving and recognizing actions in the large-scale applications. These models are also vulnerable to the temporal translation and warping, as well as the variations in motion scales and execution rates. To address these challenges, we propose to treat the 3D human action recognition as a video-level hashing problem and propose a novel First-Take-All (FTA) Hashing algorithm capable of hashing the entire video into hash codes of fixed length. We demonstrate that this FTA algorithm produces a compact representation of the video invariant to the above mentioned variations, through which action recognition can be solved by an efficient nearest neighbor search by the Hamming distance between the FTA hash codes. Experiments on the public 3D human action datasets shows that the FTA algorithm can reach a recognition accuracy higher than 80%, with about 15 bits per frame considering there are 65 frames per video over the datasets. |
|||||
2015 | High Speed Hashing For Integers And Strings | Thorup Mikkel | Arxiv | These notes describe the most efficient hash functions currently known for hashing integers and strings. These modern hash functions are often an order of magnitude faster than those presented in standard text books. They are also simpler to implement, and hence a clear win in practice, but their analysis is harder. Some of the most practical hash functions have only appeared in theory papers, and some of them requires combining results from different theory papers. The goal here is to combine the information in lecture-style notes that can be used by theoreticians and practitioners alike, thus making these practical fruits of theory more widely accessible. |
|||||
2015 | Permutation Search Methods Are Efficient Yet Faster Search Is Possible | Naidan Bilegsaikhan, Boytsov Leonid, Nyberg Eric | Arxiv | We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the high-accuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available. |
|||||
2015 | Optimizing Affinity-based Binary Hashing Using Auxiliary Coordinates | Raziperchikolaei Ramin, Carreira-perpiñán Miguel Á. | Arxiv | In supervised binary hashing, one wants to learn a function that maps a high-dimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as a corrected, iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets. In addition, our framework facilitates the design of optimization algorithms for arbitrary types of loss and hash functions. |
|||||
2015 | Supervised Learning Of Semantics-preserving Hash Via Deep Convolutional Neural Networks | Yang Huei-fang, Lin Kevin, Chen Chu-song | Arxiv | This paper presents a simple yet effective supervised deep hash approach that constructs binary hash codes from labeled data for large-scale image search. We assume that the semantic labels are governed by several latent attributes with each attribute on or off, and classification relies on these attributes. Based on this assumption, our approach, dubbed supervised semantics-preserving deep hashing (SSDH), constructs hash functions as a latent layer in a deep network and the binary codes are learned by minimizing an objective function defined over classification error and other desirable hash codes properties. With this design, SSDH has a nice characteristic that classification and retrieval are unified in a single learning model. Moreover, SSDH performs joint learning of image representations, hash codes, and classification in a point-wised manner, and thus is scalable to large-scale datasets. SSDH is simple and can be realized by a slight enhancement of an existing deep architecture for classification; yet it is effective and outperforms other hashing approaches on several benchmarks and large datasets. Compared with state-of-the-art approaches, SSDH achieves higher retrieval accuracy, while the classification performance is not sacrificed. |
|||||
2015 | On The Complexity Of Inner Product Similarity Join | Ahle Thomas D., Pagh Rasmus, Razenshteyn Ilya, Silvestri Francesco | Arxiv | A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are:
|
|||||
2015 | Tight Lower Bounds For Data-dependent Locality-sensitive Hashing | Andoni Alexandr, Razenshteyn Ilya | Arxiv | We prove a tight lower bound for the exponent \(\rho\) for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the \(c\)-approximate nearest neighbor search. In particular, our lower bound matches the bound of \(\rho\le \frac{1}{2c-1}+o(1)\) for the \(\ell_1\) space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC’15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the \(\ell_1\) space, the tight bound is \(\rho=1/c\), with the upper bound from [Indyk-Motwani, STOC’98] and the matching lower bound from [O’Donnell-Wu-Zhou, ITCS’11]. We prove that, even if the hashing is data-dependent, it must hold that \(\rho\ge \frac{1}{2c-1}-o(1)\). To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results. |
|||||
2015 | Feature Learning Based Deep Supervised Hashing With Pairwise Labels | Li Wu-jun, Wang Sheng, Kang Wang-cheng | Arxiv | Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications. |
|||||
2015 | Discrete Hashing With Deep Neural Network | Do Thanh-toan, Doan Anh-zung, Cheung Ngai-man | Arxiv | This paper addresses the problem of learning binary hash codes for large scale image search by proposing a novel hashing method based on deep neural network. The advantage of our deep model over previous deep model used in hashing is that our model contains necessary criteria for producing good codes such as similarity preserving, balance and independence. Another advantage of our method is that instead of relaxing the binary constraint of codes during the learning process as most previous works, in this paper, by introducing the auxiliary variable, we reformulate the optimization into two sub-optimization steps allowing us to efficiently solve binary constraints without any relaxation. The proposed method is also extended to the supervised hashing by leveraging the label information such that the learned binary codes preserve the pairwise label of inputs. The experimental results on three benchmark datasets show the proposed methods outperform state-of-the-art hashing methods. |
|||||
2015 | General Graph Identification By Hashing | Portegys Tom | Arxiv | A method for identifying graphs using MD5 hashing is presented. This allows fast graph equality comparisons and can also be used to facilitate graph isomorphism testing. The graphs can be labeled or unlabeled. The method identifies vertices by hashing the graph configuration in their neighborhoods. With each vertex hashed, the entire graph can be identified by hashing the vertex hashes. |
|||||
2015 | Indexing Of CNN Features For Large Scale Image Search | Liu Ruoyu, Zhao Yao, Wei Shikui, Yang Yi | Arxiv | The convolutional neural network (CNN) features can give a good description of image content, which usually represent images with unique global vectors. Although they are compact compared to local descriptors, they still cannot efficiently deal with large-scale image retrieval due to the cost of the linear incremental computation and storage. To address this issue, we build a simple but effective indexing framework based on inverted table, which significantly decreases both the search time and memory usage. In addition, several strategies are fully investigated under an indexing framework to adapt it to CNN features and compensate for quantization errors. First, we use multiple assignment for the query and database images to increase the probability of relevant images’ co-existing in the same Voronoi cells obtained via the clustering algorithm. Then, we introduce embedding codes to further improve precision by removing false matches during a search. We demonstrate that by using hashing schemes to calculate the embedding codes and by changing the ranking rule, indexing framework speeds can be greatly improved. Extensive experiments conducted on several unsupervised and supervised benchmarks support these results and the superiority of the proposed indexing framework. We also provide a fair comparison between the popular CNN features. |
|||||
2015 | Improving Style Similarity Metrics Of 3D Shapes | Dev Kapil, Lau Manfred | Arxiv | The idea of style similarity metrics has been recently developed for various media types such as 2D clip art and 3D shapes. We explore this style metric problem and improve existing style similarity metrics of 3D shapes in four novel ways. First, we consider the color and texture of 3D shapes which are important properties that have not been previously considered. Second, we explore the effect of clustering a dataset of 3D models by comparing between style metrics for a single object type and style metrics that combine clusters of object types. Third, we explore the idea of user-guided learning for this problem. Fourth, we introduce an iterative approach that can learn a metric from a general set of 3D models. We demonstrate these contributions with various classes of 3D shapes and with applications such as style-based similarity search and scene composition. |
|||||
2015 | The Homogeneous Weight For r_k Related Gray Map And New Binary Quasicyclic Codes | Yildiz Bahattin, Kelebek Ismail G. | Arxiv | Using theoretical results about the homogeneous weights for Frobenius rings, we describe the homogeneous weight for the ring family \(R_k\), a recently introduced family of Frobenius rings which have been used extensively in coding theory. We find an associated Gray map for the homogeneous weight using first order Reed-Muller codes and we describe some of the general properties of the images of codes over \(R_k\) under this Gray map. We then discuss quasitwisted codes over \(R_k\) and their binary images under the homogeneous Gray map. In this way, we find many optimal binary codes which are self-orthogonal and quasicyclic. In particular, we find a substantial number of optimal binary codes that are quasicyclic of index 8, 16 and 24, nearly all of which are new additions to the database of quasicyclic codes kept by Chen. |
|||||
2015 | Learning Better Encoding For Approximate Nearest Neighbor Search With Dictionary Annealing | Liu Shicong, Lu Hongtao | Arxiv | We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generate a balanced encoding for the target dataset. Existing methods did not explicitly consider this. To achieve these goals along with minimizing the quantization error (residue), we propose a novel dictionary optimization method called Dictionary Annealing that alternatively “heats up” a single dictionary by generating an intermediate dataset with residual vectors, “cools down” the dictionary by fitting the intermediate dataset, then extracts the new residual vectors for the next iteration. Better codes can be learned by DA for the ANN search tasks. DA is easily implemented on GPU to utilize the latest computing technology, and can easily extended to an online dictionary learning scheme. We show by experiments that our optimized dictionaries substantially reduce the overall quantization error. Jointly used with residual vector quantization, our optimized dictionaries lead to a better approximate nearest neighbor search performance compared to the state-of-the-art methods. |
|||||
2015 | Fast And Powerful Hashing Using Tabulation | Thorup Mikkel | Arxiv | Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of \(c\) characters and we have precomputed character tables \(h_1,…,h_c\) mapping characters to random hash values. A key \(x=(x_1,…,x_c)\) is hashed to \(h_1[x_1] \oplus h_2[x_2]…..\oplus h_c[x_c]\). This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is “twisted” in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not. |
|||||
2015 | Learning To Hash For Indexing Big Data - A Survey | Wang Jun, Liu Wei, Kumar Sanjiv, Chang Shih-fu | Arxiv | The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area. |
|||||
2015 | Ball-tree Efficient Spatial Indexing For Constrained Nearest-neighbor Search In Metric Spaces | Dolatshah Mohamad, Hadian Ali, Minaei-bidgoli Behrouz | Arxiv | Emerging location-based systems and data analysis frameworks requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as Kd-tree, R-tree, and Ball-tree. In this paper, we focus on Ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a Ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose Ball-tree, an improved Ball-tree that is more efficient for spatial queries. Ball-tree enjoys a modified space partitioning algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using Ball-tree, which performs better than both KNN and range search for such queries. Results show that Ball*-tree performs 39%-57% faster than the original Ball-tree algorithm. |
|||||
2014 | Core Kernels | Li Ping | Arxiv | The term “CoRE kernel” stands for correlation-resemblance kernel. In many applications (e.g., vision), the data are often high-dimensional, sparse, and non-binary. We propose two types of (nonlinear) CoRE kernels for non-binary sparse data and demonstrate the effectiveness of the new kernels through a classification experiment. CoRE kernels are simple with no tuning parameters. However, training nonlinear kernel SVM can be (very) costly in time and memory and may not be suitable for truly large-scale industrial applications (e.g. search). In order to make the proposed CoRE kernels more practical, we develop basic probabilistic hashing algorithms which transform nonlinear kernels into linear kernels. |
|||||
2014 | Sequential Hypothesis Tests For Adaptive Locality Sensitive Hashing | Chakrabarti Aniket, Parthasarathy Srinivasan | Arxiv | All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provides substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ration test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. |
|||||
2014 | Eclipse Hashing Alexandrov Compactification And Hashing With Hyperspheres For Fast Similarity Search | Noma Yui, Konoshima Makiko | Arxiv | The similarity searches that use high-dimensional feature vectors consisting of a vast amount of data have a wide range of application. One way of conducting a fast similarity search is to transform the feature vectors into binary vectors and perform the similarity search by using the Hamming distance. Such a transformation is a hashing method, and the choice of hashing function is important. Hashing methods using hyperplanes or hyperspheres are proposed. One study reported here is inspired by Spherical LSH, and we use hypersperes to hash the feature vectors. Our method, called Eclipse-hashing, performs a compactification of R^n by using the inverse stereographic projection, which is a kind of Alexandrov compactification. By using Eclipse-hashing, one can obtain the hypersphere-hash function without explicitly using hyperspheres. Hence, the number of nonlinear operations is reduced and the processing time of hashing becomes shorter. Furthermore, we also show that as a result of improving the approximation accuracy, Eclipse-hashing is more accurate than hyperplane-hashing. |
|||||
2014 | Hashing For Statistics Over K-partitions | Dahlgaard Søren, Knudsen Mathias Bæk Tejs, Rotenberg Eva, Thorup Mikkel | Arxiv | In this paper we analyze a hash function for \(k\)-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin~[FOCS’83] in order to save a factor \(Ω(k)\) of time per element over \(k\) independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al.~[AOFA’97] and in large-scale machine learning by Li et al.~[NIPS’12] for minwise estimation of set similarity. The main issue of \(k\)-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of \(k\)-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. a simple and efficient construction for invertible bloom filters and uniform hashing on a given set. |
|||||
2014 | Optimized Cartesian k-means | Wang Jianfeng, Wang Jingdong, Song Jingkuan, Xu Xin-shun, Shen Heng Tao, Li Shipeng | Arxiv | Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian \(K\)-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search. |
|||||
2014 | Learning To Rank Binary Codes | Feng Jie, Liu Wei, Wang Yan | Arxiv | Binary codes have been widely used in vision problems as a compact feature representation to achieve both space and time advantages. Various methods have been proposed to learn data-dependent hash functions which map a feature vector to a binary code. However, considerable data information is inevitably lost during the binarization step which also causes ambiguity in measuring sample similarity using Hamming distance. Besides, the learned hash functions cannot be changed after training, which makes them incapable of adapting to new data outside the training data set. To address both issues, in this paper we propose a flexible bitwise weight learning framework based on the binary codes obtained by state-of-the-art hashing methods, and incorporate the learned weights into the weighted Hamming distance computation. We then formulate the proposed framework as a ranking problem and leverage the Ranking SVM model to offline tackle the weight learning. The framework is further extended to an online mode which updates the weights at each time new data comes, thereby making it scalable to large and dynamic data sets. Extensive experimental results demonstrate significant performance gains of using binary codes with bitwise weighting in image retrieval tasks. It is appealing that the online weight learning leads to comparable accuracy with its offline counterpart, which thus makes our approach practical for realistic applications. |
|||||
2014 | Packing And Padding Coupled Multi-index For Accurate Image Retrieval | Zheng Liang, Wang Shengjin, Liu Ziqiong, Tian Qi | Arxiv | In Bag-of-Words (BoW) based image retrieval, the SIFT visual word has a low discriminative power, so false positive matches occur prevalently. Apart from the information loss during quantization, another cause is that the SIFT feature only describes the local gradient distribution. To address this problem, this paper proposes a coupled Multi-Index (c-MI) framework to perform feature fusion at indexing level. Basically, complementary features are coupled into a multi-dimensional inverted index. Each dimension of c-MI corresponds to one kind of feature, and the retrieval process votes for images similar in both SIFT and other feature spaces. Specifically, we exploit the fusion of local color feature into c-MI. While the precision of visual match is greatly enhanced, we adopt Multiple Assignment to improve recall. The joint cooperation of SIFT and color features significantly reduces the impact of false positive matches. Extensive experiments on several benchmark datasets demonstrate that c-MI improves the retrieval accuracy significantly, while consuming only half of the query time compared to the baseline. Importantly, we show that c-MI is well complementary to many prior techniques. Assembling these methods, we have obtained an mAP of 85.8% and N-S score of 3.85 on Holidays and Ukbench datasets, respectively, which compare favorably with the state-of-the-arts. |
|||||
2014 | DISA At Imageclef 2014 Revised Search-based Image Annotation With Decaf Features | Budikova Petra, Botorek Jan, Batko Michal, Zezula Pavel | Arxiv | This paper constitutes an extension to the report on DISA-MU team participation in the ImageCLEF 2014 Scalable Concept Image Annotation Task as published in [3]. Specifically, we introduce a new similarity search component that was implemented into the system, report on the results achieved by utilizing this component, and analyze the influence of different similarity search parameters on the annotation quality. |
|||||
2014 | Approximately Minwise Independence With Twisted Tabulation | Dahlgaard Søren, Thorup Mikkel | Arxiv | A random hash function \(h\) is \(\epsilon\)-minwise if for any set \(S\), \(|S|=n\), and element \(x\in S\), \(\Pr[h(x)=\min h(S)]=(1\pm\epsilon)/n\). Minwise hash functions with low bias \(\epsilon\) have widespread applications within similarity estimation. Hashing from a universe \([u]\), the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA’13] makes \(c=O(1)\) lookups in tables of size \(u^{1/c}\). Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields \(\tilde O(1/u^{1/c})\)-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS’79] \(\tilde O(1/u^{1/c})\)-minwise hashing requires \(Ω(log u)\)-independence [Indyk SODA’99]. P\v{a}tra\c{s}cu and Thorup [STOC’11] had shown that simple tabulation, using same space and lookups yields \(\tilde O(1/n^{1/c})\)-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument. |
|||||
2014 | Clustering Hamming Embedding Generalized LSH And The Max Norm | Neyshabur Behnam, Makarychev Yury, Srebro Nathan | Arxiv | We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions. |
|||||
2014 | Bayes Merging Of Multiple Vocabularies For Scalable Image Retrieval | Zheng Liang, Wang Shengjin, Zhou Wengang, Tian Qi | Arxiv | The Bag-of-Words (BoW) representation is well applied to recent state-of-the-art image retrieval works. Typically, multiple vocabularies are generated to correct quantization artifacts and improve recall. However, this routine is corrupted by vocabulary correlation, i.e., overlapping among different vocabularies. Vocabulary correlation leads to an over-counting of the indexed features in the overlapped area, or the intersection set, thus compromising the retrieval accuracy. In order to address the correlation problem while preserve the benefit of high recall, this paper proposes a Bayes merging approach to down-weight the indexed features in the intersection set. Through explicitly modeling the correlation problem in a probabilistic view, a joint similarity on both image- and feature-level is estimated for the indexed features in the intersection set. We evaluate our method through extensive experiments on three benchmark datasets. Albeit simple, Bayes merging can be well applied in various merging tasks, and consistently improves the baselines on multi-vocabulary merging. Moreover, Bayes merging is efficient in terms of both time and memory cost, and yields competitive performance compared with the state-of-the-art methods. |
|||||
2014 | Stacked Quantizers For Compositional Vector Compression | Martinez Julieta, Hoos Holger H., Little James J. | Arxiv | Recently, Babenko and Lempitsky introduced Additive Quantization (AQ), a generalization of Product Quantization (PQ) where a non-independent set of codebooks is used to compress vectors into small binary codes. Unfortunately, under this scheme encoding cannot be done independently in each codebook, and optimal encoding is an NP-hard problem. In this paper, we observe that PQ and AQ are both compositional quantizers that lie on the extremes of the codebook dependence-independence assumption, and explore an intermediate approach that exploits a hierarchical structure in the codebooks. This results in a method that achieves quantization error on par with or lower than AQ, while being several orders of magnitude faster. We perform a complexity analysis of PQ, AQ and our method, and evaluate our approach on standard benchmarks of SIFT and GIST descriptors, as well as on new datasets of features obtained from state-of-the-art convolutional neural networks. |
|||||
2014 | Hashing On Nonlinear Manifolds | Shen Fumin, Shen Chunhua, Shi Qinfeng, Hengel Anton Van Den, Tang Zhenmin, Shen Heng Tao | Arxiv | Learning based hashing methods have attracted considerable attention due to their ability to greatly increase the scale at which existing algorithms may operate. Most of these methods are designed to generate binary codes preserving the Euclidean similarity in the original space. Manifold learning techniques, in contrast, are better able to model the intrinsic structure embedded in the original high-dimensional data. The complexities of these models, and the problems with out-of-sample data, have previously rendered them unsuitable for application to large-scale embedding, however. In this work, how to learn compact binary embeddings on their intrinsic manifolds is considered. In order to address the above-mentioned difficulties, an efficient, inductive solution to the out-of-sample data problem, and a process by which non-parametric manifold learning may be used as the basis of a hashing method is proposed. The proposed approach thus allows the development of a range of new hashing techniques exploiting the flexibility of the wide variety of manifold learning approaches available. It is particularly shown that hashing on the basis of t-SNE outperforms state-of-the-art hashing methods on large-scale benchmark datasets, and is very effective for image classification with very short code lengths. The proposed hashing framework is shown to be easily improved, for example, by minimizing the quantization error with learned orthogonal rotations. In addition, a supervised inductive manifold hashing framework is developed by incorporating the label information, which is shown to greatly advance the semantic retrieval performance. |
|||||
2014 | Optimizing Ranking Measures For Compact Binary Code Learning | Lin Guosheng, Shen Chunhua, Wu Jianxin | Arxiv | Hashing has proven a valuable tool for large-scale information retrieval. Despite much success, existing hashing methods optimize over simple objectives such as the reconstruction error or graph Laplacian related loss functions, instead of the performance evaluation criteria of interest—multivariate performance measures such as the AUC and NDCG. Here we present a general framework (termed StructHash) that allows one to directly optimize multivariate performance measures. The resulting optimization problem can involve exponentially or infinitely many variables and constraints, which is more challenging than standard structured output learning. To solve the StructHash optimization problem, we use a combination of column generation and cutting-plane techniques. We demonstrate the generality of StructHash by applying it to ranking prediction and image retrieval, and show that it outperforms a few state-of-the-art hashing methods. |
|||||
2014 | Inner Product Similarity Search Using Compositional Codes | Du Chao, Wang Jingdong | Arxiv | This paper addresses the nearest neighbor search problem under inner product similarity and introduces a compact code-based approach. The idea is to approximate a vector using the composition of several elements selected from a source dictionary and to represent this vector by a short code composed of the indices of the selected elements. The inner product between a query vector and a database vector is efficiently estimated from the query vector and the short code of the database vector. We show the superior performance of the proposed group \(M\)-selection algorithm that selects \(M\) elements from \(M\) source dictionaries for vector approximation in terms of search accuracy and efficiency for compact codes of the same length via theoretical and empirical analysis. Experimental results on large-scale datasets (\(1M\) and \(1B\) SIFT features, \(1M\) linear models and Netflix) demonstrate the superiority of the proposed approach. |
|||||
2014 | On Symmetric And Asymmetric Lshs For Inner Product Search | Neyshabur Behnam, Srebro Nathan | Arxiv | We consider the problem of designing locality sensitive hashes (LSH) for inner product similarity, and of the power of asymmetric hashes in this context. Shrivastava and Li argue that there is no symmetric LSH for the problem and propose an asymmetric LSH based on different mappings for query and database points. However, we show there does exist a simple symmetric LSH that enjoys stronger guarantees and better empirical performance than the asymmetric LSH they suggest. We also show a variant of the settings where asymmetry is in-fact needed, but there a different asymmetric LSH is required. |
|||||
2014 | Fast Supervised Hashing With Decision Trees For High-dimensional Data | Lin Guosheng, Shen Chunhua, Shi Qinfeng, Hengel Anton Van Den, Suter David | Arxiv | Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the Hamming space. Non-linear hash functions have demonstrated the advantage over linear ones due to their powerful generalization capability. In the literature, kernel functions are typically used to achieve non-linearity in hashing, which achieve encouraging retrieval performance at the price of slow evaluation and training time. Here we propose to use boosted decision trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence more suitable for hashing with high dimensional data. In our approach, we first propose sub-modular formulations for the hashing binary code inference problem and an efficient GraphCut based block search method for solving large-scale inference. Then we learn hash functions by training boosted decision trees to fit the binary codes. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time. Especially for high-dimensional data, our method is orders of magnitude faster than many methods in terms of training time. |
|||||
2014 | Hashing For Similarity Search A Survey | Wang Jingdong, Shen Heng Tao, Song Jingkuan, Ji Jianqiu | Arxiv | Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space. |
|||||
2014 | In Defense Of Minhash Over Simhash | Shrivastava Anshumali, Li Ping | Arxiv | MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (\(\mathcal{R}\)), while the collision probability of SimHash is a function of cosine similarity (\(\mathcal{S}\)). To provide a common basis for comparison, we evaluate retrieval results in terms of \(\mathcal{S}\) for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to \(\mathcal{S}\), by using a general inequality \(\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}\). Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often \(\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}\) holds where \(z\) is only slightly larger than 2 (e.g., \(z\leq 2.1\)). Our restricted worst case analysis by assuming \(\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}\) shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse. |
|||||
2014 | On Tight Bounds For Binary Frameproof Codes | Guo Chuan, Stinson Douglas R., Van Trung Tran | Arxiv | In this paper, we study \(w\)-frameproof codes, which are equivalent to \(\{1,w\}\)-separating hash families. Our main results concern binary codes, which are defined over an alphabet of two symbols. For all \(w \geq 3\), and for \(w+1 \leq N \leq 3w\), we show that an \(SHF(N; n,2, \{1,w \})\) exists only if \(n \leq N\), and an \(SHF(N; N,2, \{1,w \})\) must be a permutation matrix of degree \(N\). |
|||||
2014 | Improving Bilayer Product Quantization For Billion-scale Approximate Nearest Neighbors In High Dimensions | Babenko Artem, Lempitsky Victor | Arxiv | The top-performing systems for billion-scale high-dimensional approximate nearest neighbor (ANN) search are all based on two-layer architectures that include an indexing structure and a compressed datapoints layer. An indexing structure is crucial as it allows to avoid exhaustive search, while the lossy data compression is needed to fit the dataset into RAM. Several of the most successful systems use product quantization (PQ) for both the indexing and the dataset compression layers. These systems are however limited in the way they exploit the interaction of product quantization processes that happen at different stages of these systems. Here we introduce and evaluate two approximate nearest neighbor search systems that both exploit the synergy of product quantization processes in a more efficient way. The first system, called Fast Bilayer Product Quantization (FBPQ), speeds up the runtime of the baseline system (Multi-D-ADC) by several times, while achieving the same accuracy. The second system, Hierarchical Bilayer Product Quantization (HBPQ) provides a significantly better recall for the same runtime at a cost of small memory footprint increase. For the BIGANN dataset of billion SIFT descriptors, the 10% increase in Recall@1 and the 17% increase in Recall@10 is observed. |
|||||
2014 | Expected Number Of Uniformly Distributed Balls In A Most Loaded Bin Using Placement With Simple Linear Functions | Babka Martin | Arxiv | We estimate the size of a most loaded bin in the setting when the balls are placed into the bins using a random linear function in a finite field. The balls are chosen from a transformed interval. We show that in this setting the expected load of the most loaded bins is constant. This is an interesting fact because using fully random hash functions with the same class of input sets leads to an expectation of \(\Theta\left(\frac{log m}{log log m}\right)\) balls in most loaded bins where \(m\) is the number of balls and bins. Although the family of the functions is quite common the size of largest bins was not known even in this simple case. |
|||||
2014 | Supervised Hashing Using Graph Cuts And Boosted Decision Trees | Lin Guosheng, Shen Chunhua, Hengel Anton Van Den | Arxiv | Embedding image features into a binary Hamming space can improve both the speed and accuracy of large-scale query-by-example image retrieval systems. Supervised hashing aims to map the original features to compact binary codes in a manner which preserves the label-based similarities of the original data. Most existing approaches apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of those methods, and can result in complex optimization problems that are difficult to solve. In this work we proffer a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. The proposed framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods. Our framework decomposes the into two steps: binary code (hash bits) learning, and hash function learning. The first step can typically be formulated as a binary quadratic problem, and the second step can be accomplished by training standard binary classifiers. For solving large-scale binary code inference, we show how to ensure that the binary quadratic problems are submodular such that an efficient graph cut approach can be used. To achieve efficiency as well as efficacy on large-scale high-dimensional data, we propose to use boosted decision trees as the hash functions, which are nonlinear, highly descriptive, and very fast to train and evaluate. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods, especially on high-dimensional data. |
|||||
2014 | Random Forests Can Hash | Qiu Qiang, Sapiro Guillermo, Bronstein Alex | Arxiv | Hash codes are a very efficient data representation needed to be able to cope with the ever growing amounts of data. We introduce a random forest semantic hashing scheme with information-theoretic code aggregation, showing for the first time how random forest, a technique that together with deep learning have shown spectacular results in classification, can also be extended to large-scale retrieval. Traditional random forest fails to enforce the consistency of hashes generated from each tree for the same class data, i.e., to preserve the underlying similarity, and it also lacks a principled way for code aggregation across trees. We start with a simple hashing scheme, where independently trained random trees in a forest are acting as hashing functions. We the propose a subspace model as the splitting function, and show that it enforces the hash consistency in a tree for data from the same class. We also introduce an information-theoretic approach for aggregating codes of individual trees into a single hash code, producing a near-optimal unique hash for each class. Experiments on large-scale public datasets are presented, showing that the proposed approach significantly outperforms state-of-the-art hashing methods for retrieval tasks. |
|||||
2014 | Comparing Apples To Apples In The Evaluation Of Binary Coding Methods | Rastegari Mohammad, Fakhraei Shobeir, Choi Jonghyun, Jacobs David, Davis Larry S. | Arxiv | We discuss methodological issues related to the evaluation of unsupervised binary code construction methods for nearest neighbor search. These issues have been widely ignored in literature. These coding methods attempt to preserve either Euclidean distance or angular (cosine) distance in the binary embedding space. We explain why when comparing a method whose goal is preserving cosine similarity to one designed for preserving Euclidean distance, the original features should be normalized by mapping them to the unit hypersphere before learning the binary mapping functions. To compare a method whose goal is to preserves Euclidean distance to one that preserves cosine similarity, the original feature data must be mapped to a higher dimension by including a bias term in binary mapping functions. These conditions ensure the fair comparison between different binary code methods for the task of nearest neighbor search. Our experiments show under these conditions the very simple methods (e.g. LSH and ITQ) often outperform recent state-of-the-art methods (e.g. MDSH and OK-means). |
|||||
2014 | Consistent Subset Sampling | Kutzkov Konstantin, Pagh Rasmus | Arxiv | Consistent sampling is a technique for specifying, in small space, a subset \(S\) of a potentially large universe \(U\) such that the elements in \(S\) satisfy a suitably chosen sampling condition. Given a subset \(\mathcal{I}\subseteq U\) it should be possible to quickly compute \(\mathcal{I}\cap S\), i.e., the elements in \(\mathcal{I}\) satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-\(k\) subsets occurring in some set in a collection of sets of bounded size \(b\), where \(k\) is a small integer. This can be done by applying standard consistent sampling to the \(k\)-subsets of each set, but that approach requires time \(\Theta(b^k)\). Using a carefully designed hash function, for a given sampling probability \(p \in (0,1]\), we show how to improve the time complexity to \(\Theta(b^{\lceil k/2\rceil}log log b + pb^k)\) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is \(\Theta(b^{\lceil k/4\rceil})\). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent \(k\)-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining. |
|||||
2014 | Expanding The Family Of Grassmannian Kernels An Embedding Perspective | Harandi Mehrtash T., Salzmann Mathieu, Jayasumana Sadeep, Hartley Richard, Li Hongdong | Arxiv | Modeling videos and image-sets as linear subspaces has proven beneficial for many visual recognition tasks. However, it also incurs challenges arising from the fact that linear subspaces do not obey Euclidean geometry, but lie on a special type of Riemannian manifolds known as Grassmannian. To leverage the techniques developed for Euclidean spaces (e.g, support vector machines) with subspaces, several recent studies have proposed to embed the Grassmannian into a Hilbert space by making use of a positive definite kernel. Unfortunately, only two Grassmannian kernels are known, none of which -as we will show- is universal, which limits their ability to approximate a target function arbitrarily well. Here, we introduce several positive definite Grassmannian kernels, including universal ones, and demonstrate their superiority over previously-known kernels in various tasks, such as classification, clustering, sparse coding and hashing. |
|||||
2014 | Circulant Binary Embedding | Yu Felix X., Kumar Sanjiv, Gong Yunchao, Chang Shih-fu | Arxiv | Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from \(\mathcal{O}(d^2)\) to \(\mathcal{O}(dlog{d})\), and the space complexity from \(\mathcal{O}(d^2)\) to \(\mathcal{O}(d)\) where \(d\) is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits. |
|||||
2014 | Memory Vectors For Similarity Search In High-dimensional Spaces | Iscen Ahmet, Furon Teddy, Gripon Vincent, Rabbat Michael, Jégou Hervé | Arxiv | We study an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory. This architecture is composed of several memory units, each of which summarizes a fraction of the database by a single representative vector. The potential similarity of the query to one of the vectors stored in the memory unit is gauged by a simple correlation with the memory unit’s representative vector. This representative optimizes the test of the following hypothesis: the query is independent from any vector in the memory unit vs. the query is a simple perturbation of one of the stored vectors. Compared to exhaustive search, our approach finds the most similar database vectors significantly faster without a noticeable reduction in search quality. Interestingly, the reduction of complexity is provably better in high-dimensional spaces. We empirically demonstrate its practical interest in a large-scale image search scenario with off-the-shelf state-of-the-art descriptors. |
|||||
2014 | Image Classification With A Deep Network Model Based On Compressive Sensing | Gan Yufei, Zhuo Tong, He Chu | Arxiv | To simplify the parameter of the deep learning network, a cascaded compressive sensing model “CSNet” is implemented for image classification. Firstly, we use cascaded compressive sensing network to learn feature from the data. Secondly, CSNet generates the feature by binary hashing and block-wise histograms. Finally, a linear SVM classifier is used to classify these features. The experiments on the MNIST dataset indicate that higher classification accuracy can be obtained by this algorithm. |
|||||
2014 | A New Approach To Analyzing Robin Hood Hashing | Mitzenmacher Michael | Arxiv | Robin Hood hashing is a variation on open addressing hashing designed to reduce the maximum search time as well as the variance in the search time for elements in the hash table. While the case of insertions only using Robin Hood hashing is well understood, the behavior with deletions has remained open. Here we show that Robin Hood hashing can be analyzed under the framework of finite-level finite-dimensional jump Markov chains. This framework allows us to re-derive some past results for the insertion-only case with some new insight, as well as provide a new analysis for a standard deletion model, where we alternate between deleting a random old key and inserting a new one. In particular, we show that a simple but apparently unstudied approach for handling deletions with Robin Hood hashing offers good performance even under high loads. |
|||||
2014 | Polynomials A New Tool For Length Reduction In Binary Discrete Convolutions | Amir Amihood, Kapah Oren, Porat Ely, Rothschild Amir | Arxiv | Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as polynomial multiplication or the Walsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions. both randomized and deterministic algorithms were developed for efficiently computing the sparse polynomial multiplication. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the “1” bits) up to cyclic shift. To date there is no known efficient algorithm for computing the sparse Walsh transform. Since the modulo function does not preserve the Walsh transform a new method for length reduction is needed. In this paper we present such a new method - polynomials. This method enables the development of an efficient algorithm for computing the binary sparse Walsh transform. To our knowledge, this is the first such algorithm. We also show that this method allows a faster deterministic computation of sparse polynomial multiplication than currently known in the literature. |
|||||
2014 | Fast Low-rank Representation Based Spatial Pyramid Matching For Image Classification | Peng Xi, Yan Rui, Zhao Bo, Tang Huajin, Yi Zhang | Knowledge based Systems | Spatial Pyramid Matching (SPM) and its variants have achieved a lot of success in image classification. The main difference among them is their encoding schemes. For example, ScSPM incorporates Sparse Code (SC) instead of Vector Quantization (VQ) into the framework of SPM. Although the methods achieve a higher recognition rate than the traditional SPM, they consume more time to encode the local descriptors extracted from the image. In this paper, we propose using Low Rank Representation (LRR) to encode the descriptors under the framework of SPM. Different from SC, LRR considers the group effect among data points instead of sparsity. Benefiting from this property, the proposed method (i.e., LrrSPM) can offer a better performance. To further improve the generalizability and robustness, we reformulate the rank-minimization problem as a truncated projection problem. Extensive experimental studies show that LrrSPM is more efficient than its counterparts (e.g., ScSPM) while achieving competitive recognition rates on nine image data sets. |
|||||
2014 | Approximate K-flat Nearest Neighbor Search | Mulzer Wolfgang, Nguyen Huy L., Seiferth Paul, Stein Yannik | Arxiv | Let \(k\) be a nonnegative integer. In the approximate \(k\)-flat nearest neighbor (\(k\)-ANN) problem, we are given a set \(P \subset \mathbb{R}^d\) of \(n\) points in \(d\)-dimensional space and a fixed approximation factor \(c > 1\). Our goal is to preprocess \(P\) so that we can efficiently answer approximate \(k\)-flat nearest neighbor queries: given a \(k\)-flat \(F\), find a point in \(P\) whose distance to \(F\) is within a factor \(c\) of the distance between \(F\) and the closest point in \(P\). The case \(k = 0\) corresponds to the well-studied approximate nearest neighbor problem, for which a plethora of results are known, both in low and high dimensions. The case \(k = 1\) is called approximate line nearest neighbor. In this case, we are aware of only one provably efficient data structure, due to Andoni, Indyk, Krauthgamer, and Nguyen. For \(k \geq 2\), we know of no previous results. We present the first efficient data structure that can handle approximate nearest neighbor queries for arbitrary \(k\). We use a data structure for \(0\)-ANN-queries as a black box, and the performance depends on the parameters of the \(0\)-ANN solution: suppose we have an \(0\)-ANN structure with query time \(O(n^{\rho})\) and space requirement \(O(n^{1+\sigma})\), for \(\rho, \sigma > 0\). Then we can answer \(k\)-ANN queries in time \(O(n^{k/(k + 1 - \rho) + t})\) and space \(O(n^{1+\sigma k/(k + 1 - \rho)} + nlog^{O(1/t)} n)\). Here, \(t > 0\) is an arbitrary constant and the \(O\)-notation hides exponential factors in \(k\), \(1/t\), and \(c\) and polynomials in \(d\). Our new data structures also give an improvement in the space requirement over the previous result for \(1\)-ANN: we can achieve near-linear space and sublinear query time, a further step towards practical applications where space constitutes the bottleneck. |
|||||
2014 | A New Non-mds Hash Function Resisting Birthday Attack And Meet-in-the-middle Attack | Su Shenghui, Xie Tao, Lu Shuwang | Theoretical Computer Science v | To examine the integrity and authenticity of an IP address efficiently and economically, this paper proposes a new non-Merkle-Damgard structural (non-MDS) hash function called JUNA that is based on a multivariate permutation problem and an anomalous subset product problem to which no subexponential time solutions are found so far. JUNA includes an initialization algorithm and a compression algorithm, and converts a short message of n bits which is regarded as only one block into a digest of m bits, where 80 <= m <= 232 and 80 <= m <= n <= 4096. The analysis and proof show that the new hash is one-way, weakly collision-free, and strongly collision-free, and its security against existent attacks such as birthday attack and meet-in-the- middle attack is to O(2 ^ m). Moreover, a detailed proof that the new hash function is resistant to the birthday attack is given. Compared with the Chaum-Heijst-Pfitzmann hash based on a discrete logarithm problem, the new hash is lightweight, and thus it opens a door to convenience for utilization of lightweight digital signing schemes. |
|||||
2014 | Revisiting Kernelized Locality-sensitive Hashing For Improved Large-scale Image Retrieval | Jiang Ke, Que Qichao, Kulis Brian | Arxiv | We present a simple but powerful reinterpretation of kernelized locality-sensitive hashing (KLSH), a general and popular method developed in the vision community for performing approximate nearest-neighbor searches in an arbitrary reproducing kernel Hilbert space (RKHS). Our new perspective is based on viewing the steps of the KLSH algorithm in an appropriately projected space, and has several key theoretical and practical benefits. First, it eliminates the problematic conceptual difficulties that are present in the existing motivation of KLSH. Second, it yields the first formal retrieval performance bounds for KLSH. Third, our analysis reveals two techniques for boosting the empirical performance of KLSH. We evaluate these extensions on several large-scale benchmark image retrieval data sets, and show that our analysis leads to improved recall performance of at least 12%, and sometimes much higher, over the standard KLSH method. |
|||||
2014 | Quantum Hashing Via Classical epsilon-universal Hashing Constructions | Ablayev Farid, Ablayev Marat | Arxiv | In the paper, we define the concept of the quantum hash generator and offer design, which allows to build a large amount of different quantum hash functions. The construction is based on composition of classical \(\epsilon\)-universal hash family and a given family of functions – quantum hash generator. The proposed construction combines the properties of robust presentation of information by classical error-correcting codes together with the possibility of highly compressed presentation of information by quantum systems. In particularly, we present quantum hash function based on Reed-Solomon code, and we proved, that this construction is optimal in the sense of number of qubits needed. |
|||||
2014 | Discrete Graph Hashing | Wei Liu, Cun Mu, Sanjiv Kumar, Shih-fu Chang | Neural Information Processing Systems | Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art unsupervised hashing methods, especially for longer codes. |
|||||
2014 | Two Simple Full-text Indexes Based On The Suffix Array | Grabowski Szymon, Raniszewski Marcin | Arxiv | We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other, called FBCSA, is a compact data structure, similar to M{"a}kinen’s compact suffix array, but working on fixed sized blocks. Experiments on the Pizza~\&~Chili 200\,MB datasets show that SA-hash is about 2–3 times faster in pattern searches (counts) than the standard suffix array, for the price of requiring \(0.2n-1.1n\) bytes of extra space, where \(n\) is the text length, and setting a minimum pattern length. FBCSA is relatively fast in single cell accesses (a few times faster than related indexes at about the same or better compression), but not competitive if many consecutive cells are to be extracted. Still, for the task of extracting, e.g., 10 successive cells its time-space relation remains attractive. |
|||||
2014 | Microsoft COCO: Common Objects in Context | Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollar | We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model. |
||||||
2014 | A Novel String Distance Function Based On Most Frequent K Characters | Seker Sadi Evren, Altun Oguz, Ayan Uğur, Mert Cihan | International Journal of Machine Learning and Computation | This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length are fast to calculate but does not represent the string correctly. On the other hand the methods like keeping the histogram over all characters in the string are slower but good to represent the string characteristics in some areas, like natural language. We propose a new metric, easy to calculate and satisfactory for string comparison. Method is built on a hash function, which gets a string at any size and outputs the most frequent K characters with their frequencies. The outputs are open for comparison and our studies showed that the success rate is quite satisfactory for the text mining operations. |
|||||
2014 | Wear Minimization For Cuckoo Hashing How Not To Throw A Lot Of Eggs Into One Basket | Eppstein David, Goodrich Michael T., Mitzenmacher Michael, Pszona Paweł | Arxiv | We study wear-leveling techniques for cuckoo hashing, showing that it is possible to achieve a memory wear bound of \(loglog n+O(1)\) after the insertion of \(n\) items into a table of size \(Cn\) for a suitable constant \(C\) using cuckoo hashing. Moreover, we study our cuckoo hashing method empirically, showing that it significantly improves on the memory wear performance for classic cuckoo hashing and linear probing in practice. |
|||||
2014 | Improved Asymmetric Locality Sensitive Hashing (ALSH) For Maximum Inner Product Search (MIPS) | Shrivastava Anshumali, Li Ping | Arxiv | Recently it was shown that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. Asymmetric transformations before hashing were the key in solving MIPS which was otherwise hard. In the prior work, the authors use asymmetric transformations which convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using hashing. In this work, we provide a different transformation which converts the problem of approximate MIPS into the problem of approximate cosine similarity search which can be efficiently solved using signed random projections. Theoretical analysis show that the new scheme is significantly better than the original scheme for MIPS. Experimental evaluations strongly support the theoretical findings. |
|||||
2014 | Iterative Universal Hash Function Generator For Minhashing | De Franca Fabricio Olivetti | Arxiv | Minhashing is a technique used to estimate the Jaccard Index between two sets by exploiting the probability of collision in a random permutation. In order to speed up the computation, a random permutation can be approximated by using an universal hash function such as the \(h_{a,b}\) function proposed by Carter and Wegman. A better estimate of the Jaccard Index can be achieved by using many of these hash functions, created at random. In this paper a new iterative procedure to generate a set of \(h_{a,b}\) functions is devised that eliminates the need for a list of random values and avoid the multiplication operation during the calculation. The properties of the generated hash functions remains that of an universal hash function family. This is possible due to the random nature of features occurrence on sparse datasets. Results show that the uniformity of hashing the features is maintaned while obtaining a speed up of up to \(1.38\) compared to the traditional approach. |
|||||
2014 | A Hash-based Co-clustering Algorithm For Categorical Data | De França Fabricio Olivetti | Arxiv | Many real-life data are described by categorical attributes without a pre-classification. A common data mining method used to extract information from this type of data is clustering. This method group together the samples from the data that are more similar than all other samples. But, categorical data pose a challenge when extracting information because: the calculation of two objects similarity is usually done by measuring the number of common features, but ignore a possible importance weighting; if the data may be divided differently according to different subsets of the features, the algorithm may find clusters with different meanings from each other, difficulting the post analysis. Data Co-Clustering of categorical data is the technique that tries to find subsets of samples that share a subset of features in common. By doing so, not only a sample may belong to more than one cluster but, the feature selection of each cluster describe its own characteristics. In this paper a novel Co-Clustering technique for categorical data is proposed by using Locality Sensitive Hashing technique in order to preprocess a list of Co-Clusters seeds based on a previous research. Results indicate this technique is capable of finding high quality Co-Clusters in many different categorical data sets and scales linearly with the data set size. |
|||||
2014 | Nearest Keyword Set Search In Multi-dimensional Datasets | Singh Vishwakarma, Singh Ambuj K. | Arxiv | Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications and tools. In this paper, we consider objects that are tagged with keywords and are embedded in a vector space. For these datasets, we study queries that ask for the tightest groups of points satisfying a given set of keywords. We propose a novel method called ProMiSH (Projection and Multi Scale Hashing) that uses random projection and hash-based index structures, and achieves high scalability and speedup. We present an exact and an approximate version of the algorithm. Our empirical studies, both on real and synthetic datasets, show that ProMiSH has a speedup of more than four orders over state-of-the-art tree-based techniques. Our scalability tests on datasets of sizes up to 10 million and dimensions up to 100 for queries having up to 9 keywords show that ProMiSH scales linearly with the dataset size, the dataset dimension, the query size, and the result size. |
|||||
2014 | Quantized Kernel Learning For Feature Matching | Danfeng Qin, Xuanli Chen, Matthieu Guillaumin, Luc V. Gool | Neural Information Processing Systems | Matching local visual features is a crucial problem in computer vision and its accuracy greatly depends on the choice of similarity measure. As it is generally very difficult to design by hand a similarity or a kernel perfectly adapted to the data of interest, learning it automatically with as few assumptions as possible is preferable. However, available techniques for kernel learning suffer from several limitations, such as restrictive parametrization or scalability. In this paper, we introduce a simple and flexible family of non-linear kernels which we refer to as Quantized Kernels (QK). QKs are arbitrary kernels in the index space of a data quantizer, i.e., piecewise constant similarities in the original feature space. Quantization allows to compress features and keep the learning tractable. As a result, we obtain state-of-the-art matching performance on a standard benchmark dataset with just a few bits to represent each feature dimension. QKs also have explicit non-linear, low-dimensional feature mappings that grant access to Euclidean geometry for uncompressed features. |
|||||
2014 | A Perceptual Hash Function To Store And Retrieve Large Scale DNA Sequences | De Herve Jocelyn De Goer, Kang Myoung-ah, Bailly Xavier, Nguifo Engelbert Mephu | Arxiv | This paper proposes a novel approach for storing and retrieving massive DNA sequences.. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images, that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by “avalanche effect” and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieving them. |
|||||
2014 | Efficient On-the-fly Category Retrieval Using Convnets And Gpus | Chatfield Ken, Simonyan Karen, Zisserman Andrew | Arxiv | We investigate the gains in precision and speed, that can be obtained by using Convolutional Networks (ConvNets) for on-the-fly retrieval - where classifiers are learnt at run time for a textual query from downloaded images, and used to rank large image or video datasets. We make three contributions: (i) we present an evaluation of state-of-the-art image representations for object category retrieval over standard benchmark datasets containing 1M+ images; (ii) we show that ConvNets can be used to obtain features which are incredibly performant, and yet much lower dimensional than previous state-of-the-art image representations, and that their dimensionality can be reduced further without loss in performance by compression using product quantization or binarization. Consequently, features with the state-of-the-art performance on large-scale datasets of millions of images can fit in the memory of even a commodity GPU card; (iii) we show that an SVM classifier can be learnt within a ConvNet framework on a GPU in parallel with downloading the new training images, allowing for a continuous refinement of the model as more images become available, and simultaneous training and ranking. The outcome is an on-the-fly system that significantly outperforms its predecessors in terms of: precision of retrieval, memory requirements, and speed, facilitating accurate on-the-fly learning and ranking in under a second on a single GPU. |
|||||
2014 | Asymmetric LSH (ALSH) For Sublinear Time Maximum Inner Product Search (MIPS) | Anshumali Shrivastava, Ping Li | Neural Information Processing Systems | We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we extend the LSH framework to allow asymmetric hashing schemes. Our proposal is based on a key observation that the problem of finding maximum inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search in classical settings. This key observation makes efficient sublinear hashing scheme for MIPS possible. Under the extended asymmetric LSH (ALSH) framework, this paper provides an example of explicit construction of provably fast hashing scheme for MIPS. Our proposed algorithm is simple and easy to implement. The proposed hashing scheme leads to significant computational savings over the two popular conventional LSH schemes: (i) Sign Random Projection (SRP) and (ii) hashing based on \(p\)-stable distributions for \(L_2\) norm (L2LSH), in the collaborative filtering task of item recommendations on Netflix and Movielens (10M) datasets. |
|||||
2014 | Coding For Random Projections And Approximate Near Neighbor Search | Li Ping, Mitzenmacher Michael, Shrivastava Anshumali | Arxiv | This technical note compares two coding (quantization) schemes for random projections in the context of sub-linear time approximate near neighbor search. The first scheme is based on uniform quantization while the second scheme utilizes a uniform quantization plus a uniformly random offset (which has been popular in practice). The prior work compared the two schemes in the context of similarity estimation and training linear classifiers, with the conclusion that the step of random offset is not necessary and may hurt the performance (depending on the similarity level). The task of near neighbor search is related to similarity estimation with importance distinctions and requires own study. In this paper, we demonstrate that in the context of near neighbor search, the step of random offset is not needed either and may hurt the performance (sometimes significantly so, depending on the similarity and other parameters). |
|||||
2013 | Random Binary Mappings For Kernel Learning And Efficient SVM | Roig Gemma, Boix Xavier, Van Gool Luc | Arxiv | Support Vector Machines (SVMs) are powerful learners that have led to state-of-the-art results in various computer vision problems. SVMs suffer from various drawbacks in terms of selecting the right kernel, which depends on the image descriptors, as well as computational and memory efficiency. This paper introduces a novel kernel, which serves such issues well. The kernel is learned by exploiting a large amount of low-complex, randomized binary mappings of the input feature. This leads to an efficient SVM, while also alleviating the task of kernel selection. We demonstrate the capabilities of our kernel on 6 standard vision benchmarks, in which we combine several common image descriptors, namely histograms (Flowers17 and Daimler), attribute-like descriptors (UCI, OSR, and a-VOC08), and Sparse Quantization (ImageNet). Results show that our kernel learning adapts well to the different descriptors types, achieving the performance of the kernels specifically tuned for each image descriptor, and with similar evaluation cost as efficient SVM methods. |
|||||
2013 | A Study On Unsupervised Dictionary Learning And Feature Encoding For Action Classification | Peng Xiaojiang, Peng Qiang, Qiao Yu, Chen Junzhou, Afzal Mehtab | Arxiv | Many efforts have been devoted to develop alternative methods to traditional vector quantization in image domain such as sparse coding and soft-assignment. These approaches can be split into a dictionary learning phase and a feature encoding phase which are often closely connected. In this paper, we investigate the effects of these phases by separating them for video-based action classification. We compare several dictionary learning methods and feature encoding schemes through extensive experiments on KTH and HMDB51 datasets. Experimental results indicate that sparse coding performs consistently better than the other encoding methods in large complex dataset (i.e., HMDB51), and it is robust to different dictionaries. For small simple dataset (i.e., KTH) with less variation, however, all the encoding strategies perform competitively. In addition, we note that the strength of sophisticated encoding approaches comes not from their corresponding dictionaries but the encoding mechanisms, and we can just use randomly selected exemplars as dictionaries for video-based action classification. |
|||||
2013 | Scalable Locality-sensitive Hashing For Similarity Search In High-dimensional Large-scale Multimedia Datasets | Teixeira Thiago S. F. X., Teodoro George, Valle Eduardo, Saltz Joel H. | Arxiv | Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are sequential and target shared-memory machines. Here we address these issues with a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular techniques for similarity search, but its poor referential locality properties has made its implementation a challenging problem. Our solution is based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage. The proposed parallelization attained an efficiency of 90% in a distributed system with about 800 CPU cores. In particular, the original locality-aware data partition reduced the number of messages exchanged in 30%. Our parallel LSH was evaluated using the largest public dataset for similarity search (to the best of our knowledge) with \(10^9\) 128-d SIFT descriptors extracted from Web images. This is two orders of magnitude larger than datasets that previous LSH parallelizations could handle. |
|||||
2013 | Instruction Sequence Expressions For The Secure Hash Algorithm SHA-256 | Bergstra J. A., Middelburg C. A. | Arxiv | The secure hash function SHA-256 is a function on bit strings. This means that its restriction to the bit strings of any given length can be computed by a finite instruction sequence that contains only instructions to set and get the content of Boolean registers, forward jump instructions, and a termination instruction. We describe such instruction sequences for the restrictions to bit strings of the different possible lengths by means of uniform terms from an algebraic theory. |
|||||
2013 | A Novel Block-dct And PCA Based Image Perceptual Hashing Algorithm | Jie Zeng | Arxiv | Image perceptual hashing finds applications in content indexing, large-scale image database management, certification and authentication and digital watermarking. We propose a Block-DCT and PCA based image perceptual hash in this article and explore the algorithm in the application of tamper detection. The main idea of the algorithm is to integrate color histogram and DCT coefficients of image blocks as perceptual feature, then to compress perceptual features as inter-feature with PCA, and to threshold to create a robust hash. The robustness and discrimination properties of the proposed algorithm are evaluated in detail. Our algorithms first construct a secondary image, derived from input image by pseudo-randomly extracting features that approximately capture semi-global geometric characteristics. From the secondary image (which does not perceptually resemble the input), we further extract the final features which can be used as a hash value (and can be further suitably quantized). In this paper, we use spectral matrix invariants as embodied by Singular Value Decomposition. Surprisingly, formation of the secondary image turns out be quite important since it not only introduces further robustness, but also enhances the security properties. Indeed, our experiments reveal that our hashing algorithms extract most of the geometric information from the images and hence are robust to severe perturbations (e.g. up to %50 cropping by area with 20 degree rotations) on images while avoiding misclassification. Experimental results show that the proposed image perceptual hash algorithm can effectively address the tamper detection problem with advantageous robustness and discrimination. |
|||||
2013 | Markov Chain Monte Carlo For Arrangement Of Hyperplanes In Locality-sensitive Hashing | Noma Yui, Konoshima Makiko | Arxiv | Since Hamming distances can be calculated by bitwise computations, they can be calculated with less computational load than L2 distances. Similarity searches can therefore be performed faster in Hamming distance space. The elements of Hamming distance space are bit strings. On the other hand, the arrangement of hyperplanes induce the transformation from the feature vectors into feature bit strings. This transformation method is a type of locality-sensitive hashing that has been attracting attention as a way of performing approximate similarity searches at high speed. Supervised learning of hyperplane arrangements allows us to obtain a method that transforms them into feature bit strings reflecting the information of labels applied to higher-dimensional feature vectors. In this p aper, we propose a supervised learning method for hyperplane arrangements in feature space that uses a Markov chain Monte Carlo (MCMC) method. We consider the probability density functions used during learning, and evaluate their performance. We also consider the sampling method for learning data pairs needed in learning, and we evaluate its performance. We confirm that the accuracy of this learning method when using a suitable probability density function and sampling method is greater than the accuracy of existing learning methods. |
|||||
2013 | Space-efficient Las Vegas Algorithms For K-SUM | Wang Joshua | Arxiv | Using hashing techniques, this paper develops a family of space-efficient Las Vegas randomized algorithms for \(k\)-SUM problems. This family includes an algorithm that can solve 3-SUM in \(O(n^2)\) time and \(O(\sqrt{n})\) space. It also establishes a new time-space upper bound for SUBSET-SUM, which can be solved by a Las Vegas algorithm in \(O^(2^{(1-\sqrt{\8/9\beta})n})\) time and \(O^(2^{\beta n})\) space, for any \(\beta \in [0, \9/32]\). |
|||||
2013 | ABC-SG A New Artificial Bee Colony Algorithm-based Distance Of Sequential Data Using Sigma Grams | Fuad Muhammad Marwan Muhammad | Arxiv | The problem of similarity search is one of the main problems in computer science. This problem has many applications in text-retrieval, web search, computational biology, bioinformatics and others. Similarity between two data objects can be depicted using a similarity measure or a distance metric. There are numerous distance metrics in the literature, some are used for a particular data type, and others are more general. In this paper we present a new distance metric for sequential data which is based on the sum of n-grams. The novelty of our distance is that these n-grams are weighted using artificial bee colony; a recent optimization algorithm based on the collective intelligence of a swarm of bees on their search for nectar. This algorithm has been used in optimizing a large number of numerical problems. We validate the new distance experimentally. |
|||||
2013 | Beyond Pairwise Provably Fast Algorithms For Approximate k-way Similarity Search | Anshumali Shrivastava, Ping Li | Neural Information Processing Systems |
|
|||||
2013 | A General Two-step Approach To Learning-based Hashing | Lin Guosheng, Shen Chunhua, Suter David, Hengel Anton Van Den | Arxiv | Most existing approaches to hashing apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of the method to respond to the data, and can result in complex optimization problems that are difficult to solve. Here we propose a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. This framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods. Our framework decomposes hashing learning problem into two steps: hash bit learning and hash function learning based on the learned bits. The first step can typically be formulated as binary quadratic problems, and the second step can be accomplished by training standard binary classifiers. Both problems have been extensively studied in the literature. Our extensive experiments demonstrate that the proposed framework is effective, flexible and outperforms the state-of-the-art. |
|||||
2013 | Inductive Hashing On Manifolds | Shen Fumin, Shen Chunhua, Shi Qinfeng, Hengel Anton Van Den, Tang Zhenmin | Arxiv | Learning based hashing methods have attracted considerable attention due to their ability to greatly increase the scale at which existing algorithms may operate. Most of these methods are designed to generate binary codes that preserve the Euclidean distance in the original space. Manifold learning techniques, in contrast, are better able to model the intrinsic structure embedded in the original high-dimensional data. The complexity of these models, and the problems with out-of-sample data, have previously rendered them unsuitable for application to large-scale embedding, however. In this work, we consider how to learn compact binary embeddings on their intrinsic manifolds. In order to address the above-mentioned difficulties, we describe an efficient, inductive solution to the out-of-sample data problem, and a process by which non-parametric manifold learning may be used as the basis of a hashing method. Our proposed approach thus allows the development of a range of new hashing techniques exploiting the flexibility of the wide variety of manifold learning approaches available. We particularly show that hashing on the basis of t-SNE . |
|||||
2013 | Coding For Random Projections | Li Ping, Mitzenmacher Michael, Shrivastava Anshumali | Arxiv | The method of random projections has become very popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed coding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed. In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that uniform quantization outperforms the standard existing influential method (Datar et. al. 2004). Indeed, we argue that in many cases coding with just a small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). |
|||||
2013 | Embed And Project Discrete Sampling With Universal Hashing | Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman | Neural Information Processing Systems | We consider the problem of sampling from a probability distribution defined over a high-dimensional discrete set, specified for instance by a graphical model. We propose a sampling algorithm, called PAWS, based on embedding the set into a higher-dimensional space which is then randomly projected using universal hash functions to a lower-dimensional subspace and explored using combinatorial search methods. Our scheme can leverage fast combinatorial optimization tools as a blackbox and, unlike MCMC methods, samples produced are guaranteed to be within an (arbitrarily small) constant factor of the true probability distribution. We demonstrate that by using state-of-the-art combinatorial search tools, PAWS can efficiently sample from Ising grids with strong interactions and from software verification instances, while MCMC and variational methods fail in both cases. |
|||||
2013 | The Power Of Asymmetry In Binary Hashing | Behnam Neyshabur, Nati Srebro, Russ R. Salakhutdinov, Yury Makarychev, Payman Yadollahpour | Neural Information Processing Systems | When approximating binary similarity using the hamming distance between short binary hashes, we shown that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e.~by approximating the similarity between \(x\) and \(x’\) as the hamming distance between \(f(x)\) and \(g(x’)\), for two distinct binary codes \(f,g\), rather than as the hamming distance between \(f(x)\) and \(f(x’)\). |
|||||
2013 | Compressed Spaced Suffix Arrays | Gagie Travis, Manzini Giovanni, Valenzuela Daniel | Arxiv | Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice. |
|||||
2013 | More Efficient Privacy Amplification With Less Random Seeds Via Dual Universal Hash Function | Hayashi Masahito, Tsurumaru Toyohiro | IEEE Transactions on Information Theory Volume | We explicitly construct random hash functions for privacy amplification (extractors) that require smaller random seed lengths than the previous literature, and still allow efficient implementations with complexity \(O(nlog n)\) for input length \(n\). The key idea is the concept of dual universal\(_2\) hash function introduced recently. We also use a new method for constructing extractors by concatenating \(\delta\)-almost dual universal\(_2\) hash functions with other extractors. Besides minimizing seed lengths, we also introduce methods that allow one to use non-uniform random seeds for extractors. These methods can be applied to a wide class of extractors, including dual universal\(_2\) hash function, as well as to conventional universal\(_2\) hash functions. |
|||||
2013 | Some Properties Of Faber-walsh Polynomials | Sète Olivier | Arxiv | Walsh introduced a generalisation of Faber polynomials to certain compact sets which need not be connected. We derive several equivalent representations of these Faber-Walsh polynomials, analogous to representations of Faber polynomials. Some simple asymptotic properties of the Faber-Walsh polynomials on the complement of the compact set are established. We further show that suitably normalised Faber-Walsh polynomials are asymptotically optimal polynomials in the sense of [Eiermann and Niethammer 1983]. |
|||||
2013 | Codebook Based Audio Feature Representation For Music Information Retrieval | Vaizman Yonatan, Mcfee Brian, Lanckriet Gert | Arxiv | Digital music has become prolific in the web in recent decades. Automated recommendation systems are essential for users to discover music they love and for artists to reach appropriate audience. When manual annotations and user preference data is lacking (e.g. for new artists) these systems must rely on content based methods. Besides powerful machine learning tools for classification and retrieval, a key component for successful recommendation is the audio content representation. Good representations should capture informative musical patterns in the audio signal of songs. These representations should be concise, to enable efficient (low storage, easy indexing, fast search) management of huge music repositories, and should also be easy and fast to compute, to enable real-time interaction with a user supplying new songs to the system. Before designing new audio features, we explore the usage of traditional local features, while adding a stage of encoding with a pre-computed codebook and a stage of pooling to get compact vectorial representations. We experiment with different encoding methods, namely the LASSO, vector quantization (VQ) and cosine similarity (CS). We evaluate the representations’ quality in two music information retrieval applications: query-by-tag and query-by-example. Our results show that concise representations can be used for successful performance in both applications. We recommend using top-\(\tau\) VQ encoding, which consistently performs well in both applications, and requires much less computation time than the LASSO. |
|||||
2013 | Beyond Locality-sensitive Hashing | Andoni Alexandr, Indyk Piotr, Nguyen Huy L., Razenshteyn Ilya | Arxiv | We present a new data structure for the c-approximate near neighbor problem (ANN) in the Euclidean space. For n points in R^d, our algorithm achieves O(n^{\rho} + d log n) query time and O(n^{1 + \rho} + d log n) space, where \rho <= 7/(8c^2) + O(1 / c^3) + o(1). This is the first improvement over the result by Andoni and Indyk (FOCS 2006) and the first data structure that bypasses a locality-sensitive hashing lower bound proved by O’Donnell, Wu and Zhou (ICS 2011). By a standard reduction we obtain a data structure for the Hamming space and \ell_1 norm with \rho <= 7/(8c) + O(1/c^{3/2}) + o(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998). |
|||||
2013 | Functions With Diffusive Properties | Seraj Samer | Arxiv | While exploring desirable properties of hash functions in cryptography, the author was led to investigate three notions of functions with scattering or “diffusive” properties, where the functions map between binary strings of fixed finite length. These notions of diffusion ask for some property to be fulfilled by the Hamming distances between outputs corresponding to pairs of inputs that lie on the endpoints of edges of an \(n\)-dimensional hypercube. Given the dimension of the input space, we explicitly construct such functions for every dimension of the output space that allows for the functions to exist. |
|||||
2013 | An Efficient Index For Visual Search In Appearance-based SLAM | Hajebi Kiana, Zhang Hong | Arxiv | Vector-quantization can be a computationally expensive step in visual bag-of-words (BoW) search when the vocabulary is large. A BoW-based appearance SLAM needs to tackle this problem for an efficient real-time operation. We propose an effective method to speed up the vector-quantization process in BoW-based visual SLAM. We employ a graph-based nearest neighbor search (GNNS) algorithm to this aim, and experimentally show that it can outperform the state-of-the-art. The graph-based search structure used in GNNS can efficiently be integrated into the BoW model and the SLAM framework. The graph-based index, which is a k-NN graph, is built over the vocabulary words and can be extracted from the BoW’s vocabulary construction procedure, by adding one iteration to the k-means clustering, which adds small extra cost. Moreover, exploiting the fact that images acquired for appearance-based SLAM are sequential, GNNS search can be initiated judiciously which helps increase the speedup of the quantization process considerably. |
|||||
2013 | Sparse Similarity-preserving Hashing | Masci Jonathan, Bronstein Alex M., Bronstein Michael M., Sprechmann Pablo, Sapiro Guillermo | Arxiv | In recent years, a lot of attention has been devoted to efficient nearest neighbor search by means of similarity-preserving hashing. One of the plights of existing hashing techniques is the intrinsic trade-off between performance and computational complexity: while longer hash codes allow for lower false positive rates, it is very difficult to increase the embedding dimensionality without incurring in very high false negatives rates or prohibiting computational costs. In this paper, we propose a way to overcome this limitation by enforcing the hash codes to be sparse. Sparse high-dimensional codes enjoy from the low false positive rates typical of long hashes, while keeping the false negative rates similar to those of a shorter dense hashing scheme with equal number of degrees of freedom. We use a tailored feed-forward neural network for the hashing function. Extensive experimental evaluation involving visual and multi-modal data shows the benefits of the proposed method. |
|||||
2013 | Security Analysis Of Epsilon-almost Dual Universal2 Hash Functions Smoothing Of Min Entropy Vs. Smoothing Of Renyi Entropy Of Order 2 | Hayashi Masahito | IEEE Transactions on Information Theory Volume | Recently, \(\epsilon\)-almost dual universal\(_2\) hash functions has been proposed as a new and wider class of hash functions. Using this class of hash functions, several efficient hash functions were proposed. This paper evaluates the security performance when we apply this kind of hash functions. We evaluate the security in several kinds of setting based on the \(L_1\) distinguishability criterion and the modified mutual information criterion. The obtained evaluation is based on smoothing of R'{e}nyi entropy of order 2 and/or min entropy. We clarify the difference between these two methods. |
|||||
2013 | Approximate Nearest Neighbor Search In ell_p | Nguyen Huy L. | Arxiv | We present a new locality sensitive hashing (LSH) algorithm for \(c\)-approximate nearest neighbor search in \(\ell_p\) with \(1<p<2\). For a database of \(n\) points in \(\ell_p\), we achieve \(O(dn^{\rho})\) query time and \(O(dn+n^{1+\rho})\) space, where \(\rho \le O((\ln c)^2/c^p)\). This improves upon the previous best upper bound \(\rho\le 1/c\) by Datar et al. (SOCG 2004), and is close to the lower bound \(\rho \ge 1/c^p\) by O’Donnell, Wu and Zhou (ITCS 2011). The proof is a simple generalization of the LSH scheme for \(ℓ₂\) by Andoni and Indyk (FOCS 2006). |
|||||
2013 | Stopping Rules For Bag-of-words Image Search And Its Application In Appearance-based Localization | Hajebi Kiana, Zhang Hong | Arxiv | We propose a technique to improve the search efficiency of the bag-of-words (BoW) method for image retrieval. We introduce a notion of difficulty for the image matching problems and propose methods that reduce the amount of computations required for the feature vector-quantization task in BoW by exploiting the fact that easier queries need less computational resources. Measuring the difficulty of a query and stopping the search accordingly is formulated as a stopping problem. We introduce stopping rules that terminate the image search depending on the difficulty of each query, thereby significantly reducing the computational cost. Our experimental results show the effectiveness of our approach when it is applied to appearance-based localization problem. |
|||||
2013 | On The K-independence Required By Linear Probing And Minwise Independence | Thorup Mikkel | Arxiv | We show that linear probing requires 5-independent hash functions for expected constant-time performance, matching an upper bound of [Pagh et al. STOC’07]. More precisely, we construct a 4-independent hash functions yielding expected logarithmic search time. For (1+{\epsilon})-approximate minwise independence, we show that Ω(log 1/{\epsilon})-independent hash functions are required, matching an upper bound of [Indyk, SODA’99]. We also show that the very fast 2-independent multiply-shift scheme of Dietzfelbinger [STACS’96] fails badly in both applications. |
|||||
2013 | Fast Exact Search In Hamming Space With Multi-index Hashing | Norouzi Mohammad, Punjani Ali, Fleet David J. | Arxiv | There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits. |
|||||
2013 | Kernelized Locality-sensitive Hashing For Semi-supervised Agglomerative Clustering | Xie Boyi, Zheng Shuheng | Arxiv | Large scale agglomerative clustering is hindered by computational burdens. We propose a novel scheme where exact inter-instance distance calculation is replaced by the Hamming distance between Kernelized Locality-Sensitive Hashing (KLSH) hashed values. This results in a method that drastically decreases computation time. Additionally, we take advantage of certain labeled data points via distance metric learning to achieve a competitive precision and recall comparing to K-Means but in much less computation time. |
|||||
2013 | Learning To Prune In Metric And Non-metric Spaces | Leonid Boytsov, Bilegsaikhan Naidan | Neural Information Processing Systems | Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-to prune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. |
|||||
2013 | Collision And Preimage Resistance Of The Centera Content Address | Primmer Robert, D'halluin Carl | Arxiv | Centera uses cryptographic hash functions as a means of addressing stored objects, thus creating a new class of data storage referred to as CAS (content addressed storage). Such hashing serves the useful function of providing a means of uniquely identifying data and providing a global handle to that data, referred to as the Content Address or CA. However, such a model begs the question: how certain can one be that a given CA is indeed unique? In this paper we describe fundamental concepts of cryptographic hash functions, such as collision resistance, pre-image resistance, and second-preimage resistance. We then map these properties to the MD5 and SHA-256 hash algorithms, which are used to generate the Centera content address. Finally, we present a proof of the collision resistance of the Centera Content Address. |
|||||
2013 | Learning Hash Functions Using Column Generation | Li Xi, Lin Guosheng, Shen Chunhua, Hengel Anton Van Den, Dick Anthony | Arxiv | Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets. |
|||||
2013 | Simple Compact And Robust Approximate String Dictionary | Chegrane Ibrahim, Belazzougui Djamal | Arxiv | This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary \(D\) of \(d\) strings of total length \(n\) over an alphabet of size \(\sigma\). Given a bound \(k\) and a pattern \(x\) of length \(m\), a query has to return all the strings of the dictionary which are at edit distance at most \(k\) from \(x\), where the edit distance between two strings \(x\) and \(y\) is defined as the minimum-cost sequence of edit operations that transform \(x\) into \(y\). The cost of a sequence of operations is defined as the sum of the costs of the operations involved in the sequence. In this paper, we assume that each of these operations has unit cost and consider only three operations: deletion of one character, insertion of one character and substitution of a character by another. We present a practical implementation of the data structure we recently proposed and which works only for one error. We extend the scheme to \(2\leq k<m\). Our implementation has many desirable properties: it has a very fast and space-efficient building algorithm. The dictionary data structure is compact and has fast and robust query time. Finally our data structure is simple to implement as it only uses basic techniques from the literature, mainly hashing (linear probing and hash signatures) and succinct data structures (bitvectors supporting rank queries). |
|||||
2013 | Which Space Partitioning Tree To Use For Search | Parikshit Ram, Alexander Gray | Neural Information Processing Systems | We consider the task of nearest-neighbor search with the class of binary-space-partitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question which tree to use for nearest-neighbor search?’’ To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve the search performance of a space-partitioning tree. “ |
|||||
2013 | Scalable Protein Sequence Similarity Search Using Locality-sensitive Hashing And Mapreduce | Sunarso Freddie, Venugopal Srikumar, Lauro Federico | Arxiv | Metagenomics is the study of environments through genetic sampling of their microbiota. Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the study of metagenome data is sequence similarity searching which is computationally intensive over large datasets. Tools such as BLAST require large dedicated computing infrastructure to perform such analysis and may not be available to every researcher. In this paper, we propose a novel approach called ScalLoPS that performs searching on protein sequence datasets using LSH (Locality-Sensitive Hashing) that is implemented using the MapReduce distributed framework. ScalLoPS is designed to scale across computing resources sourced from cloud computing providers. We present the design and implementation of ScalLoPS followed by evaluation with datasets derived from both traditional as well as metagenomic studies. Our experiments show that with this method approximates the quality of BLAST results while improving the scalability of protein sequence search. |
|||||
2013 | Bottom-k And Priority Sampling Set Similarity And Subset Sums With Minimal Independence | Thorup Mikkel | Arxiv | We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h_1,…,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied. |
|||||
2013 | A Quantized Johnson Lindenstrauss Lemma The Finding Of Buffons Needle | Jacques Laurent | Arxiv | In 1733, Georges-Louis Leclerc, Comte de Buffon in France, set the ground of geometric probability theory by defining an enlightening problem: What is the probability that a needle thrown randomly on a ground made of equispaced parallel strips lies on two of them? In this work, we show that the solution to this problem, and its generalization to \(N\) dimensions, allows us to discover a quantized form of the Johnson-Lindenstrauss (JL) Lemma, i.e., one that combines a linear dimensionality reduction procedure with a uniform quantization of precision \(\delta>0\). In particular, given a finite set \(\mathcal S \subset \mathbb R^N\) of \(S\) points and a distortion level \(\epsilon>0\), as soon as \(M > M_0 = O(\epsilon^{-2} log S)\), we can (randomly) construct a mapping from \((\mathcal S, ℓ₂)\) to \((\delta\mathbb Z^M, \ell_1)\) that approximately preserves the pairwise distances between the points of \(\mathcal S\). Interestingly, compared to the common JL Lemma, the mapping is quasi-isometric and we observe both an additive and a multiplicative distortions on the embedded distances. These two distortions, however, decay as \(O(\sqrt{(log S)/M})\) when \(M\) increases. Moreover, for coarse quantization, i.e., for high \(\delta\) compared to the set radius, the distortion is mainly additive, while for small \(\delta\) we tend to a Lipschitz isometric embedding. Finally, we prove the existence of a “nearly” quasi-isometric embedding of \((\mathcal S, ℓ₂)\) into \((\delta\mathbb Z^M, ℓ₂)\). This one involves a non-linear distortion of the \(ℓ₂\)-distance in \(\mathcal S\) that vanishes for distant points in this set. Noticeably, the additive distortion in this case is slower, and decays as \(O(\sqrt[4]{(log S)/M})\). |
|||||
2013 | Fast Neighborhood Graph Search Using Cartesian Concatenation | Wang Jingdong, Wang Jing, Zeng Gang, Gan Rui, Li Shipeng, Guo Baining | Arxiv | In this paper, we propose a new data structure for approximate nearest neighbor search. This structure augments the neighborhood graph with a bridge graph. We propose to exploit Cartesian concatenation to produce a large set of vectors, called bridge vectors, from several small sets of subvectors. Each bridge vector is connected with a few reference vectors near to it, forming a bridge graph. Our approach finds nearest neighbors by simultaneously traversing the neighborhood graph and the bridge graph in the best-first strategy. The success of our approach stems from two factors: the exact nearest neighbor search over a large number of bridge vectors can be done quickly, and the reference vectors connected to a bridge (reference) vector near the query are also likely to be near the query. Experimental results on searching over large scale datasets (SIFT, GIST and HOG) show that our approach outperforms state-of-the-art ANN search algorithms in terms of efficiency and accuracy. The combination of our approach with the IVFADC system also shows superior performance over the BIGANN dataset of \(1\) billion SIFT features compared with the best previously published result. |
|||||
2012 | Properties Of Perfect Transitive Binary Codes Of Length 15 And Extended Perfect Transitive Binary Codes Of Length 16 | Guskov G. K., Solov'eva F. I. | Arxiv | Some properties of perfect transitive binary codes of length 15 and extended perfect transitive binary codes of length 16 are presented for reference purposes. |
|||||
2012 | An Efficient Cryptographic Hash Algorithm (BSA) | Mukherjee Subhabrata, Roy Bimal, Laha Anirban | In Proceedings of The | Recent cryptanalytic attacks have exposed the vulnerabilities of some widely used cryptographic hash functions like MD5 and SHA-1. Attacks in the line of differential attacks have been used to expose the weaknesses of several other hash functions like RIPEMD, HAVAL. In this paper we propose a new efficient hash algorithm that provides a near random hash output and overcomes some of the earlier weaknesses. Extensive simulations and comparisons with some existing hash functions have been done to prove the effectiveness of the BSA, which is an acronym for the name of the 3 authors. |
|||||
2012 | Compact Hyperplane Hashing With Bilinear Functions | Liu Wei Columbia University, Wang Jun Ibm T. J. Watson Research Center, Mu Yadong Columbia University, Kumar Sanjiv Google, Chang Shih-fu Columbia University | Arxiv | Hyperplane hashing aims at rapidly searching nearest points to a hyperplane, and has shown practical impact in scaling up active learning with SVMs. Unfortunately, the existing randomized methods need long hash codes to achieve reasonable search accuracy and thus suffer from reduced search speed and large memory overhead. To this end, this paper proposes a novel hyperplane hashing technique which yields compact hash codes. The key idea is the bilinear form of the proposed hash functions, which leads to higher collision probability than the existing hyperplane hash functions when using random projections. To further increase the performance, we propose a learning based framework in which the bilinear functions are directly learned from the data. This results in short yet discriminative codes, and also boosts the search performance over the random projection based solutions. Large-scale active learning experiments carried out on two datasets with up to one million samples demonstrate the overall superiority of the proposed approach. |
|||||
2012 | Density Sensitive Hashing | Lin Yue, Cai Deng, Li Cheng | Arxiv | Nearest neighbors search is a fundamental problem in various research fields like machine learning, data mining and pattern recognition. Recently, hashing-based approaches, e.g., Locality Sensitive Hashing (LSH), are proved to be effective for scalable high dimensional nearest neighbors search. Many hashing algorithms found their theoretic root in random projection. Since these algorithms generate the hash tables (projections) randomly, a large number of hash tables (i.e., long codewords) are required in order to achieve both high precision and recall. To address this limitation, we propose a novel hashing algorithm called {\em Density Sensitive Hashing} (DSH) in this paper. DSH can be regarded as an extension of LSH. By exploring the geometric structure of the data, DSH avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data. Extensive experimental results on real-world data sets have shown that the proposed method achieves better performance compared to the state-of-the-art hashing approaches. |
|||||
2012 | Super-bit Locality-sensitive Hashing | Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian | Neural Information Processing Systems | Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within \((0,\pi/2]\). The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. |
|||||
2012 | Memoryless Near-collisions Revisited | Lamberger Mario, Teufl Elmar | Arxiv | In this paper we discuss the problem of generically finding near-collisions for cryptographic hash functions in a memoryless way. A common approach is to truncate several output bits of the hash function and to look for collisions of this modified function. In two recent papers, an enhancement to this approach was introduced which is based on classical cycle-finding techniques and covering codes. This paper investigates two aspects of the problem of memoryless near-collisions. Firstly, we give a full treatment of the trade-off between the number of truncated bits and the success-probability of the truncation based approach. Secondly, we demonstrate the limits of cycle-finding methods for finding near-collisions by showing that, opposed to the collision case, a memoryless variant cannot match the query-complexity of the “memory-full” birthday-like near-collision finding method. |
|||||
2012 | Angular Quantization-based Binary Codes For Fast Similarity Search | Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik | Neural Information Processing Systems | This paper focuses on the problem of learning binary embeddings for efficient retrieval of high-dimensional non-negative data. Such data typically arises in a large number of vision and text applications where counts or frequencies are used as features. Also, cosine distance is commonly used as a measure of dissimilarity between such vectors. In this work, we introduce a novel spherical quantization scheme to generate binary embedding of such data and analyze its properties. The number of quantization landmarks in this scheme grows exponentially with data dimensionality resulting in low-distortion quantization. We propose a very efficient method for computing the binary embedding using such large number of landmarks. Further, a linear transformation is learned to minimize the quantization error by adapting the method to the input data resulting in improved embedding. Experiments on image and text retrieval applications show superior performance of the proposed method over other existing state-of-the-art methods. |
|||||
2012 | Locality-sensitive Hashing With Margin Based Feature Selection | Konoshima Makiko, Noma Yui | Arxiv | We propose a learning method with feature selection for Locality-Sensitive Hashing. Locality-Sensitive Hashing converts feature vectors into bit arrays. These bit arrays can be used to perform similarity searches and personal authentication. The proposed method uses bit arrays longer than those used in the end for similarity and other searches and by learning selects the bits that will be used. We demonstrated this method can effectively perform optimization for cases such as fingerprint images with a large number of labels and extremely few data that share the same labels, as well as verifying that it is also effective for natural images, handwritten digits, and speech features. |
|||||
2012 | Multimodal Similarity-preserving Hashing | Masci Jonathan, Bronstein Michael M., Bronstein Alexander A., Schmidhuber Jürgen | Arxiv | We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches, our hashing functions are not limited to binarized linear projections and can assume arbitrarily complex forms. We show experimentally that our method significantly outperforms state-of-the-art hashing approaches on multimedia retrieval tasks. |
|||||
2012 | Greedy Multiple Instance Learning Via Codebook Learning And Nearest Neighbor Voting | Chen Gang, Corso Jason | Arxiv | Multiple instance learning (MIL) has attracted great attention recently in machine learning community. However, most MIL algorithms are very slow and cannot be applied to large datasets. In this paper, we propose a greedy strategy to speed up the multiple instance learning process. Our contribution is two fold. First, we propose a density ratio model, and show that maximizing a density ratio function is the low bound of the DD model under certain conditions. Secondly, we make use of a histogram ratio between positive bags and negative bags to represent the density ratio function and find codebooks separately for positive bags and negative bags by a greedy strategy. For testing, we make use of a nearest neighbor strategy to classify new bags. We test our method on both small benchmark datasets and the large TRECVID MED11 dataset. The experimental results show that our method yields comparable accuracy to the current state of the art, while being up to at least one order of magnitude faster. |
|||||
2012 | Thscalable Distributed Trie Hashing | Mohamed Aridj, Eddine Zegour Djamel | Arxiv | In today’s world of computers, dealing with huge amounts of data is not unusual. The need to distribute this data in order to increase its availability and increase the performance of accessing it is more urgent than ever. For these reasons it is necessary to develop scalable distributed data structures. In this paper we propose a TH* distributed variant of the Trie Hashing data structure. First we propose Thsw new version of TH without node Nil in digital tree (trie), then this version will be adapted to multicomputer environment. The simulation results reveal that TH* is scalable in the sense that it grows gracefully, one bucket at a time, to a large number of servers, also TH* offers a good storage space utilization and high query efficiency special for ordering operations. |
|||||
2012 | Hamming Distance Metric Learning | Mohammad Norouzi, David J. Fleet, Russ R. Salakhutdinov | Neural Information Processing Systems | Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. |
|||||
2012 | Topsig Topology Preserving Document Signatures | Geva Shlomo, De Vries Christopher M. | Arxiv | Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models. |
|||||
2012 | Balanced Allocations And Double Hashing | Mitzenmacher Michael | Arxiv | Double hashing has recently found more common usage in schemes that use multiple hash functions. In double hashing, for an item \(x\), one generates two hash values \(f(x)\) and \(g(x)\), and then uses combinations \((f(x) +k g(x)) \bmod n\) for \(k=0,1,2,…\) to generate multiple hash values from the initial two. We first perform an empirical study showing that, surprisingly, the performance difference between double hashing and fully random hashing appears negligible in the standard balanced allocation paradigm, where each item is placed in the least loaded of \(d\) choices, as well as several related variants. We then provide theoretical results that explain the behavior of double hashing in this context. |
|||||
2012 | B-bit Minwise Hashing In Practice Large-scale Batch And Online Learning And Using Gpus For Fast Preprocessing With Simple Hash Functions | Li Ping, Shrivastava Anshumali, Konig Arnd Christian | Arxiv | In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the context of search.
|
|||||
2012 | Evaluation Of A Simple Scalable Parallel Best-first Search Strategy | Kishimoto Akihiro, Fukunaga Alex, Botea Adi | Artificial Intelligence | Large-scale, parallel clusters composed of commodity processors are increasingly available, enabling the use of vast processing capabilities and distributed RAM to solve hard search problems. We investigate Hash-Distributed A* (HDA), a simple approach to parallel best-first search that asynchronously distributes and schedules work among processors based on a hash function of the search state. We use this approach to parallelize the A algorithm in an optimal sequential version of the Fast Downward planner, as well as a 24-puzzle solver. The scaling behavior of HDA* is evaluated experimentally on a shared memory, multicore machine with 8 cores, a cluster of commodity machines using up to 64 cores, and large-scale high-performance clusters, using up to 2400 processors. We show that this approach scales well, allowing the effective utilization of large amounts of distributed memory to optimally solve problems which require terabytes of RAM. We also compare HDA* to Transposition-table Driven Scheduling (TDS), a hash-based parallelization of IDA, and show that, in planning, HDA significantly outperforms TDS. A simple hybrid which combines HDA* and TDS to exploit strengths of both algorithms is proposed and evaluated. |
|||||
2012 | Isotropic Hashing | Weihao Kong, Wu-jun Li | Neural Information Processing Systems | Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. |
|||||
2012 | On The Difficulty Of Nearest Neighbor Search | He Junfeng Columbia University, Kumar Sanjiv Google Research, Chang Shih-fu Columbia University | Arxiv | Fast approximate nearest neighbor (NN) search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? And which data properties affect the difficulty of nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. Moreover, we present a theoretical analysis to prove how the difficulty measure (relative contrast) determines/affects the complexity of Local Sensitive Hashing, a popular approximate NN search method. Relative contrast also provides an explanation for a family of heuristic hashing algorithms with good practical performance based on PCA. Finally, we show that most of the previous works in measuring NN search meaningfulness/difficulty can be derived as special asymptotic cases for dense vectors of the proposed measure. |
|||||
2012 | An Approximate Coding-rate Versus Minimum Distance Formula For Binary Codes | Akhtman Yosef, Maunder Robert G., Hanzo Lajos | Arxiv | We devise an analytically simple as well as invertible approximate expression, which describes the relation between the minimum distance of a binary code and the corresponding maximum attainable code-rate. For example, for a rate-(1/4), length-256 binary code the best known bounds limit the attainable minimum distance to 65<d(n=256,k=64)<90, while our solution yields d(n=256,k=64)=74.4. The proposed formula attains the approximation accuracy within the rounding error for ~97% of (n,k) scenarios, where the exact value of the minimum distance is known. The results provided may be utilized for the analysis and design of efficient communication systems. |
|||||
2012 | Explicit And Efficient Hash Families Suffice For Cuckoo Hashing With A Stash | Aumüller Martin, Dietzfelbinger Martin, Woelfel Philipp | Arxiv | It is shown that for cuckoo hashing with a stash as proposed by Kirsch, Mitzenmacher, and Wieder (2008) families of very simple hash functions can be used, maintaining the favorable performance guarantees: with stash size \(s\) the probability of a rehash is \(O(1/n^{s+1})\), and the evaluation time is \(O(s)\). Instead of the full randomness needed for the analysis of Kirsch et al. and of Kutzelnigg (2010) (resp. \(\Theta(log n)\)-wise independence for standard cuckoo hashing) the new approach even works with 2-wise independent hash families as building blocks. Both construction and analysis build upon the work of Dietzfelbinger and Woelfel (2003). The analysis, which can also be applied to the fully random case, utilizes a graph counting argument and is much simpler than previous proofs. As a byproduct, an algorithm for simulating uniform hashing is obtained. While it requires about twice as much space as the most space efficient solutions, it is attractive because of its simple and direct structure. |
|||||
2012 | One Permutation Hashing | Ping Li, Art Owen, Cun-hui Zhang | Neural Information Processing Systems | While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) \(k=500\) permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing \(k\) permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \& logistic regression also confirm the theoretical results. |
|||||
2012 | Fast Exact Max-kernel Search | Curtin Ryan R., Ram Parikshit, Gray Alexander G. | Arxiv | The wide applicability of kernels makes the problem of max-kernel search ubiquitous and more general than the usual similarity search in metric spaces. We focus on solving this problem efficiently. We begin by characterizing the inherent hardness of the max-kernel search problem with a novel notion of directional concentration. Following that, we present a method to use an \(O(n log n)\) algorithm to index any set of objects (points in \(\Real^\dims\) or abstract objects) directly in the Hilbert space without any explicit feature representations of the objects in this space. We present the first provably \(O(log n)\) algorithm for exact max-kernel search using this index. Empirical results for a variety of data sets as well as abstract objects demonstrate up to 4 orders of magnitude speedup in some cases. Extensions for approximate max-kernel search are also presented. |
|||||
2012 | Strongly Universal String Hashing Is Fast | Kaser Owen, Lemire Daniel | Computer Journal | We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families—though they require a large buffer of random numbers—are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families. |
|||||
2012 | Hyperplane Arrangements And Locality-sensitive Hashing With Lift | Konoshima Makiko, Noma Yui | Arxiv | Locality-sensitive hashing converts high-dimensional feature vectors, such as image and speech, into bit arrays and allows high-speed similarity calculation with the Hamming distance. There is a hashing scheme that maps feature vectors to bit arrays depending on the signs of the inner products between feature vectors and the normal vectors of hyperplanes placed in the feature space. This hashing can be seen as a discretization of the feature space by hyperplanes. If labels for data are given, one can determine the hyperplanes by using learning algorithms. However, many proposed learning methods do not consider the hyperplanes’ offsets. Not doing so decreases the number of partitioned regions, and the correlation between Hamming distances and Euclidean distances becomes small. In this paper, we propose a lift map that converts learning algorithms without the offsets to the ones that take into account the offsets. With this method, the learning methods without the offsets give the discretizations of spaces as if it takes into account the offsets. For the proposed method, we input several high-dimensional feature data sets and studied the relationship between the statistical characteristics of data, the number of hyperplanes, and the effect of the proposed method. |
|||||
2012 | Combined Descriptors In Spatial Pyramid Domain For Image Classification | Hu Junlin, Guo Ping | Arxiv | Recently spatial pyramid matching (SPM) with scale invariant feature transform (SIFT) descriptor has been successfully used in image classification. Unfortunately, the codebook generation and feature quantization procedures using SIFT feature have the high complexity both in time and space. To address this problem, in this paper, we propose an approach which combines local binary patterns (LBP) and three-patch local binary patterns (TPLBP) in spatial pyramid domain. The proposed method does not need to learn the codebook and feature quantization processing, hence it becomes very efficient. Experiments on two popular benchmark datasets demonstrate that the proposed method always significantly outperforms the very popular SPM based SIFT descriptor method both in time and classification accuracy. |
|||||
2012 | Co-regularized Hashing For Multimodal Data | Yi Zhen, Dit-yan Yeung | Neural Information Processing Systems | Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted co-regularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. |
|||||
2012 | A New Hybrid Jpeg Image Compression Scheme Using Symbol Reduction Technique | Kumar Bheshaj, Thakur Kavita, Sinha G. R. | Arxiv | Lossy JPEG compression is a widely used compression technique. Normally the JPEG standard technique uses three process mapping reduces interpixel redundancy, quantization, which is lossy process and entropy encoding, which is considered lossless process. In this paper, a new technique has been proposed by combining the JPEG algorithm and Symbol Reduction Huffman technique for achieving more compression ratio. The symbols reduction technique reduces the number of symbols by combining together to form a new symbol. As a result of this technique the number of Huffman code to be generated also reduced. It is simple fast and easy to implement. The result shows that the performance of standard JPEG method can be improved by proposed method. This hybrid approach achieves about 20% more compression ratio than the Standard JPEG. |
|||||
2012 | Locally Linear Embedding Clustering Algorithm For Natural Imagery | Ziegelmeier Lori, Kirby Michael, Peterson Chris | Arxiv | The ability to characterize the color content of natural imagery is an important application of image processing. The pixel by pixel coloring of images may be viewed naturally as points in color space, and the inherent structure and distribution of these points affords a quantization, through clustering, of the color information in the image. In this paper, we present a novel topologically driven clustering algorithm that permits segmentation of the color features in a digital image. The algorithm blends Locally Linear Embedding (LLE) and vector quantization by mapping color information to a lower dimensional space, identifying distinct color regions, and classifying pixels together based on both a proximity measure and color content. It is observed that these techniques permit a significant reduction in color resolution while maintaining the visually important features of images. |
|||||
2011 | Similarity Join Size Estimation Using Locality Sensitive Hashing | Lee Hongrae University Of British Columbia, Ng Raymond T. University Of British Columbia, Shim Kyuseok Seoul National University | Proceedings of the VLDB Endowment | Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses the Locality-Sensitive-Hashing (LSH) scheme. The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates at both high and low similarity thresholds and small variance using real-world data sets. |
|||||
2011 | Hash Function Based Secret Sharing Scheme Designs | Chum Chi Sing, Zhang Xiaowen | Arxiv | Secret sharing schemes create an effective method to safeguard a secret by dividing it among several participants. By using hash functions and the herding hashes technique, we first set up a (t+1, n) threshold scheme which is perfect and ideal, and then extend it to schemes for any general access structure. The schemes can be further set up as proactive or verifiable if necessary. The setup and recovery of the secret is efficient due to the fast calculation of the hash function. The proposed scheme is flexible because of the use of existing hash functions. |
|||||
2011 | Searching In One Billion Vectors Re-rank With Source Coding | Jégou Hervé Inria - Irisa, Tavenard Romain Inria - Irisa, Douze Matthijs Inria Rhône-alpes / Ljk Laboratoire Jean Kuntzmann, Sed, Amsaleg Laurent Inria - Irisa | Arxiv | Recent indexing techniques inspired by source coding have been shown successful to index billions of high-dimensional vectors in memory. In this paper, we propose an approach that re-ranks the neighbor hypotheses obtained by these compressed-domain indexing methods. In contrast to the usual post-verification scheme, which performs exact distance calculation on the short-list of hypotheses, the estimated distances are refined based on short quantization codes, to avoid reading the full vectors from disk. We have released a new public dataset of one billion 128-dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale. Experiments show that our method accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation. |
|||||
2011 | Accurate Estimators For Improving Minwise Hashing And B-bit Minwise Hashing | Li Ping, Konig Christian | Arxiv | Minwise hashing is the standard technique in the context of search and databases for efficiently estimating set (e.g., high-dimensional 0/1 vector) similarities. Recently, b-bit minwise hashing was proposed which significantly improves upon the original minwise hashing in practice by storing only the lowest b bits of each hashed value, as opposed to using 64 bits. b-bit hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., the resemblance >0.5). However, there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., how much one object is contained by another object) than the resemblance. In this paper, we show that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment. |
|||||
2011 | Accelerating Nearest Neighbor Search On Manycore Systems | Cayton Lawrence | In Proceedings of the | We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their search performance is provably sublinear in the size of the database, with a factor dependent only on its intrinsic dimensionality. We demonstrate that our methods provide substantial speedups on a range of datasets and hardware platforms. In particular, we present results on a 48-core server machine, on graphics hardware, and on a multicore desktop. |
|||||
2011 | Bayesian Locality Sensitive Hashing For Fast Similarity Search | Satuluri Venu, Parthasarathy Srinivasan | PVLDB | Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets. |
|||||
2011 | Exploiting Spatial Overlap To Efficiently Compute Appearance Distances Between Image Windows | Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari | Neural Information Processing Systems | We present a computationally efficient technique to compute the distance of high-dimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18]and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average. |
|||||
2011 | A Linear-time Approximation Of The Earth Movers Distance | Jang Min-hee, Kim Sang-wook, Faloutsos Christos, Park Sunju | Arxiv | Color descriptors are one of the important features used in content-based image retrieval. The Dominant Color Descriptor (DCD) represents a few perceptually dominant colors in an image through color quantization. For image retrieval based on DCD, the earth mover’s distance and the optimal color composition distance are proposed to measure the dissimilarity between two images. Although providing good retrieval results, both methods are too time-consuming to be used in a large image database. To solve the problem, we propose a new distance function that calculates an approximate earth mover’s distance in linear time. To calculate the dissimilarity in linear time, the proposed approach employs the space-filling curve for multidimensional color space. To improve the accuracy, the proposed approach uses multiple curves and adjusts the color positions. As a result, our approach achieves order-of-magnitude time improvement but incurs small errors. We have performed extensive experiments to show the effectiveness and efficiency of the proposed approach. The results reveal that our approach achieves almost the same results with the EMD in linear time. |
|||||
2011 | Fast Linear Time M-adic Hierarchical Clustering For Search And Retrieval Using The Baire Metric With Linkages To Generalized Ultrametrics Hashing Formal Concept Analysis And Precision Of Data Measurement | Murtagh Fionn, Contreras Pedro | P-Adic Numbers Ultrametric Analysis and Applications | We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections. |
|||||
2011 | B-bit Minwise Hashing For Large-scale Linear SVM | Li Ping, Moore Joshua, Konig Christian | Arxiv | In this paper, we propose to (seamlessly) integrate b-bit minwise hashing with linear SVM to substantially improve the training (and testing) efficiency using much smaller memory, with essentially no loss of accuracy. Theoretically, we prove that the resemblance matrix, the minwise hashing matrix, and the b-bit minwise hashing matrix are all positive definite matrices (kernels). Interestingly, our proof for the positive definiteness of the b-bit minwise hashing kernel naturally suggests a simple strategy to integrate b-bit hashing with linear SVM. Our technique is particularly useful when the data can not fit in memory, which is an increasingly critical issue in large-scale machine learning. Our preliminary experimental results on a publicly available webspam dataset (350K samples and 16 million dimensions) verified the effectiveness of our algorithm. For example, the training time was reduced to merely a few seconds. In addition, our technique can be easily extended to many other linear and nonlinear machine learning applications such as logistic regression. |
|||||
2011 | Effective Protocols For Low-distance File Synchronization | Chuklin Aleksandr | Arxiv | Suppose that we have two similar files stored on different computers. We need to send the file from the first computer to the second one trying to minimize the number of bits transmitted. This article presents a survey of results known for this communication complexity problem in the case when files are “similar” in the sense of Hamming distance. We mainly systematize earlier results obtained by various authors in 1990s and 2000s and discuss its connection with coding theory, hashing algorithms and other domains of computer science. In particular cases we propose some improvements of previous constructions. |
|||||
2011 | Descriptor Learning For Omnidirectional Image Matching | Masci Jonathan, Migliore Davide, Bronstein Michael M., Schmidhuber Jürgen | Arxiv | Feature matching in omnidirectional vision systems is a challenging problem, mainly because complicated optical systems make the theoretical modelling of invariance and construction of invariant feature descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashing framework, in which we are trying to map the descriptor data to the Hamming space preserving the descriptor similarity on the training set. A neural network is used to solve the underlying optimization problem. Our approach outperforms not only straightforward descriptor matching, but also state-of-the-art similarity-preserving hashing methods. |
|||||
2011 | A Fast Nearest Neighbor Search Algorithm Based On Vector Quantization | Corlay Sylvain Lpma | Arxiv | In this article, we propose a new fast nearest neighbor search algorithm, based on vector quantization. Like many other branch and bound search algorithms [1,10], a preprocessing recursively partitions the data set into disjointed subsets until the number of points in each part is small enough. In doing so, a search-tree data structure is built. This preliminary recursive data-set partition is based on the vector quantization of the empirical distribution of the initial data-set. Unlike previously cited methods, this kind of partitions does not a priori allow to eliminate several brother nodes in the search tree with a single test. To overcome this difficulty, we propose an algorithm to reduce the number of tested brother nodes to a minimal list that we call “friend Voronoi cells”. The complete description of the method requires a deeper insight into the properties of Delaunay triangulations and Voronoi diagrams |
|||||
2011 | Pattern Matching In Lempel-ziv Compressed Strings Fast Simple And Deterministic | Gawrychowski Pawel | Arxiv | Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan. |
|||||
2011 | Kernel Diff-hash | Bronstein Michael M | Arxiv | This paper presents a kernel formulation of the recently introduced diff-hash algorithm for the construction of similarity-sensitive hash functions. Our kernel diff-hash algorithm that shows superior performance on the problem of image feature descriptor matching. |
|||||
2011 | Hashing Algorithms For Large-scale Learning | Ping Li, Anshumali Shrivastava, Joshua Moore, Arnd König | Neural Information Processing Systems | Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare \(b\)-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data. |
|||||
2011 | Improving The Performance Of K-means For Color Quantization | Celebi M. Emre | Image and Vision Computing | Color quantization is an important operation with many applications in graphics and image processing. Most quantization methods are essentially based on data clustering algorithms. However, despite its popularity as a general purpose clustering algorithm, k-means has not received much respect in the color quantization literature because of its high computational requirements and sensitivity to initialization. In this paper, we investigate the performance of k-means as a color quantizer. We implement fast and exact variants of k-means with several initialization schemes and then compare the resulting quantizers to some of the most popular quantizers in the literature. Experiments on a diverse set of images demonstrate that an efficient implementation of k-means with an appropriate initialization strategy can in fact serve as a very effective color quantizer. |
|||||
2011 | Independence Of Tabulation-based Hash Classes | Klassen Toryn Qwyllyn, Woelfel Philipp | Arxiv | A tabulation-based hash function maps a key into d derived characters indexing random values in tables that are then combined with bitwise xor operations to give the hash. Thorup and Zhang (2004) presented d-wise independent tabulation-based hash classes that use linear maps over finite fields to map a key, considered as a vector (a,b), to derived characters. We show that a variant where the derived characters are a+b*i for i=0,…, q-1 (using integer arithmetic) yielding (2d-1)-wise independence. Our analysis is based on an algebraic property that characterizes k-wise independence of tabulation-based hashing schemes, and combines this characterization with a geometric argument. We also prove a non-trivial lower bound on the number of derived characters necessary for k-wise independence with our and related hash classes. |
|||||
2011 | Anti-sparse Coding For Approximate Nearest Neighbor Search | Jégou Hervé Inria - Irisa, Furon Teddy Inria - Irisa, Fuchs Jean-jacques Inria - Irisa | Arxiv | This paper proposes a binarization scheme for vectors of high dimension based on the recent concept of anti-sparse coding, and shows its excellent performance for approximate nearest neighbor search. Unlike other binarization schemes, this framework allows, up to a scaling factor, the explicit reconstruction from the binary representation of the original vector. The paper also shows that random projections which are used in Locality Sensitive Hashing algorithms, are significantly outperformed by regular frames for both synthetic and real data if the number of bits exceeds the vector dimensionality, i.e., when high precision is required. |
|||||
2011 | A Cuckoo Hashing Variant With Improved Memory Utilization And Insertion Time | Porat Ely, Shalem Bar | Arxiv | Cuckoo hashing [4] is a multiple choice hashing scheme in which each item can be placed in multiple locations, and collisions are resolved by moving items to their alternative locations. In the classical implementation of two-way cuckoo hashing, the memory is partitioned into contiguous disjoint fixed-size buckets. Each item is hashed to two buckets, and may be stored in any of the positions within those buckets. Ref. [2] analyzed a variation in which the buckets are contiguous and overlap. However, many systems retrieve data from secondary storage in same-size blocks called pages. Fetching a page is a relatively expensive process; but once a page is fetched, its contents can be accessed orders of magnitude faster. We utilize this property of memory retrieval, presenting a variant of cuckoo hashing incorporating the following constraint: each bucket must be fully contained in a single page, but buckets are not necessarily contiguous. Empirical results show that this modification increases memory utilization and decreases the number of iterations required to insert an item. If each item is hashed to two buckets of capacity two, the page size is 8, and each bucket is fully contained in a single page, the memory utilization equals 89.71% in the classical contiguous disjoint bucket variant, 93.78% in the contiguous overlapping bucket variant, and increases to 97.46% in our new non-contiguous bucket variant. When the memory utilization is 92% and we use breadth first search to look for a vacant position, the number of iterations required to insert a new item is dramatically reduced from 545 in the contiguous overlapping buckets variant to 52 in our new non-contiguous bucket variant. In addition to the empirical results, we present a theoretical lower bound on the memory utilization of our variation as a function of the page size. |
|||||
2011 | Learning To Search Efficiently In High Dimensions | Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhang, Yihong Gong, Thomas S. Huang | Neural Information Processing Systems | High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efficiency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efficiency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees). |
|||||
2011 | Fast Compressed Tries Through Path Decompositions | Grossi Roberto, Ottaviano Giuseppe | Arxiv | Tries are popular data structures for storing a set of strings, where common prefixes are represented by common root-to-node paths. Over fifty years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of path-decomposed tries and experimentally evaluate the corresponding reduction in space usage and memory latency, comparing with the state of the art. We study two cases of applications: (1) a compressed dictionary for (compressed) strings, and (2) a monotone minimal perfect hash for strings that preserves their lexicographic order. For (1), we obtain data structures that outperform other state-of-the-art compressed dictionaries in space efficiency, while obtaining predictable query times that are competitive with data structures preferred by the practitioners. In (2), our tries perform several times faster than other trie-based monotone perfect hash functions, while occupying nearly the same space. |
|||||
2011 | Strict Authentication Watermarking With JPEG Compression (SAW-JPEG) For Medical Images | Zain Jasni Mohamad | European Journal of Scientific Research Volume | This paper proposes a strict authentication watermarking for medical images. In this scheme, we define region of interest (ROI) by taking the smallest rectangle around an image. The watermark is generated from hashing the area of interest. The embedding region is considered to be outside the region of interest as to preserve the area from distortion as a result from watermarking. The strict authentication watermarking is robust to some degree of JPEG compression (SAW-JPEG). JPEG compression will be reviewed. To embed a watermark in the spatial domain, we have to make sure that the embedded watermark will survive JPEG quantization process. The watermarking scheme, including data embedding, extracting and verifying procedure were presented. Experimental results showed that such a scheme could embed and extract the watermark at a high compression rate. The watermark is robust to a high compression rate up to 90.6%. The JPEG image quality threshold is 60 for the least significant bit embedding. The image quality threshold is increased to 61 for 2nd and 3rd LSB manipulations. |
|||||
2011 | Training Logistic Regression And SVM On 200GB Data Using B-bit Minwise Hashing And Comparisons With Vowpal Wabbit (VW) | Li Ping, Shrivastava Anshumali, Konig Christian | Arxiv | We generated a dataset of 200 GB with 10^9 features, to test our recent b-bit minwise hashing algorithms for training very large-scale logistic regression and SVM. The results confirm our prior work that, compared with the VW hashing algorithm (which has the same variance as random projections), b-bit minwise hashing is substantially more accurate at the same storage. For example, with merely 30 hashed values per data point, b-bit minwise hashing can achieve similar accuracies as VW with 2^14 hashed values per data point. We demonstrate that the preprocessing cost of b-bit minwise hashing is roughly on the same order of magnitude as the data loading time. Furthermore, by using a GPU, the preprocessing cost can be reduced to a small fraction of the data loading time. Minwise hashing has been widely used in industry, at least in the context of search. One reason for its popularity is that one can efficiently simulate permutations by (e.g.,) universal hashing. In other words, there is no need to store the permutation matrix. In this paper, we empirically verify this practice, by demonstrating that even using the simplest 2-universal hashing does not degrade the learning performance. |
|||||
2011 | Introduction To The Bag Of Features Paradigm For Image Classification And Retrieval | O'hara Stephen, Draper Bruce A. | Arxiv | The past decade has seen the growing popularity of Bag of Features (BoF) approaches to many computer vision tasks, including image classification, video search, robot localization, and texture recognition. Part of the appeal is simplicity. BoF methods are based on orderless collections of quantized local image descriptors; they discard spatial information and are therefore conceptually and computationally simpler than many alternative methods. Despite this, or perhaps because of this, BoF-based systems have set new performance standards on popular image classification benchmarks and have achieved scalability breakthroughs in image retrieval. This paper presents an introduction to BoF image representations, describes critical design choices, and surveys the BoF literature. Emphasis is placed on recent techniques that mitigate quantization errors, improve feature detection, and speed up image retrieval. At the same time, unresolved issues and fundamental challenges are raised. Among the unresolved issues are determining the best techniques for sampling images, describing local image features, and evaluating system performance. Among the more fundamental challenges are how and whether BoF methods can contribute to localizing objects in complex images, or to associating high-level semantics with natural images. This survey should be useful both for introducing new investigators to the field and for providing existing researchers with a consolidated reference to related work. |
|||||
2011 | Collision-resistant Hash Function Based On Composition Of Functions | Ndoundam Rene, Sadie Juvet Karnel | ARIMA Vol. | cryptographic hash function is a deterministic procedure that compresses an arbitrary block of numerical data and returns a fixed-size bit string. There exist many hash functions: MD5, HAVAL, SHA, … It was reported that these hash functions are not longer secure. Our work is focused in the construction of a new hash function based on composition of functions. The construction used the NP-completeness of Three-dimensional contingency tables and the relaxation of the constraint that a hash function should also be a compression function. |
|||||
2011 | Multimodal Diff-hash | Bronstein Michael M. | Arxiv | Many applications require comparing multimodal data with different structure and dimensionality that cannot be compared directly. Recently, there has been increasing interest in methods for learning and efficiently representing such multimodal similarity. In this paper, we present a simple algorithm for multimodal similarity-preserving hashing, trying to map multimodal data into the Hamming space while preserving the intra- and inter-modal similarities. We show that our method significantly outperforms the state-of-the-art method in the field. |
|||||
2011 | Design Of Image Cryptosystem By Simultaneous Vq-compression And Shuffling Of Codebook And Index Matrix | Pal Arup Kumar, Biswas G. P., Mukhopadhyay S. | The International journal of Multimedia Its Applications | The popularity of Internet usage although increases exponentially, it is incapable of providing the security for exchange of confidential data between the users. As a result, several cryptosystems for encryption of data and images have been developed for secured transmission over Internet. In this work, a scheme for Image encryption/decryption based on Vector Quantization (VQ) has been proposed that concurrently encodes the images for compression and shuffles the codebook and the index matrix using pseudorandom sequences for encryption. The processing time of the proposed scheme is much less than the other cryptosystems, because it does not use any traditional cryptographic operations, and instead it performs swapping between the contents of the codebook with respect to a random sequence, which resulted an indirect shuffling of the contents of the index matrix. It may be noted that the security of the proposed cryptosystem depends on the generation and the exchange of the random sequences used. Since the generation of truly random sequences are not practically feasible, we simulate the proposed scheme using MATLAB, where its operators like rand(method, seed), randperm(n) has been used to generate pseudorandom sequences and it has been seen that the proposed cryptosystem shows the expected performance. |
|||||
2010 | Reverse Nearest Neighbors Search In High Dimensions Using Locality-sensitive Hashing | Arthur David, Oudot Steve Y. | Arxiv | We investigate the problem of finding reverse nearest neighbors efficiently. Although provably good solutions exist for this problem in low or fixed dimensions, to this date the methods proposed in high dimensions are mostly heuristic. We introduce a method that is both provably correct and efficient in all dimensions, based on a reduction of the problem to one instance of \(\e\)-nearest neighbor search plus a controlled number of instances of {\em exhaustive \(r\)-\pleb}, a variant of {\em Point Location among Equal Balls} where all the \(r\)-balls centered at the data points that contain the query point are sought for, not just one. The former problem has been extensively studied and elegantly solved in high dimensions using Locality-Sensitive Hashing (LSH) techniques. By contrast, the latter problem has a complexity that is still not fully understood. We revisit the analysis of the LSH scheme for exhaustive \(r\)-\pleb using a somewhat refined notion of locality-sensitive family of hash function, which brings out a meaningful output-sensitive term in the complexity of the problem. Our analysis, combined with a non-isometric lifting of the data, enables us to answer exhaustive \(r\)-\pleb queries (and down the road reverse nearest neighbors queries) efficiently. Along the way, we obtain a simple algorithm for answering exact nearest neighbor queries, whose complexity is parametrized by some {\em condition number} measuring the inherent difficulty of a given instance of the problem. |
|||||
2010 | A Sparse Johnson--lindenstrauss Transform | Dasgupta Anirban, Kumar Ravi, Sarlós Tamás | Arxiv | Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction — the Johnson–Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just \(\tilde{O}(\frac{1}{\epsilon})\) non-zero entries per column. We also show a matching lower bound on the sparsity for a large class of projection matrices. Our bounds are somewhat surprising, given the known lower bounds of \(Ω(\frac{1}{\epsilon^2})\) both on the number of rows of any projection matrix and on the sparsity of projection matrices generated by natural constructions. Using this, we achieve an \(\tilde{O}(\frac{1}{\epsilon})\) update time per non-zero element for a \((1\pm\epsilon)\)-approximate projection, thereby substantially outperforming the \(\tilde{O}(\frac{1}{\epsilon^2})\) update time required by prior approaches. A variant of our method offers the same guarantees for sparse vectors, yet its \(\tilde{O}(d)\) worst case running time matches the best approach of Ailon and Liberty. |
|||||
2010 | Gb-hash Hash Functions Using Groebner Basis | Dey Dhananjoy, Mishra1 Prasanna Raghaw, Sengupta Indranath | Arxiv | In this paper we present an improved version of HF-hash, viz., GB-hash : Hash Functions Using Groebner Basis. In case of HF-hash, the compression function consists of 32 polynomials with 64 variables which were taken from the first 32 polynomials of hidden field equations challenge-1 by forcing last 16 variables as 0. In GB-hash we have designed the compression function in such way that these 32 polynomials with 64 variables form a minimal Groebner basis of the ideal generated by them with respect to graded lexicographical (grlex) ordering as well as with respect to graded reverse lexicographical (grevlex) ordering. In this paper we will prove that GB-hash is more secure than HF-hash as well as more secure than SHA-256. We have also compared the efficiency of our GB-hash with SHA-256 and HF-hash. |
|||||
2010 | Sharp Rate For The Dual Quantization Problem | Pagès Gilles Lpma, Wilbertz Benedikt Lpma | Arxiv | In this paper we establish the sharp rate of the optimal dual quantization problem. The notion of dual quantization was recently introduced in the paper [8], where it was shown that, at least in an Euclidean setting, dual quantizers are based on a Delaunay triangulation, the dual counterpart of the Voronoi tessellation on which “regular” quantization relies. Moreover, this new approach shares an intrinsic stationarity property, which makes it very valuable for numerical applications. We establish in this paper the counterpart for dual quantization of the celebrated Zador theorem, which describes the sharp asymptotics for the quantization error when the quantizer size tends to infinity. The proof of this theorem relies among others on an extension of the so-called Pierce Lemma by means of a random quantization argument. |
|||||
2010 | A New Approach to Cross-Modal Multimedia Retrieval | N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R.Levy and N. Vasconcelos | ICME | The collected documents are selected sections from the Wikipedia’s featured articles collection. This is a continuously growing dataset, that at the time of collection (October 2009) had 2,669 articles spread over 29 categories. Some of the categories are very scarce, therefore we considered only the 10 most populated ones. The articles generally have multiple sections and pictures. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the author(s). Then this dataset was prunned to keep only sections that contained a single image and at least 70 words. The final corpus contains 2,866 multimedia documents. The median text length is 200 words. |
|||||
2010 | On The Insertion Time Of Cuckoo Hashing | Fountoulakis Nikolaos, Panagiotou Konstantinos, Steger Angelika | Arxiv | Cuckoo hashing is an efficient technique for creating large hash tables with high space utilization and guaranteed constant access times. There, each item can be placed in a location given by any one out of k different hash functions. In this paper we investigate further the random walk heuristic for inserting in an online fashion new items into the hash table. Provided that k > 2 and that the number of items in the table is below (but arbitrarily close) to the theoretically achievable load threshold, we show a polylogarithmic bound for the maximum insertion time that holds with high probability. |
|||||
2010 | Self-taught Hashing For Fast Similarity Search | Zhang Dell, Wang Jun, Cai Deng, Lu Jinsong | Arxiv | The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal \(l\)-bit binary codes for all documents in the given corpus via unsupervised learning, and then train \(l\) classifiers via supervised learning to predict the \(l\)-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly. |
|||||
2010 | B-bit Minwise Hashing For Estimating Three-way Similarities | Ping Li, Arnd Konig, Wenhao Gui | Neural Information Processing Systems | Computing two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b>= 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way similarity is technically non-trivial. We develop the precise estimator which is accurate and very complicated; and we recommend a much simplified estimator suitable for sparse data. Our analysis shows that \(b\)-bit minwise hashing can normally achieve a 10 to 25-fold improvement in the storage space required for a given estimator accuracy of the 3-way resemblance. |
|||||
2010 | Invariant Spectral Hashing Of Image Saliency Graph | Taquet Maxime, Jacques Laurent, De Vleeschouwer Christophe, Macq Benoit | Arxiv | Image hashing is the process of associating a short vector of bits to an image. The resulting summaries are useful in many applications including image indexing, image authentication and pattern recognition. These hashes need to be invariant under transformations of the image that result in similar visual content, but should drastically differ for conceptually distinct contents. This paper proposes an image hashing method that is invariant under rotation, scaling and translation of the image. The gist of our approach relies on the geometric characterization of salient point distribution in the image. This is achieved by the definition of a “saliency graph” connecting these points jointly with an image intensity function on the graph nodes. An invariant hash is then obtained by considering the spectrum of this function in the eigenvector basis of the Laplacian graph, that is, its graph Fourier transform. Interestingly, this spectrum is invariant under any relabeling of the graph nodes. The graph reveals geometric information of the image, making the hash robust to image transformation, yet distinct for different visual content. The efficiency of the proposed method is assessed on a set of MRI 2-D slices and on a database of faces. |
|||||
2010 | Fast Color Quantization Using Weighted Sort-means Clustering | Celebi M. Emre | Journal of the Optical Society of America A | Color quantization is an important operation with numerous applications in graphics and image processing. Most quantization methods are essentially based on data clustering algorithms. However, despite its popularity as a general purpose clustering algorithm, k-means has not received much respect in the color quantization literature because of its high computational requirements and sensitivity to initialization. In this paper, a fast color quantization method based on k-means is presented. The method involves several modifications to the conventional (batch) k-means algorithm including data reduction, sample weighting, and the use of triangle inequality to speed up the nearest neighbor search. Experiments on a diverse set of images demonstrate that, with the proposed modifications, k-means becomes very competitive with state-of-the-art color quantization methods in terms of both effectiveness and efficiency. |
|||||
2010 | Comparison Of Modified Dual Ternary Indexing And Multi-key Hashing Algorithms For Music Information Retrieval | Sridhar Rajeswari Anna University-chennai, India, Amudha A. Anna University-chennai, India, Karthiga S. Anna University-chennai, India, T Geetha V Anna University-chennai, India | International Journal of Artificial Intelligence Applications | In this work we have compared two indexing algorithms that have been used to index and retrieve Carnatic music songs. We have compared a modified algorithm of the Dual ternary indexing algorithm for music indexing and retrieval with the multi-key hashing indexing algorithm proposed by us. The modification in the dual ternary algorithm was essential to handle variable length query phrase and to accommodate features specific to Carnatic music. The dual ternary indexing algorithm is adapted for Carnatic music by segmenting using the segmentation technique for Carnatic music. The dual ternary algorithm is compared with the multi-key hashing algorithm designed by us for indexing and retrieval in which features like MFCC, spectral flux, melody string and spectral centroid are used as features for indexing data into a hash table. The way in which collision resolution was handled by this hash table is different than the normal hash table approaches. It was observed that multi-key hashing based retrieval had a lesser time complexity than dual-ternary based indexing The algorithms were also compared for their precision and recall in which multi-key hashing had a better recall than modified dual ternary indexing for the sample data considered. |
|||||
2010 | Hashing Hyperplane Queries To Near Points With Applications To Large-scale Active Learning | Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman | Neural Information Processing Systems | We consider the problem of retrieving the database points nearest to a given {\em hyperplane} query without exhaustively scanning the database. We propose two hashing-based solutions. Our first approach maps the data to two-bit binary keys that are locality-sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the Euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sub-linear time. Our first method’s preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods’ tradeoffs, and show that they make it practical to perform active selection with millions of unlabeled points. |
|||||
2010 | Similarity Search And Locality Sensitive Hashing Using Tcams | Shinde Rajendra, Goel Ashish, Gupta Pankaj, Dutta Debojyoti | Arxiv | Similarity search methods are widely used as kernels in various machine learning applications. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors is known to be a hard problem and in high dimensions, best known solutions offer little improvement over a linear scan. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in \(n\) with exponent greater than \(1\), and query time sublinear, but still polynomial in \(n\), where \(n\) is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents \(0\), \(1\) or \(\). The \(\) is a wild card representing either a \(0\) or a \(1\). We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into \(\{0,1,\}\). By using the added functionality of a TLSH scheme with respect to the \(\) character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We believe that this work can open new avenues in very high speed data mining. |
|||||
2010 | The Power Of Simple Tabulation Hashing | Patrascu Mihai, Thorup Mikkel | Arxiv | Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC’77). Keys are viewed as consisting of c characters. We initialize c tables T_1, …, T_c mapping characters to random hash codes. A key x=(x_1, …, x_q) is hashed to T_1[x_1] xor … xor T_c[x_c]. While this scheme is not even 4-independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernoff-type concentration, min-wise hashing for estimating set intersection, and cuckoo hashing. |
|||||
2010 | A Derandomized Sparse Johnson-lindenstrauss Transform | Kane Daniel M., Nelson Jelani | Arxiv | Recent work of [Dasgupta-Kumar-Sarlos, STOC 2010] gave a sparse Johnson-Lindenstrauss transform and left as a main open question whether their construction could be efficiently derandomized. We answer their question affirmatively by giving an alternative proof of their result requiring only bounded independence hash functions. Furthermore, the sparsity bound obtained in our proof is improved. The main ingredient in our proof is a spectral moment bound for quadratic forms that was recently used in [Diakonikolas-Kane-Nelson, FOCS 2010]. |
|||||
2010 | The Universality Of Iterated Hashing Over Variable-length Strings | Lemire Daniel | Discrete Applied Mathematics | Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability. |
|||||
2010 | Balancing Clusters To Reduce Response Time Variability In Large Scale Image Search | Tavenard Romain Inria - Irisa, Amsaleg Laurent Inria - Irisa, Jégou Hervé Inria - Irisa | Arxiv | Many algorithms for approximate nearest neighbor search in high-dimensional spaces partition the data into clusters. At query time, in order to avoid exhaustive search, an index selects the few (or a single) clusters nearest to the query point. Clusters are often produced by the well-known \(k\)-means approach since it has several desirable properties. On the downside, it tends to produce clusters having quite different cardinalities. Imbalanced clusters negatively impact both the variance and the expectation of query response times. This paper proposes to modify \(k\)-means centroids to produce clusters with more comparable sizes without sacrificing the desirable properties. Experiments with a large scale collection of image descriptors show that our algorithm significantly reduces the variance of response times without seriously impacting the search quality. |
|||||
2010 | Bounds For Binary Codes Relative To Pseudo-distances Of K Points | Bachoc Christine Imb, Zemor Gilles Imb | Arxiv | We apply Schrijver’s semidefinite programming method to obtain improved upper bounds on generalized distances and list decoding radii of binary codes. |
|||||
2010 | Hashing Image Patches For Zooming | Gupta Mithun Das | Arxiv | In this paper we present a Bayesian image zooming/super-resolution algorithm based on a patch based representation. We work on a patch based model with overlap and employ a Locally Linear Embedding (LLE) based approach as our data fidelity term in the Bayesian inference. The image prior imposes continuity constraints across the overlapping patches. We apply an error back-projection technique, with an approximate cross bilateral filter. The problem of nearest neighbor search is handled by a variant of the locality sensitive hashing (LSH) scheme. The novelty of our work lies in the speed up achieved by the hashing scheme and the robustness and inherent modularity and parallel structure achieved by the LLE setup. The ill-posedness of the image reconstruction problem is handled by the introduction of regularization priors which encode the knowledge present in vast collections of natural images. We present comparative results for both run-time as well as visual image quality based measurements. |
|||||
2010 | Improved Fast Similarity Search In Dictionaries | Karch Daniel, Luxen Dennis, Sanders Peter | Arxiv | We engineer an algorithm to solve the approximate dictionary matching problem. Given a list of words \(\mathcal{W}\), maximum distance \(d\) fixed at preprocessing time and a query word \(q\), we would like to retrieve all words from \(\mathcal{W}\) that can be transformed into \(q\) with \(d\) or less edit operations. We present data structures that support fault tolerant queries by generating an index. On top of that, we present a generalization of the method that eases memory consumption and preprocessing time significantly. At the same time, running times of queries are virtually unaffected. We are able to match in lists of hundreds of thousands of words and beyond within microseconds for reasonable distances. |
|||||
2010 | Approximate Nearest Neighbor Search For Low Dimensional Queries | Har-peled Sariel, Kumar Nirman | Arxiv | We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data. |
|||||
2010 | Maximum Bipartite Matching Size And Application To Cuckoo Hashing | Kanizo Yossi, Hay David, Keslassy Isaac | Arxiv | Cuckoo hashing with a stash is a robust multiple choice hashing scheme with high memory utilization that can be used in many network device applications. Unfortunately, for memory loads beyond 0.5, little is known on its performance. In this paper, we analyze its average performance over such loads. We tackle this problem by recasting the problem as an analysis of the expected maximum matching size of a given random bipartite graph. We provide exact results for any finite system, and also deduce asymptotic results as the memory size increases. We further consider other variants of this problem, and finally evaluate the performance of our models on Internet backbone traces. More generally, our results give a tight lower bound on the size of the stash needed for any multiple-choice hashing scheme. |
|||||
2009 | A Class Of Structured P2P Systems Supporting Browsing | Cohen Julien Lina | Arxiv | Browsing is a way of finding documents in a large amount of data which is complementary to querying and which is particularly suitable for multimedia documents. Locating particular documents in a very large collection of multimedia documents such as the ones available in peer to peer networks is a difficult task. However, current peer to peer systems do not allow to do this by browsing. In this report, we show how one can build a peer to peer system supporting a kind of browsing. In our proposal, one must extend an existing distributed hash table system with a few features : handling partial hash-keys and providing appropriate routing mechanisms for these hash-keys. We give such an algorithm for the particular case of the Tapestry distributed hash table. This is a work in progress as no proper validation has been done yet. |
|||||
2009 | Cophir A Test Collection For Content-based Image Retrieval | Bolettieri Paolo, Esuli Andrea, Falchi Fabrizio, Lucchese Claudio, Perego Raffaele, Piccioli Tommaso, Rabitti Fausto | Arxiv | The scalability, as well as the effectiveness, of the different Content-based Image Retrieval (CBIR) approaches proposed in literature, is today an important research issue. Given the wealth of images on the Web, CBIR systems must in fact leap towards Web-scale datasets. In this paper, we report on our experience in building a test collection of 100 million images, with the corresponding descriptive features, to be used in experimenting new scalable techniques for similarity searching, and comparing their results. In the context of the SAPIR (Search on Audio-visual content using Peer-to-peer Information Retrieval) European project, we had to experiment our distributed similarity searching technology on a realistic data set. Therefore, since no large-scale collection was available for research purposes, we had to tackle the non-trivial process of image crawling and descriptive feature extraction (we used five MPEG-7 features) using the European EGEE computer GRID. The result of this effort is CoPhIR, the first CBIR test collection of such scale. CoPhIR is now open to the research community for experiments and comparisons, and access to the collection was already granted to more than 50 research groups worldwide. |
|||||
2009 | Data Structure For Representing A Graph Combination Of Linked List And Hash Table | Kolosovskiy Maxim A. Altai State Technical University, Russia | Arxiv | In this article we discuss a data structure, which combines advantages of two different ways for representing graphs: adjacency matrix and collection of adjacency lists. This data structure can fast add and search edges (advantages of adjacency matrix), use linear amount of memory, let to obtain adjacency list for certain vertex (advantages of collection of adjacency lists). Basic knowledge of linked lists and hash tables is required to understand this article. The article contains examples of implementation on Java. |
|||||
2009 | Rank-approximate Nearest Neighbor Search Retaining Meaning And Speed In High Dimensions | Parikshit Ram, Dongryeol Lee, Hua Ouyang, Alexander Gray | Neural Information Processing Systems | The long-standing problem of efficient nearest-neighbor (NN) search has ubiquitous applications ranging from astrophysics to MP3 fingerprinting to bioinformatics to movie recommendations. As the dimensionality of the dataset increases, exact NN search becomes computationally prohibitive; (1+eps)-distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments with high-dimensional datasets show that it often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. |
|||||
2009 | Group Sparse Coding | Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow | Neural Information Processing Systems | Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification. |
|||||
2009 | The Usefulness Of Multilevel Hash Tables With Multiple Hash Functions In Large Databases | Akinwalle A. T., Ibharalu F. T. | Ann. Univ. Tibiscus Comp. Sci. Series VII | In this work, attempt is made to select three good hash functions which uniformly distribute hash values that permute their internal states and allow the input bits to generate different output bits. These functions are used in different levels of hash tables that are coded in Java Programming Language and a quite number of data records serve as primary data for testing the performances. The result shows that the two-level hash tables with three different hash functions give a superior performance over one-level hash table with two hash functions or zero-level hash table with one function in term of reducing the conflict keys and quick lookup for a particular element. The result assists to reduce the complexity of join operation in query language from O(n2) to O(1) by placing larger query result, if any, in multilevel hash tables with multiple hash functions and generate shorter query result. |
|||||
2009 | Learning Multiple Layers of Features from Tiny Images | A. Krizhevsky | University of Toronto | Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is significantly improved by pre-training a layer of features on a large set of unlabeled tiny images. |
|||||
2009 | Feature Hashing For Large Scale Multitask Learning | Weinberger Kilian, Dasgupta Anirban, Attenberg Josh, Langford John, Smola Alex | Arxiv | Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case – multitask learning with hundreds of thousands of tasks. |
|||||
2009 | Efficient Authenticated Data Structures For Graph Connectivity And Geometric Search Problems | Goodrich Michael T., Tamassia Roberto, Triandopoulos Nikos | Arxiv | Authenticated data structures provide cryptographic proofs that their answers are as accurate as the author intended, even if the data structure is being controlled by a remote untrusted host. We present efficient techniques for authenticating data structures that represent graphs and collections of geometric objects. We introduce the path hash accumulator, a new primitive based on cryptographic hashing for efficiently authenticating various properties of structured data represented as paths, including any decomposable query over sequences of elements. We show how to employ our primitive to authenticate queries about properties of paths in graphs and search queries on multi-catalogs. This allows the design of new, efficient authenticated data structures for fundamental problems on networks, such as path and connectivity queries over graphs, and complex queries on two-dimensional geometric objects, such as intersection and containment queries. |
|||||
2009 | B-bit Minwise Hashing | Li Ping, Konig Arnd Christian | Arxiv | This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest \(b\) bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5. |
|||||
2009 | Datasets for approximate nearest neighbor search | Herve Jegou, Laurent Amsaleg | BIGANN consists of SIFT descriptors applied to images from extracted from a large image dataset. |
||||||
2009 | Efficient Match Kernel Between Sets Of Features For Visual Recognition | Liefeng Bo, Cristian Sminchisescu | Neural Information Processing Systems | In visual recognition, the images are frequently modeled as sets of local features (bags). We show that bag of words, a common method to handle such cases, can be viewed as a special match kernel, which counts 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse. It is, therefore, appealing to design match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels on large datasets due to their significant computational cost. To address this problem, we propose an efficient match kernel (EMK), which maps local features to a low dimensional feature space, average the resulting feature vectors to form a set-level feature, then apply a linear classifier. The local feature maps are learned so that their inner products preserve, to the best possible, the values of the specified kernel function. EMK is linear both in the number of images and in the number of local features. We demonstrate that EMK is extremely efficient and achieves the current state of the art performance on three difficult real world datasets: Scene-15, Caltech-101 and Caltech-256. |
|||||
2009 | Identification With Encrypted Biometric Data | Bringer Julien, Chabanne Herve, Kindarji Bruno | Arxiv | Biometrics make human identification possible with a sample of a biometric trait and an associated database. Classical identification techniques lead to privacy concerns. This paper introduces a new method to identify someone using his biometrics in an encrypted way. Our construction combines Bloom Filters with Storage and Locality-Sensitive Hashing. We apply this error-tolerant scheme, in a Hamming space, to achieve biometric identification in an efficient way. This is the first non-trivial identification scheme dealing with fuzziness and encrypted data. |
|||||
2009 | Learning To Hash With Binary Reconstructive Embeddings | Brian Kulis, Trevor Darrell | Neural Information Processing Systems | Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. |
|||||
2009 | Locality-sensitive Binary Codes From Shift-invariant Kernels | Maxim Raginsky, Svetlana Lazebnik | Neural Information Processing Systems | This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing. |
|||||
2009 | Parallelization Of The LBG Vector Quantization Algorithm For Shared Memory Systems | Annaji Rajashekar, Rao Shrisha | International Journal of Image Processing vol. | This paper proposes a parallel approach for the Vector Quantization (VQ) problem in image processing. VQ deals with codebook generation from the input training data set and replacement of any arbitrary data with the nearest codevector. Most of the efforts in VQ have been directed towards designing parallel search algorithms for the codebook, and little has hitherto been done in evolving a parallelized procedure to obtain an optimum codebook. This parallel algorithm addresses the problem of designing an optimum codebook using the traditional LBG type of vector quantization algorithm for shared memory systems and for the efficient usage of parallel processors. Using the codebook formed from a training set, any arbitrary input data is replaced with the nearest codevector from the codebook. The effectiveness of the proposed algorithm is indicated. |
|||||
2009 | Sharp Load Thresholds For Cuckoo Hashing | Fountoulakis Nikolaos, Panagiotou Konstantinos | Arxiv | The paradigm of many choices has influenced significantly the design of efficient data structures and, most notably, hash tables. Cuckoo hashing is a technique that extends this concept. There,we are given a table with \(n\) locations, and we assume that each location can hold one item. Each item to be inserted chooses randomly k>1 locations and has to be placed in any one of them. How much load can cuckoo hashing handle before collisions prevent the successful assignment of the available items to the chosen locations? Practical evaluations of this method have shown that one can allocate a number of elements that is a large proportion of the size of the table, being very close to 1 even for small values of k such as 4 or 5. In this paper we show that there is a critical value for this proportion: with high probability, when the amount of available items is below this value, then these can be allocated successfully, but when it exceeds this value, the allocation becomes impossible. We give explicitly for each k>1 this critical value. This answers an open question posed by Mitzenmacher (ESA ‘09) and underpins theoretically the experimental results. Our proofs are based on the translation of the question into a hypergraph setting, and the study of the related typical properties of random k-uniform hypergraphs. |
|||||
2009 | NUS-WIDE: a real-world web image database from National University of Singapore | T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng | CIVR | This paper introduces a web image dataset created by NUS’s Lab for Media Search. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5x5 fixed grid partitions, and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we highlight characteristics of Web image collections and identify four research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval. |
|||||
2009 | Searching with quantization: approximate nearest neighbor search using short codes and distance estimators | H. Jegou, M. Douze, C. Schmid | INRIA Technical Report | We propose an approximate nearest neighbor search method based on quantization. It uses, in particular, product quantizer to produce short codes and corresponding distance estimators approximating the Euclidean distance between the orginal vectors. The method is advantageously used in an asymmetric manner, by computing the distance between a vector and code, unlike competing techniques such as spectral hashing that only compare codes. Our approach approximates the Euclidean distance based on memory efficient codes and, thus, permits efficient nearest neighbor search. Experiments performed on SIFT and GIST image descriptors show excellent search accuracy. The method is shown to outperform two state-of-the-art approaches of the literature. Timings measured when searching a vector set of 2 billion vectors are shown to be excellent given the high accuracy of the method. |
|||||
2009 | Optimal Lower Bounds For Locality Sensitive Hashing (except When Q Is Tiny) | O'donnell Ryan, Wu Yi, Zhou Yuan | Arxiv | We study lower bounds for Locality Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}^d under the Hamming distance. Recall that here H is said to be an (r, cr, p, q)-sensitive hash family if all pairs x, y in {0,1}^d with dist(x,y) at most r have probability at least p of collision under a randomly chosen h in H, whereas all pairs x, y in {0,1}^d with dist(x,y) at least cr have probability at most q of collision. Typically, one considers d tending to infinity, with c fixed and q bounded away from 0. For its applications to approximate nearest neighbor search in high dimensions, the quality of an LSH family H is governed by how small its “rho parameter” rho = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani showed that for each c, the extremely simple family H = {x -> x_i : i in d} achieves rho at most 1/c. The only known lower bound, due to Motwani, Naor, and Panigrahy, is that rho must be at least .46/c (minus o_d(1)). In this paper we show an optimal lower bound: rho must be at least 1/c (minus o_d(1)). This lower bound for Hamming space yields a lower bound of 1/c^2 for Euclidean space (or the unit sphere) and 1/c for the Jaccard distance on sets; both of these match known upper bounds. Our proof is simple; the essence is that the noise stability of a boolean function at e^{-t} is a log-convex function of t. |
|||||
2009 | ImageNet: A large-scale hierarchical image database | J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei | CVPR | The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond. |
|||||
2009 | Hf-hash Hash Functions Using Restricted HFE Challenge-1 | Dey Dhananjoy, Mishra Prasanna Raghaw, Sengupta Indranath | Arxiv | Vulnerability of dedicated hash functions to various attacks has made the task of designing hash function much more challenging. This provides us a strong motivation to design a new cryptographic hash function viz. HF-hash. This is a hash function, whose compression function is designed by using first 32 polynomials of HFE Challenge-1 with 64 variables by forcing remaining 16 variables as zero. HF-hash gives 256 bits message digest and is as efficient as SHA-256. It is secure against the differential attack proposed by Chabaud and Joux as well as by Wang et. al. applied to SHA-0 and SHA-1. |
|||||
2008 | The MIR Flickr Retrieval Evaluation. | M. J. Huiskes, M. S. Lew | MIR | In most well known image retrieval test sets, the imagery typically cannot be freely distributed or is not representative of a large community of users. In this paper we present a collection for the MIR community comprising 25000 images from the Flickr website which are redistributable for research purposes and represent a real community of users both in the image content and image tags. We have extracted the tags and EXIF image metadata, and also make all of these publicly available. In addition we discuss several challenges for benchmarking retrieval and classification methods. |
|||||
2008 | Online Metric Learning And Fast Similarity Search | Prateek Jain, Brian Kulis, Inderjit Dhillon, Kristen Grauman | Neural Information Processing Systems | Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates for approximate similarity search data structures. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines. |
|||||
2008 | Optimal Hash Functions For Approximate Closest Pairs On The N-cube | Gordon Daniel M., Miller Victor, Ostapenko Peter | Arxiv | One way to find closest pairs in large datasets is to use hash functions. In recent years locality-sensitive hash functions for various metrics have been given: projecting an n-cube onto k bits is simple hash function that performs well. In this paper we investigate alternatives to projection. For various parameters hash functions given by complete decoding algorithms for codes work better, and asymptotically random codes perform better than projection. |
|||||
2008 | Tight Bounds For Hashing Block Sources | Chung Kai-min, Vadhan Salil | Arxiv | It is known that if a 2-universal hash function \(H\) is applied to elements of a {\em block source} \((X_1,…,X_T)\), where each item \(X_i\) has enough min-entropy conditioned on the previous items, then the output distribution \((H,H(X_1),…,H(X_T))\) will be ``close’’ to the uniform distribution. We provide improved bounds on how much min-entropy per item is required for this to hold, both when we ask that the output be close to uniform in statistical distance and when we only ask that it be statistically close to a distribution with small collision probability. In both cases, we reduce the dependence of the min-entropy on the number \(T\) of items from \(2log T\) in previous work to \(log T\), which we show to be optimal. This leads to corresponding improvements to the recent results of Mitzenmacher and Vadhan (SODA `08) on the analysis of hashing-based algorithms and data structures when the data items come from a block source. |
|||||
2008 | Bounds On Codes Based On Graph Theory | Rouayheb Salim Y. El, Georghiades C. N., Soljanin E., Sprintson A. | Arxiv | Let \(A_q(n,d)\) be the maximum order (maximum number of codewords) of a \(q\)-ary code of length \(n\) and Hamming distance at least \(d\). And let \(A(n,d,w)\) that of a binary code of constant weight \(w\). Building on results from algebraic graph theory and Erd\H{o}s-ko-Rado like theorems in extremal combinatorics, we show how several known bounds on \(A_q(n,d)\) and \(A(n,d,w)\) can be easily obtained in a single framework. For instance, both the Hamming and Singleton bounds can derived as an application of a property relating the clique number and the independence number of vertex transitive graphs. Using the same techniques, we also derive some new bounds and present some additional applications. |
|||||
2008 | Spectral Hashing | Yair Weiss, Antonio Torralba, Rob Fergus | Neural Information Processing Systems | Semantic hashing seeks compact binary codes of datapoints so that the Hamming distance between codewords correlates with semantic similarity. Hinton et al. used a clever implementation of autoencoders to find such codes. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresh- olded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigen- functions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes significantly outperform the state-of-the art. |
|||||
2008 | A Fast Generic Sequence Matching Algorithm | Musser David R., Nishanov Gor V. | Arxiv | A string matching – and more generally, sequence matching – algorithm is presented that has a linear worst-case computing time bound, a low worst-case bound on the number of comparisons (2n), and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its efficiency advantages in a wide variety of sequence matching problems of practical interest, including traditional string matching; large-alphabet problems (as in Unicode strings); and small-alphabet, long-pattern problems (as in DNA searches). Since it is expressed as a generic algorithm for searching in sequences over an arbitrary type T, it is well suited for use in generic software libraries such as the C++ Standard Template Library. The algorithm was obtained by adding to the Knuth-Morris-Pratt algorithm one of the pattern-shifting techniques from the Boyer-Moore algorithm, with provision for use of hashing in this technique. In situations in which a hash function or random access to the sequences is not available, the algorithm falls back to an optimized version of the Knuth-Morris-Pratt algorithm. |
|||||
2008 | Quasi-metrics Similarities And Searches Aspects Of Geometry Of Protein Datasets | Stojmirovic Aleksandar | Arxiv | A quasi-metric is a distance function which satisfies the triangle inequality but is not symmetric: it can be thought of as an asymmetric metric. The central result of this thesis, developed in Chapter 3, is that a natural correspondence exists between similarity measures between biological (nucleotide or protein) sequences and quasi-metrics. Chapter 2 presents basic concepts of the theory of quasi-metric spaces and introduces a new examples of them: the universal countable rational quasi-metric space and its bicompletion, the universal bicomplete separable quasi-metric space. Chapter 4 is dedicated to development of a notion of the quasi-metric space with Borel probability measure, or pq-space. The main result of this chapter indicates that `a high dimensional quasi-metric space is close to being a metric space’. Chapter 5 investigates the geometric aspects of the theory of database similarity search in the context of quasi-metrics. The results about \(pq\)-spaces are used to produce novel theoretical bounds on performance of indexing schemes. Finally, the thesis presents some biological applications. Chapter 6 introduces FSIndex, an indexing scheme that significantly accelerates similarity searches of short protein fragment datasets. Chapter 7 presents the prototype of the system for discovery of short functional protein motifs called PFMFind, which relies on FSIndex for similarity searches. |
|||||
2007 | Perfect Hashing For Data Management Applications | Botelho Fabiano C., Pagh Rasmus, Ziviani Nivio | Arxiv | Perfect hash functions can potentially be used to compress data in connection with a variety of data management tasks. Though there has been considerable work on how to construct good perfect hash functions, there is a gap between theory and practice among all previous methods on minimal perfect hashing. On one side, there are good theoretical results without experimentally proven practicality for large key sets. On the other side, there are the theoretically analyzed time and space usage algorithms that assume that truly random hash functions are available for free, which is an unrealistic assumption. In this paper we attempt to bridge this gap between theory and practice, using a number of techniques from the literature to obtain a novel scheme that is theoretically well-understood and at the same time achieves an order-of-magnitude increase in performance compared to previous ``practical’’ methods. This improvement comes from a combination of a novel, theoretically optimal perfect hashing scheme that greatly simplifies previous methods, and the fact that our algorithm is designed to make good use of the memory hierarchy. We demonstrate the scalability of our algorithm by considering a set of over one billion URLs from the World Wide Web of average length 64, for which we construct a minimal perfect hash function on a commodity PC in a little more than 1 hour. Our scheme produces minimal perfect hash functions using slightly more than 3 bits per key. For perfect hash functions in the range \(\{0,…,2n-1\}\) the space usage drops to just over 2 bits per key (i.e., one bit more than optimal for representing the key). This is significantly below of what has been achieved previously for very large values of \(n\). |
|||||
2007 | A Note On Approximate Nearest Neighbor Methods | Breuel Thomas M. | Arxiv | A number of authors have described randomized algorithms for solving the epsilon-approximate nearest neighbor problem. In this note I point out that the epsilon-approximate nearest neighbor property often fails to be a useful approximation property, since epsilon-approximate solutions fail to satisfy the necessary preconditions for using nearest neighbors for classification and related tasks. |
|||||
2007 | Trellis-coded Quantization Based On Maximum-hamming-distance Binary Codes | Cappellari Lorenzo | Arxiv | Most design approaches for trellis-coded quantization take advantage of the duality of trellis-coded quantization with trellis-coded modulation, and use the same empirically-found convolutional codes to label the trellis branches. This letter presents an alternative approach that instead takes advantage of maximum-Hamming-distance convolutional codes. The proposed source codes are shown to be competitive with the best in the literature for the same computational complexity. |
|||||
2007 | Avoiding Rotated Bitboards With Direct Lookup | Tannous Sam | ICGA Journal Vol. | This paper describes an approach for obtaining direct access to the attacked squares of sliding pieces without resorting to rotated bitboards. The technique involves creating four hash tables using the built in hash arrays from an interpreted, high level language. The rank, file, and diagonal occupancy are first isolated by masking the desired portion of the board. The attacked squares are then directly retrieved from the hash tables. Maintaining incrementally updated rotated bitboards becomes unnecessary as does all the updating, mapping and shifting required to access the attacked squares. Finally, rotated bitboard move generation speed is compared with that of the direct hash table lookup method. |
|||||
2007 | LabelMe: a database and web-based tool for image annotation | B. Russell, A. Torralba, K. Murphy, W. T. Freeman | IJCV | We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web. |
|||||
2007 | Recursive N-gram Hashing Is Pairwise Independent At Best | Lemire Daniel, Kaser Owen | Computer Speech Language | Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent. |
|||||
2007 | The Extended Edit Distance Metric | Fuad Muhammad Marwan Muhammad Valoria, Marteau Pierre-françois Valoria | Content-Based Multimedia Indexing CBMI | Similarity search is an important problem in information retrieval. This similarity is based on a distance. Symbolic representation of time series has attracted many researchers recently, since it reduces the dimensionality of these high dimensional data objects. We propose a new distance metric that is applied to symbolic data objects and we test it on time series data bases in a classification task. We compare it to other distances that are well known in the literature for symbolic data objects. We also prove, mathematically, that our distance is metric. |
|||||
2007 | Lower Bounds On The Minimum Average Distance Of Binary Codes | Mounits Beniamin | Arxiv | New lower bounds on the minimum average Hamming distance of binary codes are derived. The bounds are obtained using linear programming approach. |
|||||
2007 | A Learning Framework For Nearest Neighbor Search | Lawrence Cayton, Sanjoy Dasgupta | Neural Information Processing Systems | Can we leverage learning techniques to build a fast nearest-neighbor (NN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures. |
|||||
2007 | A New Lower Bound For A(1766) | Chee Yeow Meng | Ars Combinatoria Vol. | We construct a record-breaking binary code of length 17, minimal distance 6, constant weight 6, and containing 113 codewords. |
|||||
2006 | One-pass One-hash N-gram Statistics Estimation | Lemire Daniel, Kaser Owen | Arxiv | In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. To reduce costs further, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our running times with exact counts using suffix arrays and show that, while we use hardly any storage, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data. |
|||||
2006 | Cascade Hash Tables A Series Of Multilevel Double Hashing Schemes With O(1) Worst Case Lookup Time | Li Shaohua | Arxiv | In this paper, the author proposes a series of multilevel double hashing schemes called cascade hash tables. They use several levels of hash tables. In each table, we use the common double hashing scheme. Higher level hash tables work as fail-safes of lower level hash tables. By this strategy, it could effectively reduce collisions in hash insertion. Thus it gains a constant worst case lookup time with a relatively high load factor(70%-85%) in random experiments. Different parameters of cascade hash tables are tested. |
|||||
2006 | Linear Probing With Constant Independence | Pagh Anna, Pagh Rasmus, Ruzic Milan | Arxiv | Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analysis rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected {\em logarithmic} cost per operation. On the positive side, we show that 5-wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing. |
|||||
2005 | Data Tastes Better Seasoned Introducing The ASH Family Of Hashing Algorithms | Capelis D. J. | Arxiv | Over the recent months it has become clear that the current generation of cryptographic hashing algorithms are insufficient to meet future needs. The ASH family of algorithms provides modifications to the existing SHA-2 family. These modifications are designed with two main goals: 1) Providing increased collision resistance. 2) Increasing mitigation of security risks post-collision. The unique public/private sections and salt/pepper design elements provide increased flexibility for a broad range of applications. The ASH family is a new generation of cryptographic hashing algorithms. |
|||||
2005 | Individual Displacements In Hashing With Coalesced Chains | Janson Svante | Arxiv | We study the asymptotic distribution of the displacements in hashing with coalesced chains, for both late-insertion and early-insertion. Asymptotic formulas for means and variances follow. The method uses Poissonization and some stochastic calculus. |
|||||
2005 | New Upper Bounds On A(nd) | Mounits Beniamin, Etzion Tuvi, Litsyn Simon | Arxiv | Upper bounds on the maximum number of codewords in a binary code of a given length and minimum Hamming distance are considered. New bounds are derived by a combination of linear programming and counting arguments. Some of these bounds improve on the best known analytic bounds. Several new record bounds are obtained for codes with small lengths. |
|||||
2005 | De Dictionariis Dynamicis Pauco Spatio Utentibus | Demaine Erik D., Der Heide Friedhelm Meyer Auf, Pagh Rasmus, Patrascu Mihai | Arxiv | We develop dynamic dictionaries on the word RAM that use asymptotically optimal space, up to constant factors, subject to insertions and deletions, and subject to supporting perfect-hashing queries and/or membership queries, each operation in constant time with high probability. When supporting only membership queries, we attain the optimal space bound of Theta(n lg(u/n)) bits, where n and u are the sizes of the dictionary and the universe, respectively. Previous dictionaries either did not achieve this space bound or had time bounds that were only expected and amortized. When supporting perfect-hashing queries, the optimal space bound depends on the range {1,2,…,n+t} of hashcodes allowed as output. We prove that the optimal space bound is Theta(n lglg(u/n) + n lg(n/(t+1))) bits when supporting only perfect-hashing queries, and it is Theta(n lg(u/n) + n lg(n/(t+1))) bits when also supporting membership queries. All upper bounds are new, as is the Omega(n lg(n/(t+1))) lower bound. |
|||||
2005 | Entropy Based Nearest Neighbor Search In High Dimensions | Panigrahy Rina | Arxiv | In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use locality-preserving hash functions (that tend to map nearby points to the same value) to construct several hash tables to ensure that the query point hashes to the same bucket as its nearest neighbor in at least one table. Our approach is different – we use one (or a few) hash table and hash several randomly chosen points in the neighborhood of the query point showing that at least one of them will hash to the bucket containing its nearest neighbor. We show that the number of randomly chosen points in the neighborhood of the query point \(q\) required depends on the entropy of the hash value \(h(p)\) of a random point \(p\) at the same distance from \(q\) at its nearest neighbor, given \(q\) and the locality preserving hash function \(h\) chosen randomly from the hash family. Precisely, we show that if the entropy \(I(h(p)|q,h) = M\) and \(g\) is a bound on the probability that two far-off points will hash to the same bucket, then we can find the approximate nearest neighbor in \(O(n^\rho)\) time and near linear \(\tilde O(n)\) space where \(\rho = M/log(1/g)\). Alternatively we can build a data structure of size \(\tilde O(n^{1/(1-\rho)})\) to answer queries in \(\tilde O(d)\) time. By applying this analysis to the locality preserving hash functions in and adjusting the parameters we show that the \(c\) nearest neighbor can be computed in time \(\tilde O(n^\rho)\) and near linear space where \(\rho \approx 2.06/c\) as \(c\) becomes large. |
|||||
2005 | Estimation Of Intrinsic Dimensionality Using High-rate Vector Quantization | Maxim Raginsky, Svetlana Lazebnik | Neural Information Processing Systems | We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise. |
|||||
2005 | Duality Between Packings And Coverings Of The Hamming Space | Cohen Gérard, Vardy Alexander | Arxiv | We investigate the packing and covering densities of linear and nonlinear binary codes, and establish a number of duality relationships between the packing and covering problems. Specifically, we prove that if almost all codes (in the class of linear or nonlinear codes) are good packings, then only a vanishing fraction of codes are good coverings, and vice versa: if almost all codes are good coverings, then at most a vanishing fraction of codes are good packings. We also show that any specific maximal binary code is either a good packing or a good covering, in a certain well-defined sense. |
|||||
2005 | A Unifying Class Of Skorokhod Embeddings Connecting The Azema-yor And Vallois Embeddings | Cox A. M. G., Hobson D. G. | Arxiv | In this paper we consider the Skorokhod embedding problem in Brownian motion. In particular, we give a solution based on the local time at zero of a variably skewed Brownian motion related to the underlying Brownian motion. Special cases of the construction include the Azema-Yor and Vallois embeddings. In turn, the construction has an interpretation in the Chacon-Walsh framework. |
|||||
2005 | Phase Transition For Parking Blocks Brownian Excursion And Coalescence | Chassaing Philippe Iec, Louchard Guy Ulb | Random Structures Algorithms | In this paper, we consider hashing with linear probing for a hashing table with m places, n items (n < m), and l = m<n empty places. For a non computer science-minded reader, we shall use the metaphore of n cars parking on m places: each car chooses a place at random, and if this place k is occupied, the car tries successively k+1, k+2, … until it finds an empty place (with the convention that place m+1 is actually place 1). Pittel [42] proves that when l/m goes to some positive limit a < 1, the size of the largest block of consecutive cars is O(log m). In this paper we examine at which level for n a phase transition occurs for the largest block of consecutive cars between o(m) and O(m). The intermediate case reveals an interesting behaviour of sizes of blocks, related to the standard additive coalescent in the same way as the sizes of connected components of the random graph are related to the multiplicative coalescent. |
|||||
2004 | Asymptotic Improvement Of The Gilbert-varshamov Bound On The Size Of Binary Codes | Jiang Tao, Vardy Alexander | IEEE TRANSACTIONS ON INFORMATION THEORY vol. | Given positive integers \(n\) and \(d\), let \(A_2(n,d)\) denote the maximum size of a binary code of length \(n\) and minimum distance \(d\). The well-known Gilbert-Varshamov bound asserts that \(A_2(n,d) \geq 2^n/V(n,d-1)\), where \(V(n,d) = \sum_{i=0}^{d} {n \choose i}\) is the volume of a Hamming sphere of radius \(d\). We show that, in fact, there exists a positive constant \(c\) such that $\( A_2(n,d) \geq c \frac{2^n}{V(n,d-1)} log_2 V(n,d-1) \)\( whenever \)d/n \le 0.499$. The result follows by recasting the Gilbert- Varshamov bound into a graph-theoretic framework and using the fact that the corresponding graph is locally sparse. Generalizations and extensions of this result are briefly discussed. |
|||||
2004 | Hash Sort A Linear Time Complexity Multiple-dimensional Sort Algorithm | Gilreath William F. | Proceedings of First Southern Symposium on Computing December | Sorting and hashing are two completely different concepts in computer science, and appear mutually exclusive to one another. Hashing is a search method using the data as a key to map to the location within memory, and is used for rapid storage and retrieval. Sorting is a process of organizing data from a random permutation into an ordered arrangement, and is a common activity performed frequently in a variety of applications. Almost all conventional sorting algorithms work by comparison, and in doing so have a linearithmic greatest lower bound on the algorithmic time complexity. Any improvement in the theoretical time complexity of a sorting algorithm can result in overall larger gains in implementation performance.. A gain in algorithmic performance leads to much larger gains in speed for the application that uses the sort algorithm. Such a sort algorithm needs to use an alternative method for ordering the data than comparison, to exceed the linearithmic time complexity boundary on algorithmic performance. The hash sort is a general purpose non-comparison based sorting algorithm by hashing, which has some interesting features not found in conventional sorting algorithms. The hash sort asymptotically outperforms the fastest traditional sorting algorithm, the quick sort. The hash sort algorithm has a linear time complexity factor – even in the worst case. The hash sort opens an area for further work and investigation into alternative means of sorting. |
|||||
2004 | The Asymptotic Number Of Binary Codes And Binary Matroids | Wild Marcel | SIAM Journal of Discrete Mathematics | The asyptotic number of nonequivalent binary n-codes is determined. This is also the asymptotic number of nonisomorphic binary n-matroids. The connection to a result of Lefmann, Roedl, Phelps is explored. The latter states that almost all binary n-codes have a trivial automorphism group. |
|||||
2004 | An Investigation Of Practical Approximate Nearest Neighbor Algorithms | Ting Liu, Andrew Moore, Ke Yang, Alexander Gray | Neural Information Processing Systems | This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimen- sional perception areas such as computer vision, with dozens of publica- tions in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hash- ing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also intro- duce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same random- projection-based approximations that LSH enjoys, but with a simpler al- gorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. |
|||||
2004 | Efficient Hashing With Lookups In Two Memory Accesses | Panigrahy Rina | Arxiv | The study of hashing is closely related to the analysis of balls and bins. It is well-known that instead of using a single hash function if we randomly hash a ball into two bins and place it in the smaller of the two, then this dramatically lowers the maximum load on bins. This leads to the concept of two-way hashing where the largest bucket contains \(O(loglog n)\) balls with high probability. The hash look up will now search in both the buckets an item hashes to. Since an item may be placed in one of two buckets, we could potentially move an item after it has been initially placed to reduce maximum load. with a maximum load of We show that by performing moves during inserts, a maximum load of 2 can be maintained on-line, with high probability, while supporting hash update operations. In fact, with \(n\) buckets, even if the space for two items are pre-allocated per bucket, as may be desirable in hardware implementations, more than \(n\) items can be stored giving a high memory utilization. We also analyze the trade-off between the number of moves performed during inserts and the maximum load on a bucket. By performing at most \(h\) moves, we can maintain a maximum load of \(O(\frac{log log n}{h log(loglog n/h)})\). So, even by performing one move, we achieve a better bound than by performing no moves at all. |
|||||
2004 | Probabilities Of Randomly Centered Small Balls And Quantization In Banach Spaces | Dereich S., Lifshits M. A. | Annals of Probability | We investigate the Gaussian small ball probabilities with random centers, find their deterministic a.s.-equivalents and establish a relation to infinite-dimensional high-resolution quantization. |
|||||
2003 | Indexing Schemes For Similarity Search In Datasets Of Short Protein Fragments | Stojmirovic Aleksandar, Pestov Vladimir | Information Systems | We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than one per cent of the entire dataset. |
|||||
2003 | PHORMA Perfectly Hashable Order Restricted Multidimensional Arrays | Lins Lauro, Lins Sostenes, Melo Silvio | Arxiv | In this paper we propose a simple and efficient data structure yielding a perfect hashing of quite general arrays. The data structure is named phorma, which is an acronym for perfectly hashable order restricted multidimensional array. Keywords: Perfect hash function, Digraph, Implicit enumeration, Nijenhuis-Wilf combinatorial family. |
|||||
2003 | A Hash Of Hash Functions | Ozsari Turker | Arxiv | In this paper, we present a general review of hash functions in a cryptographic sense. We give special emphasis on some particular topics such as cipher block chaining message authentication code (CBC MAC) and its variants. This paper also broadens the information given in some well known surveys, by including more details on block-cipher based hash functions and security of different hash schemes. |
|||||
2002 | Indexing Schemes For Similarity Search An Illustrated Paradigm | Pestov Vladimir, Stojmirovic Aleksandar | Fundamenta Informaticae Vol. | We suggest a variation of the Hellerstein–Koutsoupias–Papadimitriou indexability model for datasets equipped with a similarity measure, with the aim of better understanding the structure of indexing schemes for similarity-based search and the geometry of similarity workloads. This in particular provides a unified approach to a great variety of schemes used to index into metric spaces and facilitates their transfer to more general similarity measures such as quasi-metrics. We discuss links between performance of indexing schemes and high-dimensional geometry. The concepts and results are illustrated on a very large concrete dataset of peptide fragments equipped with a biologically significant similarity measure. |
|||||
2000 | A Method For Command Identification Using Modified Collision Free Hashing With Addition Rotation Iterative Hash Functions (part 1) | Skraparlis Dimitrios | Arxiv | This paper proposes a method for identification of a user`s fixed string set (which can be a command/instruction set for a terminal or microprocessor). This method is fast and has very small memory requirements, compared to a traditional full string storage and compare method. The user feeds characters into a microcontroller via a keyboard or another microprocessor sends commands and the microcontroller hashes the input in order to identify valid commands, ensuring no collisions between hashed valid strings, while applying further criteria to narrow collision between random and valid strings. The method proposed narrows the possibility of the latter kind of collision, achieving small code and memory-size utilization and very fast execution. Hashing is achieved using additive & rotating hash functions in an iterative form, which can be very easily implemented in simple microcontrollers and microprocessors. Such hash functions are presented and compared according to their efficiency for a given string/command set, using the program found in the appendix. |
|||||
1999 | The MNIST Database of Handwritten Digits | Y. LeCun, C. Cortes, C. Burges | The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. |
||||||
1999 | On The Geometry Of Similarity Search Dimensionality Curse And Concentration Of Measure | Pestov Vladimir | Information Processing Letters | We suggest that the curse of dimensionality affecting the similarity-based search in large datasets is a manifestation of the phenomenon of concentration of measure on high-dimensional structures. We prove that, under certain geometric assumptions on the query domain \(Ω\) and the dataset \(X\), if \(Ω\) satisfies the so-called concentration property, then for most query points \(x^\ast\) the ball of radius \((1+\e)d_X(x^\ast)\) centred at \(x^\ast\) contains either all points of \(X\) or else at least \(C_1\exp(-C_2\e^2n)\) of them. Here \(d_X(x^\ast)\) is the distance from \(x^\ast\) to the nearest neighbour in \(X\) and \(n\) is the dimension of \(Ω\). |
|||||
1998 | Linear Probing And Graphs | Knuth Donald E. | Algorithmica | Mallows and Riordan showed in 1968 that labeled trees with a small number of inversions are related to labeled graphs that are connected and sparse. Wright enumerated sparse connected graphs in 1977, and Kreweras related the inversions of trees to the so-called ``parking problem’’ in 1980. A~combination of these three results leads to a surprisingly simple analysis of the behavior of hashing by linear probing, including higher moments of the cost of successful search. |