C-OPH Improving The Accuracy Of One Permutation Hashing (OPH) With Circulant Permutations
Li Xiaoyun, Li Ping. Arxiv 2021
[Paper]
ARXIV
Independent
Minwise hashing (MinHash) is a classical method for efficiently estimating
the Jaccrad similarity in massive binary (0/1) data. To generate hash
values for each data vector, the standard theory of MinHash requires
independent permutations. Interestingly, the recent work on “circulant MinHash”
(C-MinHash) has shown that merely two permutations are needed. The first
permutation breaks the structure of the data and the second permutation is
re-used time in a circulant manner. Surprisingly, the estimation accuracy
of C-MinHash is proved to be strictly smaller than that of the original
MinHash. The more recent work further demonstrates that practically only one
permutation is needed. Note that C-MinHash is different from the well-known
work on “One Permutation Hashing (OPH)” published in NIPS’12. OPH and its
variants using different “densification” schemes are popular alternatives to
the standard MinHash. The densification step is necessary in order to deal with
empty bins which exist in One Permutation Hashing.
In this paper, we propose to incorporate the essential ideas of C-MinHash to
improve the accuracy of One Permutation Hashing. Basically, we develop a new
densification method for OPH, which achieves the smallest estimation variance
compared to all existing densification schemes for OPH. Our proposed method is
named C-OPH (Circulant OPH). After the initial permutation (which breaks the
existing structure of the data), C-OPH only needs a “shorter” permutation of
length (instead of ), where is the original data dimension and
is the total number of bins in OPH. This short permutation is re-used in
bins in a circulant shifting manner. It can be shown that the estimation
variance of the Jaccard similarity is strictly smaller than that of the
existing (densified) OPH methods.
Similar Work