Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I remember learning about locality sensitive hashing in school a decade ago, but since then I’ve never seen it used in an ML system in industry. Does anyone actually use this technique at scale? If you have, I’d love to hear about your application and experience with it.


The recommender system mention is very valuable use case that I've seen at some of the largest ml using companies in the world. If you have a recommendation system on millions or much higher pool of content you will need an ANN index. I would expect google image search is using an ANN index. Facebook definitely uses ANN indices. Spotify is where Annoy comes from. They're pretty key at tiktok.

Whether you use locality sensitive hashing or something else is a separate question. Personal experience is graph based ANN indices (hnsw is one nice one) have tended to a bit better, but LSH is competitive so I wouldn't strongly be against a design decision choosing it. One downside I've seen with some ann index libraries is a lot of them don't support incremental updates/deletions and force you to build the index as a batch job. That's fine in some use cases, but breaks others. LSH based approach spotify uses doesn't support incremental updates, but that's not because LSH can't support it just Annoy wasn't designed for it.


It's widely used in recommenders based on embeddings. See:

https://github.com/spotify/annoy

https://github.com/facebookresearch/faiss




It was used extensively in classic computer vision descriptor matching including most image lookup e.g. google image search today. There are many reasons it not much used in recent computer vision. The first is that it does not work when the descriptors themselves are explicitly designed to make visually similar features descriptively distinct, and the second is that it generally works poorly for binary vectors which dominate. With regards to ml, its been tried, but the datasets are either too small or not available to researchers, and the result would be far too slow compared to the current approaches. Its actually one of the areas where deep learning hasn't reached parity to my current knowledge. It would be a great project, the dataset issue aside.


For binary vectors you can choose a different distance metric (not geometric one, i.e. Jaccard) that can be used to effectively hash similar data points into similar buckets.

Treating your binary vector as a set allows you to use min-hashing as your LSH schema (min-hashing is just a random permutation of the given set). This simple trick makes LSH with min-hashing quite a powerful tool for binary vectors that are extensively used in recommenders systems and other domains.

I've used LSH + Min-Hash for image search (and subsequently for audio fingerprinting). If interested, I've blogged about it here [1].

[1] - https://emysound.com/blog/open-source/2020/06/12/how-audio-f...


Agree. Also cosine distance.


Locality sensitive hashing is extensively used in bioinformatics and genomics.

mash is probably the best-known implementation (https://doi.org/10.1186/s13059-016-0997-x). It allows you to do highly efficient pairwise comparisons of genome sequences, where it provides a good approximates their actual pairwise sequence divergence.

Is MinHash used industrially outside of genomics? In theory, it'd be useful on any kind of text corpus. The design of the hash functions that the sketch is built from is non-trivial The hashes typically come from k-mers of a fixed length and various fast hash functions over them.


I’ve used it in an HFT system. Basically as a form of latency-minimizing model compression for evaluation in production.

The problem was that the base layer model (tree ensembles) did very well in terms of training error but was much too slow to evaluate in production. With LSH, you can start with a bunch of points from the base model, then train the compressed model against the reference samples.


It was used in https://ai.googleblog.com/2020/01/reformer-efficient-transfo... for faster and more memory-efficient attention.


Academic example but I'm sure it's made it's way to this industry as well: in bioinformatics, this has proven tremendously useful for approximate string matching to do things like identify microbes in sequenced microbiome samples - cf https://genomebiology.biomedcentral.com/articles/10.1186/s13...


The general idea of quantizing points so that similar points fall into the same bucket or a near bucket is a powerful one.

It's more appropriately used to answer Ball queries (i.e. find all points within distance r of the query) than near neighbor queries.

I've used it in multiple context :

-Low dimension : Computing all pair-wise interaction for nearby particles in physics simulations : You hash particles positions to cells, and you check interactions between nearby cells.

-Image Similarity search for candidates to loop closure in SLAM algorithms (Depending on the number of images you have brute-forcing in hamming space is often sufficient)

-Visual vocabulary, each key-point has a descriptor value, when you use opencv with SIFT/SURF for matching you can use Flann with LSH.

-K-mean variant clustering with large K, when K is large you need an index to find the closer point efficiently (used to compute the visual vocabulary above)

-Neural networks : for example it can be very tempting to put some kind of LSH index on the GPU as a standard neural-network layer, but if it's too small you are better with standard brute-force, and if it's too big you run out of memory. So you are looking for the sweet-spot. But remembering that if you have an index with 10M elements and you update sparsely only 10 elements by iteration, you'll need at least 1M iterations before each element has moved. So your algorithm will take a long time to converge.

-P2P indices : Tables of hashes that can be used to recover similar data when querying them.

The version of LSH described in the article is not often used because in real data you can often take advantage of some hierarchical representation inside the data.

It's among one of the many space-partitioning technique with multiple overlapping partitions.

Depending on the constraints on your data, it's hard to make it shine. It shines best when there are large quantity of data, but it consumes a lot of memory so it may be useful to have the index on disk, and ideally you'd like to be able to update the index incrementally. The various existing libraries work well for the specific situations they are designed to handle, but quite often you are still better rolling your own space-partitioning technique for your specific constraints.

Also quite often instead of an approximate search and have to fiddle with some tuning parameters, it's better to use some exact search : If you can project into hamming spaces, there are fast and exact algorithms (although memory intensive) based on the pigeon-hole principle.


It's used in the cybersecurity world for fuzzy file comparison. See https://github.com/trendmicro/tlsh


A lifetime ago I used it on a document clustering system. It is not very good as a general similarity function, but excellent for quickly finding near duplicates in linear time. About half of the documents of set I was working on (news articles) were duplicates, so an early removal pass would speed up the actual clustering algorithm pass.

My recollection is a bit fuzzy, but I remember specifically using Min-Hash together with random projection.


We used it at work (anti-cybercrime / Phishing focused) company in conjunction with our crawler for clustering of phishing landing pages.

Typically, you'll have loads of script kiddies hosting slightly modified copies of the Apple ID landing page on a compromised web server... Alternatively you'd have loads of these pages build out of "exploit frameworks" / "kits" so we'd want to categorize and group to identify prevalence of a given framework / author.

Had the nice side-effect of prioritizing, speeding up, and automating takedowns for our SOC folks.


It is being used in the reformer architecture to relate the dot product attention layer. This enables much longer input sequences and reduces computational complexity from O(n^2) to O(nlog(n)), where n in the length of the sequence.

https://huggingface.co/transformers/model_doc/reformer.html


I've used it for a wide variety of search and recommendation problems, albeit not in isolation. In practice, I've found that it is a great way to reduce the size of a search space before handing the problem off to a more precise algorithm.


I used it in ecommerce to find duplicate product titles.


When I was in ByteDance, it was used extensively to deduplicate news articles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: