That's why I said the scaled cosine distance, the dot product between two vectors is a scaled version of the cosine distance.
This is practically every neural network layer that uses a MAD (multiply-add) computational motif, including 3x3s.
Surprisingly few people know this.
I believe you might have the reduction order a bit backwards -- the dot product reduces to cosine similarity and not the other way around as cosine similarity is the normalized dot product.
Any operation in neural networks that multiplies two vectors together and then collects them with a single addition operation is a dot product.
So 3x3 convs are technically a dot product, even though they have a spatial dimension. Same for the initial part of most transformers' attention layers, MLP layers, etc....
The overlap can be a bit tricky to deal with in terms of implications, but I've found it a very helpful formulation for squeezing out some performance boosts in previous implementations of neural networks that I've worked on.
Hope this helps, feel free to let me know if you have any more questions/thoughts/etc, love! <3 :)) :D :fireworkds:
This is practically every neural network layer that uses a MAD (multiply-add) computational motif, including 3x3s.
Surprisingly few people know this.
I believe you might have the reduction order a bit backwards -- the dot product reduces to cosine similarity and not the other way around as cosine similarity is the normalized dot product.