Authors: Harald Steck, Chaitanya Ekanadham, Nathan Kallus
Introduction
Cosine similarity is a widely used metric in various domains, particularly for quantifying the similarity between highdimensional objects via learned lowdimensional embeddings. It is calculated as the cosine of the angle between two vectors, or equivalently, the dot product of their normalized versions. This measure is prevalent in fields like natural language processing (NLP) and recommender systems, where embeddings capture semantic similarities. However, recent studies have shown that cosine similarity can sometimes yield arbitrary or even meaningless results. This blog post explores why this happens and what alternatives might offer more reliable insights.
Understanding Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. For vectors ( A ) and ( B ), it is defined as:
This measure focuses on the direction rather than the magnitude of the vectors, making it useful for identifying similarity in highdimensional spaces where the actual values (magnitude) of the vectors are less important than their orientation.
The Problem with Cosine Similarity
Despite its popularity, cosine similarity can sometimes lead to inconsistent or arbitrary results. To understand why, let's consider embeddings derived from regularized linear models, which allow us to derive analytical insights.
Matrix Factorization Models
Matrix factorization (MF) models are commonly used in recommender systems. They decompose a useritem interaction matrix ( X ) into two lowerdimensional matrices ( A ) and ( B ):
Here, ( A ) represents user embeddings and ( B ) represents item embeddings. Typically, the dot product of these embeddings is used to approximate ( X ), but cosine similarity is often employed to measure the similarity between the embeddings.
Regularization and Its Effects
Regularization techniques are applied during training to prevent overfitting. Two common regularization schemes in MF models are:

Joint Regularization:
$\min_{A,B} X  XAB^\top_F^2 + \lambda AB^\top_F^2$ 
Separate Regularization:
$\min_{A,B} X  XAB^\top_F^2 + \lambda (XA_F^2 + B_F^2)$
The first scheme regularizes the product ( AB^\top ), while the second regularizes ( A ) and ( B ) separately. These different schemes result in different properties of the learned embeddings and their cosine similarities.
Arbitrary Results from Cosine Similarity
The key insight is that the cosine similarity of embeddings learned under joint regularization can be arbitrary due to the invariance of the solution under rescaling of ( A ) and ( B ). For example, if ( \hat{A} ) and ( \hat{B} ) are solutions, then ( \hat{A}D ) and ( \hat{B}D^{1} ) for any diagonal matrix ( D ) are also valid solutions. This rescaling affects the cosine similarity calculations, leading to potentially meaningless results.
Example Scenarios
Consider two different choices for the diagonal matrix ( D ):

Case 1:
$D = dMat(..., \frac{1}{1+\lambda/\sigma_i^2}, ...)^{1/2}$ 
Case 2:
$D = dMat(..., \frac{1}{1+\lambda/\sigma_i^2}, ...)^{1/2}$
Each choice yields different cosine similarities for the same embeddings, demonstrating the nonuniqueness and arbitrariness of cosine similarity in this context.
Remedies and Alternatives
To address the issues with cosine similarity, we propose the following remedies:
 Training with Cosine Similarity: Adjust the training process to directly optimize for cosine similarity, which can be facilitated by techniques like layer normalization.
 Avoiding Embedding Space: Instead of using the learned embeddings, project back into the original space and apply cosine similarity there. For instance, use ( XA \hat{B}^\top ) as a smoothed version of ( X ).
Practical Recommendations
 Standardize Data: Apply normalization or reduction of popularity bias before or during training. Techniques like negative sampling or inverse propensity scaling can be effective.
 Evaluate Alternatives: Consider alternative similarity measures like unnormalized dotproducts, especially when working with embeddings derived from linear models.
Experimental Validation
To validate our findings, we conducted experiments on simulated data where the groundtruth semantic similarities are known. The results confirm that different regularization techniques and choices of rescaling matrices significantly impact the resulting cosine similarities, often leading to inconsistent outcomes.
Conclusion
Cosine similarity is a powerful tool for measuring similarity in highdimensional spaces, but it is not without pitfalls. The choice of regularization and the inherent properties of the embeddings can lead to arbitrary and nonunique results. Therefore, it is crucial to carefully consider these factors and explore alternative approaches to ensure meaningful and reliable similarity measurements.
References
 J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization, 2016. arXiv:1607.06450.
 R. Jin, et al. Towards a better understanding of linear models for recommendation. In ACM KDD, 2021.
 V. Karpukhin, et al. Dense passage retrieval for opendomain question answering, 2020. arXiv:2004.04906v3.
 O. Khattab and M. Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT, 2020. arXiv:2004.12832v2.
 T. Mikolov, et al. Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781.
 H. Steck. Autoencoders that don’t overfit towards the identity. In NeurIPS, 2020.
 S. Zheng, et al. Regularized singular value decomposition and application to recommender system, 2018. arXiv:1804.05090.
 K. Zhou, et al. Problems with cosine as a measure of embedding similarity for high frequency words. In ACL, 2022.
Top comments (0)