Damon Who

Posted on May 21, 2024

Is Cosine Similarity of Embeddings Really About Similarity?

#llm #ai

Authors: Harald Steck, Chaitanya Ekanadham, Nathan Kallus

Introduction

Cosine similarity is a widely used metric in various domains, particularly for quantifying the similarity between high-dimensional objects via learned low-dimensional embeddings. It is calculated as the cosine of the angle between two vectors, or equivalently, the dot product of their normalized versions. This measure is prevalent in fields like natural language processing (NLP) and recommender systems, where embeddings capture semantic similarities. However, recent studies have shown that cosine similarity can sometimes yield arbitrary or even meaningless results. This blog post explores why this happens and what alternatives might offer more reliable insights.

Understanding Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. For vectors ( A ) and ( B ), it is defined as:

\text{cosSim}(A, B) = \frac{A \cdot B}{|A| |B|}

This measure focuses on the direction rather than the magnitude of the vectors, making it useful for identifying similarity in high-dimensional spaces where the actual values (magnitude) of the vectors are less important than their orientation.

The Problem with Cosine Similarity

Despite its popularity, cosine similarity can sometimes lead to inconsistent or arbitrary results. To understand why, let's consider embeddings derived from regularized linear models, which allow us to derive analytical insights.

Matrix Factorization Models

Matrix factorization (MF) models are commonly used in recommender systems. They decompose a user-item interaction matrix ( X ) into two lower-dimensional matrices ( A ) and ( B ):

X \approx XAB^\top

Here, ( A ) represents user embeddings and ( B ) represents item embeddings. Typically, the dot product of these embeddings is used to approximate ( X ), but cosine similarity is often employed to measure the similarity between the embeddings.

Regularization and Its Effects

Regularization techniques are applied during training to prevent overfitting. Two common regularization schemes in MF models are:

Joint Regularization:

$\min_{A,B} |X - XAB^\top|_F^2 + \lambda |AB^\top|_F^2$
Separate Regularization:

$\min_{A,B} |X - XAB^\top|_F^2 + \lambda (|XA|_F^2 + |B|_F^2)$

The first scheme regularizes the product ( AB^\top ), while the second regularizes ( A ) and ( B ) separately. These different schemes result in different properties of the learned embeddings and their cosine similarities.

Arbitrary Results from Cosine Similarity

The key insight is that the cosine similarity of embeddings learned under joint regularization can be arbitrary due to the invariance of the solution under rescaling of ( A ) and ( B ). For example, if ( \hat{A} ) and ( \hat{B} ) are solutions, then ( \hat{A}D ) and ( \hat{B}D^{-1} ) for any diagonal matrix ( D ) are also valid solutions. This rescaling affects the cosine similarity calculations, leading to potentially meaningless results.

Example Scenarios

Consider two different choices for the diagonal matrix ( D ):

Case 1:

$D = dMat(..., \frac{1}{1+\lambda/\sigma_i^2}, ...)^{1/2}$
Case 2:

$D = dMat(..., \frac{1}{1+\lambda/\sigma_i^2}, ...)^{-1/2}$

Each choice yields different cosine similarities for the same embeddings, demonstrating the non-uniqueness and arbitrariness of cosine similarity in this context.

Remedies and Alternatives

To address the issues with cosine similarity, we propose the following remedies:

Training with Cosine Similarity: Adjust the training process to directly optimize for cosine similarity, which can be facilitated by techniques like layer normalization.
Avoiding Embedding Space: Instead of using the learned embeddings, project back into the original space and apply cosine similarity there. For instance, use ( XA \hat{B}^\top ) as a smoothed version of ( X ).

Practical Recommendations

Standardize Data: Apply normalization or reduction of popularity bias before or during training. Techniques like negative sampling or inverse propensity scaling can be effective.
Evaluate Alternatives: Consider alternative similarity measures like unnormalized dot-products, especially when working with embeddings derived from linear models.

Experimental Validation

To validate our findings, we conducted experiments on simulated data where the ground-truth semantic similarities are known. The results confirm that different regularization techniques and choices of rescaling matrices significantly impact the resulting cosine similarities, often leading to inconsistent outcomes.

Conclusion

Cosine similarity is a powerful tool for measuring similarity in high-dimensional spaces, but it is not without pitfalls. The choice of regularization and the inherent properties of the embeddings can lead to arbitrary and non-unique results. Therefore, it is crucial to carefully consider these factors and explore alternative approaches to ensure meaningful and reliable similarity measurements.

References

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization, 2016. arXiv:1607.06450.
R. Jin, et al. Towards a better understanding of linear models for recommendation. In ACM KDD, 2021.
V. Karpukhin, et al. Dense passage retrieval for open-domain question answering, 2020. arXiv:2004.04906v3.
O. Khattab and M. Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT, 2020. arXiv:2004.12832v2.
T. Mikolov, et al. Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781.
H. Steck. Autoencoders that don’t overfit towards the identity. In NeurIPS, 2020.
S. Zheng, et al. Regularized singular value decomposition and application to recommender system, 2018. arXiv:1804.05090.
K. Zhou, et al. Problems with cosine as a measure of embedding similarity for high frequency words. In ACL, 2022.

PromptZone - Leading AI Community for Prompt Engineering and AI Enthusiasts