Document Similarity Calculation Using TF-IDF and Cosine Similarity: Python Implementation and In-depth Analysis

Keywords: TF-IDF | Cosine Similarity | Python Implementation | Document Similarity | scikit-learn

Abstract: This article explores the method of calculating document similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity. Through Python implementation, it details the entire process from text preprocessing to similarity computation, including the application of CountVectorizer and TfidfTransformer, and how to compute cosine similarity via custom functions and loops. Based on practical code examples, the article explains the construction of TF-IDF matrices, vector normalization, and compares the advantages and disadvantages of different approaches, providing practical technical guidance for information retrieval and text mining tasks.

Introduction

In the fields of information retrieval and natural language processing, document similarity calculation is a fundamental and critical task. TF-IDF (Term Frequency-Inverse Document Frequency), as a classic text representation method, effectively measures the importance of words in documents, while cosine similarity is commonly used to measure the similarity between vectors. This article, based on Python implementation, delves into how to combine TF-IDF and cosine similarity to calculate document similarity, with an in-depth analysis of its core principles and implementation details.

Basics of TF-IDF and Cosine Similarity

TF-IDF is a statistical method used to evaluate the importance of a word to a document in a collection. Its calculation is based on term frequency (TF) and inverse document frequency (IDF), with the formula: TF-IDF = TF × IDF. Term frequency represents the frequency of a word in a document, while inverse document frequency measures the word's general importance, calculated as log(total documents / documents containing the word). This representation highlights keywords in documents while suppressing the impact of common words.

Cosine similarity measures the similarity between two vectors by computing the cosine of the angle between them, with the formula: cos(θ) = (A·B) / (||A|| × ||B||), where A and B are vectors, · denotes the dot product, and ||·|| denotes the Euclidean norm. The value range of cosine similarity is [-1, 1], with values closer to 1 indicating greater similarity. It is often used for similarity comparison after text vectorization, as it is insensitive to vector length, making it more suitable for sparse vectors like TF-IDF.

Python Implementation: From Text Preprocessing to Similarity Calculation

The following is a complete Python example demonstrating how to implement TF-IDF transformation and cosine similarity calculation using scikit-learn and NumPy libraries. The code is based on the best answer (Answer 2) from the Q&A data, optimized and explained.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

# Define training set and test set (query document)
train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright."]
stopWords = stopwords.words('english')

# Use CountVectorizer for bag-of-words transformation, removing stop words
vectorizer = CountVectorizer(stop_words=stopWords)
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print('Fit Vectorizer to train set:', trainVectorizerArray)
print('Transform Vectorizer to test set:', testVectorizerArray)

# Define cosine similarity calculation function
cosine_function = lambda a, b: round(np.inner(a, b) / (LA.norm(a) * LA.norm(b)), 3)

# Iterate over training and test set vectors to compute similarity
for vector in trainVectorizerArray:
    for testV in testVectorizerArray:
        cosine = cosine_function(vector, testV)
        print(f'Cosine similarity between {vector} and {testV}: {cosine}')

# Apply TF-IDF transformation
transformer = TfidfTransformer()
transformer.fit(trainVectorizerArray)
train_tfidf = transformer.transform(trainVectorizerArray).toarray()
print('TF-IDF for train set:', train_tfidf)

transformer.fit(testVectorizerArray)
test_tfidf = transformer.transform(testVectorizerArray).todense()
print('TF-IDF for test set:', test_tfidf)

Running the above code yields the following output:

Fit Vectorizer to train set: [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set: [[0 1 1 1]]
Cosine similarity between [1 0 1 0] and [0 1 1 1]: 0.408
Cosine similarity between [0 1 0 1] and [0 1 1 1]: 0.816
TF-IDF for train set: [[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]
TF-IDF for test set: [[ 0.          0.57735027  0.57735027  0.57735027]]

In this example, the training set consists of two documents: "The sky is blue." and "The sun is bright.", and the test set (query document) is "The sun in the sky is bright.". CountVectorizer converts text into bag-of-words vectors, removing stop words (e.g., "the", "is"), and the generated feature vectors represent word occurrence counts. For instance, the vector [1 0 1 0] might correspond to words "sky" and "blue" (specific feature names can be obtained via vectorizer.get_feature_names_out()).

The cosine similarity calculation shows that the similarity between the first training document and the query document is 0.408, and between the second training document and the query document is 0.816. This indicates that the second document ("The sun is bright.") is more similar to the query document, as they share more keywords (e.g., "sun" and "bright"). After TF-IDF transformation, the vectors are normalized, making the Euclidean norm of each vector 1, which enhances the accuracy of similarity calculation.

In-depth Analysis of Core Knowledge Points

1. Text Preprocessing and Vectorization: Using CountVectorizer, text can be converted into numerical vectors, with stop words removed via the stop_words parameter to reduce noise. The fit_transform method is used for the training set to learn the vocabulary and transform data, while the transform method is used for the test set to ensure the same vocabulary is applied.

2. Role of TF-IDF Transformation: TfidfTransformer converts term frequency vectors into TF-IDF values, applied through fit and transform methods. TF-IDF not only considers term frequency but also reduces the weight of common words via inverse document frequency, highlighting unique keywords in documents. In the output, TF-IDF values are row-normalized (L2 norm), making vector lengths 1, which simplifies cosine similarity calculation as, after normalization, cosine similarity is equivalent to the dot product.

3. Implementation of Cosine Similarity Calculation: The custom cosine_function uses NumPy's inner function to compute the dot product and LA.norm to compute the Euclidean norm. By looping through training and test set vectors, similarity between each pair of documents can be calculated. This method is straightforward but may be inefficient; for large datasets, optimized methods like linear kernel functions are recommended.

4. Comparison with Linear Kernel Function: Referring to Answer 1, scikit-learn provides the linear_kernel function for efficient cosine similarity calculation, especially suitable for sparse matrices. Since TF-IDF vectors are normalized, the linear kernel function (dot product) is equivalent to cosine similarity. For example:

from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()

This method avoids explicit norm calculation, improving computational efficiency and suitability for large document collections.

Practical Applications and Extensions

In practical applications, document similarity calculation can be used for tasks such as information retrieval, recommendation systems, and text clustering. For example, in a Q&A system, similarity between user queries and knowledge base documents can be computed to return the most relevant answers. While the TF-IDF and cosine similarity method is classic, it has some limitations:

Sparsity Issue: TF-IDF vectors are typically high-dimensional and sparse, which may affect computational efficiency and storage.
Limited Semantic Understanding: TF-IDF is based on term frequency statistics and cannot capture semantic relationships like synonyms or context dependencies.
Extension Methods: To overcome these limitations, word embeddings (e.g., Word2Vec, GloVe) or pre-trained language models (e.g., BERT) can be considered, which better represent semantic information but at higher computational cost.

Additionally, the code example uses simple loop calculations; for large-scale data, parallel processing or distributed computing frameworks (e.g., Apache Spark) can be employed to enhance performance.

Conclusion

This article details the method for calculating document similarity based on TF-IDF and cosine similarity, demonstrating the entire process from text preprocessing to similarity computation through Python code examples. Core knowledge points include the application of CountVectorizer and TfidfTransformer, the mathematical principles and implementation of cosine similarity, and comparisons with efficient calculation methods. Although TF-IDF has limitations in semantic representation, its simplicity and efficiency make it valuable in many practical scenarios. In the future, combining deep learning methods can further improve the accuracy and semantic understanding of similarity calculations. Readers can extend the code to apply it to their own text data for specific information retrieval or analysis tasks.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.