Learn how to build a Content-based Recommender System using Python, TF-IDF, and Cosine Similarity with the MovieLens dataset.

No table of contents available for this article
In the previous article of our Recommender System series, we explored the fundamentals of recommendation systems. Today, we're diving deeper into one of the most popular approaches: Content-based Filtering – and more importantly, we'll build a complete system from scratch using Python.
Imagine you just finished watching an action movie and loved it. A smart friend would naturally suggest other action movies, right? Content-based Filtering works on the same principle.
Specifically, the system will:
Analyze the attributes of items the user has interacted with
Search for other items with similar attributes
Recommend those items to the user
Real-world example: If you frequently watch crime dramas, the system will suggest other crime-related content. Simple as that!
Strengths of Content-based Filtering:
No need for user ratings data
Works well even for new users (as long as they've interacted with at least one item)
Easy to explain why recommendations were made
Weaknesses:
Only suggests "safe" items similar to what users already like
Cannot leverage the "wisdom of the crowd" from other users
Depends heavily on item metadata quality
We'll build a movie recommendation system using the MovieLens dataset – one of the most classic datasets in the Recommender System field.
The movies.csv file contains movie information with 3 main columns: movie_id, title, and genres. Notably, a movie can belong to multiple genres, separated by the | character.
python
import pandas
def get_dataframe_movies_csv(file_path):
"""
Read MovieLens CSV file and return a DataFrame
with 3 columns: movie_id, title, genres
"""
movie_cols = ['movie_id', 'title', 'genres']
movies = pandas.read_csv(
file_path,
sep=',',
names=movie_cols,
encoding='latin-1'
)
return moviesThis is the most crucial step. We need to transform movie genre information (text format) into feature vectors – a mathematical representation that computers can understand and compare.
What is TF-IDF?
TF-IDF (Term Frequency - Inverse Document Frequency) is a method for evaluating how important a word is within a document. It operates on two principles:
TF (Term Frequency): The more a word appears in a document → the more important it is to that document
IDF (Inverse Document Frequency): The fewer documents a word appears in → the more distinctive value it has
Example: The word "Action" appears in movie A but only 10% of all movies are Action → this word is highly valuable for identifying movie A.
python
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_matrix(movies):
"""
Create TF-IDF matrix from the genres column
TfidfVectorizer parameters:
- analyzer='word': Extract by word units
- ngram_range=(1, 1): Take 1 word at a time
- min_df=0: Don't skip any words (including rare ones)
Result: Matrix with rows = number of movies, columns = unique genres
"""
tf = TfidfVectorizer(
analyzer='word',
ngram_range=(1, 1),
min_df=0
)
new_tfidf_matrix = tf.fit_transform(movies['genres'])
return new_tfidf_matrixAfter obtaining feature vectors for each movie, we need to measure the similarity between movies. Cosine Similarity is a popular choice because:
Independent of vector magnitude (only considers direction)
Fast computation, especially when combined with TF-IDF
Values range from [0, 1], easy to interpret
python
from sklearn.metrics.pairwise import linear_kernel
def cosine_sim(matrix):
"""
Compute cosine similarity matrix between all movie pairs
Uses linear_kernel instead of cosine_similarity
for better performance with TF-IDF matrices (already normalized)
Result: Square matrix NxN (N = number of movies)
"""
new_cosine_sim = linear_kernel(matrix, matrix)
return new_cosine_simWith the MovieLens dataset containing approximately 9,743 movies, the resulting matrix will be 9743 x 9743 in size. Each cell [i][j] contains the similarity score between movie i and movie j.
python
import pandas as pd
class ContentBasedRecommender:
"""
Content-based Movie Recommendation System
"""
def __init__(self, movies_csv):
self.movies = get_dataframe_movies_csv(movies_csv)
self.tfidf_matrix = None
self.cosine_sim = None
def build_model(self):
"""Build model from data"""
# Split genres (currently joined by '|')
self.movies['genres'] = self.movies['genres'].str.split('|')
self.movies['genres'] = self.movies['genres'].fillna("").astype('str')
# Create TF-IDF matrix and compute cosine similarity
self.tfidf_matrix = tfidf_matrix(self.movies)
self.cosine_sim = cosine_sim(self.tfidf_matrix)
def fit(self):
"""Train the model"""
self.build_model()
def recommend(self, title, top_n=10):
"""
Recommend top N movies similar to the input movie
Args:
title: Movie title to find recommendations for
top_n: Number of movies to recommend
Returns:
List of similarity scores and recommended movie titles
"""
titles = self.movies['title']
indices = pd.Series(self.movies.index, index=self.movies['title'])
# Get index of input movie
idx = indices[title]
# Get similarity scores with all other movies
sim_scores = list(enumerate(self.cosine_sim[idx]))
# Sort by score in descending order
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Skip the first movie (itself) and take top N
sim_scores = sim_scores[1:top_n + 1]
movie_indices = [i[0] for i in sim_scores]
return sim_scores, titles.iloc[movie_indices].valuesDespite being simple and effective, Content-based Filtering has limitations worth noting:
Limitation 1: Cannot leverage "wisdom of the crowd"
The system completely ignores information from other users. In reality, user behavior often exhibits group patterns – if many people who like A also like B, you might like B too. Content-based doesn't capture this insight.
Limitation 2: Metadata dependency
We don't always have detailed descriptions for every item. Asking users to add tags is often impractical, and NLP algorithms for automatic feature extraction still face challenges like synonyms, abbreviations, typos, and multiple languages.
Solution: Collaborative Filtering – an approach that leverages the user community's behavior. We'll explore this technique in the next article of this series.
Through this article, we have:
Understood how Content-based Filtering works
Mastered using TF-IDF for feature vectorization
Learned how to compute Cosine Similarity for measuring similarity
Built a complete movie recommendation system with Python
Try applying this knowledge to your own real-world problems!
References:
MovieLens Dataset: https://grouplens.org/datasets/movielens/
Scikit-learn TfidfVectorizer Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook. Springer.