Build Content-based Filtering Recommender System with Python | Step-by-Step Guide

Introduction

In the previous article of our Recommender System series, we explored the fundamentals of recommendation systems. Today, we're diving deeper into one of the most popular approaches: Content-based Filtering – and more importantly, we'll build a complete system from scratch using Python.

How Does Content-based Filtering Work?

Imagine you just finished watching an action movie and loved it. A smart friend would naturally suggest other action movies, right? Content-based Filtering works on the same principle.

Specifically, the system will:

Analyze the attributes of items the user has interacted with
Search for other items with similar attributes
Recommend those items to the user

Real-world example: If you frequently watch crime dramas, the system will suggest other crime-related content. Simple as that!

Strengths of Content-based Filtering:

No need for user ratings data
Works well even for new users (as long as they've interacted with at least one item)
Easy to explain why recommendations were made

Weaknesses:

Only suggests "safe" items similar to what users already like
Cannot leverage the "wisdom of the crowd" from other users
Depends heavily on item metadata quality

System Architecture

We'll build a movie recommendation system using the MovieLens dataset – one of the most classic datasets in the Recommender System field.

Step 1: Data Loading and Initialization

The movies.csv file contains movie information with 3 main columns: movie_id, title, and genres. Notably, a movie can belong to multiple genres, separated by the | character.

python

import pandas

def get_dataframe_movies_csv(file_path):
    """
    Read MovieLens CSV file and return a DataFrame
    with 3 columns: movie_id, title, genres
    """
    movie_cols = ['movie_id', 'title', 'genres']
    movies = pandas.read_csv(
        file_path, 
        sep=',', 
        names=movie_cols, 
        encoding='latin-1'
    )
    return movies

Step 2: Building the TF-IDF Matrix

This is the most crucial step. We need to transform movie genre information (text format) into feature vectors – a mathematical representation that computers can understand and compare.

What is TF-IDF?

TF-IDF (Term Frequency - Inverse Document Frequency) is a method for evaluating how important a word is within a document. It operates on two principles:

TF (Term Frequency): The more a word appears in a document → the more important it is to that document
IDF (Inverse Document Frequency): The fewer documents a word appears in → the more distinctive value it has

Example: The word "Action" appears in movie A but only 10% of all movies are Action → this word is highly valuable for identifying movie A.

python

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_matrix(movies):
    """
    Create TF-IDF matrix from the genres column
    
    TfidfVectorizer parameters:
    - analyzer='word': Extract by word units
    - ngram_range=(1, 1): Take 1 word at a time
    - min_df=0: Don't skip any words (including rare ones)
    
    Result: Matrix with rows = number of movies, columns = unique genres
    """
    tf = TfidfVectorizer(
        analyzer='word', 
        ngram_range=(1, 1), 
        min_df=0
    )
    new_tfidf_matrix = tf.fit_transform(movies['genres'])
    return new_tfidf_matrix

Step 3: Computing Cosine Similarity

After obtaining feature vectors for each movie, we need to measure the similarity between movies. Cosine Similarity is a popular choice because:

Independent of vector magnitude (only considers direction)
Fast computation, especially when combined with TF-IDF
Values range from [0, 1], easy to interpret

python

from sklearn.metrics.pairwise import linear_kernel

def cosine_sim(matrix):
    """
    Compute cosine similarity matrix between all movie pairs
    
    Uses linear_kernel instead of cosine_similarity
    for better performance with TF-IDF matrices (already normalized)
    
    Result: Square matrix NxN (N = number of movies)
    """
    new_cosine_sim = linear_kernel(matrix, matrix)
    return new_cosine_sim

With the MovieLens dataset containing approximately 9,743 movies, the resulting matrix will be 9743 x 9743 in size. Each cell [i][j] contains the similarity score between movie i and movie j.

Step 4: Putting It All Together

python

import pandas as pd

class ContentBasedRecommender:
    """
    Content-based Movie Recommendation System
    """
    
    def __init__(self, movies_csv):
        self.movies = get_dataframe_movies_csv(movies_csv)
        self.tfidf_matrix = None
        self.cosine_sim = None

    def build_model(self):
        """Build model from data"""
        # Split genres (currently joined by '|')
        self.movies['genres'] = self.movies['genres'].str.split('|')
        self.movies['genres'] = self.movies['genres'].fillna("").astype('str')
        
        # Create TF-IDF matrix and compute cosine similarity
        self.tfidf_matrix = tfidf_matrix(self.movies)
        self.cosine_sim = cosine_sim(self.tfidf_matrix)

    def fit(self):
        """Train the model"""
        self.build_model()
    
    def recommend(self, title, top_n=10):
        """
        Recommend top N movies similar to the input movie
        
        Args:
            title: Movie title to find recommendations for
            top_n: Number of movies to recommend
            
        Returns:
            List of similarity scores and recommended movie titles
        """
        titles = self.movies['title']
        indices = pd.Series(self.movies.index, index=self.movies['title'])
        
        # Get index of input movie
        idx = indices[title]
        
        # Get similarity scores with all other movies
        sim_scores = list(enumerate(self.cosine_sim[idx]))
        
        # Sort by score in descending order
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        # Skip the first movie (itself) and take top N
        sim_scores = sim_scores[1:top_n + 1]
        
        movie_indices = [i[0] for i in sim_scores]
        return sim_scores, titles.iloc[movie_indices].values

Limitations and Future Directions

Despite being simple and effective, Content-based Filtering has limitations worth noting:

Limitation 1: Cannot leverage "wisdom of the crowd"

The system completely ignores information from other users. In reality, user behavior often exhibits group patterns – if many people who like A also like B, you might like B too. Content-based doesn't capture this insight.

Limitation 2: Metadata dependency

We don't always have detailed descriptions for every item. Asking users to add tags is often impractical, and NLP algorithms for automatic feature extraction still face challenges like synonyms, abbreviations, typos, and multiple languages.

Solution: Collaborative Filtering – an approach that leverages the user community's behavior. We'll explore this technique in the next article of this series.

Conclusion

Through this article, we have:

Understood how Content-based Filtering works
Mastered using TF-IDF for feature vectorization
Learned how to compute Cosine Similarity for measuring similarity
Built a complete movie recommendation system with Python

Try applying this knowledge to your own real-world problems!

References:

MovieLens Dataset: https://grouplens.org/datasets/movielens/
Scikit-learn TfidfVectorizer Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook. Springer.

Building Content-based Filtering from Scratch with Python and MovieLens

Building Content-based Filtering from Scratch with Python and MovieLens

Table of Contents

Introduction

How Does Content-based Filtering Work?

System Architecture

Step 1: Data Loading and Initialization

Step 2: Building the TF-IDF Matrix

Step 3: Computing Cosine Similarity

Step 4: Putting It All Together

Limitations and Future Directions

Conclusion

Table of Contents

Related posts

Collaborative Filtering from Scratch: Building Behavior-Based Recommendation Systems

AI Chatbot with RAG: Smart Solution for Optimizing Internal Information Retrieval in Enterprises

What is a Recommender System? A Complete Guide to Recommendation Engines

Related posts

Collaborative Filtering from Scratch: Building Behavior-Based Recommendation Systems

AI Chatbot with RAG: Smart Solution for Optimizing Internal Information Retrieval in Enterprises

What is a Recommender System? A Complete Guide to Recommendation Engines