A comprehensive guide to building Collaborative Filtering Recommender Systems with Python. Learn User-User CF, Item-Item CF, and rating prediction using the MovieLens dataset.

No table of contents available for this article
In our previous article, we successfully built a Content-based Recommender System – an approach that recommends items based on their attributes. However, this method has a significant limitation: it cannot leverage the "wisdom of the crowd" from the user community.
Today, we'll explore Collaborative Filtering – a method that addresses this exact problem by exploiting the behavior of similar users.
Imagine you're looking for a good movie to watch. You have two options:
Find movies with similar genres to ones you've enjoyed (Content-based)
Ask a friend with similar taste: "What great movie did you just watch?" (Collaborative Filtering)
The second approach is the core idea behind Collaborative Filtering: predicting a user's preference for an item based on the behavior of "similar" users.
Real-world example:
Suppose users A and B both rate crime dramas highly (4-5 stars). History shows B loved "The Godfather." Since A and B have similar tastes, the system infers that A might also like "The Godfather" → Recommend this movie to A.
The key insight here is that the system doesn't need to know what genre "The Godfather" belongs to. It only needs to know that users with behavior similar to A liked this movie.
Collaborative Filtering has two main variants:
User-User CF (UUCF):
Find users with similar behavior to the target user
Recommend items that those similar users liked
Works well when number of users < number of items
Item-Item CF (IICF):
Find items similar to items the user has liked
Recommend those similar items
Works well when number of items < number of users (more common in practice)
In this article, we'll implement User-User CF to understand the principles, then discuss how to extend it to Item-Item CF.
Unlike Content-based which only needs item information, Collaborative Filtering requires 3 data components:
Users: List of users
Items (Movies): List of movies
Ratings: User scores for each item
These components are represented as a Utility Matrix – rows are users, columns are items, and values are ratings.
python
import pandas
import numpy as np
def get_dataframe_ratings_base(file_path):
"""
Read MovieLens ratings file
Returns matrix with 3 columns: user_id, item_id, rating
"""
r_cols = ['user_id', 'item_id', 'rating']
ratings = pandas.read_csv(
file_path,
sep='\t',
names=r_cols,
encoding='latin-1'
)
return ratings.valuesThe challenge: This matrix has many empty values (missing values) – items that users haven't rated yet. Our task is to predict these values.
Before computing, we need to handle missing values. There are several approaches:
Option 1: Replace with 0 → Simple but inaccurate
Option 2: Replace with 2.5 (midpoint of 0-5) → Better but still imprecise
Option 3 (Best): Normalize by each user's mean rating
Why is Option 3 best?
Each person has different rating standards. An "easy" rater might give 4 stars to an average movie, while a "tough" rater only gives 3 stars. By subtracting each user's mean rating, we remove personal bias and only retain information about whether the user likes an item more or less than their average.
python
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
class CollaborativeFiltering:
"""
Collaborative Filtering System
Supports both User-User CF and Item-Item CF
"""
def __init__(self, data_matrix, k, dist_func=cosine_similarity, uuCF=1):
"""
Initialize CF
Args:
data_matrix: Matrix [user_id, item_id, rating]
k: Number of neighbors for prediction
dist_func: Similarity function (default: cosine_similarity)
uuCF: 1 = User-User CF, 0 = Item-Item CF
"""
self.uuCF = uuCF
# For Item-Item CF, swap user and item columns
self.Y_data = data_matrix if uuCF else data_matrix[:, [1, 0, 2]]
self.k = k
self.dist_func = dist_func
self.Ybar_data = None
# Number of users and items (+1 because index starts from 0)
self.n_users = int(np.max(self.Y_data[:, 0])) + 1
self.n_items = int(np.max(self.Y_data[:, 1])) + 1
def normalize_matrix(self):
"""
Normalize matrix by:
1. Computing mean rating for each user
2. Subtracting each rating by its user's mean
3. Replacing unknown values with 0
Result: Positive = liked more than average
Negative = liked less than average
Zero = not yet rated
"""
users = self.Y_data[:, 0]
self.Ybar_data = self.Y_data.copy()
self.mu = np.zeros((self.n_users,))
for n in range(self.n_users):
# Get all ratings from user n
ids = np.where(users == n)[0].astype(np.int32)
ratings = self.Y_data[ids, 2]
# Compute mean (handle empty case)
m = np.mean(ratings)
if np.isnan(m):
m = 0
self.mu[n] = m
# Normalize: rating - mean
self.Ybar_data[ids, 2] = ratings - self.mu[n]
# Convert to sparse matrix for memory efficiency
self.Ybar = sparse.coo_matrix(
(self.Ybar_data[:, 2],
(self.Ybar_data[:, 1], self.Ybar_data[:, 0])),
(self.n_items, self.n_users)
)
self.Ybar = self.Ybar.tocsr()Why use Sparse Matrix?
Utility Matrices are typically very large (millions of users × millions of items) but only a small fraction of cells have values (users only rate a few items). Sparse Matrices only store non-zero values and their positions, significantly reducing memory usage.
After normalization, we compute the Similarity Matrix – a matrix containing similarity scores between all user pairs.
python
def compute_similarity(self):
"""
Compute similarity matrix between users
Using Cosine Similarity
Result: Matrix S[i][j] = similarity between user i and user j
Values range from [-1, 1]
1 = exactly the same
0 = no correlation
-1 = completely opposite
"""
self.S = self.dist_func(self.Ybar.T, self.Ybar.T)Cosine Similarity measures the angle between two vectors in multi-dimensional space. Two users with similar rating patterns will have cosine similarity close to 1.
This is the most critical step. To predict user u's rating for item i, we:
Find all users who have rated item i
Select the k most similar users to user u (k-nearest neighbors)
Compute weighted average of their ratings (weight = similarity)
Concrete example:
Suppose we need to predict user U1's rating for item I1 with k=2:
Users who rated I1: U0, U3, U5
U1's similarity with them: U0=0.83, U3=-0.4, U5=-0.23
Select 2 users with highest similarity: U0 (0.83), U5 (-0.23)
Their normalized ratings for I1: U0=0.75, U5=0.5
Predicted rating = (0.83×0.75 + (-0.23)×0.5) / (|0.83| + |-0.23|) ≈ 0.48
python
def __predict(self, u, i, normalized=1):
"""
Predict user u's rating for item i
Args:
u: User ID
i: Item ID
normalized: 1 = return normalized rating, 0 = actual rating
"""
# Find users who rated item i
ids = np.where(self.Y_data[:, 1] == i)[0].astype(np.int32)
users_rated_i = (self.Y_data[ids, 0]).astype(np.int32)
# Get user u's similarity with these users
sim = self.S[u, users_rated_i]
# Select k users with highest similarity
a = np.argsort(sim)[-self.k:]
nearest_s = sim[a]
# Get their normalized ratings
r = self.Ybar[i, users_rated_i[a]]
# Compute weighted average
if normalized:
return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8)
# Add back mean for actual rating
return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8) + self.mu[u]
def predict(self, u, i, normalized=1):
"""Wrapper function supporting both UUCF and IICF"""
if self.uuCF:
return self.__predict(u, i, normalized)
return self.__predict(i, u, normalized)Finally, for each user, we predict ratings for all unrated items, sort in descending order, and take the top N.
python
def recommend_top(self, user_id, top_n=10):
"""
Recommend top N items for a user
Args:
user_id: Target user's ID
top_n: Number of items to recommend
Returns:
List of recommended items, sorted by predicted rating descending
"""
# Get items user has already rated
ids = np.where(self.Y_data[:, 0] == user_id)[0]
items_rated = self.Y_data[ids, 1].tolist()
# Predict ratings for unrated items
recommendations = []
for item_id in range(self.n_items):
if item_id not in items_rated:
predicted_rating = self.__predict(user_id, item_id)
recommendations.append({
'item_id': item_id,
'predicted_rating': predicted_rating
})
# Sort by rating descending and take top N
recommendations.sort(key=lambda x: x['predicted_rating'], reverse=True)
return recommendations[:top_n]In practice, Item-Item CF is often preferred over User-User CF because:
1. Computational efficiency: The number of items is usually much smaller than users. The Similarity Matrix will be smaller, enabling faster computation.
2. Greater stability: Each item is rated by many users, so item feature vectors are "denser." When new ratings come in, item-item similarities change less than user-user similarities.
3. Easy to implement from UUCF: Simply transpose the Utility Matrix, apply the same algorithm, then transpose the result back.
python
# Initialize Item-Item CF
iicf = CollaborativeFiltering(data_matrix, k=10, uuCF=0)Through these 3 articles, we have:
Understood the overview of Recommender Systems and popular methods
Built Content-based RS using TF-IDF and Cosine Similarity
Built Collaborative Filtering RS with both User-User and Item-Item variants
Mastered sparse matrix handling and data normalization
This is just the foundation. In production, systems typically combine multiple approaches (Hybrid Recommender Systems) and use advanced techniques like Matrix Factorization and Deep Learning for better performance.
References:
Ekstrand, Michael D., John T. Riedl, and Joseph A. Konstan. "Collaborative filtering recommender systems" (2011)
Leskovec, J., Rajaraman, A., & Ullman, J. D. "Mining of Massive Datasets" - Chapter 9: Recommendation Systems. Stanford University (2014)
MovieLens Dataset: https://grouplens.org/datasets/movielens/
Stanford CS246: Mining Massive Data Sets - Recommendation Systems