Collaborative Filtering Recommender System with Python | User-User CF and Item-Item CF Guide

Introduction

In our previous article, we successfully built a Content-based Recommender System – an approach that recommends items based on their attributes. However, this method has a significant limitation: it cannot leverage the "wisdom of the crowd" from the user community.

Today, we'll explore Collaborative Filtering – a method that addresses this exact problem by exploiting the behavior of similar users.

How Does Collaborative Filtering Work?

Imagine you're looking for a good movie to watch. You have two options:

Find movies with similar genres to ones you've enjoyed (Content-based)
Ask a friend with similar taste: "What great movie did you just watch?" (Collaborative Filtering)

The second approach is the core idea behind Collaborative Filtering: predicting a user's preference for an item based on the behavior of "similar" users.

Real-world example:

Suppose users A and B both rate crime dramas highly (4-5 stars). History shows B loved "The Godfather." Since A and B have similar tastes, the system infers that A might also like "The Godfather" → Recommend this movie to A.

The key insight here is that the system doesn't need to know what genre "The Godfather" belongs to. It only needs to know that users with behavior similar to A liked this movie.

Two Approaches to Collaborative Filtering

Collaborative Filtering has two main variants:

User-User CF (UUCF):

Find users with similar behavior to the target user
Recommend items that those similar users liked
Works well when number of users < number of items

Item-Item CF (IICF):

Find items similar to items the user has liked
Recommend those similar items
Works well when number of items < number of users (more common in practice)

In this article, we'll implement User-User CF to understand the principles, then discuss how to extend it to Item-Item CF.

System Design

Step 1: Building the Utility Matrix

Unlike Content-based which only needs item information, Collaborative Filtering requires 3 data components:

Users: List of users
Items (Movies): List of movies
Ratings: User scores for each item

These components are represented as a Utility Matrix – rows are users, columns are items, and values are ratings.

python

import pandas
import numpy as np

def get_dataframe_ratings_base(file_path):
    """
    Read MovieLens ratings file
    Returns matrix with 3 columns: user_id, item_id, rating
    """
    r_cols = ['user_id', 'item_id', 'rating']
    ratings = pandas.read_csv(
        file_path, 
        sep='\t', 
        names=r_cols, 
        encoding='latin-1'
    )
    return ratings.values

The challenge: This matrix has many empty values (missing values) – items that users haven't rated yet. Our task is to predict these values.

Step 2: Matrix Normalization

Before computing, we need to handle missing values. There are several approaches:

Option 1: Replace with 0 → Simple but inaccurate
Option 2: Replace with 2.5 (midpoint of 0-5) → Better but still imprecise
Option 3 (Best): Normalize by each user's mean rating

Why is Option 3 best?

Each person has different rating standards. An "easy" rater might give 4 stars to an average movie, while a "tough" rater only gives 3 stars. By subtracting each user's mean rating, we remove personal bias and only retain information about whether the user likes an item more or less than their average.

python

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

class CollaborativeFiltering:
    """
    Collaborative Filtering System
    Supports both User-User CF and Item-Item CF
    """
    
    def __init__(self, data_matrix, k, dist_func=cosine_similarity, uuCF=1):
        """
        Initialize CF
        
        Args:
            data_matrix: Matrix [user_id, item_id, rating]
            k: Number of neighbors for prediction
            dist_func: Similarity function (default: cosine_similarity)
            uuCF: 1 = User-User CF, 0 = Item-Item CF
        """
        self.uuCF = uuCF
        # For Item-Item CF, swap user and item columns
        self.Y_data = data_matrix if uuCF else data_matrix[:, [1, 0, 2]]
        self.k = k
        self.dist_func = dist_func
        self.Ybar_data = None
        
        # Number of users and items (+1 because index starts from 0)
        self.n_users = int(np.max(self.Y_data[:, 0])) + 1
        self.n_items = int(np.max(self.Y_data[:, 1])) + 1

    def normalize_matrix(self):
        """
        Normalize matrix by:
        1. Computing mean rating for each user
        2. Subtracting each rating by its user's mean
        3. Replacing unknown values with 0
        
        Result: Positive = liked more than average
                Negative = liked less than average
                Zero = not yet rated
        """
        users = self.Y_data[:, 0]
        self.Ybar_data = self.Y_data.copy()
        self.mu = np.zeros((self.n_users,))
        
        for n in range(self.n_users):
            # Get all ratings from user n
            ids = np.where(users == n)[0].astype(np.int32)
            ratings = self.Y_data[ids, 2]
            
            # Compute mean (handle empty case)
            m = np.mean(ratings)
            if np.isnan(m):
                m = 0
            self.mu[n] = m
            
            # Normalize: rating - mean
            self.Ybar_data[ids, 2] = ratings - self.mu[n]
        
        # Convert to sparse matrix for memory efficiency
        self.Ybar = sparse.coo_matrix(
            (self.Ybar_data[:, 2],
             (self.Ybar_data[:, 1], self.Ybar_data[:, 0])),
            (self.n_items, self.n_users)
        )
        self.Ybar = self.Ybar.tocsr()

Why use Sparse Matrix?

Utility Matrices are typically very large (millions of users × millions of items) but only a small fraction of cells have values (users only rate a few items). Sparse Matrices only store non-zero values and their positions, significantly reducing memory usage.

Step 3: Computing User Similarity

After normalization, we compute the Similarity Matrix – a matrix containing similarity scores between all user pairs.

python

def compute_similarity(self):
    """
    Compute similarity matrix between users
    Using Cosine Similarity
    
    Result: Matrix S[i][j] = similarity between user i and user j
            Values range from [-1, 1]
            1 = exactly the same
            0 = no correlation
            -1 = completely opposite
    """
    self.S = self.dist_func(self.Ybar.T, self.Ybar.T)

Cosine Similarity measures the angle between two vectors in multi-dimensional space. Two users with similar rating patterns will have cosine similarity close to 1.

Step 4: Predicting Ratings

This is the most critical step. To predict user u's rating for item i, we:

Find all users who have rated item i
Select the k most similar users to user u (k-nearest neighbors)
Compute weighted average of their ratings (weight = similarity)

Concrete example:

Suppose we need to predict user U1's rating for item I1 with k=2:

Users who rated I1: U0, U3, U5
U1's similarity with them: U0=0.83, U3=-0.4, U5=-0.23
Select 2 users with highest similarity: U0 (0.83), U5 (-0.23)
Their normalized ratings for I1: U0=0.75, U5=0.5
Predicted rating = (0.83×0.75 + (-0.23)×0.5) / (|0.83| + |-0.23|) ≈ 0.48

python

def __predict(self, u, i, normalized=1):
    """
    Predict user u's rating for item i
    
    Args:
        u: User ID
        i: Item ID
        normalized: 1 = return normalized rating, 0 = actual rating
    """
    # Find users who rated item i
    ids = np.where(self.Y_data[:, 1] == i)[0].astype(np.int32)
    users_rated_i = (self.Y_data[ids, 0]).astype(np.int32)
    
    # Get user u's similarity with these users
    sim = self.S[u, users_rated_i]
    
    # Select k users with highest similarity
    a = np.argsort(sim)[-self.k:]
    nearest_s = sim[a]
    
    # Get their normalized ratings
    r = self.Ybar[i, users_rated_i[a]]
    
    # Compute weighted average
    if normalized:
        return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8)
    
    # Add back mean for actual rating
    return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8) + self.mu[u]

def predict(self, u, i, normalized=1):
    """Wrapper function supporting both UUCF and IICF"""
    if self.uuCF:
        return self.__predict(u, i, normalized)
    return self.__predict(i, u, normalized)

Step 5: Generating Top-N Recommendations

Finally, for each user, we predict ratings for all unrated items, sort in descending order, and take the top N.

python

def recommend_top(self, user_id, top_n=10):
    """
    Recommend top N items for a user
    
    Args:
        user_id: Target user's ID
        top_n: Number of items to recommend
        
    Returns:
        List of recommended items, sorted by predicted rating descending
    """
    # Get items user has already rated
    ids = np.where(self.Y_data[:, 0] == user_id)[0]
    items_rated = self.Y_data[ids, 1].tolist()
    
    # Predict ratings for unrated items
    recommendations = []
    for item_id in range(self.n_items):
        if item_id not in items_rated:
            predicted_rating = self.__predict(user_id, item_id)
            recommendations.append({
                'item_id': item_id,
                'predicted_rating': predicted_rating
            })
    
    # Sort by rating descending and take top N
    recommendations.sort(key=lambda x: x['predicted_rating'], reverse=True)
    return recommendations[:top_n]

Item-Item CF: A Different Perspective

In practice, Item-Item CF is often preferred over User-User CF because:

1. Computational efficiency: The number of items is usually much smaller than users. The Similarity Matrix will be smaller, enabling faster computation.

2. Greater stability: Each item is rated by many users, so item feature vectors are "denser." When new ratings come in, item-item similarities change less than user-user similarities.

3. Easy to implement from UUCF: Simply transpose the Utility Matrix, apply the same algorithm, then transpose the result back.

python

# Initialize Item-Item CF
iicf = CollaborativeFiltering(data_matrix, k=10, uuCF=0)

Summary of Basic Recommender System Series

Through these 3 articles, we have:

Understood the overview of Recommender Systems and popular methods
Built Content-based RS using TF-IDF and Cosine Similarity
Built Collaborative Filtering RS with both User-User and Item-Item variants
Mastered sparse matrix handling and data normalization

This is just the foundation. In production, systems typically combine multiple approaches (Hybrid Recommender Systems) and use advanced techniques like Matrix Factorization and Deep Learning for better performance.

References:

Ekstrand, Michael D., John T. Riedl, and Joseph A. Konstan. "Collaborative filtering recommender systems" (2011)
Leskovec, J., Rajaraman, A., & Ullman, J. D. "Mining of Massive Datasets" - Chapter 9: Recommendation Systems. Stanford University (2014)
MovieLens Dataset: https://grouplens.org/datasets/movielens/
Stanford CS246: Mining Massive Data Sets - Recommendation Systems

Collaborative Filtering from Scratch: Building Behavior-Based Recommendation Systems

Collaborative Filtering from Scratch: Building Behavior-Based Recommendation Systems

Table of Contents

Introduction

How Does Collaborative Filtering Work?

Two Approaches to Collaborative Filtering

System Design

Step 1: Building the Utility Matrix

Step 2: Matrix Normalization

Step 3: Computing User Similarity

Step 4: Predicting Ratings

Step 5: Generating Top-N Recommendations

Item-Item CF: A Different Perspective

Summary of Basic Recommender System Series

Table of Contents

Related posts

AI Chatbot with RAG: Smart Solution for Optimizing Internal Information Retrieval in Enterprises

Building Content-based Filtering from Scratch with Python and MovieLens

What is a Recommender System? A Complete Guide to Recommendation Engines

Related posts

AI Chatbot with RAG: Smart Solution for Optimizing Internal Information Retrieval in Enterprises

Building Content-based Filtering from Scratch with Python and MovieLens

What is a Recommender System? A Complete Guide to Recommendation Engines