Building a Scalable Recommendation System with Collaborative Filtering

The Problem

Creating a recommendation system that can handle large volumes of user data and provide accurate suggestions is a challenging task, especially when dealing with the cold start problem and sparse user-item interaction matrices.

Step 1: Understanding the Approach

To tackle this problem, we will use a combination of user-based collaborative filtering and matrix factorization, which allows us to reduce the dimensionality of the user-item interaction matrix and improve the accuracy of our recommendations.

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

Step 2: Loading the Data

We will use the MovieLens dataset, which contains user-item interaction data, to train and test our recommendation system. The data is fetched from a public URL and loaded into a pandas dataframe.

import requests
response = requests.get("https://files.grouplens.org/datasets/movielens/ml-100k/u.data")
data = response.text
lines = data.splitlines()
ratings = []
for line in lines:
    user_id, item_id, rating, timestamp = line.split("\t")
    ratings.append((int(user_id), int(item_id), int(rating)))
df = pd.DataFrame(ratings, columns=["user_id", "item_id", "rating"])

Step 3: Building the User-Item Interaction Matrix

We create a sparse user-item interaction matrix using the csr_matrix function from scipy, which allows us to efficiently store and manipulate the data.

user_item_matrix = csr_matrix((df["rating"], (df["user_id"], df["item_id"])))

Step 4: Implementing Collaborative Filtering

We use the NearestNeighbors algorithm from scikit-learn to find the most similar users based on their interaction profiles, which allows us to generate personalized recommendations.

nn = NearestNeighbors(n_neighbors=10, algorithm="brute", metric="cosine")
nn.fit(user_item_matrix)

Step 5: Matrix Factorization

We use the NMF (Non-negative Matrix Factorization) algorithm from scikit-learn to reduce the dimensionality of the user-item interaction matrix and improve the accuracy of our recommendations.

from sklearn.decomposition import NMF
nmf = NMF(n_components=10, init="random", random_state=0)
nmf.fit(user_item_matrix)

Complete Script

The full runnable script combining all steps:

#!/usr/bin/env python3
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import NMF
import requests

def load_data():
    response = requests.get("https://files.grouplens.org/datasets/movielens/ml-100k/u.data")
    data = response.text
    lines = data.splitlines()
    ratings = []
    for line in lines:
        user_id, item_id, rating, timestamp = line.split("\t")
        ratings.append((int(user_id), int(item_id), int(rating)))
    df = pd.DataFrame(ratings, columns=["user_id", "item_id", "rating"])
    return df

def build_user_item_matrix(df):
    user_item_matrix = csr_matrix((df["rating"], (df["user_id"], df["item_id"])))
    return user_item_matrix

def train_collaborative_filtering(user_item_matrix):
    nn = NearestNeighbors(n_neighbors=10, algorithm="brute", metric="cosine")
    nn.fit(user_item_matrix)
    return nn

def train_matrix_factorization(user_item_matrix):
    nmf = NMF(n_components=10, init="random", random_state=0)
    nmf.fit(user_item_matrix)
    return nmf

if __name__ == "__main__":
    df = load_data()
    user_item_matrix = build_user_item_matrix(df)
    nn = train_collaborative_filtering(user_item_matrix)
    nmf = train_matrix_factorization(user_item_matrix)
    print("Collaborative filtering and matrix factorization models trained successfully")

Expected Output

When you run the script, you should see the message "Collaborative filtering and matrix factorization models trained successfully", indicating that the models have been trained and are ready for use.

What I'd Change

In a real-world application, I would consider using more advanced techniques such as deep learning-based recommendation systems, which can learn complex patterns in user behavior and provide more accurate recommendations. Additionally, I would focus on optimizing the system for scalability and performance, using techniques such as distributed computing and caching to handle large volumes of user data.

Py Data

Building a Scalable Recommendation System with Collaborative Filtering

The Problem

Step 1: Understanding the Approach

Step 2: Loading the Data

Step 3: Building the User-Item Interaction Matrix

Step 4: Implementing Collaborative Filtering

Step 5: Matrix Factorization

Complete Script

Expected Output

What I'd Change

Post a Comment

Building a Web Scraper with Python and Reddit API to Analyze Trending AI Topics

Git & GitHub

Migrating from Flask to FastAPI: What I Learned Boosting API Performance with PyPI Data

Lossless compression algorithm

Py Data