Forecasting F1 Race Activity with Prophet and statsmodels: A Step-by-Step Guide

The Problem

Have you ever wondered how major sporting organizations plan their logistics, staffing, or even marketing campaigns far in advance? It often boils down to accurately forecasting future event schedules and their associated activity levels. For something as globally complex as Formula 1, predicting the number of races or significant meetings in upcoming months isn't just a nice-to-have; it’s a critical input for everything from broadcast scheduling to venue preparation. The real challenge isn't just having historical data, but choosing and implementing the right forecasting technique that can gracefully handle trends, seasonality, and the occasional unexpected pattern in real-world time series data.

Step 1 — Data Ingestion and Preprocessing

The first hurdle in any forecasting task is getting your hands on the data and shaping it into something usable. For F1, we'll tap into the Open F1 Race Data API. Our goal here is to fetch the 2024 meetings and transform them into a clean time series where we can count the number of significant events per month. This count will serve as our proxy for "F1 race activity" or "attendance" in a broader sense, reflecting the intensity of the F1 calendar.

I'll start by making a simple HTTP GET request to the API. Since real-world APIs can be flaky, I’ve wrapped the request in a `try-except` block to catch common network issues and provide a fallback or retry mechanism. After fetching, I parse the JSON response and transform it into a Pandas DataFrame. The key is to convert the `date_start` column to a datetime object, then resample the data to count unique meetings per month. This gives us our `ds` (date) and `y` (count) columns, which are essential for both Prophet and statsmodels.


import requests
import pandas as pd
from datetime import datetime, timedelta
import time
import matplotlib.pyplot as plt
import seaborn as sns
from prophet import Prophet
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def fetch_f1_data(year=2024, retries=3, delay=5):
    """Fetches F1 meeting data from the Open F1 API with retries."""
    url = f'https://api.openf1.org/v1/meetings?year={year}'
    for i in range(retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
            return response.json()
        except requests.exceptions.Timeout:
            print(f"Request timed out. Retrying {i+1}/{retries}...")
        except requests.exceptions.RequestException as e:
            print(f"API request failed: {e}. Retrying {i+1}/{retries}...")
        time.sleep(delay)
    print("Failed to fetch data after multiple retries. Using an empty dataset.")
    return []

def preprocess_f1_data(raw_data):
    """Preprocesses raw F1 meeting data into a time series DataFrame."""
    if not raw_data:
        print("No raw data to preprocess. Returning empty DataFrame.")
        return pd.DataFrame(columns=['ds', 'y'])

    try:
        df = pd.DataFrame(raw_data)
        
        # Ensure 'date_start' is present and not null
        if 'date_start' not in df.columns or df['date_start'].isnull().all():
            raise ValueError("Missing or entirely null 'date_start' column in raw data.")

        df['date_start'] = pd.to_datetime(df['date_start'], errors='coerce')
        df.dropna(subset=['date_start'], inplace=True)

        # Count unique meetings per month
        # Using a fixed start of the month for consistent grouping
        df['month_start'] = df['date_start'].dt.to_period('M').dt.start_time
        monthly_activity = df.groupby('month_start')['meeting_key'].nunique().reset_index()
        monthly_activity.columns = ['ds', 'y']
        
        # Ensure all months from min to max are present, filling missing with 0
        min_date = monthly_activity['ds'].min()
        max_date = monthly_activity['ds'].max()
        full_date_range = pd.date_range(start=min_date, end=max_date, freq='MS')
        full_df = pd.DataFrame({'ds': full_date_range})
        monthly_activity = pd.merge(full_df, monthly_activity, on='ds', how='left').fillna(0)
        
        return monthly_activity
    except Exception as e:
        print(f"Data preprocessing failed: {e}. Returning empty DataFrame.")
        return pd.DataFrame(columns=['ds', 'y'])

Step 2 — Exploratory Data Analysis (EDA) with Visualizations

With our data preprocessed, the next crucial step is to understand its underlying patterns. Before we throw models at it, we need to visually inspect for trends, seasonality, and any unusual spikes or dips. This helps us confirm our assumptions and sometimes even reveals issues with the data itself.

I'll generate a simple line plot of our `ds` (date) against `y` (monthly F1 activity). This visualization will immediately show us how F1 event activity has changed over the year. We might expect higher activity during certain parts of the year and quieter periods, reflecting the typical F1 season calendar. Saving this plot to a file ensures we have a record of our initial observations.


def perform_eda(df, filename='f1_activity_eda.png'):
    """Performs basic EDA and visualizes the F1 activity."""
    if df.empty:
        print("EDA skipped: DataFrame is empty.")
        return

    plt.figure(figsize=(12, 6))
    sns.lineplot(x='ds', y='y', data=df)
    plt.title('Monthly F1 Race Activity (Meetings Count)')
    plt.xlabel('Date')
    plt.ylabel('Number of Unique Meetings')
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(filename)
    print(f"EDA plot saved to {filename}")
    plt.close()

Step 3 — Prophet Model Implementation

Now that we've got our clean time series and a basic understanding of its patterns,

Py Data