Building Robust Data Pipelines with Apache Airflow and Python: A GitHub API Example

As a data engineer, I've often struggled with building and managing complex data pipelines that involve multiple tasks, dependencies, and error handling. Recently, I worked on a project that required extracting data from the GitHub API and storing it in a database. I decided to use Apache Airflow and Python to build the pipeline, and I was impressed by how robust and scalable it turned out to be. In this tutorial, I'll walk you through the process of building a similar pipeline, and provide you with the tools and knowledge you need to handle errors and exceptions like a pro.

Key Takeaways

Use Apache Airflow to manage and schedule data pipelines
Extract data from the GitHub API using Python and the requests library
Handle errors and exceptions using try-except blocks and logging

The Problem

A common problem in data engineering is building and managing complex data pipelines that involve multiple tasks, dependencies, and error handling. This can be particularly challenging when working with external APIs, such as the GitHub API, which can be prone to errors and rate limiting.

Data and Sources

The GitHub API provides a wealth of data about repositories, including stars, forks, open issues, and description. The API endpoint we'll be using is https://api.github.com/repos/python/cpython. Data accessed on 2024-09-16.

Loading the Data

To load the data from the GitHub API, we'll use the requests library to send a GET request to the API endpoint.

import requests
response = requests.get("https://api.github.com/repos/python/cpython")
data = response.json()

The Core Logic

The core logic of our pipeline involves extracting the relevant data from the API response and storing it in a database. We'll use the pandas library to handle the data and the sqlite3 library to interact with the database.

import pandas as pd
import sqlite3

def extract_data(data):
    # Extract the relevant data from the API response
    stars = data['stargazers_count']
    forks = data['forks_count']
    open_issues = data['open_issues_count']
    description = data['description']
    
    # Create a pandas DataFrame to store the data
    df = pd.DataFrame({
        'stars': [stars],
        'forks': [forks],
        'open_issues': [open_issues],
        'description': [description]
    })
    
    return df

def load_data(df):
    # Connect to the database
    conn = sqlite3.connect('github_data.db')
    cursor = conn.cursor()
    
    # Create a table to store the data
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS github_data
        (stars INTEGER, forks INTEGER, open_issues INTEGER, description TEXT)
    ''')
    
    # Insert the data into the table
    df.to_sql('github_data', conn, if_exists='replace', index=False)
    
    # Close the connection
    conn.close()

Putting It Together

Now that we have the core logic in place, we can put everything together using Apache Airflow. We'll create a DAG that defines the tasks and dependencies, and use the schedule_interval parameter to schedule the pipeline to run at regular intervals.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'github_pipeline',
    default_args=default_args,
    schedule_interval=timedelta(days=1),
)

def extract_and_load_data(**kwargs):
    data = extract_data(response.json())
    load_data(data)

task = PythonOperator(
    task_id='extract_and_load_data',
    python_callable=extract_and_load_data,
    dag=dag,
)

Complete Script

The full runnable script combining all steps:

#!/usr/bin/env python3
import requests
import pandas as pd
import sqlite3
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'github_pipeline',
    default_args=default_args,
    schedule_interval=timedelta(days=1),
)

def extract_data(data):
    stars = data['stargazers_count']
    forks = data['forks_count']
    open_issues = data['open_issues_count']
    description = data['description']
    
    df = pd.DataFrame({
        'stars': [stars],
        'forks': [forks],
        'open_issues': [open_issues],
        'description': [description]
    })
    
    return df

def load_data(df):
    conn = sqlite3.connect('github_data.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS github_data
        (stars INTEGER, forks INTEGER, open_issues INTEGER, description TEXT)
    ''')
    
    df.to_sql('github_data', conn, if_exists='replace', index=False)
    
    conn.close()

def extract_and_load_data(**kwargs):
    response = requests.get("https://api.github.com/repos/python/cpython")
    data = extract_data(response.json())
    load_data(data)

task = PythonOperator(
    task_id='extract_and_load_data',
    python_callable=extract_and_load_data,
    dag=dag,
)

Expected Output

When you run the pipeline, you should see the data being extracted from the GitHub API and stored in the database. You can verify this by checking the database or running a query to retrieve the data.

Limitations and Tradeoffs

This pipeline assumes that the GitHub API is available and responding correctly. If the API is down or returns an error, the pipeline will fail. Additionally, this pipeline only extracts a limited set of data from the API. If you need to extract more data, you'll need to modify the pipeline accordingly.

Frequently Asked Questions

How do I handle errors and exceptions in the pipeline?

You can use try-except blocks to catch and handle errors and exceptions in the pipeline. For example, you can use a try-except block to catch any errors that occur when making the API request or storing the data in the database.

How do I schedule the pipeline to run at regular intervals?

You can use the schedule_interval parameter in the DAG definition to schedule the pipeline to run at regular intervals. For example, you can set the schedule_interval to timedelta(days=1) to run the pipeline once a day.

How do I store the data in a different database?

You can modify the load_data function to store the data in a different database. For example, you can use the psycopg2 library to store the data in a PostgreSQL database instead of a SQLite database.

What I'd Change

In a real-world production environment, I would add more error handling and logging to the pipeline to ensure that it's robust and reliable. I would also consider using a more scalable database solution, such as a cloud-based database, to store the data. Additionally, I would use a more secure way to store the API credentials, such as using environment variables or a secrets manager. Overall, this pipeline provides a solid foundation for building a robust data pipeline, but it's just the starting point, and there are many ways to improve and extend it.

Py Data