Apache Airflow in Data Engineering

So you are currently running cron jobs for scheduling your works? Yes, a few months back I was running cron jobs for scheduling my tasks during my Software Engineering Journey.

Today I will share with you how I migrated from cron jobs to Apache Airflow. Why choose apache airflow, not the cron job. How to set up Apache airflow in your machine.

A few Months Back, I was working in a company where I had to run scripts once a day. Guess!! What? I used cron job and that was amazing. I was able to run my scripts every day.

So cronjob made my day very easy, I was able to run the script successfully and was living a happy life. Oh, wait that happiness didn't last long. 

So, What Happened? 

All was perfectly fine Until we found the data is missing.

Wait, What?

Yes, Every Day I was checking for jobs, but as the no of jobs increased it was difficult for me to track the job status. It was difficult for me to find out why the particular job failed. What happened to the job was always a major concern.

What are the issues I faced while using cron?

  • Logging
  • Error Handling
  • difficult to re-run the jobs missed
  • difficult to retry job failures

So what might be the good alternative that manages all of these?

I found something called background tasks and task schedulers popular ones like celery, Sanic has its background task, FastAPI has its own. and so on.

"Celery: Distributed task queue. Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well."

So I went for background tasks scheduler instead of cron and most of the issues were solved and I had to do everything like logging, error handling.

As I stepped into the world of Data Engineering, I had to work with not only Data but now I was working with big data.

I was looking for a tool that I could use for creating jobs, scheduling jobs, and could monitor them easily while interacting with Distributed systems and jobs could be monitored from UI rather than CLI.

Then I found Apache Airflow, I was like that's the thing I was looking for.

Since I could get all the things in one place, I had Scheduler, background tasks executer, interactive UI (to visualize Jobs i.e workflow, logs, job status).

"Airflow is a platform to programmatically author, schedule and monitor workflows"

It is used for the scheduling and orchestration of data pipelines or workflows as directed acyclic graphs (DAGs) of tasks.

 The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

Orchestration of data pipelines refers to the sequencing, coordination, scheduling, and managing of complex data pipelines from diverse sources.

Rich command lines utilities make performing complex surgeries on DAGs a snap. 

The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Since I can use python scripts to create workflow makes it easy to create complex ETL pipelines.

If I need to execute a python script as well as a bash script I can do both of the tasks, as there are Operators.

"Operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don't need to share resources with any other operators"

If I need interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. then Hooks implement a common interface when possible, and act as a building block for operators.

What are the principles of Airflow?

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
  • Extensible: Easily define your operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine.
  • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.

Now, Let's set up Apache Airflow Locally.

You have to install airflow in your system using the pip python package manager.

pip install apache-airflow

now you can use standalone mode to run airflow

airflow standalone

The Standalone command will initialize the database, make a user,
and start all components for you.

Visit localhost:8080 in the browser and use the admin account details
shown on the terminal to login.

If you want to run the individual parts of Airflow manually rather than using the all-in-one standalone command, you can instead run:

airflow db init    #initializes db for airflow

airflow users create \
--username admin \
--firstname hello \
--lastname world \
--role Admin \
--email [email protected]

airflow webserver --port 8080 #runs airflow webserver on port 8080

airflow scheduler #runs airflow scheduler

Alternatively, you can configure Airflow using Docker.

If you have docker and docker-compose in your system you can follow this approach.

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.3/docker-compose.yaml'

mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env

docker-compose up