If you want to work efficiently as a data scientist or engineer, it’s important to have the right tools. Having dedicated resources on hand allows one to perform repetitive processes in an agile manner. It’s not just about automating those processes but also performing them regularly on a consistent basis. This can be anything from extracting, analyzing, and loading data for your data science team’s regular report to re-training your machine learning model every time you receive new data from users.
Apache Airflow is one such tool that lets you efficiently make sure that your workflow stays on track. It is an open-source data workflow management system that lets you automate simple to complex processes primarily written in Python and has a rich web UI to visualize, monitor, and fix any issues that may arise. This tutorial is a step-by-step guide on how to install, configure, and set up Airflow as well as how you can schedule Python scripts with Apache Airflow.
Installation and setup
Let’s start with setting up Airflow on our workstation so that we can test and run the pipelines that we build. There are multiple ways to set up and run Apache Airflow on one’s system. In this tutorial, we will set up Airflow on Docker. So if you don’t have Docker and Docker Compose already installed on your system, download them from their respective websites.
Next, (in a separate folder) let’s download a Docker Compose file, developed by the Airflow community that describes all the services needed by Airflow. This essentially creates an environment much easier. You can either download it with this link or use the following curl command:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.4/docker-compose.yaml'
And then you should get a file called docker-compose.YAML. This file contains several service definitions including some to deploy Airflow on Docker Compose.
Now, before we start Airflow for the first time, we need to prepare an environment. let’s create a folder that Airflow will work on, and in it either create three folders – dags, plugins, and, logs:
mkdir ./dags ./plugins ./logs Or mkdir dags plugins logs
Your folder structure should look like this:
.
├── dags
├── docker-compose.yaml
├── logs
└── plugins
3 directories, 1 file
Next, if you’re on Mac or Windows you will need to export some environment variables in order to make sure that the user and group permissions are the same between those folders from your host and the folders in your containers.
echo -e "AIRFLOW_UID=$(id -u)" > .env
Initialize Airflow
Now that the folders are created and the permissions are set, let us initialize our Airflow instance. To do this, run:
docker-compose up airflow-init
This service is in charge of running airflow DB init or airflow DB upgrade and then creating the user “airflow” with the password “airflow” that’s exactly what you can see here from the output. This would also create the required containers to run the Airflow servers.
So now the only last thing we need to do is to execute the command:
docker-compose up
This command runs all the services that are specified in the docker-compose file – the scheduler, the webserver, the worker, readies, and so on.
To check that the docker containers are up and running you can open up a new terminal and run the following command:
docker ps
It should look something like this:
Now we can finally use our Airflow instance. To do that, open up a web browser and go to https://localhost:8080 or http://127.0.0.1:8080/, there if you see a login page, it means that it worked successfully!
To log in, you can enter the id and password that you got from the initial init command, which was “airflow” and “airflow”. After that you would be greeted with the main dashboard of your Airflow instance on docker which would be something like this:
And that’s it! Now we can take a look at how you can schedule Python scripts with Apache Airflow
Schedule Python scripts
Now to schedule Python scripts with Apache Airflow, open up the dags folder where your Airflow is installed or create a folder called “dags” in there. This will be the place where all your dags, or, python scripts will be. Once you have it, create a file in there ending with a .py extension (keep in mind that any dag will have the .py extension).
The first step to creating a dag is to make the right imports. First, open up the file you just created (crm-elastic-dag.py in my case) and add the following import:
from airflow import dag
Here we are importing the dag class, it is an essential import you have to make import in order to say that this file is actually a dag.
The second most important import you have to make dates:
from airflow.utils.dates import days_ago
This is important because as you are going to see a data pipeline expects a start date and a time at which we will say that the data pipeline should start being scheduled.
Now, we can begin to define how we want to use or schedule Python scripts with Apache Airflow. I have created a dag already created a dag that would just run a python script that I created which would run daily. You can understand how I made this from the code given below:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from includes.vs_modules.test import hello
args = {
'owner': 'Vincent Stevenson',
'start_date': days_ago(1) # make start date in the past
}
#defining the dag object
dag = DAG(
dag_id='crm-elastic-dag',
default_args=args,
schedule_interval='@daily' #to make this workflow happen every day
)
#assigning the task for our dag to do
with dag:
hello_world = PythonOperator(
task_id='hello',
python_callable=hello,
# provide_context=True
)
Here, instead of including the script itself that I want to use, I have called it from a different file location. You can see here:
from includes import hello
This would help in scheduling multiple scripts and managing their way easier to manage.
The script that I am scheduling is just a simple function to print hello each time it’s run:
def hello():
print('Hello!')
Now, this is a very basic script, but once you understand the file structures you can begin to include more function definitions, additional Python scripts to include you can begin to really scale up your actual data pipelines.
After you’ve done all the above steps you can finally start the containers again and work with this on the Airflow dashboard. Run the same command to start the servers:
docker-compose up -d
This won’t take much time if you have already done this. It is configuring everything, downloading the layers and all that from the docker hub, etc. Once this is done, we can visit the Airflow dashboard again to see our dag.
And if it worked successfully you should be able to see your dag on the dashboard like this:
Here you can click on it to see what’s inside it and to manage it. Here you can turn the little switch on at the top left corner to activate it and then it would be its defined process. You can also see its process like when it started, logs, etc on this page.
And here if you visit the logs section you can see the “Hello!” statement we wrote that was made by the external python script that we made!
Final Words
We hope you enjoyed this tutorial on how to install and use Airflow. Now that you have learned how to install it and schedule Python scripts with Apache Airflow. You can start automating and streamlining your workflow today! If you have any questions, please do not hesitate to contact us anytime.
Here are some useful tutorials that you can read:
- Concurrency in Python
- Basic Neural Network in Python to Make Predictions
- Monitor Python scripts using Prometheus
- How to Implement Google Login in Flask App
- How to create a Word Guessing Game in Python
- Convert an image to 8-bit image
- Create Sudoku game in Python using Pygame
- How to Deploy Flask API on Heroku?
- Create APIs using gRPC in Python