Get Your Hands Dirty with Apache Airflow: A Comprehensive Guide to Installing Locally on Mac,Windows

Learn how to install Apache Airflow locally on Mac, Windows, and Linux with this comprehensive guide. Get hands-on experience with the popular data management tool.

image description

Get Your Hands Dirty with Apache Airflow: A Comprehensive Guide to Installing Locally on Mac, Windows, and Linux

Learn how to install Apache Airflow locally on Mac, Windows, and Linux with this comprehensive guide. Get hands-on experience with the popular data management tool.

Apache Airflow is an open-source platform for managing and scheduling data workflows. It allows you to define, organize, and manage tasks and dependencies that make up a workflow. Airflow provides a user-friendly interface for monitoring and managing the execution of your workflows, making it a popular choice for data engineers, data scientists, and developers.

Airflow can be used in a variety of scenarios, including:

  1. ETL (Extract, Transform, Load) processes: Airflow can be used to automate the process of extracting data from multiple sources, transforming the data into a desired format, and loading the data into a target database.

  2. Data processing pipelines: Airflow can be used to automate complex data processing pipelines that involve multiple stages and tasks.

  3. Machine learning workflows: Airflow can be used to automate the deployment and management of machine learning models, including training, evaluation, and deployment.

  4. Monitoring and alerting: Airflow can be used to automate the process of monitoring data and triggering alerts when certain conditions are met.

To use Apache Airflow, you need to have a basic understanding of Python and know how to create virtual environments. Once you've installed Airflow and set up the environment, you can start defining and executing workflows using the Airflow UI and Python code.

Prerequisites

Before diving into Apache Airflow, there are a few prerequisites to keep in mind. To ensure a smooth and successful Airflow experience, it is recommended that your environment meets the following requirements:

  1. Python version: Airflow is compatible with Python 3.7, 3.8, 3.9, and 3.10.

  2. Supported databases: Airflow has been tested with PostgreSQL versions 11, 12, 13, 14, and 15, MySQL versions 5.7 and 8, SQLite version 3.15.0 or later, and experimental support for MSSQL versions 2017 and 2019.

  3. Kubernetes: Airflow has been tested with Kubernetes versions 1.20.2, 1.21.1, 1.22.0, 1.23.0, and 1.24.0.

It's important to make sure your environment meets these requirements to ensure the best possible experience with Apache Airflow.

Here are the official resources for downloading Apache Airflow and related documentation:

  1. Download Apache Airflow: The latest version of Apache Airflow can be downloaded from the official Apache Airflow website at https://airflow.apache.org/.

  2. Installation Guide: Detailed installation instructions can be found in the official Apache Airflow documentation at https://airflow.apache.org/docs/stable/installation.html.

  3. User Guide: A comprehensive guide to using Apache Airflow can be found in the official documentation at https://airflow.apache.org/docs/stable/userguide/index.html.

  4. Tutorials: There are several tutorials available that cover different aspects of Apache Airflow, including creating and managing data pipelines. These tutorials can be found in the official documentation at https://airflow.apache.org/docs/stable/tutorial.html.

  5. API Reference: The API reference for Apache Airflow can be found in the official documentation at https://airflow.apache.org/docs/stable/api.html.

  6. Release Notes: The release notes for each version of Apache Airflow can be found in the official documentation at https://airflow.apache.org/docs/stable/releases.html.

These resources should provide you with everything you need to get started with Apache Airflow and understand how to use it effectively.

Here is a step-by-step tutorial on how to create and run a workflow using Apache Airflow:

Install Apache Airflow: The first step is to install Apache Airflow. You can install it using pip, by running the following command:

pip install "apache-airflow==2.2.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.8.txt"

The given command is a pip command used to install a specific version of Apache Airflow. The command is used to install Apache Airflow version 2.2.3.

The "pip install" part of the command tells pip to install a package, and the package specified is "apache-airflow==2.2.3", which means that we want to install version 2.2.3 of Apache Airflow.

The "--constraint" option is used to specify a constraint file for the installation. The constraint file contains a list of package versions that are compatible with the installed package. In this case, the constraint file is

specified as "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt", which means that the package versions listed in the constraint file should not contain any versions for Python 3.9.

So, in summary, this command is used to install Apache Airflow version 2.2.3 while ensuring that the installed packages do not contain any versions for Python 3.9.

Initialize the Airflow database: Before using Apache Airflow, you need to initialize its database by running the following command:

jai@MacBook-Air airflow % airflow db init   
Once you've followed these steps, you should have Apache Airflow installed and ready to use on your computer.

It will create the airflow folder in your root directory, so navigate to it:

jai@MacBook-Air ~ % cd airflow 
jai@MacBook-Air airflow % ls
airflow-scheduler.err	airflow-scheduler.pid	airflow-webserver.out	airflow.db
airflow-scheduler.log	airflow-webserver.err	airflow-webserver.pid	logs
airflow-scheduler.out	airflow-webserver.log	airflow.cfg		webserver_config.py

Prior to diving into the intricacies of the Airflow metastore database (airflow.db), we'll establish an Airflow user and establish the necessary environment for accessing the database. Afterwards, we'll explore the process for accessing the database and the significance of editing the airflow.cfg file.

jai@MacBook-Air airflow % airflow users create \
          --username jai \
          --firstname jai \
          --lastname giri \
          --role Admin \
          --email admin@example.org

The Apache Airflow system operates through two essential components - the Webserver and the Scheduler. To properly evaluate and execute your DAGs, it's necessary to run both. Let's begin by starting the Webserver in the background, or daemon mode,

with the following command:

jai@MacBook-Air airflow % airflow webserver --daemon
or
jai@MacBook-Air airflow % airflow webserver -D

With the Webserver now running in the background, we can similarly launch the Scheduler using the following command:

jai@MacBook-Air airflow % airflow scheduler -D   
or
jai@MacBook-Air airflow % airflow scheduler --daemon

Access the Airflow UI: To access the Airflow UI, open your web browser and navigate to http://localhost:8080. You should see the Airflow dashboard, where you can manage and monitor your workflows.

Example 1:

Create your first workflow: To create your first workflow, you'll need to write some Python code. A basic workflow in Airflow consists of one or more tasks, which are defined using Python functions. For example, here's a simple workflow that prints "Hello, World!" to the console:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2021, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'hello_world',
    default_args=default_args,
    description='A simple example of a DAG',
    schedule_interval=timedelta(hours=1),
)

def hello_world_task():
    print('Hello, World!')

hello_world_task = PythonOperator(
    task_id='hello_world_task',
    python_callable=hello_world_task,
    dag=dag,
)

You can save this code as a Python script and run it to create your first workflow in Airflow.

These are the basic steps to download, install, and use Apache Airflow. Once you have Airflow set up, you can start exploring its many features and using it to automate and manage your data pipelines.

Another example2:

That shows how to use Apache Airflow to automate a simple data pipeline. In this example, we'll use Airflow to extract data from a CSV file, transform it, and load it into a database.

  1. Create a virtual environment: As before, it's a good practice to create a virtual environment for your Airflow installation. To create a virtual environment, run the following command:
python3 -m venv airflow-env
  1. Activate the virtual environment: Once you've created the virtual environment, activate it by running the following command:
source airflow-env/bin/activate
  1. Install Apache Airflow: To install Apache Airflow, run the following command:
pip install apache-airflow
  1. Initialize the Airflow database: Before using Apache Airflow, you need to initialize its database by running the following command:
jai@MacBook-Air airflow % airflow db init   
  1. Create the CSV file: To follow along with this example, create a CSV file with some sample data. Here's an example:
id,name,age
1,John Doe,35
2,Jane Doe,30
3,Bob Smith,40
  1. Write the extract task: The first step in our data pipeline is to extract data from the CSV file. We can write a Python function to do this using the csv library. Here's an example:
import csv

def extract_data_from_csv(**kwargs):
    file_path = '/path/to/sample.csv'
    data = []
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            data.append(row)
    return data
  1. Write the transform task: The next step in our data pipeline is to transform the data we extracted from the CSV file. For this example, we'll add a new field to each row called full_name, which is the concatenation of name and age. Here's an example:
def transform_data(**kwargs):
    ti = kwargs['ti']
    data = ti.xcom_pull(task_ids='extract_data_from_csv')
    for row in data:
        row['full_name'] = row['name'] + '-' + str(row['age'])
    return data
  1. Write the load task: The final step in our data pipeline is to load the transformed data into a database. For this example, we'll use SQLite, but you can use any database you prefer. Here's an example:
import sqlite3

def load_data_into_database(**kwargs):
    ti = kwargs['ti']
    data = ti.xcom_pull(task_ids='transform_data')
    conn = sqlite3.connect('/path/to/db.sqlite')
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS data (
            id INTEGER PRIMARY KEY,
            name TEXT,
            age INTEGER,
            full_name TEXT
        )
    ''')
  1. Create the DAG: With the extract, transform, and load tasks written, we can now create the DAG in Airflow. Here's an example:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'you',
    'depends_on_past': False,
    'start_date': datetime(2022, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'data_pipeline',
    default_args=default_args,
    description='A simple data pipeline example',
    schedule_interval=timedelta(hours=1),
)

extract_data_task = PythonOperator(
    task_id='extract_data_from_csv',
    python_callable=extract_data_from_csv,
    dag=dag,
)

transform_data_task = PythonOperator(
    task_id='transform_data',
    python_callable=transform_data,
    dag=dag,
)

load_data_task = PythonOperator(
    task_id='load_data_into_database',
    python_callable=load_data_into_database,
    dag=dag,
)

extract_data_task >> transform_data_task >> load_data_task
  1. Start the Airflow webserver: To run the DAG, you need to start the Airflow webserver. You can start it by running the following command:
airflow webserver -p 8080
  1. Trigger the DAG: With the webserver running, you can now trigger the DAG by visiting the Airflow UI in your web browser. The DAG should run automatically, extracting data from the CSV file, transforming it, and loading it into the database.

That's it! You now know how to use Apache Airflow to automate a simple data pipeline. With this knowledge, you can start building more complex pipelines to handle larger datasets and more sophisticated transformations.

Extra Stuff:

Here are the steps to download and install Apache Airflow on Mac/Linux, Windows:

Mac/Linux:

Create a virtual environment: It's a good practice to create a virtual environment for your Airflow installation. To create a virtual environment, run the following command:

python3 -m venv airflow-env

Activate the virtual environment: Once you've created the virtual environment, activate it by running the following command:

source airflow-env/bin/activate

Install Apache Airflow: To install Apache Airflow, run the following command:

pip install apache-airflow

Initialize the Airflow database: Before using Apache Airflow, you need to initialize its database by running the following command:

airflow db init   

Windows:

Create a virtual environment: It's a good practice to create a virtual environment for your Airflow installation. To create a virtual environment, run the following command:

python -m venv airflow-env

Activate the virtual environment: Once you've created the virtual environment, activate it by running the following command:

airflow-env\Scripts\activate

Install Apache Airflow: To install Apache Airflow, run the following command:

pip install apache-airflow

Initialize the Airflow database: Before using Apache Airflow, you need to initialize its database by running the following command:

airflow db init   

Once you've followed these steps, you should have Apache Airflow installed and ready to use on your computer.

Start the Airflow web server: To start the Airflow web server, run the following command:

airflow webserver --daemon
or
airflow webserver -D

Start the Airflow scheduler: To start the Airflow scheduler, run the following command:

airflow scheduler -D   
or
airflow scheduler --daemon

Access the Airflow UI: To access the Airflow UI, open your web browser and navigate to http://localhost:8080. You should see the Airflow dashboard, where you can manage and monitor your workflows.

To uninstall Apache Airflow, follow these steps:

  1. Deactivate the virtual environment: If you have installed Apache Airflow in a virtual environment, you need to deactivate the environment first. You can do this by running the following command in your terminal:
deactivate
  1. Uninstall Apache Airflow using pip: Use the following command to uninstall Apache Airflow using pip:
for windows:
pip uninstall apache-airflow

for macOS/Linux:
pip3 uninstall apache-airflow
  1. Delete the airflow home folder: Apache Airflow creates a home folder during installation, which stores configuration and log files. You can delete the home folder after uninstalling Apache Airflow. The default location for the home folder is ~/airflow, but you can check the location of your home folder in the airflow.cfg file.

  2. Remove the virtual environment: If you have used a virtual environment for installing Apache Airflow, you can remove the environment after uninstalling Apache Airflow. You can do this by deleting the folder that contains the environment.

That's it! You have successfully uninstalled Apache Airflow from your system.

Conclusion

In conclusion, Apache Airflow is a powerful open-source platform for managing and scheduling data pipelines. With its ability to run on different operating systems, support for multiple databases, and comprehensive documentation, Apache Airflow is a great choice for anyone looking to manage their data pipelines efficiently. Whether you're a data engineer, data scientist, or data analyst, Apache Airflow can help you automate and streamline your data workflows. With the resources provided above, you should be able to get started with Apache Airflow easily and quickly.

DigitalOcean Referral Badge

DigitalOcean Sign Up : If you don't have a DigitalOcean account yet, you can sign up using the link below and receive $200 credit for 60 days to get started: Start your free trial with a $200 credit for 60 days link below: Get $200 free credit on DigitalOcean ( Note: This is a referral link, meaning both you and I will get credit.)


Latest From PyDjangoBoy

👩💻🔍 Explore Python, Django, Django-Rest, PySpark, web 🌐 & big data 📊. Enjoy coding! 🚀📚