Setting Up Apache Airflow and Jupyter Notebook on AWS EC2 instance

By Yashpal Singla

Industry:Technology 5 Min Read

Introduction

Airflow setup on EC2 instance along with DAG management on the server using Jupyter notebook is the easiest and convenient way of managing automated scripts in the Apache Airflow called DAGs. Before we move on to the deployment, let’s know about the Apache Airflow and Jupyter Notebook.

Apache Airflow

Automation is a transformative force that promises to change our approach toward work. It is paving the way for a more productive, efficient, and innovative future as we integrate it into our daily life. Its potential ranges from streamlining mundane and repetitive tasks to transforming entire industries.

Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

For Details information, click on this link https://airflow.apache.org/docs/stable

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.

Details on the official website, jupyter

Setting Up Apache Airflow and Jupyter Notebook

For setting up Airflow and Jupyter Notebook, you need access to the Ubuntu terminal through SSH for the EC2 instance. Secondly, make sure to allow 8080 and 8888 ports in the AWS security groups so that everyone can access these ports that are default ports of Airflow and jupyter Notebook respectively.

Install the apache airflow the traditional way and open the Custom TCP ports in the EC2 security group. The ports we will be targeting are 8080 for Apache Airflow and 8888 for Jupyter Notebook.

For details on setting up Apache Airflow and Jupyter Notebook you can follow below link:

Airflow Apache

Jupyter Notebook

The installation can also be done using the venv (Virtual environment) or Anaconda (Python environment manager). In such cases, we would require to provide respective library paths for Airflow and Jupyter Notebook.

To get the Jupyter Notebook exposed to 8888 it is required to generate the config file using the below command. To create a jupyter_notebook_config.py file, with all the defaults commented out, you can use the following command line:

jupyter notebook --generate-config

If the current OS is ubuntu, the file can be located at:
/home/ubuntu/.jupyter/jupyter_notebook_config.py

You need to edit this file using VIM or Nano command in the command line and add the below lines in the file

c.NotebookApp.allow_origin = '*'

c.NotebookApp.Ip = '0.0.0.0'

We can run Airflow using the following two commands

airflow webserver -p 8080 (the command will expose the airflow interface on 8080 port)

airflow scheduler (to refresh DAGs and scheduling and execute tasks)

To run Jupyter Notebook, use the below command:

jupyter notebook (it will expose the respective directory listing in which the command is executed)

We need to make sure that the above jupyter command is executed in Airflow Home Directory, which will allow us to manage the DAGs. The above commands are for testing purposes; we need to run these commands as services so that we don’t have to keep the ssh open.

For running the Apache Airflow and Jupyter Notebook as service in the background, we need to follow below steps:

1. Create a service for Airflow Webserver command using below command:

sudo nano /etc/systemd/system/airflow-webserver.service

Paste the below code in the file:

 
                   [Unit]
                                       
                                        Description=Airflow webserver daemon
                                        After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
                                        Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
                                        [Service]
                                        EnvironmentFile=/home/ubuntu/airflow/airflow
                                        User=ubuntu
                                        Group=ubuntu
                                        Type=simple
                                        ExecStart=/usr/bin/sudo /bin/bash -lc 'airflow webserver'
                                        Restart=on-failure
                                        RestartSec=5s
                                        PrivateTmp=true
                                        [Install]
                                        WantedBy=multi-user.target
                                

2. Create a service file for airflow scheduler command using below command:

sudo nano /etc/systemd/system/airflow-scheduler.service

Paste the below code in the file:

                    [Unit]
                                            Description=Airflow scheduler daemon
                                            After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
                                            Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
                                            [Service]
                                            EnvironmentFile=/home/ubuntu/airflow/airflow
                                            User=ubuntu
                                            Group=ubuntu
                                            Type=simple
                                            ExecStart=/usr/bin/sudo /bin/bash -lc 'airflow scheduler' 
                                            Restart=always
                                            RestartSec=5s
                                            [Install]
                                            WantedBy=multi-user.target
                                        

3. Create a service file for airflow scheduler command using below command:

sudo nano /etc/systemd/system/jupyter.service

                    [Unit]
                                            Description=Jupyter Notebook
                                            [Service]
                                            Type=simple
                                            PIDFile=/run/jupyter.pid
                                            ExecStart=/usr/bin/sudo /bin/bash -lc 'jupyter-notebook --allow-root'
                                            User=ubuntu
                                            Group=ubuntu
                                            WorkingDirectory=/home/ubuntu/airflow
                                            Restart=always
                                            RestartSec=10
                                            [Install]
                                            WantedBy=multi-user.target
                                        

4. Before starting the services, they are required to be enabled using the below commands:

                    systemctl enable airflow-webserver. service
                                            systemctl enable airflow-scheduler.service
                                            systemctl enable airflow-jupyter.service
                                        

5. To run the services, run the below commands:

                    systemctl start airflow-webserver 
                                            systemctl start airflow-scheduler 
                                            systemctl start airflow-jupyter 
                                        

Using the above instructions, the Apache airflow and Jupyter Notebook will run smoothly on the EC2 instance.

For any questions or in need of more details, please feel free to reach out to us. Team iotasol is here to help! Contact us now!

Subscribe to our newsletter to stay updated with our work!

© 2024 - All Rights Reserved | Terms of use | Cookie Policy | Privacy Policy | Sitemap