Data processing in information systems is divided into three stages: extraction, transformation, and loading (Extract Transform Load, ETL). In solutions using Big Data, it is with the help of ETL that the original (“raw”) data is converted into information suitable for business analysis.
However, as data grows and analytical tasks become more complex, the number of ETL processes that must be planned, monitored, and restarted in the event of failures also increases – the need for an orchestrator arises.
In this article, we will talk about an effective open-source Apache Airflow tool that helps manage complex ETL processes and works well with the principles of Cloud-Native applications.
AirFlow Core Entities
Data processing processes, or pipelines, in Airflow are described below. This is a semantic combination of tasks that must be performed in a strictly defined sequence according to a specified schedule. Visually, the DAG looks like a directed acyclic graph, a graph that does not have cyclic dependencies.
DAG nodes perform tasks. These are direct operations applied to data, for example, loading data from various sources, aggregating them, indexing, clearing duplicates, saving the obtained results, and other ETL processes. At the code level, tasks can be Python functions or Bash scripts.
Operators are most often responsible for the implementation of tasks. If tasks describe what actions to perform with data, then operators describe how to perform these actions. It is a template for completing tasks.
A special group of operators is made up of sensors (Sensors), which allow prescribing a reaction to a specific event. The trigger can be the arrival of a specific time, the receipt of a certain file or line with data, another DAG / Task, and so on.
AirFlow has a rich selection of built-in operators. In addition, many custom operators are available by installing community-supported vendor packages. It is also possible to add custom operators by extending the BaseOperator base class. When frequently used code based on standard operators appears in your project. It is recommended that you convert it to your own operator.
AirFlow Architecture And How It Works
The AirFlow architecture is based on the following components:
- Web Server – responsible for the user interface, where it is possible to configure DAGs and their schedule, track their execution status, and so on.
- Metadata DB (metadata database) – own metadata repository based on the SqlAlchemy library for storing global variables, data source connection settings, Task Instance, DAG Run execution statuses, and so on. It requires the installation of a SQL Alchemy compatible database such as MySQL or PostgreSQL.
- Scheduler – The service responsible for scheduling in Airflow. Keeping track of all created Task and DAGs, the scheduler initializes the Task Instance as the conditions necessary to run them are met. By default, once a minute, the scheduler analyzes the DAG parsing results and checks to see if there are any tasks ready to run. Execute active tasks; the scheduler uses the Executor specified in the settings.For specific versions of the database (PostgreSQL 9.6+ and MySQL 8+), simultaneous operation of several schedulers is supported – for maximum fault tolerance.
- Worker (Worker) – a separate process in which tasks are performed. The Worker’s placement – locally or on a different machine – is determined by the selected Executor type.
- Executor – The mechanism by which task instances are launched. Works in conjunction with the scheduler within one process. Supported artist types are shown below.