Most often, the following advantages of AirFlow:
- Open-source: AirFlow is supported by the community and has well-documented documentation.
- Based on Python: Python is considered a relatively easy language to learn and an accepted standard for Big Data and Data Scientists. When ETL processes are defined as code, they become easier to develop, test, and maintain. It also eliminates the need to use JSON or XML configuration files to describe pipelines.
- Rich toolkit and friendly UI: Working with AirFlow is possible using the CLI, REST API, and a web interface built on top of the Flask Python framework.
- Data sources and services: AirFlow supports many databases and Big Data stores: MySQL, PostgreSQL, MongoDB, Redis, Apache Hive, Apache Spark, Apache Hadoop, S3 Object Storage, and more.
- Customization: It is possible to customize your own operators.
- Scalability. Unlimited DAGs allowed due to modular architecture and message queue. Workers can scale when using Celery or Kubernetes.
- Monitoring and Alerting: Integration with Statsd and FluentD is supported – for collecting and sending metrics and logs. An Airflow-exporter is also available for integration with Prometheus.
- Ability to customize role-based access: By default, AirFlow provides five roles with different access levels: Admin, Public, Viewer, Op, User. You can also create your roles with access to a limited number of DAGs. Additionally, integration with Active Directory and flexible access configuration using RBAC (Role-Based Access Control) is possible.
- Testing support: You can add basic Unit tests that will check both the pipelines in general and specific tasks in them.
Of course, there are drawbacks, but they are primarily associated with a fairly high entry threshold and the need to take into account various nuances when working with AirFlow:
- When designing tasks, it is essential to observe idempotency: tasks should be written so that the same result is returned for the same input parameters regardless of their runs.
- It is necessary to understand the execution_date processing mechanisms. It is essential to realize that corrections, task code get reflected in all their launches over the last time. This excludes reproducibility of the results but, on the other hand, allows you to get results from the work of new algorithms for the past.
- There is no way to design a DAG graphically, as it is, for example, available in Apache NiFi. On the contrary, many see this as a plus since code review is easier than schema review.
- Some users note minor time delays in starting tasks due to the nuances of the scheduler’s work associated with the overhead of queuing and prioritizing tasks. However, in Airflow 2, such delays were minimized, and it was also possible to run multiple schedules for maximum performance.
Also Read: AirFlow: What It Is, How It Works