Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder. Introduction Use case: how to handle data coming in regularly from customers…
Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder.
Introduction
Use case: how to handle data coming in regularly from customers?
- Option 1: use CRON
- only time triggers
- hard error handling
- inconvenient when overlapping
- Option 2: Writing a workflow processing tool
- start is easy
- soon you reach limits: invest much more than envisionned of work with it
- Option 3: Use an OpenSource worklow processing tool
- multiple options
- they chose Apache Airflow @ BlueYonder
Apache Airflow
Apache Airflow is a workflow scheduler like Apache Oozie or Azkaban
- Written in [Python])(https://www.python.org/)
- Workflows are defined in Python
- Interface with a view of present & past runs and also logging
- Extensible with plugins
- Active development and community
- Provides a nice ui and REST interface
- Relatively lightweight (2 processes on a server & a database)
Development
An Airflow job is composed of multiple operators, one operator being one step of the job, and sensors to read inputs. In a Python workflow, you build your DAG yourself operator by operator.
Many operators are available in Airflow:
- BashOperator
- SimpleHttpOperator
- …
and sensors:
- HttpSensor
- HdfsSensor
- …
or you can develop your own operator/sensor in Python. Also, Airflow supports branching of the workflow through custom operators.
State handling
- Variable or relative to the airflow instance
- External communications are relative to the DAG run / task
- Both states are persisted in two database
Deployment
Two processes and a database:
- scheduler
- webserver
- database PostgreSQL, SQLite, …
Notes
- Airflow doesn’t handle user impersonation, you have to do it yourself
- High Availability isn’t handled natively by Airflow
- The presented use case had no need to connect to services with Kerberos & High Availability
Conclusion
Airflow seems to be a very nice alternative to Oozie and it’s XML workflows. We would have loved for it to be in JavaScript with NodeJS instead of Python!