DataEng Digest - Issue #6: Guide to Data Engineering: The Finale, 7 circles of data testing hell, Cluster Schedulers, Data pipelining with Airflow and more.
Wow, DataEng Digest crossed 100 subscribers without any active promotions. I am really glad to introd
DataEng Digest - Issue #6: Guide to Data Engineering: The Finale, 7 circles of data testing hell, Cluster Schedulers, Data pipelining with Airflow and more.
Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data.
— the purpose of schedulers the way they were originally envisaged and developed at Google
— how well (or not) they translate to solve the problems of the rest of us
— why they come in handy even…
Batch data processing, historically known as ETL, is extremely challenging. It’s time-consuming, brittle, and often unrewarding. How to simplify the process? Use right instruments and follow best practices.