After we know where the allow-list lives T358695, and have the SparkSql queries and the SparkScala script productionized T358681,
we should implement an Airflow dag that uses them to populate the Commons Impact Metric datasets as Iceberg tables.
The DAG should have a monthly granularity (unless after Community feedback we decide on something else T358688).
It should execute each subsequent step of the pipeline storing intermediate results in temporary tables and pass them to the next operator.
The final results should populate the 5 Commons Impact Metrics datasets as Iceberg tables.
Tasks:
- Write the DAG
- Test it in the development instance
- Code review and deploy
Definition of done:
- The DAG is running in production, and the datasets are being populated every month.