[Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	mforns
	Feb 28 2024, 5:54 PM

Description

After we know where the allow-list lives T358695, and have the SparkSql queries and the SparkScala script productionized T358681,
we should implement an Airflow dag that uses them to populate the Commons Impact Metric datasets as Iceberg tables.
The DAG should have a monthly granularity (unless after Community feedback we decide on something else T358688).
It should execute each subsequent step of the pipeline storing intermediate results in temporary tables and pass them to the next operator.
The final results should populate the 5 Commons Impact Metrics datasets as Iceberg tables.

Tasks:

Write the DAG
Test it in the development instance
Code review and deploy

Definition of done:

The DAG is running in production, and the datasets are being populated every month.

Details

Subject	Repo	Branch	Lines +/-
Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder	analytics/refinery/source	master	+83 -71
Modify Commons Impact Metrics queries to ignore ancestor categories	analytics/refinery	master	+154 -165
Commons Impact Metrics queries - Correct order of insert	analytics/refinery	master	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	VirginiaPoundstone	T358673 [Epic] Commons Impact Metrics Implementation
Resolved	VirginiaPoundstone	T360649 [Sprint 11 GOAL] Commons Impact Metrics: Deliver Data Pipeline
Resolved	mforns	T358699 [Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg

Event Timeline

mforns created this task.Feb 28 2024, 5:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2024, 5:54 PM

mforns added a parent task: T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 5:57 PM

mforns renamed this task from [Commons Impact Metrics] Create Airflow job that generates the Commons Impact Metrics datasets in Iceberg to [Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg.Feb 28 2024, 6:01 PM

mforns mentioned this in T358701: [Commons Impact Metrics] Create Airflow job that generates the public dumps.Feb 28 2024, 6:09 PM

mforns mentioned this in T358707: [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS.Feb 28 2024, 6:42 PM

mforns mentioned this in T358715: [Commons Impact Metrics] Add test data in AQS's test environments to back up new AQS service.Feb 28 2024, 7:31 PM

mforns mentioned this in T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 9:02 PM

VirginiaPoundstone moved this task from Incoming requests to Q1 24/25 on the Commons-Impact-Metrics board.Mar 1 2024, 6:25 PM

VirginiaPoundstone moved this task from Incoming to Data Products Sprint 11 on the Data Products board.Mar 18 2024, 7:37 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 11); removed Data Products.

mforns set the point value for this task to 5.Mar 21 2024, 2:15 PM

mforns mentioned this in T360640: [Commons Impact Metrics] Create Airflow job that formats and loads the data to Druid for AQS.Mar 21 2024, 2:25 PM

VirginiaPoundstone triaged this task as Medium priority.Mar 29 2024, 4:33 PM

Milimetric claimed this task.Apr 3 2024, 9:43 PM

Milimetric moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 11) board.