MediaWiki Reconciliation API
Closed, DeclinedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	xcollazo
	Jun 28 2024, 7:47 PM

Description

Overview

EventBus might have produced a page_change event that got lost somewhere along the pipeline: EventBus -> EventGate -> kafka -> flink enrichment -> kafka -> gobblin -> hdfs. From a reconciliation mechanism, we can find out basic details about events that we missed. These basic details include at least (wiki_db, revision_id). Different mechanisms might be able to get us more, but let's abstract from that and call this the missed event basic info.

Consumers of the page_content_change and page_change events would like to know about these missed events in a separate topic. So what we would like to build here is something that gives us valid page_change and page_content_change events in response to missed event basic info.

First Approach

In a recent meeting we reached consensus to try the following:

(for reference, the page change schema)

In the EventBus repository we build new MediaWiki REST API endpoints that have one or more of the following signatures (the output is always in JSON format not a produced event here):
- missed event basic info -> page_change
- missed event basic info -> page_content_change
- missed event basic info -> page_change + page_content_change
- Array[missed event basic info] -> Array[page_change]
- Array[missed event basic info] -> Array[page_content_change]
- Array[missed event basic info] -> Array[page_change + page_content_change]
We test these endpoints and adjust the following to maximize performance:
- missed event basic info: is this just (wiki_db, revision_id) or does MW benefit in performance if we pass more stuff, such as the page id? This will come down to how we use indices and what queries the MW PHP api is performing on Maria DB.
- return value. As far as we can see right now, the main question here is do we return page_content_change at the same time or separately? This will come down to how much more expensive it is to fetch the content than just the basic metadata for page_change. And not just for a one-off query but also for batch queries and for possible throughput with either approach.

Details

	Subject	Repo	Branch	Lines +/-
	WIP: hacky reconciliation API REST endpoint	mediawiki/extensions/EventBus	master	+324 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T358373 [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions
Declined	None	T368745 MediaWiki reconciliation API and event enrichment pipeline
Declined	gmodena	T368782 MediaWiki Reconciliation API

Event Timeline

xcollazo created this task.Jun 28 2024, 7:47 PM

@gmodena and @Ottomata the description above is just me thinking out loud. Kindly please modify as you see fit.

xcollazo updated the task description. (Show Details)Jun 28 2024, 7:56 PM

xcollazo mentioned this in T368787: Flink job to enrich reconciliation events.Jun 28 2024, 8:07 PM

I'd like this API to be a global endpoint rather than per wiki. Is that possible?

I'm not sure but it won't be as easy as per wiki. Why do you want this?

We've got the wiki_id. I guess we just need config to map from wiki_id to the correct API endpoint?

Hm, a potential problem:

We really don't want this API to be publicly accessible. I'm not certain if we have examples of MW deployed private APIs? Maybe we do and I just don't know it?

If async, do we need state? Will the API return an ID so that we can track progress? Or should we keep it simple?

Good q. Should we use the MW JobQueue for this? It has the same consistency issues as producing regular events, but it shouldn't be worse. Rather than having the API call produce reconciled events directly, it would enqueue a job with the work to do. We'd implement a job in EventBus that produces the events.

Are you thinking of targeting Action or REST endpoints?

gmodena mentioned this in T368745: MediaWiki reconciliation API and event enrichment pipeline.Jul 1 2024, 2:01 PM

In T368782#9939563, @Ottomata wrote:

I'd like this API to be a global endpoint rather than per wiki. Is that possible?

I'm not sure but it won't be as easy as per wiki. Why do you want this?

So that a client doesn't need to think about it on a per wiki basis. We'd just fire once for all wikis and we're done. It is just a nice to have though, if it is not possible, then that is all right.

if it is not possible, then that is all right.

I'm not sure that it's impossible...but it will be awkward. Something is going to have to instantiate all the MW context stuff that allows us to e.g. get a RevisionRecord from the correct database. I've never worked in a global context like this. I know its possible to have any wiki access a single one-off global database, but I'm not sure if it is possible to have a global endpoint access any wiki's database.

We'd have to ask a MW person to be more sure.

Milimetric updated the task description. (Show Details)Jul 9 2024, 4:09 PM

• lbowmaker added a project: Data-Engineering (Q1 2024 July 1st - September 30th).Jul 12 2024, 2:49 PM

• lbowmaker set the point value for this task to 13.

• lbowmaker moved this task from Next Up to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.

• lbowmaker assigned this task to gmodena.Jul 15 2024, 2:25 PM

• lbowmaker moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

xcollazo moved this task from In Process to Paused on the Dumps 2.0 (Kanban Board) board.Jul 24 2024, 6:38 PM

xcollazo moved this task from Paused to In Process on the Dumps 2.0 (Kanban Board) board.Aug 12 2024, 6:41 PM

Given we now have wmf_dumps.wikitext_inconsistent_rows_rc1 in production (DDL here), we should start fleshing out this MediaWiki API.

A POST request from PySpark should be trivial, but we should agree on whether we want this to happen per revision_id, or if the endpoint will allow N revision_ids. I would strongly prefer N revision_ids per submit to avoid a flood of requests. I would also prefer this API sends the resulting reconcile events via a page_change_late event stream instead of synchronously returning it in the POST API call, but I understand this may not be possible due to performance implications.

@gmodena has a hacky PoC here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1053043

But ya let's discuss.

I would also prefer this API sends the resulting reconcile events via a page_change_late event stream

Same.

Or:

Remind me if I have forgotten, but did we discard doing this as an async streaming job? PySpark emits records to Kafka, streaming job consumes, enriches with MW API output, produces event(s).

Or:

we use MW job queue.

In T368782#10064666, @Ottomata wrote:

@gmodena has a hacky PoC here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1053043

Ah, I had forgot about that. @gmodena how difficult would it be to make that endpoint take N revision_ids?

Remind me if I have forgotten, but did we discard doing this as an async streaming job? PySpark emits records to Kafka, streaming job consumes, enriches with MW API output, produces event(s).

IIRC, we agreed to try the EventBus API approach first because 1) We figured the throughput would be low given findings from T368176, and 2) teaching PySpark to do events would be relatively onerous to an EventBus API and 3) the write path stays consistent.

Or:

we use MW job queue.

EventBus could do that regardless of API design, yes?

we use MW job queue.

EventBus could do that regardless of API design, yes?

True.

I ask about the stream enrichment or job queue path (which could also use the EventBus API endpoint we are making) because then we don't need to think about batches. But, likely making the EventBus endpoint handle batches (and optionally return vs produce) is a nice feature to have and not hard to do, so ya let's do it.

Tagging MW-Interfaces-Team for awareness.

In T368782#9939636, @Ottomata wrote:

Hm, a potential problem:

We really don't want this API to be publicly accessible. I'm not certain if we have examples of MW deployed private APIs? Maybe we do and I just don't know it?

You may be interested in T365752: REST: Introduce support for private modules.

gmodena moved this task from In Process to Paused on the Dumps 2.0 (Kanban Board) board.Sep 4 2024, 12:58 PM

xcollazo mentioned this in T374055: Spike: Figure out if we have everything we need in Spark to emit page_change_late events.Sep 4 2024, 7:58 PM

Change #1053043 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] WIP: hacky reconciliation API REST endpoint

https://gerrit.wikimedia.org/r/1053043

gerritbot added a project: Patch-For-Review.Sep 9 2024, 10:48 AM

gmodena moved this task from In progress to Blocked/Paused on the Data-Engineering (Q1 2024 July 1st - September 30th) board.Sep 9 2024, 2:11 PM

Ahoelzl edited projects, added Data-Engineering (Q2 2024 October 1st - December 31th); removed Data-Engineering (Q1 2024 July 1st - September 30th).Oct 9 2024, 1:53 PM

Ahoelzl moved this task from Next Up to Blocked/Paused on the Data-Engineering (Q2 2024 October 1st - December 31th) board.

I am closing this ticket, as we have abandoned the idea of building this reconciliation API on top of Event Bus. Instead, we will emit the events directly from Spark to EventGate via T368755.

MediaWiki Reconciliation APIClosed, DeclinedPublic13 Estimated Story PointsActions