(Translated by https://www.hiragana.jp/)
⚓ T368782 MediaWiki Reconciliation API
Page MenuHomePhabricator

MediaWiki Reconciliation API
Closed, DeclinedPublic13 Estimated Story Points

Description

Overview

EventBus might have produced a page_change event that got lost somewhere along the pipeline: EventBus -> EventGate -> kafka -> flink enrichment -> kafka -> gobblin -> hdfs. From a reconciliation mechanism, we can find out basic details about events that we missed. These basic details include at least (wiki_db, revision_id). Different mechanisms might be able to get us more, but let's abstract from that and call this the missed event basic info.

Consumers of the page_content_change and page_change events would like to know about these missed events in a separate topic. So what we would like to build here is something that gives us valid page_change and page_content_change events in response to missed event basic info.

First Approach

In a recent meeting we reached consensus to try the following:

(for reference, the page change schema)

  • In the EventBus repository we build new MediaWiki REST API endpoints that have one or more of the following signatures (the output is always in JSON format not a produced event here):
    • missed event basic info -> page_change
    • missed event basic info -> page_content_change
    • missed event basic info -> page_change + page_content_change
    • Array[missed event basic info] -> Array[page_change]
    • Array[missed event basic info] -> Array[page_content_change]
    • Array[missed event basic info] -> Array[page_change + page_content_change]
  • We test these endpoints and adjust the following to maximize performance:
    • missed event basic info: is this just (wiki_db, revision_id) or does MW benefit in performance if we pass more stuff, such as the page id? This will come down to how we use indices and what queries the MW PHP api is performing on Maria DB.
    • return value. As far as we can see right now, the main question here is do we return page_content_change at the same time or separately? This will come down to how much more expensive it is to fetch the content than just the basic metadata for page_change. And not just for a one-off query but also for batch queries and for possible throughput with either approach.

Other questions

  • From T368176, we note that a bad day could yield ~50,000 revisions that are missing. So we need to figure out what a reasonable max size per request should be. That is, what is the cost per revision emit to EventBus? Will the client wait or will EventBus return right away and work on the request async?
  • If async, do we need state? Will the API return an ID so that we can track progress? Or should we keep it simple?
  • Considerations regarding running as an extension to MediaWiki that I have no idea of? For example: I'd like this API to be a global endpoint rather than per wiki. Is that possible?

Event Timeline

@gmodena and @Ottomata the description above is just me thinking out loud. Kindly please modify as you see fit.

I'd like this API to be a global endpoint rather than per wiki. Is that possible?

I'm not sure but it won't be as easy as per wiki. Why do you want this?

We've got the wiki_id. I guess we just need config to map from wiki_id to the correct API endpoint?

Hm, a potential problem:

We really don't want this API to be publicly accessible. I'm not certain if we have examples of MW deployed private APIs? Maybe we do and I just don't know it?

If async, do we need state? Will the API return an ID so that we can track progress? Or should we keep it simple?

Good q. Should we use the MW JobQueue for this? It has the same consistency issues as producing regular events, but it shouldn't be worse. Rather than having the API call produce reconciled events directly, it would enqueue a job with the work to do. We'd implement a job in EventBus that produces the events.

Are you thinking of targeting Action or REST endpoints?

I'd like this API to be a global endpoint rather than per wiki. Is that possible?

I'm not sure but it won't be as easy as per wiki. Why do you want this?

So that a client doesn't need to think about it on a per wiki basis. We'd just fire once for all wikis and we're done. It is just a nice to have though, if it is not possible, then that is all right.

if it is not possible, then that is all right.

I'm not sure that it's impossible...but it will be awkward. Something is going to have to instantiate all the MW context stuff that allows us to e.g. get a RevisionRecord from the correct database. I've never worked in a global context like this. I know its possible to have any wiki access a single one-off global database, but I'm not sure if it is possible to have a global endpoint access any wiki's database.

We'd have to ask a MW person to be more sure.

Given we now have wmf_dumps.wikitext_inconsistent_rows_rc1 in production (DDL here), we should start fleshing out this MediaWiki API.

A POST request from PySpark should be trivial, but we should agree on whether we want this to happen per revision_id, or if the endpoint will allow N revision_ids. I would strongly prefer N revision_ids per submit to avoid a flood of requests. I would also prefer this API sends the resulting reconcile events via a page_change_late event stream instead of synchronously returning it in the POST API call, but I understand this may not be possible due to performance implications.

@gmodena has a hacky PoC here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1053043

But ya let's discuss.

I would also prefer this API sends the resulting reconcile events via a page_change_late event stream

Same.

Or:

Remind me if I have forgotten, but did we discard doing this as an async streaming job? PySpark emits records to Kafka, streaming job consumes, enriches with MW API output, produces event(s).

Or:

we use MW job queue.

Ah, I had forgot about that. @gmodena how difficult would it be to make that endpoint take N revision_ids?

Remind me if I have forgotten, but did we discard doing this as an async streaming job? PySpark emits records to Kafka, streaming job consumes, enriches with MW API output, produces event(s).

IIRC, we agreed to try the EventBus API approach first because 1) We figured the throughput would be low given findings from T368176, and 2) teaching PySpark to do events would be relatively onerous to an EventBus API and 3) the write path stays consistent.

Or:

we use MW job queue.

EventBus could do that regardless of API design, yes?

we use MW job queue.

EventBus could do that regardless of API design, yes?

True.

I ask about the stream enrichment or job queue path (which could also use the EventBus API endpoint we are making) because then we don't need to think about batches. But, likely making the EventBus endpoint handle batches (and optionally return vs produce) is a nice feature to have and not hard to do, so ya let's do it.

Hm, a potential problem:

We really don't want this API to be publicly accessible. I'm not certain if we have examples of MW deployed private APIs? Maybe we do and I just don't know it?

You may be interested in T365752: REST: Introduce support for private modules.

Change #1053043 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] WIP: hacky reconciliation API REST endpoint

https://gerrit.wikimedia.org/r/1053043

I am closing this ticket, as we have abandoned the idea of building this reconciliation API on top of Event Bus. Instead, we will emit the events directly from Spark to EventGate via T368755.