Overview
EventBus might have produced a page_change event that got lost somewhere along the pipeline: EventBus -> EventGate -> kafka -> flink enrichment -> kafka -> gobblin -> hdfs. From a reconciliation mechanism, we can find out basic details about events that we missed. These basic details include at least (wiki_db, revision_id). Different mechanisms might be able to get us more, but let's abstract from that and call this the missed event basic info.
Consumers of the page_content_change and page_change events would like to know about these missed events in a separate topic. So what we would like to build here is something that gives us valid page_change and page_content_change events in response to missed event basic info.
First Approach
In a recent meeting we reached consensus to try the following:
(for reference, the page change schema)
- In the EventBus repository we build new MediaWiki REST API endpoints that have one or more of the following signatures (the output is always in JSON format not a produced event here):
- missed event basic info -> page_change
- missed event basic info -> page_content_change
- missed event basic info -> page_change + page_content_change
- Array[missed event basic info] -> Array[page_change]
- Array[missed event basic info] -> Array[page_content_change]
- Array[missed event basic info] -> Array[page_change + page_content_change]
- We test these endpoints and adjust the following to maximize performance:
- missed event basic info: is this just (wiki_db, revision_id) or does MW benefit in performance if we pass more stuff, such as the page id? This will come down to how we use indices and what queries the MW PHP api is performing on Maria DB.
- return value. As far as we can see right now, the main question here is do we return page_content_change at the same time or separately? This will come down to how much more expensive it is to fetch the content than just the basic metadata for page_change. And not just for a one-off query but also for batch queries and for possible throughput with either approach.
Other questions
- From T368176, we note that a bad day could yield ~50,000 revisions that are missing. So we need to figure out what a reasonable max size per request should be. That is, what is the cost per revision emit to EventBus? Will the client wait or will EventBus return right away and work on the request async?
- If async, do we need state? Will the API return an ID so that we can track progress? Or should we keep it simple?
- Considerations regarding running as an extension to MediaWiki that I have no idea of? For example: I'd like this API to be a global endpoint rather than per wiki. Is that possible?