(Translated by https://www.hiragana.jp/)
⚓ T369898 Reduce the number of resource_change and resource_purge events emitted due to template changes
Page MenuHomePhabricator

Reduce the number of resource_change and resource_purge events emitted due to template changes
Open, Needs TriagePublic

Description

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times. These spikes are presumably caused by changes to highly used templates. We should investigate strategies to mitigate this effect.

These events are used to purge cached copies of rendered page content from carious caches (Varnish, PCS/Cassandra, etc). This is needed to avoid users seeing vandalized content even after a malicious change to a template has been reverted.

Context

When a template is changed, we schedule several kinds of jobs that both recursively iterate over batches of pages that use the template in question: First HTMLCacheUpdateJob and RefreshLinksJob, then later CdnPurgeJob.

  • HTMLCacheUpdateJob:
  • RefreshLinksJob is responsible for updating derived data in the database, in particular any entries in the link tables (pagelinks, templatelinks, etc) associated with the affected page. This is done be re-parsing the page content (using the new version of the template). Note that the rendered output is currently not cached in the ParserCache, because we want the cache to be populated based on organic page view access patterns. But that could change in the future.
  • HTMLCacheUpdateJob is responsible for invalidating the ParserCache and also for purging cached copies of the output from the CDN (Varnish) layer. It updates the page_touched field in the database and causes a CdnPurgeJob to be scheduled for any URLs affected by the change.
  • CdnPurgeJob uses the EventRelayerGroup service to notify any interested parties that URLs need purging. In MWF production, this triggers CdnPurgeEventRelayer, which sends resource_change to the changeprop service which then emits resource_purge events to the CDN layer. It also sends no-cache requests so services that manage their own caches, like RESTbase and PCS.

Diagram: https://miro.com/app/board/uXjVKI3NmLw=/

Ideas

  • Generally rely on natural expiry of caches, which should happen after one day. Only trigger recursive purges in certain cases:
    • if the template isn't used too much (maybe 100 times or so)
    • when an admin explicitly requests a recursive purge (could be a button on the purge page, or a popup after a revert)
    • after rollback/undo (unconditionally or optionally)
    • if the template is unprotected (protected templates are unlikely to be vandalized).
  • Avoid purges if the generated output for a given change didn't actually change due to the change to the template.
    • Leave it to RefreshLinksJob to decide if a CdnPurgeJob is needed for a given page.
    • This would delay purging quite a bit, since RefreshLinksJob is slow.
  • Avoid purges based on traffix:
    • only purge pages that were requested in the last 24h (using a join in Flink)
    • only purge pages that are in the top n-th percentile of page views.

Event Timeline

Change #1053907 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] RefreshLinksJob: collect stats on redundant parses

https://gerrit.wikimedia.org/r/1053907

Umherirrender renamed this task from Redunce the number of resource_change and resource_purge events emitted due to template changes to Reduce the number of resource_change and resource_purge events emitted due to template changes.Jul 12 2024, 5:19 PM

"if the template is unprotected (protected templates are unlikely to be vandalized)."

But if a protected template is vandalized then the vandalism has a much broader scale.

The English Wikipedia has had LTAs making 500 edits and waiting a month just to vandalize protected templates, for example https://en.wikipedia.org/wiki/Special:Contributions/CheezDeez32. But that tends to already generate complaints about the vandalism hours after it was reverted like https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_212#Strange_bug_on_Flag_of_Russia_article even with the recursive purges. But I guess if the initial edit didn't purge things then the revert won't need to either.

when an admin explicitly requests a recursive purge (could be a button on the purge page, or a popup after a revert)

I suspect most vandalism patrollers will have no idea what they are being asked to answer.


I think the idea is probably fine, but you should at least think about how to handle cleanup after cases like the one I linked.

Another option is to subdivide pages into two categories, "high traffic pages" and "long tail low traffic pages". The latter would be put effectively into a no-cache state: the cache lifetime would be very short, and we would never emit purges for them, relying on the natural expiration to deal with vandalism. We'd only emit purges for the high traffic pages.

This is similar to the last item in the "ideas" category except for statically assigning pages to a category (instead of dynamically) and the addition of a no-cache (or limited-lifetime-cache) header for the "low traffic pages". Newly-created pages would probably default to the "high traffic"/"precise purging" category, and we could run a job "every so often" to reasssign the categories, emitting a purge for any page moved from "high traffic" to "low traffic". Nothing needs to be done for a page moved from low traffic to high.

The goal is to increase the amount of /useful work/ done by the purge traffic: increasing the chances that the thing being purged is (a) actually in a cache, and (b) will be viewed from the cache before cache expiration.

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times

I'm curious about the the problem that this causes. Too many jobs inserted for job queue to handle quickly enough? Too many purge requests at once?

For completeness, another option is the varnish "x-key" system, which involves two research projects. One is that implementation of x-key in varnish appears to be incomplete, and the second is that the assignment of appropriate x-keys to URLs is non-trivial as well. There are too many templates used on a page like [[Barack Obama]] to naively assign one x-key to every recursively-included template, so we still need to come up with a mechanism to determine which of the templates deserve an x-key assigned, likely based on purge statistics.

Change #1053907 merged by jenkins-bot:

[mediawiki/core@master] RefreshLinksJob: collect stats on redundant parses

https://gerrit.wikimedia.org/r/1053907

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times

I'm curious about the the problem that this causes. Too many jobs inserted for job queue to handle quickly enough? Too many purge requests at once?

In the case of RESTBase (and the api server cluster) outages or at least incidents. One such incident is documented here https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api. It's one of the reasons that I am happy that we are removing RESTBase from the infrastructure.

Looking at metrics from LinksUpdate, it seems that we could reduce the number of purges a lot, if we wait until after we re-parse to decide whetehr we need to purge or not.

grafik.png (410×622 px, 41 KB)

On average, we see

  • 33 times per second we find a new rendering already cached. This is probably mostly from direct edits. We'd still want to trigger purges on direct edits, immediately.
  • 16 times per second, the re-parse generates the exat same HTML as before. We could skip the purge in this case.
  • 4 times per seconf we find that the HTML actually changed. So we'd have to purge.
  • 61 we don't find anything in the parser cache. There is probably nothing in the edge caches in that case, but we'd still have to notifiy persistant caches (in RESTbase/PCS)

So, of the 114 purges per second, we could:

  • skip 16 entirely
  • send 61 only to services, not Varnish/ATS