(Translated by https://www.hiragana.jp/)
⚓ T366004 Add page-title to the x_analytics header
Page MenuHomePhabricator

Add page-title to the x_analytics header
Open, Needs TriagePublic

Description

Following https://phabricator.wikimedia.org/T365321, @Krinkle suggested that we could send page title from Mediawiki in the x_analytics header.
Knowing we currently parse wiki URLs to get their titles for pageviews (with the lot of not-found and potential mistakes) , this idea feels great if achievable.

Event Timeline

Based on https://wikitech.wikimedia.org/wiki/Debugging_in_production:

krinkle@mwdebug1002:~$ curl -I --connect-to ::$HOSTNAME 'https://test.wikipedia.org/wiki/Main_Page'
…
X-Analytics: ns=0;page_id=11791;rev_id=585017
krinkle@mwdebug1002:~$ curl -I --connect-to ::$HOSTNAME 'https://test.wikipedia.org/w/index.php?curid=11791'
X-Analytics: ns=0;page_id=11791;rev_id=585017

As I suspected, MediaWiki does not treat curid differently. Once the initial routing handler normalizes it to that of a valid page view, the rest of the codebase handles it fully as a page view, including the way the XAnalytics extension for MediaWiki (via WikimediaEvents extension hook) emits the X-Analytics header for WMF pageview counting/metadata.

Title can get fairly long (255 bytes) and may contain special characters. So I guess the main question is, whether to always include it, and when including it, what format to use? I'm guessing you'd want the same format we use in the URLs (i.e. the "db key" format, in which spaces are underscores, and ucfirst-transform is applied; any other form results in a redirect from MediaWiki to this canonical form). But do we put percentage url-encoding in here? What level of UTF-8 literals is supported here? Do you percent-decode the URL in the pageview pipeline usually? If so, it might make sense to set title= here without percent encoding. Perhaps a quoted string would make sense (e.g. JSON string literal, for removal of doubt about how to parse it) like ns=0;page_id=1;rev_id=1;title="Main_Page".

https://en.wikipedia.org/wiki/Special:PrefixIndex?prefix=%22&namespace=0&hideredirects=1

so we would also send ns=0;title="\"Weird Al\" Yankovic's Greatest Hits" for https://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic%27s_Greatest_Hits. While that should be easy and interoperable to parse, it is not easy to extract from the larger X-Analytics value since it would mean = and ; may appear there and afaik there's not really a standard for what this X-Analytics value is formatted as, other than "split by ; and =" right?

So perhaps percent-encoding is ideal in that, naturally, percent-encoding being made for query strings where ; and & are used as separators, naturally encodes those.

So I'd suggest plain the urlencode() instead, producing: ns=0;title=%22Weird_Al%22_Yankovic%27s_Greatest_Hits.

This would byte-identical to the canonical url after precent-decoding. It would not be byte-identical before percent-decoding because canonical URLs in MediaWiki use wfUrlencode() where we pretty-fy various characters like ;, : and ! which urlencode would percent-encode by default.

Whichever format we decide on, this would be a simple patch to make in WikimediaEvents to the onXAnalyticsSetHeader hook handler. I would suggest limiting it to requests where title isn't set, so as to take on minimal risk and change to the pipeline.

This begs a question that @Milimetric and others have discussed for a while: Using webrequests to identify pageviews is error prone and computationally expensive.

Could we emit a pageview event instead? That would give greater control to MediaWiki to identify what constitutes a pageview.

Doing that would be a big project with a complicated migration to manage. But, we will probably keep having issues like this as the pageview logic lags behind MW changes.

Anyway, food for thought!

I grouped a couple of tasks under this so we're less likely to lose them in the fray.

These are important for data quality reasons. But they're also important for solidifying our team's understanding of the data pipelines. I look at these kinds of problems like rust. If you get a little rust you can polish it off, paint, all good. If you let the rust eat away at the structure, eventually everything breaks.