User Details
- User Since
- Oct 8 2014, 5:48 PM (512 w, 5 d)
- Availability
- Available
- IRC Nick
- Milimetric
- LDAP User
- Milimetric
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Wed, Jul 31
@VirginiaPoundstone adding this to our sprint because it's basically a no-op and an easy resolution to an old task.
Tue, Jul 30
Oh! Thanks for the reminder, this is now available but not included in the sqoop lists. I'll make a patch, easy enough. labswiki seems to be available in both the analytics replicas and cloud replicas (as labswiki_p in the latter)
Mon, Jul 29
Some initial thoughts: https://docs.google.com/document/d/1NorKzBiQyz2nXCUUkGqdFP8SfPD0bzBMdfPO183dqNY/edit
Adding this to our Sprint board as it needs to get a look and deployment. But the change looks good, just have to alter the table and make sure the timing of all the jobs works out.
great, moving this to get deployed. Steps will be:
Just for the record, we met and discussed @Joe's proposal (this task's description) and were in general agreement that it's the best way forward. We have follow-up discussions to have and coordination to do, but we're aligned on the idea.
Sat, Jul 27
K, as a final update here, the pipeline is:
Fri, Jul 26
manually running this in a screen on an-launcher1002:
Ok, this ended up being very involved. I believe the root of all the confusion is that all the dumps jobs assume the PREVIOUS dump finished and work only on the CURRENT dump. So we ran around dumpsdata and snapshot hosts, hardcoding 20240701 where it was looking for "latest" and we're not sure whether we broke anything. At the end of the day, we basically figured that the snapshot1010 version of dumps files seemed all good, and we just rsynced them over to dumpsdata and clouddumps hosts. The rsync service that runs ALSO assumes this "latest" thing, but not for all files, just for the status files. So as far as we could tell everything was already rsynced except the status and html files. The monitor/html generation service ALSO assumes this "latest" thing so we weren't able to run that to generate the html, even after trying to hack it, but the html files were already on the snapshot hosts so we just moved those over with rsync too. The base rsync excludes json and html, so we just hacked it to include them instead.
Quick spike to size up how many revisions we're dealing with on a daily basis:
Thu, Jul 25
oof, I just realized this is for the month BEFORE. I see that's still in-process:
Wed, Jul 24
Tue, Jul 23
Ok, dug into this a bit more. Looks like the job set up to import the dumps XML is running fine but the status file says wikidatawiki is still in progress. Specifically it says this:
The airflow sensor timed out. But I never saw an alert for it (maybe it was before this week). I cleared it and will report back here in a bit after it has a chance to think about running again.
Deployed, started job, waiting to see if it works.
I deployed this and started the job, checking in now to make sure it runs.
Mon, Jul 22
- wmf.mediawiki_history: duplicate revision/create records indeed exist, some have 4 copies and some 2 copies but all spot-checked duplicates come in even numbers
- wmf_raw.mediawiki_revision: does not show the same duplication
- analytics mysql replicas: the pages those revisions belong to were moved and had some delete/restore and delete/revision actions in the logging table
- cloud replicas: agrees with analytics replicas
+1 to decom
Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.
Sorry, I just signed it, I'm sure I signed it or some form of it at some point before, I've been an employee for like 12 years almost :P
Fri, Jul 19
ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:
Wed, Jul 17
Tue, Jul 16
weird... quick steps as I look into this.
Fri, Jul 12
Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:
Quick spark-sql query to get link changes where someone tags a new wiki project on the talk page:
Thu, Jul 11
OK, so it seems most problems do indeed track back to not applying delete and restore events. It feels like we can mark this task complete. We can find a way to apply delete/restore/merge, and then run these queries again and see what we need to reconcile. The period I looked at above was 10 days of enwiki revisions. If anyone disagrees, do move this task back.
That one mismatch_page that has no other reason listed is apparently part of a merge, so if we're not following up on delete/restore properly then this makes perfect sense because merges are more complicated still. Here are the two pages involved and the logging table records for them:
Ok, I think I got this query to make sense... the results:
Wed, Jul 10
I am still trying to find an elegant way to change the queries and show all this, but I just wanted to share results so far:
Apologies for the week delay here, I was out sick, picking it back up soon.
Tue, Jul 9
Jul 3 2024
Quick summary of last meeting. Luke started working on a draft of what we were talking about (see the reconciliation flow on https://miro.com/app/board/uXjVNfaohl0=/).
Jul 1 2024
My first hunch, that the revisions were coming from only specific pages, is wrong:
Jun 26 2024
are we literally saying that we should just change the value of statistics-users-active to Editors? Code here: https://gerrit.wikimedia.org/g/mediawiki/core/+/1aa990f1725bf81caaf44527b9e778b5a8fe7e4d/languages/i18n/en.json#1950
Thanks for pinging us, we don't use abuse filter tables anywhere I'm aware of, so this shouldn't affect us.
Thanks for pinging us on this. The sqoop code should run without modification, so we're good downstream. Thank you!
obligatory reference: https://www.mediawiki.org/wiki/Extension:NavigationTiming (is this roughly related?)
Jun 25 2024
This is now done.
Great question, @mforns. This was mostly for performance reasons. I couldn't find a way to get Spark to optimally work on the full day of pageviews without first aggregating it like this to > 250. But the execution plan I ended up with looks pretty wild. Let's talk tomorrow when you have some time. I'm attaching the change here.
Jun 24 2024
I've migrated and shut off the old instances. I will delete them in a couple of days, just in case. But everything's working fine without them. Did not know about the wmflabs -> wmcloud automatic redirect, that made everything very simple.
I grouped a couple of tasks under this so we're less likely to lose them in the fray.
Jun 21 2024
The simpler way to do this, just two phases as opposed to progressive, gets us fairly similar results, with about 200 fewer rows which are all detailing specific browser versions.
We get a ton more detailed results this way, and the total coverage increases to 99.7%. Still not 99.9%, but I think we may have too much detail at some point. I'm fairly happy with these results, and I'm going to prepare the new browser general query as a gerrit change. It'll be good to get some review.
This might affect some data we sqoop into HDFS and some of how we compute commons impact metrics or similar future metrics. We have to wait until a schema change is proposed to know for sure.
From a discussion with @Krinkle about the data, a preliminary idea of how to roll up is:
Jun 20 2024
The long and the short of it is that we can get that "other" to about 2% if we simply roll up remaining data by browser family and os family. We could get fancier but let's see what folks think about just this approach.
This spreadsheet (1) has all the different aggregations in separate sheets. The name of the sheet is the aggregation type. Described here:
ok, I have some results for us to peruse, from rolling up in different ways. First of all, my query so we can debate whether or not it's accurate.
Jun 17 2024
Analysis available in this spreadsheet: https://docs.google.com/spreadsheets/d/1iSlH5XsRXV7mDoku0F5HbLNJmx1CMBm6ECakZMPUbU8/edit?usp=sharing
Made up some slides to help think about this data:
Jun 12 2024
Jun 11 2024
This looks like a great system to get started with. I can think of some potential snags that come up, so as we build it let's keep an eye out for these and similar:
Jun 7 2024
select day, http_status, count(*) count_by_status from pageview_actor where year=2024 and month=4 and day in (19,26) and geocoded_data['country_code'] = 'HK' and normalized_host.project_class = 'wikifunctions' group by day, http_status
day | http_status | count_by_status |
19 | 200 | 389 |
19 | 301 | 4311 |
19 | 302 | 931 |
26 | 200 | 198 |
26 | 301 | 133801 |
26 | 302 | 1028 |
Jun 6 2024
Jun 5 2024
Jun 4 2024
https://mpic.svc.eqiad.wmnet:30443/ is the endpoint we should use to talk to the mpic service making sure we don't take unnecessary hops through DNS.
May 31 2024
hypothesis so far: maybe some workers are getting MaxMind updates on a staggered schedule from others, so there's always some variation?
(and sub-country it's much worse)
May 29 2024
The two instances have been moved, docs updated on wiki and in code, and proxies have been moved. The only problem is the new proxies can't use the old wmflabs.org domain. For now, I left the old proxies up and additionally set up the new proxies. So, for example, both https://pingback.wmflabs.org/ and https://pingback.wmcloud.org/ work. Whenever the old instances are deleted, the old URLs will stop working. I guess part of sign-off will be to communicate this and maybe delete the old instances?
Keeping track of how I do this for future reference. (The previous task where I did this was T236586 and I failed to take good notes there)