https://dumps.wikimedia.org/backup-index.html shows 2024-04-17 13:16:19 wikidatawiki: Dump complete
https://dumps.wikimedia.org/wikidatawiki/ does not show 20240501
Description
Related Objects
- Mentioned In
- T365155: Text id verification makes dumps skip many good rows
T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye - Mentioned Here
- T365155: Text id verification makes dumps skip many good rows
T362566: Stop growth of text table by storing ES addresses in content table
T315902: New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps
T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye
Event Timeline
Confirming no wikidatawiki is running on snapshot1011, which is the node that should typically run it:
hostname -f snapshot1011.eqiad.wmnet sudo -u dumpsgen bash cd /srv/deployment/dumps/dumps/xmldumps-backup python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki --dryrun would kill processes []
It's weird that it never started, but I could not tell why.
Now let's start it on the snapshot1014.eqiad.wmnet host which has already been upgraded to latest OS (Bullseye), thus giving a chance to T325228 to take on snapshot1011.
ssh snapshot1014.eqiad.wmnet sudo -u dumpsgen bash cd /srv/deployment/dumps/dumps/xmldumps-backup screen -S wikidatawiki-20240501 bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki
For reference, this late wikidata dump has had an impact on some downstream jobs.
The refinery-import-page-current-dumps systemd job on an-launcher1002 emitted WARNING Not writing success flag due to unavailable projects [wikidatawiki] and failed projects [] but still exited with a zero.
btullis@an-launcher1002:~$ systemctl status refinery-import-page-current-dumps.service ● refinery-import-page-current-dumps.service - Schedules daily an incremental import of the current month of pages-meta-current xmldumps into HDFS Loaded: loaded (/lib/systemd/system/refinery-import-page-current-dumps.service; static; vendor preset: enabled) Active: inactive (dead) since Fri 2024-05-10 07:11:26 UTC; 3h 58min ago Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state Process: 32400 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject refinery-import-page-current-dumps --mail-to data-engineering-alerts@lists.wikimedia.org --only-on-error /usr/local/bin/kerberos-run Main PID: 32400 (code=exited, status=0/SUCCESS) May 10 05:00:00 an-launcher1002 systemd[1]: Starting Schedules daily an incremental import of the current month of pages-meta-current xmldumps into HDFS... May 10 05:00:00 an-launcher1002 kerberos-run-command[32504]: User analytics executes as user analytics the command ['/usr/local/bin/refinery-import-page-current-dumps'] May 10 07:11:26 an-launcher1002 kerberos-run-command[32400]: 2024-05-10T05:00:20 WARNING Dump status not done in json dump-info file /mnt/data/xmldatadumps/public/wikidatawiki/20240501/dumpstatus.json May 10 07:11:26 an-launcher1002 kerberos-run-command[32400]: 2024-05-10T07:11:26 WARNING Not writing success flag due to unavailable projects [wikidatawiki] and failed projects [] May 10 07:11:26 an-launcher1002 systemd[1]: refinery-import-page-current-dumps.service: Succeeded.
Subsequently the Airflow mediawiki_wikitext_current DAG timed out waiting for the _SUCCESS flag.
dumps.wikimedia.org still says that the wikidata dump is in progress:
...but the load on snapshot1014 looks suspiciously low, so I will check out that screen session and see if the dump processes are still running.
I see scrolling errors that I don't understand from the dumps worker process.
Generic error getting text for revisions.
I also saw this status message scrolling past:
2024-05-09 14:42:08: wikidatawiki (ID 3932972) 1395999 pages (54.7|1265.5/sec all|curr), 1396000 revs (54.7|0.9/sec all|curr), 92.1%|92.1% prefetched (all|curr), ETA 2025-08-07 03:38:27 [max 2149446292]
I'll see if I can find any more indicators of why we can't obtain this text.
That error message comes from here: https://gerrit.wikimedia.org/g/mediawiki/core/+/master/maintenance/includes/TextPassDumper.php#673 but that doesn't help me immediately.
I could kill and restart the dump processes, but I'm hesitant to do so without @xcollazo having looked at it first.
It's possible that this is some issue related to running the maintenance scripts on bullseye, because I think that this is the first time one of these dumps has been run at scale on bullseye.
It seems that there was maintenance work carried out on db1167 yesterday https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-05-09 which could have impacted this dump run.
I will proceed to kill the running dumps.
btullis@snapshot1014:/srv/mediawiki/wmf-config$ sudo -u dumpsgen bash dumpsgen@snapshot1014:/srv/mediawiki/wmf-config$ cd /srv/deployment/dumps/dumps/xmldumps-backup dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki --dryrun would kill processes ['2331253', '3932951', '3932952', '3932953', '3932955', '3932960', '3932967', '3932974', '3932977'] dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki
Now restarting the dumps:
dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ screen -r wikidatawiki-20240501 dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki
Oh, the lock file still exists:
python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --log --skipdone --exclusive --date 20240501 wikidatawiki Traceback (most recent call last): File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/./worker.py", line 571, in <module> main() File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/./worker.py", line 460, in main locker.lock() File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/dumps/wikidump.py", line 715, in lock lockf = FileUtils.atomic_create(self.get_lock_file_path(), "w") File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/dumps/fileutils.py", line 31, in atomic_create fdesc = os.open(filepath, os.O_EXCL + os.O_CREAT + os.O_WRONLY) FileExistsError: [Errno 17] File exists: '/mnt/dumpsdata/xmldatadumps/private/wikidatawiki/lock_20240501' Dump of wiki wikidatawiki failed.
Checking the lock file:
cd /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/ dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ cat lock_20240501 snapshot1014.eqiad.wmnet 2331253
Confirmed that this was the process killed earlier by dumpadmin.py --kill and it is no longer running, so I can remove the lock file.
dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ rm lock_20240501
Now running the job again:
dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ cd /srv/deployment/dumps/dumps/xmldumps-backup dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --log --skipdone --exclusive --date 20240501 wikidatawiki Running wikidatawiki... 2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20240501 ... 2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501 ... 2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501 ... 2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20240501 ... 2024-05-10 13:17:10: wikidatawiki Cleaning up old dumps for wikidatawiki 2024-05-10 13:17:10: wikidatawiki No old public dumps to purge. 2024-05-10 13:17:10: wikidatawiki No old private dumps to purge. 2024-05-10 13:17:10: wikidatawiki Starting backup of wikidatawiki 2024-05-10 13:17:11: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ... 2024-05-10 13:17:11: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ... 2024-05-10 13:17:11: wikidatawiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest/wikidatawiki-latest-site_stats.sql.gz-rss.xml 2024-05-10 13:17:11: wikidatawiki Reading wikidatawiki-20240501-site_stats.sql.gz checksum for md5 from file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501/md5sums-wikidatawiki-20240501-site_stats.sql.gz.txt 2024-05-10 13:17:11: wikidatawiki Reading wikidatawiki-20240501-site_stats.sql.gz checksum for sha1 from file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501/sha1sums-wikidatawiki-20240501-site_stats.sql.gz.txt
Thanks @BTullis for looking into this.
The generic logging that we get from the dumps doesn't help much. I hope this second restart solves the issue.
Interestingly, the errors that I saw after the database went away are the same as those seen in T315902: New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps.
I'm not going to delve into the PHP at the moment, but it's indicative that there is something wrong with routing to the database after some load-balancer changes with dbctl.
It also suggests that this isn't a problem specific to bullseye, which is reassuring.
Bunch of recoverable errors reported on https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/efjbIJHS--Q/m/LekH2tawAgAJ. A sample:
*** Wiki: wikidatawiki ===================== [20240512132606]: Skipping bad text id 44423a2f2f636c757374657232392f3739303236383533 of revision 2148925565 [20240512132620]: Skipping bad text id 44423a2f2f636c757374657232392f3739303435343333 of revision 2148962571 [20240512133014]: Skipping bad text id 44423a2f2f636c757374657232382f3739303137383638 of revision 2148962836 [20240512133027]: Skipping bad text id 44423a2f2f636c757374657232392f3739303336323538 of revision 2148944452 [20240512133121]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339343635 of revision 2148950677 [20240512133833]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339373430 of revision 2148951238 [20240512134223]: Skipping bad text id 44423a2f2f636c757374657232382f3739303035343633 of revision 2148937955 [20240512134256]: Skipping bad text id 44423a2f2f636c757374657232382f3739303033373434 of revision 2148934567
I have not seen these errors before, and they coincide with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948 being merged, which is part of T362566.
Perusing code, the exception is coming from :
... protected function getText( $id, $model = null, $format = null, $expSize = null ) { if ( !$this->isValidTextId( $id ) ) { $msg = "Skipping bad text id " . $id . " of revision " . $this->thisRev; $this->progress( $msg ); return ''; } ...
and isValidTextId() is defined as:
private function isValidTextId( $id ) { if ( preg_match( '/:/', $id ) ) { return $id !== 'tt:0'; } elseif ( preg_match( '/^\d+$/', $id ) ) { return intval( $id ) > 0; } return false; }
CC @Ladsgroup
We should get rid of that part and bump the XML version as Daniel suggested. Wanna do it?
I'm afraid I am not versed in MediaWiki development. I just happen to be the guy that inherited Dumps 1.0 via a series of unfortunate events.
I am happy to bump the XML and get rid of the text id as part of Dumps 2.0 work though (Dumps 2.0 is tech stack of Hadoop/Flink/Spark which am very comfortable with).
We all have had such misfortunes! Fixing the issue in mw is rather easy, we should just drop the code piece, bump the number in several places, update the schema validation and update tests. Here is an example https://gerrit.wikimedia.org/r/c/mediawiki/core/+/464768
I'd be more than happy to review or help getting it done.
Coming back to original issue: wikidatawiki continues to make progress, now generating the pages-meta-history 14's as per https://dumps.wikimedia.org/wikidatawiki/20240501/.
It looks like it will finish successfully.
It is significantly late, but since we are running it on a spare node (snapshot1014.eqiad.wmnet) it should not block the next run in the node typically assigned to run it (snapshot1011.eqiad.wmnet).
Ok I just looked that snapshot1014.eqiad.wmnet and there is no lock file and I see no load in the node, which explains why https://dumps.wikimedia.org/wikidatawiki/20240501/ shows no recent progress.
@BTullis perhap the last run was not run in a screen ?
Anyhow, it is now too late to retry this again. We will have to live with wikidatawiki not having a full dump for 20240501.