(Translated by https://www.hiragana.jp/)
⚓ T364250 May 1, 2024 wikidatawiki dump not started
Page MenuHomePhabricator

May 1, 2024 wikidatawiki dump not started
Closed, ResolvedPublic

Description

https://dumps.wikimedia.org/backup-index.html shows 2024-04-17 13:16:19 wikidatawiki: Dump complete
https://dumps.wikimedia.org/wikidatawiki/ does not show 20240501

Event Timeline

Thanks for the report. Will investigate.

Confirming no wikidatawiki is running on snapshot1011, which is the node that should typically run it:

hostname -f
snapshot1011.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki --dryrun
would kill processes []

It's weird that it never started, but I could not tell why.

Now let's start it on the snapshot1014.eqiad.wmnet host which has already been upgraded to latest OS (Bullseye), thus giving a chance to T325228 to take on snapshot1011.

ssh snapshot1014.eqiad.wmnet

sudo -u dumpsgen bash

cd /srv/deployment/dumps/dumps/xmldumps-backup

screen -S wikidatawiki-20240501

bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

For reference, this late wikidata dump has had an impact on some downstream jobs.

The refinery-import-page-current-dumps systemd job on an-launcher1002 emitted WARNING Not writing success flag due to unavailable projects [wikidatawiki] and failed projects [] but still exited with a zero.

btullis@an-launcher1002:~$ systemctl status refinery-import-page-current-dumps.service 
● refinery-import-page-current-dumps.service - Schedules daily an incremental import of the current month of pages-meta-current xmldumps into HDFS
   Loaded: loaded (/lib/systemd/system/refinery-import-page-current-dumps.service; static; vendor preset: enabled)
   Active: inactive (dead) since Fri 2024-05-10 07:11:26 UTC; 3h 58min ago
     Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
  Process: 32400 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject refinery-import-page-current-dumps --mail-to data-engineering-alerts@lists.wikimedia.org --only-on-error /usr/local/bin/kerberos-run
 Main PID: 32400 (code=exited, status=0/SUCCESS)

May 10 05:00:00 an-launcher1002 systemd[1]: Starting Schedules daily an incremental import of the current month of pages-meta-current xmldumps into HDFS...
May 10 05:00:00 an-launcher1002 kerberos-run-command[32504]: User analytics executes as user analytics the command ['/usr/local/bin/refinery-import-page-current-dumps']
May 10 07:11:26 an-launcher1002 kerberos-run-command[32400]: 2024-05-10T05:00:20 WARNING Dump status not done in json dump-info file /mnt/data/xmldatadumps/public/wikidatawiki/20240501/dumpstatus.json
May 10 07:11:26 an-launcher1002 kerberos-run-command[32400]: 2024-05-10T07:11:26 WARNING Not writing success flag due to unavailable projects [wikidatawiki] and failed projects []
May 10 07:11:26 an-launcher1002 systemd[1]: refinery-import-page-current-dumps.service: Succeeded.

Subsequently the Airflow mediawiki_wikitext_current DAG timed out waiting for the _SUCCESS flag.

image.png (791×1 px, 274 KB)

dumps.wikimedia.org still says that the wikidata dump is in progress:

image.png (669×865 px, 236 KB)

...but the load on snapshot1014 looks suspiciously low, so I will check out that screen session and see if the dump processes are still running.

I see scrolling errors that I don't understand from the dumps worker process.

image.png (640×1 px, 106 KB)

Generic error getting text for revisions.

I also saw this status message scrolling past:

2024-05-09 14:42:08: wikidatawiki (ID 3932972) 1395999 pages (54.7|1265.5/sec all|curr), 1396000 revs (54.7|0.9/sec all|curr), 92.1%|92.1% prefetched (all|curr), ETA 2025-08-07 03:38:27 [max 2149446292]

I'll see if I can find any more indicators of why we can't obtain this text.

That error message comes from here: https://gerrit.wikimedia.org/g/mediawiki/core/+/master/maintenance/includes/TextPassDumper.php#673 but that doesn't help me immediately.
I could kill and restart the dump processes, but I'm hesitant to do so without @xcollazo having looked at it first.

It's possible that this is some issue related to running the maintenance scripts on bullseye, because I think that this is the first time one of these dumps has been run at scale on bullseye.

It seems that there was maintenance work carried out on db1167 yesterday https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-05-09 which could have impacted this dump run.

I will proceed to kill the running dumps.

btullis@snapshot1014:/srv/mediawiki/wmf-config$ sudo -u dumpsgen bash

dumpsgen@snapshot1014:/srv/mediawiki/wmf-config$  cd /srv/deployment/dumps/dumps/xmldumps-backup

dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki --dryrun
would kill processes ['2331253', '3932951', '3932952', '3932953', '3932955', '3932960', '3932967', '3932974', '3932977']

dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --wiki wikidatawiki

Now restarting the dumps:

dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ screen -r wikidatawiki-20240501 

dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki

Oh, the lock file still exists:

python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --log --skipdone --exclusive --date 20240501 wikidatawiki
Traceback (most recent call last):
  File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/./worker.py", line 571, in <module>
    main()
  File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/./worker.py", line 460, in main
    locker.lock()
  File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/dumps/wikidump.py", line 715, in lock
    lockf = FileUtils.atomic_create(self.get_lock_file_path(), "w")
  File "/srv/deployment/dumps/dumps-cache/revs/0d1f9be3610716a30b97df2ca671cc246c62c8f2/xmldumps-backup/dumps/fileutils.py", line 31, in atomic_create
    fdesc = os.open(filepath, os.O_EXCL + os.O_CREAT + os.O_WRONLY)
FileExistsError: [Errno 17] File exists: '/mnt/dumpsdata/xmldatadumps/private/wikidatawiki/lock_20240501'
Dump of wiki wikidatawiki failed.

Checking the lock file:

cd /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/
dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ cat lock_20240501 
snapshot1014.eqiad.wmnet 2331253

Confirmed that this was the process killed earlier by dumpadmin.py --kill and it is no longer running, so I can remove the lock file.

dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ rm lock_20240501

Now running the job again:

dumpsgen@snapshot1014:/mnt/dumpsdata/xmldatadumps/private/wikidatawiki$ cd /srv/deployment/dumps/dumps/xmldumps-backup
dumpsgen@snapshot1014:/srv/deployment/dumps/dumps/xmldumps-backup$ bash ./worker --date 20240501 --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --wiki wikidatawiki
python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:wd --log --skipdone --exclusive --date 20240501 wikidatawiki
Running wikidatawiki...
2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20240501 ...
2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501 ...
2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501 ...
2024-05-10 13:17:10: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/private/wikidatawiki/20240501 ...
2024-05-10 13:17:10: wikidatawiki Cleaning up old dumps for wikidatawiki
2024-05-10 13:17:10: wikidatawiki No old public dumps to purge.
2024-05-10 13:17:10: wikidatawiki No old private dumps to purge.
2024-05-10 13:17:10: wikidatawiki Starting backup of wikidatawiki
2024-05-10 13:17:11: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ...
2024-05-10 13:17:11: wikidatawiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest ...
2024-05-10 13:17:11: wikidatawiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/latest/wikidatawiki-latest-site_stats.sql.gz-rss.xml 
2024-05-10 13:17:11: wikidatawiki Reading wikidatawiki-20240501-site_stats.sql.gz checksum for md5 from file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501/md5sums-wikidatawiki-20240501-site_stats.sql.gz.txt
2024-05-10 13:17:11: wikidatawiki Reading wikidatawiki-20240501-site_stats.sql.gz checksum for sha1 from file /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20240501/sha1sums-wikidatawiki-20240501-site_stats.sql.gz.txt

Thanks @BTullis for looking into this.

The generic logging that we get from the dumps doesn't help much. I hope this second restart solves the issue.

BTullis triaged this task as High priority.May 10 2024, 2:39 PM
BTullis moved this task from Backlog to Active on the Dumps-Generation board.

Interestingly, the errors that I saw after the database went away are the same as those seen in T315902: New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps.
I'm not going to delve into the PHP at the moment, but it's indicative that there is something wrong with routing to the database after some load-balancer changes with dbctl.
It also suggests that this isn't a problem specific to bullseye, which is reassuring.

Bunch of recoverable errors reported on https://groups.google.com/a/wikimedia.org/g/ops-dumps/c/efjbIJHS--Q/m/LekH2tawAgAJ. A sample:

*** Wiki: wikidatawiki
=====================
[20240512132606]: Skipping bad text id 44423a2f2f636c757374657232392f3739303236383533 of revision 2148925565

[20240512132620]: Skipping bad text id 44423a2f2f636c757374657232392f3739303435343333 of revision 2148962571

[20240512133014]: Skipping bad text id 44423a2f2f636c757374657232382f3739303137383638 of revision 2148962836

[20240512133027]: Skipping bad text id 44423a2f2f636c757374657232392f3739303336323538 of revision 2148944452

[20240512133121]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339343635 of revision 2148950677

[20240512133833]: Skipping bad text id 44423a2f2f636c757374657232392f3739303339373430 of revision 2148951238

[20240512134223]: Skipping bad text id 44423a2f2f636c757374657232382f3739303035343633 of revision 2148937955

[20240512134256]: Skipping bad text id 44423a2f2f636c757374657232382f3739303033373434 of revision 2148934567

I have not seen these errors before, and they coincide with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1020948 being merged, which is part of T362566.

Perusing code, the exception is coming from :

...
	protected function getText( $id, $model = null, $format = null, $expSize = null ) {
		if ( !$this->isValidTextId( $id ) ) {
			$msg = "Skipping bad text id " . $id . " of revision " . $this->thisRev;
			$this->progress( $msg );
			return '';
		}
...

and isValidTextId() is defined as:

	private function isValidTextId( $id ) {
		if ( preg_match( '/:/', $id ) ) {
			return $id !== 'tt:0';
		} elseif ( preg_match( '/^\d+$/', $id ) ) {
			return intval( $id ) > 0;
		}

		return false;
	}

CC @Ladsgroup

We should get rid of that part and bump the XML version as Daniel suggested. Wanna do it?

We should get rid of that part and bump the XML version as Daniel suggested. Wanna do it?

I'm afraid I am not versed in MediaWiki development. I just happen to be the guy that inherited Dumps 1.0 via a series of unfortunate events.

I am happy to bump the XML and get rid of the text id as part of Dumps 2.0 work though (Dumps 2.0 is tech stack of Hadoop/Flink/Spark which am very comfortable with).

We all have had such misfortunes! Fixing the issue in mw is rather easy, we should just drop the code piece, bump the number in several places, update the schema validation and update tests. Here is an example https://gerrit.wikimedia.org/r/c/mediawiki/core/+/464768

I'd be more than happy to review or help getting it done.

We all have had such misfortunes! Fixing the issue in mw is rather easy, we should just drop the code piece, bump the number in several places, update the schema validation and update tests. Here is an example https://gerrit.wikimedia.org/r/c/mediawiki/core/+/464768

I'd be more than happy to review or help getting it done.

Thanks for the offer. Moved this work to T365155.

Coming back to original issue: wikidatawiki continues to make progress, now generating the pages-meta-history 14's as per https://dumps.wikimedia.org/wikidatawiki/20240501/.

It looks like it will finish successfully.

It is significantly late, but since we are running it on a spare node (snapshot1014.eqiad.wmnet) it should not block the next run in the node typically assigned to run it (snapshot1011.eqiad.wmnet).

Ok I just looked that snapshot1014.eqiad.wmnet and there is no lock file and I see no load in the node, which explains why https://dumps.wikimedia.org/wikidatawiki/20240501/ shows no recent progress.

@BTullis perhap the last run was not run in a screen ?

Anyhow, it is now too late to retry this again. We will have to live with wikidatawiki not having a full dump for 20240501.