Fix Armenian sentence tokenization bug in the link recommendation algorithm
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	kevinbazira
	Jan 19 2023, 5:30 AM

Description

The Armenian Wikipedia (hywiki) training pipeline got stuck at the step where it generates backtesting data as shown in the screenshot below:

Armenian Wikipedia (hywiki) pipeline stuck at backtesting data generation - Screenshot from 2023-01-16 17-56-15.png (741×1 px, 288 KB)

Decided to let it continue running for over 10 hrs as I wasn't sure whether it was stuck because of a bug or the amount of data it was processing.

Consulted @MGerlach whether he had ever faced this issue and he said:

In another project, I recently came across that for armenian wikipedia, the standard sentence-processing pipeline didnt work. Looking at some articles in hywiki, I quickly realized that Armenian uses the "։" as a sentence-marker (which is not the same as colon). It thus makes sense that it gets stuck when generating the backtesting data where we extract individual sentences with existing links. In this case, the sentences will be way too long so then probably gets hung up somewhere.

The goal is to replace the Armenian sentence-splitting symbol so that the link recommendation algorithm can run sentence tokenization successfully.

Details

	Subject	Repo	Branch	Lines +/-
	Fix Armenian sentence tokenization bug	research/mwaddlink	main	+5 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	• lbowmaker	T307881 Scaling of link suggestions service
Open	Trizek-WMF	T304110 [EPIC] Deploy "add a link" to all Wikipedias
Resolved	Sgs	T308134 Deploy "add a link" to 9th round of wikis
Resolved	None	T327371 Fix Armenian sentence tokenization bug in the link recommendation algorithm

Event Timeline

kevinbazira created this task.Jan 19 2023, 5:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2023, 5:30 AM

kevinbazira updated the task description. (Show Details)Jan 19 2023, 5:36 AM

kevinbazira mentioned this in T308134: Deploy "add a link" to 9th round of wikis.Jan 19 2023, 5:38 AM

kevinbazira moved this task from Unsorted to Watching on the Machine-Learning-Team board.Jan 23 2023, 6:27 AM

@kevinbazira I see this is "Watching" on your team's board; is this something that @MGerlach and Research might work on?

@kostajh, yes @MGerlach and I will work on this.

In short: I could resolve the issue when upgrading wikitextparser to version 0.51.1 (I previously used 0.45.1).

I could reproduce the error for hywiki, specifically for two articles (Օլիմպիական_երդում and Ազգային_օլիմպիական_կոմիտե ). I could trace down the error to wikitextparser’s parse-function which we use here in our code. This seemed to be articles with large tables and I found issues about the wikitextparser hanging with tables (example). Those were subsequently fixed in newer releases of wikitextparser. After updating wikitextparser I could succesfully run the script for hywiki.

Recommended steps:

adapt sentence-splitting to take into account Armenian sentence-stop symbol ։ (verǰaket) by adding wikicode = wikicode.replace("։", ".\n") to our parsing here)
update wikitextparser to v0.51.1

kevinbazira mentioned this in T329817: Kyrgyz Wikipedia model training pipeline failed.Feb 16 2023, 6:26 AM

@MGerlach, thank you for the recommendations. I have tested the fix locally and the hywiki training pipeline completed successfully.

Change 890351 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/mwaddlink@main] Fix Armenian sentence tokenization bug

https://gerrit.wikimedia.org/r/890351

gerritbot added a project: Patch-For-Review.Feb 21 2023, 1:15 AM

Change 890351 merged by jenkins-bot:

[research/mwaddlink@main] Fix Armenian sentence tokenization bug

https://gerrit.wikimedia.org/r/890351

kostajh mentioned this in rRMWAf4afe132a05a: Fix Armenian sentence tokenization bug.Feb 21 2023, 1:00 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2023, 1:10 PM

kevinbazira moved this task from Watching to In Progress on the Machine-Learning-Team board.Feb 22 2023, 4:35 AM

The Armenian sentence tokenization bug has been fixed in T327371#8631149

hywiki has been added to wikis that will be deployed in the 11th round T308136

calbon moved this task from In Progress to Complete Q3 2022/23 on the Machine-Learning-Team board.Feb 28 2023, 3:41 PM

elukey closed this task as Resolved.Mar 2 2023, 9:49 AM

	F36318979: Armenian Wikipedia (hywiki) pipeline stuck at backtesting data generation - Screenshot from 2023-01-16 17-56-15.png
	Jan 19 2023, 5:36 AM

Fix Armenian sentence tokenization bug in the link recommendation algorithm Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fix Armenian sentence tokenization bug in the link recommendation algorithm
Closed, ResolvedPublic
Actions

Related Objects
Search...