(Translated by https://www.hiragana.jp/)
⚓ T327371 Fix Armenian sentence tokenization bug in the link recommendation algorithm
Page MenuHomePhabricator

Fix Armenian sentence tokenization bug in the link recommendation algorithm
Closed, ResolvedPublic

Description

The Armenian Wikipedia (hywiki) training pipeline got stuck at the step where it generates backtesting data as shown in the screenshot below:

Armenian Wikipedia (hywiki) pipeline stuck at backtesting data generation - Screenshot from 2023-01-16 17-56-15.png (741×1 px, 288 KB)

Decided to let it continue running for over 10 hrs as I wasn't sure whether it was stuck because of a bug or the amount of data it was processing.

Consulted @MGerlach whether he had ever faced this issue and he said:

In another project, I recently came across that for armenian wikipedia, the standard sentence-processing pipeline didnt work. Looking at some articles in hywiki, I quickly realized that Armenian uses the "։" as a sentence-marker (which is not the same as colon). It thus makes sense that it gets stuck when generating the backtesting data where we extract individual sentences with existing links. In this case, the sentences will be way too long so then probably gets hung up somewhere.

The goal is to replace the Armenian sentence-splitting symbol so that the link recommendation algorithm can run sentence tokenization successfully.

Event Timeline

kostajh subscribed.

@kevinbazira I see this is "Watching" on your team's board; is this something that @MGerlach and Research might work on?

In short: I could resolve the issue when upgrading wikitextparser to version 0.51.1 (I previously used 0.45.1).

I could reproduce the error for hywiki, specifically for two articles (Օլիմպիական_երդում and Ազգային_օլիմպիական_կոմիտե ). I could trace down the error to wikitextparser’s parse-function which we use here in our code. This seemed to be articles with large tables and I found issues about the wikitextparser hanging with tables (example). Those were subsequently fixed in newer releases of wikitextparser. After updating wikitextparser I could succesfully run the script for hywiki.

Recommended steps:

@MGerlach, thank you for the recommendations. I have tested the fix locally and the hywiki training pipeline completed successfully.

Change 890351 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/mwaddlink@main] Fix Armenian sentence tokenization bug

https://gerrit.wikimedia.org/r/890351

Change 890351 merged by jenkins-bot:

[research/mwaddlink@main] Fix Armenian sentence tokenization bug

https://gerrit.wikimedia.org/r/890351

The Armenian sentence tokenization bug has been fixed in T327371#8631149

hywiki has been added to wikis that will be deployed in the 11th round T308136