Belarusian NLP and Speech Processing resources

This repository contains links to Belarusian Natural Language and Speech Processing resources and datasets.

It is inspired by similar project with Ukrainian Speech Processing resources: egorsmkv/speech-recognition-uk

TODOs:

add detailed descriptions to each of list items
evaluate models on benchmarks and log their performance

🎙 Speech-to-Text

🎙💡 Implementations

wav2vec2 trained on Common Voice 8 + kenlm language model trained on Common Voice 8:
- Model: ales/wav2vec2-cv-be
- Demo: ales/wav2vec2-cv-be-lm
- Code: navalnica/wav2vec2-belarusian
whisper:
- original openai/whisper models
- Whisper models fine-tuned on Belarusian Common Voice 11 dataset:
  - Whisper Small:
    - Model: ales/whisper-small-belarusian
    - test WER on CommonVoice11: 6.79
    - Demo: ales/whisper-small-belarusian-demo
    - Code: navalnica/whisper-finetuning-be
  - Whisper Base:
    - Model: ales/whisper-base-belarusian
    - Code: navalnica/whisper-finetuning-be
Nvidia NeMo models:
- nvidia/stt_be_conformer_ctc_large
  - [huggingface self-reported metric] test WER on CommonVoice10: 4.8
- nvidia/stt_be_conformer_transducer_large
  - [huggingface self-reported metric] test WER on CommonVoice10: 3.8
- nvidia/stt_be_fastconformer_hybrid_large_pc
  - [huggingface self-reported metric] test WER on CommonVoice12: 2.72
  - [huggingface self-reported metric] test WER P&C CommonVoice12: 3.87
ESPnet:
- espnet/belarusian_commonvoice_blstm

🎙📊 Benchmarks

Model comparisons grouped by dataset. TODO

🎙📚 Datasets

Common Voice. Speech recognition dataset
Dataset from knihi.com. TODO: what is the type of dataset?
google/fleurs
ssrlab: TODO. Speech recognition dataset

📢 Text-to-Speech

📢💡 Implementations

CoquiAI implementations
- jhlfrfufyfn/bel-tts. GlowTTS + HifiGan
  - Code
  - Model
  - Demo on HuggingFace
  - Demo on a custom web-page. The source code for the demo page: here
- alex73/belarusian-tts. CoquiAI implementation by Yurii Paniv (@robinhad).
  Original repo & models were deleted - only fork is available now

📝 NLP

POS-tagging

KoichiYasuoka/roberta-small-belarusian-upos
stanfordnlp/stanza-be
poritski/YABC_Tagger. Rule-based POS-tagger and lemmatizer.
Written in Perl. Uses poritski/YABC as a Grammar base (?)
volchek/beltagger. An improved version of poritski/YABC_Tagger rule-based POS-tagger and lemmatizer.
Cross-platform, written in C++.
Known issues:
- requires input data to be incoded in Windows-1251, does not support UTF-8;
- tagset is not fully-compatible with BNKorpus's tagset and grammar base
- grammar base used is not full enough. Belarus/GrammarDB is a better paradigms source but is not incorporated yet
- suffix table calculation script is not ported from Perl to C++
- code uses Boost libarary

Other

pkasila/bel-sklony - web page with Belarusian nouns declension. Demo: sklony.pkasila.net

Masked Language Modeling

KoichiYasuoka/roberta-small-belarusian

📝📚 Datasets

oscar
mc4
poritski/YABC - Эксперыментальны корпус беларускай мовы, ЭКБМ
Belarus/GrammarDB - Grammar Database of Belarusian language
tsimafeip/Translator - Dataset with russian-belarusian translation pairs
Universal dependencies dataset:
- Page
- GitHub Repository
Tatoeba Belarusian sentences

🧍‍♀️🧍 Communities and platforms:

corpus.by
ssrlab.by
bnkorpus.info
Belarus organization on github
nlproc.by community on github

🦔 Unsorted

nothing for now

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Belarusian NLP and Speech Processing resources

TODOs:

🎙 Speech-to-Text

🎙💡 Implementations

🎙📊 Benchmarks

🎙📚 Datasets

📢 Text-to-Speech

📢💡 Implementations

📝 NLP

POS-tagging

Other

Masked Language Modeling

📝📚 Datasets

🧍‍♀️🧍 Communities and platforms:

🦔 Unsorted

About

Releases

Packages

navalnica/be_nlp_speech_resources

Folders and files

Latest commit

History

Repository files navigation

Belarusian NLP and Speech Processing resources

TODOs:

🎙 Speech-to-Text

🎙💡 Implementations

🎙📊 Benchmarks

🎙📚 Datasets

📢 Text-to-Speech

📢💡 Implementations

📝 NLP

POS-tagging

Other

Masked Language Modeling

📝📚 Datasets

🧍‍♀️🧍 Communities and platforms:

🦔 Unsorted

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages