(Translated by https://www.hiragana.jp/)
Wav2Vec2 - Wolfram Neural Net Repository

Wav2Vec2 Trained on LibriSpeech Data

Transcribe an English audio recording

This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network. Parts of these feature representations are then artificially masked and fed to a transformer network that outputs contextualized representations, and the entire model is trained via a contrastive task where the output of the masked data at masked time steps is penalized for being distant from the true representation. Wav2Vec2 achieves state-of-the-art performance on the full LibriSpeech benchmark for noisy speech, while for the clean 100-hour LibriSpeech setup, it outperforms the previous best result while using 100 times less labeled data.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Wav2Vec2 Trained on LibriSpeech Data"]
Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter. Inspect the available parameters:

In[3]:=
NetModel["Wav2Vec2 Trained on LibriSpeech Data", "ParametersInformation"]
Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=
NetModel[{"Wav2Vec2 Trained on LibriSpeech Data", "Size" -> "Large"}]
Out[6]=

Pick a non-default uninitialized net:

In[7]:=
NetModel[{"Wav2vec2 Trained on LibriSpeech Data", "Size" -> "Large"}, "UninitializedEvaluationNet"]
Out[8]=

Evaluation function

Define an evaluation function that runs the net and produces the final transcribed text:

In[9]:=
netevaluate[audio_] := Module[{chars},
  chars = NetModel["Wav2Vec2 Trained on LibriSpeech Data"][audio];
  StringReplace[StringJoin@chars, "|" -> " "]
  ]

Basic usage

Record an audio sample and transcribe it:

In[10]:=
record = AudioCapture[]
Out[11]=
In[12]:=
netevaluate[record]
Out[12]=

Try it over different audio samples. Notice that the output can contain spelling mistakes, especially with noisy audio. Hence a spellchecker is usually needed as a post-processing step:

In[13]:=
AssociationMap[netevaluate]@
 Map[ExampleData[{"Audio", #}] &, {"FemaleVoice", "MaleVoice", "NoisyTalk"}]
Out[13]=

Requirements

Wolfram Language 13.2 (December 2022) or above

Resource History

Reference