Self-supervised learning to combat speech recognition errors



Speech recognition systems have difficulty understanding African American Vernacular English (AAVE). In a 2020 study by researchers at Stanford University, the software for AAVE performed so poorly that some leading systems correctly transcribed barely half of the words spoken.

The researchers speculated that the systems had a common flaw: “Insufficient audio from Black speakers when training the models”.

A startup called Speechmatics has developed a technique that appears to narrow this data gap.

The company announced last week that its software had “an overall accuracy of 82.8% for African American voices” based on data sets used in the Stanford study. In comparison, the systems developed by Google and Amazon each recorded an accuracy of only 68.6%.

Speechmatics attributed much of its performance to a technique known as self-supervised learning.

Training school

The advantage of self-monitored models is that not all of their training data has to be labeled by humans. This enables them to enable AI systems to learn from a much larger pool of information.

This enabled Speechmatics to increase its training data from around 30,000 hours of audio to around 1.1 million hours.

Will Williams, the company’s VP of Machine Learning, told TNW that the approach improves the software’s performance on a variety of speech patterns:

We want to develop scalable methods with which we can attack a wide range of accents at the same time.

Learn like a child

One of the benefits of the technique was to fill the Speechmatics age understanding gap.

Based on the open source project Common voice, the software had a 92% accuracy rate for children’s voices. By comparison, the Google system had an accuracy of 83.4%.

Williams said that improving recognition of children’s voices was never a specific goal:

We train on millions of hours of audio, and just like a child learns, we expose our learning systems to all that online audio … In those millions of hours, children’s voices will be heard, so they will learn how to deal with them – but without that they are labeled.

That doesn’t mean that self-supervised learning alone can remove AI biases. Allison Koenecke, the lead author of the Stanford study, noted that other issues also need to be addressed:

We also firmly believe that achieving fair results is as much a “people problem” as a “data problem”. That said, we hope that ASR [automatic speech recognition] Developers themselves understand the need to be fully inclusive.

Still, Speechmatics’ performance suggests that self-supervised learning can at least mitigate Dataset distortions.



Leave A Reply