It is well known that people hear speech not only by listening with their ears, but also by picking up cues from the mouth movements of the speakers.
Similarly, combining visual observation with audio could help a computer possibly better analyze human speech. Computer programs can, in a sense, read lips, although developing them is a tedious task.
Recent work from Meta, the parent company of Facebook, Instagram, and WhatsApp, suggests a more efficient path to a day when computers can read lips as well as HAL 9000 when Dr. David Bowman and Dr. Frank Poole tried to bypass his sound sensors in the capsule in the 2001 movie.
The scientists at Meta for Artificial Intelligence published a research report last Friday in which they drastically reduced the effort required to develop software to analyze the words of the lip movements of speakers in recorded video. The work was also able to use lip reading technology to usefully improve speech recognition in noisy environments.
The program is “75 percent more accurate than the best audio-visual speech recognition systems (which use both sound and images of the speaker to understand what the person is saying),” say the authors.
Of course, there’s a metaverse angle here: not only could the program be used for instant translation, it could one day help create realistic lip movements in virtual reality avatars to give a real sense of presence – this one Feeling of being there with someone, even when they are on the other side of the world. ”
The work represents progress in two ways. One of them is self-supervised learning, which avoids specific cues such as text transcripts and instead gives the program a spontaneously divine structure in data. The other field of development are so-called multimodal neural networks, which combine data of different types in such a way that they mutually reinforce one another.
The result, called AV-HuBERT, where the “AV” stands for audio-visual and the “Hu” stands for “hidden unit”, combines acoustic and visual signals in order to recognize words from lip movements.
The main author Bowen Shi and his colleagues Wei-Ning Hsu, Kushal Lakhotia and Abdelrahman Mohamed from Facebook published their article “Learning Audio-Visual Speech Representation By Masked Multimodal Cluster Prediction” on the arXiv preprint server last Friday. The authors also wrote a blog post which may be easier for you to digest.
As Shi & Co. explain, earlier work was also multimodal, combining visual data, video frames with audio, and waveform snippets to train a neural network to predict how they would fit together.
But such programs are more likely to rely on some additional, prepared cue, such as transcribing speakers’ videos into sentences that serve as labels. The new work goes the self-directed way and spontaneously composes patterns without any external structure.
“It is the first system that models speech and lip movements from unlabeled data – raw videos that have not yet been transcribed – together,” the authors write in their blog post.
Many previous models annotated word-level lip reading videos to train them, which are costly to collect because they require word boundary information. In contrast to these models, our models are completely pre-trained from the ground up with the proposed approach.
The AV-HuBERT program they invented is based on an audio-only program called HuBERT, which was introduced by Hsu and colleagues last year. As the name suggests, HuBERT uses the bidirectional neural network approach Transformer, which was developed at Google in 2018.
By “masking” parts of an audio recording, that is, by leaving out sections of an audio waveform, the HuBERT neural network had to reconstruct in its training phase which audio pieces matched each other.
In AV-HuBERT, Shi and the team now “merge” audio bits with frames from videos of speaking people. The training phase of the neural network essentially takes place in two stages. First, like the original audio-only HuBERT, they use the attentional approach to mask the audio and then group those audio waveforms into clusters, the groups of examples that are in some way close together in their attributes.
These groupings then become a target for the second level of the neural network. The multimodal part of AV-HuBERT simultaneously masks both the images of the speaker’s lips and the audio waveform and then tries to match them with the clusters formed in the first wave. In this way, the program calculates which lip configurations correspond to which audio waveforms and thus “learns” the correlation between mouth movement and audio output.
This is effectively a self-supervised approach that guesses structures without explicit cues.
The amalgamation means that the attention paid to picture frames and audio waveforms reinforce each other to create superior clusters than either would alone. These clusters become the “target” of subsequent tasks such as lip reading and speech recognition.
As the authors explain
AV-HuBERT simultaneously captures linguistic and phonetic information for unmasked regions both from the lip movement and from audio streams in its latent representations and then codes their long-term temporal relationships in order to solve the task of masked prediction.
After AV-HuBERT has been trained in this way themselves, the writers fine-tune it by actually introducing labeled video, hours of it, with formal transcripts telling the machine where the words are in the video.
The main dataset used to test and train the AV-HuBERT program is LRS3, which was developed in 2018 by Triantafyllos Afouras and colleagues at Oxford and “The largest publicly available dataset on sentence level lip reading to date is the video extracted from TED – and TEDx talks in English from YouTube. ”
As a result of the self-supervised training by AV-HuBERT, it can predict the words from the videos of speakers better than all previous attempts, write Shi and Co.
More important than the raw value, however, is the enormous reduction in the amount of data required to train the program.
“AV-HuBERT achieves state-of-the-art technology with 433 hours of text transcription, two orders of magnitude less than the 31,000 hours of tagged data used in the best approach to date,” they write.
Because far less data is required, it is possible to perform lip reading tasks for languages that have much less data than other, so-called, resource-poor languages. (For example, think of languages other than English, French, and German.)
The authors state that “AV-HuBERT can be used as a future work for multilingual lip reading in resource-poor languages” and that the same “approach to other applications of visual language representation such as”
Shi and colleagues supplemented their results with a second article published last week that describes the use of AV-HuBERT for automatic speech recognition. The focus here is on how speech can be better analyzed in the context of noise.
Speech recognition “is used in meeting scenarios, is subject to a babbling noise, while one used in a home environment naturally encounters music, cooking or vacuum cleaner noises.” Your request is whether such ambient noises can be overcome by AV-HuBERT.
During the training, Shi and her team mix in noise clips with the video frames and audio waveform samples from AV-HuBERT. The result, they write, is that the program gets good at getting around the gibberish. So much so that AV-HuBERT achieves a 50% reduction in the word error rate or WHO, the proportion of wrong words, compared to previous speech recognition systems.
“Our future work will include applying audiovisual speech recognition to real, low-resource and multilingual environments,” they write.
How real is something like HAL 9000’s lip reading? The notion that AI is now better than humans at lip reading was written with previous AI work in the past few years. At 26.9%, the word error rate in AV-HuBERT’s best program is actually far better than that of human, professional lip readers. Apparently, the best most human lip readers get is only 40% (they’re four times out of ten wrong). Obviously, for things like transcribing conversations afterwards, this could be a huge boost for software programs.
In practice, however, there is one major limitation. This is really simulate Lip reading. The AV-HuBERT results pass a test for canned video, not a live conversation in free form like that of Bowman and Poole in the film.
At the moment you may still be safe in the capsule.