In the age of the internet, people are getting closer and closer – you can chat your friend from Turkey on Snapchat, call your parents on their fancy vacation by video, send a short text message to your old pen pal (now your new keyboard friend) in Japan.
But the closer the world gets together, the more our attention spans become commodified. We spend hours scrolling Instagram while we spend less time interacting directly with one another.
Ironically, Artificial Intelligence is now changing that.
In March 2021, Google introduced its live subtitle function for Chrome browsers. Live Caption uses machine learning to instantly create subtitles for any video or audio clip, giving deaf and hard of hearing people better access to Internet content.
In the past – and still today – subtitles were either preprogrammed for video formats or a shorthand typed an almost instantaneous subtitle that was broadcast on television. However, in places where subtitles are not the “norm”, such as in apps like Instagram or TikTok, subtitles are almost impossible to find. Live Caption changes that: with a few taps of the screen, anyone can instantly have accurate subtitles that expand the reach of audio and video.
Google’s Live Caption is a type of NLP, or natural language processing. NLP is a form of artificial intelligence that uses algorithms to enable a kind of “interaction” between humans and machines. NLPs help us decipher human languages into machine languages and often vice versa.
To understand the history of NLP, we have to fall back on one of the most brilliant scientists of modern times: Alan Turing. In 1950 Turing published “Computing Machinery and Intelligence” in which he discussed the concept of sentient, thinking computers. Claiming that there were no convincing arguments against the idea that machines could think like humans, he suggested the “imitation game” now known as the Turing Test. Turing suggested a way of measuring whether artificial intelligence can think for itself or not: If it could fool a person with a certain probability that it is a person, it can be described as intelligent.
From 1964 to 1966, the German scientist Joseph Weizenbaum wrote an NLP algorithm called ELIZA. ELIZA used pattern matching techniques to conduct a conversation. For example, in the DOCTOR script, if a patient told the computer “my head hurts”, they would respond with a phrase similar to “why does your head hurt?”. ELIZA is now considered to be one of the earliest chatbots and one of the first to fool a human in a limited type of Turing test.
The 1980s was a major turning point in the production of NLPs. In the past, NLP systems like ELIZA formed conversations by relying on a complex set of rules – the AI could not “think” for itself; rather, it was a bit like a chatbot and used “ready-made” responses to fit into the context. If the person said something for which they had no answer, they would give an “undirected” answer with something like “tell me more about it” [a topic from earlier in the conversation].
In the late 1980s, NLPs instead focused on statistical models that helped them conduct probability-based conversations.
Modern speech recognition NLP includes some common principles such as speech recognition, audio recognition, speech recognition, and diarization that can differentiate between speakers. Google’s Live Caption system uses three deep learning models to form the subtitles: a recurrent neural network (RNN) for speech recognition, a text-based RNN for recognizing punctuation marks, and a convolutional neural network (CNN) for classifying sound events . These three models send signals that together make up the subtitle track, complete with applause subtitles and music subtitles.
When speech is recognized in an audio or video format, Automatic Speech Recognition (ASR) RNN is activated so the device can begin translating the words into text. If this language stops, for example if music is played instead, the ASR stops to save the phone battery and the [music] Label in the caption.
Since the language text is formulated as a caption, the punctuation is formed on the previous complete sentence. The punctuation is constantly adjusted until the ASR results do not affect the meaning of the entire sentence.
At the moment Live Caption can only create subtitles for English text, but it is constantly being improved and will one day be expanded to other languages. Early versions of the Spanish, German, and Portuguese subtitles are currently available on Google Meet.
Accessibility-centric NLPs aren’t just limited to creating subtitles. Another Google project, Project Euphonia, is using NLP to help people with atypical speech or language disabilities be better understood through speech recognition software. Project Euphonia collects 300-1500 audio phrases from volunteers with a speech impairment. These audio samples can then be “fed” to speech recognition models to train a variety of speech impairments. In addition, the program creates simplified speech systems that can use face tracking or simple sounds to signal various actions, such as turning on a light or playing a specific song.
One of Google’s latest ASR NLPs aims to change the way we interact with others around us and expand the scope of communication – and with whom. Google interpreter mode uses ASR to recognize what you are saying and spits out an accurate translation into another language, effectively creating a conversation between foreign people and breaking down language barriers. Similar instant translation technology was also used by SayHi, which allows users to control how fast or slow the translation is spoken.
There are still a few problems with the ASR system. Often referred to as the AI accent gap, machines sometimes have difficulty understanding people with strong accents or dialects. This is currently being approached on a case-by-case basis: Scientists tend to use a “single accent” model, in which different algorithms are designed for different dialects or accents. For example, some companies have experimented with using separate ASR systems to identify Mexican dialects of Spanish versus Spanish dialects of Spanish.
Ultimately, many of these ASR systems reflect some degree of implicit bias. In the United States, African-American Vernacular English, also known as AAVE, is an extremely common dialect of “traditional” English most commonly spoken by African-Americans. However, several studies have found significant racial differences in the average word error rate in different ASR systems, with one study finding that the average word error rate for black speakers is almost twice that of white speakers in ASR programs from Amazon, Apple, Google, IBM and Microsoft.
In the future, more diverse training on AI, including regional accents, dialects, and slang, can help reduce racial and ethnic differences in the accuracy of ASR.
Technology has incredible potential to bring people together, but when people are left out, whether because of disability, race, ethnicity, or any other reason, it can be a divisive and isolating force. Thanks to natural language processing, we are starting to fill these gaps between people in order to build a more accessible future.