Foreign language translation for the IC gets a machine learning boost from IARPA



Some of the hottest trending languages ​​are Kazakh, Swahili and Pashto. At least for the US Intelligence Community (IC).

It is safe to say that no organization is more interested in what foreigners say and write than the IC. This is especially true of what is said in widely spoken languages ​​by US opponents such as China and Russia. However, this also applies to low-resource languages ​​spoken by much smaller populations around the world, such as Kazakh, Swahili, and Pashto.

The constant challenge the IC faces is how to interpret these lesser-used languages, or any other language, quickly and accurately.

It would be an incredibly time-consuming and expensive endeavor to use humans to translate the quadrillions of words that are written and spoken by people around the world every day. Fortunately, IARPA is revolutionizing the way the IC consumes foreign language information with its Machine Translation for English Retrieval of Information in Any Language (MATERIAL) program.

By using machine learning to turn multilingual text and voice media into usable intelligence information for analysts, regardless of their language skills, the need for human translation is dramatically reduced.

“The MATERIAL program has really changed the landscape by enabling everyone to efficiently find information in low-resource languages,” said MATERIAL program manager Dr. Carl Rubino. “This is a turning point for IC and is revolutionizing the way we access critical foreign language data.”

The MATERIAL program launched in October 2017, including Johns Hopkins University, Raytheon BBN Technologies, Columbia University, and the University of Southern California Information Sciences Institute, were tasked with building robust, automated language capabilities over a period of four years. The ultimate goal of MATERIAL was to set up Cross-Language Information Retrieval (CLIR) systems that find language and text content in different languages ​​with fewer resources, only use English search queries and pass the relevant foreign language information retrieved concisely in English. The performers exceeded expectations and did successfully done exactly that.

In addition to Kazakh, Swahili and Pashto, the CLIR system performers developed include state-of-the-art automatic speech recognition and machine translation systems as well as models for other languages ​​such as Tagalog, Somali, Lithuanian, Georgian, Bulgarian and Farsi.

MATERIAL technologies were recently introduced in SCALE 2021, a multinational summer workshop at Johns Hopkins University devoted to exploring topics related to human language technology. This summer’s theme was Cross-Language Information Retrieval. With the knowledge gained and basic models from the program, the SCALE scientists were able to develop tailor-made CLIR functions for Chinese, Russian and Farsi.

“I am thrilled that this technology is taking root,” said Dr. Ruby. “With continued investment in ICs and mastery, this relatively novel approach to data discovery should soon become a standard and reliable tool for our analysts.”

Read the announcement at IARPA



Leave A Reply