OpenAI’s GPT-3 architecture represents a game changer in AI research and use. The largest neural network ever developed promises significant improvements in natural language tools and applications.
Developers can use the deep learning-based language model to develop pretty much anything language-related. The approach looks promising for startups developing advanced natural language processing (NLP) tools not only for B2C applications, but also for integration into B2B use cases of companies.
Generative Pre-trained Transformer 3 (GPT-3) is “arguably the largest and best all-purpose NLP AI model out there,” said Vishwastam Shukla, CTO of HackerEarth.
Because the model is so generic, users are completely free to choose how they want to use GPT-3, including creating mobile apps, creating search engines, translating languages, and writing poetry. The model contains over ten times as many parameters as the next largest NLP model, Microsoft’s Turing NLG, so the accuracy and value it can deliver are significantly higher than GPT-3.
“The AI industry is excited about GPT-3 because 175 billion weighted connections between parameters drive NLP application development,” said Dattaraj Rao, chief data scientist at Persistent Systems.
OpenAI, the artificial intelligence research laboratory that developed GPT-3, trained the model with over 45 terabytes of data from the Internet and books to support its 175 billion parameters.
“Parameters in machine language usage represent skills or knowledge of the model. The higher the number of parameters, the more skilled the model,” said Shukla.
Parameters are like variables in an equation, explains Sri Megha Vujjini, data scientist at Saggezza, a global IT consultancy.
In a basic math equation like “a + 5b = y”, “a” and “b” are parameters and “y” is the result. In a machine learning algorithm, these parameters correspond to the weighting between words, for example the correlation between their meaning or their common usage.
The next model is Microsoft’s Turing NLG with around 17 billion parameters, and GPT-2, OpenAI’s predecessor to GPT-3, only had around 2 billion.
Earlier this year, EleutherAI, a collective of volunteer AI researchers, engineers, and developers released GPT-Neo 1.3B and GPT-Neo 2.7B.
The GPT-Neo models are named for the number of parameters they have and have an architecture very similar to OpenAI’s GPT-2.
Rao said it offers comparable performance to GPT-2 and smaller GPT-3 models. Most importantly, developers can download it and refine it with domain-specific text to get new results. As a result, Rao expects many new applications from GPT-Neo.
Researchers are now planning even larger models. Google’s Switch Transformer model has 1.6 trillion parameters.
Coding language skills
Sreekar Krishna, National Leader AI and Head of Data Engineering at KPMG US, said: “GPT-3 is essentially the next step in the evolution of a natural learning system.”
Using millions of examples, it shows that a system can learn aspects of domain knowledge and language constructs.
Traditional algorithmic development divided problems into basic core micro-problems that could be individually addressed for final solution. People solve problems the same way, but we’re backed by decades of training in common sense, general knowledge, and business experience.
In the traditional machine learning training process, algorithms are shown a sample of training data and are expected to learn various skills to suit human decision-making.
For decades, scientists have tested the idea that if we started feeding enormous amounts of data into algorithms, the algorithms would assimilate the domain-specific data and common knowledge, language grammar constructs, and human social norms. However, due to the limited computing power and the challenges of systematically testing highly complex systems, this theory was difficult to verify.
However, the success of the GPT-3 architecture has shown that the researchers are on the right track, said Krishna. With enough data and the right architecture, it is possible to encode general knowledge, grammar and even humor in the network.
GPT-3 language models
Ingesting such huge amounts of data from various sources made GPT-3 a kind of all-purpose tool.
Vishwastam ShuklaCTO, HackerEarth
“We don’t have to tune it for different use cases,” said Vujjini.
For example, the accuracy of a traditional English-to-German translation model depends on how well trained it has been and how the data is ingested. But with the GPT-3 architecture, regardless of how the data is ingested, the output appears to be correct. More importantly, a developer doesn’t need to specifically train it with translation examples.
This makes it easy to extend GPT-3 for a wide range of use cases and language models.
Sage has also experimented with helping companies analyze customer feedback by interpreting speech patterns to generate insights.
However, Rao argues that some domain-specific training is required to tune the GPT-3 language models to get the most benefit in real world applications such as healthcare, banking, and programming.
For example, physicians training a GPT-type model on a data set of patient diagnoses based on symptoms could make it easier to recommend diagnoses. Microsoft has meanwhile refined GPT-3 on large amounts of source code for a code autocompleter called Copilot, which can automatically generate lines of source code.
GPT-3 vs. BERT
GPT-3 is often compared to Google’s BERT language model because they are both large neural networks for NLP built on top of Transformer architectures. However, there are significant differences in size, development methods, and deployment models.
Also, due to a strategic partnership between Microsoft and OpenAI, GPT-3 is only offered as a private service, while BERT is available as open source software.
GPT-3 does better out of the box than BERT in new application domains, said Krishna. This enables companies to tackle simple business problems faster than with BERT.
However, GPT-3 can become cumbersome due to the sheer infrastructure organizations need to deploy and use, Shukla said. Organizations can conveniently load the largest BERT model with 345 million parameters onto a single GPU workstation.
With a size of 175 billion parameters, the largest GPT-3 models are almost 470 times the size of the largest BERT model. However, the size of GPT-3 is associated with a much higher computational effort, which is why GPT-3 is only offered as a service, while BERT can be embedded in new applications.
Both BERT and GPT-3 use a Transformer architecture to encode and decode a data sequence. The encoder part creates a contextual embedding for a data series, while the decoder uses this embedding to create a new series.
BERT has a more extensive encoder capability for generating a contextual embedding from a sequence. This is useful for analyzing sentiment or answering questions. GPT-3, on the other hand, is stronger on the decoder part to take the context into account and generate new text. This is useful for writing content, drawing up summaries, or generating code.
According to Sage, GPT-3 supports significantly more use cases than BERT. GPT-3 is useful for writing articles, reviewing legal documents, creating résumés, gaining business insights from consumer feedback, and building applications. BERT is more used for language support, customer reviews analysis, and some advanced searches.