OpenAI warned that its Codex neural network, as powered by GitHub’s code completion tool Copilot, is likely to generate a source that looks plausible but is false, and its performance will decrease as it grows.
The Artificial Intelligence Laboratory revealed the shortcomings and limitations of non-production builds of its Codex model in a pre-print paper this week. It should be noted that GitHub Copilot drives its own production variant of the system; the preliminary models discussed in the paper are smaller and are only trained on Python, while Copilot has been trained on more data and supports code completion for a number of programming languages.
Still, GitHub Copilot suffers from similar problems as the Codex prototypes. In fact, the generated code is unlikely to be correct and useful to developers on the first try, and it tends to provide answers that seem reasonable at first glance, but may be incorrect. Programmers should carefully check the automatically written code for errors.
The Codex language model in the article has been optimized to 159GB of Python source code, sourced from more than 50 million public repositories on GitHub. Each Python file analyzed contained fewer than a thousand lines of code. To test the model’s AI pair programming capabilities, researchers developed 164 handwritten programming problems that examined Codex’s ability to complete functions, understand simple algorithms, and compute mathematical queries.
The most powerful version of the system with 12 billion parameters was able to solve 28.8 percent of the problems in the first attempt. For comparison: OpenAI’s natural language system GPT-3 could not solve any of these.
However, Codex does better when given the opportunity to generate more responses. In ten attempts the correct answer was found in 46.81 percent of the cases, in 100 attempts it was 72.31 percent.
In other words, it is up to human programmers, or perhaps tools, to get the job done, to pick the best proposal out of Codex. “This result suggests that exact code samples can be selected via heuristic ranking rather than fully evaluating each sample, which may not be possible or practical when deployed,” the paper says.
GitHub Copilot seems a little better; It can generate the correct code 43 percent of the time on the first attempt and 57 percent of the time if 10 attempts are allowed. If you were concerned that these code completion tools could replace human programmers, don’t worry because they are not a real threat at the moment and are just simple pattern matching engines. Good for generating boilerplate code and the like.
The human touch
The researchers acknowledged that “a strong student who completes an introductory computer science course is expected to be able to solve a greater proportion of the problems than Codex,” despite having seen more code than a professional Developer will ever see in his life.
Codex tends to replicate common coding examples on which it was trained; When you write something that looks similar, it fills in the gaps with what it thinks should happen next, even though the generated code is often not entirely accurate. If you’re writing something more specialized or complex than most scripts for a given application, Codex isn’t that useful.
âWe find that Codex can recommend syntactically incorrect or undefined code and call functions, variables and attributes that are undefined or outside the scope of the code base. In addition, Codex is struggling to analyze ever longer and higher quality specifications or specifications at the system level, âthe paper says.
This problem only gets worse as the models get bigger and more powerful, the paper says. Unfortunately, Codex is only as good as you as a programmer. If you feed it prompts that contain subtle flaws, it will tend to produce worse code than it is capable of. This persists if the prompt also includes instructions on how to write the correct code. This gap increases with the size of the model, âthe researchers wrote.
They also warned that Codex, like other language models, can be made to generate “racist, denigrating, and otherwise harmful output” as code comments. Prejudices related to gender or race were also observed in code structures. To avoid harm in the real world, GitHub Copilot includes filters that automatically block offensive words so that harmful messages cannot be spit out.
It’s not all doom and darkness, don’t get us wrong. OpenAI wants to focus on the potential positive effects the tool could have, such as: For example, whether it can make programmers more productive or encourage them to better document their code so others can read it. Â®