The first days of ChatGPT being made available to the public by OpenAI last winter provided proof of the program’s ability to generate computer code. Which was a revelation for the developers. It seemed right away that ChatGPT was so good at coding that suddenly even people with little coding knowledge could use it to build powerful software.
Many months of experience and research on the subject have revealed that ChatGPT and other generative AIs cannot actually develop programs as such. The best they can do is provide baby steps for simple coding problems.
“What generative AI has shown is that I can have a partner when I’m doing a task, who gives me suggestions and allows me to overcome creative obstacles,” said Naveen Rao, co-founder and CEO of the AI startup MosaicML, which was acquired in August by Databricks.
“AIs don’t even write particularly good code. It’s beginner’s code.”
But this level of support for IT development is low.
“It gives you some sort of scaffolding, with some things that can be repeated, but that’s it,” he said. “If I ask to solve a very difficult problem, it’s not possible. AIs don’t even write particularly good code. It’s beginner’s code.”
Some studies show that large language models such as GPT-4 are far inferior to human coders in overall code quality.
Developers superior to GPT-4
A recent study carried out by Sayed Erfan Arefin and colleagues at Texas Tech University tested GPT-4 and its predecessor, GPT-3.5, with coding problems taken from the online platform LeetCode. These are the types of problems that are posed to candidates at Google, for example.
The programs were evaluated based on two tests: “organizing data for efficient access (using appropriate data structures)” and “creating workflows to process data (using efficient algorithms)”. They were also assessed on what’s called “string manipulation”, which overlaps with the other two challenges.
When the language models were subjected to what the authors call “full questions”, that is, the programs were given example solutions to the questions, GPT-4 only answered correctly 26% of questions, compared to 45% for human respondents.
When some information was removed, GPT-4’s capacity dropped to 19% of questions answered correctly. GPT-3.5 is around 12% and 10%, respectively.
The authors also examined the quality of the GPT code, both for successes and failures. In both cases, they found a problem: GPT often struggled with a basic coding practice, namely “defining variables consistently.”
Fixed GPT-3, GPT-4, and humans, for training and test sets, when given either complete information about the problem, with example solutions, or incomplete information. Texas Tech University
Moving from basic to complex programming problems
Scaling is also an issue for AI code generation. The most encouraging results obtained so far by GPT-4 in terms of code generation only concern trivial problems.
A study, carried out by David Noever of cybersecurity firm PeopleTec, tested GPT-4’s ability to find faulty code in code samples. This is work that is usually done by vulnerability testing programs, such as Snykor SAST.
In some cases, GPT-4 found more errors than Snyk, the authors said. But he also made many mistakes. Importantly, GPT-4 was tested on just over 2,000 lines of code. A tiny number compared to applications that can contain hundreds of thousands or even millions of lines of code, spread across many linked files.
It is therefore not certain that the successes achieved by AIs on childish problems can extend to such complexity.
AI misjudges variables
A study last month by Zhijie Liu and colleagues at ShanghaiTech University looked at the quality of code based on its correctness, understandability and security. The study also tested ChatGPT on LeetCode tasks. And they also tested its code generation on what’s called the Common Weakness Environment, a vulnerability test maintained by the research firm MITRE.
They tested ChatGPT on tasks formulated before or after 2021, because ChatGPT was only trained on pre-2021 hardware, and so they wanted to see how the program performs on both older problems and newer problems .
And the results are striking. For the most recent issues, called “Aft.”, for “after” 2021, he and his team have seen very low fix rates. “The ability of ChatGPT to generate functionally correct code decreases significantly as problem difficulty increases,” they write. Only 15.4% of the code in C language programs was acceptable, and none was acceptable for the most difficult problems. Additionally, “ChatGPT-generated code for hard and medium problems is more likely to contain compilation and runtime errors.” Human coders who took the test scored an average of 66%.
For older issues, labeled “Bef.”, the percentage rises to 31%, which remains low.
The team qualified the types of erroneous responses that ChatGPT provides. Often it involves something as simple as evaluating a variable. A mistake that is difficult to imagine on the part of a novice developer.
Example of bad code generated by ChatGPT. The program is supposed to classify boxes into categories based on their description. On line 12, the code decides that if a box is neither “large” nor “heavy”, it should be categorized as “both” – when it should be “neither “. ShanghaiTech University
ChatGPT performs better with strongly typed languages
Researchers reach fascinating conclusions. First, they note that ChatGPT faces new problems: “ChatGPT may have limitations when generating code for unfamiliar or novel problems, even if the problems are easy to solve from a human.”
But the programming language used matters: the technology works best with certain programming languages that are “strongly typed” or more “expressive”. “The likelihood that ChatGPT will generate functionally correct code is higher when using more highly expressive languages (e.g., Python3),” they write.
Another flaw is that ChatGPT can be convoluted, so its errors are harder to correct. “ChatGPT’s code generation process may be neglected,” they write, “and the generated code may not meet some of the detailed requirements described, making it difficult to generate successfully or fix (so that it is functional)”.
Regarding MITRE’s Common Vulnerabilities test, “Code generated by ChatGPT often has vulnerabilities, which is a serious problem,” they write. Fortunately, they note that ChatGPT is able to fix many of these vulnerabilities in subsequent prompts when it receives more detailed information from the MITER data set.
The three studies therefore suggest that the use of generative AI for programming is only in its infancy. As Mr. Rao said, generative AI is useful for simple support tasks, where the programmer is in charge.