[ad_1]
Bigger Voice AIs give bad answers more and more often. This conclusion is suggested by a study on the accuracy of large language models. To do this, researchers took a close look at three major LLMs: GPT from OpenAI, Llama from Meta and Bloom, an open source model developed by the academic group BigScience.
Big doesn’t automatically mean good
José Hernández-Orallo from the Valencian Research Institute for Artificial Intelligence in Spain and his colleagues analyzed for their study the voice AI mentioned at the beginning for errors in their answers. They also present false answers to human subjects to see how good we humans are at detecting false answers.
To do this, they first examined an early version of the respective language model. The old model was then compared with a more current, improved version. The difference: The new versions had in the meantime been fed with significantly more data in order to refine their answers.
Editorial recommendations
The scientists found that the improved AI models provide more accurate answers, as expected. The researchers attribute this to human feedback, which helps refine AI response options. However, there remains a big but: The more precise answers only apply to cases in which AI was actually able to answer the question asked.
Hernández-Orallo and his team also found that reliability declines equally, according to Nature. The researchers write: “Among the inaccurate answers, the proportion of incorrect answers has increased.”
This happens because models are less likely to respond that they don’t know something or change the subject. “These days they answer almost everything. And that means more right answers, but also more wrong answers,” explains Hernández-Orallo.
“The tendency of chatbots to express opinions beyond their own knowledge has increased. That looks to me like what we would call bullshitting,” Mike told Hicks Nature. He is a philosopher of science and technology at the University of Glasgow, UK. “The result is that ordinary users are likely to overestimate the capabilities of chatbots, and that is dangerous,” Hernández-Orallo points out in the report.
Test proves: Incorrect answers increase proportionately by 60 percent or more
The scientists peppered the models with thousands of requests. For example, questions were asked about arithmetic, anagrams, geography and science. The bots’ ability to modify information was also tested, such as arranging a list in alphabetical order.
The result is astonishing: the improved AIs were less likely to avoid difficult questions and instead tried to answer them. GPT-4 is listed here as an example. Nature writes: “The proportion of incorrect answers, those that were either incorrect or avoided, increased as the size of the models increased, reaching more than 60 percent in several improved models.”
This does not mean that larger chatbots generally give 60 percent bad answers. This is the proportion of incorrect answers to questions that the AI cannot answer. Where older AI versions tend to write “I don’t know” or avoid the question, the AI with a larger learning data pool invents false information.
As the study continued, the scientists checked with volunteers whether they classified the answers as correct, incorrect or avoidant: In around ten percent to 40 percent of cases, the test subjects classified incorrect answers as correct for both easy and difficult questions. The scientist’s conclusion therefore sounds pretty damning. “Humans are not capable of monitoring these models,” says Hernández-Orallo.
[ad_2]
Source link