AI models trained by data from other AI can end up with meaningless results

Generative artificial intelligences that are trained by other artificial intelligences (AI) can end up causing irreversible defects and contaminating the results with meaningless content. An article in Nature emphasizes the importance of using reliable data to train AI models, since using the same AI for this purpose can cause the original content to be replaced by “nonsense unrelated” to the original in just a few generations.

Using AI-generated data sets to train future generations of machine learning models can contaminate their results, a concept known as ‘model collapse,’ the study led by the Oxford University.

The paper defines ‘model collapse’ as a degenerative process that affects generations of AI models, in which their biased data ends up contaminating the next generation. Having been trained with contaminated data, they misperceive reality.

From architecture to hares

One of the examples the study shows is a test with a text about medieval architecture as the original input. In the ninth generation of AI, The result was a list of North American haresThe authors propose that ‘model collapse’ is an inevitable result of AI models using training data sets created by previous generations.

Generative AI tools, such as large language models (LLMs), have gained popularity and have been trained primarily using human-generated input. However, as these models continue to proliferate on the internet, computer-generated content can be used to train other AI models—or themselves—in a recursive loop.

“We found that the indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which the tails of the original content distribution disappear,” the study states.

The team demonstrated that An AI may miss certain results in the training data, which causes it to learn only from a portion of the data set. In addition, feeding a model with AI-generated data causes subsequent generations to degrade their learning ability, ultimately causing the model to collapse.

Training an AI with data generated by another AI is not impossible, but “We must take the filtering of this data seriously”Still, tech companies that rely on human-generated content may be able to train more effective AI models than their competitors, they add.

Other experts say

Commenting on the results of the study, in which he did not participate, the professor of Systems Engineering and Automation at the University of the Basque Country, Victor Etxebarriawho describes the research as “excellent.”

AI is trained with huge amounts of data from the internet, produced by people who have legal copyrights to their material. To avoid lawsuits or save costs, technology companies use data generated by their own AI to continue training their machines. “This increasingly widespread procedure means that AI is not useful for any truly reliable function,” said Etxebarria. “This turns AI into tools that are not only useless in helping us solve our problems, but can also be harmful if we base our decisions on incorrect information.”

The effect that the authors propose to call ‘model collapse’ is true: Large language models really do collapse (they stop working, they respond poorly, they give incorrect information). This is a statistical effect that is perfectly demonstrated in the article,” explained Etxebarria, quoted by Science Media Centre, a platform of scientific resources for journalists.

The study is “very interesting, of good quality, but its value is above all at a theoretical level,” he said. Andreas Kaltenbrunner, from the Universitat Oberta de Catalunya. “Their conclusions are based on the assumption that in future training only data generated by AI models will be used,” but in a real scenario there will always also be a portion of data generated by humans.

The researcher considered that “it is not clear what the outcome would be” if data generated by humans is mixed with other data generated by AI, and even less so what would happen if data (increasingly frequent) generated in a hybrid way between AI and humans is also added.

Source: Gestion

You may also like

Immediate Access Pro