With the success of Generative AI models like ChatGPT for text or StableDiffusion for images, there is a lot of content that is being synthetically made and published on the internet.
To train and fine-tune those models, most of the data is collected from the internet.
What happens when the new generation of models is trained on synthetical data?
Te MADness problem
A recent study, Self-Consuming Generative Models Go MAD, tested this aforementioned question. It defined that models that are recursively trained on synthetical data go MAD (Model Autophagy Disorder). The result of those models is overall a reduced variety of content generation or/and reduced quality in content. But the understanding of the ramifications caused by those kinds of models is still poorly understood.
There are three possible ways to train models with synthetic data:
- Training with synthetic data
- Training with a fixed amount of real data and synthetic data
- Training each time with some fresh amount of real data and synthetic data
In the first case, where a model is training with only synthetic data and each time using its own made content to train the following iteration of generative model, there is quickly a degradation of quality and diversity in the content produced.
When the model is trained on each iteration with a fixed amount of real data the degradation of the models is still inevitable but is delayed.
And finally when the models are always trained with a mixed amount of fresh real data, it seems like the quality and diversity of the models is maintained over each iteration.
Why does this happen?
Models try to learn from the data they ingest and with these models we try to replicate and make the models generate what humans would generate. So the intent is for the models to learn from humans. But once the data that they are consuming is coming from other models, they are no longer learning from humans but from itself. This causes an effect like an echo chamber. The new generation of models start misinterpreting what they believe to be real by reinforcing what they have already learnt.
The study from Shumailov et al. mentions two reasons:
- Statistical approximation error: Where the number of real data starts to disappear as the number of synthetical data tends to infinity. And the approximation tends to be an error due to the resampling of the data.
- Functional approximation error: Where the error comes from trying to fit an oversimple or overcomplex model to the problem.
Imagine you want to make biscuits but you don’t have enough flour, so you use a synthetic flour that is non-toxic and works just like flour but doesn’t taste like flour. The more real flour you substitute for the synthetic flour, the less your biscuits will resemble real biscuits. This is the statistical approximation error: the more synthetic data you use, the further away from the real thing you are.
A combination of those two errors, in particular the statistical approximation error, are the reasons for this kind of problem.
With the inevitable spread of use of generative models, and its content being distributed online, it is inevitable that future generations of AI models will be trained on synthetic data.
While it is not impossible to train a good model with synthetic data, it is always a good idea to include data produced by humans to the mix.
Here at LHF Labs we are experts in extracting and curating data. One of the examples is the esCorpius dataset.