Synthetic Data, Explained: Why AI Trained on AI Is The Next Big Thing (and Problem)

Key Points:

  • Synthetic data is seen as a solution for the scarcity of AI training data
  • Companies like Anthropic, Google, and OpenAI are working on creating quality synthetic data
  • AI models built on synthetic data have encountered issues, leading to terms like “Habsburg AI” and “Model Autophagy Disorder” being used


AI companies are facing challenges with the shortage of training data, leading them to explore the potential of synthetic data. This solution, at a glance, seems simple and promising for addressing data scarcity issues and potential copyright concerns related to AI development. However, major players like Anthropic, Google, and OpenAI have yet to successfully create high-quality synthetic data.


Models trained on synthetic data have encountered significant problems, likened by some to “Habsburg AI” or “Model Autophagy Disorder.” These terms describe systems heavily trained on the outputs of generative AI, resulting in mutated models with exaggerated features, similar to the Habsburg jaw associated with the historical Habsburg dynasty.


Researchers and industry experts are exploring ways to develop synthetic data without causing system breakdowns. Companies like OpenAI and Anthropic are experimenting with checks-and-balances systems, where one model generates data and another verifies its accuracy. Anthropic has been transparent about its synthetic data use, employing internal guidelines to train its two-model system and revealing that its LLM model was trained on internally generated data.


Despite the concept’s promise, current synthetic data research faces challenges. Given the limited understanding of how AI functions, especially in complex systems like deep learning models, the development of effective synthetic data remains uncertain. This unresolved issue adds to the complexities of AI development and underscores the need for further research and innovation in this area.



Prompt Engineering Guides



©2024 The Horizon