AI companies are facing challenges with the shortage of training data, leading them to explore the potential of synthetic data. This solution, at a glance, seems simple and promising for addressing data scarcity issues and potential copyright concerns related to AI development. However, major players like Anthropic, Google, and OpenAI have yet to successfully create high-quality synthetic data.
Models trained on synthetic data have encountered significant problems, likened by some to “Habsburg AI” or “Model Autophagy Disorder.” These terms describe systems heavily trained on the outputs of generative AI, resulting in mutated models with exaggerated features, similar to the Habsburg jaw associated with the historical Habsburg dynasty.
Researchers and industry experts are exploring ways to develop synthetic data without causing system breakdowns. Companies like OpenAI and Anthropic are experimenting with checks-and-balances systems, where one model generates data and another verifies its accuracy. Anthropic has been transparent about its synthetic data use, employing internal guidelines to train its two-model system and revealing that its LLM model was trained on internally generated data.
Despite the concept’s promise, current synthetic data research faces challenges. Given the limited understanding of how AI functions, especially in complex systems like deep learning models, the development of effective synthetic data remains uncertain. This unresolved issue adds to the complexities of AI development and underscores the need for further research and innovation in this area.