Sci-Tech

Can synthetic data make AI models accurate and reliable?

2025-01-22   

Elon Musk, founder of artificial intelligence (AI) startup xAI, recently stated, "In AI training, we have basically exhausted the accumulated sum of human knowledge." Previous research has also shown that real data generated by humans will be exhausted within 2 to 8 years. Given the increasing scarcity of real data, the technology industry is shifting towards using synthetic data to meet the "appetite" of AI. The Australian website "Dialogue" reported earlier this month that synthetic data has many advantages, but excessive reliance on synthetic data may also weaken the accuracy and reliability of AI. Synthetic data has emerged, and technology companies mainly rely on real data to build, train, and improve AI models. Real data refers to text, videos, and images created by humans. They are collected through investigation, experimentation, observation, or mining of websites and social media. Real data is highly valuable due to its inclusion of real events, their scenarios, and backgrounds, but it is not perfect. It may contain spelling errors, inconsistencies, or irrelevant content, and even harbor serious biases, resulting in generative AI models creating images that only display male or white images in certain situations. But real data is increasingly scarce because the speed at which humans generate data cannot keep up with the growing demand of AI. Ilya Sutskville, co-founder of the Open Artificial Intelligence Research Center in the United States, claimed at a machine learning conference held in December last year that the AI industry has reached what he called a "data peak," and AI training data is facing a crisis of depletion like fossil fuels. In addition, research predicts that by 2026, the training of large language models such as ChatGPT will exhaust all available text data on the Internet, and there will be no new real data available. To provide sufficient "nutrients" for AI, a type of data generated by algorithms that mimics real-world situations - synthetic data - has emerged. Synthetic data is created in the digital world, rather than collected or measured from the real world. It can serve as a substitute for real-world data to train, test, and validate AI models. In theory, synthetic data provides an economically efficient and fast solution for training AI models. It effectively solves the privacy and ethical issues that have been criticized when using real data for AI training, especially when it comes to sensitive information such as personal health data. More importantly, unlike real data, synthetic data can theoretically be supplied infinitely. Research firm Gartner estimates that approximately 60% of the data used in AI and analytics projects in 2024 will be synthetic data. By 2030, the vast majority of data used by AI models will be synthetic data generated by AI. In fact, many technology leaders and startups such as Microsoft, Metaverse platform companies, and Anthropic have begun to widely use synthetic data to train their AI models. For example, Microsoft's AI model "Phi-4", which was open sourced on January 8th, is a combination of synthetic data and real data training; Google's Gemma model also uses a similar approach. Anthropic also utilized some synthetic data to develop one of its most high-performance AI systems, the Claude 3.5 Sonnet. Apple's self-developed AI system, Apple Intelligence, also extensively utilized synthetic data during the pre training phase. With the increasing demand for synthetic data from technology companies, tools for producing synthetic data have also emerged one after another. Omniverse Replicator, a 3D simulation data generation engine released by Nvidia, can generate synthetic data for autonomous vehicle and robot training. Last June, Nvidia open-source the Nemotron 4340B series of models, which developers can use to generate synthetic data for training large language models for applications in industries such as healthcare, finance, manufacturing, and retail. In professional fields such as healthcare and finance, this model can generate high-quality synthetic data based on specific needs, helping to build more accurate industry-specific models. The open-source synthetic data tool Synthetic Data Showcase launched by Microsoft aims to achieve privacy preserving data sharing and analysis by generating synthetic data and user interfaces. Amazon SageMaker Ground Truth, launched by Amazon Web Services, can also generate hundreds of thousands of automatically labeled composite images for users. In addition, in December last year, the metaverse platform company launched the open-source large model Llama 3.3, which significantly reduced the cost of generating synthetic data. Although synthetic data has temporarily solved the urgent problem of AI training, it is not perfect due to the unpredictable risks of excessive dependence. A key issue is that when AI models rely too heavily on synthetic data, they may 'crash'. They will generate more 'illusions' and fabricate seemingly reasonable and credible information that actually does not exist. Moreover, the quality and performance of AI models will rapidly decline, and may even become unusable. For example, if a certain AI model generates data with spelling errors and trains other models using this erroneous data, these AI models will inevitably "spread rumors" and lead to even greater errors. In addition, there is a risk of oversimplification in the synthesis of data. It may lack the details and diversity inherent in real datasets, which may result in the output of AI models trained on it being too simplistic and lacking practicality. To address these issues, the International Organization for Standardization needs to start creating powerful systems to track and validate AI training data. In addition, AI systems can be equipped with metadata tracking capabilities, allowing users or systems to trace the synthesized data. Humans also need to supervise the synthesized data throughout the entire training process of AI models to ensure their high quality and compliance with ethical standards. The future of AI largely depends on the quality of data, and synthetic data will play an increasingly important role in overcoming data shortages. People must maintain a cautious attitude towards the use of synthetic data, minimize errors, and ensure that it serves as a reliable supplement to real data, thereby ensuring the accuracy and credibility of AI systems. (New Society)

Edit:He Chuanning Responsible editor:Su Suiyue

Source:Sci-Tech Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Recommended Reading Change it

Links