Why is synthetic data suddenly so popular?


Ajinkya Bhave, Country Head (India) at Siemens Engineering Services, recently spoke about Siemens’ use of synthetic data to deploy the machine in real-life scenarios. “The idea was that we created synthetic training data, which was then used to train a neural network on a digital twin of the model. We then tested this on real errors that occur in the ball bearings of the gears with the physical data. The graph showed us that the forecast was pretty accurate.” he said, Discussion of a gearbox problem for wind turbines. He’s not the only one emphasizing the benefits of synthetic data for training AI models.

MIT Technology Review recently named the use of synthetic data for AI as one of the ten breakthrough technologies of 2022. Forrester’s research even identified synthetic data as part of AI 2.0. The world is getting more data hungry by the day, and in this revolution powered by AI trained on models and their privacy issues, information is always scarce. This lack of data availability is coupled with the need for accurate and adequate data to train a fair model. Synthetic data can come to the rescue here. For example, in an example cited by MIT, researchers at Data Science Nigeria created synthetic data on African clothing to offset the plethora of datasets on Western clothing. The African data set and images were created from scratch using AI.

What is synthetic data?

Synthetic data is artificially generated data that reflects real-world data either mathematically or statistically. It has proven itself as an alternative to real data for model training research. Several algorithms and tools generate synthetic data to create a simulation of reality. When used properly, synthetic data can be a good complement to human-annotated data while preserving the speed and cost factors of the project.

How will synthetic data reshape AI training?

This fake or artificially created data can train AIs in arenas where real data is scarce or too sensitive to use. For example, Uber uses synthetic data to validate anomaly detection algorithms and predictions on scarce data. Synthetic data has been used in driverless cars for years, and Forbes and Garter have both made their predictions about what this data means, but what exactly makes it so important for AI?

Deepfakes, biased AI and privacy issues have become a major crisis in AI models; simply, models trained on insufficient data will produce incorrect and untrustworthy predictions. On the other hand, the development of GANs and their ability to generate realistic but bogus predictions has facilitated the creation of synthetic data.

Last November, NVIDIA’s Jensen Huang launched the Omniverse Replicator, “an engineer to generate synthetic data with fundamental truths to train AI networks.” In conversation with IEEE, Rev LebardianVP, Simulation Technology and Omniverse Engineering at NVIDIA revealed that synthetic data can make AI systems better and even more ethical.

Real data is missing in several respects. First of all, the current information is not all-inclusive. Second, blocks of data are unusable due to security and privacy concerns. With laws like GDPR in the EU and several bills in the US protecting citizens’ data, the engineering team has limited data to train AI models. Synthetic data solves these problems and more. As it is constructed it can be created in tons of quantities needed for training while being tagged and de-biased. In addition, this data is completely anonymous, which overcomes the problems of anonymizing personal data that can be hacked.

In another way, synthetic data helps for faster and better training of AI models as teams are able to generate datasets quickly. Additionally, once created, this data goes through the stages of data cleansing and maintenance, further saving time and money. That’s what Paul Walborsky, the first co-founder of a synthetic data service, said Nvidia“a single image that might cost $6 at a labeling service can be synthesized for six cents.”

Additionally, Lebaredian illustrated NVIDIA’s experience to make this claim. First, to train a model to play dominoes, the team would have to buy hundreds of sets of dominoes, collect them, place them in different environments, conditions, sensors, and lighting, and then label the data. Alternatively, they trained a model to create dominoes that happened to work efficiently. Second, Lebaredian considered the impossibility of obtaining real data to ensure the accuracy and variety needed to train self-driving cars. “There’s really no getting around it. Without a physically accurate simulation to generate the data we need for these AIs, there’s no way we’re going to move forward,” he said.

Removing the pink glass

The importance of the synthetic data discussion is not an all-encompassing solution to the ethical and quantitative debates surrounding the dataset. The synthetic data is only as unbiased as the real data set on which it is based. It also brings with it the problem of the uncanny valley. Currently, the gap between real and synthetic data limits the real-world performance of machine learning models trained only in simulations.


Comments are closed.