Conquering the World through Simulation: The Rise of Synthetic Data in AI


Hear from CIOs, CTOs, and other C-level and senior executives about data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more

Would you trust AI trained on synthetic data as opposed to real data? You may not know, but you probably already know – and that’s fine, according to what a newly published survey.

The lack of high quality, domain-specific data sets for testing and training AI applications has led teams to look for alternatives. Most in-house approaches require teams to collect, collate, and annotate their own DIY data – which further increases the potential for bias, inadequate edge-case performance (i.e. poor generalization), and data breaches.

One salvation already seems to be in sight: advances in synthetic data. This computer-generated, realistic data provides solutions to virtually every item on the list of business-critical problems teams are currently facing.

That is the core of the introduction to “Synthetic Data: Key to Production-Ready AI in 2022”. The survey results are based on responses from people who work in the computer vision industry. However, the results of the survey are of broader interest. First, because there are a wide range of markets that depend on computer vision, including extended reality, robotics, intelligent vehicles, and manufacturing. Second, because the approach of generating synthetic data for AI applications could be generalized beyond computer vision.

Lack of data kills AI projects

Data gen, a company specializing in simulated synthetic data, recently commissioned Wakefield Research to conduct an online survey of 300 computer vision professionals to better understand how they use AI / ML training data for computer vision systems and -Applications received and used and how these decisions affect their projects.

The reason people turn to synthetic data for AI applications is clear. Training machine learning models requires high quality data that is not easy to obtain. That seems like a commonly shared experience.

Ninety-nine percent of respondents said an ML project was abandoned completely due to insufficient training data, and 100% of respondents said they experienced project delays due to insufficient training data.

What is less clear is how synthetic data can help. Gil Elbaz, Datagen CTO and Co-Founder, can relate to this. When he started using synthetic data as part of his second degree at the Technion University of Israel in 2015, his focus was on computer vision and 3D data using deep learning.

Elbaz was surprised to see how synthetic data worked: “It seemed like a hack, like something that shouldn’t work, but works anyway. It was very, very counter-intuitive, ”he said.

However, after seeing this in practice, Elbaz and his co-founder Ofir Chakon saw an opportunity in it. In computer vision, as in other AI application areas, data must be annotated in order to be used to train machine learning algorithms. This is a very labor-intensive, distortion-prone and error-prone process.

“You go out and take pictures of people and things on a large scale and then send them to manual annotation companies. This is not scalable and makes no sense. We have focused on solving this problem with a technological approach that meets the requirements of this growing industry, ”said Elbaz.

Datagen started working in garage mode and generated data through simulation. By simulating the real world, they were able to create data to teach the AI ​​to understand the real world. Convincing people that this worked was an uphill battle, but today Elbaz feels vindicated.

According to survey results, 96% of the teams say they use synthetic data to some extent to train computer vision models. Interestingly, 81% use synthetic data in a ratio equal to or greater than that of the manual data.

Synthetic data, according to Elbaz, can mean many things. Datagen focuses on so-called simulated synthetic data. This is a subset of synthetic data that focuses on 3D simulations of the real world. Virtual images captured in this 3D simulation are used to create fully labeled visual data that can then be used to train models.

Simulated synthetic data to the rescue

The reason this works in practice has two reasons, Elbaz said. The first is that AI is real data centered.

“Suppose we have a neural network, for example to recognize a dog in a picture. So it takes 100 GB of dog pictures. It then outputs a very specific output. It outputs a bounding box where the dog is in the picture. It’s like a function that assigns the picture to a specific bounding box, ”he said.

“The neural networks themselves only weigh a few megabytes, and they actually compress hundreds of gigabytes of visual information and extract only what is needed. So if you look at it that way, then the neural networks themselves are less interesting. The interesting thing is actually the data. “

So the question is how do we create data that best represents the real world? According to Elbaz, this is best achieved by generating simulated synthetic data using techniques such as GANs.

This is one possibility, but it is very difficult to create new information by simply training an algorithm on a certain data set and then using that data to create more data, according to Elbaz. It doesn’t work because there are certain limits to the information you represent.

What Datagen does – and what companies like Tesla do – is create a simulation that focuses on understanding people and their surroundings. Instead of collecting videos of people doing things, they are collecting information that is unbundled from the real world and of high quality. It’s an elaborate process that involves collecting high quality scans and motion capture data from the real world.

The company then scans objects and models procedural environments, thus generating information that is decoupled from the real world. The magic is to connect it to scale and make it available to the user in a controllable, easy way. Elbaz described the process as a combination of directional aspects and the simulation of aspects of real world dynamics via models and environments such as game engines.

It’s a tedious process, but it looks like it works. And it is particularly valuable for edge cases that are otherwise difficult to obtain, such as extreme scenarios in autonomous driving. It is very important to have data for these edge cases.

The multi-million dollar question, however, is whether synthetic data generation could be generalized beyond computer vision. There is not a single AI application domain that is not data hungry and would not benefit from additional, high quality data representing the real world.

In answering this question, Elbaz referred separately to unstructured data and structured data. Unstructured data such as images or audio signals can largely be simulated. Text that is viewed as semi-structured data and structured data such as tabular data or medical records – these are different. But there, too, says Elbaz, we see a lot of innovation.

Many startups focus on tabular data, mostly with privacy in mind. The use of tabular data raises privacy concerns. For this reason, we see work on creating the possibility of simulating data from an existing data pool, but not expanding the amount of information. Synthetic tabular data is used to create a data protection compliance layer on top of the existing data.

Synthetic data can be shared with data scientists around the world so they can start training models and building insights without actually accessing the underlying real-world data. Elbaz expects this practice to be more widespread, for example in scenarios such as training personal assistants, as it eliminates the risk of using personal data.

Dealing with prejudices and data protection

Another interesting side effect of using synthetic data that Elbaz identified was the elimination of distortions and the achievement of a higher quality annotation. Bias creeps in with manually annotated data, be it due to different views among the annotators or the inability to effectively annotate ambiguous data. With synthetic data generated by simulation, this is not a problem, as the data is pre-annotated perfectly and consistently.

In addition to computer vision, Datagen would like to extend this approach to audio as the guiding principles are similar. In addition to synthetic substitute data for data protection and video and audio data that can be generated by simulation, is there the possibility of ever seeing synthetic data in scenarios like e-commerce?

Elbaz believes this could be a very interesting use case that an entire company could be built around. Both tabular data and unstructured behavioral data would need to be combined – for example, how consumers move the mouse and what they do on the screen. But there is a tremendous amount of information about shopper behavior and it should be possible to simulate interactions on e-commerce sites.

This could be beneficial to the product people who optimize ecommerce sites, and it could also be used to train models to predict things. Caution is advised in this scenario, as the e-commerce use case is more similar to the GAN-generated data approach, i.e. more structured synthetic data than unstructured data.

“I think you won’t be creating any new information. For example, you can ensure that you have a privacy-compliant version of the Black Friday dates. The goal there would be for the data to represent the real data in the best possible way without compromising customer privacy. And then you can delete the real data at some point. So you would have a substitute for the real data without having to ethically pursue customers marginally, ”said Elbaz.

The bottom line is that while synthetic data can be very useful and increasingly adopted in certain scenarios, its limitations should also be clear.


VentureBeat’s mission is to be a digital marketplace for technical decision makers to gain knowledge about transformative technologies and transactions. Our website provides essential information on data technologies and strategies to help you run your organization. We invite you to become a member of our community to gain access:

  • current information on the topics of interest to you
  • our newsletters
  • closed thought leadership content and discounted access to our award-winning events such as Transform 2021: Learn more
  • Network functions and more

become a member


Comments are closed.