The Dangers of Synthetic Data: Training AIs with Simulations


A recent VentureBeat report discusses how synthetic data can help improve AI. However, engineers must be extremely careful when using such data because ultimately this data is not real. Why is collecting data a big challenge, what is synthetic data and how does it threaten AI training?

Why is collecting data a big challenge for AI?

While the concept of AI has existed for decades, only in the last decade has its use accelerated. One reason is that hardware platforms for accelerating AI algorithms have only recently emerged, which means that older AI algorithms ran inefficiently on generic CPUs. But by far the most important reason for the sudden explosion in AI is the unfathomable amount of data available to engineers to train AI.

Fundamentally, neural networks (the mechanism that powers AI) are surprisingly simple. A neural network consists of functional blocks that take multiple inputs, multiply those input signals by weights, and then add those weighted signals together to produce an output enable signal. These nodes are stacked in discrete layers that pass their processed signal to other layers of nodes. By changing the weights of these signals, the network can can be trained to produce outputs correctly based on specific inputs. For example, a neural network can be trained to recognize cats, with a full image leaking into one of two outputs, cat or no cat.

In order for an AI to be trained, it needs data, and the more data an AI is given, the better it performs. In the case of the cat example, it would be imperative that the AI ​​not only be shown multiple images of each cat, but also of each breed. This trains the AI ​​to recognize cats based on common traits such as tails, eyes, ears, and body shape.

And it’s this requirement of large datasets with different variations, all carefully annotated, that causes problems for AIs. Before data protection laws came into force, AI developers could turn to big tech companies like Facebook, Google, and Amazon for customer data that could include conversations, messages, images, behavior patterns, and website history. But the introduction of strict data protection laws (like the GDPR) now prevents companies from receiving data without strict permission. Businesses can circumvent these limitations by enticing customers to train in-house AI solutions on data they collect directly. A common example is Captcha tests (select all images with a fire truck that trains AI while keeping it safe).

What is synthetic data?

One solution to the data challenge is to create synthetic data. As the name suggests, synthetic data is artificially generated data that is very similar to real data and often comes from simulations. For example a Program can be designed to create brand new handwriting styles (this already exists), large chunks of text can be written in these styles, and images of that text can be automatically parsed into a machine learning neural network.

The main advantage of this is that the original text is already known digitally, which means that the output of the AI ​​can be compared with the one originally calculated. Thus, a closed-loop system can improve the AI’s ability to recognize handwriting without having to see real-world examples of handwriting.

Another potential application for synthetic data is autonomous vehicles. For autonomous vehicles to be safe, they must prove that they can function under all conditions, which is very difficult to achieve in real life. Instead, hyper-realistic simulations can be constructed that can quickly train autonomous AI how to navigate traffic, spot potential hazards, and respond to unexpected events such as swerving vehicles, falling trees, and law enforcement stops.

To what extent does synthetic data pose a threat to AI?

A recent article by Venture Beat describes the benefits of synthetic data and how it can help AI, and while this is true, engineers should exercise caution with synthetic data. Of course, synthetic data makes it possible to train AI in a controlled simulated environment, but it must be clear that such data is not real. Synthetic data can be very similar to real data, but ultimately it is not real.

Like climate change, financial projections, determining the nature of the universe, everything Depends on the accuracy of a model, and it’s startling how easily a model can be manipulated. Adjusting a few constants here and there can drastically alter a model’s output to suit an intended purpose, and this manipulation could quickly spread to an AI.

For example, building a climate model with current data that doesn’t support the current consensus can be altered to do so, and a climate AI learning from that model would potentially be fed incorrect information. Another example of manipulated models could be the creation of simulated social media data with large amounts of political bias and stereotypes that do not match real data. Training an AI to learn from this would result in an AI that can only see the world through the eyes of the manipulated model, and this could be dangerous when used in a real-world scenario such as account monitoring, tagging, and automated bans .

In total, AI is only as good as the data fed to it, and humans tend to be Using statistics and models to prove their beliefs. Synthetic data can be used to help AI learn, but extreme caution should be exercised because, ultimately, synthetic data is not real.


Comments are closed.