Using Synthetic Data to resolve common problems for training visual AI’s

5 Jan 2023

Share article:


By Chris Longstaff, VP Product Management, Mindtech Global

Mindtech Chameleon Platform combines UI and CLI to enable rapid training data creation

From retail, to law enforcement, and from healthcare to driverless cars, data scientists the world over are developing powerful visual AI applications that are bringing the benefits of deep machine learning networks to a whole swathe of industries.

In Hollywood, for instance, studios train networks to generate hyper-realistic CGI crowd scenes based on the way people move en masse, whilst AI trained on the way people in crowds respond when someone pulls a gun allows police departments to spot signs of lethal weapon threats in CCTV footage.

In retail and hospitality, visual activity recognition is helping firms analyse body language to sense when customers might need instant assistance — and on the streets visual AI systems for future driverless cars are already learning the rules of the road — and the sidewalk.

Despite such compelling use cases, however, trouble is stalking visual AI’s brave, new world. A clutch of problematic, real-world data acquisition issues — collectively amounting to what’s being called a data roadblock — are holding up the advancement of visual AI.

In short, those issues are:

· Acquiring and labelling the required 100,000s (or millions) of high res real-world images takes too long (many months) before a model can be trained; Even when labelled, the accuracy is variable, and often incomplete and incorrect

Setting the camera for data capture

· Emerging privacy legislation, like Europe’s GDPR, is restricting access to many real-world image datasets, some of which have had to be deleted;

· Despite holding useful training datasets on millions of their users, Google, Apple, Amazon and Facebook typically don’t make them commercially available, leaving many companies training their models on the same commercially-available data sets — potentially replicating data problems and limiting innovation.

Of all these issues, GDPR is perhaps having most impact on visual AI, especially as its measures gain popularity and legislative footholds worldwide. In force in Europe since May 2018, it assures that personally identifiable image and video data can only be used with the express consent of the people pictured in it — and AI developers, concerned about infringing GDPR, are now extremely wary of using real people in their training data.

The answer to these data roadblocking issues, however, is a relatively simple one: visual AI developers need to augment what real-world data they can acquire with as much synthetic data as they can generate.

So just what is synthetic data?

The “Virtual World”

In the context of visual AI, it involves creating hundreds of thousands of valid training images by realistically simulating an activity in a computer-generated, photorealistic, 3D environment — in other words, synthesizing virtual scenes to stand in the stead of real-world data.

In Chameleon, our synthetic data creation platform, users set up the scene in terms of buildings and environments, and then import all the assets relevant to their application — which could be anything: people, bicycles, cars or crowds in which people mill in multiple directions (with collision detection). They then set up activities, events and “what if” scenarios that will generate enough training images, which are captured in a series of simulation runs. Finally placing one or more virtual cameras to capture the scenarios.

Mindtech can deliver the customer with the required assets, or they can use the provided Chameleon Asset Manager to import 3D artwork and convert it into smart, annotated objects that can have behaviours assigned for the task at hand (a drone would be assigned the ability to fly, say). Users — data scientists, machine learning teams — simply point and click on the platform to set up paths for mobile assets — vehicles, people, drones, muggers and gunmen for example — to follow. They can even choose the time of day and the weather.

You know the really great thing about this deterministic way of creating data? Because a computer generated it, it’s automatically, perfectly annotated (labelled, segmented, 3D pose information and so on), including advanced annotations that are extremely difficult or impossible by. In short producing a dataset, ready to train a machine learning network. And because it’s entirely computer-generated, the data is always 100% GDPR compliant.

Advanced, 100% Accurate Annotations

What’s vital here, however, is that a synthetic data platform is not so complex that it gets in the way of machine learning engineers and/or data scientists generating truly valid, realistic data that can substitute for real-world training data. That’s why Chameleon has been designed to be a flexible, generic tool that can build scenarios across diverse problem domains — and one that needs no 3D graphics expertise on the part of the user, either.

But there is, as the late Apple founder Steve Jobs famously used to say, just one more thing.

AI models are infamous for fragility — throwing up bizarre, unexpected results due to the fact that they sometimes generalize from incomplete datasets, or a fault with the model design. For that reason, a synthetic data platform must be capable — as Chameleon is — of reproducing a dataset it once generated at a later time, should anyone need to forensically check why an ML model in development needs troubleshooting.

That key error checking capability ensures those tasked with training AI models can have as much faith in synthetic data as they currently do in real-world data — perhaps even more.

Using Synthetic Data to resolve common problems for training visual AI’s was originally published in MindtechGlobal on Medium, where people are continuing the conversation by highlighting and responding to this story.