How To Boot Out Irrelevant Training Data for Visual AI — Intelligent Data Engineering

5 Jan 2023

Share article:


How To Boot Out Irrelevant Training Data for Visual AI — Intelligent Data Engineering

By Chris Longstaff, VP Product Management, Mindtech

Anyone developing a machine vision system knows all too well just how difficult it is to get hold of enough images to train a machine learning network — and that getting hold of enough relevant, accurate data to cover as many critical corner cases as possible is tougher still.

Reducing the time Data Scientists spend on Data curation requires Intelligent Data Engineering

This is why a synthetic data creation platform must set out to proactively help AI engineers and data scientists to focus on generating only the most relevant and accurate of visual training images, rather than simply creating thousands of irrelevant or images that are statistically too similar to each other to give the model any significant new vectors to learn from. Some platforms which suffer from these pitfalls, require images to be laboriously sorted through manually, frame-by-frame, working out what is relevant and irrelevant. Worse still irrelevant training data incurs significant penalties in time, cost and required processing power (and electrical power!) for the training.

At Mindtech, we have a term for our speedy, automation-assisted generation of relevant and accurate datasets: intelligent data engineering — and it’s a process that is fully implemented within Chameleon, our synthetic data creation and curation platform.

To understand quite how the platform brings relevance and accuracy to visual AI, let’s consider some example applications. In the safety-critical domain, for instance, imagine that a visual AI is being developed to maintain safe distancing between people and heavy construction machinery — excavators, say — on building sites. We want the system to sound an alarm before an excavator gets too close to someone, but not when it is too close to, say, shadows generated as the sun moves across the sky.

A Synthetic Image created by Chameleon for training Man-Machine “social distance” application

By automatically “driving” the excavator in Chameleon’s 3D virtual world, across multiple deterministic, pseudo-random paths, and allowing animated characters to move naturally in the environment, carrying out environmentally appropriate behaviors; combining this with changing environmental conditions such as changing the weather and lighting too — training images can be obtained of humans at safe and dangerous distances under many different conditions. And Chameleon’s automation does most of the heavy lifting here, presenting image options for data scientists who then apply their human intelligence to decide what is actually a likely real-world risk and what is not (As a side note, and possibly a topic for a future article(!), Chameleon’s advanced annotations are also a great assistance in this use case, with full 3D depth map, pose information and velocities enabling to e.g. understand if a person is walking towards or away from a vehicle, how far away they are and at what speed vehicles and people are moving)

A typical construction sceneThe scene depth map enables accurate calculations of man-machine distance

In this way, the automation gently nudges the user into considering multiple risk scenarios, so they quickly and efficiently end up with a cache of accurate, relevant images — without having to waste time sifting out any inappropriate ones. A key benefit of synthetic data here, is also to set an appropriate “capture” rate of images. Typically, the entropy between 2 consecutive frames at 30fps is minimal, and most applications don’t benefit from this. By enabling independent setting of “activity” rate and capture rate, we ensure that a good degree of entropy between captured images is maintained; for example we may set the activity rate to 30fps, but the capture rate to only 2 fps.

Intelligently data engineering also extends to trying to cover as many likely (and unlikely) permutations as possible, so the platform is designed to be able to automatically randomize for example:

· The types and colors of clothing the people is wearing

· The number and orientation of people and vehicles

· Their possible height, race, gender, skin-tone and ethnicity.

· The background: grassy, sandy, concrete, tarmac, city, quarry …

Varying Details such as camera type, person location and clothing help to give required coverage

All these factors, and more, can be changed at speed by the platform’s automation algorithm — giving extensive training image coverage for that particular kind of corner case.

But intelligent engineering is not only about safety-critical applications: imagine a major league vintner launches a new wine brand and pays handsomely for a supermarket chain to stock the bottles on the promotional end-of-aisle displays, known as ‘endcaps’ in the retail trade.

Randomized data for Endcap monitoring — note the missing and erroneous bottles

A visual AI’s camera, trained on the endcap, could ensure the bottles are displayed properly, label name outwards, and that shop staff are ensuring the shelves are always full, and that any rival brand is quickly removed if placed there by mistake. Chameleon’s ability to automatically and quickly generate image permutations — randomly rotating and positioning bottles in 3D on the shelf — allows all orientations of the product to be predicted and imaged, ensuring the training data is as relevant as possible. Here the ability of the platform to randomize camera positions, whilst maintaining a focal point of interest, ensures that coverage can be obtained without needing to know precisely where the camera will be located with respect to the shelving unit.

Developers of AI based vision systems are dealing with hundreds of thousands of images, making manual review and selection impossible. Creating datasets by intelligent data engineering is the only scalable approach.

How To Boot Out Irrelevant Training Data for Visual AI — Intelligent Data Engineering was originally published in MindtechGlobal on Medium, where people are continuing the conversation by highlighting and responding to this story.