When ‘Data Drift’ Sets In, Synthetic Data is the only solution for rapid responses

31 Jan 2023

Share article:


By Chris Longstaff, Vice-President of Product Management, Mindtech

It might seem logical to imagine that, after successfully training a visual AI network, data scientists could simply deploy their system in the wild, put their feet up, and then reap the benefits of their machine learning technology in perpetuity, unfortunately that is not the case.

It is rarely possible to deploy a visual machine learning network and then just forget about it , one very good reason being that we live in a dynamic world, and the salient features of many types of objects that visual models are trained to recognize, almost without exception, will change over time.

Fashions (Data) change over time, causing issues for ML networks, which must address this Data Drift

Called “data drift”, this inevitable change is a phenomenon that has long stalked the world of visual AI, and one which data scientists must constantly guard against.

The Covid-19 pandemic provided one of the most outstanding examples of the way data drift can quickly hobble visual AI models. When people suddenly started wearing face coverings in public — everything from elaborately patterned washable cloth masks, to fabric surgical masks, medical-grade filtered N95s masks and plastic full-face visors — many visual AI systems designed to recognise faces simply malfunctioned.

The rapid arrival of face masks caused severe issues for ML based vision systsms

And it was more than just biometric devices and building entry systems affected: one system we’ve been made aware of at Mindtech was designed to assess the number of people visiting buildings like shopping malls and sports grounds — by counting the number of faces going in. But it failed when faced with the newly masked hordes. As a result, the model needed retraining with synthetic images of people wearing all kinds of novel masks and facewear.

Data Drift can only be rapidly addressed using synthetic data. Mindtech’s Chameleon Datapacks are training ready sets of annotated images and sequences that enable rapid re-training of deployed ML networks

Elsewhere in the visual AI object recognition arena, models also fail when retail product packaging changes markedly, when vehicle makers innovate with bizarre designs or paint jobs — think of the Tesla Cybertruck, for instance, whose sharp, angular lines make it look more like a grounded stealth fighter jet, or the recent trend for “Wrapping vehicles” in mirror type materials. Or think of new clothing fashions that can change markedly, perhaps skewing the accuracy of models supposed to recognise when people, though thankfully these trends tend to be slower.

The sudden trend for “wrapped” cars makes them difficult to accurately detect. Not only are these an example of data-drift but the “double-whammy” of a corner case too.

Or in the emerging cashier-less supermarkets, where shoppers simply take what they want off the shelves and image recognition bills their accounts, majorly-altered product designs could lead to model failure. Your cornflakes may scan fine one week, but the next with a special discounted pack, or a seasonal special with a new “Christmas special” pack peppered in loudly-coloured and behatted Santas and elves, the model fails because the data has drifted. This could also affect automatic stock analysis robots, which travel up and down the shop’s shelving, assessing which stock items need replenishing.

The Same product can have frequent changes to packaging for discounts, seasonal variations etc. which can cause havoc with automated checkout systems

So how should people in the business of visual AI tackle this data-drift ? What’s absolutely key here is that data scientists and the companies that deploy their models, understand that they are not — and cannot — be “fit and forget”. The universe of possible inference data will shift and they must be alert to those changes, aware of data drift, and be prepared to act on it, by augmenting and even replacing training data. This data drift really plays to the strengths of synthetic images: rapid to develop, privacy compliant and producing optimum results when working alongside real world data. In the above example of the face detection, you may have taught your network with real world data, and augmenting with synthetic face mask data brings about the best of both forms of data.

There’s also a flipside to this: it’s also possible that a whole class of data you’ve trained your network on may have become irrelevant and need removing — perhaps because a line of products in a cashier-less store, say, has been discontinued. So regularly reviewing the model’s training data, to ensure it is relevant to the task today, is vital.

Creating synthetic images, or deciding if any now-irrelevant data class needs removing, is made all the easier with a strong synthetic data management and curation tool. Mindtech’s Chameleon platform not only helps us understand what data classes have already gone into a project, but also lets users curate thousands of new images with vast variability on facets like lighting angles, skin tones and backgrounds.

So in conclusion: be aware of data drift, be prepared to act on it quickly and realise that synthetic data is quite likely to be the only way to cost effectively and rapidly update your models. “If it ain’t broke don’t fix it” is a well-worn phrase in engineering circles, but data drift makes visual AI models break invisibly — so eternal vigilance is the watchword here.

When ‘Data Drift’ Sets In, Synthetic Data is the only solution for rapid responses was originally published in MindtechGlobal on Medium, where people are continuing the conversation by highlighting and responding to this story.