By Chris Longstaff, Vice-President of Product Management, Mindtech
Here’s a funny thing: although visual AI systems trained on synthetic images are already successfully deployed in numerous domestic, retail, healthcare and industrial applications, one of the most influential arbiters of a technology’s worth is steadfastly listing synthetic data as an early-stage emerging idea that’s yet to really prove its mettle.
So what, precisely, is going on? Why is a mature, revenue-generating technology being depicted as not yet ready for primetime?
Under discussion here is the widely-cited Hype Cycle For Artificial Intelligence published by Gartner Inc, a technology market analyst based in Stamford, Connecticut. In their hype cycle curves, Gartner’s researchers plot the expectations end users have for a novel technology, against a nominal timeline divided up into five major developmental phases it has identified as key to every innovation’s lifecycle.
Gartner 2022 Hype Cycle
Those phases begin with an emerging ‘Innovation Trigger’ stage, in which a technology is deemed not yet ready to deliver on its inventor’s hopes. This rises, fueled by advances in capability, plus market and media hype, to a ‘Peak of Inflated Expectations’. After that, however, the curve descends precipitously to the ‘Trough of Disillusionment’, where end users have to start getting real about what’s actually achievable.
All being well, this can then lead to a boost in expectations called the Slope of Enlightenment — and finally the tech reaches the sunlit uplands called the ‘Plateau of Productivity’ — the long tail where the real money is made.
Those tongue-in-cheek labels make it tempting to view the Hype Cycle as something of an amusing aside, but it does nevertheless throw light on some stark realities. For instance, “autonomous vehicles” are not yet truly driverless — they all have remote safety drivers — so in placing them in the Trough of Disillusionment, Gartner is absolutely spot on.
Where I take issue however, is in classifying synthetic data as a single category in its over-hyped ‘Peak of Inflated Expectations’ segment. This is frankly, a non-sensical choice because “synthetic data” is as diverse as “data” itself; With so many different types of data, across a whole host of markets, with many of them admittedly early stage, it is not prudent to assume they all have the same level of synthetic data maturity. And even choosing a narrower category of synthetic data such as “visual synthetic data”, we still have a broad sweep of applications at differing levels of maturity, from concept through to deployed.
Once we add in the wider suite of synthetic data that is prolific today, for instance, the emerging demands from the speech recognition sector for synthetic audio data — and that is a very underdeveloped, nascent field right now. The same goes for synthetic infrared, LIDAR and radar data — all of which remains early stage and deservedly in the Innovation Trigger phase.
By contrast, insurance, educational, finance and shopping arenas are already using artificially-generated tabular datasets, often because privacy regulations like Europe’s GDPR may make real world data unusable, or because live data is biased or not available. They are already deploying models today and could be placed in the plateau of productivity.
Even if we were to split this “Synthetic Data” categorization into subsets, Gartner’s AI Hype Cycle could still run into issues, for example lumping synthetic images into a single category; in some applications they are already used to train incredibly high accuracy models that are out there in commercial applications right now, whilst in other segments they are still under the research phase. It’s important to give potential users the right impression about synthetic imagery and the technology’s state of readiness.
So it is time commentators and analysts alike began to be far more specific about what type of synthetic data is on the agenda. Do they mean tabular or otherwise nonvisual data like audio, infrared or LIDAR? Or do they mean the visual kind that are out there now, solving safety ambiguities, corner cases and GDPR issues everywhere from shops to hazardous industries — and with the huge advantage to users of being automatically annotated?
It’s crucial people are clear on the type of synthetic data at issue if expectations (the vertical axis on the Hype Cycle graph, remember) are to be set at the right level. Gartner says it expects synthetic data will overshadow real word data in AI training by 2030 — but I expect that future releases of the hype curve may break the category of “synthetic data” down into some more specific categorizations. With all that said, I believe the overall intent of the hype curve is just to start conversations, so leave your comments here, and let’s discuss where you think Synthetic data sits on the hype curve.
Is a single category for Synthetic Data in Gartner’s Hype Cycle appropriate? was originally published in MindtechGlobal on Medium, where people are continuing the conversation by highlighting and responding to this story.