Synthetic Data 101

23 Oct 2019

Tags:

Blog

Recently we launched Mindtech’s first product, the Chameleon AI tools and simulator. This first blog post is to give a high level description of what synthetic data is.

Training Neural Networks requires data. Lots of data. With Chameleon, we are concerned with neural networks for visual processing. The training data for these networks consists of an image, “The Input”, and an annotation, “The Output”. This data can be “real” or synthetic.

Real data is images captured by camera (or similar recording device), with manually added annotations. The annotations will depend on the objective of the training network, and may be simple labels, bounding boxes, more complex semantic segmentation (labelling all pixels of a particular object), or anything else appropriate to the desired output.

Synthetic data (for visual networks), are images and corresponding annotations that are created by computer, specifically utilising 3D graphics engines. These synthetic images are part of a virtual world, that accurately models the real world, including lighting, occlusions and so on. These models are generated by combining assets created by skilled graphic artists, and then adding the required simulation behaviour, to end up with an accurate virtual world, that can be used to generate the required data (images + annotations) for training neural networks.

In the upcoming blog posts, we will consider they key benefits of synthetic data, and the advantages over real data.