Recently there has been a fair amount of discussion across the industry regarding the increased use of Synthetic Data in digital advertising. As a result, the IAB Australia Data Council decided that we should look to inform and support industry better regarding its usage by providing some simple definitions and best practices – hence this quick explainer.
A separate collaborative Q&A article will also follow shortly, with members of our Data Council answering some related questions and adding more details and examples as to how Synthetic Data can be leveraged responsibly and effectively in digital advertising here in Australia.
As soon as this follow-up Q&A article is published we will link it into this article.
We have also created a visual explainer of this article in the downloads at the bottom of the page.
What is Synthetic Data?
Synthetic data refers to artificially generated data that mimics the characteristics of real-world data in terms of its characteristics, patterns, and statistical properties – without containing actual personal or any sensitive information.
In contrast to real-world data (which originates from observations of natural occurrences such as customer interactions, sensor readings, or financial transactions) synthetic data is created through advanced algorithms being trained on real-world data sets leveraging simulations and increasingly generative AI models (such as Generative Adversarial Networks or Large Language Models) which have been designed to preserve the utility of real data, whilst still mitigating any privacy concerns.
The use of synthetic data has been rapidly gaining traction across various industries for a number of years now – and have been in use for some time in areas such as computer games (e.g. flight simulators) as well as in scientific simulations of everything from atoms to far-flung galaxies.
More recently businesses such as automotive, finance and retailers have been leveraging synthetic data in the improvement of the work of robots, drones, factories, hospitals and scientists.
For instance, to better optimise the process of how it makes cars, BMW created a virtual factory using a large simulation platform, the data from which BMW can fine tune how assembly workers and robots work together to build cars efficiently. In logistics, Amazon Robotics uses synthetic data to train robots to identify packages of varying types and sizes. Food and beverage giant PepsiCo employs an ‘omniverse replicator’ to generate the synthetic data it uses to train AI models, making its operations more efficient.
Healthcare providers in fields such as medical imaging use synthetic data to train AI models while protecting patient privacy, creating diagnostic models of thousands of simulated medical cases. In finance, American Express has for some time now studied ways to use Generative Adversarial Networks (GANs – which are deep learning models) to create synthetic data to help refine its ML & AI models that detect fraud.
The adoption is only increasing. In a report on synthetic data (Maverick Research: Forget About Your Real Data – Synthetic Data Is the Future of AI) Gartner predicted that by 2030 most of the data used in AI will be artificially generated by rules, statistical models, simulations and other techniques.
image source: Gartner
Generally there are four overall forms of synthetic data:
Synthetic structured data – represents individuals, products and other entities and their activities or attributes – including customers and their purchasing habits. For example patients and their symptoms, medications and diagnoses. Synthetic structured data is often sub-categorised again, depending upon its make-up and approach – into fully synthetic, partially synthetic, rule-based and hybrid.
Synthetic images – are crucial for training object detection, image classification and segmentation. This very useful for early cancer detection, drug discovery and clinical trials, or teaching self-driving cars. Synthetic images can be used for rare edge cases where little data is available, like horizontal-oriented traffic signals.
Synthetic text – can be tailored to enable robust, versatile natural language processing (NLP) models for translation, sentiment analysis and text generation for applications such as fraud detection and stress testing.
Synthetic time series data (including sensor data) – can be used in radar systems, IoT sensor readings, and light detection & ranging. It can be valuable for predictive maintenance and autonomous vehicle systems, where more data can ensure improved safety and reliability.
To create synthetic data, data scientists can use various synthetic data generation tools and techniques. To start looking at approaches to create your own, consider testing out the free open-source resources available at the Synthetic Data Vault Project, which was created by MIT’s ‘Data to AI Lab’ in 2016 and is a large ecosystem for synthetic data generation & evaluation – CLICK HERE
image source: Synthetic Data Vault (SDV)
How is Synthetic Data used in Digital Advertising?
In marketing, consumer insights and behavioral predictions have long played critical roles and by generating synthetic datasets businesses can simulate customer interactions, test market scenarios, and better refine strategies. Critically this can be done without relying solely on real-world data collection, which is limited by privacy regulations and accessibility concerns as our traditional data collection and activation methods face increased legal and ethical scrutiny.
Beyond compliance, synthetic data also enhances data availability and quality. Real-world data is often incomplete, biased, or skewed due to limitations in collection methods. Synthetic data provides consistent and controlled environments from which businesses can generate high-quality datasets tailored to specific research needs, enabling brands to create detailed customer journey simulations. This helps to better understand and predict how consumers will interact with content, ads, and products – leading to more effective personalisation strategies.
By leveraging synthetic data, companies can refine their strategies based on simulated customer reactions before launching actual campaigns. This reduces the chances of failure, mitigates risk and ensures that marketing spends are as efficient and effective as possible.
Below are some core areas in which synthetic data is being used in marketing efforts:
Improved Ad Targeting – better train machine learning algorithms for improved ad targeting, simulating real customer behaviours and preferences. Buyers can build more accurate predictive models to target ads to the most relevant audiences.
Enhanced Personalisation – create diverse and realistic customer profiles, to better personalise campaigns more effectively by understanding different customer segments and preferences, enabling tailored messages and relevant offers to specific audiences.
Scaling target audiences – expand target audiences by simulating new customer segments and/or demographics, identify potential untapped markets and develop strategies to reach them more effectively by replacing real user data in analytics and lookalike modelling, where synthetic profiles mimic high-value customers to expand reach without compromising privacy.
Testing & Optimisation – more effectively conduct A/B testing and ‘market matching’ research without having to risk the privacy of real customers. Marketers can simulate more diverse scenarios and variations to identify the most effective strategies before deploying them to the market. Leveraging synthetic datasets can also create virtual audience segments which simulate user interactions with ads (clicks, conversions, impressions) to predict campaign performance – enabling advertisers to test multiple ad variations in controlled, synthetic environments and refine strategies before launching campaigns.
Data Privacy Compliance – better comply with data privacy regulations (such as GDPR) and ensure conformity with local privacy laws. Synthetic data is used to train ad targeting and recommendation algorithms when real data is limited, sensitive, or restricted. It allows advertisers to simulate user behaviours, preferences, and demographics to optimise ad campaigns without risking exposure of any personal data.
Data Augmentation – when real datasets are too small or incomplete, synthetic data augments them to improve the robustness of predictive models. This is particularly useful for niche markets or emerging platforms with limited user data. Also, generating synthetic data is often more cost-effective than collecting and maintaining large datasets of real customer information.
Fraud Detection & Anomaly Testing – synthetic data is used to simulate fraudulent activities (e.g. fake clicks or bot traffic) to train systems to detect and prevent ad fraud. It helps identify anomalies in ad performance metrics without exposing real campaign data.
Benefits of Synthetic Data
The overall benefits as per the examples above are related to privacy preservation, scalability, and cost efficiency.
- Privacy & Security – Synthetic data can help businesses comply with data privacy regulations and protect sensitive customer data, as it doesn’t rely on authentic data to make decisions. Not having to deal with the privacy issues or legal complications that come with using real-world data means that there are fewer hurdles for your companies to jump through to use data.
- Scalability & Diversity – Synthetic data can be generated in large volumes, providing more opportunities for testing and training machine learning models. With competent algorithms, a training model and an output generator one can create infinite synthetic data for ongoing use.
- Efficiency & Effectiveness – Generating synthetic data sets can be much more cost-efficient than collecting and managing real world data, as it doesn’t require the same resources, time, or effort. Cost savings result allowing businesses to be more effective and invest resources into other areas.
There is also an argument that synthetic data can help to reduce bias by providing controlled, diverse, and representative datasets when training AI models. However, this does require competent ongoing monitoring and evaluation as bias is one of the potential areas of risk, as highlighted below.
Considerations when using Synthetic Data
The overall effectiveness of synthetic data always depends upon how well it is integrated with real-world insights, as concerns still remain about the validity of the generated data when compared to real-world data upon expert review.
Brands can obviously benefit from the efficiency and scale of synthetic data whilst maintaining the authenticity of real market insights. In short, it should be used thoughtfully, with continuous validation and refinement by competent staff and only used as a complementary capability – rather than a direct replacement for more traditional methods.
Lastly, synthetic data generation tools and techniques are ever evolving, which means there will always be room for improvement in accuracy and efficiency. Some considerations are below:
- Quality & Realism – poorly generated synthetic data may lack the nuance of real data, leading to inaccurate models.
- Bias Amplification – if the underlying real data or algorithms are biased, synthetic data may perpetuate those biases.
- Validation – ensuring synthetic data accurately represents real-world scenarios requires rigorous testing.
- Computational Effort & Talent Costs – generating high-quality synthetic data can be resource-intensive and require specialised experts to manage and oversee its usage.