Data Quality and Bias
Similarities with the Original Data
Whilst synthetic data may be very different from the real data from which it was created, it is important to remember that it will to some extent at least resemble the original data. Whilst this is of course the aim of creating these datasets, it also means that issues from the original data may be mirrored in the synthetic version too. This means that biases that exist within the original data may be reproduced in the synthetic data. Both real and synthetic data may contain hidden biases. Whilst some bias may be easy to find and mitigate, researchers should be aware of contextual, social, and historical biases which may exist within the original dataset, and which in turn, may be reproduced in the synthetic dataset. The same can be said for the quality of the data being created – this too is likely to mirror the real-life data, which may not be of sufficient quality.
Back to topDifferences from the Original Data
In contrast, synthetic data will always differ from the real data from which it was created (unless where rows do exactly replicate the rows in the real data as a result of coincidence). Whilst this is necessary to ensure enhanced privacy, synthetic data may be different in very important ways, which in turn may have an impact on the quality and validity of the data. It should be expected that synthetic data will contain errors and differences, and so, if high quality data are required, then it may be necessary to use the real data instead. This will depend on the aims of the research, and the needs of the project.
Missing values
Because synthetic data only serves to mimic the data from which it was created, it may be missing important datapoints. For example, if the original data are missing values (for example, because it hasn’t been sampled, or it doesn’t represent the existing population it is meant to model), then this too will be reflected in the synthetic data. This is particularly relevant in relation to outliers in the original data. Often outliers can be more important that regular datapoints, as they both help to identify errors in the data or model, and where valid, can tell a story about individuals who do not fit within a pattern or trend. They can also have huge influence over the statistical analysis of the dataset.
However, the importance of mimicking outliers (and other unique values) within the synthetic dataset needs to be weighed up with the risk of identification.
Back to topAdvice and possible mitigations
Synthetic data can be used for many reasons however it should be considered how well it will fit the purpose you wish to use it for, based upon how it was produced. For this reason, producers of synthetic data should be transparent about the methods used to create the data, and researchers should take time to consider the use case for the data.
- If you are creating a synthetic dataset, think hard about the original data you are using, and any biases that they may contain. How might these influence the synthetic data and what can you do to mitigate the bias? Be aware however, that by attempting to eliminate bias in the synthetic data, you are distorting the dataset, and thus decreasing its fidelity. You should weight up the benefit of minimising the bias with your use case.
- Moreover, if you are a researcher using a synthetic dataset, it is equally important that you are aware of any potential biases, and how these might influence what you can use the synthetic data for, and any outcomes.
- ML models can be adjusted to account for bias and ensure that the synthetic dataset is more representative. More information can be found in our ethical guidance on machine learning.
- Output control by the producer of the synthetic data will be necessary, especially when the datasets are complex in nature. The best way to do this is by comparing the synthetic data with the original real-life data.