Synthetic Data and Ethics
What is synthetic data, and why is it important?
Synthetic data are an artificial alternative to real-world data. Commonly created by machine learning systems, synthetic data will mimic real datasets to produce data that replicates the important statistical properties of the original data, whilst ensuring that re-identification is almost impossible. Synthetic data are particularly valuable, as they can be created to meet specific needs or conditions that may not be available in the real data. This can be useful when:
- Privacy requirements limit a researcher’s access to, or use of, data.
- Wanting to test a product, piece of software, or section of code when the real data either does not exist or is not accessible.
- Wanting to train or validate new machine learning algorithms, where the generation of real data is too costly or time consuming.
- Wanting to understand the real data better before researchers go through application processes to access it
The utility of synthetic data will be very dependent on how it is produced, and how closely it mirrors the real data from which it was created. The intended use case of the data must be a primary driver in constructing the synthetic data.
Synthetic data can be defined in a number of ways, making it sometimes very complex to conceptualise and understand. Most commonly however, synthetic data is described in relation to their level of fidelity. Fidelity can be seen as a sliding scale, with the extremes being low fidelity and high fidelity, as defined below.
High fidelity
High fidelity datasets share, and deliberately conserve, many of the features of the original dataset from which they are created. This may include complex relationships between different variables, which in turn may cause greater risk of disclosure. Whilst the disclosure risk may be higher however, high fidelity data will often have more analytic value than lower fidelity data, enabling it to be used for a wider range of applications, including hypothesis generation, image-to-image translation, and the robust testing of complicated AI models,
Low fidelity
Low fidelity datasets are generally seen as less risky, in that the data less closely mirrors the original data from which it was created. At the lowest level of fidelity, this may mean that the data reflects the original data only in relation to the types of information it contains, or how it is laid out. This means that there is less risk of disclosure than could be found in high fidelity data. This is because low fidelity synthetic data does not preserve the relationships which exist between variables in the original dataset. As a result, the use cases of low fidelity data are less broad, however it can still be used as a form of metadata that allows the researcher to understand the real data better to scope their research question, develop code, or be used as a training tool.
Back to topWhy ethics matters when creating, sharing, and using synthetic data
The use of synthetic data in research and statistics provides substantial benefits, some of which have been listed above. Ultimately, synthetic data are considered a privacy-enhancing technique, which can help researchers to use and access data, whilst allowing organisations to protect the privacy of its data subjects, thus resulting in an often safer, easier, and faster way to share data. This is particularly beneficial where the real data are considered sensitive (for example, should it contain information on protected characteristics, or identifiable personal information).
Nonetheless, although the use of synthetic data brings real value, there are also ethical issues which need to be considered. Synthetic data brings with it ethical considerations common to all types of data – but also some of its own. For example, synthetic data can only mirror the real-world data from which it is based – it cannot be an exact copy of it. This means that synthetic data may not contain some of the original data’s outliers, which often tell a very important story. Ultimately, the quality of the synthetic data will only ever be as good as the quality of the input data and the data generation model, therefore reflecting biases from the original data source.
While these applications all bring real benefits in terms of efficiency, convenience, and safety, they also bring risks. By taking a considered approach to ethics in every project, we can mitigate risks and retain public trust in the use of data for research and in statistics.
Back to topGeneral ethical principles for research and statistics
To help researchers and statisticians navigate potential ethical issues for all types of projects, the UK Statistics Authority has developed a series of ethical principles and a related ethics self-assessment tool.
At a basic level, these principles focus on ensuring the public good of research and statistics, maintaining confidentiality of data, understanding the potential risks and limitations in new research methods and technologies, compliance with legal requirements, considering public acceptability of the project, and transparency in the collection, use, and sharing of data.
The ethical principles provide guidelines for dealing with all types of data and you should consider your project against each of these principles early and throughout your work.
This guidance is underpinned by these general principles but focuses specifically on ethical considerations relating to synthetic data which require us to take particular care. These include:
- The need to ensure the quality, and validity of the synthetic data, and how this may be affected by the original real data.
- The need to consider the public’s perceptions and understanding of synthetic data, and how its uses, benefits, and limitations, can be clearly and transparently communicated to different audiences.
- The importance of maintaining accountability within all aspects of synthetic data processes, ensuring that the data are used only for the intended purposes.