Confidentiality, privacy and disclosure
The issue
Whilst synthetic data are often heralded as being a privacy-enhancing method, it is important that researchers are aware that synthetic data are not assumed to be completely risk free.
The real data from which synthetic data are produced may be complex, and have complex sources, and it is therefore important for a researcher to understand how the synthetic data has been produced, and what statistical methods have been used to minimise the risk of disclosure or identification. Synthetic data which more accurately reproduces the original data from which it was created may contain data that matches information about real individuals. This is most likely to happen by coincidence, but it could be that in trying to reproduce the statistical properties of very small groups, the synthesis algorithm uses the same combinations present in the real data. Whilst the data subject themselves may have been deidentified, this is still problematic as the data could still be disclosive – this may be a result of coincidence (as outlined above, or the data may systematically reproduce real combinations if the original data is of poor quality). This is a particular risk when the data use sensitive variables, or where a dataset only has a small number of variables or values.
Some data types are particularly sensitive. For example, health data, location data, or data which contains information on protected characteristics, are all considered sensitive, and the greatest care must be taken in ensuring we do nothing to put vulnerable populations or individuals at risk. For those creating synthetic datasets, this means ensuring that steps are taken to identify unique characteristics within the original dataset and ensuring that they are adequately masked (or eliminated) within the synthetic data being produced. Where this is necessary, these processes should be clearly documented, and communicated to researchers so that they are aware of the limitations of the new dataset. For those using synthetic data, it is their responsibility to ensure that they are aware of these limitations, and to what extent these unique datapoints have been omitted or masked.
Back to topThe ONS Synthetic Data Spectrum
The ONS Synthetic Data Spectrum is a high-level scale which classifies synthetic datasets based on how closely they resemble the original data, their purpose, and the perceived disclosure risk.
Structural datasets have the lowest disclosure risk (low fidelity), preserving only the format and datatypes, whilst replica data preserves format, structure, joint distributions, missingness patterns, low level geographies, and has a higher disclosure risk (high fidelity). Equally, structural datasets have more limited uses as the analytic value of the data is low, in comparison to replica data where the analytic value is far higher. There is a balance that needs to be considered when using synthetic data as to the quality of the data vs. its potential to be disclosive. This will be highly dependent on the intended use of the synthetic data.
Back to topIntroducing noise into datasets to reduce disclosure risk
In some projects, researchers have introduced a limited amount of “noise” or consistency conditions into the synthetic data in order to reduce the disclosure risk of synthetic data. However, this should be done with caution. The introduction of consistency conditions and noise into high-fidelity data is likely to render any modelling performed on the data invalid.
In the case of high-fidelity dataset then, should creators of synthetic data decide to introduce noise into their datasets, this should be controlled so as to reproduce the aggregate properties of the real-world data. Measurement error techniques can also be used to perform statistical inference to further minimise these problems. The introduction of noise into low fidelity datasets is less problematic, and indeed may be used as a tool to minimise any risk of disclosure, provided that there is a clear justification for doing so, and that the accuracy of the data is not important for the intended use case(s).
Back to topAdvice and Possible Mitigations
Now that we have considered the issue of confidentiality and disclosure risk, alongside some case examples, here are some of the ways in which these risks may be mitigated by both data sharers and data users:
- Data sharers should be aware of the need to balance a researcher’s need for realistic data, against any potential confidentiality or privacy requirements. Moreover, users of the data have a responsibility to ensure that they do uphold the disclosure control regulations of both their own organisation, and the data owner’s.
- When synthetic data are shared, materials should be clearly labelled to communicate how representative of the original data the synthetic data are, and how this may affect the quality of the data, and the level of possible disclosure risk. Just because the data are synthetic does not mean that researchers should be storing, using, or sharing this data without being able to clearly justify the reasons for doing so.
- Where “noise” or consistency conditions are being introduced into synthetic datasets, this should be proportionately controlled. Moreover, these processes should be clearly documented, and users of the data should be made aware of the potential consequences of this on their findings.
- A wide range of tools and techniques exist to help generators of synthetic data obscure sensitive or private information within their datasets. This includes traditional statistical methods, deep learning techniques and natural language processing. Both the creators of the dataset, and those who go on to use it should be aware of which methods are best used for the type of data being used, and the impacts of these on the quality and validity of the data.
- Where relying on methods that promise theoretical privacy guarantees (eg. Differential Privacy) users should ensure that their release plans are consistent with the assumptions under which the formal guarantees hold.
- Users should consider whether their data should be penetration tested for privacy concerns by conducting formal privacy attacks (membership inference, attribute inference). Synthetic data should be produced in a way that is unlikely to reproduce disclosive aspects of real data. Synthetic data must be checked for disclosure issues before distribution or use.