Transparency – communicating the benefits and limitations of synthetic data

Because synthetic data are a relatively new concept, user acceptance may be more challenging than traditional statistical methods of production. The general public, and even other researchers and statisticians may not have sufficient understanding of what synthetic data are, how it can be used, or

its limitations and benefits, in order to come to an informed choice about its value. Moreover, little research has yet been conducted to determine how the public view synthetic data, or how well they understand it. For these reasons, it is incredibly important that those creating and using synthetic datasets are transparent in communicating the benefits and limitations of synthetic data, and this information should be tailored to the relevant audience(s) in order to ensure effective understanding. The Centre for Applied Data Ethics’ Machine Learning guidance outlines different types of information which should be given to different audiences dependent on their knowledge and expertise, which is also applicable here.

It is entirely possible that synthetic data could be mistaken for “real” data, particularly when it is of high fidelity. However, as discussed in the previous section, synthetic data will always differ from the “real” data from which it was created in some way, and so it is imperative that developers clearly label their data as synthetic, else there is a risk that it could be released into the public domain, and in turn, mistaken for “real” data.

If you are creating or sharing a synthetic dataset, as a minimum this information should include:

  • What the dataset’s intended use is
  • How the dataset was created, and what methods were used
  • What type of dataset this is (fully synthetic, partially synthetic, or hybrid)
  • The level of fidelity of the dataset
  • Any limitations of the dataset

If you have used a dataset, and are publishing your results, as a minimum, you should communicate:

  • Why you chose to use a synthetic dataset rather than the original data
  • Why this synthetic dataset was appropriate for the research and research aims
  • What limitations or issues you may have experienced whilst using the dataset, and how this could have affected the quality or validity of your results
  • The methods and tools used when using the dataset

It may also be useful when communicating with a lay audience, to give a brief outline as to what synthetic data are, and the benefits of creating, using and sharing synthetic data as well.

Back to top

Remain accountable

It is the collective responsibility of those who develop and use synthetic datasets to ensure that the use case of the data is clear, and that models are not used beyond their intended use. For researchers, it is important to consider how the synthetic data being used was developed, and for what purpose. Utilising these datasets beyond their intended use without care and caution is likely to impact upon the accuracy and validity of your results.

REMEMBER:

Synthetic data are not always a suitable alternative to the real thing, and their uses are limited! In many cases, it may be best to use the original dataset! This will be dependent on factors such as the precision needed in the findings, or the importance of the decisions being made on the basis of the findings (and how much these decisions may be influenced by potential issues or differences with the data). If you are considering using a synthetic dataset, it is important to understand what it was built to be used for, and ensure before you use it, that it is suitable for your own project.

Whilst developers may have limited control over who uses their datasets, and for what purpose, to remain accountable, anyone creating a synthetic dataset should be explicit in communicating the intended use of the data, so that it is not used by others in the wrong way.

Back to top

Advice and Possible Mitigations

  • Ensure accountability by design, by having governance processes to ensure human oversight at the appropriate organisational level throughout design and implementation of your synthetic dataset.
  • Ensure that you document any decision-making processes throughout the development, sharing, or use of the synthetic dataset, and sure that these decisions, and their justifications, are clearly communicated to the relevant stakeholders.
  • When developing a synthetic dataset, or publishing outputs from a synthetic dataset, ensure that you clearly label the dataset or outcomes to ensure that it is clear that synthetic data has been used.
  • When creating a synthetic dataset, ensure that it is audited at regular intervals, and that this is well documented. The Alan Turing Institute have produced literature on auditing and evaluation processes which may be helpful when planning this process. Sufficient time should be built into the project plan to allow for this. This will enable assurance that the dataset fulfils the intended purpose without any unwanted consequences (such as bias).
Back to top