Ethical considerations in the use of Machine Learning for research and statistics

Published:
26 October 2021
Last updated:
26 October 2021

Confidentiality

Machine learning – particularly when using predictive analytics – often involves linking data from multiple sources. New data sources (such as social media data and biometric data) are increasingly being used by analysts to identify social trends, and using these sources raises ethical questions surrounding confidentiality. Put simply, confidentiality refers to the measures taken to ensure that data and data subjects are protected from being identified, via the separation or modification of personal information provided by participants from the data.

There are many ways in which we can protect confidentiality, including techniques such as the anonymisation of datasets, de-identification, and data management and security processes (such as limiting access to data to specific, agreed purposes). The techniques used will depend on several factors, including the type of data being used, and for what purpose. In choosing which methods may be most appropriate for your project, it may be helpful for researchers to consider the 5 Safes Framework, which has been adopted by the  Office for National Statistics’. This helps researchers maximise the use of their data, whilst ensuring that the data is kept secure at all times.

Potential Mitigations

  • Once the dataset has been explored, and the data necessary for the research has been determined, additional data should be deleted. However, this should be done with care, and only after careful consideration, as deletion of some data may make the model worse, or less transparent for the user.
  • If anonymising data, this should be done at the earliest opportunity and consideration should be taken to ensure that the most appropriate method(s) of anonymisation are used (e.g., natural language processing- based anonymisation, differential privacy, data masking, aggregations). Ensuring the exclusion of unnecessary data will help prevent the system from identifying unhelpful correlations.
  • Consideration should be given to the level of control data subjects should have over their own data, balancing the risks to individual privacy against the statistical value of the dataset. Data subject rights can be specified in materials that are publicly available to data subjects (such as privacy notices or participant information) and organisations should have appropriate mechanisms to delete or withdraw data as required.
Back to top