Ethical considerations in the use of Machine Learning for research and statistics

Machine Learning and Ethics

What is Machine Learning and why is it useful in research and statistics?

Machine learning algorithms work by learning from “training” data and applying that learning to new, unseen data. By using machine learning, analysts are able to identify trends and patterns in very large datasets.

The information collected from machine learning can be:

descriptive (it uses data to explain a phenomenon),
predictive (it predicts what will happen based on trends and patterns from the data given), or
prescriptive (it can make decision-making suggestions).

This means that machine learning methods have an incredibly wide range of application.

Machine learning processes can be further categorised into 3 main types, which describe how the algorithm is influenced by the researcher.

Supervised machine learning

In this instance, algorithms are fed labelled data (data which is annotated so that the machine knows its target) for training. Supervised learning leads to a prediction or classification of a known quantity (i.e., an outcome variable), using patterns that the machine finds in the data to predict an outcome. For example, if a data scientist wanted to teach a system to identify cats in different images, they would feed the system with images of cats, labelled as cats.

Unsupervised machine learning

Unlike supervised machine learning, in which the machine is fed input variables and an output variable, unsupervised machine learning only uses input data. This means that the model learns without supervision to discover its own patterns and information from the data. This type of machine learning assists us in finding unknown patterns in data. The data fed into the model is typically unlabelled. Using the same example as in the supervised machine learning, in this instance, a data scientist would feed the system with images of cats, but they would not be labelled, and it would be the responsibility of the system to analyse the data and predict which images are of cats.

Reinforcement machine learning

Reinforcement machine learning trains models to make decision sequences, by utilising a process of trial and error. The programmer will reward the machine when it does what the programmer wants and penalise it when it does not (though the programmer will not give the models help in making these decisions). The model then will try to maximise its reward, causing it to change its decisions (strategise).

Examples of Machine Learning projects in research and statistics include:

The use of machine learning data in research and statistics provides substantial potential benefits. Particularly beneficial is the ability to analyse large data sets and extract information quickly once a model is deployed. The automation of tasks may be less resource intensive and the ability for models to autonomously adapt to improve the quality and validity of outcomes is often invaluable to data scientists, researchers, and analysts. As such, machine learning is now used in multiple types of research.

Why ethics matters in Machine Learning

When analysts embark on a research project using any method, it is always important that any possible ethical issues relating to the collection, access, use and storage of data are considered. This helps reduce potential harm to all individuals involved in the research and helps maintain public acceptability around the production of research and statistics, and enables researchers to efficiently access and harness data that supports the production of statistics for the public good. These ethical issues are particularly important when using more contemporary methods such as machine learning, as it poses not only traditional data ethics considerations such as transparency and privacy concerns, but also new ones. When algorithms are used, it may be more challenging to guard against mistakes or bias, which may result from human interaction with the model (for example, coding, design decisions or data input), and that could affect the system’s outputs. Moreover, research findings may be biased or erroneous should models be used outside of their intended purpose, or if machine outputs are not thoroughly reviewed and checked for validity before use.

Of course, despite these ethical considerations, there are huge benefits to using machine learning methods. Taking a considered approach to ethics in every project helps to maintain public trust in the use of data for research and statistics more generally, enabling researchers to harness the power of data to support public good research. No matter what stage your machine learning project may be at, it is always sensible to discuss possible ethical issues that could arise with other researchers. This applies if you are thinking about starting a new project using machine learning, in the process of designing your study, or even if you have started to create, or deploy, a machine learning system. Of course, it is always beneficial to start thinking about ethical challenges at the earliest possible stage of the research. By doing this you are implementing good data ethics by design.

The UK Statistics Authority provides researchers with an ethics self-assessment tool, which is used to empower researchers to identify and review any ethical challenges apparent in a research project. This guidance supplements the ethics self-assessment tool and also provides a high-level checklist that you can use to ensure that any research or statistical project that uses machine learning techniques is ethically responsible.

General ethical principles for research and statistics

To help analysts navigate potential ethical issues, the UK Statistics Authority has developed six ethical principles to consider throughout the life cycle of a research project. These principles focus on ensuring the public good of research and statistics, maintaining confidentiality of data, understanding the potential risks and limitations in new research methods and technologies, compliance with legal requirements, considering public acceptability of the project, and transparency in the collection, use and sharing of data.

This guidance is underpinned by these general principles but focuses specifically on ethical considerations relating to machine learning which require us to take particular care. These include:

The importance of minimising and mitigating social bias and subsequent discrimination within machine learning research, and clearly communicating these biases, and the limitations of our research.
The need to consider the transparency and explainability of machine learning research, and the implications this has for reproducibility.
The importance of maintaining accountability within all aspects of machine learning processes, ensuring that models are used only for their intended purposes, and that different stakeholders are aware of their responsibilities.
The need to consider the confidentiality and privacy risks arising from the data used, both in relation to training data which is fed into the machine, and outputs resulting from the machine learning’s findings.

Also worth reflection here is the Office for Statistical Regulation’s Code of Practice for Statistics. The OSR Code of Practice sets the standards that producers of official statistics should commit to. Specifically, the Code of Practice is framed around three main pillars. These are value, quality, and trustworthiness. These support, and map clearly onto, the UK Statistics Authority’s ethical principles and are important statistical standards. We can benefit greatly from considering these principles in relation to machine learning, and the ethical issues addressed below.

« Previous