Title and link to statistical outputUnistats Dataset 2020/21
Name of producer organisationHigher Education Statistics Agency (HESA)
Rebecca Mantle
Jonathan Waller
Date of breach report18 September 2020

The Unistats dataset contains information about courses offered by higher education providers. It is published by HESA on behalf of the Office for Students (OfS). It is published on a cycle of major releases once annually, normally in September, with regular scheduled minor updates on a weekly basis between those major release points. Following publication of each update to the data set (major and minor) a third party contractor working for OfS takes a copy of the dataset and incorporates it into a website intended to provide prospective students with course information: Discover Uni.

A major release of 2020/21 Unistats data was scheduled to be published at 09:30 on 28 September
2020 in accordance with HESA’s pre-announced publication schedule. On 26 August 2020 at 09:30 a regular weekly minor release of 2019/20 data was due to be published but instead a near-final version of the 28 September major release of 2020/21 data was published in error. The issue was identified at 10:42 at which point the 2020/21 data was removed and replaced with the correct 2019/20 minor update.

Relevant principle(s) and practice(s)T3.4 The circulation of statistics in their final form ahead of their publication should be restricted to eligible recipients, in line with the rules and principles on pre-release access set out in legislation for the UK and devolved administrations.
Date of occurrence of breach26 August 2020

The OfS contractor wanted to test their data workflows in updating the Discover Uni site because the data content of the 2020/21 release had changed markedly from the 2019/20 edition. To do this they needed test data. The OfS had asked HESA to deliver their contractor a near-final version of the 2020/21 dataset as test data.

This was an exception to the normal processes used by the contractor.

Normally, a weekly automated process creates the latest Unistats data file by collating and processing the most recent set of data submissions from Higher Education (HE) providers and stores the resulting file on a HESA physical file server.

An automated process scans for new files created, again on a weekly basis, and uploads the most recent file to a cloud storage area. Once in that area, and at the time of the publication of the weekly update, users can download the Unistats data file using a public API end point.

The contractor who maintains the Discover Uni website on behalf of the OfS obtains the latest Unistats data set using the same API end point. On this occasion as they needed a test file of real (but not final) data. The file was created by HESA staff using existing processes and stored in the same HESA file server area where the regularly updated Unistats files are stored. The intended next step would have been to manually transfer this file to the third-party contractor and then delete it from the file server area.

Unfortunately, the timing was such that before the test file could be manually transferred and deleted, the weekly automated update process incorrectly picked up the file and uploaded it to the cloud storage area. This meant the file was made available from the API end point at the time of the expected  publication of the weekly update.

The cause of the breach was a combination of new and inexperienced staff using a complex automated system without fully appreciating the implications of file placement and timing. There was also a lack of a clearly documented procedure for the exceptional case of transferring a test data file to a third-party contractor outside the regular cycle.

Since this exceptional process was not thought to represent an official statistics publication, HESA’s
Head of Official Statistics was not consulted.

Provide details of the impact of the breach both inside the producer body and externally There were five downloads of the Unistats 2020/21 dataset during the period in which it was available (times of which were: 09:31, 09:48, 10:19, 10:30, 10:36). The latter two of these relate to internal HESA colleagues involved in the production of the dataset, but the former three are unaccounted for.

Describe the short-term actions made to redress the situation and the longer-term changes to procedures As soon as the accidental release of the Unistats 2020/21 data set had been identified (at 10:42 am on 26 August) it was removed and replaced with the correct 2019/20 data.

An email was sent out to a group of users of the dataset who have chosen to be notified of updates. This explained the details of the error and gave HESA’s apologies. It also asked any users who downloaded the dataset between the times of the breach to delete all copies of the data and to re-download the  correct updated dataset.

We are producing a clear procedure document that explains the correct process for creating and transferring a test Unistats data file. This document will be presented to those staff who operate the
Unistats delivery process within a suite of further training to ensure they fully understand the details
of the process and the implications of file placement and process timing.

Any future test data files will be stored in a separate file storage area that is not utilised by the regular automated update process. In addition, all such test data deliveries will be overseen by HESA’s Head of Official Statistics and only released with her explicit approval.

We are working with the Good Practice Team to learn more about Reproducible Analytical Pipelines
and how they may be able to help simplify our processes. We are also arranging training in official statistics and the Code of Practice for Statistics with the Good Practice Team to ensure all staff understand the regulations.