RMR Working Group

Steve RogersCDP: Statistical LeadCal GheeCSD: Statistical Lead
Lawrence DyerCDP: RMR LeadTanita BarnettCSD
Pratibha VellankiCDPBethany FitzgibbonCSD
Graham GreenleesCDPOrlaith FraserCSD
Paul WaruszynskiCDPBrendan DavisCSD
Bambi ChoudharyCDPGareth PowellMD: Estimation
Charles OdinigwehDSTFern LeatherMD: E&I & Adjustment
Steve SmallwoodPSD: Statistical LeadJosie PlachtaMD: Matching
Jayne SholdisNISRAFergus ChristieNRS
Alistair StoopsNISRAMelissa LiewNRS
CDP: Census Data ProcessingCSD: Census Statistical Design
DST: Digital Services & TechnologyMD: Statistical Methodology
PSD: Population Statistics & DemographyNRS: National Records of Scotland
NISRA: Northern Ireland Statistical Agency

Background

As for previous Censuses two of the primary objectives for the 2021 England and Wales (EW) Census are to provide Government and other consumers of statistical information with:

  1. accurate local authority level population estimates
  2. a representative statistical Census database on which to base ongoing research and analyses

However, while the ONS affords considerable effort to the collection of accurate household (HH) and individual information the task is complex, leading invariably to a wide range of errors in the data that can potentially undermine those objectives.  Missing and inconsistent responses in the collected data, undercount through missed enumeration or record-level non-response, and overcount through the collection of duplicate responses are just three of a wide range of significant challenges.

To help overcome these problems and minimise the impact on the quality of the statistical estimates and the utility of the final Census database, raw Census data are cleaned and adjusted through a series of deterministic and statistical processes.  First, a clean and consistent baseline database of collected response data is established through a series of preliminary data cleaning and classification methods and processes completed ultimately by statistical item-level Editing and Imputation.  Second, supported by linking Census data to a Census Coverage Survey through statistical Matching methodology, and by utilising information from other alternative sources of data and analyses, statistical Estimation methods are used to define a comprehensive set of population weights that take account of issues such as overcount and undercount in the observed data.  Finally, these population weights are used by additional statistical Adjustment, Edit and Imputation, and Disclosure Control methods to arrive at a fully adjusted Census database. This general strategy ensures that the two primary Census objectives are met successfully.

RMR is a preliminary data cleaning strategy implemented prior to any of the Census statistical adjustment processes.  While the two primary objectives of the Census program would, in principle, be met by a database containing accurate and discrete information about each HH or communal establishment (CE) and the individuals therein, collected Census data will inevitably contain multiple and/or duplicate responses associated with a particular HH, CE, and/or individual.  Multiple and duplicate responses can undermine both primary Census objectives, contributing to overcount and inconsistencies in the Census data. RMR is designed to help minimise the impact of these potential problems.

It is important to note here that the first and foremost objective for the RMR strategy is the resolution of multiple and duplicate responses received from a discrete enumeration address assigned a Unique Property Reference Number (UPRN) through the ONS address resolution strategy.  Consequently, RMR assumes that this has been assigned correctly. There is also a secondary objective to explore the possibility of extending RMR in 2021 to include a relatively local level of geography such as postcode or Lower Layer Super Output Area (LSOA).  The main point here is that the resolution of multiple responses or duplicates beyond these very local geographic boundaries is out of scope for the RMR strategy. Consequently, any incidental reference to ‘Census data’ from this point forward should be considered as always referring to data within these constraints.

Multiple responses are, of course, not always errors, sometimes occurring quite legitimately in the Census data associated with a discrete enumeration address. Typically, this would be associated with a design decision. For example, requests for additional household continuation (HC) questionnaires needed by larger HHs responding through paper format can lead to receiving two or more responses for that address. Also, to meet changing social norms, in 2021 individuals are being encouraged to complete and return an individual response (referred to as an iForm) should they choose not to disclose accurate but personal information on a primary HH questionnaire.  Consequently, in this case, the Census data could contain two or more responses from the same individual. While iForms are likely to contain different answers to some Census questions, fundamentally, this type of multiple response is a duplicate at the address in question.

Multiple responses and duplicates, however, can also emerge through unintentional (and perhaps unavoidable) error.  For instance, there may be errors in the underlying enumeration address frame leading to several questionnaires being sent to the same address that are subsequently completed and returned. Enumerators in the field may leave a paper questionnaire at an address that is completed and returned in addition to the householder completing an electronic questionnaire. In 2011 there was clear evidence that some individuals misunderstood how they should respond to the Census, entering their personal information more than once on a discrete HH questionnaire. All these examples manifest as duplicates in the Census data.  In contrast, receiving several responses from the same enumeration address could also be indicative of errors where the address frame has failed to recognise that there is more than one HH at the address.

Regardless of whether occurring through design or error, multiple responses and duplicates are problematic with respect to primary census objectives and can serve as contributing factors to a wide range of statistical errors in Census outputs.  For example, and amongst many other reasons, poor integration of HC questionnaires can lead to undercount of larger HHs.  Duplicate HH and individual responses can lead to overcount of the general population. Duplicate individual responses can also lead to potential overcount of larger HHs. Errors in the underlying address frame can lead to several different issues associated with both undercount and overcount.

The overarching role of RMR is to address these problems and minimise any associated risk to the quality of Census Outputs by resolving multiple responses and duplicates, not by simply removing them from the data, but by careful integration of the information provided through all associated responses.  As multiple responses and duplicates can also have a direct and negative impact on the performance and accuracy of statistical Matching, Estimation, Adjustment, and Edit & Imputation methodology, the RMR strategy for 2021 has been designed and developed in close conjunction with these methods through the RMR Working Group.  All in all, RMR is designed to support the production of accurate local authority level population estimates and a representative statistical Census database; the two primary Census objectives.

Aim of the current paper

The current paper has 3 primary aims:

  1. To provide an outline of the design and design principles behind the review and development of RMR for the 2021 Census
  2. To provide a high-level summary of the RMR strategy and the function of each module in the RMR method. New modules for 2021 will be highlighted.
  3. To provide a detailed overview of the business rules and statistical methods employed by each of the RMR modules. This will include notes on the assumptions and reasoning considered by the RMR Working Group during the review.  New Modules, and significant changes to the 2011 RMR design will be highlighted.

First implemented in the 2001 Census and again, successfully implemented in 2011, the RMR strategy has undergone an extensive review for 2021.  The primary aim of the review was to ensure that the 2021 RMR strategy was built on a comprehensive set of design principles that ensure a consistent approach and justifiable foundation in line with Census objectives.  Consequently, the review itself, and the decision-making processes leading to that design has been an iterative process where elements previously agreed were often, and necessarily, revisited a second and even a third time to incorporate emerging principles. While considerably time-consuming, we believe that this had led to a far more robust, accurate and consistent end to end design for the 2021 RMR strategy tied more tightly to the statistical and analytical aims of the Census program than RMR has in the past.

In this Section we present an overview of the principles that have emerged to underpin the basic design of the 2021 RMR strategy. Where appropriate we indicate how the RMR strategy has changed since 2011, how we have incorporated lessons learned from 2011, and how we have linked the 2021 strategy closely to subsequent statistical adjustment methods and Census objectives.  The Section should also guide an understanding of the decision-making processes that led to the detail of how RMR functions, presented in Sections 5 and 6.

4.1 General principles

  • Following what was generally a positive assessment of how the 2011 RMR strategy performed in the last Census it was agreed that the 2021 RMR strategy would build on that success. Consequently, the emerging design of the 2021 RMR strategy was guided by holding the effectiveness and proficiency of the 2011 RMR specification to account and, where appropriate, removing, adjusting, or adding functionality. Key drivers of the review included:
    • lessons learned in 2011 that identified potential improvements to the RMR process
    • changes to overarching Census design objectives such as the modernisation of the collection strategy
    • changes to the analytical aims of the Census question set to meet the demands of a modern society
    • advances in statistical methodology implemented both within RMR and the methods it aims to support
    • advances in the computational power available on the new ONS Cloudera processing platform.
  • While the 2011 Census RMR strategy was quite successful in meeting its objectives, post Census evaluation indicated that there were areas of the design that could be improved. For example, the statistical matching methodology in the 2011 RMR strategy left at least some unexpected duplicates in the data.  While ultimately resolved prior to outputs this was something unexpected by both Statistical Matching and Estimation leading to delays and necessary revisions to approved methods. It is worth noting here that any RMR strategy will have limitations and it is important that these are well understood and addressed in aggregate adjustments to Census estimates through Statistical Estimation. The RMR Working Group was established to ensure that all stakeholders with a significant interest in the way RMR functions were included in the development of its design.  Table 1 outlines the related and relevant topic areas covered by members of the Working Group.

 Table 1. Membership & roles of the RMR Working Group

ONS Division/Team Role & Responsibility
Census Processing Team Overall design & development of RMR & implementation
Census Statistical Design Overall statistical design of Census methodology
Methodology: Statistical Matching Design and development Of the Census to CCS and Census to Census Statistical Matching Strategies
Methodology: Estimation & Adjustment Design & development of the Census Statistical Estimation Strategy
Methodology: E&I & Adjustment Design & development of the Census Statistical E&I and Adjustment Strategies
Population Statistics and Demography Representing consumers of Census outputs regarding data quality and accuracy of estimates
NISRA & NRS Representing devolved UK Statistical Institutes with respect to sharing ideas and harmonisation

4.2 Overarching Design

  • Building on the 2011 design the 2021 RMR strategy adopts a progressive and modular approach to the problem of resolving multiple and duplicate responses. Each module, or in some cases, set of modules, focuses on one aspect of the potential problem space.  For example, Module 1 looks to resolve multiple and duplicate CE’s; Modules 2 & 3 at HHs; Module 4 at Dummy forms; Module 5 at iForms, and so on. Section 5 provides a functional overview of all RMR Modules while Section 6 delves further into the detail.  Each module builds on the outcome of one or more of the previous modules.  Consequently, sequencing formed a significant part of the review and several adjustments have been made to the 2011 ordering, not only to optimise accuracy and efficiency but also to accommodate new Modules for 2021 such as the resolution of duplicate iForms (Module 5b).  We return to the sequencing of Modules in Section 5.
  • In general, as in previous Censuses the resolution of multiple and duplicate responses relative to a topic area (i.e., in CEs, HHs, Dummy Forms, and so on) will be addressed through a 2-stage process:
    1. In Stage 1 multiple and duplicate responses need to be identified correctly. This is often achieved relatively easily through information provided on each types of Census questionnaire and or through data from the Census Field Work Management Tool (FWMT) such as the UPRN.  However, in cases where this information is not available, not suitable, or in the presence of uncertainty, RMR relies on statistical matching methods.  Typically, statistical matching/linkage methods are used to identify people who, for some reason, are represented more than once in the data.
    2. In Stage 2, where multiple and duplicate data sources have been identified, they need to be resolved. This is not about throwing data away but instead by careful integration of the alternative data sources. Typically, this will be driven by decision-making logic designed, for example, to decide where to put an iForm that could in principle, belong to two or more HHs at the UPRN. Alternatively, to decide which set of responses from a set of duplicate individual records will serve best as a baseline set from which to build a unique and discrete observation.  Resolution is usually achieved through a hierarchy of rule-based business rules followed by rules based more on statistical probability. This general strategy serves to prioritise options and manage resolution as we move progressively towards and into uncertainty.

4.3 Identifying duplicates through Statistical matching

  • We mentioned earlier that during live Census processing in 2011 the RMR strategy left at least some unexpected duplicates in the data indicating that this should be one of the most significant areas to explore for 2021. Lessons learned from 2011 noted that the matching methodology implemented in the 2011 RMR strategy was likely to be one of the most salient reasons for this. For example, 2011 RMR relied quite heavily on Soundex based matching methods which we now understand will underperform with some cultural name sets.  However, one of the most interesting aspects of the review was that the performance of 2011 RMR matching strategy was evaluated by comparing results using the far more sophisticated Census to CCS and Census to Census Matching methodologies.  From this, one of the primary guiding principles for the design and development of the 2021 RMR strategy was that the matching methods implemented in the 2021 RMR strategy should be designed by the ONS Methodology Matching Team in conjunction with the development of the Census to CCS and Census to Census Matching methods. This not only ensures compatibility and consistency between methods but also ensures that the RMR matching methods meet the high-quality criteria one would expect from the ONS Methodology Division.

Currently, there are two ONS internal technical papers available on the work of the Methodology Matching Team: Plachta and Shipsey, 2019; Plachta, 2020.  Both papers can be made available on request.  Here we provide a brief overview:

  • For the 2021 Census RMR strategy, a deterministic linkage method will be implemented to identify duplicate people within an enumeration address (UPRN) prior to resolution. The linkage method uses 17 match-keys developed using 2011 Census data to capture as many duplicate matches as possible with a very low tolerance of incorrect matches. Research into the optimal combinations of match-keys was conducted by testing a variety of strengths of match-key on 2011 data and conducting a clerical review to identify those that made false positive matches (records that were not in truth duplicates). The remaining keys are loose enough to let through true matches that contained errors.

The match-keys allow for exactly matching records, records where date of birth is missing or contains slight error, records with missing gender, and records containing a variety of name errors. Name matching is implemented using two string comparators; Levenshtein Edit Distance and Jaro-Winkler; to identify matching names with spelling, handwriting or scanning errors. The benefit of using two comparators is that although both are very high quality and find a high volume of correct matches, each can identify matches that the other misses, making the RMR match-keys more resilient compared to the Soundex phonetic comparator method used in 2011.

In 2021, the Methodology Linkage team will also run a probabilistic matching exercise alongside the deterministic RMR process to independently identify duplicate responses. This will use the Fellegi-Sunter algorithm using parameters calculated using 2011 data. This exercise is not designed to add further duplicates to RMR’s list of duplicates, but as a quality assurance process to run alongside. If the probabilistic algorithm and the deterministic methods return similar matches, we can be confident that the match-keys are working well, despite the differences between 2011 and 2021 data. Using probabilistic matching in RMR is out of scope due to the resources required firstly to conduct it to the required quality, and secondly to deal with any potential matching conflicts between deterministic and probabilistic methods.

In 2011, 237,200 records were identified as duplicates in RMR, although in subsequent matching stages many more duplicate records were found. The new match-keys for 2021 have able to improve this number to 288,468 2011 duplicates identified with a precision of over 99.99% when testing on 2011 data. Although we expect the 2021 Census data to differ from the 2011 data due in part to changes in collection methods, we are confident that the improvements we have made to the method will mean that it is flexible and robust enough to perform well with 2021 Census data, outperform the 2011 strategy, and provide the required consistency with other matching methods also used to support Statistical Estimation & Adjustment.

4.4 Resolving multiple and duplicate responses through rule-based and statistical decision-making

  • As outlined earlier, the resolution of multiple or duplicate responses within each topic area (i.e., CEs, HHs, Dummy Forms, and so on) are addressed in RMR through a hierarchy of rule-based business rules followed by rules driven more by statistical probability to help manage uncertainty. Two overarching principles were agreed to guide the review of the 2011 and development of the 2021 RMR strategy. In all cases where a resolution process was implemented in the 2021 RMR strategy:
    • it should not inadvertently introduce bias into the Census data.
    • It should look to retain as much supplied information as possible.
  • An initial review of the 2011 RMR decision-making strategies revealed that in many cases the business rules implemented to resolve some multiple or duplicate responses were no longer valid relative to the general Census design principles laid out in Section 4.1. For instance, business rules driven by a predominantly paper-based Collection strategy in 2011 represented a particularly salient point of revision considering the transition to a predominantly electronic questionnaire in 2021.  It was also noted that in many of the 2011 modules, the resolution of residual cases that remained unresolved at the end of a sequence of deterministic rules were still being addressed through propositional (micro) business rules where a stochastic approach would have been a more appropriate way to manage uncertainty.  Retention of these rules would clearly mean breaking the principle of not inadvertently introducing bias into the Census data.

To ensure this overarching principle was always adhered to, a second general principle was applied to the process of reviewing and rebuilding the decision-making hierarchy in each of 2021 RMR modules, and that was to always start from a position where the general and default resolution strategy would be to distribute multiples or duplicates randomly amongst all available options.  During discussion and review of each module this baseline strategy could only be preceded by deterministic business rules or by other strategies if the members of the working group could justify their inclusion and why they should take priority.  In truth, the decision-making logic of all 2021 RMR modules ended up as a sequence of deterministic business rules with the uncertainty associated with any residuals at the end of the decision-making sequence being managed by statistical resolution (see Table 2 for a comparison).  However, this principled approach ensured that each of the 2021 RMR modules were examined and developed in a structured and systematic way while maintaining the overarching aim of avoiding the introduction, or for that matter retention, of unintentional bias.

  • In 2011, across the entire suite of RMR modules, deterministic business rules driven the overarching design of the 2011 Census, particularly the 2011 Census Collection strategy, were used quite frequently in the RMR decision-making hierarchy. While there was no reason to move away from a similar approach in 2021, all in all, significant changes to several aspects of the overarching Census design for 2021, including the Collection strategy, led to many revisions of business rules for 2021 RMR.  Details of all these changes can be found in Section 6. Here, as examples, we present just a few of the design-based business rules considered in the review that led to the most significant changes.  As we worked through each of the RMR modules the list of business rules was referred to and implementation amended where necessary.  Again, this principle ensured that we maintained a consistent approach to the review.  These changes ensured that the 2021 RMR strategy was consistent with the reasoning behind changes to the overarching Census design between 2011 to 2021.
  • EQ first. One of the most significant changes in the design of the 2021 Census compared to 2011 was the shift in the Collection strategy from predominantly paper questionnaires (PQ) to electronic questionnaires (EQ).   Driven by this overarching design principle, in some of the 2011 RMR modules PQ responses were prioritised in favour of EQ responses.  For 2021, the change in this overarching design principle led to similar rules being reconfigured where appropriate to prioritise EQ responses over PQ.
  • iForms first. While prioritising the information collected through individual forms is not new for 2021 more emphasis has been placed on encouraging and facilitating the use of iForms in the overarching Census design than in 2011. This shift is designed to meet the demand for more accurate statistics on changing society norms and allows individuals to provide personal information about themselves that they may not have disclosed on a standard HH form. This change was considered throughout the RMR review.
  • Non-response as valid ‘prefer not to say’. In 2011 RMR often used completion rate to determine which of a set of multiple questionnaires returned by the same individual would be considered the baseline for integration. Typically, the most completed record is selected for this purpose before ongoing integration. However, as an extension of the demand for more accurate statistics on changing society norms, the Working Group were advised that it was a legal obligation to count missing data in an iForm associated with voluntary questions such as gender identity as a valid ‘prefer not to say’ response.  This was factored into all decision-making logic that fell into this category.
  • Receipt date. In the 2011 RMR strategy, date of receipt was sometimes employed to prioritise one response over another when multiple responses were received from the same HH, CE, or individual. For 2021 several options were always considered including first receipted, last receipted, and receipted closest to Census day. The best option was selected based on the circumstance.  However, in the overarching Census Collection design there is more focus on the significance of iForms than there was in 2011. In addition to encouraging individuals to complete iForms for the reasons previously outlined in the ‘iForms first’ principle above, public facing Census Support will also guide people asking how to revise information they have already provided also towards competing an iForm. These design decisions render ‘last receipted’ a more predominant option amongst alternatives than in 2011.
  • To close this Section, there are four more aspects of the 2021 RMR design considered during the review that we think are worth mentioning in this more generic overview. These represent relatively important extensions to the 2011 RMR design for 2021 that we are confident will lead to improved performance.
  • Retention of information. We mentioned at the beginning of this Section that the retention of as much information as possible was one of the main aims when integrating multiple or duplicate records belonging to the same HH, CE or individual. For 2021 this will generally be implemented in the same way as 2011 in that once we have determined which record to retain as the baseline response through the decision-making strategies discussed so far, missing data in that record will be backfilled using information from the other duplicate records.  There was a considerable amount of discussion about optimising this process, particularly around trying to maintain consistency between responses within person and between people within a HH.  However, on review, the Working Group decided that trying to implement an editing strategy within RMR was out of scope and the resolution of inconsistencies would best be left to the following statistical Editing and Imputation methodology.

There was however, one area where we were able to improve upon the 2011 RMR design.  One of the inevitable consequences of resolving duplicates is that the original HH structure can be disrupted. For example, we could start with a 7-person HH, but persons 2 & 3 end up being duplicates and resolved into one.  This means that when restructuring HHs at the end of RMR person 4 becomes person 3 and so on.  The problem here is that with a paper questionnaire the relationship matrix is abbreviated. In 2011 we collected all the relationship between persons 1 to 6 but once the questionnaire reached person 7 it would only ask for the relationship to person 1. Consequently, by making person 6 person 5 after RMR all the relationships that would have been collected had there not been a duplicate are now missing.  This would have been left for Edit & Imputation to resolve.

The problem in 2021 is likely to be more significant as abbreviation of the relationship matrix starts at person 5 rather than person 6 in the paper questionnaire potentially increasing the amount of imputation required into the relationship matrix for this response mode.  However, the electronic questionnaire in 2021 has been designed to collect the entire relationship matrix regardless of the number of people in the HH.  While relatively complex, the 2021 RMR strategy has been coded to retain that information and adjust the relationship matrix accordingly when tidying up the HH structure, replacing what would have been imputed data with observed data.

  • Making use of the ‘Tuning Phase’. In contrast to 2011, the 2021 Census response data will be streamed live into the processing pipelines daily from the day that the Census goes live during Collection. This represents an early opportunity to start exploring the data and adjusting method parameters and processes based on analyses of the actual 2021 Census data prior to committing to a final end-to-end run of the Census cleaning and adjustment methods once all the data has been collected. This was considered throughout the review of the RMR strategy and there were two areas in the decision-making logic space we felt that we could program in new strategies with adjustable parameters that not only took advantage of this ‘tuning phase’, but could potentially provide additional support to our aim of not inadvertently introducing bias into the Census data.

We have already mentioned that in 2011, RMR often used completion rate to determine which of a set of multiple questionnaires returned by the same individual would be considered the baseline for integration, and that typically, the most completed record is selected for this purpose before ongoing integration.  For 2021 we have extended this function in that rather than just being based on the basic sum of completed Census questions, it is now based on the weighted sum, with the weight being a parameter that can easily be adjusted through a configuration file read by the RMR program.  This allows evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.

We have also mentioned that in the resolution of residuals all RMR modules will converge on distributing multiples and duplicates randomly where there are no valid business rules or other strategies to suggest otherwise.  For 2021, we have programmed RMR to accept and insert a sequence of up to 10 user defined conditional propensity statements.  This provides the opportunity to fine tune the probability distribution of residuals if required or deemed appropriate rather than automatically fall back on a completely random allocation.  It wasn’t possible to implement this functionality through a configuration file as we have for for the weighted sum of valid observed data, but the code has been written in such a way that adding them during live processing would not be that difficult.

  • Extending the search area. Following the 2011 Census, it was noted that duplicate responses were not always constrained to a unique enumeration address.  This is not unusual with Census data or other data sources but as duplicates of any kind contribute to overcount in statistical estimates, aggregate adjustments for this type of error were made through the 2011 statistical Estimation and Adjustment methodology.  It was noted, however, that some of these duplicates occurred within small area geographies such as postcode or LSOA.  Consequently, a recommendation was carried forward to 2021 to explore the possibility of extending the remit of RMR to identify and resolve these duplicates earlier on in the Census processing pipeline.

To address this recommendation research was conducted with 3 primary aims. First, to identify the primary source of the problem with a view to try and understand whether it was likely to occur again in 2021.  Second, to explore the potential problem of establishing an appropriate rule-based resolution strategy that could effectively resolve all between HH duplicates.  And third, to evaluate whether implementing a proportional rule-based resolution strategy as in RMR had any real value compared to a statistical adjustment.  It is important to note here that ONS Methodology were already developing statistical Matching and Estimation methodology to address this type of overcount based on Census to Census matching.

The results of the research indicated that these localised duplicates could primarily be attributed to errors in the 2011 Address Register, typically, where two or more questionnaires had been sent to the same address.  Following consultation with the ONS Address Register Team we were assured that the risk of this occurring again in 2021 was far lower than it was in the previous Census.  In addition, mapping out the potential problem space for a rule-base resolution strategy applied across the full gamut of possible duplicate combinations that could occur at the person level between 1 or more HHs, with up to 30 individuals within a HH, demonstrated that this would be extremely complex and time-consuming exercise.  It was also concluded that due to this complexity, a full rule-based resolution strategy was likely to increase the risk of introducing bias into the Census data rather than reducing it.

All in all, the Working Group concluded that fundamentally, a statistical approach through Estimation and Adjustment to what was likely to be a relatively low impact problem in 2021 was a far better strategy than extending RMR.  However, it was also agreed that the idea was not completely redundant. Implementing the statistical Matching strategy to identify duplicates within the desired area is a relatively easy extension to the RMR program.  Consequently, it was agreed that RMR would be extended to identify and flag these duplicates as this information could make a significant contribution to subsequent statistical Matching and Estimation strategies. It was also agreed that in the relatively simple case of wholly duplicate HHs a simple rule-based approach of retaining only one of the HHs would be appropriate and carry little or no risk.  While we still have to finalise the detail, this is likely to be based on the principle of retaining as much information as possible from the duplicate HHs.

  • Using Admin or Alternative data. Making use of administrative and/or alternative data sources has been a consistent theme throughout the Census programme and this was considered throughout the RMR review. Specifically, following the 2011 Census it was noted that the 2011 RMR strategy may have struggled to resolve multiple responses where it was not clear whether a ‘dummy form’, representing a unique address with no valid Census return was a regular property or CE.  There were obvious candidate administrative sources that might help this issue such as the VoA data.  However, in this case it became relatively clear that reliable information could be pulled into the RMR process from the 2021 Field Work Management Tool (FWMT) to achieve that same thing, but without having to consider potential problems with administrative data such as its accuracy and how it may lag in time relative to Census day.  While the FWMT data is not administrative it is alternative data that was not used in 2011 so we expect far better resolution of dummy forms in 2021.

It is fair to note that during the review, we did not identify anywhere else within the RMR process where other administrative data sources would be particularly useful or improve the quality of the outcome.  Much of this was due to improvements in areas such as the Census Address Register, the FWMT, and the Census Collection Strategy itself, which, by definition, have reduced the propensity for erroneous duplicates to occur in the Census data in the first place.  That said, we have left open the possibility of linking the RMR process to the Census Intelligence Database (CID), containing a suite of pre-linked administrative data sources for use elsewhere cross the Census programme.

Table 2. General comparison of the 2011 & 2021 RMR matching & resolution strategies

RMR 2011 RMR 2021
Matching methods Relatively basic & independent Completely aligned with ongoing statistical Matching, Estimation & Adjustment Methodology
Search & resolution of multiples & duplicates within UPRN Based on overarching 2011 Census Design: Based on overarching 2021 Census Design:
Business rule 1 Business rule 1
Business rule 2 Business rule 2
-------------- --------------
Business rule n Business rule n
n/a Retention of information: Weighted Variable Count
Retention of information: Variable Count Retention of information: Variable Count
Residual management: Micro business rules Residual management: Micro business rules
n/a Residual management: Conditional propensity allocation/distribution
n/a Residual management: Random/allocation distribution
Search & resolution within wider geography n/a Search: Yes
Resolution: Partial, but with full flagging to support ongoing statistical Matching, Estimation & Adjustment Methodology
Use of alternative or administrative data sources n/a Wider admin sources considered. Use of alternative information from FWMT

Module # Description of Module Overview of Module Overarching Assumptions of Module
1 Resolves multiple CE responses - Multiple CE responses for a UPRN are resolved to one record. - The address frame accurately identifies HHs that are attached to CEs as different residences and assigns them a different Child UPRN to the CE.
- HH responses at the same UPRN as a CE response are disregarded. - When multiple residences at the same address are identified in the field, unique Child UPRNs will be created to identify the different residences as being at different addresses.
- Persons captured on HH forms at these UPRNs are moved to the retained CE at the UPRN.
2 Resolve multiple HH responses: Stage 1 Duplicate HH responses within the UPRN are disregarded, persons within disregarded HH responses are moved to the retained HH (which is the HH that the disregarded response was identified to be a duplicate of). - The address frame accurately identifies different HHs at the same address and assigns them different Child UPRNs so that they are identified as different residences through processing.
3 Resolve multiple HH responses: - When multiple residences at the same address are identified in the field, unique Child UPRNs will be created to identify the different residences as being at different addresses.
Stage 2
4 Resolve dummy responses - Multiple dummy responses for a UPRN are resolved to one. - Multiple dummy responses for the same UPRN are identified as duplicates.
- A HH or CE is created at UPRNs where there are dummy response(s) but no HH or CE response. - Duplicate dummy responses are always assumed to be duplicates from the same HH or CE.
- When multiple residences at the same address are identified in the field, unique Child UPRNs will be created to identify the different residences as being at different addresses.
5a Resolve duplicate iForm responses Duplicate iForm responses at the same UPRN are resolved to one iForm response. If the matching methods provided by Methodology identifies individual responses as being duplicates then it is accepted that this is correct and these responses are resolved to one.
*New Module for 2021
5b Assign continuation forms HC forms are assigned to a HH or CE at their UPRN. - Persons captured on iForms and HC forms are captured at the correct UPRN. Therefore, they should be assigned to a HH or CE that exists at the UPRN. They do not become discovered HHs.
5c Assign iForms iForms are assigned to a HH or CE at their UPRN
6 Resolve Orphan Responses - Orphan responses are identified as being iForms or HC forms received at UPRNs where there was no CE, HH or dummy responses. - Persons on orphan iForms and HC forms are captured at the correct UPRN. Therefore, they should not be assigned to a HH/CE at a nearby address and need a residence created for them at the UPRN.
- One HH or one CE is created at UPRNs where there are orphan responses - All orphan responses at the same UPRN are assumed to relate to the same residence.
- The orphan responses are assigned to the newly created residence at the UPRN.
7 Identify duplicate Individual responses Duplicate persons in the same residence are identified and resolved. If the matching methods provided by Methodology identifies individual responses as being duplicates it is accepted that this is correct.  These duplicate responses are then resolved to one.
8a Resolve duplicate individual responses
8b Flag residual duplicate individual responses Remaining duplicate individuals in the same UPRN are flagged. If the matching methods provided by Methodology identifies individual responses as being duplicates it is accepted that this is correct. 
*New Module for 2021
9 Identify wholly duplicated HHs HHs within the same postcode that contain the same individuals are identified and resolved. If the matching methods provided by Methodology identifies individual responses as being duplicates it is accepted that this is correct. 
*New Module for 2021 HHs of the same size that contain exactly the same individuals within the same Postcode are duplicate responses.
10 Resolve wholly duplicated HHs
*New Module for 2021
11 Resolve adjusted CE and HH data structures. Resolves the adjusted CE and HH data structures following logical rules implemented in RFP and RMR Modules 1 to 10. HHs that have more than 30 individuals are assumed to have submitted the wrong residence form.
12 Create RMR Flags - Creates flags that detail data changes that have been undertaken by the RMR process. N/A
13 Run RMR Diagnostics - Checks that RMR process has run successfully. N/A
- Creates diagnostics on the RMR process.

 RMR Modularisation: Detailed Methods & Rationale

Module # Description of Module Description of Rules Rationale
1 Resolves multiple CE responses When both HH and CE responses are received for the same UPRN the following rules are applied: CE responses are prioritised over HH responses as there will be hand delivery of CE forms to some establishments. Therefore, if a UPRN has filled out a CE form there is likely to be good reason behind it.
-  The CE response is retained and the HH response is disregarded. There are likely to be fewer multiple responses of this type in 2021 as CEs and HHs are captured on the same Address Register.
-  Persons captured on the disregarded HH responses are moved to the retained CE. The persons captured on the disregarded HH responses are moved to the CE so that they are kept.
When multiple CE responses are received for the same UPRN. There should only be one CE at an address. If multiple responses are received, then it’s highly likely that these responses relate to the same CE.
We choose one of the CE responses to keep as the baseline CE record and disregard the other form(s).
1) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighted function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
The selection of the CE response to be retained is based on the following set of priority rules:
2) EQ (not forced) over PQ as EQ responses are known to be of higher quality than PQ responses. PQ is taken over forced EQ responses as PQ is a response that has been submitted by the respondent whereas they haven’t submitted the forced EQ response.
1) Greatest sum of weighted completed CE questions. (The default weights are 1 for each variable)
3) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form.
2) If equal on sum of weighted completed CE questions, then prioritise EQ (Not Forced) then PQ, then EQ (Forced).
4) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
3) If equally complete and same response mode, then prioritise the response that was last receipted.
5) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
4) Additional rules (if required).
Backfilling of responses is undertaken to keep as much information on the residence as possible. Information from disregarded records is expected to be more accurate than imputed values.
5) If all else is equal, then one of the forms is selected at random.
Any missing CE variables on the retained record will be populated with valid responses from a disregarded CE response for the same UPRN using the same priority rules as those used to select which CE response to retain.
2 Resolve multiple HH responses: Stage 1 Where there are multiple HH forms received for the same UPRN, the following sequential rules are applied to decide whether they are duplicate responses and should be merged: 1) Responses with the same QID can only result when a paper form with an IAC is received by a HH that then return both internet and paper responses. This rule essentially covers the merging of EQ and PQ responses for the same address rule in 2011.
This rule will be coded in such a way that it can be toggled on/off in live running if required.
1) If the HH responses have the same QID then the HH responses are merged.
2) If matching identifies that responses contains the same individuals, then it’s highly likely that the responses relate to the same HH and are duplicates of each other.
2) The match-keys provided by Methodology are used to identify HH responses that contain the same individuals.
3 Resolve multiple HH responses: Continuing the resolution of multiple HH responses from Module 2: 3)  If matching identifies that responses contains the same individuals, then it’s highly likely that the responses relate to the same HH and are duplicates of each other.
Stage 2
3) HH forms that are found to contain persons in common in Module 3, Rule 2) are assumed to be the same HH and are therefore merged together. 4) Minor-only HHs can only legitimately occur under very rare circumstances, and, for disclosure reasons, statistics on these HHs cannot be output in the Census. Also, the likelihood of a minor-only HH occurring at the same UPRN as a non-minor-only HH would be rare and the likelihood is that it is the same HH but for some reason only children were captured on one of the forms. Therefore, it was agreed that we should merge the HH responses in these cases.
4) If one or more of the multiple responses is a minor-only (under 16) HH and there are other non-minor-only HHs then the minor-only HH response is merged with a non-minor-only HH response, prioritising merging to those HHs which hold the highest Levenshtein score match on surname (where the Levenshtein score is above a certain parameterised threshold).  If multiple forms have the same match score or none of the non-minor only HH responses have acceptable Levenshtein match scores, then one of these matched (or un-matched where no matches) HH responses is selected to be merged with the minor-only HH based on a random draw. 5) This is a new rule for 2021. There was concern that there may some discovered HHs may be lost unless they are identified at this stage. E.g. A HH with a dedicated Annex Therefore, this step will look for any strong evidence on the HH responses that identify them as being discovered HHs.  It is still to be decided what evidence can be used to support the identification of a discovered HH. Suggestions include using the responses to H8, H9 and/or H10 on the HH form. This step will be coded in such a way that additional rules for identification can be added later.
5) Based on a set of to be defined conditions, if there is evidence to suggest that a HH responses is a discovered HH at the UPRN then we will flag them as a discovered HH and they do not go through steps, 6, 7 & 8. 6) This is a new rule for 2021. If there is no reason to suggest that the multiple HH responses relate to different HHs in the previous rule and at least one person’s surname matched across the forms, then it’s assumed likely that the HH responses relate to the same HH.
6) Levenshtein matching is undertaken on the surnames of the person responses. Where the Levenshtein match score is deemed acceptable between multiple HH responses they are merged. 7) This is a new rule for 2021. This allows for rules to be added to merge the responses if there is evidence to suggest that they should be. A rule that may be added is the merging of Welsh language responses with English language responses at the same Child UPRN. This rule was implemented in 2011 as addresses in Wales are sent both the Welsh language form and the English language form and therefore have a higher propensity to respond twice than addresses in England.
7) Based on a set of to be defined conditions if there is evidence to suggest that the HH response(s) are not discovered HHs then we will merge them at this stage. 8) Empty forms received at the same UPRN as other responses are assumed to be incomplete responses or timeshares. Therefore, the responses are merged unless there is information to suggest that they shouldn’t be.
8) If there are “empty” HH responses (HH responses with no person information filled in), these are merged with “non-empty” HH responses unless there is evidence to suggest that they are legitimate empty HHs. Where there are multiple “non-empty” HH responses, one of these responses is selected at random, to be merged with the “empty” HH response.
When merging HH responses in Modules 2 and 3 one of the HH responses is selected to be retained as the baseline HH record, based on the following priority rules:
a) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during Census Collection.
a) Take the response with the greatest sum of weighted completed HH questions (the default weights are 1 for each variable).
b) EQ (not forced) over PQ as EQ responses are known to be of higher quality than PQ responses. PQ is taken over forced EQ responses as PQ is a response that has been submitted by the respondent whereas they haven’t submitted the forced EQ response.
b) If equally complete, then prioritise EQ (Not Forced) then PQ, then EQ (Forced)
c) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form, even though they are advised against doing this.
c) If equally complete and same response mode, then prioritise the response that was last receipted.
d) This is a new rule for 2021. There is the functionality to add extra rules here, during the tuning phase if information comes to light.
d) Additional rules (if required)
e) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
e) If all else is equal, then one of the forms is selected at random.
Backfilling of responses is done to keep as much information on the residence as possible. Information from disregarded records is expected to be more accurate than imputed values.
Any missing HH variables on the retained record will be populated with valid responses from a disregarded HH response for the same UPRN using the same priority rules as those used to select which HH response to retain.
4 Resolve dummy responses The following sequential rules are undertaken on dummy response data: 1) It is an assumption of the module that a dummy response is a duplicate response of a CE or HH response captured at the same UPRN. 
1) If there is a CE response captured at the UPRN then all dummy responses at the UPRN are discarded. The CE response is retained over the dummy response when both are captured at the same UPRN for a variety of reasons:
-  The CE form contains the questions that are required, whereas the dummy form contains questions that could be used to create CEs from.
2) If there is a HH response captured at the UPRN then all dummy responses at the UPRN are discarded. -  It is not reasonable to place a response burden on a respondent to fill out a CE form if it is then ignored in favour of a field interviewer’s view.
-  Responses filled out by a site manager are likely to be more reliable than those filled out by an enumerator who may not even have access to the establishment.
3) If a dummy response contains both HH and CE information, the response is split into two separate records in the data. One containing only HH information and the other containing only CE information.
2) It is an assumption of the module that a dummy response is a duplicate response of a CE or HH response captured at the same UPRN. The HH response is retained over the dummy response when both are captured at the same UPRN for a variety of reasons:
4) One of the dummy responses at the UPRN is selected to be retained based on the following priority rules: -  The HH forms contains more questions that directly map to the HH table on the CDM.
a) Greatest sum of weighted completed dummy questions (the default weights are 1 for each variable). -  There is not much point in asking a resident to fill out their HH form if we then just ignore their response in favour of a field interviewer who most likely did not have access to the residence.
b) FWMT over any potential PQ back-up option. -  A resident’s response about the HH they live in is more reliable than those filled out by an enumerator who may not even have access to the HH.
c) Last Receipted
d) Additional rules (if required) In 2011, the Type of Accommodation and Self-Contained questions on the dummy response were merged with the same fields on the HH response if the variables were missing on the HH response but observed on the dummy response. It was decided against undertaking this action for 2021 as there was concern about any potential field interviewer bias on these responses and it was noted that imputation would likely better recover the true values of these variables for the HH.
e) If all else is equal, then one of these forms is selected at random.
3) A residence can either be a HH or a CE it cannot be both, therefore the forms are split so that either a HH or a CE is created.
5) Any missing dummy variables on the retained record will be populated with valid responses from a disregarded dummy response for the same UPRN using the same priority rules as those used to select which dummy response to retain.
4)  
6) If the retained dummy response is a CE then a new CE record is created on the CE table for this UPRN. The CE record will be flagged as having been created from a dummy response. Any dummy question variables that exist on the CE table will be populated for this new record from the responses on the dummy form. IDs for this row will be created from the dummy Response ID. All other information will be set to missing (-9) on the new record. a)  The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
7) Otherwise, If the retained dummy response is a HH then a new HH record is created on the HH table for this UPRN. The HH record will be flagged as having been created from a dummy response. Any dummy question variables that exist on the HH table will be populated for this new record from the responses on the dummy form. IDs for this row will be created from the dummy Response ID. All other information will be set to missing (-9) on the new record. b) FWMT responses will be taken over any potential PQ back-up as the quality of response on EQ is known to be higher than that of PQ.
c) Later receipted responses are favoured over other responses as there is an assumption that a latter response may be filled out to correct for a mistake or change in circumstance on an earlier form.
d) This is a new rule for 2021. There is the functionality to add extra rules here, during the tuning phase if information comes to light.
e) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
5) This follows the general principles of RMR. Any missing information should be populated from other completed dummy responses at the same UPRN. The prioritisation rules are the same as those used to select the primary dummy response. Keeping these prioritisation rules retains consistency between responses.
6) This action undertakes the key reason for collecting dummy response information, to create a residence when no response is received for a UPRN. If this action was not undertaken, the address may be assigned the wrong residence type, or it may be missed completely by the Census.
7) This action undertakes the key reason for collecting dummy response information, to create a residence when no response is received for a UPRN. I f this action was not undertaken, the address may be assigned the wrong residence type, or it may be missed completely by the Census.
5a Resolve duplicate iForm responses When multiple iForm responses are received for the same UPRN, the match-keys provided by Methodology are used to identify duplicate responses. This is a new module for 2021.
If there are duplicate responses and at least one of these are not EQ (Forced) responses, then the following sequential priority rules are used to decide which of the responses to retain (forced EQ responses are not considered in rules a to e): In 2011, the de-duplication of all individual responses (all form types) within the same residence was undertaken in one rule set at the end of RMR.
a) Last receipted. However, it was discovered that duplicate iForm responses could be assigned to different HHs at the same UPRN because the assigning of iForms to HHs contains a random element (see Module 5c). This meant that these duplicates would not be resolved by RMR as they were assigned to different residences.
Therefore, for 2021, we are bringing forward the resolution of duplicate iForm responses to the stage before we assign them to HHs.
b) If receipted on the same date, take the response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable) of those last receipted.
Duplicate iForms can be resolved before duplicates from other form types as iForm responses are prioritised over other form types in Module 8a. For more information, please see the Rationale for Module 8a.
c) If they are equal on sum of weighted completed questions and are last receipted, then choose EQ (not forced) over PQ.
In a change to the 2011 rules and individual responses from other form types, we are prioritising last receipted iForm responses ahead of most complete iForm responses.
d) Additional rules (if required)
This rule is based on respondents being advised to fill out iForms as fully as possible when they wish to correct information from earlier submitted forms. It is therefore an assumption that the last receipted iForm will contain the correct information for the individual and these responses should therefore be prioritised.
e) If all else is equal, then one of these forms is selected at random.
As a result of this, non-forced submitted iForms are prioritised ahead of forced submitted forms, regardless of completeness. This is because the forced submissions
Otherwise if all the iForm responses are EQ (Forced) these rules are applied: will always be the last receipted form at the UPRN as they do not get uploaded/receipted until EQ closes, yet it is not known when the information on these forms was supplied.
f) Take the response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable). a) If a respondent is adamant that they wish to correct information that they have already submitted, they will be advised to fill out an iForm as fully as possible. iForm responses that are receipted later are assumed to provide the “correct” information for the respondent and are therefore prioritised ahead of earlier responses.
g) Additional rules (if required) b) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. Again, the submissions through EQ forced are not considered at this point because these responses haven’t been submitted by a respondent and this rule is used to decide between two submitted forms that are last receipted.
h) If all else equal, then one of these forms is selected at random. c) This is a general rule of RMR and is used as EQ responses are expected to be of higher quality than PQ responses.
d) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
e) A random approach is used as the last measure when looking at only non-forced submissions to ensure no bias.
Missing person variables on the retained record will be populated with valid responses from a disregarded iForm responses for the same UPRN prioritising valid responses from the disregarded forms based on the following rules:
f) The Forced response with the greatest sum of weighted completed individual questions is selected if there are only forced iForm responses. The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
1)       Response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable).
g) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
2)       If multiple responses have the greatest sum of weighted completed questions then prioritise EQ (Not forced), PQ then EQ (Forced)
h) If a single retained response cannot be identified using the preceding rules, then one of these forms will be selected using a random approach as a last measure to ensure no bias.
3)       If equally complete and same response mode, prioritise the response that was receipted last.
4)       Additional rules (if required) 1) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
5)       If all else equal, one of these forms is selected at random. 2) EQ are known to be of a higher quality than PQ responses. Non-forced submissions are prioritised over forced submissions as a respondent has knowingly sent the non-forced entries in and therefore, they are likely to be of higher quality.
Exceptions for backfilling are: 3) Later receipted responses are favoured over other responses as there is an assumption that a latter response may be filled out to correct for a mistake or change in circumstance on an earlier form.
-  Voluntary questions.
-  Partially filled Address, DOB, Year of Arrival and Citizenship fields 4) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
5) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
Voluntary questions will not be backfilled as legally, missing responses have to be treated as valid responses as they could be seen as a refusal to answer the question.
Partially filled blocks of questions will also not be backfilled, to avoid creating inconsistencies in this data.
5b Assign continuation forms 1) When an HC Form is captured at the same UPRN as one retained CE, the persons captured on the HC form will be assigned to the CE. 1) Anticipates best intentions where someone has requested an HC form to fill out individual information but reside at a CE. As Module 4 create CEs from dummy responses, a HC response could be assigned to a CE created from a dummy here.
As the HC form doesn’t provide any residence information but the dummy form does it was decided that the residence type should be taken from the dummy response.
2) Persons captured on HC forms at the same UPRN as one retained HH are assigned to the retained HH.
2) Anticipates best intentions where someone has requested a HC form for a HH.
3) The following priority rules are used to choose which HH to assign the HC form to when there are multiple retained HH at the same child UPRN at which a HC form was received:
3) These priority rules are new for 2021, in 2011 the rules were based around the H2 Question on the HH form (How many people usually live here?) and then Soundex matching of last name.
a) Using match-keys, when there are matches on match-keys to individuals across multiple HHs the HH is selected based on:
i) HH with the individual that has matched on the highest of the match-keys in the hierarchical system. It is known that the quality of the completion of question H2 was likely not fit for purpose in 2011 and in 2021 it will no longer be asked on the EQ. Therefore, we decided against using these methods for 2021 to prevent bias being introduced through this process.
ii) Random selection
When assigning the HC forms to a HH we decided to focus on ensuring that duplicate persons are assigned to the same HH where possible. Therefore, when linking HC forms to a HH record, we prioritise assigning the HC forms to the retained HH that contains a duplicate response of a resident captured on the HC form. Assigning the HC form to this HH will allow these duplicates to be resolved in Module 8.  The assigning to a HH that contains a duplicate individual occurs in rules a) and b).
b) Levenshtein Score matching of full name – first name and last name (Empty persons), when there are matches on full name to individuals across multiple HHs then the HH is selected based on:
i) Highest first name Levenshtein match score a) The match-keys are used to identify potential duplicates. When multiple HHs share people in common with the HC form. The HH with the best matched (Highest match-key) duplicate is selected.
ii) Random selection
b) The second phase of duplicate matching, as empty individuals only contain name information, the only way that they can be identified as a duplicate is through name matching. As this is not as strong a matching method as the match-keys it is behind them in the priority order. The rule (i) for choosing between multiple matched HHs is still to be confirmed through further research.
c) Levenshtein Score matching of full name –
first name and last name (Non-empty persons), when there are matches on full name to multiple HHs the HH is selected based on: If duplicates are not found, further name matching methods are used to select a HH to assign the HC form to through rules c) and d).
i) Highest first name Levenshtein match score
ii) Random selection c) Full Name matching of non-empty persons is undertaken first because people with the same name at the same address are more likely to live together as they are likely to be fathers and sons. The rule (i) for choosing between multiple matched HHs is still to be confirmed through further research.
d) Levenshtein Score matching on last name only, when there are matches on last name to individuals across multiple HHs the HH is selected based on: d) Last name is then used as it is believed that persons with the same last name at the same address are more likely to live together.
i) Highest Levenshtein match score
ii) Random selection e) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
e) Additional rules (if required) f) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
f) If all else equal, one of the HHs is selected at random.
5c Assign iForms 1) When an iForm is captured at the same UPRN as a retained CE, the persons captured on the HC form will be assigned to the CE. 1) Anticipates best intentions where someone has requested an iForm to fill out individual information but reside at a CE.
2) Persons captured on iForms at the same UPRN as one retained HH are assigned to the retained HH. 2) Anticipates best intentions where someone has requested an iForm but reside at a HH.
3) The following priority rules are used to choose which HH to assign the iForm to when there are multiple retained HH at the same child UPRN at which a HC form was received: 3) These priority rules are new for 2021, in 2011 the rules were based around Soundex matching of name and the H2 Question on the HH form (How many people usually live here?).
a) Using match-keys, when there are matches on match-keys to individuals across multiple HHs the HH is selected based on It is known that the quality of the completion of question H2 was likely not fit for purpose in 2011 and in 2021 it will no longer be asked on the EQ. Therefore, we decided against using these methods for 2021 to prevent bias being introduced through this process.
i) HH with the individual that has matched on the highest of the match-keys in the hierarchical system.
ii) Random selection When assigning the iForms to a HH we decided to focus on ensuring that duplicate persons are assigned to the same HH where possible. Therefore, when linking iForms to a HH record, we prioritise assigning the iForms to the retained HH that contains a duplicate response of the resident captured on the iForm. Assigning the iForm to this HH will allow these duplicates to be resolved in Module 8.  The assigning to a HH that contains a duplicate individual occurs in rules a) and b).
b) Levenshtein Score matching of full name – first name and last name (empty persons), when there are matches on full name to individuals across multiple HHs, the HH is selected based on: When linking iForms to a HH record, we prioritise assigning the iForms to the retained HH that contains a duplicate response of a resident captured on the iForm. Assigning the iForm to this HH will allow these duplicates to be resolved in Module 8.  The assigning to a HH that contains a duplicate individual occurs in rules a) and b).
i) Highest first name Levenshtein match score
ii) Random selection a) The match-keys are used to identify potential duplicates. When multiple HHs share people in common with the iForm. The HH with the best matched (Highest match-key) duplicate is selected.
c) Levenshtein Score matching of full name – b)The second phase of duplicate matching, as empty individuals only contain name information, the only way that they can be identified as a duplicate is through name matching. As this is not as strong a matching method as the match-keys it is behind them in the priority order. The rule i) for choosing between multiple matched HHs is still to be confirmed.
first name and last name (Non-empty persons), when there are matches on full name to individuals across multiple HHs, the HH is selected based on:
i) Highest first name Levenshtein match score If duplicates are not found, then further name matching is used to select a HH to assign the iForm to through rules c) and d).
ii) Random selection
c) Full name matching of non-empty persons is undertaken first because people with the same name at the same address are more likely to live together as they are likely to be fathers and sons. The rule i) for choosing between multiple matched HHs is still to be confirmed.
d) Levenshtein Score matching on last name only, when there are matches on last name to individuals across multiple HHs, the HH is selected based on:
i) Highest Levenshtein match score d)Last name is then used as it is believed that persons with the same last name at the same address are more likely to live together.
ii) Random selection
e) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
e) Additional rules (if required)
f)  A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
f) If all else equal, then one of the HHs is selected at random.
An “empty” person is an individual response that only contains the first name and last name of the individual. They are only retained through RFP if they are captured on a HH form and only contain full name.
The person captured on the iForm is assigned the person number of the first empty person which they match on full name, or, if no matches to empty persons, at the end of all existing individual places already filled. If it is assigned the person number of the first empty person that it full name matches on then the first matched empty HH person record is disregarded. It is expected that the most common occurrence of “empty” persons is when a respondent is named (first name and last name) in the HH section of the form but no information is provided for them in the individual section as they intend to fill out an iForm for the individual response.
Copies of the relationships that include the matched disregarded empty person are created, with the resident_id/related_resident_id updated so that all instances of the resident_id of the disregarded empty person are changed to be that of the person captured on the iForm. If an iForm matches an empty person then the iForm is assumed to be a duplicate of the empty person. As the empty person contains less information than the matching iForm person, it is disregarded. The iForm then adopts the person number of the disregarded empty person as this is the position that it is assumed that the iForm response should exist within the HH structure.
If the iForm response matches multiple empty persons, then it is assigned the person number of the matched empty person with the lowest person number. This is to preserve as much of the relationship information as possible for this person.
Empty persons that are from PQ or EQ (Forced) are disregarded.
Empty persons captured on PQ or EQ (Forced) are disregarded at this point as there is not 100% assurance (no validation) for these collection modes that the empty persons are a result of requesting an iForm. They may just be poorly filled out responses, therefore they are disregarded.
Empty persons captured on EQ (not forced) are retained, unless they match on name to an iForm. This is because there is reasonable evidence that this response represents a real person that requested but did not return an iForm.
Without this rule, there is a risk having a bias in 1-person HHs of Multiple Occupancies (HMOs), because adjustment won’t add people into counted HHs, and HMOs may be more likely to respond online.
The retaining of empty EQ (not forced) submissions will be coded in such a way that it can be toggled on and off as required during live processing.
6 Resolve orphan responses Orphan responses are identified as records captured on iForms or HC Forms at UPRNs for which no HH forms and CE forms are received. 1) There must be a good reason as to why a respondent would fill out a HC form for an address. The form clearly states that it is for a HH and so the assumption is that the respondent would have been aware of this and only filled out the form if they lived in a HH. This takes priority over iForms as there is a clear indication on the form that it is to be filled out for a HH.
The following priority rules are used to determine whether a HH or CE is created at a UPRN where there are orphan responses: 2)  
-  The planned approach for collecting individuals residing at CEs is through iForms.
1) If there is at least one HC form at the UPRN create a HH. -  It’s unlikely that a respondent would indicate on an iForm that they to a CE if they resided in a HH. For this reason, this response is prioritised ahead of the ticking of HH to this question.
2) Else, if at least one of the iForms indicates on the Type of Establishment question that the form relates to a CE, create a CE.
Rules, 3,4,5 & 6 are new for 2021. In 2011, the final rule was to just set any remaining orphan residences to be a HH, it’s likely that this would have resulted in a slight overcount of HHs.
3) Else, if at least one of the iForms indicates on the Type of Establishment question that the form related to a HH, create a HH. For 2021 we plan to use all available information on the questionnaires and FWMT to assess as correctly as possible the residence type of the address.
4) Else, use FWMT information for the UPRN to determine whether a HH or CE should be created.
3) The only remaining information provided by the respondent that can be used to determine the residence is this. Information provided by the respondent is prioritised ahead of information provided by enumerators/ administrative data.
5) Else, additional Rules (if required)
4) A new rule for 2021. It is understood that there is information on the FWMT data that can be used to determine whether the residence at the UPRN is a HH or CE. Current understanding is that this is a flag that indicates the residence type, and this come from the address register but can also be updated by enumerator in the field. This is prioritised above a random draw as there is trustworthy information to inform the decision.
6) Else, randomly choose whether to create a HH or CE.
5) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
All orphan responses are assigned to the newly created residence at their UPRN.
6) A random approach is used as the last measure to ensure no bias.
7 Identify duplicate Individual responses The match-keys provided by Methodology are used to identify and flag duplicate individuals in the same residence. Matching is undertaken to identify duplicate persons captured at the same residence.
Any empty persons that Levenshtein Score match on full name to another individual in the residence are flagged as duplicates.
8a Resolve duplicate individual responses When duplicate individuals are identified within the same residence, one of the individuals is selected to be retained and the others are discarded. 1) Responses captured on iForms are prioritised over other form responses as there are special collection reasons for trusting iForm responses over other responses, these include:
The following priority rules are used to select which individual to retain as the baseline record: -  People are advised to fill out an iForm if there is sensitive information that they wish to record but do not want to disclose to other HH members.
-  If people feel strongly about correcting information that they have already submitted, then they are advised to do so using an iForm.
1) iForm -  iForm responses are less likely to be filled out by proxy.
2) Response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable). 2) This rule helps ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
3) EQ are known to be of a higher quality than PQ responses. Non-forced submissions are prioritised over forced submissions as a respondent has knowingly sent the non-forced entries in and therefore, they are likely to be of higher quality.
3) EQ (Not Forced), PQ, then EQ (Forced)
4) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form.
4) Last receipted response
5) There is the functionality to add extra rules here, during the tuning phase if information comes to light.
5) Additional Rules (if required)
6) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found.
6) If all else equal, one of the individuals is selected at random.
Voluntary questions will not be backfilled as legally, missing responses have to be treated as valid responses as they could be seen as a refusal to answer the question.
Any missing Individual variables on the retained record will be populated with valid responses from a disregarded duplicate individual response using the same priority rules as above.
Partially filled blocks of questions will also not be backfilled, to avoid creating inconsistencies in this data.
Exceptions for backfilling are:
-  Voluntary questions.
-  Partially filled Address, DOB, Year of Arrival and Citizenship fields
8b Flag residual duplicate individual responses The match-keys provided by Methodology are used to identify and flag duplicate retained individuals that are within the same UPRN. It is possible, but very unlikely, that residual duplicates within the UPRN can occur.
This is caused by duplicates on 2 or more HC forms being assigned to different HHs at the same UPRN in Module 5b. The forms could be assigned to different HHs if they match multiple HHs or they are assigned to a HH based on random selection.
The scale of these duplicates will determine whether further intervention is required.
A solution to this has been discussed at the Working Group, and it was noted that there would not be a quick fix. It’s likely that the fix would result in having to recode parts of Module 2 & 3.
The counts of missed duplicates in 2011 were very low, likely in the 00s. It is expected that this will be even lower in 2021 due to:
1)       Improvements in Address Register correctly identifying HHs at being at different addresses.
2)       The uptake of HC forms being lower.
These duplicates will be flagged and counted to determine whether an intervention or fix is required.
9 Identify wholly duplicated HHs The match-keys are used to identify duplicate persons in the postcode. Assessment of the 2011 Census Overcount Methodology (Dini & Large, 2014) made the following recommendation for RMR:
Levenshtein score matching is used to identify empty persons that are duplicated within the postcode. “The RMR processing would benefit from looking within the postcode for duplicates, in addition to looking within the address”.
HHs of the same size within the same postcode that contain the same persons are identified and flagged as wholly duplicated HHs. As mentioned in Section 2 of this paper, it was known in 2011 that problems in the Address Register meant that on occasion the same HH was listed multiple times at slightly different UAIs. This led to these HHs being followed up for multiple responses.
As RMR sought to resolve duplicates within the HH, these duplicate responses were not resolved in 2011.
In this module we will identify and flag any remaining duplicates within the postcode.
We will also flag HHs of the same size for which every individual is duplicated. We refer to these HHs as “wholly duplicated HHs”.
We do not attempt to resolve any other duplicates other than the wholly duplicates HHs as it was found that the combinatorial problem space for resolving duplicates of this nature between UPRNS, within the same small area, is to vast to accurately and feasibly manage through a deterministic rule set.
10 Resolve wholly duplicated HHs When wholly duplicated HH are identified, one of the HHs (and residents) is selected to be retained and the others are discarded. In line with the rest of RMR, one of the wholly duplicated HH (along with the individuals) is retained and the other(s) disregarded.
The following priority rules are used to select which HH (and residents) to retain.
Responses have been merged by this point and so the responses could have come from different response modes with varying receipt dates.
1) Greatest sum of weighted completed HH and individual questions (the default weights are 1 for each variable). Therefore, the only remaining general RMR priority rules that can be used to identify which records to retain are 1) Most complete (to retain as much information as possible and 2) Random draw.
2) If equally completed, then one of the HHs is selected at random.
The HH response with the greatest sum of weighted completed HH and individual questions is selected to be retained. This ensures that as much information as possible is retained for the combined HH and individual(s) responses. The weighting function is included to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection.
The other HHs and their residents are flagged to be disregarded.
The creation of the flag to identify duplicate persons will help later processes account for these responses in their methodology.
Remaining duplicate individuals flagged in Module 9 are retained but the flag is persisted to highlight the extent of the remaining overcount within the postcode to later processes.
This flag will indicate the records that are identified to be the same individual.
11 Resolve adjusted CE and HH data structures. The rules for this module are still to be discussed at the RMR Working Group. The proposed rules to be taken for discussion are: 1) HHs of this size are more likely to be a CE than a HH.
1) HHs that contain more than 30 persons are converted to CEs. The HH is disregarded, a CE is created, and the residents are moved to the new CE. 2) Sequential person numbering is required by later processes.
2) Reorder person number, ensuring that person numbers are sequential with no gaps. When reordering, persons captured on the same form will be given sequential person numbers. 3) This relationship data is no longer required and is therefore logically deleted.
3) All non-applicable relationship records are disregarded (i.e. relationship no longer required based on newly assigned person numbers, resident is disregarded, or resident is moved to residing at a CE.) 4) New relationships are required based on changes to residence and/or person number.
4) New relationship records are created where required, these records will have their Relationship set to missing. (i.e. relationship now required based on newly assigned person numbers, resident is moved to residing at a HH). Where the new record is a HC form individual’s relationship with person 1, whatever relationship is captured on the HC form will populate the relationship field. 5) This visitor data is no longer required and is therefore logically deleted.
5) Disregard visitors data from disregarded HHs. 6) Person 1 no longer exists, so resident’s responses are set to missing where they responded with “Same as Person 1”.
6) Address One Year Ago information is set to missing where an individual selected “Same as Person 1” response and person 1 on the questionnaire is disregarded.
12 Create RMR Flags N/A N/A
13 Run RMR Diagnostics N/A N/A

References

J Plachta and R Shipsey (2019). Methodology Report on Identifying Duplicate Persons for Resolving Multiple Responses in the 2021 Census. ONS Technical Report

J Plachta, Office for National Statistics (2020.) Methodology Report on Identifying Same Surnames for Resolving Multiple Responses in the 2021 Census. ONS Technical Report

Dini and Large (2014). Assessment of the 2011 Census Overcount Methodology.