1 | Resolves multiple CE responses | When both HH and CE responses are received for the same UPRN the following rules are applied: | CE responses are prioritised over HH responses as there will be hand delivery of CE forms to some establishments. Therefore, if a UPRN has filled out a CE form there is likely to be good reason behind it. |
| | | |
| | | |
| | | |
| | - The CE response is retained and the HH response is disregarded. | There are likely to be fewer multiple responses of this type in 2021 as CEs and HHs are captured on the same Address Register. |
| | | |
| | | |
| | | |
| | - Persons captured on the disregarded HH responses are moved to the retained CE. | The persons captured on the disregarded HH responses are moved to the CE so that they are kept. |
| | When multiple CE responses are received for the same UPRN. | There should only be one CE at an address. If multiple responses are received, then it’s highly likely that these responses relate to the same CE. |
| | | |
| | We choose one of the CE responses to keep as the baseline CE record and disregard the other form(s). | |
| | | |
| | | 1) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighted function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | The selection of the CE response to be retained is based on the following set of priority rules: | |
| | | |
| | | 2) EQ (not forced) over PQ as EQ responses are known to be of higher quality than PQ responses. PQ is taken over forced EQ responses as PQ is a response that has been submitted by the respondent whereas they haven’t submitted the forced EQ response. |
| | | |
| | 1) Greatest sum of weighted completed CE questions. (The default weights are 1 for each variable) | |
| | | |
| | | 3) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form. |
| | | |
| | 2) If equal on sum of weighted completed CE questions, then prioritise EQ (Not Forced) then PQ, then EQ (Forced). | |
| | | |
| | | 4) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | 3) If equally complete and same response mode, then prioritise the response that was last receipted. | |
| | | |
| | | 5) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | 4) Additional rules (if required). | |
| | | |
| | | Backfilling of responses is undertaken to keep as much information on the residence as possible. Information from disregarded records is expected to be more accurate than imputed values. |
| | | |
| | 5) If all else is equal, then one of the forms is selected at random. | |
| | | |
| | | |
| | | |
| | Any missing CE variables on the retained record will be populated with valid responses from a disregarded CE response for the same UPRN using the same priority rules as those used to select which CE response to retain. | |
2 | Resolve multiple HH responses: Stage 1 | Where there are multiple HH forms received for the same UPRN, the following sequential rules are applied to decide whether they are duplicate responses and should be merged: | 1) Responses with the same QID can only result when a paper form with an IAC is received by a HH that then return both internet and paper responses. This rule essentially covers the merging of EQ and PQ responses for the same address rule in 2011. |
| | | |
| | | This rule will be coded in such a way that it can be toggled on/off in live running if required. |
| | | |
| | 1) If the HH responses have the same QID then the HH responses are merged. | |
| | | |
| | | 2) If matching identifies that responses contains the same individuals, then it’s highly likely that the responses relate to the same HH and are duplicates of each other. |
| | | |
| | 2) The match-keys provided by Methodology are used to identify HH responses that contain the same individuals. | |
3 | Resolve multiple HH responses: | Continuing the resolution of multiple HH responses from Module 2: | 3) If matching identifies that responses contains the same individuals, then it’s highly likely that the responses relate to the same HH and are duplicates of each other. |
| | | |
| Stage 2 | | |
| | | |
| | 3) HH forms that are found to contain persons in common in Module 3, Rule 2) are assumed to be the same HH and are therefore merged together. | 4) Minor-only HHs can only legitimately occur under very rare circumstances, and, for disclosure reasons, statistics on these HHs cannot be output in the Census. Also, the likelihood of a minor-only HH occurring at the same UPRN as a non-minor-only HH would be rare and the likelihood is that it is the same HH but for some reason only children were captured on one of the forms. Therefore, it was agreed that we should merge the HH responses in these cases. |
| | | |
| | | |
| | | |
| | 4) If one or more of the multiple responses is a minor-only (under 16) HH and there are other non-minor-only HHs then the minor-only HH response is merged with a non-minor-only HH response, prioritising merging to those HHs which hold the highest Levenshtein score match on surname (where the Levenshtein score is above a certain parameterised threshold). If multiple forms have the same match score or none of the non-minor only HH responses have acceptable Levenshtein match scores, then one of these matched (or un-matched where no matches) HH responses is selected to be merged with the minor-only HH based on a random draw. | 5) This is a new rule for 2021. There was concern that there may some discovered HHs may be lost unless they are identified at this stage. E.g. A HH with a dedicated Annex Therefore, this step will look for any strong evidence on the HH responses that identify them as being discovered HHs. It is still to be decided what evidence can be used to support the identification of a discovered HH. Suggestions include using the responses to H8, H9 and/or H10 on the HH form. This step will be coded in such a way that additional rules for identification can be added later. |
| | | |
| | | |
| | | |
| | 5) Based on a set of to be defined conditions, if there is evidence to suggest that a HH responses is a discovered HH at the UPRN then we will flag them as a discovered HH and they do not go through steps, 6, 7 & 8. | 6) This is a new rule for 2021. If there is no reason to suggest that the multiple HH responses relate to different HHs in the previous rule and at least one person’s surname matched across the forms, then it’s assumed likely that the HH responses relate to the same HH. |
| | | |
| | | |
| | | |
| | 6) Levenshtein matching is undertaken on the surnames of the person responses. Where the Levenshtein match score is deemed acceptable between multiple HH responses they are merged. | 7) This is a new rule for 2021. This allows for rules to be added to merge the responses if there is evidence to suggest that they should be. A rule that may be added is the merging of Welsh language responses with English language responses at the same Child UPRN. This rule was implemented in 2011 as addresses in Wales are sent both the Welsh language form and the English language form and therefore have a higher propensity to respond twice than addresses in England. |
| | | |
| | | |
| | | |
| | 7) Based on a set of to be defined conditions if there is evidence to suggest that the HH response(s) are not discovered HHs then we will merge them at this stage. | 8) Empty forms received at the same UPRN as other responses are assumed to be incomplete responses or timeshares. Therefore, the responses are merged unless there is information to suggest that they shouldn’t be. |
| | | |
| | | |
| | | |
| | 8) If there are “empty” HH responses (HH responses with no person information filled in), these are merged with “non-empty” HH responses unless there is evidence to suggest that they are legitimate empty HHs. Where there are multiple “non-empty” HH responses, one of these responses is selected at random, to be merged with the “empty” HH response. | |
| | | |
| | | |
| | | |
| | When merging HH responses in Modules 2 and 3 one of the HH responses is selected to be retained as the baseline HH record, based on the following priority rules: | |
| | | |
| | | a) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during Census Collection. |
| | | |
| | a) Take the response with the greatest sum of weighted completed HH questions (the default weights are 1 for each variable). | |
| | | |
| | | b) EQ (not forced) over PQ as EQ responses are known to be of higher quality than PQ responses. PQ is taken over forced EQ responses as PQ is a response that has been submitted by the respondent whereas they haven’t submitted the forced EQ response. |
| | | |
| | b) If equally complete, then prioritise EQ (Not Forced) then PQ, then EQ (Forced) | |
| | | |
| | | c) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form, even though they are advised against doing this. |
| | | |
| | c) If equally complete and same response mode, then prioritise the response that was last receipted. | |
| | | |
| | | d) This is a new rule for 2021. There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | d) Additional rules (if required) | |
| | | |
| | | e) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | e) If all else is equal, then one of the forms is selected at random. | |
| | | |
| | | Backfilling of responses is done to keep as much information on the residence as possible. Information from disregarded records is expected to be more accurate than imputed values. |
| | | |
| | Any missing HH variables on the retained record will be populated with valid responses from a disregarded HH response for the same UPRN using the same priority rules as those used to select which HH response to retain. | |
4 | Resolve dummy responses | The following sequential rules are undertaken on dummy response data: | 1) It is an assumption of the module that a dummy response is a duplicate response of a CE or HH response captured at the same UPRN. |
| | | |
| | 1) If there is a CE response captured at the UPRN then all dummy responses at the UPRN are discarded. | The CE response is retained over the dummy response when both are captured at the same UPRN for a variety of reasons: |
| | | |
| | | - The CE form contains the questions that are required, whereas the dummy form contains questions that could be used to create CEs from. |
| | | |
| | 2) If there is a HH response captured at the UPRN then all dummy responses at the UPRN are discarded. | - It is not reasonable to place a response burden on a respondent to fill out a CE form if it is then ignored in favour of a field interviewer’s view. |
| | | |
| | | - Responses filled out by a site manager are likely to be more reliable than those filled out by an enumerator who may not even have access to the establishment. |
| | | |
| | 3) If a dummy response contains both HH and CE information, the response is split into two separate records in the data. One containing only HH information and the other containing only CE information. | |
| | | |
| | | 2) It is an assumption of the module that a dummy response is a duplicate response of a CE or HH response captured at the same UPRN. The HH response is retained over the dummy response when both are captured at the same UPRN for a variety of reasons: |
| | | |
| | 4) One of the dummy responses at the UPRN is selected to be retained based on the following priority rules: | - The HH forms contains more questions that directly map to the HH table on the CDM. |
| | | |
| | a) Greatest sum of weighted completed dummy questions (the default weights are 1 for each variable). | - There is not much point in asking a resident to fill out their HH form if we then just ignore their response in favour of a field interviewer who most likely did not have access to the residence. |
| | | |
| | b) FWMT over any potential PQ back-up option. | - A resident’s response about the HH they live in is more reliable than those filled out by an enumerator who may not even have access to the HH. |
| | | |
| | c) Last Receipted | |
| | | |
| | d) Additional rules (if required) | In 2011, the Type of Accommodation and Self-Contained questions on the dummy response were merged with the same fields on the HH response if the variables were missing on the HH response but observed on the dummy response. It was decided against undertaking this action for 2021 as there was concern about any potential field interviewer bias on these responses and it was noted that imputation would likely better recover the true values of these variables for the HH. |
| | | |
| | e) If all else is equal, then one of these forms is selected at random. | |
| | | |
| | | 3) A residence can either be a HH or a CE it cannot be both, therefore the forms are split so that either a HH or a CE is created. |
| | | |
| | 5) Any missing dummy variables on the retained record will be populated with valid responses from a disregarded dummy response for the same UPRN using the same priority rules as those used to select which dummy response to retain. | |
| | | |
| | | 4) |
| | | |
| | 6) If the retained dummy response is a CE then a new CE record is created on the CE table for this UPRN. The CE record will be flagged as having been created from a dummy response. Any dummy question variables that exist on the CE table will be populated for this new record from the responses on the dummy form. IDs for this row will be created from the dummy Response ID. All other information will be set to missing (-9) on the new record. | a) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | | |
| | | |
| | 7) Otherwise, If the retained dummy response is a HH then a new HH record is created on the HH table for this UPRN. The HH record will be flagged as having been created from a dummy response. Any dummy question variables that exist on the HH table will be populated for this new record from the responses on the dummy form. IDs for this row will be created from the dummy Response ID. All other information will be set to missing (-9) on the new record. | b) FWMT responses will be taken over any potential PQ back-up as the quality of response on EQ is known to be higher than that of PQ. |
| | | |
| | | |
| | | |
| | | c) Later receipted responses are favoured over other responses as there is an assumption that a latter response may be filled out to correct for a mistake or change in circumstance on an earlier form. |
| | | |
| | | |
| | | |
| | | d) This is a new rule for 2021. There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | | |
| | | |
| | | e) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | | |
| | | |
| | | 5) This follows the general principles of RMR. Any missing information should be populated from other completed dummy responses at the same UPRN. The prioritisation rules are the same as those used to select the primary dummy response. Keeping these prioritisation rules retains consistency between responses. |
| | | |
| | | |
| | | |
| | | 6) This action undertakes the key reason for collecting dummy response information, to create a residence when no response is received for a UPRN. If this action was not undertaken, the address may be assigned the wrong residence type, or it may be missed completely by the Census. |
| | | |
| | | |
| | | |
| | | 7) This action undertakes the key reason for collecting dummy response information, to create a residence when no response is received for a UPRN. I f this action was not undertaken, the address may be assigned the wrong residence type, or it may be missed completely by the Census. |
| | | |
| | | |
5a | Resolve duplicate iForm responses | When multiple iForm responses are received for the same UPRN, the match-keys provided by Methodology are used to identify duplicate responses. | This is a new module for 2021. |
| | | |
| | | |
| | | |
| | If there are duplicate responses and at least one of these are not EQ (Forced) responses, then the following sequential priority rules are used to decide which of the responses to retain (forced EQ responses are not considered in rules a to e): | In 2011, the de-duplication of all individual responses (all form types) within the same residence was undertaken in one rule set at the end of RMR. |
| | | |
| | | |
| | | |
| | a) Last receipted. | However, it was discovered that duplicate iForm responses could be assigned to different HHs at the same UPRN because the assigning of iForms to HHs contains a random element (see Module 5c). This meant that these duplicates would not be resolved by RMR as they were assigned to different residences. |
| | | |
| | | Therefore, for 2021, we are bringing forward the resolution of duplicate iForm responses to the stage before we assign them to HHs. |
| | | |
| | b) If receipted on the same date, take the response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable) of those last receipted. | |
| | | |
| | | Duplicate iForms can be resolved before duplicates from other form types as iForm responses are prioritised over other form types in Module 8a. For more information, please see the Rationale for Module 8a. |
| | | |
| | c) If they are equal on sum of weighted completed questions and are last receipted, then choose EQ (not forced) over PQ. | |
| | | |
| | | In a change to the 2011 rules and individual responses from other form types, we are prioritising last receipted iForm responses ahead of most complete iForm responses. |
| | | |
| | d) Additional rules (if required) | |
| | | |
| | | This rule is based on respondents being advised to fill out iForms as fully as possible when they wish to correct information from earlier submitted forms. It is therefore an assumption that the last receipted iForm will contain the correct information for the individual and these responses should therefore be prioritised. |
| | | |
| | e) If all else is equal, then one of these forms is selected at random. | |
| | | |
| | | As a result of this, non-forced submitted iForms are prioritised ahead of forced submitted forms, regardless of completeness. This is because the forced submissions |
| | | |
| | Otherwise if all the iForm responses are EQ (Forced) these rules are applied: | will always be the last receipted form at the UPRN as they do not get uploaded/receipted until EQ closes, yet it is not known when the information on these forms was supplied. |
| | | |
| | | |
| | | |
| | f) Take the response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable). | a) If a respondent is adamant that they wish to correct information that they have already submitted, they will be advised to fill out an iForm as fully as possible. iForm responses that are receipted later are assumed to provide the “correct” information for the respondent and are therefore prioritised ahead of earlier responses. |
| | | |
| | | |
| | | |
| | g) Additional rules (if required) | b) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. Again, the submissions through EQ forced are not considered at this point because these responses haven’t been submitted by a respondent and this rule is used to decide between two submitted forms that are last receipted. |
| | | |
| | | |
| | | |
| | h) If all else equal, then one of these forms is selected at random. | c) This is a general rule of RMR and is used as EQ responses are expected to be of higher quality than PQ responses. |
| | | |
| | | |
| | | |
| | | d) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | | |
| | | |
| | | e) A random approach is used as the last measure when looking at only non-forced submissions to ensure no bias. |
| | | |
| | Missing person variables on the retained record will be populated with valid responses from a disregarded iForm responses for the same UPRN prioritising valid responses from the disregarded forms based on the following rules: | |
| | | |
| | | f) The Forced response with the greatest sum of weighted completed individual questions is selected if there are only forced iForm responses. The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | 1) Response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable). | |
| | | |
| | | g) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | 2) If multiple responses have the greatest sum of weighted completed questions then prioritise EQ (Not forced), PQ then EQ (Forced) | |
| | | |
| | | h) If a single retained response cannot be identified using the preceding rules, then one of these forms will be selected using a random approach as a last measure to ensure no bias. |
| | | |
| | 3) If equally complete and same response mode, prioritise the response that was receipted last. | |
| | | |
| | | |
| | | |
| | 4) Additional rules (if required) | 1) The record with the greatest sum of weighted completed questions is retained to ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | | |
| | | |
| | 5) If all else equal, one of these forms is selected at random. | 2) EQ are known to be of a higher quality than PQ responses. Non-forced submissions are prioritised over forced submissions as a respondent has knowingly sent the non-forced entries in and therefore, they are likely to be of higher quality. |
| | | |
| | | |
| | | |
| | Exceptions for backfilling are: | 3) Later receipted responses are favoured over other responses as there is an assumption that a latter response may be filled out to correct for a mistake or change in circumstance on an earlier form. |
| | | |
| | - Voluntary questions. | |
| | | |
| | - Partially filled Address, DOB, Year of Arrival and Citizenship fields | 4) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | | |
| | | |
| | | 5) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | | |
| | | |
| | | Voluntary questions will not be backfilled as legally, missing responses have to be treated as valid responses as they could be seen as a refusal to answer the question. |
| | | |
| | | |
| | | |
| | | Partially filled blocks of questions will also not be backfilled, to avoid creating inconsistencies in this data. |
5b | Assign continuation forms | 1) When an HC Form is captured at the same UPRN as one retained CE, the persons captured on the HC form will be assigned to the CE. | 1) Anticipates best intentions where someone has requested an HC form to fill out individual information but reside at a CE. As Module 4 create CEs from dummy responses, a HC response could be assigned to a CE created from a dummy here. |
| | | |
| | | As the HC form doesn’t provide any residence information but the dummy form does it was decided that the residence type should be taken from the dummy response. |
| | | |
| | 2) Persons captured on HC forms at the same UPRN as one retained HH are assigned to the retained HH. | |
| | | |
| | | 2) Anticipates best intentions where someone has requested a HC form for a HH. |
| | | |
| | 3) The following priority rules are used to choose which HH to assign the HC form to when there are multiple retained HH at the same child UPRN at which a HC form was received: | |
| | | |
| | | 3) These priority rules are new for 2021, in 2011 the rules were based around the H2 Question on the HH form (How many people usually live here?) and then Soundex matching of last name. |
| | | |
| | a) Using match-keys, when there are matches on match-keys to individuals across multiple HHs the HH is selected based on: | |
| | | |
| | i) HH with the individual that has matched on the highest of the match-keys in the hierarchical system. | It is known that the quality of the completion of question H2 was likely not fit for purpose in 2011 and in 2021 it will no longer be asked on the EQ. Therefore, we decided against using these methods for 2021 to prevent bias being introduced through this process. |
| | | |
| | ii) Random selection | |
| | | |
| | | When assigning the HC forms to a HH we decided to focus on ensuring that duplicate persons are assigned to the same HH where possible. Therefore, when linking HC forms to a HH record, we prioritise assigning the HC forms to the retained HH that contains a duplicate response of a resident captured on the HC form. Assigning the HC form to this HH will allow these duplicates to be resolved in Module 8. The assigning to a HH that contains a duplicate individual occurs in rules a) and b). |
| | | |
| | b) Levenshtein Score matching of full name – first name and last name (Empty persons), when there are matches on full name to individuals across multiple HHs then the HH is selected based on: | |
| | | |
| | i) Highest first name Levenshtein match score | a) The match-keys are used to identify potential duplicates. When multiple HHs share people in common with the HC form. The HH with the best matched (Highest match-key) duplicate is selected. |
| | | |
| | ii) Random selection | |
| | | |
| | | b) The second phase of duplicate matching, as empty individuals only contain name information, the only way that they can be identified as a duplicate is through name matching. As this is not as strong a matching method as the match-keys it is behind them in the priority order. The rule (i) for choosing between multiple matched HHs is still to be confirmed through further research. |
| | | |
| | c) Levenshtein Score matching of full name – | |
| | | |
| | first name and last name (Non-empty persons), when there are matches on full name to multiple HHs the HH is selected based on: | If duplicates are not found, further name matching methods are used to select a HH to assign the HC form to through rules c) and d). |
| | | |
| | i) Highest first name Levenshtein match score | |
| | | |
| | ii) Random selection | c) Full Name matching of non-empty persons is undertaken first because people with the same name at the same address are more likely to live together as they are likely to be fathers and sons. The rule (i) for choosing between multiple matched HHs is still to be confirmed through further research. |
| | | |
| | | |
| | | |
| | d) Levenshtein Score matching on last name only, when there are matches on last name to individuals across multiple HHs the HH is selected based on: | d) Last name is then used as it is believed that persons with the same last name at the same address are more likely to live together. |
| | | |
| | i) Highest Levenshtein match score | |
| | | |
| | ii) Random selection | e) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | | |
| | | |
| | e) Additional rules (if required) | f) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | | |
| | | |
| | f) If all else equal, one of the HHs is selected at random. | |
5c | Assign iForms | 1) When an iForm is captured at the same UPRN as a retained CE, the persons captured on the HC form will be assigned to the CE. | 1) Anticipates best intentions where someone has requested an iForm to fill out individual information but reside at a CE. |
| | | |
| | | |
| | | |
| | 2) Persons captured on iForms at the same UPRN as one retained HH are assigned to the retained HH. | 2) Anticipates best intentions where someone has requested an iForm but reside at a HH. |
| | | |
| | | |
| | | |
| | 3) The following priority rules are used to choose which HH to assign the iForm to when there are multiple retained HH at the same child UPRN at which a HC form was received: | 3) These priority rules are new for 2021, in 2011 the rules were based around Soundex matching of name and the H2 Question on the HH form (How many people usually live here?). |
| | | |
| | | |
| | | |
| | a) Using match-keys, when there are matches on match-keys to individuals across multiple HHs the HH is selected based on | It is known that the quality of the completion of question H2 was likely not fit for purpose in 2011 and in 2021 it will no longer be asked on the EQ. Therefore, we decided against using these methods for 2021 to prevent bias being introduced through this process. |
| | | |
| | i) HH with the individual that has matched on the highest of the match-keys in the hierarchical system. | |
| | | |
| | ii) Random selection | When assigning the iForms to a HH we decided to focus on ensuring that duplicate persons are assigned to the same HH where possible. Therefore, when linking iForms to a HH record, we prioritise assigning the iForms to the retained HH that contains a duplicate response of the resident captured on the iForm. Assigning the iForm to this HH will allow these duplicates to be resolved in Module 8. The assigning to a HH that contains a duplicate individual occurs in rules a) and b). |
| | | |
| | | |
| | | |
| | b) Levenshtein Score matching of full name – first name and last name (empty persons), when there are matches on full name to individuals across multiple HHs, the HH is selected based on: | When linking iForms to a HH record, we prioritise assigning the iForms to the retained HH that contains a duplicate response of a resident captured on the iForm. Assigning the iForm to this HH will allow these duplicates to be resolved in Module 8. The assigning to a HH that contains a duplicate individual occurs in rules a) and b). |
| | | |
| | i) Highest first name Levenshtein match score | |
| | | |
| | ii) Random selection | a) The match-keys are used to identify potential duplicates. When multiple HHs share people in common with the iForm. The HH with the best matched (Highest match-key) duplicate is selected. |
| | | |
| | | |
| | | |
| | c) Levenshtein Score matching of full name – | b)The second phase of duplicate matching, as empty individuals only contain name information, the only way that they can be identified as a duplicate is through name matching. As this is not as strong a matching method as the match-keys it is behind them in the priority order. The rule i) for choosing between multiple matched HHs is still to be confirmed. |
| | | |
| | first name and last name (Non-empty persons), when there are matches on full name to individuals across multiple HHs, the HH is selected based on: | |
| | | |
| | i) Highest first name Levenshtein match score | If duplicates are not found, then further name matching is used to select a HH to assign the iForm to through rules c) and d). |
| | | |
| | ii) Random selection | |
| | | |
| | | c) Full name matching of non-empty persons is undertaken first because people with the same name at the same address are more likely to live together as they are likely to be fathers and sons. The rule i) for choosing between multiple matched HHs is still to be confirmed. |
| | | |
| | d) Levenshtein Score matching on last name only, when there are matches on last name to individuals across multiple HHs, the HH is selected based on: | |
| | | |
| | i) Highest Levenshtein match score | d)Last name is then used as it is believed that persons with the same last name at the same address are more likely to live together. |
| | | |
| | ii) Random selection | |
| | | |
| | | e) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | e) Additional rules (if required) | |
| | | |
| | | f) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | f) If all else equal, then one of the HHs is selected at random. | |
| | | |
| | | An “empty” person is an individual response that only contains the first name and last name of the individual. They are only retained through RFP if they are captured on a HH form and only contain full name. |
| | | |
| | The person captured on the iForm is assigned the person number of the first empty person which they match on full name, or, if no matches to empty persons, at the end of all existing individual places already filled. If it is assigned the person number of the first empty person that it full name matches on then the first matched empty HH person record is disregarded. | It is expected that the most common occurrence of “empty” persons is when a respondent is named (first name and last name) in the HH section of the form but no information is provided for them in the individual section as they intend to fill out an iForm for the individual response. |
| | | |
| | | |
| | | |
| | Copies of the relationships that include the matched disregarded empty person are created, with the resident_id/related_resident_id updated so that all instances of the resident_id of the disregarded empty person are changed to be that of the person captured on the iForm. | If an iForm matches an empty person then the iForm is assumed to be a duplicate of the empty person. As the empty person contains less information than the matching iForm person, it is disregarded. The iForm then adopts the person number of the disregarded empty person as this is the position that it is assumed that the iForm response should exist within the HH structure. |
| | | |
| | | If the iForm response matches multiple empty persons, then it is assigned the person number of the matched empty person with the lowest person number. This is to preserve as much of the relationship information as possible for this person. |
| | | |
| | Empty persons that are from PQ or EQ (Forced) are disregarded. | |
| | | |
| | | Empty persons captured on PQ or EQ (Forced) are disregarded at this point as there is not 100% assurance (no validation) for these collection modes that the empty persons are a result of requesting an iForm. They may just be poorly filled out responses, therefore they are disregarded. |
| | | |
| | | |
| | | |
| | | Empty persons captured on EQ (not forced) are retained, unless they match on name to an iForm. This is because there is reasonable evidence that this response represents a real person that requested but did not return an iForm. |
| | | |
| | | Without this rule, there is a risk having a bias in 1-person HHs of Multiple Occupancies (HMOs), because adjustment won’t add people into counted HHs, and HMOs may be more likely to respond online. |
| | | |
| | | |
| | | |
| | | The retaining of empty EQ (not forced) submissions will be coded in such a way that it can be toggled on and off as required during live processing. |
| | | |
| | | |
6 | Resolve orphan responses | Orphan responses are identified as records captured on iForms or HC Forms at UPRNs for which no HH forms and CE forms are received. | 1) There must be a good reason as to why a respondent would fill out a HC form for an address. The form clearly states that it is for a HH and so the assumption is that the respondent would have been aware of this and only filled out the form if they lived in a HH. This takes priority over iForms as there is a clear indication on the form that it is to be filled out for a HH. |
| | | |
| | | |
| | | |
| | The following priority rules are used to determine whether a HH or CE is created at a UPRN where there are orphan responses: | 2) |
| | | |
| | | - The planned approach for collecting individuals residing at CEs is through iForms. |
| | | |
| | 1) If there is at least one HC form at the UPRN create a HH. | - It’s unlikely that a respondent would indicate on an iForm that they to a CE if they resided in a HH. For this reason, this response is prioritised ahead of the ticking of HH to this question. |
| | | |
| | | |
| | | |
| | 2) Else, if at least one of the iForms indicates on the Type of Establishment question that the form relates to a CE, create a CE. | |
| | | |
| | | Rules, 3,4,5 & 6 are new for 2021. In 2011, the final rule was to just set any remaining orphan residences to be a HH, it’s likely that this would have resulted in a slight overcount of HHs. |
| | | |
| | 3) Else, if at least one of the iForms indicates on the Type of Establishment question that the form related to a HH, create a HH. | For 2021 we plan to use all available information on the questionnaires and FWMT to assess as correctly as possible the residence type of the address. |
| | | |
| | | |
| | | |
| | 4) Else, use FWMT information for the UPRN to determine whether a HH or CE should be created. | |
| | | |
| | | 3) The only remaining information provided by the respondent that can be used to determine the residence is this. Information provided by the respondent is prioritised ahead of information provided by enumerators/ administrative data. |
| | | |
| | 5) Else, additional Rules (if required) | |
| | | |
| | | 4) A new rule for 2021. It is understood that there is information on the FWMT data that can be used to determine whether the residence at the UPRN is a HH or CE. Current understanding is that this is a flag that indicates the residence type, and this come from the address register but can also be updated by enumerator in the field. This is prioritised above a random draw as there is trustworthy information to inform the decision. |
| | | |
| | 6) Else, randomly choose whether to create a HH or CE. | |
| | | |
| | | 5) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | All orphan responses are assigned to the newly created residence at their UPRN. | |
| | | |
| | | 6) A random approach is used as the last measure to ensure no bias. |
7 | Identify duplicate Individual responses | The match-keys provided by Methodology are used to identify and flag duplicate individuals in the same residence. | Matching is undertaken to identify duplicate persons captured at the same residence. |
| | | |
| | | |
| | | |
| | Any empty persons that Levenshtein Score match on full name to another individual in the residence are flagged as duplicates. | |
8a | Resolve duplicate individual responses | When duplicate individuals are identified within the same residence, one of the individuals is selected to be retained and the others are discarded. | 1) Responses captured on iForms are prioritised over other form responses as there are special collection reasons for trusting iForm responses over other responses, these include: |
| | | |
| | The following priority rules are used to select which individual to retain as the baseline record: | - People are advised to fill out an iForm if there is sensitive information that they wish to record but do not want to disclose to other HH members. |
| | | |
| | | - If people feel strongly about correcting information that they have already submitted, then they are advised to do so using an iForm. |
| | | |
| | 1) iForm | - iForm responses are less likely to be filled out by proxy. |
| | | |
| | | |
| | | |
| | 2) Response with the greatest sum of weighted completed individual questions (the default weights are 1 for each variable). | 2) This rule helps ensure that the most consistent record is kept. The weighting function has been added to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | | 3) EQ are known to be of a higher quality than PQ responses. Non-forced submissions are prioritised over forced submissions as a respondent has knowingly sent the non-forced entries in and therefore, they are likely to be of higher quality. |
| | | |
| | 3) EQ (Not Forced), PQ, then EQ (Forced) | |
| | | |
| | | 4) Later receipted responses are favoured over other responses as it is expected that some people may choose to fill out another form to correct for a mistake or change in circumstance on an earlier form. |
| | | |
| | 4) Last receipted response | |
| | | |
| | | 5) There is the functionality to add extra rules here, during the tuning phase if information comes to light. |
| | | |
| | 5) Additional Rules (if required) | |
| | | |
| | | 6) A random approach is used as the last measure to ensure no bias. This is an improvement on 2011 when the last measure was to take the first one found. |
| | | |
| | 6) If all else equal, one of the individuals is selected at random. | |
| | | |
| | | Voluntary questions will not be backfilled as legally, missing responses have to be treated as valid responses as they could be seen as a refusal to answer the question. |
| | | |
| | Any missing Individual variables on the retained record will be populated with valid responses from a disregarded duplicate individual response using the same priority rules as above. | |
| | | |
| | | Partially filled blocks of questions will also not be backfilled, to avoid creating inconsistencies in this data. |
| | | |
| | Exceptions for backfilling are: | |
| | | |
| | - Voluntary questions. | |
| | | |
| | - Partially filled Address, DOB, Year of Arrival and Citizenship fields | |
| | | |
| | | |
8b | Flag residual duplicate individual responses | The match-keys provided by Methodology are used to identify and flag duplicate retained individuals that are within the same UPRN. | It is possible, but very unlikely, that residual duplicates within the UPRN can occur. |
| | | |
| | | This is caused by duplicates on 2 or more HC forms being assigned to different HHs at the same UPRN in Module 5b. The forms could be assigned to different HHs if they match multiple HHs or they are assigned to a HH based on random selection. |
| | | |
| | The scale of these duplicates will determine whether further intervention is required. | |
| | | |
| | | A solution to this has been discussed at the Working Group, and it was noted that there would not be a quick fix. It’s likely that the fix would result in having to recode parts of Module 2 & 3. |
| | | |
| | | |
| | | |
| | | The counts of missed duplicates in 2011 were very low, likely in the 00s. It is expected that this will be even lower in 2021 due to: |
| | | |
| | | 1) Improvements in Address Register correctly identifying HHs at being at different addresses. |
| | | |
| | | 2) The uptake of HC forms being lower. |
| | | |
| | | |
| | | |
| | | These duplicates will be flagged and counted to determine whether an intervention or fix is required. |
| | | |
| | | |
9 | Identify wholly duplicated HHs | The match-keys are used to identify duplicate persons in the postcode. | Assessment of the 2011 Census Overcount Methodology (Dini & Large, 2014) made the following recommendation for RMR: |
| | | |
| | | |
| | | |
| | Levenshtein score matching is used to identify empty persons that are duplicated within the postcode. | “The RMR processing would benefit from looking within the postcode for duplicates, in addition to looking within the address”. |
| | | |
| | | |
| | | |
| | HHs of the same size within the same postcode that contain the same persons are identified and flagged as wholly duplicated HHs. | As mentioned in Section 2 of this paper, it was known in 2011 that problems in the Address Register meant that on occasion the same HH was listed multiple times at slightly different UAIs. This led to these HHs being followed up for multiple responses. |
| | | |
| | | |
| | | |
| | | As RMR sought to resolve duplicates within the HH, these duplicate responses were not resolved in 2011. |
| | | |
| | | |
| | | |
| | | In this module we will identify and flag any remaining duplicates within the postcode. |
| | | |
| | | We will also flag HHs of the same size for which every individual is duplicated. We refer to these HHs as “wholly duplicated HHs”. |
| | | |
| | | |
| | | |
| | | We do not attempt to resolve any other duplicates other than the wholly duplicates HHs as it was found that the combinatorial problem space for resolving duplicates of this nature between UPRNS, within the same small area, is to vast to accurately and feasibly manage through a deterministic rule set. |
10 | Resolve wholly duplicated HHs | When wholly duplicated HH are identified, one of the HHs (and residents) is selected to be retained and the others are discarded. | In line with the rest of RMR, one of the wholly duplicated HH (along with the individuals) is retained and the other(s) disregarded. |
| | | |
| | The following priority rules are used to select which HH (and residents) to retain. | |
| | | |
| | | Responses have been merged by this point and so the responses could have come from different response modes with varying receipt dates. |
| | | |
| | 1) Greatest sum of weighted completed HH and individual questions (the default weights are 1 for each variable). | Therefore, the only remaining general RMR priority rules that can be used to identify which records to retain are 1) Most complete (to retain as much information as possible and 2) Random draw. |
| | | |
| | 2) If equally completed, then one of the HHs is selected at random. | |
| | | |
| | | The HH response with the greatest sum of weighted completed HH and individual questions is selected to be retained. This ensures that as much information as possible is retained for the combined HH and individual(s) responses. The weighting function is included to allow for evidence-based prioritisation of any Census variable or subset of variables following analyses of the 2021 Census data during the Census Collection. |
| | | |
| | The other HHs and their residents are flagged to be disregarded. | |
| | | |
| | | The creation of the flag to identify duplicate persons will help later processes account for these responses in their methodology. |
| | | |
| | Remaining duplicate individuals flagged in Module 9 are retained but the flag is persisted to highlight the extent of the remaining overcount within the postcode to later processes. | |
| | | |
| | This flag will indicate the records that are identified to be the same individual. | |
11 | Resolve adjusted CE and HH data structures. | The rules for this module are still to be discussed at the RMR Working Group. The proposed rules to be taken for discussion are: | 1) HHs of this size are more likely to be a CE than a HH. |
| | | |
| | | |
| | | |
| | 1) HHs that contain more than 30 persons are converted to CEs. The HH is disregarded, a CE is created, and the residents are moved to the new CE. | 2) Sequential person numbering is required by later processes. |
| | | |
| | | |
| | | |
| | 2) Reorder person number, ensuring that person numbers are sequential with no gaps. When reordering, persons captured on the same form will be given sequential person numbers. | 3) This relationship data is no longer required and is therefore logically deleted. |
| | | |
| | | |
| | | |
| | 3) All non-applicable relationship records are disregarded (i.e. relationship no longer required based on newly assigned person numbers, resident is disregarded, or resident is moved to residing at a CE.) | 4) New relationships are required based on changes to residence and/or person number. |
| | | |
| | | |
| | | |
| | 4) New relationship records are created where required, these records will have their Relationship set to missing. (i.e. relationship now required based on newly assigned person numbers, resident is moved to residing at a HH). Where the new record is a HC form individual’s relationship with person 1, whatever relationship is captured on the HC form will populate the relationship field. | 5) This visitor data is no longer required and is therefore logically deleted. |
| | | |
| | | |
| | | |
| | 5) Disregard visitors data from disregarded HHs. | 6) Person 1 no longer exists, so resident’s responses are set to missing where they responded with “Same as Person 1”. |
| | | |
| | | |
| | | |
| | 6) Address One Year Ago information is set to missing where an individual selected “Same as Person 1” response and person 1 on the questionnaire is disregarded. | |
12 | Create RMR Flags | N/A | N/A |
13 | Run RMR Diagnostics | N/A | N/A |