Sampling and Weighting Technical Report, Census of Population, 2021
3. Census data processing
3.1 Introduction
This chapter discusses the processing of all the completed questionnaires (all questionnaire types), which encompasses everything from the receipt of the questionnaires through to the creation of an accurate and complete census database. It describes the steps of questionnaire registration, questionnaire imaging and data capture, editing, error correction, failed edit follow‑up, coding, dwelling classification and non‑response adjustments, linkage of administrative data, imputation, weighting, and final response rates.
Automated processes, implemented for the 2021 Census, had to be monitored to ensure that all Canadian residences were enumerated once and only once. The Master Control System (MCS) was built to control and monitor the process flow, from collection to data processing. The MCS held a master list of all the dwellings in Canada, where each dwelling was identified with a unique identifier. This system was updated on an ongoing basis with information about each dwelling’s status in the census process flow (e.g., delivered, received or processed). Reports were generated daily by the system and made accessible online to managers to ensure that census operations were efficient and effective.
3.2 Receipt and registration
Responses received through the Internet or help-line telephone interviews were received directly at the Data Operations Centre (DOC), where the receipt of the responses was registered automatically.
Respondents completing paper questionnaires mailed them back to the DOC. Canada Post registered their receipt automatically in multiple locations in Canada (as part of the normal mail flow process) by scanning the barcode on the front of the questionnaire through the transparent portion of the return envelope. The envelopes were then delivered to the DOC throughout each business day. Canada Post would also send files daily listing all census questionnaires received at each regional processing plant, by date of receipt.
The registration of each returned questionnaire was flagged on the MCS at Statistics Canada. A list of all the dwellings for which a questionnaire had not been received was generated daily by the MCS and transmitted to field operations to prevent follow-up on households that had already completed their questionnaire during non‑response follow-up.
3.3 Scanning and keying from images
In 2021, all paper census forms (2A, 2C, 2A-L, 2A-R, 3A) were imaged. The following steps were part of the imaging process:
- Document preparation: Mailed-back questionnaires were removed from envelopes and foreign objects (i.e., clips and staples) were detached in preparation for scanning. The questionnaires were batched by form type. Their spine was cut off to separate them into single sheets.
- Scanning: The questionnaires were converted to digital images.
- Automated image quality assessment: An automated system analyzed the images for errors or anomalies. Images failing this process were sent to be reviewed by a document analysis operator.
- Document analysis: At this step, images containing anomalies were presented to an operator for review. The operator could accept the image as is and send it directly to key entry (bypassing automated recognition), or the operator could send the entire questionnaire to be pulled at the check-out step. See below for more details on the key entry and check-out steps.
- Automated recognition: This step attempted to automatically recognize all handwritten responses and marks on the questionnaires.
- Key entry: Operators entered responses that automated recognition could not determine with sufficient accuracy. About 12% of all responses were sent to keying.
- Check-out: Once the questionnaires were processed through all of the above steps, the paper questionnaires were checked out of the system. Check-out is a quality assurance process that ensures that the images and captured data are of sufficient quality that the paper questionnaires require no subsequent processing. Questionnaires that had been flagged as containing errors were pulled at check-out and reprocessed.
3.4 Coverage edits, completion edits and failed edit follow-up
At this stage, a number of automated edits were performed on respondent data. These edits were designed to detect cases where the number of persons counted in the household was incorrect because of an error in collection, a respondent error or a data capture error. Most of these errors occurred on paper questionnaires, including:
- data erroneously entered in the wrong person column
- crossed off data that are captured in error
- data not being provided for every household member listed in the roster at the beginning of the questionnaire.
Errors that can occur both on paper and online include:
- data provided for the same person on more than one questionnaire (e.g., a person completes their own 3A questionnaire and is also included on the household 2A questionnaire)
- the receipt of duplicate questionnaires (e.g., a person completes the Internet version and their spouse completes the paper version and mails it back).
For about 54% of edit failures, the system resolved the case automatically. This was done when the error was such that the solution was obvious. The solutions included deleting false person data that were created because of respondent or capture error and deleting duplicate responses. The remainder of the edit failure cases were forwarded to processing clerks for resolution. An interactive system enabled the clerks to compare data across questionnaires and examine the images of paper questionnaires to detect data capture or respondent errors. Edit failures were resolved by deleting invalid or duplicate persons or by adding missing persons (i.e., creating blank person records), as necessary and appropriate.
Following the coverage edits, another set of automated edits was run. These edits detected cases where too many questions had missing responses or where data had not been provided for all the usual residents in the household, including cases where missing persons were added by coverage edit clerks. Households that failed these edits were followed up with. An interviewer called the respondent to resolve coverage issues and obtain missing responses, using a computer-assisted telephone interviewing application. For households that responded to the long-form questionnaire, only data missing for the short-form questions were followed up on. The data obtained through this follow-up activity were introduced into the system for subsequent processing steps. If the follow-up was unsuccessful, the data were imputed in the edit and imputation step (see Section 3.8).
3.5 Coding
The census questionnaires contained questions for which answers could be selected from a list, as well as questions requiring a written response. Where possible, written responses were automatically assigned a numerical code according to Statistics Canada reference files, code sets and standard classifications. Reference files for the automated match process were built using actual responses from past censuses or other surveys measuring the same concepts, as well as administrative files. For cases where a code could not be automatically assigned, codes were assigned using machine learning models that were developed with the natural language processing algorithm “fastText.”Note 1 Finally, records that were not assigned a code automatically through either a reference file or machine learning were coded by specially trained coders and subject-matter specialists.
The following questions required coding on both the long- and short-form questionnaires:
- gender
- relationship to Person 1
- home language
- mother tongue
- instruction in the minority official language.
The following questions required coding for the long-form sample only:
- place of birth of person
- place of birth of parents
- citizenship
- knowledge of non‑official languages
- ethnic or cultural origins
- population group
- religion
- First Nation/Indian band
- place of residence one year ago
- place of residence five years ago
- major field of study
- location of study
- industry
- occupation
- place of work
- Inuit land claim
- main reason for working part-time
- main reason for not working full-year
- Métis organization
- language of work.
A total of about 85 million write-ins were coded from the 2021 Census questionnaires. Overall, about 88% were coded automatically, and about 9% were coded using machine learning, although these rates varied considerably from one question to the next.
3.6 Classification and non‑response adjustments for unoccupied and non‑response dwellings
The Dwelling Classification Survey (DCS) was used to estimate the rate of enumerator error in classifying private dwellings, excluding those in collection units (CUs) in First Nations communities, Métis settlements, Inuit regions and other remote areas, and all private dwellings attached to a collective dwelling, as occupied or unoccupied. This information was used to make adjustments to the census database. The DCS selected a random sample of 1,903 mail-out, list/leave, and mail-out with drop-off CUs. Enumerators revisited these CUs in June, July and August 2021 to reassess the occupancy status as of Census Day of each private dwelling for which no response was received. The DCS estimated that 17.3% of the 1,259,149 private dwellings classified as unoccupied were actually occupied and that 38.5% of the 342,162 private dwellings with no response that were classified as occupied or that had an unknown occupancy status were actually unoccupied. Estimates based on the DCS sample were used to adjust the occupancy status for individual dwellings. This resulted in an increase of 3.0% in the number of occupied private dwellings and a decrease of 6.8% in the number of unoccupied dwellings at the Canada level.
The final non‑response status is determined after this adjustment of the occupancy status by the DCS. Occupied private dwellings with non‑response had their household size imputed based on the estimated distribution resulting from the DCS and then had the rest of their data imputed. The imputed responses came from another census-responding household or administrative data and were generally the geographically nearest neighbour with the same household size. This process is called whole household imputation (WHI). This imputation process is explained in Sections 3.7 and 3.8.
The WHI process has another component that is separate from the use of the DCS estimates to adjust the census database. The non‑DCS areas—CUs in First Nations communities, Métis settlements, Inuit regions and other remote areas, and all private dwellings attached to a collective dwelling—require a different imputation strategy. In these areas only, all unoccupied private dwellings are assumed to be truly unoccupied. This implies that unoccupied dwellings are assumed to be classified correctly and no imputations are done. All private dwellings with no response that were classified by enumerators as being occupied were assumed to be occupied and were imputed as occupied. As in DCS areas, dwellings imputed as occupied had their household size and responses imputed, and the imputed response came from another census-responding household or administrative data. No restrictions were placed on the household size for these imputations, as was done in the DCS area.
The WHI process results in all private dwellings being classified as either occupied or unoccupied (i.e., there is no longer any total non‑responding dwellings). At the Canada level (for DCS and non‑DCS areas), 3.1% of occupied private dwellings were imputed through the WHI process.
More details on the DCS and the WHI process will be available in the Coverage Technical Report, Census of Population, 2021, Statistics Canada Catalogue no. 98-303-X.
3.7 Use of administrative data
The use of administrative data increased for the 2021 Census compared with 2016. In addition to the administrative data used for the Income process, they were used for Immigration, as well as in the context of the WHI process. All these uses benefited from the linkage of administrative data.
As was the case in 2016, administrative data were the only source of information on income for the Census Program. This not only reduced response burden, but also increased the quality and quantity of the income data available. The information on individuals’ income was compiled from administrative data for the entire population aged 15 and older. The T1 Income Tax and Benefit Return; the T3, T4, T4A, T4RIF, T4RSP, T5, T4A(P), T4A(OAS), T4E and T5007 tax slips; Canada Child Benefit data; and goods and services tax/harmonized sales tax credit data are examples of the sources of administrative data used. Regular, recurring taxable and non‑taxable income received during the 2020 calendar yearNote 2 was included. One-time receipts, such as lump‑sum withdrawals from registered retirement savings plans and other savings plans, lump‑sum insurance settlements, lump‑sum pension benefits, capital gains or losses, inheritances, and lottery winnings, were excluded.
The Immigration process is the successor to the 2016 Admission CategoryNote 3 process, which also incorporates elements that were in the 2016 EthnoculturalNote 4 process. For the first time, in 2021, administrative data from Immigration, Refugees and Citizenship Canada (IRCC) were the main source of information for most variables processed in the Immigration process for the census long-form sample. In 2016, respondents were asked their place of birth, citizenship, immigrant status, and year of immigration (if applicable). For 2021, the immigration status and year of immigration questions were replaced by administrative data. In addition to the variables processed in 2016, the IRCC administrative data provided new variables with information on non‑permanent residents, year of arrival, province or territory of intended destination and more.
Whole household imputation
During the WHI process, administrative data at the household and person levels were used to impute some non‑responding households to improve the data quality of the population and the dwelling counts. Administrative data were used to impute for the household size, date of birth and sex at birth when the administrative data were of sufficient quality.
3.8 Edit and imputation
The data collected in any survey or census contain some omissions or inconsistencies. For example, a respondent may be unwilling to answer a question, answer something that contradicts a previous answer or enter a meaningless answer. Other errors, such as incorrect coding, can also occur.
The final clean-up of data, done in the edit and imputation process, was fully automated using the Canadian Census Edit and Imputation System (CANCEIS) (Statistics Canada 2020) for all census topics. Two imputation methods were applied. The first method, called “deterministic imputation,” involved assigning specific values under certain conditions when problems were clear and unambiguous to resolve. Detailed edit rules were applied to identify these conditions, and the variables involved in the rules were assigned predetermined values. The second method, called “minimum-change nearest-neighbour donor imputation,” applied a series of detailed edit rules that identified any missing or inconsistent responses. When a record with missing or inconsistent responses was identified, another record that met the edit rules and was the most similar to it with respect to a set of defined characteristics was selected as a donor. Data from this donor record were borrowed and used to make the minimum number of changes to the variables to resolve all cases of missing or inconsistent responses.
The edit and imputation process starts with the WHI applied to census non‑respondents in CUs with a response rate lower than 90%. For those with good quality administrative data records, these non‑respondents have their household size, date of birth and sex at birth imputed from their administrative data for all members of the household as a first step. The remainder of the missing variables are imputed in subsequent steps. The remainder of the census non‑respondents are imputed by the geographically nearest neighbour among the set of full or partial respondents, or the set of non‑respondents now imputed by administrative data. In the DCS areas, the donor must have the same household size.
Once WHI is completed, the remainder of the missing or invalid information is imputed deterministically or by nearest neighbour donor imputation, module by module. These modules are built to process all variables of a common topic together.
3.9 Non‑response
A non‑response status may differ during the collection and processing phases. The main differences arise because the occupancy status can change between collection and processing, and because the household must answer a minimum number of questions to be considered a respondent in the processing phase. Unless otherwise specified, the term “non‑response” refers to non‑response in the data processing phase. The same applies when response is referred to rather than non‑response.
For the 2021 Census long-form questionnaire, two types of households were considered non‑respondents:
- households from the sample that answered only the questions common to both types of questionnaires, i.e., only the short‑form questions
- households that did not answer any questions.
This refers to total non‑response, which is processed differently depending on the collection method and the type of household.
3.10 Weighting
The 2021 Canadian Census Program consisted of a Census of Population and a sample survey for which one-quarter of Canadian private households were selected. Households not sampled for the survey received a short-form questionnaire, while sampled households received a long‑form questionnaire. In addition to the short-form questions, the long-form questionnaire gathered sociocultural information, as well as information on daily activities, mobility, place of birth, education, labour market activity, etc. Weighting was used to represent the entire population based on the information gathered from the sample.
The first step in the weighting process was to assign a design weight to each household that reflected its probability of being sampled. In most CUs, the sampling fraction was one-quarter, and therefore, households in these CUs were assigned a design weight of 4. The design weights in these CUs then underwent an initial adjustment for coverage and total non‑response. This adjustment was applied to the weights of respondent households. Finally, a second adjustment, referred to as final calibration, was made to establish closer agreement between the estimates obtained from respondent households in the sample and the census counts for a number of characteristics from the short-form questionnaire or from administrative data sources. The weighting methodology is described in detail in Chapter 4. All private households attached to collective dwellings and all private households in CUs in First Nations communities, Métis settlements, Inuit regions and other remote areas were selected for the long-form sample and received a design weight of 1. They were then excluded from the coverage and non‑response adjustment processes, as well as from the final calibration process.
Long-form sample households with a non‑zero weight at the end of the weighting process were the respondent households, along with the households who were assigned a design weight of 1, i.e., private households attached to collective dwellings and all private households in CUs in First Nations communities, Métis settlements, Inuit regions and other remote areas. These households made up the set of households that contributed to the long-form estimates.
3.11 Final response rates
Table 3.11.1 presents the final response rates for private households in the 2021 Census of Population, for Canada and for each province and territory, followed by non‑weighted and weighted response rates for the long-form sample based on the definition of non‑response given in Section 3.9.
The final response rate is the ratio of the numerator to the denominator, where:
- the numerator is the number of private dwellings for which a questionnaire was completed Note 5
- the denominator is the number of private dwellings classified as occupied, according to the census database.
The final classification of a dwelling’s occupancy status is based on an analysis of the data gathered by field staff, data provided by respondents and the results of a study into the quality of occupancy status in the DCS (see Section 3.6). The response rates indicated in Table 3.11.1 differ from the collection response rates, which were previously published and were mentioned in Section 1.5, in that they take data processing and dwelling occupancy verification into account in identifying non‑respondent households. These response rates are therefore considered final.
Weighted response rates were produced for the long-form sample. They are defined as the ratio of the numerator to the denominator, where:
- the numerator is the design-weighted count of private dwellings for which a questionnaire was completed
- the denominator is the design-weighted count of private dwellings classified as occupied, according to the census database.
Region | Response rate—short-form questionnaire | Non-weighted response rate—long-form questionnaire only | Weighted response rate—long-form questionnaire only |
percent | |||
Canada | 96.9 | 94.9 | 95.7 |
Newfoundland and Labrador | 97.0 | 95.0 | 95.6 |
Prince Edward Island | 97.6 | 96.5 | 96.8 |
Nova Scotia | 97.1 | 95.6 | 96.1 |
New Brunswick | 96.8 | 94.8 | 95.7 |
Quebec | 97.1 | 95.7 | 96.3 |
Ontario | 97.2 | 95.8 | 96.2 |
Manitoba | 96.5 | 93.1 | 94.4 |
Saskatchewan | 95.5 | 91.8 | 93.5 |
Alberta | 96.5 | 93.4 | 94.4 |
British Columbia | 96.5 | 94.0 | 95.1 |
Yukon | 95.7 | 85.5 | 89.5 |
Northwest Territories | 91.8 | 86.2 | 89.2 |
Nunavut | 79.7 | 78.1 | 78.1 |
Note: All private households and occupied dwellings are included in the calculation of these response rates, without exception. Sources: Statistics Canada, 2021 Census of Population and 2021 Census long-form sample. |
- Date modified: