Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.
2. Census and National Household Survey data processing
Table of contents
This chapter discusses the processing of all the completed questionnaires, which encompasses everything from the reception of the questionnaires through the creation of an accurate and complete census database and a National Household Survey (NHS) database. Described below are the steps of questionnaire registration, questionnaire imaging and data capture, editing, error correction, failed edit follow-up, coding, dwelling classification and non-response adjustments, imputation, and weighting.
Automated processes, implemented for the 2011 Census and NHS, had to be monitored to ensure that all Canadian residences were enumerated once and only once and to indicate which of those residences were to be included in the NHS. The Master Control System (MCS) was built to control and monitor the process flow, from collection to data processing. The MCS held a master list of all the dwellings in Canada where each dwelling was identified with a unique identifier. This system was updated on a daily basis with information about each dwelling's status in the census and NHS process flow (i.e., delivered, received, processed, etc.). Reports were generated and made accessible online to the managers to ensure that census and NHS operations were efficient and effective.
2.2 Receipt and registration
Responses received through the Internet or help line telephone interview were received directly to a centralized data processing centre called the Data Operations Centre (DOC) and their receipt registered automatically.
Respondents completing paper questionnaires mailed them back to the DOC. Canada Post registered their receipt automatically in multiple locations in Canada (as part of the normal mail flow process) by scanning the barcode on the front of the questionnaire through the transparent portion of the return envelope. The envelopes were then delivered to the DOC. Each day, Canada Post would send a daily file listing all census and NHS questionnaires received at each regional processing plant, by date of receipt.
The registration of each returned questionnaire was flagged on the Master Control System (MCS) at Statistics Canada. A list of all the dwellings for which a questionnaire had not been received was generated by the MCS and then transmitted to field operations for follow-up. Registration updates were sent to field operations on a daily basis to prevent follow-up on households which had already completed their questionnaire.
2.3 Imaging and keying from images
In 2011, the forms imaged were: the three census questionnaires (2A, 2C, 3A), the Census of Agriculture questionnaire (F6), and the two NHS questionnaires (N1, N2). The image quality improved relative to 2006 with the replacement of black and white scanners with color scanners. The following steps were part of the imaging process:
- Document preparation: mailed-back questionnaires were removed from envelopes and foreign objects, such as clips and staples, were detached in preparation for scanning. The questionnaires were batched by form type. Forms that were in a booklet format were separated into single sheets by cutting off the spine.
- Scanning: converted the questionnaires to digital images.
- Automated image quality assessment: an automated system analyzed the images for errors or anomalies. Images failing this process were sent to be reviewed by a document analysis operator.
- Document analysis: at this step, images containing anomalies were presented to an operator for review. The operator could accept the image as is, send it directly to key entry, or send it to be rescanned.
- Automated recognition: this step attempted to automatically recognize hand-written responses and marks on the questionnaire.
- Key entry: operators entered responses that automated recognition could not determine with sufficient accuracy.
- Check-out: as soon as the questionnaires were processed successfully through all of the above steps, the paper questionnaires were checked out of the system. Check-out is a quality assurance process that ensures the images and captured data are of sufficient quality that the paper questionnaires are no longer required for subsequent processing. Questionnaires that had been flagged as containing errors were pulled at check-out and reprocessed.
2.4 Coverage edits
Coverage edits were applied to both census and NHS questionnaires. At this stage, a number of automated edits were performed on respondent data. These edits were designed to detect cases where invalid persons may have been created either due to respondent error or data capture error. Examples include data erroneously entered in the wrong person column, crossed off data that was captured in error, or data provided for the same person more than once, usually due to the receipt of duplicate forms (e.g., a husband or wife completed the Internet version and their spouse filled in the paper form and mailed it back). The edits were also designed to detect the possible absence of usual residents, when data are not provided for every household member listed at the beginning of the questionnaire.
About 45% of edit failure cases were resolved deterministically by the system. The remainder were forwarded to processing clerks for resolution. An interactive system enabled the clerks to examine the captured data and compare them with the image if available (online questionnaires would not have an image). Edit failures were resolved by deleting invalid or duplicate persons and adding missing ones (i.e., creating blank person records), as necessary and appropriate.
2.5 Completion edits and failed edit follow-up
Completion edits and failed edits follow-up only apply to census questionnaires. Following the coverage edits, another set of automated edits was run on census questionnaires to detect cases where there were either too many missing responses, or there were indications that data may not have been provided for all usual residents in the household. Households failing these edits were sent for follow-up. An interviewer telephoned the respondent to resolve any coverage issues and to fill in the missing information, using a computer-assisted telephone interviewing application. The data were then sent back to the DOC for reintegration into the system for subsequent processing.
Both the census and NHS questionnaires contained questions for which answers could be checked off against a list, as well as questions requiring a written response from the respondent in the boxes provided. These written responses underwent automated coding to assign each one a numerical code, using Statistics Canada reference files, code sets and standard classifications. Reference files for the automated match process were built using actual responses from past censuses, as well as administrative files. Specially-trained coders and subject-matter specialists resolved cases where a code could not be automatically assigned. The following questions required coding for both the census and NHS: Relationship to Person 1, Home language, and Mother tongue. The following questions required coding for NHS only: Place of birth, Citizenship, Non-official languages, Ethnic origin, Population group, First Nation/Indian band, Religion, Place of residence 1 year ago, Place of residence 5 years ago, Place of birth of parents, Major field of study, Location of study, Industry, Occupation, Place of work and Language of work.
About 15 million write-ins were coded from the 2011 Census questionnaires, while about 46 million were coded from the NHS questionnaires. Overall about 87% were coded automatically, although the autocoding rate varied considerably from one variable to the next.
As the responses for a particular variable were coded, the data for that variable were sent to the edit and imputation phase.
2.7 Classification and non-response adjustments for unoccupied and non-response dwellings
The Dwelling Classification Survey (DCS) was used to estimate the rate of enumerator error in classifying dwellings in the self-enumerated collection areas of the census as occupied or unoccupied. Based on this information, adjustments were made to the census database. The DCS selected a random sample of 1,729 self-enumerated CUs that were revisited in July and August 2011 to reassess the occupancy status as of census day for each dwelling for which no response had been received. The DCS found that 13.8% of the 1,099,156 dwellings classified as unoccupied were actually occupied and that 30.8% of the 317,976 dwellings with no responses that were classified as occupied or with occupancy status classified as unknown were actually unoccupied. Estimates based on the DCS sample were used to adjust the occupancy status for individual dwellings. This resulted in an increase of 3.3% in the number of occupied dwellings, and a decrease of 5.0% in the number of unoccupied dwellings at the Canada level.
After this adjustment of the occupancy status by the DCS, occupied dwellings with total non-response had the number of usual residents (if not known) and all the responses to the census questions imputed by borrowing the unimputed responses from another household within the same CU. This process, called whole household imputation (WHI), imputed 99% of the total non-response households. Utilizing a single donor under WHI was more efficient computationally and was less likely to produce implausible results than using several donors as part of the main edit and imputation process. Nevertheless, the other 1% of the total non-response households where no donor household was found under the WHI process was imputed as part of the main edit and imputation process.
More details on the DCS and the whole household imputation procedure can be found in the Coverage Technical Report, 2011 Census, Catalogue no. 98-303-X.
2.8 Edit and imputation
The data collected in any survey or census contains some omissions or inconsistencies. For example, a respondent might be unwilling to answer a question, fail to remember the right answer, or misunderstand the question. Other possible mistakes such as incorrect coding can also occur.
The final clean-up of data, done in the edit and imputation process, was for the most part fully automated. Two types of imputation were applied. The first type, called 'deterministic imputation', involved assigning specific values under certain conditions when the resolution of the problem is clear and unambiguous. Detailed edit rules were applied to identify these conditions, and then the variables involved in the rules would be assigned a pre-determined value. The second type of imputation, called 'minimum-change nearest-neighbour donor imputation,' applied a series of detailed edit rules that identified any missing or inconsistent responses. When a record with missing or inconsistent responses is identified, another record with most characteristics in common with the record in error was selected. Data from this donor record were borrowed and used to make the minimum number of changes to the variables in order to resolve all missing or inconsistent responses. The Canadian Census Edit and Imputation System (CANCEIS) (see CANCEIS version 5.2 Basic User Guide) was the automated system used for nearly all deterministic and minimum-change nearest-neighbour donor imputation in the 2011 Census and National Household Survey (NHS).
In 2011, the census questionnaire consisted of the same eight questions that appeared on the 2006 Census short-form questionnaire plus two additional questions on language. These questions were asked of 100% of the population. All remaining information was collected by the National Household Survey, which was distributed to about 30% of households. Weighting was used to project the information gathered from the 30% sample to the entire population.
The sampling approach used for the 2011 NHS was different from what was used for the 2006 Census long form. Therefore, the weighting methodology was also different. The first step in the weighting process was to assign basic weights that reflect the probability of the household being sampled. These weights were then adjusted for total non response. A final adjustment was done by the smallest possible amount needed to ensure closer agreement between the sample estimates and the census counts for a number of characteristics related to age, sex, marital status, common-law status, language, and household size. The weighting methodology is described in detail in Chapter 4.