Statistics Canada
Symbol of the Government of Canada
Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

5. Census data processing

5.1 Introduction

5.2 Receipt and registration

5.3 Imaging and data capture

5.4 Coverage edits

5.5 Completion edits

5.6 Coding

5.7 Adjustments for non-response and misclassified occupied dwellings

5.8 Edit and imputation

5.9 Weighting

5.1 Introduction

Census data processing encompasses everything from the capture of questionnaire data from the completed questionnaires through to the creation of an accurate and complete census database:  questionnaire registration, data capture, questionnaire imaging, editing, error correction, coding, imputation and weighting. This section describes each operation.

Automated processes, implemented for the 2006 Census, had to be monitored to ensure that all Canadian residences were enumerated once and only once. The Master Control System was built to control and monitor the process flow. The Master Control System held a master list of all the dwellings included in the census. Each dwelling had a unique identifier providing the link to its questionnaire. This system was updated on a daily basis with information of each dwelling's status in the census process flow (i.e., delivered, received, processed, etc.). Reports were generated and accessible online to the census managers to ensure that operations were efficient and effective.

5.2 Receipt and registration

Respondents completing paper questionnaires mailed them back to a centralized data processing centre (DPC). Canada Post scanned the barcode on the front of the questionnaire through the transparent portion of the return envelope. The envelopes were then transported to the DPC along with a compact disc containing the list of all of the identifiers for the scanned questionnaires. The returned questionnaires were then registered on the Master Control System at Statistics Canada. About ten days after Census Day, a list of all of the dwellings for which a questionnaire had not been received was generated by the Master Control System and then transmitted to field operations for follow-up. Afterwards, registration updates were sent to field operations on a daily basis to prevent follow-up on households whose questionnaires (either paper or electronic) were received after that point in time.

5.3 Imaging and data capture

The 2006 Census was Canada's first census to capture data using automated capture technologies rather than manual keying. There were five steps in the imaging process:

  • Document preparation: mailed-back questionnaires were removed from envelopes and foreign objects, such as paper clips and staples were detached in preparation for scanning. Forms that were in a booklet format were separated into single sheets by cutting off the spine.
  • Scanning: scanning, using 18 high-speed scanners converted the paper to digital images (pictures).
  • Automated image quality assurance: an automated system verified the quality of the scanning. Images failing this process were flagged for rescanning or keying from paper.
  • Automated data capture: optical mark recognition and optical character recognition technologies were used to extract respondents' data from the images. Where the systems could not recognize the handwriting with sufficient accuracy, data repair was done by an operator.
  • Check-out: as soon as the questionnaires were processed successfully through all of the above steps, the paper questionnaires were checked out of the system. Check-out is a quality assurance process that ensures the images and captured data are of sufficient quality that the paper questionnaires are no longer required for subsequent processing. Questionnaires that had been flagged as containing errors were pulled at check-out and reprocessed as required.

5.4 Coverage edits

At this stage, a number of automated edits were performed on the respondent data. These edits were designed to detect cases where invalid persons may have been created either due to respondent error or data capture error. Examples include data erroneously entered in a blank person column, crossed off data that was captured in error, or data provided for the same person more than once, usually due to the receipt of duplicate forms (e.g., a husband completed the Internet version and his wife filled in the paper form and mailed it back). The edits were also designed to detect the possible absence of usual residents, when data are not provided for every household member listed at the beginning of the questionnaire. There was also some telephone follow-up for these edit failures.

Data from questionnaires that failed the edits were forwarded to processing clerks for verification. An interactive system enabled the clerks to examine the captured data and compare it with the image if available (online questionnaires would not have an image). Edit failures were resolved by manually deleting invalid/duplicate persons and adding missing ones (i.e., creating blank person records), as necessary and appropriate.

5.5 Completion edits

Following the coverage edits, another set of automated edits was run to detect cases where there were either too many missing responses, or there were indications that data may not have been provided for all usual residents in the household. Households failing these edits were subject to follow-up whereby an interviewer used a computer-assisted telephone interview (CATI) application to telephone the respondent to resolve any coverage issues and to fill in the missing information. The data were then sent back to the Data Processing Center for reintegration into the system for subsequent processing.

5.6 Coding

The long-form questionnaires (2B, 2C, 2D and 3B) contained questions where answers could be checked off against a list, as well as questions requiring a written response from the respondent in the boxes provided. These written responses underwent automated coding to assign each one a numerical code, using Statistics Canada reference files, code sets and standard classifications. Reference files for the automated match process were built using actual responses from past censuses. Specially trained coders and subject matter experts resolved cases where a code could not be automatically assigned. The variables for which coding applied were: Relationship to Person 1, Place of birth, Citizenship, Non-official languages, Home language, Mother tongue, Ethnic origin, Population group, Indian band/First Nation, Place of residence 1 year ago, Place of residence 5 years ago, Major Field of Study, Location of study, Place of birth of parents, Language at work, Industry, Occupation and Place of work.

About 37 million write-ins were coded from the 2006 long-form questionnaires. An average of about 82% of these was coded automatically.

As the responses for a particular variable were coded, the data for that variable were sent to the edit and imputation phase.

5.7 Adjustments for non-response and misclassified occupied dwellings

The Dwelling Classification Survey (DCS) was carried out during processing after non-response follow-up to estimate the error rates in classifying dwellings in the self-enumerated collection areas as occupied or unoccupied in the field. Based on this information, adjustments were made to the census database. The DCS selected a random sample of 1,405 self-enumerated CUs that were revisited in July and August 2006 to reassess the occupancy status as of Census Day for each dwelling for which no questionnaire had been received. The DCS found that 17.4% of the 934,564 dwellings classified as unoccupied were actually occupied and that 29.1% of the 366,527 dwellings with no responses that were classified as occupied or with occupancy status classified as unknown were actually unoccupied. Estimates based on the DCS samples were used to adjust the occupancy status for individual dwellings so as to change (impute) appropriate proportions of unoccupied dwellings to occupied and of occupied non-responding dwellings to unoccupied. This resulted in an increase of 3.6% in the number of occupied dwellings (relative to the number of dwellings originally classified as occupied) and a decrease of 5.2% in the number of unoccupied dwellings at the Canada level (relative to the number of dwellings originally classified as unoccupied). More information on the DCS can be found in Section 6.

After this adjustment of the occupancy status on the basis of the DCS results, occupied dwellings with total non-response had the number of usual residents (if not known) and all the responses to the census questions imputed by borrowing the unimputed responses from another household within the same CU that had its type of questionnaire (long or short). This process, called whole household imputation (WHI), imputed 96% of the total non-response households. The other 4% of the total non-response households where no donor household was found under the WHI process were imputed as part of the main edit and imputation (E&I) process. Utilizing a single donor under WHI was more efficient computationally and was less likely to produce implausible results than using several donors as part of the main E&I process, as was done in 2001. More information on WHI can be found in Section 6.2.4.

5.8 Edit and imputation

The data collected in any survey or census contains some omissions or inconsistencies. For example, a respondent might be unwilling to answer a question, fail to remember the right answer, or misunderstand the question. Also, census staff may code responses incorrectly or make other mistakes during processing.

The final clean-up of the data was done in edit and imputation and was, for the most part, fully automated. Two types of imputation were applied. The first type, called 'deterministic imputation,' involved assigning specific values under certain conditions. Detailed edit rules were applied to identify these conditions, and then the variables involved in the rules would be assigned a pre‑determined value. The second type of imputation, called 'minimum-change donor imputation,' applied a series of detailed edit rules that identified any missing or inconsistent responses. These missing or inconsistent responses were corrected by changing as few variables as possible. For minimum-change donor imputation, a record with a number of characteristics in common with the record in error was selected. Data from this 'donor' record were borrowed and used to change the minimum number of variables necessary to resolve all missing or inconsistent responses. The Canadian Census Edit and Imputation System (CANCEIS) was used for nearly all deterministic and minimum-change donor imputation in 2006.

5.9 Weighting

Questions on age, sex, marital status, mother tongue and relationship to Person 1 were asked of 100% of the population, as in previous censuses. However, the bulk of census information was acquired on a 20% sample basis, using the additional questions on the 2B questionnaire. Weighting was used to project the information gathered from the 20% sample to the entire population.

For the 2006 Census, weighting employed the same methodology used in the 2001 Census, known as calibration estimation. This began by first assigning initial weights of approximately 5 to the sampled households. These weights were then adjusted by the smallest possible amount needed to ensure closer agreement between the sample estimates and the population counts for a number of characteristics related to age, sex, marital status, common-law status and household size (e.g., number of males, number of people aged 15 to 19). More information on sampling and weighting can be found in the 2006 Census Technical Report on Sampling and Weighting.

   Previous page | Table of contents | Next page