Statistics Canada
Symbol of the Government of Canada
Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

3. Data processing

3.1 General

3.1.1 Receipt and registration

3.1.2 Imaging and data capture from paper questionnaires

3.1.3 Edits and failed-edit follow-up

3.1.4 Automated coding

3.1.5 Edit and imputation

3.1.6 Weighting

3.2 Aboriginal Peoples – Processing

3.2.1 Coding of the band/First Nation write-in question

3.2.2 Edit and imputation

3.2.3 Impact of edit and imputation

3.1 General

The processing phase of the census began with the process of translating responses into meaningful data. This part of the census cycle is divided into six main activities:

  • Receipt and registration
  • Imaging and data capture from paper questionnaires
  • Edits and failed edit follow-up
  • Automated coding
  • Edit and imputation
  • Weighting

3.1.1 Receipt and registration

Respondents completing paper questionnaires in mail-back areas mailed them back to a centralized data processing centre.

Questionnaires in canvasser areas were completed by the enumerators and shipped to the data processing centre.

Responses received through the Internet or the Census Help Line telephone interview were received directly by the Data Processing Centre and their receipt registered automatically.

The registration of each returned questionnaire was flagged on the Master Control System at Statistics Canada. About 10 days after Census Day, a list of all of the dwellings for which a questionnaire had not been received was generated by the Master Control System and then transmitted to Field Operations for follow-up. Registration updates were sent to Field Operations on a daily basis to prevent follow-up on households which had subsequently completed their questionnaire, either by mail, telephone or through the Internet.

3.1.2 Imaging and data capture from paper questionnaires

The 2006 Census was Canada's first census to capture data using automated capture technologies rather than manual keying.

Steps in imaging:

  • Document preparation – Mailed-back questionnaires were removed from envelopes and foreign objects, such as clips and staples detached in preparation for scanning. Forms that were in a booklet format were separated into single sheets by cutting off the spine.
  • Scanning – Scanning, using 18 high-speed scanners, converted the paper to digital images (pictures).
  • Automated image quality assurance – An automated system verified the quality of the scanning. Images failing this process were flagged for rescanning or keying from paper.
  • Automated data capture – Optical mark recognition and optical character recognition technologies were used to extract respondents' data from the images. Where the systems could not recognize the handwriting with sufficient accuracy, data recognition was completed by a census operator (keyer).
  • Check-out – As soon as the questionnaires were processed successfully through all of the above steps, the paper questionnaires were checked out of the system. Check-out is a quality assurance process that ensures the images and captured data are of sufficient quality that the paper questionnaires are no longer required for subsequent processing. Questionnaires that had been flagged as containing errors were pulled at check-out and reprocessed as required.

3.1.3 Edits and failed-edit follow-up

At this stage, a number of automated edits were performed on the respondent data. These edits simulated those that enumerators would have done manually in previous censuses. They checked for completeness of the responses as well as coverage (e.g., the number of persons in the household).

Data from questionnaires that failed the edits were forwarded to a processing clerk for verification against the image if available (online questionnaires would not have an image). If multiple questionnaires were received for one household, they were also verified at this stage to determine if they were duplicates (e.g., a husband completed the Internet version and his wife filled in the paper form and mailed it back).

In cases where the processing clerk could not resolve an error, or there were too many missing responses, the data were transmitted to a Census Help Line for follow-up. An interviewer telephoned the respondent to resolve any coverage issues and to fill in the missing information, using a computer-assisted telephone interviewing application. The data were then sent back to the Data Processing Centre for reintegration into the system for subsequent processing.

No automated edits or follow-ups are performed on questionnaires in canvasser areas.

3.1.4 Automated coding

The 2B and 2D long-form questionnaires contained questions where answers could be checked off against a list, as well as questions requiring a written response from the respondent in the boxes provided. These written responses underwent automated coding to assign each one a numerical code, using Statistics Canada reference files, code sets and standard classifications. Reference files for the automated match process were built using actual responses from past censuses. Specially trained coders and experts resolved cases where a code could not be automatically assigned. The variables for which coding applied were: Relationship to person 1, Place of birth, Citizenship, Non-official languages, Home language, Mother tongue, Ethnic origin, Population group, Indian band/First Nation, Place of residence 1 year ago, Place of residence 5 years ago, Major field of study, Location of study, Place of birth of parents, Language at work, Industry, Occupation and Place of work.

Over 40 million write-in responses were coded from the 2006 long questionnaires; an average of about 75% of these were coded automatically.

3.1.5 Edit and imputation

The data collected in any survey or census contain some omissions or inconsistencies. These errors can be the result of respondents missing a question, or can be due to errors generated during processing. For example, a respondent might be unwilling to answer a question, fail to remember the right answer, or misunderstand the question. Census staff may code responses incorrectly or make other mistakes during processing.

After the capture, completeness, coverage editing, corrections and coding operations were completed, the data were processed through the final edit and imputation activity, which was almost fully automated. In general, the editing process detects the errors, and the imputation process corrects them.

3.1.6 Weighting

Questions on age, sex, marital status, mother tongue and relationship to Person 1 were asked of 100% of the population, as in previous censuses. In areas where canvasser enumeration was employed, using Form 2D (the Northern and Reserves Questionnaire), 100% of the population was asked all census questions. However, in the rest of Canada, the bulk of census information was acquired on a 20% sample basis, using the additional questions on the 2B questionnaire. Weighting was used to project the information gathered from the 20% sample to the entire population.

The weighting method provides 100% representative estimates for the 20% data and maximizes the quality of sample estimates.

For the 2006 Census, weighting employed the same methodology used in the 2001 Census, known as calibration estimation. This began with initial weights of approximately 5 and then adjusted them by the smallest possible amount needed to ensure closer agreements between the sample estimates (e.g., number of males, number of people aged 15 to 19) and the population counts for age, sex, marital status, common-law status and household.

This was the last processing step in producing the final 2006 Census database, the source of data for all publications, tabulations and custom products.

3.2 Aboriginal Peoples – Processing

3.2.1 Coding of the band/First Nation write-in question

Write-in responses to the Indian band/First Nation question were coded to a list of over 600 Indian bands/First Nations. The proportion of responses done by automated coding was 75%. The remaining responses were coded using interactive applications designed specifically for Indian band/First Nation coding. The systems included several reference files such as a file containing different spellings of Indian band names and the corresponding codes, and a file containing geographic codes for Indian reserves, names of Indian reserves, and names of Indian bands that are affiliated with these reserves1.

3.2.2 Edit and imputation

The edit and imputation process used for 2006 is essentially the same as what was used for 2001 and 1996. The process for the Aboriginal variables was re-designed for the 1996 Census when the current three Aboriginal questions (18, 20 and 21) were initially asked.

The general aim of the edit and imputation process for Aboriginal data is twofold:

  • To assign valid values in the case of missing or invalid responses to questions 18, 20 or 21. (An invalid response refers to a multiple response that is not allowed or does not make any sense, such as the 'Yes' and 'No' circles both being checked.)
  • To replace valid but questionable responses to questions 18, 20 or 21 with responses that are more reasonable given the known characteristics of the person.

Two types of imputation were applied to the Aboriginal data, namely deterministic imputation and donor imputation. Deterministic imputation is the process by which a unique value is assigned to a missing or invalid response through either relationships among personal characteristics, or, in the case of children with no responses, by using the characteristic(s) of their parent(s) to fill in the missing data. Donor imputation is performed by identifying individuals in the same geographical area that have similar, but complete and consistent characteristics and then copying the values of randomly selected individuals to fill in the missing or erroneous data among the 'failed edit' individuals.

Because of the substantial differences involved in enumerating Aboriginal people on and off reserves, these two sub-populations were treated differently. The population on reserves was subject to much more deterministic imputation, since the chances were very good that their characteristics matched those we expect on a reserve (e.g., registered Indian status and member of a band if there is any indication that the person is a North American Indian). By contrast, persons living off reserves were subject to more editing and to donor imputation, both to eliminate 'false' Aboriginal responses which appear due to respondent misunderstanding, and to make up for high non-response among the Aboriginal population through the random process of imputation.

Early in the process, auxiliary information was used to perform deterministic imputation on the data. This information included Mother Tongue (Question 16), Place of Birth (Question 9), Ethnic Origin (Question 17) and Population Group (Question 19). The purpose of these comparisons was to correct responses from non-Aboriginal persons who reported themselves as Aboriginal, for example South Asians and Creoles who might have misunderstood the intent of the terms 'Indian' or 'Métis' and reported themselves as such. Through the use of related cultural variables, these people could be identified and their responses edited. For example, a positive response to one of the Aboriginal questions in combination with any of the following can signal a problem: mother tongue other than Aboriginal, English or French; non-Aboriginal, non-English, non-French, non-Canadian ethnic origin(s); a response of South Asian or Latin American in the Population Group question.

The place of residence was also a useful piece of information, especially if it was an Indian reserve. Most people living on an Indian reserve are Registered Indians and the reserve belongs to a specific Indian band. The 'strong association' of these questions which provide auxiliary information to the Aboriginal questions therefore provides useful information for editing of the data.

Another element of the process involves consistency checks between the various Aboriginal-related questions. For example, if we have an invalid multiple response to Question 18, consisting of 'No' in combination with 'Yes, North American Indian', 'Yes, Métis' and/or 'Yes, Inuit (Eskimo)', the person's other responses are checked for the existence of an Aboriginal mother tongue, Aboriginal ancestry (ethnic origin), or a 'Yes' to the question on band membership or Registered Indian status. If any of these exist, the 'No' is removed from the Question 18 response.

One other special type of deterministic imputation is the assignment of a parent's response to a child. Specifically, if a child has a missing or invalid response to one of the Aboriginal questions, their parent's response (if valid) will be assigned to the child. In a two-parent family, the mother's response (if valid) is used; otherwise, the father's response is selected.

As mentioned previously, donor imputation involves finding an individual with similar characteristics and copying his/her values for the missing or erroneous data. Donor imputation is performed only on spouses, lone parents and non-census family persons2 not living on a reserve. For donor imputation of Question 21 (Registered Indian status), for example, a potential donor must be living in the same geographical area with the same values of sex, census family status (i.e., spouse, lone parent or non-census family person) and band/First Nation membership as the person with the missing or inconsistent value. Additionally, preference is given to potential donors with similar ages and the same responses to Question 18 (Aboriginal identity).

At the end of the edit and imputation process, certain responses that had been assigned through donor imputation were modified for certain reasons. For example, certain communities that are not Indian Reserves or Indian Settlements do nevertheless have known band affiliations. If a missing or invalid response to the band membership question had been replaced using donor imputation, but the person was living in a band-affiliated community, the imputed response was replaced by the band affiliated with that community.

In 2006, the total imputation rates from both deterministic imputation and donor imputation were as follows. (Note that all rates shown here and in successive tables are based on unweighted counts.) The table shows that the rates were higher for reserve communities.

3.2.3 Impact of edit and imputation

A review of the data from data capture to finalization indicates that only a minimal proportion of responses to Questions 18, 20 and 21 were changed as a result of the edit and imputation process. The following tables show the distribution of initial responses to these questions compared with the distribution of responses after edit and imputation. As intended, the process eliminated all blank and invalid responses, replacing them with a valid response of some sort. The main thing to note is that the overall distribution of the responses is not changed by the process.

Note:

  1. Statistics Canada acknowledges with appreciation the expertise and assistance provided by Eric McGregor from Indian and Northern Affairs Canada in the coding of the Indian band/First Nation responses for the 2006 Census.
  2. A 'census family' is defined as a married couple (with or without children of either or both spouses), a couple living common-law (with or without children of either or both partners) or a lone parent of any marital status, with at least one child living in the same dwelling.

previous gif   Previous page | Table of contents | Next page   next gif