Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.
3 Data processing
Table of contents
The processing phase of the 2011 Census and National Household Survey began with the process of translating responses into meaningful data. This part of the survey cycle was divided into six main activities:
- Receipt, registration and storage
- Imaging and data capture from paper questionnaires
- Edits and failed edit follow-up
- Edit and imputation
3.1 Receipt, registration and storage
Statistics Canada's Data Operations Centre (DOC) was the central reception, registration and storage point for electronic and printed questionnaires. Electronic questionnaires were transmitted directly to the DOC's servers, and printed questionnaires were scanned and stored as images. After the quality of the image was confirmed, the data were captured by optical mark recognition (OMR) and intelligent character recognition (ICR). If the image quality was inadequate, the data were captured manually by an operator.
3.2 Imaging and data capture from paper questionnaires
Upon reception and registration of paper questionnaires, the documents went through the following process:
- Document preparation – Mailed-back questionnaires were removed from envelopes, and foreign objects such as clips and staples detached. The questionnaires were then organized in batches, by form type, to have their spines cut off (Booklet format forms, only) in preparation for scanning.
- Scanning – Seven high-speed scanners were used to create images of each page of each questionnaire.
- Automated image quality assurance – An automated system verified the quality of the scanning. Images failing this process were flagged for rescanning or sent to keying.
- Automated data capture – Optical mark recognition and optical character recognition technologies were used to extract respondents' data from the images. Where the systems could not recognize the handwriting with sufficient accuracy, data recognition was completed by a census (key) operator.
- Check-out – As soon as the questionnaires were processed successfully through all of the above steps, the paper questionnaires were checked out of the system. Check-out is a quality assurance process that ensured the images and captured data are of sufficient quality that the paper questionnaires were no longer required for subsequent processing. Questionnaires that had been flagged as containing errors were pulled at check-out and reprocessed as required.
3.3 Edits and failed edit follow-up
At this stage, a number of automated coverage edits were performed on the respondent data. If multiple questionnaires were received for one household, they were also verified at this stage to determine if they were duplicates (e.g., a husband completed the Internet version and his wife filled in the paper form and mailed it back). Data from questionnaires that failed the edits were forwarded to a processing clerk for verification against the image if available (online questionnaires would not have an image).
Once coverage edits were completed, the household data were subjected to automated completion edits that simulated those that enumerators would have done manually in censuses prior to 2006. They checked for completeness of the responses as well as coverage (e.g., the number of persons in the household). A score was attributed in cases where:
- the processing clerk was not able to resolve a coverage error in the Data Operations Centre
- there was an indication that the respondent was unsure of whether or not (a) person(s) should be included in the household
- there were data indicating that this was a dwelling occupied solely by temporary or foreign residents
- there were many missing or invalid responses.
Although census data were transmitted to one of five Statistics Canada regional offices for failed edit follow-up, NHS data were only subject to follow up where there was total non-response, except in the case of the canvasser areas, where questionnaires were completed during enumeration.
While automated edits were applied to all form types as described above, follow-up was not performed on questionnaires in canvasser areas as this was done during enumeration.
The N1 and N2 questionnaires contained questions where answers could be checked off against a list, as well as questions requiring a written response from the respondent in the boxes provided. These written responses (write-in responses) underwent automated coding to assign each response a numeric code, using Statistics Canada reference files, code sets and standard classifications. When the system was unable to assign a code to a particular response, the response was coded manually by a specially trained coder. In 2011, coding was applied to the following variables: relationship to Person 1, place of birth, citizenship, non-official languages, home language, mother tongue, ethnic origin, population group, First Nation/Indian band membership, religion, place of residence 1 year ago, place of residence 5 years ago, place of birth of parents, major field of study, location of study, language at work, industry, occupation and place of work.
3.4.1 Coding of the First Nation/Indian band membership write-in question
Write-in responses to the First Nation/Indian band membership question were coded to a list of over 600 Indian bands. The proportion of responses done by automated coding was 68%. The remaining responses were coded using interactive applications designed specifically for First Nation/Indian band coding. The systems included several reference files such as a file containing different spellings of Indian band names and the corresponding codes, and a file containing geographic codes for Indian reserves, names of Indian reserves, and names of Indian bands that are affiliated with these reserves.Footnote1 The First Nation/Indian band membership data are not available on the dissemination data file but are available on request.
3.5 Edit and imputation
After data capture, and initial editing and coding operations were completed, the data were processed up to the final edit and imputation stage. The final editing detected invalid responses and inconsistencies. This editing was based on rules determined by Statistics Canada's subject-matter analysts. Unanswered questions were also identified. Imputation replaced these missing, invalid or inconsistent responses with plausible values. When carried out properly, imputation can improve data quality by replacing non-responses with plausible responses similar to the ones that the respondents would have given if they had answered the questions. It also has the advantage of producing a complete data set.
The nearest-neighbour-donor method was used to impute NHS data. This method is widely used in the treatment of item non-response. It replaces missing, invalid or inconsistent information about one respondent with values from another 'similar' respondent. The rules for identifying the respondent most similar to the non-respondent may vary with the variables to be imputed. Donor-imputation methods have good properties and generally will not alter the distribution of the data, a drawback of many other imputation techniques. Following nearest-neighbour imputation, consistency of data is assured (see the NHS User Guide Chapter 4 – Data processing).
3.6 Edit and imputation of Aboriginal variables
The edit and imputation of the ethnocultural variables, and specifically the Aboriginal variables, was almost entirely redesigned for 2011, with the primary goal being to streamline the processes and to use, as much as possible, one donor to impute data for a respondent who had provided incomplete or invalid responses on his/her NHS questionnaire.
In 2011, the variables of immigration, citizenship, place of birth, ethnic origin/Aboriginal ancestry, population group/visible minority, Aboriginal group, Registered or Treaty Indian status, and First Nation/Indian band membership were processed together, with the interrelations between these variables clearly defined in advance. Donor imputation for missing information within these variables was done with one donor for all variables, as much as possible.
In 2011, all people requiring imputation, who were not census family children, used a single donor, who was also not a census family child. Census family children who required imputation used a donor within their own census family (sibling or parent). As a result, the imputed records were internally consistent and based on actual full responses, rather than multiple-donor responses that might have donated inconsistent information. This is a definite improvement over past methods, where units were stratified for imputation based roughly on language and geography, but not on the host of variables that were used in 2011.
The low rates for item non-response and invalid response rates, and the corresponding low imputation rates for Aboriginal variables (Aboriginal group, Registered or Treaty Indian status, and Membership in a First Nation/Indian band) (see Table 2), had little overall impact on data quality.
The 2011 NHS total imputation rates for questions 18 (Aboriginal group), 20 (Registered or Treaty Indian status) and 21 (Membership in a First Nation/Indian band) are shown in Table 2.
Imputation rates for Aboriginal group, Registered or Treaty Indian status, and Membership in a First Nation/Indian band, Canada, provinces and territories, 2011 NHS
|Provinces and territories||Aboriginal group (%)||Registered or Treaty Indian status (%)||Membership in a First Nation/Indian band (%)|
|Source: Statistics Canada, National Household Survey, 2011.|
|Newfoundland and Labrador||4.4||6.0||4.2|
|Prince Edward Island||3.9||6.1||3.9|
The final responses were weighted so that the data from the sample accurately represent the NHS's target population. The NHS weighting process involved calculating sampling weights, adjusting the weights for the survey's total non-response and calibrating the weights against census totals.
The sampling fraction varied with the questionnaire delivery mode. For the mail delivery mode, about 3 in 10 households (29%) received a questionnaire. For the enumerator delivery mode, the sampling fraction is 1 in 3 households (33%). However, in cases where it was necessary to reach households in remote areas or on Indian reserves or settlements, or Inuit communities where only the interview response mode was offered, no sampling was done and all households were invited to participate in the NHS.
Then the sampling weights were adjusted to reflect the targeted non-response follow-up that was done on a subsample of those households that had not responded to the NHS by mid-July 2011.
Subsequent to non-response follow-up, the resulting weight was adjusted for the survey's residual non-response within the subsample. This was done by transferring the weights of non-respondent households to the nearest-neighbour respondent households in the subsample.
Lastly, the weights were calibrated against census population totals for geographic areas known as calibration areas. Weight calibration was performed so that the estimates for an NHS calibration area would be approximately equal to the census counts for that area, for a set of about 60 characteristics common to the NHS and the census. Calibration is a realignment of survey estimates to known population control totals by a minimal modification of the weights. In this case, the census provides a number of counts for various demographic, social and geographic characteristics of the population. These are used as the population controls. The sample survey weights from the NHS are adjusted so that the estimates from the survey match these known counts of the census. With the assurance that their known population composition is maintained, the resulting calibrated weights are then applied to all other variables and characteristics of the survey.
Nevertheless, there may be differences between the NHS estimates and the census counts for common characteristics. Certain factors come into play, that explain these differences, in particular the size of the geographic area and the level of non-response. The smaller the population count in a certain geographic area, the greater the risk that the NHS estimates will be different from the census counts. This issue was present with the 2006 Census long form, but it was less common because of the higher response rates and the calibration method used retained the demographic characteristics for both small and large municipalities. As a guideline to users, whenever the NHS population estimate or its distribution is not similar to the census comparable counts it may indicate quality issues due to non-response. It is suggested that such geographies should be collapsed to a higher level of dissemination. Estimates for Indian reserves that demonstrate a similar effect should be combined with other Indian reserves associated with the same Indian band as the estimates are more likely to yield more reliable estimates.
For additional information on the methodology of the NHS, refer to the National Household Survey User Guide, Catalogue no. 99-001-X2011001. All efforts are made to reduce errors in estimation and the Census of Population plays a major role in ensuring the reliability in the estimates of the NHS.