Chapter 4 – Data processing
Table of contents
Data Operations Centre
Statistics Canada's Data Operations Centre (DOC) was the central reception and storage point for electronic and printed questionnaires. Electronic questionnaires were transmitted directly to the DOC's servers, and printed questionnaires were scanned and stored as images. After the quality of the image was confirmed, the data were captured by optical mark recognition (OMR) and intelligent character recognition (ICR). If the image quality was inadequate, the data were captured manually by an operator.
Coding, the next stage of data processing, was also carried out in the Data Operations Centre. All write-in responses were submitted to an automated coding system that assigned each response a numeric code using Statistics Canada reference files, code sets and standard classifications. When the system was unable to assign a code to a particular response, the response was coded manually by an operator. Coding was applied to the following variables: relationship to Person 1, place of birth, citizenship, non-official languages, home language, mother tongue, ethnic origin, population group, Indian band/First Nation, place of residence 1 year ago, place of residence 5 years ago, place of birth of parents, major field of study, location of study, language of work, industry, occupation and place of work.
Data edit and non-response imputation
After data capture, initial edit and coding operations have been completed, the data are processed up to the final edit and imputation stage. The final edit detects invalid responses and inconsistencies. This edit is based on rules determined by Statistics Canada's subject-matter analysts. Unanswered questions are also identified. Imputation replaces these missing, invalid or inconsistent responses with plausible values. When carried out properly, imputation can improve data quality by replacing non-responses with plausible responses similar to the ones that the respondents would have given if they had answered the questions. It also has the advantage of producing a complete data set.
The nearest-neighbour method was used to impute NHS data. This method is widely used in the treatment of non-response. It replaces missing, invalid or inconsistent information about one respondent with values from another, 'similar' respondent. The rules for identifying the respondent most similar to the non-respondent may vary with the variables to be imputed. Donor imputation methods have good properties and generally will not alter the distribution of the data, a drawback of many other imputation techniques. Following nearest-neighbour imputation, the data are checked for consistency.
The final responses are weighted so that the data from the sample accurately represent the NHS's target population. The weighting process involves calculating sampling weights, adjusting the weights for the survey's total non-response, and calibrating the weights against census totals.
First, an initial sampling weight of about 3 is assigned to each sampled household. The initial weight of 3 is the inverse of the probability of being selected in the NHS sample. As noted in Section 3.2, about 3 of 10 households were selected in the sample, which yields an initial weight of just over 3 (10/3). Then the sampling weights are adjusted to reflect the selection of the subsample. As mentioned in Section 3.4, the subsample was selected from the set of households that had not responded to the NHS by mid-July 2011. It is important to note that at the end of these two weighting steps, some households have a weight of 1 because in some regions, all households are selected in the NHS sample.
Next, since a number of households in the subsample were still non-respondent at the end of collection operations, the sampling weight is adjusted for the survey's residual non-response. This is done by transferring the weights of non-respondent households to the nearest-neighbour respondent households. The latter are identified in a manner similar to the imputation process described in Section 4.2, using known variables for respondent and non-respondent households, including census variables and a few variables resulting from matches to administrative databases.
Lastly, the weights are calibrated against census totals at the level of geographic calibration areas. Those areas contain an average of about 2,300 dwellings or 5,600 people in the NHS target population. They are formed by grouping dissemination areas so that they are contiguous, have enough respondent households to make calibration easy to perform, and do not straddle census division boundaries or, wherever possible, census subdivision and census tract boundariesFootnote1. Calibration is performed so that the estimates for an NHS calibration area are approximately equal to the census counts for that area, for a set of about 60 characteristics common to the NHS and the Census. The control totals used are for age, sex, marital/common-law status, dwelling structure, household size, family structure and language. They include the number of households and individuals in all the dissemination areas that make up the calibration area. It is important to note, however, that for a given area, a number of calibration totals are discarded on the basis of certain criteria to avoid reducing the general quality of the estimates.
Nevertheless, there may be differences between the NHS estimates and the census counts for common characteristics. The smaller the geographic area is, the greater the risk that the NHS estimates will be different from the census counts. This problem was present with the 2006 Census long form, but it was less common because of the higher response rates and the small variation in these response rates across areas, for both small and large municipalities.
Users should pay close attention to the potential differences between the 2011 Census counts and the NHS estimates for common characteristics. Where there are differences, users should consider the 2011 Census counts to be of higher quality and give preference to them since they are not affected by the NHS's sampling variance or non-response error.
A detailed technical guide to NHS weighting will be available in early 2014. It will provide further details on the weighting and estimation process.
- Footnote 1
Note that the weights of NHS households that are selected with certainty are calibrated independently. They have their own calibration areas, which can straddle census division boundaries.
- Date modified: