Statistics Canada
Symbol of the Government of Canada
Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

6.  Sampling bias

In this chapter, we will assess whether, following adjustments for non-response, the census sample is biased. This can be done by calculating the Z statistic

An equation to calculate the Z statistic

for short form characteristics such as 'Marital status = single,' where the census population count X can be compared to the sample estimate The estimator based on initial weightsbased on initial weights. In the Z statistic, the difference between the estimate and the population count is divided by the square root of the variance of the estimate. If the sampling process is random and unbiased, it can be shown that The Z statistic will follow approximately a normal distribution with mean 0 and variance 1 (see Appendix C).

Table 6.1 and Chart 6.1 present Z statistics at the Canada level for the 2001 and 2006 censuses (along with the differences The difference between the initial estimator and the population count) for 34 characteristics closely resembling the constraints which were applied in generating the final census weights (see Appendix B). If The Z statisticfollows a normal distribution, the probability that The absolute value of the Z statistic is greater than 3is approximately 0.0026 for one characteristic. This suggests that, on average, we would expect to see 0.0026 x 34 = 0.0884 of the 34 characteristics having The absolute value of the Z statistic is greater than 3. However, according to Table 6.1, 22 of the 34 characteristics in 2006 have a Z statistic outside the range of –3 to 3. This provides strong evidence that the 2006 Census sample is biased. Similarly in 2001, 25 of the 32 characteristics were outside that range.

Chart 6.1 shows that for many characteristics the Z statistic is much different between 2001 and 2006. Since Z is a random variable, some of these differences may not be statistically significant. A W statistic, which is defined in Appendix C, was calculated for each characteristic to determine whether or not the Z statistics from 2001 and 2006 were significantly different. The W statistic, its p-value, and 17 characteristics with statistically significant differences in the Z values (because their p-values are less than 0.05) are identified in Table 6.1 as well as in Chart 6.1.

In Chart 6.1, it can be seen that the downward bias in the sample in 2006 increased significantly (as flagged by an asterisk) for males, males age 15 and over, persons aged 15 to 19, single persons, 3-person households and 6+-person households while the upward bias in the sample in 2006 increased significantly for 5-person households. In addition, it can be seen that the downward bias in the sample in 2006 decreased significantly for separated persons and 1-person households while the upward bias in the sample in 2006 decreased significantly for females, females aged 15 and over, persons aged 10 to 14, married persons and 4-person households. Finally, the upward bias in the sample in 2006 changed significantly to a downward bias for the total population count for the age group 5 to 9 and the age group 40 to 44.

Chart 6.1 also shows a consistent downward bias in the 2001 and 2006 census samples of persons aged 20 to 39 and a consistent upward bias in the 2001 and 2006 census samples of persons aged 45 and over and for 2-person households.

Chart 6.1 also shows a very large upward bias in 2006 for single-detached dwellings and a somewhat smaller downward bias for apartments of less than 5 storeys.

Bias in the sample can originate from a variety of sources, including enumerator errors (e.g., not selecting the sample according to specifications), non-response bias (e.g., young adult males are less likely to complete a long questionnaire than a short questionnaire), response bias (e.g., respondents answering differently on Form 2B than they would respond on Form 2A), processing errors, and so on. 

The large biases in the 2006 sample for 5-person and 6+-person households were the result of reducing the number of persons on the long form from six persons to five persons because more space was required to allow the automated data capture of write-in responses. The number of persons on the short form remained at six. In 2006, sometimes households with more than five persons who received a long form did not request a second long form and only listed five persons as living in the household. This caused the large increase in the upward bias in the sample for five-person households and a corresponding large increase in the downward bias in the sample for six-person households. The weighting calibration process was only able to partially correct for these biases, and these biases also made it more difficult for the calibration to correct for other biases.

Another possible source of bias in the census sample was non-response. The percentage of households with no responses at the end of field operations was 2.8% in 2006 compared to 1.6% in 2001. After adjustments were done to the occupancy status by the Dwelling Classification Survey (see Section 2.7), the percentage of occupied dwellings with no responses was 3.5% in 2006 compared to 2.0% in 2001. In 2006, whole household imputation was used to impute for 96% of these total non-response households with 18.6% of them becoming long forms. In 2001, whole household imputation was not done. In both 2001 and 2006, long forms with total non-response to the questions asked on a sample basis were converted to short forms. This process was called 'Document Conversion.' In 2006, 12,638 long-form households were converted to short forms. In 2001, 17,692 long-form households with some short form responses, but no long-form responses were converted to short forms, while 144,282 total non-response households (of which approximately 20% would be expected to have originally been long forms) become short forms. The much smaller number of long forms converted to short forms in 2006 was the result of most total non-response households being dealt with by whole household imputation.

This change was made because the 2001 approach may have introduced significant biases into the sample. For example, in 2001 it was known that the percentage of single-detached dwellings that were total non-response households was half that of the population as a whole. See Section 7.2.2 for a more detailed discussion of the impact on sampling bias of the introduction of whole household imputation in 2006. The discussion in Section 7.2.2 casts some doubt on the utility of using the W statistic above to determine whether or not the Z statistics from 2001 and 2006 were significantly different. This is because many of the differences appear to be the result of the introduction of whole household imputation rather than because of sampling variability. The use of the W statistic to test for regional differences in the bias for 2006 below, however, is not affected by this concern.

A third possible source of bias comes from errors that were either made by the respondent or introduced by the data capture process. Some of the inconsistencies that resulted were detected and corrected by the edit and imputation process described in Section 2.8.

The geographic variation of the bias was also studied. The Z statistics for all 34 characteristics were calculated for the East, Quebec, Ontario and the West (including the three territories) regions in the same fashion as at the Canada level. The relative bias between these four regions is displayed for the 2006 and 2001 censuses in Chart 6.2 and Chart 6.3 respectively. Again, using the W statistic, regional differences which are statistically significant are flagged by placing the initials of the regions at either the bottom or the top of the chart. For example, WQ and OQ indicate that there is a significant difference in the bias between the West and Quebec as well as between Ontario and Quebec.

Comparing Chart 6.2 to Chart 6.3, it can be seen that there were more significant regional differences in 2006 than in 2001. It is interesting to note for 2006 that the downward bias for the total population count is much larger for the West and Ontario than for Quebec and the East. It is also interesting to note that for females there is a downward bias in the sample for the West and Ontario and an upwards bias for Quebec and the East.

Section 7.2.2 and Chapter 8 will show that these population/estimate differences are often significantly reduced by calibration of the census weights. As a result, the inferences based on calibrated estimates should be more accurate.

Table 6.1  Population/estimate differences in 2006 and 2001 censuses based on initial weights

Chart 6.1  Z statistics for population/estimate differences based on initial weights, for Canada, 2006 and 2001 censuses

Chart 6.2  Regional Z statistics in 2006

Chart 6.3  Regional Z statistics in 2001

previous gif   Previous page | Table of contents | Next page   next gif