Chapter 5 – Data quality assessment and indicators

In a sample survey there are two types of error: sampling error and non-sampling error. The former is present because when we estimate a characteristic, we are measuring only part of the population instead of the whole population. The latter covers all errors that are not related to sampling. This type of error is also present in the census. Sections 5.1 and 5.2 contain an overview of these types of error as they relate to the NHS.

Sampling error

The objective of the NHS is to produce estimates from a number of questions for a wide variety of geographies, ranging from very large areas (such as provinces and census metropolitan areas) to very small areas (such as neighbourhoods and municipalities), and for various population groups such as Aboriginals Peoples and immigrants. These groups also vary in size, especially when cross-classified by geographic area. Such groupings are generally referred to as 'domains of interest'.

For any given domain of interest, on the assumption that the sampling is random, the sampling error depends on several parameters: population size, the number of survey respondents, the variability of the variables being measured, stratification and cluster sampling.

With a sampling rate of about 3 in 10 and a response rate of 68.6%, it is estimated that about 21% of the Canadian population participated in the NHS. Nevertheless, the quality of the domain estimates may vary appreciably, in particular because of the variation in response rates from domain to domain.

Non-sampling error

Besides sampling, a number of factors can cause errors in the survey's results. Respondents may misunderstand the questions and answer them inaccurately, and responses may be entered incorrectly during data capture and processing. These are examples of non-sampling errors that were thoroughly accounted for at every stage of collection and processing to mitigate their impact.

In addition, in every self-administered voluntary survey, error due to non-response to the survey's variables makes up a substantial portion of the non-sampling error. A distinction is made between partial non-response (lack of response to one or some questions) and total non-response (lack of response to the survey because the household could not be reached or refused to participate). Total non-response is likely to bias the estimates based on the survey, because non-respondents tend to have different characteristics from respondents. As a result, there is a risk that the results will not be representative of the actual population.

Since the NHS has a response rate of 68.6% (see Section 3.5), that risk is taken into account. Statistics Canada conducted several studies and various simulations, before and after collection, to assess the risk and extent of the potential bias. A number of measures were taken to mitigate its effects.

Description of the NHS data quality assessment process and indicators

From the start of collection to approval for release, NHS data undergo many analyses, and a number of quality indicators are produced. In this assessment process, the indicators are analyzed so that the quality of the NHS estimates can be assessed and users can be informed of any potential limitations in the estimates. The main quality indicators produced and analyzed during the assessment are as follows:

Item non-response rates: By collection method, demographic characteristics such as age and sex, and respondents' area of residence.

Indicators of response quality: For example, the rates of invalid or uncodable responses, analyzed by collection method.

Global non-response rate: Combines household non-response and item non-response, and is weighted and produced for various geographies (see Section 6.3).

Indicators of non-response bias: Based on matching of data from the 2006 and 2011 censuses and the NHS sample, these indicators provide data on NHS respondents and non-respondents and measure the average discrepancy between NHS estimates and estimates produced with 2006 Census data (see Section 5.5).

Coefficients of variation (CVs): Used to measure the variability of estimates.

There are three main steps in the assessment process:

Verification of NHS data during collection and processing: This involves calculating the response quality and non-response indicators throughout the collection period. The objective is to detect possible irregularities and correct them during collection and edit and imputation.

Verification of data after edit and imputation: This involves calculating quality indicators for the entire data set and assessing the quality of imputed data. The objective is to ensure that edit and imputation have minimized potential biases while maintaining data consistency. For each NHS question, the key quality indicators produced and analyzed by subject-matter analysts are the imputation rate, the rate of corrected inconsistent responses, and a comparison of item response distributions before and after imputation.

Certification of final estimates: The final estimates were certified after weighting to ensure that the data are consistent and reliable. At this point, the final estimates are compared with various data sources. These comparisons help determine whether the NHS estimates are consistent and therefore of good quality. The key data sources used are estimates from other Statistics Canada surveys for which data based on common concepts are available (for example, the Labour Force Survey), data from previous censuses, and data from selected administrative records available to Statistics Canada (for example, the T1 file on family income and Citizenship and Immigration Canada's Longitudinal Immigration Database). Population projections, available for population subgroups (for example, projections for Aboriginal peoples), which are based on the 2006 Census and are produced with microsimulations, were also compared with the NHS estimates.

Certification of the final estimates is the last step in the validation process leading to recommendation for release of the data for each geography and domain of interest. Based on the analysis of quality indicators and the comparison of the NHS estimates with other data sources, the recommendation is for either, unconditional release, conditional release or non-release for quality reasons. In the case of conditional release or non-release, appropriate notes and warnings are included in the products and provided to users.

For more details on the quality indicators and assessment results, please see the reference guides for the various domains of interest (see Appendix 2).

Comparability of the NHS estimates

Comparability of the NHS estimates and the 2006 Census

The content of the NHS is similar to that of the 2006 Census long questionnaire. However, a number of changes were made to some questions and sections of the questionnaire. For example, the NHS measures a new component of income (capital gains or losses) and child care and support expenses; the questions used to measure Aboriginal identity were altered slightly; and the universe for determining generational status was expanded to include the entire population, not just the population aged 15 and over. In addition, the unpaid work section was not asked in the 2011 NHS.

Any significant change in survey method or content can affect the comparability of the data over time, and that applies to the NHS as well. It is impossible to determine with certainty whether, and to what extent, differences in a variable are attributable to an actual change or to non-response bias. Consequently, at every stage of processing, verification and dissemination, considerable effort was made to produce data that are as precise in their level of detail, and to ensure that the NHS's published estimates are of good quality in keeping with Statistics Canada standards.

Caution must be exercised when NHS estimates are compared with estimates produced from the 2006 Census long form, especially when the analysis involves small geographies. Users are asked to use the NHS's main quality indicator, the global non-response rate (see Section 6.3), in assessing the quality of the NHS estimates and determining the extent to which the estimates can be compared with the estimates from the 2006 Census long form. Users are also asked to read any quality notes that may be included in dissemination products.

Discrepancy between 2011 Census counts and 2011 NHS estimates

The final weights are selected so as to reduce or eliminate differences between the 2011 Census population counts and the NHS estimates. However, some discrepancies may persist because the weighting constraints sometimes have to be discarded. In addition, since the final weight adjustment is based on calibrated areas, some of which are made up of several small municipalities, there may be discrepancies between the NHS estimates and the census counts for small municipalities. The discrepancy between the population counts and the sample estimates is the difference between the NHS estimate and the 2011 Census count divided by the 2011 Census count.

Whether there is a discrepancy or not is an indication of the quality of the NHS estimates. For a given census subdivision (CSD) or any other geographic area, users are invited to compare the 2011 Census count with the NHS estimate for the same target population to get an idea of the quality of the NHS estimates. The larger the discrepancy is, the greater the risk of having poor-quality NHS estimates.

For CSDs with a population of 25,000 or more, the census count and the NHS estimate are practically identical. That is not always the case for smaller CSDs.

Comparisons of the 2011 Census population counts and the NHS population estimates at the CSD level for the same target population are presented in three figures in Appendix 3. Comparisons are provided for CSDs with a population between 5,000 and 25,000, CSDs with a population between 1,000 and 5,000, and CSDs with a population between 40 and 1,000. Each figure shows the ratio of the NHS population estimate to the 2011 Census population count. If the ratio is equal or close to 1, the NHS population estimate is equal to the 2011 Census population count. If the ratio is greater than 1, the NHS estimate is greater than the 2011 Census count, and if the ratio is less than 1, the NHS estimate is less than the census count. The farther the ratio is from 1, the greater the risk of having poor-quality NHS estimates.

An analysis of the three figures shows that for small CSDs, there can be large discrepancies between the 2011 Census population count and NHS population estimate. As explained in Section 4.3, those discrepancies are due to weighting, and as in any survey, they may be larger for small geographic areas. A similar analysis comparing the NHS estimates and the 2011 Census counts for common questions would also provide an idea of the quality of the NHS estimates.

Indicators of non-response bias

As noted in Section 3.1, the higher a survey's non-response is, the greater the risk of non-response bias. During collection, the purpose of non-response follow-up, especially the subsample follow-up, was to maximize the survey's response rate and control potential non-response bias due to the survey's voluntary nature.

To assess the quality of the NHS estimates, in addition to the usual procedures (see Section 5.3), indicators of non-response bias were calculated and analyzed.

The indicators were calculated using a data file matching the 2006 and 2011 censuses. By means of a complex matching method using surnames, addresses and birthdates, 73% of 2011 Census respondents were linked to their 2006 records. As a result, we have 2006 Census data (including data from the long form) for a large portion of the NHS sample, whether the household responded or not.

These data made it possible (1) to compare NHS respondents and non-respondents for various characteristics measured in 2006, and (2) to calculate and analyze bias indicators and assess the quality of the NHS estimates. However, these analyses have some limitations, due to the nature of the matching file. It was impossible to match the entire NHS sample to the 2006 Census, and indicators could only be calculated for large geographic areas such as the provinces and territories, census divisions and census metropolitan areas.

It is important to keep in mind that these bias indicators are based on data from the previous census and not bias estimates calculated directly with 2011 NHS data. The indicators were used to assess the potential risk of bias for each geographic area. Analysis of these indicators and additional quality assessment analyses (see Section 5.3) provided assurance that the published NHS estimates meet Statistics Canada's quality standards. Notes are provided for variables and geographic areas for which some limitations on the quality of the NHS estimates must be taken into account.

Date modified: