Statistics Canada
Symbol of the Government of Canada
Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Data quality


Introduction

Statistics Canada, as a professional agency in charge of producing official statistics, has the responsibility to inform users of the concepts and methodology used in collecting and processing its data, the quality of the data it produces, and other features of the data that may affect their use or interpretation.

Data users must first be able to verify that the conceptual framework and definitions that would satisfy their particular data needs and uses are the same as, or sufficiently close to, those employed in collecting and processing the data. Users then need to be able to assess the degree to which errors in the data restrict the use of these data.

Four cyclists cross a park on a summer day.

The measurement and assessment of data quality, however, are complex undertakings. There are several dimensions to the concept of quality, many potential sources of error and often no comprehensive measures of data quality. A rigid requirement for comprehensive data quality measurement for all Statistics Canada products would not be achievable given the present state of knowledge. Emphasis must, however, be placed on describing and quantifying the major elements of quality.

Errors in census data

The accuracy of a statistical estimate is a measure of how much the estimate differs from the correct or ‘true' figure. Departures from true figures are known as errors. Although this term does not imply that anyone has made a mistake, some degree of error is the inevitable result of decisions taken to control the cost of the census. This is an important point, since many kinds of errors can be anticipated and controlled by building special procedures into the census. The more resources put into these procedures, the tighter the control and the lower the degree of error in the data. However, there is a point at which the benefits of a further reduction in error are too minor to justify the expense.

The significance of error to the data user depends very much on the nature of the error, the intended use of the data and the level of detail involved. Some errors occur more or less at random and tend to cancel out when individual responses are aggregated for a sufficiently large group. For example, some people may overestimate their income, while others may underestimate it. If there is no systematic tendency for people to err in either direction, then overestimates by some individuals will more or less offset underestimates by others in the group. The larger the group, the closer the average reported income is likely to be to the true value. On the other hand, if many people forget a source of income, the result will be a general tendency to understate total income. In this case, the average reported income will be lower than the true average. Such systematic errors are far more serious a problem for most users than random errors: the bias they cause in the data persists no matter how large the group, and is very difficult to measure.

Sources of error

Errors can arise from many sources, but can be grouped into a few broad categories: coverage errors, non-response errors, response errors, processing errors and sampling errors.

Coverage errors

The census attempts to count every Canadian resident on Census Day. Census staff makes a list of all dwellings in each collection unit and a census questionnaire is delivered to each dwelling, i.e., either mailed out or dropped off by an enumerator. The householder is asked to list all usual residents of the dwelling by following the Step B guidelines on the questionnaire. Mistakes can occur in this task. Census staff may misjudge the location of the collection unit's boundaries and miss certain dwellings. A dwelling may be missed because it is inside of what seems to be a single dwelling, or it is located on a road not marked on the collection unit map. The enumerator may fail to drop off a questionnaire at an occupied dwelling because it appears to be unoccupied.Cross-country skiing with view of Red Mountain, British Columbia.

Householders may misunderstand the Step B guidelines and not list all the usual residents of the dwelling; for example, a family member temporarily away from home at school or in a hospital could be left out. A family maintaining two residences could be missed at both because of confusion about where its members should be counted. Such situations could also lead to double-counting or 'overcoverage‘, which occurs when an individual is listed at two residences. This is less prevalent than 'undercoverage‘, which occurs when individuals or households are missed.

Non-response errors

Despite best efforts during census data collection, sometimes it is impossible to obtain a complete questionnaire from a household, even though the dwelling was identified as occupied and a questionnaire was delivered. The household members may be away over the entire census period or may refuse to complete the form. In most cases, the questionnaire is returned, but information is missing for some questions or individuals. Questionnaires are edited and followed up on by census interviewers for missing information. Nevertheless, some non-response is inevitable and, though certain adjustments for missing data can be made during processing, some loss of accuracy is inevitable.

Response errors

A response may not be entirely accurate. The respondent may have misinterpreted the question or may not know the answer, especially if it is given for an absent household member. Occasionally, a response error may be caused by the enumerator when following up for a missing response, or when recording items such as the structural characteristics of a dwelling.

Processing errors

All questionnaires (paper and electronic) are channelled to the Data Processing Centre. Data from paper questionnaires are captured through optical mark and character recognition, or keyed in. Subsequently, write-ins are coded, automatically or manually, with the assistance of a computer. Data capture and coding mistakes can occur at this stage, despite the quality control methods. Following capture and coding, all the data undergo a series of computer checks to identify missing or inconsistent responses. Responses are created or 'imputed' for missing or unacceptable information, using answers from respondents who share similar characteristics such as age and sex. The computer cannot, of course, impute a correct response every time, but when results are tabulated for sufficiently large geographic areas or subgroups of the population, imputation errors will more or less cancel each other out.

Sampling errors

Some census questions are asked of all Canadian residents, but most of the cultural and economic information is obtained from a sample of one in five households. The information collected from these households is 'weighted' to produce estimates for the whole population. The simplest weighting procedure would be to multiply the results for the sampled households by five, since each household in the sample represents five households in the total population, but the actual weighting procedure, though similar in principle, is much more complex.

Naturally, the results of the weighted sample differ somewhat from the results that would have been obtained from the total population. The difference is known as ‘sampling error'. The actual sampling error is, of course, unknown, but it is possible to calculate an 'average' value.

If several samples of the same size were selected using a random process, similar to that used in the actual census, the weighted results would tend to vary around the true result for the total population. The ‘standard error' is a measure of the average size of this variation. Fortunately, it is not necessary to actually generate a number of samples to estimate the standard error for the census; it can be estimated from the single sample actually taken.

Data quality measurement

To allow data users to assess the impact of errors and to improve our own understanding of how and where errors occur, a number of data quality studies have been conducted for recent censuses. For the 2006 Census, special studies examine errors in coverage, sampling and content (i.e., non-response, response and processing).

Coverage errors

Three studies address coverage errors. First, the Dwelling Classification Survey for which a sample of dwellings listed by enumerators as ‘unoccupied' or ‘non-response' are revisited to establish how many of these residences were in fact occupied or unoccupied on Census Day, as well as the number of persons who were living in the occupied dwellings. Estimates are obtained of the total number of households and persons missed due to dwelling misclassification, and the census results are adjusted based on these.

The two remaining studies provide estimates of gross undercoverage and overcoverage, but are not the basis for adjustments of census results. The reverse record check estimates gross undercoverage by selecting a sample of people before the census collection activities, finding all addresses where they might have been enumerated, then checking census questionnaires corresponding to these addresses to find out if these people were enumerated in 2006. The sample was selected from 2001 Census returns, from birth and immigration registrations, from permit (student, work or minister [see 'Non-permanent resident' variable previously referred to]) holders and refugee claimant registrations, and from people identified as missed in the 2001 reverse record check. Based on the data acquired for the selected persons, they are classified either as enumerated, out of scope (i.e., died or emigrated prior to Census Day), or missed. This classification leads to estimates of the total number of persons missed during census enumeration.

The census also includes a study to measure gross overcoverage. The Overcoverage Study attempts to link all persons in the census database against each other by using direct matching and statistical matching techniques; the detected matches are classified to strata and a sample of matches within each stratum is verified against census questionnaire information to determine the frequency of double-counting. Estimates are obtained of the total number of overcovered persons during census enumeration.

The results of this study are used, along with the census population counts and the results of the reverse record check, in the Population Estimates Program.

Coverage error estimates will be available in the 4th quarter of 2009.

Content errors

A number of studies evaluate the quality of the data for each question. Response rates, edit failure rates, and a comparison of estimates before and after imputation are among the data quality measures used. Tabulations from the 2006 Census are also compared with corresponding data from past censuses, from other surveys, and from administrative sources. Detailed cross-tabulations are checked for consistency and accuracy. Some of these checks are conducted prior to the release of census data, in a process known as ‘certification'; more detailed studies take longer.

Sampling errors

As mentioned earlier, it is possible to calculate standard errors for sample variables. In addition, studies evaluate sampling and weighting procedures.

Dissemination of data quality information

Census data quality information is disseminated in two ways. All census products include a section on data quality that examines sources of errors and provides cautionary notes for users. In some cases, estimates of the magnitude of errors are given—for example, estimates of sampling error. Information is also published in the 2006 Census Technical Reports (available in the fall of 2009) series that summarizes the results of data quality studies.

Previous page | Table of contents | Next page >