Statistics Canada
Symbol of the Government of Canada
Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

8. Census Overcoverage Study

8.1 Introduction

8.2 Methodology

8.2.1 Step 1: Exact matching with administrative data

8.2.1.1 Administrative data files

8.2.1.2 Using names

8.2.1.3 Exact match

8.2.2 Step 2: Probabilistic match with the RDB

8.2.2.1 Using the Generalized Record Linkage System

8.2.2.2 Manual verification

8.3 Estimation of overcoverage

8.4 Types of overcoverage

8.1 Introduction

Following the 2001 Census of Population, the level of overcoverage due to duplication of individuals was measured by three studies, each one covering a part of the overcoverage: the Automated Match Study (AMS), the Collective Dwelling Study (CDS) and the Reverse Record Check (RRC). The introduction of names to the 2006 Census Response Database provides an opportunity to use name matching to measure overcoverage and therefore estimate overcoverage with a single study, the Census Overcoverage Study (COS). The COS is based on a series of automated exact and probabilistic matching operations and manual work. These matching operations also involve the use of various administrative data files. Therefore, the 2006 RRC measures just undercoverage and the CDS is no longer conducted as collective dwellings are covered by the COS.

8.2 Methodology

The methodology for estimating 2006 overcoverage was based on matching persons while the Automated Match Study (AMS)1 was based on matching households of persons. The 2006 Census Overcoverage Study (COS) took advantage of the fact that the 2006 Census Response Database (RDB) contained respondent names. For the first time, the names were captured and were available for computer processing. It was anticipated that the inclusion of names in measuring overcoverage would maximize the proportion of total overcoverage covered by automated matching methods. Since the RRC no longer measures overcoverage, the new methodology reduces coverage study costs associated with the collection of additional addresses by the RRC for overcoverage measurement. The COS also produces a more precise estimate without geographic restrictions such as those applied to the 2001 AMS. Persons who were living in collective dwellings and hence completed a Form 3A or 3B were in scope for the COS.

In principle, the RDB could have been matched to itself to detect duplicate enumerations. However, on a practical level, and for methodological considerations, the COS was conducted in two steps as outlined below. It should be noted that the RRC version of the 2006 Census Response Database (RRC RDB) was not the same as the database that the COS used, since some records excluded for the RRC did not need to be excluded for the COS.

8.2.1 Step 1: Exact matching with administrative data

The first step was based on exact matching procedures, and involved matching the RDB with a set of administrative data files representing a large portion of the census target population. It was expected that this process would directly identify cases of overcoverage. In particular, RDB records assigned to the same administrative record through 'many-to-one' matches were declared to be cases of overcoverage without further review, since they pointed to the same individual from the administrative data files.

8.2.1.1 Administrative data files

Since there is no single administrative data file covering the entire Canadian Census target population, it was necessary to combine several files, each one covering a different segment of the population, in order to carry out the COS. The aim was to maximize the coverage of the Canadian Census target population while avoiding duplication of individuals among the administrative data files.

The following administrative data files were used:

  • 2005 income tax records, supplemented with additional records for taxation years 2000 to 2004.
  • Birth files for Canadian citizens born between 1985 and 2003.
  • Immigration files for immigrants born outside Canada between 1985 and 2003, to cover children of immigrants not present in the birth files of Canadian citizens born between 1985 and 2003.
  • Immigration files for immigrants who arrived in Canada between 2004 and May 16, 2006 (Census Day), given that they would not be on the income tax file for 2005.
  • Non-permanent residents files.
  • Health care files from the Yukon Territory, the Northwest Territories and Nunavut.

The income tax files for 2000 to 2004 were included with the 2005 income tax file in order to improve the coverage of the Canadian census target population. The personal income tax records accounted for approximately 80% of all administrative records used at the first step. The Health care files for the three territories were used to represent all persons living in the territories, whereas the other administrative data files, as listed above, were used to represent persons living in the provinces. As a variety of administrative data files were used, every effort was made to remove duplicates, so the first step exact match would be effective.

8.2.1.2 Using names

Without the presence of names in the 2006 RDB, the new methodology for measuring person duplication would not have been developed. The names used in the RDB for matching purposes were taken from Step B of the census questionnaire, which contains a list of all reported members of the household. Family name(s) and given name(s) were included in the same 80‑character field. Respondents were asked to list their family name(s) first and then their given name(s). In order to use this field for matching, it was necessary to standardize the names and separate the 80 characters into a family name and a given name.

However, despite the instructions, not all respondents wrote their family names and given names in the correct order. Since this could lead to problems when matching with administrative data files, a strategy to separate names into family name and given name was developed to address this issue.

Family name(s) are separate from given name(s) in the adminstrative data files. This made it possible to compute the probability that a particular name is either used as a given name or a family name based on frequencies in the Canadian population. The name frequencies were broken down by sex and year of birth. This acknowledged that the use of a name as a given or family name may vary between males and females and over time. The name frequencies were then used to parse each name into the part most likely to be the given name(s) and the part most likely to be the family name(s). The same strategy was applied to the names from the RDB and to the names from the administrative sources. It was important that the name was parsed in the same manner on both files to ensure that the exact match was effective.

8.2.1.3 Exact match

Since the goal of the exact match was to identify each individual rather than to find a number of suggested matches for each record in the RDB, it was necessary to take a very conservative approach and only consider overcovered cases where a high degree of certainty was achieved. The variables used for this process were name, sex and date of birth.

Overcoverage was identified when two or more RDB records matched to the same administrative record. For evaluation purposes, a sample of these overcoverage matches was manually verified, as well as a sample of the one-to-one cases. An adjustment to the estimate of overcoverage, based on the results of the verification sample, was done to account for false matches whereby two or three records had the same administrative record but did not represent the same individual.

A record in the RDB may have been a match for more than one administrative record, and vice‑versa, thus creating a many-to-many match. For example, this can occur when two individuals have the same name and date of birth. When two RDB records matched to two administrative records, it was assumed that this grouping contained two valid one-to-one matches. However, a sample of the two-to-two matches was taken to verify this assumption. Following this review, the two-to-two now considered overcoverage were weighted up and added to the total estimate of overcoverage coming from the first step. All other combinations of many‑to-many matches were manually verified and either classified as overcoverage or not. In this way, all of the many-to-many matches were resolved at the first step of the COS.

Note that in the first step, for technical reasons, RDB records for the provinces were matched to provincial administrative records, and RDB records for the territories were matched to the records in the territorial administrative Health Care Files. Hence, cases of overcoverage between the provinces and the territories were missed at Step 1, but they were included in Step 2.

The exact match rate in Step 1 was 66.5%, which means 66.5% of RDB records were involved in a match with an administrative record. Among all the RDB records, we note that:

  • 64.68% of RDB records were part of a one-to-one match
  • 1.76% of RDB records were involved in a many-to-one match (case of overcoverage)
  • just 0.05% of matches were part of a many-to-many relationship
  • 33.52% have not been matched.

A total of 260,708 persons involved in multiple enumerations were identified in Step 1. Estimating the number of persons involved in multiple enumerations was done by assigning a weight to each enumeration. Two-to-one matches identified in Step 1, for example, represent one person who was enumerated twice. In order to estimate the number of persons involved in duplicate enumerations, each RDB record was given a weight of ½. The premise was that the usual residence of the person is equally likely to be that of the first enumeration as that of the second enumeration. Cases of overcoverage whereby the enumerations are in more than one province (interprovincial overcoverage) were of particular interest since each province is assigned an equal portion of the total weight of 12. Table 8.2.1.3 presents the total overcoverage in Step 1 for intraprovincial, intraterritorial, interprovincial and interterritorial pairs.

A total of 246,982 persons were overcovered within the same province or territory, and 13,726 between provinces or territories, for a total of 260,708 persons overcovered in Step 1. Only 5.3% of total overcoverage in Step 1 was interprovincial or territorial. The highest rates of interprovincial or interterritorial overcoverage were in the Atlantic provinces and Alberta. In percentage terms, interprovincial/interterritorial overcoverage was much smaller in the territories since Step 1 only applied to overcoverage between territories.

At this stage in the process, the RDB was split into two parts. Part A consisted of all RDB records that were matched to at least one administrative record, whether overcovered or not. Part B consisted of all RDB records that were not matched to an administrative record, as well as all territorial records. The latter was done to take into account provincial-territorial matches that were missed in Step 1. A probabilistic match was then done between Part B and the entire RDB to identify cases of overcoverage that were not identified in Step 1.

8.2.2 Step 2: Probabilistic match with the RDB

Step 2 of the COS is a probabilistic record linkage between RDB records that were not matched with an administrative record (Part B), about 10.2 million records, and the complete RDB (Part A + Part B) consisting of about 30.6 million records. Statistics Canada's Generalized Record Linkage System (GRLS) was used for this step.

8.2.2.1 Using the Generalized Record Linkage System

First, the rules governing the probabilistic match were established. Within the framework of GRLS, variables such as first name, last name, sex, date of birth, and some variables related to geography  (listed in the next paragraph), were considered during the record linkage. The output from GRLS results in pairs of individuals with an associated weight that indicates the strength of the match. The higher the matching weight is, the more likely the pair is a good match, thus resulting in overcoverage.

The Generalized Record Linkage System allows for variations in the spelling of names and variations in the agreement on date of birth. Geography was also considered in the linkage via the PRCDCU field (combination of province code, census division and collection unit), postal code and city (when postal code is missing). All the variables involved in the probabilistic record linkage were subject to different rules in a preliminary step called the selection criteria and rules applied for the purpose of the actual record linkage. Frequency weights for all variables, except for sex (because male and female are approximately in the same proportion in the population), were also used within GRLS. Frequency weights allow for matches on more common values to be weighted less heavily than matches on less common values.

The standard Fellegi-Sunter (1969) approach is implemented in the GRLS. An upper threshold, S2, is established, above which matches were accepted as overcoverage without verification. The threshold S2 was set conservatively so as to minimize the probability of finding false matches of overcoverage above S2. A lower threshold, S1, below, which matches are rejected without further review (i.e., no overcoverage), was also determined to minimize cases of overcoverage below threshold S1.

8.2.2.2 Manual verification

Due to time and resource constraints, it was impossible to verify all cases in the middle zone i.e., pairs whose matching weight was between S1 and S2 (1.1 million pairs). Instead, a sample of these matches was selected.

The sampling method used was systematic sampling with selection probabilities Pi proportional to size measure θi  (1- θi). Pairs were ordered by province or territory, sex and date of birth. θi is the matching weight standardized on the interval [0,1]. The matching weight itself is from GRLS. θi is correlated with the probability of being a true match (i.e., a case of overcoverage). Pairs with θi close to 0 or 1 had the lowest probability of being selected for manual verification. The total sample size was 19,802 pairs.

The standardized GRLS matching weight, θi, was determined as follows:

                        θi = (Xi - C)/(D - C)

Where:              Xi is the GRLS matching weight for each pair i
                        C = S1-1
                        D= S2+1

By definition, we did not want S1 and S2 to be the boundary points of the interval [0,1]. This is why C is equal to S1-1 and D is equal to S2+1.

The sample was selected using the SAS statistical software PROC SURVEY SELECT procedure. The first-order inclusion probabilities were calculated in SAS. However, due to time and resource constraints, second-order inclusion probabilities which were needed to calculate the variance estimates, were not determined. As a result, the variance estimate is only an approximation and overestimates the true variance. Estimation is discussed in more detail in the next section.

The selection probabilities for a sample design with probability proportional to size using Theta subscript i times 1 minus theta subscript i as the size measure were calculated as follows:

An equation showing that P subscript ki is equal to the ratio of the product of theta subscript ki and 1 minus theta subscript ki to the sum over i in k of the product of theta subscript ki and 1 minus theta subscript ki; k represents the stratum, i represents the pair

With the methodology outlined in this section, a pair whose weight was in the middle of the interval S1 to S2 had a greater chance of being verified. This is because these were the cases that we were more uncertain about. When the matching weight was close to S1, it was more likely not to be a case of overcoverage. In contrast, when the matching weight was close to S2, it was more likely to be a case of overcoverage. Therefore, there was no need to select a large sample near the end points of the interval to obtain good estimates.

A team of clerks examined information from the RDB to determine whether or not there was overcoverage. When necessary, they referred to census questionnaire images to verify RDB data to determine whether or not there was overcoverage. Quality control samples were selected as part of the manual verification process, to assess the quality of the coding. When the clerks were unsure about a case, it was referred to experts.

Table 8.2.2.2.1 provides estimates of total Step 2 overcoverage, overcoverage above  S2 and overcoverage between S1 and S2 by province and territory. We note that most of the overcoverage comes from between S1 and S2. The total estimate is 235,946, of which 180,523 comes from between the thresholds and 55,423 comes from pairs above S2. The last are pairs with a matching weight sufficiently high as to be declared overcoverage without manual verification. Note that the coefficients of variation (CVs) are all under 10% (except for the Yukon Territory, with 11.54%). There is, of course, no variance associated with the overcoverage found above S2.

Table 8.2.2.2.2 provides the interprovincial, interterritorial, intraprovincial and intraterritorial overcoverage. At the national level, 3.7% of the total overcoverage comes from the inter provincial-territorial overcoverage. In percentage terms, the inter/intra overcoverage is higher in the territories. This is expected because interprovincial/territorial overcoverage was not measured in Step 1. For the provinces, the highest proportions of inter/intra are, as in Step 1, in the Atlantic provinces and in Alberta.

8.3 Estimation of overcoverage

In 2006, overcoverage was measured primarily by the Census Overcoverage Study (COS). The total overcoverage estimate comprises individuals overcovered in Step 1, and those deemed overcovered during the probabilistic matching in Step 2. Individuals deemed overcovered in Step 2 whose matching weight was above the upper threshold S2, had a weight of 1. The weight of overcoverage cases identified from the sample between the lower threshold S1 and the upper threshold S2 was determined by the sample design.

To evaluate the COS, the Automated Match Study (AMS) was repeated in 2006. The COS estimates were compared to those of the AMS. The comparison revealed a bias in the COS estimates whereby some pairs identified in the AMS were not found in the COS frames. Since the AMS provided an estimate of overcoverage not included in the COS, the last step in estimating overcoverage was to account for this bias by using the AMS estimates to adjust the COS estimates. This step is discussed at the end of the section. More information on evaluation of the COS is in Section 10.2.

The variance of the estimate of total overcoverage comes primarily from the sample between the thresholds S1 and S2 in Step 2. Another portion is from samples of two-to-two cases in Step 1, and a very small portion is obtained from the samples used to adjust for false matches in Step 1. As with the 2001 AMS, overcoverage observed between two provinces or territories is divided equally between the provinces or territories in question. The same principle applies to other domains of estimation.(Two individuals, for example, who do not belong to the same age group).

Statistics Canada's StatMx software was used to calculate point estimates, as well as the Step 2 sample variance between S1 and S2. As explained in Section 8.2.2.2, the sample was selected with probability proportional to θi(1- θi), where θi represents the standardized  GRLS matching weight defined on [0,1]. Since StatMx cannot produce variances for a PPSWOR (probability proportional to size without replacement) design, a variance estimate for a PPSWR (probability proportional to size with replacement) design was used. Consequently, the variance is overestimated. As explained in Section 8.2.2.2, this approximation comes from not deriving the second order inclusion probabilities.

Table 8.3.1 presents the total overcoverage estimates for Step 1 and Step 2.

Table 8.3.2 provides the total overcoverage estimate based on intraprovincial and interprovincial or territorial overcoverage. Some 4.5% of overcoverage is interprovincial or territorial. In Step 1, 53% of overcoverage is inter-provincial or territorial, while the figure is 3.7% for Step 2. Ontario and Quebec have the least interprovincial overcoverage, while the territories, Atlantic provinces and Alberta have the most.

As described above, comparison of the COS estimates and the AMS estimates revealed a bias in the COS estimates. Consequently, an adjustment was made to the COS estimates using the AMS estimate of the undercoverage not covered by the COS. The adjusted estimates are the final estimates of total population overcoverage that appear in Section 1. Table 8.3.3 presents the overcoverage estimates before and after the AMS adjustment. The biggest increases are in Nunavut (6.11%) and Alberta (5.77%), while the smallest are in Quebec (2.56%) and the Yukon Territory (2.69%).

8.4 Types of overcoverage

In 2006, the possible types of overcoverage were examined for the first time. The most frequent types are described below.

Some 20% of COS overcoverage is from the 'consecutive/quasi consecutive identifier' category. This refers to overcoverage from two identical households with exactly the same address or in very close geographic proximity (and therefore have a similar household identifier). Two households were considered identical if they contained the same people with the same demographic characteristics.

Another 20.5% of COS overcoverage is from the ' identical households: not consecutive/quasi consecutive' category. These are identical households that are not geographically close. A further 16.9% of COS overcoverage is from the 'child(ren) of parents living in separate households' category.

We then find 12.0% of COS overcoverage in the 'student/young adult who has recently left home' category, and 11.1% in the 'non-identical households: one household is included in another' category (whereby the members of one household can all be found in the other larger household).

Note:

  1. For a detailed description of the AMS methodology, see the 2001 Technical Report on Coverage Studies.
  2. Some of these weights were adjusted for false matches, as mentioned earlier in this section.

   Previous page | Table of contents | Next page