Chapter 4 – Data quality practices
Table of contents
- Data quality measures
- Other methods of data quality suppression
- Calculation of order statistics
- Data quality rule for disseminating data for population aged 100 and over
- Data quality rule for disseminating data on same-sex and opposite-sex couples
The following section describes the methods used to restrict the dissemination of NHS data of unacceptable quality.
Data quality measures
Data quality indicators for tabulations based on place of residence geographies
Data quality indicators
Data quality indicators (commonly referred to as data quality flags) are attached to each place of residence standard geographic area disseminated. In the NHS database environments, the data quality indicators consist of a five-digit numeric field. On the database and in electronic products browsed via Beyond 20/20, these flags are displayed as a five-digit numeric code (example: 1 0 0 1 0). On the NHS website, flagging of partially enumerated areas to end users is done through the use of symbols. Specific symbols in use for the 2011 NHS are documented in Section Data quality and confidentiality table symbols.
Incompletely enumerated areas
In 2011, there were a total of 36 Indian reserves and Indian settlements that were 'incompletely enumerated' in the NHS. For these reserves or settlements, NHS enumeration was either not permitted or was interrupted before it could be completed, or was not possible because of natural events (specifically forest fires in Northern Ontario).
There are no data for incompletely enumerated Indian reserves and settlements on the NHS database. Higher-level geographic areas containing these areas are identified in the NHS products.
Although NHS data are not available for incompletely enumerated Indian reserves and settlements, the areas themselves are included as part of the standard geographic hierarchies on the NHS databases. Retrieval and tabulation software will retrieve these areas but with no data. For place of work geographies, these areas will be suppressed.
Partially enumerated areas
Any geographic area that contains an incompletely enumerated area is considered a partially enumerated area. Partially enumerated areas are flagged to end users as containing incompletely enumerated areas.
Global non-response rates
The global non-response rate (GNR) is an indicator of data quality which combines complete non-response and partial non-response to the survey. A smaller GNR indicates a lower risk of non-response bias, i.e., a lower risk of lack of accuracy. Global non-response rates are determined for each of the NHS geographic areas. These areas are flagged on the database according to the non-response rate. Geographic areas with a global non-response rate higher than or equal to 50% are suppressed from standard data products but will be available as a custom request. Geographic areas with a global non-response rate lower than 50% are identified in tabulations, but not suppressed. In electronic products, a numeric flag, as well as the actual global non-response rate is provided.
|1st (0XXXX)||Incomplete enumeration flag||0||Default|
|1||Incompletely enumerated Indian reserve or Indian settlement (suppressed)|
|2||Excludes National Household Survey data for one or more incompletely enumerated Indian reserves or Indian settlements|
|2nd (X0XXX)||Not applicable||0||Default|
|3rd (XX0XX)||Not applicable||0||Default|
|4th (XXX0X)||Data quality flag||0||Data quality index showing a global non-response rate lower than 50%|
|1||Data quality index showing, a global non-response rate higher than or equal to 50% (suppressed)|
|5th (XXXX0)||Not applicable||0||Default|
Data quality indicators for tabulations based on place of work geographies
As indicated in Section Global non-response rates, global non-response rates (GNRs) are determined for each of the NHS geographic areas. Therefore, place of work geographic areas (POW) have their own global non-response rates. POW GNRs are based on the population aged 15 years and over who worked at any given time between January 2010 and May 2011 at a usual place of work or at home, located in the specific place of work geographic area whereas place of residence geographic areas (POR) have global non-response rates based on the population residing in the area. Consequently, place of work geographic areas might have different global non-response rate values when compared to their equivalent place of residence geographic area. For example, the global non-response rate for the place of work census subdivision of Toronto might not be the same as the global non-response rate for the place of residence census subdivision of Toronto.
POW GNRs like POR GNRs are an estimate, not an absolute metric, and both GNR values are variable. However, POW GNRs are more variable than POR GNRs, in precisely the same way that POW population estimates, not being calibrated to a known POW population enumerated through the census, will be more variable than POR population estimates, which are calibrated to known populations enumerated through the census.
As is the case for place of residence geographic areas, data for place of work geographic areas with a global non-response rate of 50% or above will be suppressed in standard products, but will be available as a custom request. However, it is important to note that in standard products, data might be available for some place of residence geographic areas (if their global non-response rate is below 50%) but not for the equivalent place of work geographic area (if the equivalent place of work geographic area has a global non-response rate of 50% or above), and vice-versa.
The data quality indicator for place of work uses the 4th digit of the five-digit numeric code.
|4th (XXX0X)||Data quality flag||0||Data quality index showing a global non-response rate lower than 50%|
|1||Data quality index showing a global non-response rate higher than or equal to 50% (suppressed)|
Other methods of data quality suppression
The methods of suppression mentioned to this point provide sufficient data quality suppression and identification for most NHS data products. However, in some products, the specifying area or production area may require that additional data quality suppression be performed. Examples of additional suppression could include increasing population thresholds or applying distribution or cell suppression. These are typically product-specific requirements and therefore are not part of the automated suppression systems. In all cases, some form of manual process is required.
The most common example of other methods of data quality suppression is distribution suppression. This occurs in selected standard income products where income distributions are suppressed when the total number of units (persons, families, households) within the income distribution is less than 250. A variation of this procedure is applied to standard income products that feature only number, median and average statistics for employment or total income only.
Calculation of order statistics
Medians and more generally quantiles are calculated using linear interpolations. The quantile interval (that is the interval where the value of the quantile is located) is determined using two methods based on the kind of values of the statistical variables:
Variables that take values with decimals and any variables with dollar values
The quantile interval is constructed to ensure that relative errors made by using the linear interpolation are less than 0.78%. For example, if the true quantile is $30,000.00, the error made by using the built-in algorithm is less than $234.00.
Variables that take integer values that are not dollars
For these variables, the quantile interval is always of size 1. For example, if the true quantile is 23.46, the interpolation is applied to the interval [23, 24].
Data quality rule for disseminating data for population aged 100 and over
Data for the population aged 100 years and over cannot be disseminated in single years of age. For custom requests that require a more detailed breakdown than provided in standard data products, in which the population aged 100 years and over is grouped together, the most detailed age breakdown which can be provided is as follows, and it can only be provided for 'Canada':
Total population 100 years and over
100 years to 104 years
105 years to 109 years
110 years and over
Data quality rule for disseminating data on same-sex and opposite-sex couples
The questionnaires of the 2011 Census of Population and the 2011 National Household Survey introduced for the first time a specific response on household relationships to determine the number of same-sex married couples. Analysis of the data on same-sex married couples has shown that there may be an overestimation of this family type and marital status. The 2011 National Household Survey shows a total of 63,920 same-sex couples in Canada, of which 20,280 are married couples. The range of overestimation of both these estimates, at the national level, is between 0 and 3,800.
For levels of geography such as Canada, provinces, territories and census metropolitan areas (CMAs), estimates are generally higher, so the potential overestimation is expected to be small in relative terms; however, the data should still be interpreted with caution.
At lower levels of geography, the same potential overestimation could be relatively large, and not only should the data be interpreted with caution, but certain suppression rules restrict their publication. These rules apply to both the 2011 Census and the 2011 National Household Survey.
First, the breakdown of same-sex couples or opposite-sex couples by conjugal status, that is, whether they are married or living common law, cannot be disseminated for geographic areas other than Canada, provinces, territories and CMAs.
Second, data cannot be disseminated that identify either same-sex or opposite-sex couples (in total, married or living common law) of any area with a population of less than 5,000 (as measured in the 2011 NHS).
- All data may be disseminated for same-sex or opposite-sex couples for Canada, provinces, territories, census metropolitan areas (CMAs), although they should still be interpreted with caution.
- Data on same-sex couples and opposite-sex couples may be disseminated for other geographic areas if they have a population of 5,000 or more, provided that the breakdown by conjugal status (married, living common law) is not included.
- No data may be disseminated that identify any same-sex or opposite-sex couples for areas of population less than 5,000.
- Date modified: