The standard small-area disease mapping methods used to assess spatial distributions of disease and/or mortality risks do not account for potential bias and uncertainty associated with population-at-risk estimates. As such, resulting small-area risk estimates may be inaccurate. The generic disease mapping framework consists of a Poisson regression of the observed incidence counts adjusting for local covariate values and the size of the population-at-risk (referred to as the offset), which is often treated as fixed and known (Waller and Gotway, 2004, Wakefield, 2007). In the United States, population-at-risk values are commonly derived from U.S. Census data products (Decennial Census, Population Estimation Program, and American Community Survey) that report small-area population counts (United States Census Bureau, 2012, U.S. Census Bureau, 2018, Population Estimation Program, U.S. Census Bureau, 2019). Importantly, recent spatial mapping innovations have produced alternative demographic data sources such as WorldPop which produce high spatial resolution data on human population distributions to address current limitations in national censuses and health surveillance systems (WorldPop, 2020). Although each source reports population data for the same set of small areas, there are important distinctions in data collection and processing methodologies and availability, which yield population-at-risk estimates that suffer from varying types and degrees of error (Peterson et al., 2024, Nethery et al., 2021). For accurate small-area disease/mortality estimation, it is critical to: (1) develop a robust method to propagate uncertainty associated with population-at-risk (offset) estimates, and (2) assess bias introduced when offset related uncertainty is ignored across different denominator data sources.
In the United States, annual small-area (county and census-tract level) population counts are published by the United States Census Bureau (USCB) in the form of the decennial census, intercensal population projections (PEP), and the American Community Survey (ACS) multi-year estimates (United States Census Bureau, 2012, U.S. Census Bureau, 2018, Population Estimation Program, U.S. Census Bureau, 2019). The decennial census is a cross-sectional comprehensive survey mandated every 10 years to count the entire U.S. population, which is accomplished through multiple modes of collection. Census counts do not suffer from sampling error, but do suffer from forms of non-sampling measurement error (i.e., duplications, erroneous errors, and omissions) (United States Census Bureau, 2012, Starsinic and Albright, 2001, U.S. Census Bureau, 2004, U.S. Census Bureau: Measures of Nonsampling Error, 2015). PEP reported intercensal population estimates are derived from a cohort component model, which uses the last decennial census as a base population, and projects population estimates forward using births, deaths, and net migrations (Population Estimation Program, U.S. Census Bureau, 2019, Preston et al., 2000). As such, PEP-reported population estimates suffer from unknown non-sampling errors including census-related errors and errors associated with birth, death, and migration data. USCB formally recommends the use of PEP or decennial counts as population estimates, however, PEP reported population counts are not available for geographies smaller than county, i.e., census tracts and block groups. The ACS is a complex rolling sample survey conducted annually, which collects 3.5 million independent samples of data nationally (approximately 2.5% of the population) for each year within a 5-year time interval. ACS reports small-area (county, census tract, block groups) population counts using data sampled over the 5 year time interval, referred to as multi-year estimates. Additionally, the ACS reports associated margins of error, which quantify the uncertainty (variability) due to sampling error across multiple years (U.S. Census Bureau, 2014a, U.S. Census Bureau, 2014b, U.S. Census Bureau, 2009, U.S. Census Bureau, 2018). For detailed information on data source specific sources and degrees of error across USCB data products refer to Peterson et al. (2024).
Private companies and academic groups have begun to produce high resolution gridded population estimates based on machine learning (ML) models that often combine census, remote sensing, land use, and other information to estimate population counts at smaller geographies in near real time. One of the most popular products of this nature is WorldPop (WP), which utilizes an open-source algorithm and provides yearly global high resolution gridded population estimates. Advantages of WP include its near real-time capability (available for the current year) and high spatial resolution (WorldPop, 2020, Nethery et al., 2021). In summary, WP uses a combination of available, remotely-sensed and geospatial datasets (i.e., settlement locations, settlement extents, land cover, roads, building maps, health facility locations, satellite nightlights, vegetation, topography, refugee camps) incorporated within a random forest model to generate gridded predictions of population density at ∼100m spatial resolution across the globe. Gridded estimates are weighted and aggregated to produce small-area to regional level estimates of population size and other demographic indicators (WorldPop, 2020, Stevens et al., 2015, Tatem et al., 2013). High spatial resolution population estimates derived from the ML algorithms address some of the limitations of official population statistics, however, do not adhere to the same validation and control measures as official population counts reported by USCB. Additionally, these algorithms are often trained on census reported population counts and therefore inherit the bias present in census reported counts. These notable and important differences in data collection, availability, and validation methodologies highlight the need to develop and assess data-source specific mechanisms of propagating offset related error within a disease mapping model.
There is an abundance of rich literature on measurement error methods to account for error in covariates using both classical and Berkson error methods (Gustafson, 2004, Carroll et al., 2006). Carroll et al. (2006) highlighted the use of measurement error corrections in the context of Bayesian epidemiological studies. However, relatively little literature addresses measurement error in the context of spatial data (Huque et al., 2016, Huque et al., 2014, Li et al., 2009, Zhang et al., 2021a, Josey et al., 2023), and there is an absence of literature addressing measurement error associated with population-at-risk values. Li et al. (2009) examined the effects of covariate measurement error in spatial linear mixed models. Their findings showed that using error-prone covariates leads to attenuation in the estimated regression coefficients, while inflating the variance components. The study emphasized that ignoring measurement error can result in biased parameter estimates and underscored the importance of adjusting for such errors in spatial modeling. Zhang et al. (2021a) assessed small-area risk estimates of fatal car crashes among teen drivers, accounting for offset errors using a proxy denominator modeled using a Bayesian Berkson error model. The authors compared models with and without measurement error adjustments. While the choice of model had little impact on the regression coefficients, adjustments notably affected the intercept and time effect, attenuating the time effect toward the null. Best and Wakefield (1999) explored the impact of inaccuracies in population counts and case registration on cancer mapping studies in the United Kingdom. The study proposed various model approaches to account for: (1) case under ascertainment due to imperfect collection procedures used by disease registries, and (2) under-enumeration in denominator counts based on population data subject to error. To incorporate population-at-risk uncertainty, population proportions were modeled using a multinomial-Dirichlet approach. The authors assessed sensitivity of results to these errors using both simulation studies and UK breast cancer data. A study by Nethery et al. (2021) investigated the impact of using imperfect and temporally mismatched population estimates from ACS and WP to generate real-time disease incidence rates for small areas in epidemiological research. The results showed that such mismatches, especially in race- or age-stratified populations, lead to significant bias, particularly when population estimates deviate from the “ground truth” and are temporally misaligned. To our knowledge, no study has addressed how to incorporate offset uncertainty within a larger disease mapping model across multiple types of denominator data sources suffering from varying degrees and types of error.
In this paper we compare two approaches to propagate offset related uncertainty within a Bayesian hierarchical disease mapping framework. The first approach, referred to as the Bayesian Spatial Berkson Error (BSBE) model, fuses an alternative Besag-York-Mollié (BYM2) disease mapping model and Berkson measurement error methods. This approach assumes the true population-at-risk (offset) is unknown, and is derived as a function of the reported population size plus associated error (Carroll et al., 2006, Gustafson, 2004, Huque et al., 2016, Huque et al., 2014, Besag et al., 1991, Riebler et al., 2016). In the second approach, referred to as the Bayesian Spatial Classical Error (BSCE) model, we extend the standard disease mapping model to include an additional data model (likelihood function) for the observed log-transformed population-at-risk using a classical error approach (Carroll et al., 2006). In contrast to the BSBE approach, this approach assumes the observed population-at-risk is derived as a function of the true population plus associated error. We compare BSBE and BSCE approaches to a naive no error (NE) method, which treats offset values as fixed. Additionally, we compare model fits across these three error models and the three data sources (PEP, ACS, and WP) to illustrate how different error mechanisms affect bias and uncertainty of relative risk estimates. In the case of PEP intercensal and WP population counts, there is no available information on offset related errors. We incorporate offset-uncertainty using a model-based spatially structured error model. In contrast, ACS estimates provide direct information on the degree of sampling error associated with population counts, but do not report on non-sampling errors (Population Estimation Program, U.S. Census Bureau, 2019, WorldPop, 2020, U.S. Census Bureau, 2018).
We compare model results across the different denominators using both: (1) simulation studies to compare model performance between the BSBE, BSCE, and NE error models, and (2) an application of the three approaches to obtain county age-stratified 2020 estimates of opioid-related mortality risks and associated uncertainties for the 159 counties in the state of Georgia (GA). Opioid-related drug overdose mortality increased 4-fold between 1999 and 2017 (Abdalla and Galea, 2022). Additionally, the burden of opioid-related mortality is not equally shared across socio-demographic, economic, and geographic characteristics (Kline et al., 2021, Hepler et al., 2021). Accurate estimation of small-area risks of opioid mortality is essential to properly target harm reduction resources in effective and reliable ways. Previous studies have assessed small-area opioid-related mortality rates accounting for variability in space and time characteristics and environments of people who use opioids (Sumetksy et al., 2021, Kline et al., 2021, Hepler et al., 2021, Kline and Hepler, 2021b, Rossen et al., 2014). A limitation of the current research in opioid mortality estimation is the lack of accounting for errors associated with denominator estimates. We illustrate the application of the three error models to obtain small-area (county-level) opioid mortality trends in GA accounting for data source specific uncertainty to highlight notable differences in small-area results across error models and denominator sources.
The paper is organized as follows: Section 2.1 describes population-at-risk uncertainty across the different denominator sources (PEP, ACS, and WP). Section 2.2 summarizes the BSBE and BSCE approaches to incorporate population-at-risk uncertainty within a disease mapping model. Section 2.3 outlines the process used to assess the impact of offset uncertainty on small-area risk estimates using simulation studies. Section 2.4 describes the application to obtain age-stratified small-area opioid-related mortality risk estimates. Section 3 illustrates findings across simulation and case studies. Lastly, Section 4 discusses summary of findings, limitations, and implications of our study.
Comments (0)