We considered three different datasets that cover both data-rich and -sparse, as a well low- and high-dimensional situations, thereby providing a broad range for the benchmarking covariate distribution models.
NHANES-3/-11. The US national health and nutrition examination survey (NHANES) contains a collection of demographic, physical, health-related and lifestyle-related variables from a representative sample of the US population [1]. The dataset distributed in R package NHANES contains a resampled version of \(n=\) 10,000 individuals from the 2009 to 2012 NHANES survey, to counterbalance oversampling of certain subpopulations [24]. From this dataset, we ignored individuals with age of 80 years or above (for whom exact age is not given) and those with a diastolic blood pressure of 0 (likely a database error). Subsequently, we derived two NHANES analysis datasets by selecting a subset of covariates and removing duplicates and missing values:
NHANES-3 (data-rich low-dimensional setting). 6230 unique measurements of 3 continuous covariates: age, weight and height.
NHANES-11 (medium-sparse high-dimensional setting). 2133 unique measurements of 11 covariates, of which 10 are continuous (age, weight, height, heart rate, systolic blood pressure, diastolic blood pressure, testosterone concentrations, total cholesterol, urine volume, urine flow rate) and one is discrete (sex).
ORGANWT (very sparse high-dimensional setting). A dataset on organ weights of two different mouse strains has been reported by [16]. From this dataset, we selected 6 continuous covariates (body weight and the 5 most relevant organs weights for PBPK modelling [12] that are available in the dataset, namely brain, heart, kidneys, liver, spleen) and 2 discrete variables (sex and strain), totalling 145 measurements of 8 covariates (after removing one animal in which brain weight was missing).
The sizes of the datasets are summarised in Table 1.
Table 1 Dimensions of the three datasetsGeneral notation and scopeWe consider the observed covariates \(\textbf = (x^,...,x^)\) to be an i.i.d. sample from an (unknown) d-dimensional probability distribution with density p and (cumulative) distribution function F. For example, in dataset NHANES-3, \(x^=(x^_,x^_,x^_)\) is age, weight and height of the i-th observed individual (\(d=3\)). The aim of covariate distribution modelling is to estimate a surrogate model q which approximates p, in order to be able to simulate i.i.d. data \(\textbf\) from this surrogate model. Alternatively, a method may directly provide such data \(\textbf\).
Copula modelsHere we give a short introduction to copula modelling for continuous distributions. Since the notation required for a general presentation of theory and estimation (especially for vine copulas) is complex, we refer to [5] for further details. Any multivariate density p can be uniquely decomposed into a product of marginals and their dependency structure (known as Sklar’s theorem),
$$\begin p(x_,...,x_) = p_(x_)\cdot \ldots \cdot p_(x_)\cdot c\big (F_(x_),...,F_(x_)\big ), \end$$
(1)
where \(p_\) and \(F_\) are the i-th marginal density and distribution functions, respectively, and where c is a copula density, supported on the unit cube. This decomposition allows to estimate the marginals and the dependency structure in two separate steps.
First, the marginals are modelled, giving rise to estimated marginal distribution functions \(\hat_, ..., \hat_\). In this step, standard univariate distribution fitting methods can be used. Here, we employed univariate local polynomial kernel density estimators as implemented in R package kde1d [17]. If the margins are regular enough, parametric methods can be used as well.
Next, using the estimated marginals, the covariates are transformed to uniform scale via
$$\begin x \mapsto \hat(x) := \big (\hat_(x_),...,\hat_(x_)\big ). \end$$
(2)
A copula model can then be estimated on the transformed data \(\textbf = \hat(\textbf)\). One possible approach consists of using multivariate copula models such as Gaussian copulas. While this approach is conceptually simple, it assumes the same type of dependency across all variables. A more flexible copula modelling approach consists of combining bivariate copula models in a pair copula decomposition of the multivariate copula. These copula models, termed vine copulas, use a graph-theoretical structure (vine) describing the conditional dependency structure of the pair copula decomposition. Modelling with vine copulas is supported by libraries such as rvinecopulib [18], which automatize the selection of both the vine structure and the pair copulas, and allow for efficient simulation.
Multiple imputation by chained equations (MICE)MICE is, at its origin, a method for imputing missing values in a datasets. Also known as conditional distribution modelling, it aims at predicting a missing value given all other (non-missing) observations. Within MICE, different algorithms are available. We used the defaults in the R implementation in package mice, namely predictive mean matching [15] for continuous variables and logistic regression for categorical variables, for which robust performance has been reported [3]. To use this method for the simulation of covariate distributions, we used the method described in [29]: (i) if missing data are present, a single standard MICE step is used to fill the missing data; (ii) then, all observed covariates are again labeled as missing and drawn in turn according to MICE in order to obtain a sample.
Covariate distribution modelsSix different covariate distribution models are considered in this article, which differ in how marginals are modelled, the choice of dependency structure and the use of parametric or nonparametric elements.
GaussDist. A multivariate Gaussian distribution (with density q) is fitted to the data \(\textbf\) on original scale. Hence, all marginals are also Gaussian. This method only applies to continuous covariates.
IndepCop. An independence copula is used for the transformed data \(\textbf\). This means that marginals are modelled, but no dependency structure.
GaussCop. A multivariate Gaussian copula is fitted to the transformed data \(\textbf\). This scenario combines GaussDist and IndepCop. It can be considered an optimized version of the “continuous method” by [31], in which the log-transform is replaced by an kernel-based transform.
ParVine. A vine copula is fitted to the transformed data \(\textbf\). For each dependency element within the vine copula structure, the best bivariate copula is selected from a family of parametric copulas.
NonparVine. A vine copula is fitted to the transformed data \(\textbf\). For each dependency element, a nonparametric bivariate copula model is constructed based on a transform kernel approach [19].
MICE. The MICE method is applied to the entire dataset, as described in [29].
Kullback–Leibler divergence estimationFor ease of notation, we assume here that all considered covariates are continuous. A combined continuous/discrete framework is described in Appendix C. We assume a surrogate model with density q (i.e., a multivariate Gaussian distribution or a copula model) has been estimated from the data \(\textbf\). The KL divergence between p and q is then defined as
$$\begin D_\text (p||q) := \int _^} \log \left( \frac\right) p(x)dx. \end$$
(3)
To estimate \(D_\text (p||q)\), we used a nearest neighbour density-based method proposed in [33]. In their approach, \(D_\text (p||q)\) is estimated from two samples, \(\textbf\) from p and \(\textbf\) from q. While \(\textbf\) is already available, a (large) sample \(\textbf = (y^,...,y^)\) is simulated from the surrogate model q. For copula-based models, \(\textbf\) is obtained by transforming a copula sample \(\textbf\) back to the original scale, \(\textbf = \hat^(\textbf)\). Working with a second sample \(\textbf\) rather than directly using the surrogate model density q allowed for the use of a two-sample bias reduction method. Also, it allows a natural comparison to the MICE method, in which a sample \(\textbf\) is obtained directly without estimating a density first.
The construction of the two-sample bias-reduced nearest-neighbour KL divergence estimator is presented in Appendix A. It is implemented and documented in the R package kldest [8]. A benchmark (for different data dimensions and with uncertainty quantification) demonstrating robust performance of this estimator is presented on the website accompanying the R package kldest [9].
While KL divergence is scale-invariant, KL divergence estimation might differ between data scales. In the main text, all results are shown on the original scale. An investigation of differences between original and uniform scale (where the copula models are fitted) is shown in Fig. 5 (Appendix D).
Uncertainty quantificationUncertainty quantification of KL divergence estimators has not been addressed in the literature beyond the 1-D discrete case, and in particular, no asymptotic distributions of nearest neighbour-based KL divergence estimators are available. Also, nearest neighbour-based KL divergence estimation cannot deal with samples containing duplicates. This means that standard bootstrapping, which relies on resampling with replacement, cannot be used for uncertainty quantification either. Instead, we opted for using subsampling as described by [22]. A precise definition of the procedure is provided in Appendix B. Briefly, a large number s of subsamples of size \(b \ll n,m\) is drawn without replacement. Subsequently, the distribution of the KL divergence estimator is approximated by the subsample distribution, corrected for differences in sample size. We chose \(b = n^\) and large fixed \(s = 1000\) and \(m=\) 10,000.
Simulation frameworkAll simulations were performed in R version 4.2.2 [25]. The source code for reproducing the results, including the three datasets, is available on Zenodo [10]. Requiring as input a list of datasets, it is designed to easily generalise to other datasets.
Comments (0)