N-of-one differential gene expression without control samples using a deep generative model

The modelArchitecture and hyperparameters

The full model consists of a Gaussian mixture model as the parametrized distribution over latent space and a decoder as presented in [20]. We performed hyperparameter optimization on the latent dimension, number of hidden layers in the decoder, and number of Gaussian components. Representations were explored with dimensions between 10 and 100. We further tested various decoder architectures with two to four hidden layers of 200 to 10,000 hidden units. For the Gaussian mixture model, we tried different distribution complexities with 20 to 50 components. Each representation of a sample received a 50-dimensional vector initialized with zero. The decoder consists of a 50-dimensional input layer fed with the representations, two hidden layers, and an output layer with units corresponding to each of the genes in the data. The two hidden layers are of size 500 and 8000, respectively, immediately followed by a ReLU activation [31, 32]. The output values are transformed into negative binomial distributions over expression counts for each gene. The negative binomial is given by:

$$NB\;(k;\;m,\;r)\;=\;\frac)}^r\;)}^k$$

where k is the library size, m is the mean, and r the dispersion parameter. The decoder outputs are passed through ReLU activation and scaled with the sample gene expression mean and used as the mean (m) of the distributions. The trainable dispersion parameters (r) of the negative binomials are gene-specific and initialized with 2. The GMM consists of 45 mixture components. The priors are a mollified uniform distribution (the “softball” prior) with a spread of 7 and a sharpness of 10 for the means, a Gaussian with a mean of 1 (corresponds to a standard deviation of 0.1), and a standard deviation of 1 for the negative logarithmic diagonal covariance and a Dirichlet distribution with all alphas equal to five. For more details, please refer to [20].

Training

The model was trained for 200 epochs with a batch size of 256. The optimizer of choice was Adam [33] without weight decay and betas 0.5 and 0.9. Because the representations are updated every epoch, the decoder, representation, and GMM received distinct optimizer instances with learning rates 10−4, 0.01, and 0.01, respectively.

Evaluation

Representations for single test samples are optimized as described in [20] with all model parameters fixed. For a new datapoint, representations were initialized from the component means. This results in 45 representations per sample. These were optimized with the fixed model for 10 epochs, after which the best representation is selected for each sample. To perform differential expression analysis, we further optimized the selected representation for another 50 epochs.

Differential expression

We used the DGD to find differentially expressed genes for tumor samples. An optimal representation is found for each tumor sample as described above. These representations are interpreted as the closest-normal for each tumor—an in silico normal. In our setup, we want to test if counts for a given gene are significantly different between the tumor sample and the normal tissue output of the neural network. Let \(NB\;(_i,r_i)\) be the negative binomial distribution for gene \(i\) and \(_\) be the actual count in the tumor sample. We test the hypothesis that the observation from the tumor is from the in silico normal. Therefore, the p value for \(_\) is the sum of all counts with a probability lower than that of \(_\):

$$p\;\mathrm\;=\;\sum\nolimits_^K\;NB\;(k;\;m_i,\;r_i)\;I\;\lbrack NB\;(_i,\;r_i)\;\leq\;NB\;(x_i_i,\;r_i)\rbrack,$$

where K is the library size and I is an indicator function taking a value of 1 when the condition is fulfilled and otherwise 0. The above expression yields an exact p value for the negative binomial distributions. However, it requires summing over all read counts across genes. For the sake of efficiency, we therefore obtain an asymptotic p value by summing over an evenly spaced grid of 104 in the domain of \(K\).

DataData collection and processing

The raw gene count expression data from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) were downloaded from the Recount3 platform [22] (https://rna.recount.bio/) using the built-in R packages. Additionally, the sample metadata files were also acquired from Recount3 [22].

At the time of download (February 9, 2022) there were 31 different tissue types in GTEx and one group without tissue information, with a total of 19,214 individual samples (Additional File 2: Table S1). The 133 samples without tissue information were removed, and duplicates were dropped before further processing. We performed feature selection using the filterbyExpr [34] function with default parameters in R (edgeR version 3.34.1) to remove lowly expressed genes. We further reduced the set of genes to protein-coding genes according to the UCSC Genome Browser [35]. The gene list is available in our GitHub repository (see below). After preprocessing, the GTEx dataset contained 18,975 samples and 16,883 annotated protein-coding genes. We have deposited the gene list used to train our model in Zenodo with https://doi.org/10.5281/zenodo.10021626 [36], and it is released under the GNU license. The gene list can also be found in Zenodo with the link: https://zenodo.org/records/10021626 [36].

For training and evaluation, the samples were split into a train and a test set, with the test set representing roughly 10% of the data. Sampling was performed in a proportional stratified random manner, meaning that 10% of the samples were randomly chosen and assigned to the test set for each tissue. This procedure resulted in 17,072 train and 1903 test samples.

As for cancer data, The Cancer Genome Atlas (TCGA) Program provided 33 different cancer types. We selected the set of genes present in our processed GTEx dataset and separated the TCGA samples into a normal adjacent set (747 samples) and a tumor set (10,601 samples) following the metadata file.

TCGA tissue selection

For an assessment of the model’s capability to assign independent samples to the correct tissue clusters according to the training data, we selected 14 cancers from TCGA that fulfilled the following two conditions: (1) the tissue corresponding to the cancer is present in GTEx and (2) the tissue must have at least ten adjacent normal samples from each cancer type [37]. We also included the adrenal and brain tumor samples, even though they lack adjacent normal samples, to compare our result with the work of Vivian et al. [10]. This results in 6111 TCGA tumor samples and 624 TCGA adjacent normal samples (Additional File 3: Table S2).

In applying our differential expression analysis to cancer types other than breast cancer, we include TCGA cancer samples from the 31 cancer types overlapping with those found in DriverDBv3 [23].

TCGA breast cancer subset

To obtain a homogenous breast cancer (BC) dataset for testing, we curated the TCGA BC samples to only include primary tumors from women between 40 and 70 years of age. We excluded samples with low tumor cell percentage (defined as < 50%) or a high level of necrosis (defined as > 5%), in addition to samples from patients with known metastasis, stage IV or stage X tumors, or prior cancer diagnosis. For a full list of selection criteria and columns from metadata used for curation, see Additional File 5: Table S4. The number of available BC samples for analysis was reduced from 1256 to 381.

Cancer driver genes and PAM50 genes

We downloaded a list of cancer driver genes for each cancer type from DriverDBv3 [23], and we matched the symbol names in the database to the Ensembl IDs from recount3. The PAM50 gene set used for BC subtype classification (basal, luminal A, luminal B, and Her2-enriched) [24] was downloaded through the R-package genefu [27]. As our model testing and comparison with DEseq2 pertained to the expression profile of BC subtype tissue vs. normal tissue (i.e., not between subtypes), we filtered the PAM50 dataset to only include the genes which were specific to a single subtype or which could distinguish BC subtype(s) from normal tissue. We noted the expected directionality of each of the PAM50 genes (upregulated or downregulated) for a contrast. The selection of genes was based partly on literature [26, 28, 29] and partly on the robust normalized PAM50 scores [27]. The PAM50 gene set was reduced to 34 genes.

AnalysisEvaluation of tissue specificity

Clustering performance according to tissue type was evaluated based on the GMM probability densities for each sample’s representation. For this purpose, samples are assigned the GMM component that achieves the highest probability density for their inferred representations. We calculated the percentage of each tissue per component as the number of samples of a given tissue clustered in a given component divided by the total number of samples assigned to this component.

$$percentage\;=\;\frac\;x\;100$$

This is done independently for the train set (Fig. 2B) and test set (Additional File 1: Fig S2).

Matching TCGA to normal tissues

We use the TCGA data described above to evaluate the mapping of unseen and independent data onto the latent space. New representations for all 6111 TCGA tumor samples were found using the DGD trained on GTEx data. The resulting GMM probability densities for the TCGA representations are used to evaluate how well new samples match the correct tissues of the training representation. We define the “correct” tissue as the tissue that best represents a given GMM component. Our evaluation metric is the percentage of TCGA samples of a given tissue matched to the corresponding GMM component with respect to the total number of TCGA samples for this tissue.

$$\%\;TCGA\;in\;correct\;clusters\;=\;\frac\;x\;100$$

Bladder samples were evaluated differently due to the lack of a bladder-specific component in the normal model. Instead, we evaluated a correct match as TCGA and GTEx bladder samples were assigned to the same component(s).

Comparing GTEx and TCGA gene expression predictions

The total probability of a sample is the product of the probabilities across all 16,883 genes given by the model (also used in the model likelihood). We calculate the negative log-probability mass of each sample for three datasets: GTEx test, TCGA adjacent normal, and TCGA tumor. For this comparison, we use subsets containing ten tissues: adrenal, brain, breast, colon, kidney, liver, lung, prostate, stomach, and thyroid. We apply this analysis for a pan-tissue comparison based on the eight tissues only, because adrenal and brain are missing in the TCGA adjacent normal subset. For a fair comparison, we ensure equal samples for a given tissue across the three datasets. If all datasets have more than 20 samples for a given tissue, we randomly select 20 samples from each subset for that tissue.

Differential expression analysis in TCGA breast cancer

Differential expression analysis (DEA) performed by our model is compared to DESeq2 by the enrichment of cancer driver genes among DEGs.

For a general comparison, we perform 20 single-sample experiments using 1 random breast cancer sample (case) from the population of 40–50-year-old Caucasian females. This procedure leaves us with 166 samples. Genes that result in absolute log2-fold changes greater than one and adjusted p values below 0.01 are accepted as differentially expressed. The enrichment score is then given as the normalized number of DEGs belonging to the group of breast cancer driver genes or PAM50 genes, respectively.

$$ES\;=\;\frac\;\mathrm\;\mathrm\;\mathrm\;\ast\;16,883\;\mathrm g\mathrm e\mathrm n\mathrm e\mathrm s}\;\mathrm\;\ast\;\mathrm\;\mathrm\;\mathrm}$$

We perform a comparable DEA with DESeq2 using all 44 GTEx samples (control) under the same conditions (40–50-year-old females).

We also perform single-sample analyses of the four available breast cancer subtypes: basal-like (84 samples), HER2 (37 samples), luminal A (176 samples), and luminal B (84 samples), both for our model and DEseq2. We randomly choose one sample from each of the four subtypes as a case sample and used all GTEx breast tissue samples (40–70-year-old females, 143 samples in total) as controls in the DEseq2 method. The experiment is repeated 20 times for each subtype.

False positive analysis

In order to assess the false positive rates of the methods, we selected a random GTEx breast sample from the test set (42 samples) as a false case sample. We performed DEA with our model and DESeq2 to yield false positive DEGs for adjusted p values ranging from 0.01 to 0.1 (absolute log2-fold change greater than 1). We performed this 20 times and reported the resulting DEGs as false positives. As controls for DESeq2, we used all breast samples from the GTEx training set (440 samples) as controls. We also perform the analysis for DEseq2 by choosing five controls, which are randomly selected from the breast samples mentioned above.

For the enrichment analysis of Cancer Driver Genes for multiple cancer types, 31 different cancer types are used: adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast carcinoma, colon adenocarcinoma, kidney cancer (kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma), liver hepatocellular carcinoma, lung adenocarcinoma, prostate adenocarcinoma, stomach adenocarcinoma, and thyroid carcinoma. We perform 20 single-sample experiments for each of the cancer types. For each cancer type, its respective cancer driver gene list was used in the enrichment score calculation.

Comments (0)

No login
gif