Understanding the phenotypic diversity among human populations is important in medicine and other life sciences. Genetic epidemiology evaluates phenotypic diversity by statistical models that combine genetic effects and environmental effects to identify the causal variants or genes of diseases. This has greatly contributed to the understanding of the genetic causes and mechanisms of disease.
In recent years, the field of genetic epidemiology has grown significantly due to the availability of genomics data. In particular, genome-wide association studies (GWAS) have identified many genetic variants that affect complex traits including diseases.[1] In addition to genomic information, information from other omics such as transcriptomics can also be used to analyze phenotypic diversity. Furthermore, in the past few years, the technology for measuring omics data at the single-cell level has made dramatic progress. The integration of genetic epidemiology methodology with single-cell omics data is becoming increasingly important. In this paper, we propose cell population-based frameworks and discuss the future of genetic epidemiology with single-cell omics data.
Model for explaining the variation of phenotype Standard modelThe model that expresses phenotypic diversity as a combination of genetic and environmental effects is the most basic model in genetic epidemiology (Standard Model: Figure 1A). Many genetic epidemiological studies, including GWAS, are based on this framework and use statistical models such as linear regression and contingency table tests to analyze the association of genetic factors and phenotype. This basic model expresses only the causal relationship from genetic factors to phenotypic diversity and does not include insight into molecular mechanisms.
Model of genetic epidemiology. G and E represent Genetic and Environmental effects, respectively. (A) Standard Model. G and E directly generate phenotypic diversity. (B) Omics Model. G and E generate phenotypic diversity via the diversity of omics information. (C) Cell Population Model. G and E generate phenotypic diversity via the diversity of cell populations where each single cell has omics information. (D) Multi-Tissue Model. G and E generate phenotypic diversity via the diversity of multiple cell populations
Omics modelGenetic factors influence phenotypic diversity of biomolecules such as RNA or proteins. Comprehensive biomolecular information is known as omics information, which is classified into genome, transcriptome, proteome, epigenome, or metabolome information.[2, 3] Genetic epidemiologists have actively studied phenotypic diversity through such omics information, which is not limited to genomic information.
The Omics Model shown in Figure 1B is a framework for combining genetic epidemiology with omics data. In this model, genetic and environmental effects contribute to phenotypic diversity via biomolecular information. To identify the genetic effects on pools of biomolecular information such as the transcriptome, proteome, metabolome, or epigenome (blue arrow in Figure 1B), the identification of single nucleotide polymorphisms (SNPs) associated with these biomolecules (expression quantitative trait loci (eQTL), protein QTL, methylation QTL, metabolite QTL) are being actively investigated.[4, 5] For example, eQTL analysis identifies genetic variants that are associated with gene expression levels obtained from transcriptome data. The eQTLs identified in various tissues have been published in databases such as.[6, 7] In addition, studies that examine the relationship between omics diversity and phenotypic diversity (red arrow in Figure 1B) constitute disease omics analysis. Studies to identify differentially expressed genes in diseased and healthy individuals are included in this category. Both types of study designs have been widely implemented in omics research projects.
Cell population modelA disease or complex phenotype of medical interest is manifest at the tissue or individual level. It is not caused by just one particular cell but by abnormalities of an entire cell population in the relevant tissue. In fact, tissue samples used in omics analyses are composed of a number of cells, and each cell has different omics information. Breakthroughs in single-cell omics technology over the last few years have made it possible to acquire omics information at the single-cell level. Genetic epidemiological models can then be extended for single-cell omics studies.
We propose the Cell Population Model as a framework for genetic epidemiology with single-cell omics data (Figure 1C). This model expresses phenotypic diversity as cell population profile diversity. Each cell in the body has a different omics status from the others. In this model, genetic and environmental effects affect phenotypic diversity through the diversity of a cell population profile where each cell has omics information. This model is an extension of the Omics Model and is considered a natural biological expression of complex phenotypes.
While association studies between cell population profiles and phenotype are often performed to identify a cellular subset related with disease using cytometry data or single-cell RNA-seq data (red arrow in Figure 1C),[8-10] genetic epidemiological analyses based on such models have not been performed to date (blue arrow in Figure 1C). Previously, we performed the first GWAS study on the diversity of lymphocyte populations in peripheral blood using a large-scale cytometry dataset based on this framework.[11] As a result, although the analysis was performed with a relatively small sample size, the SNPs associated with individual differences of the lymphocyte profile were successfully identified. In recent years, research to acquire cytometry data on a large scale has also become common.[12] Genetic epidemiological research under this model can be expected to bring new findings.
Multi-tissue modelThe Cell Population Model can be extended to multiple tissues in the Multi-Tissue Model (Figure 1D). Under this framework, phenotypic diversity is understood as being generated by a combination of effects from cell population profiles of multiple related tissues. For systemic diseases involving multiple tissues, such models are a natural expression of the mechanism. Although genetic epidemiological studies using the Multi-Tissue Model have not been conducted, it is considered meaningful as a future genetic epidemiological model in the single-cell era.
Cell population profile as a distribution on Omics State Space Omics state spaceEach cell in a cell population has the biomolecular information of five omics layers: epigenome, transcriptome, proteome, metabolome, and somatic genome. Biomolecular information in the epigenome layer, such as DNA methylation, histone acetylation, and chromatin openness, can be quantified as signal values assigned to each position in the genome. The factors in the transcriptome layer are the expression level of all genes in the human genome. The factors in the proteome layer are the expression levels of all proteins. More dimensions are required if cellular localization or chemical modifications such as phosphorylation of proteins are distinguished. The metabolome layer contains the abundance of all metabolites including lipids and low-molecular-weight compounds. Factors in the somatic genome layer are information about mutations or DNA damage that accumulate in the somatic genome and are distinct from the germline genome information inherited from parents. For example, cancer is a disease caused by an increase in the number of cells with abnormal somatic genomic information, and cancer genome analysis has been used to identify genes involved in the pathogenesis of the disease.[13, 14] In addition, considering mitochondria genome is beneficial to understand the differences among cells. For example, recent in vivo study using mouse observed the mitochondrial transfer between different types of cells, which is related to biological or pathological phenomena.[15, 16]
Because each cell has individual omics information, one cell can be represented as a one point in the state space where each biomolecule measurement value represents a coordinate axis. Here, we call this state space of the biomolecules of all the omics layers the “Omics State Space.” The function of the cell population depends on the profile of cells with different omics statuses.
Therefore, the cell population profile is characterized as a probability distribution in the Omics State Space. Since this distribution corresponds to the joint distribution of whole biomolecular measurements, including all gene expressions, protein expressions, mutations in the somatic genome, and epigenome modifications, it is a very high-dimensional distribution. Cells are not evenly observed in the Omics State Space, and most parts are sparse areas where no cells are observed at all. We define the cell population profile as the distribution in this Omics State Space.
Experimental data measuring biomolecules to capture the properties of the cell population profileExperimental data measuring biomolecules can be interpreted as capturing different parts of the distribution of the cell population profile in the Omics State Space. Because the distribution of cell population profiles is very high-dimensional and complex, there is no experiment technique to get a complete picture. Existing biomolecular experimental data can be classified according to three perspectives with respect to the desired information: the target omics layer, bulk/single-cell, and candidate-based/comprehensive. For example, bulk and candidate-based approaches in the proteome layer include western blotting or ELISA. Immunocytochemistry is a single-cell level and candidate-based method primarily in the proteome layer, where the number of cells that can be measured is small but protein localization can be distinguished. Single-cell and comprehensive approaches in the transcriptome layer include RNA-seq or DNA microarray. Methods for comprehensive measurements at the single-cell level in each layer have made rapid progress in the past few years.[15-22] Recent genomics assay can detect even mtDNA mutations at single cell level.[23]
In particular, single-cell data and bulk data differ in their data structure. The bulk measurement is an estimate of the mean value for a particular axis of the distribution in the Omics State Space. Since the mean value in a probability distribution is a representative and reasonable feature of the distribution, the bulk measurement value is a reasonable index for comparison among distributions when the cell populations are homogeneous. Single-cell data constitute a sample from the population distribution of the cell population profile where only some coordinate values are observable (Figure 2). While the shape information of the distribution is lost in bulk data, single-cell data can partially capture it. Then, it can be used to identify and quantify heterogeneity and cellular subsets in the cell population profile.
Omics State Space and single cell data. The cell population profile is characterized as a high-dimensional probability distribution in the Omics State Space where each measurement value on transcriptome, proteome, metabolome, somatic genome, or epigenome layer represents a coordinate axis. Single-cell data constitute a sample from this population distribution of the cell population profile where only some coordinate values are observable as marker
The ability to acquire more biomolecule information simultaneously at the single-cell level will allow us to understand the shape of the cell population profile at higher resolution. In recent years, the ability to measure omics information in multiple layers simultaneously has been actively researched, and measurement techniques at the single-cell level have been developed.[24, 25]
Requirement of a cell populationIn this section, we will discuss important issues when considering cell population profiles as distributions, and the requirements that must be met for a cell population.
When the cell population profile is viewed as a probability distribution, each data point is considered independent and the cell location information disappears. Then, cells need to be able to come and go from each other within a cell population. This assumption holds well for peripheral blood cell populations. When blood cells are sampled from peripheral blood, each cell can be regarded as independent and randomly collected, and the single-cell data can be regarded as a statistical sample from the population distribution. However, in many anatomically defined tissues, it is not only necessary for cells to maintain their proper biomolecular expression state, but also for each cell to occupy its proper position in the tissue to maintain the tissue function. For example, tissue stem cells are maintained in a microenvironment called a niche.[26] Considering such cell populations as distributions would result in a loss of biological information.
In recent years, spatial omics technologies that simultaneously acquire positional and omics information have received much attention. For example, the spatial transcriptome can reveal transcriptome data while retaining spatial information in the tissue.[27] Such spatial information may be useful in determining the range of cell populations that can be treated as distributions and in compensating for the loss of positional information.
To extend the Cell Population Model to the Multi-Tissue Model, it is necessary to consider the interactions between the cell populations. Cell populations exchange information through physical interactions or cellular signaling. In reality, the diversity of some complex phenotypes is generated by many cell populations that make up an individual and their interactions.
Feature extraction of cell population profilesIn order to design genetic epidemiology studies based on a cell population-based framework, such as the Cell Population Model or Multi-Tissue Model, it is necessary to perform association analysis between the cell population profile and individual labels such as genotype or phenotype. Since the cell population profile is represented as a probability distribution on the Omics State Space, conventional methods of genetic epidemiology and omics data analysis cannot be directly used in this situation. The solution to apply these data analysis methods and conduct association analysis is to extract feature values from cell population profiles. In this section, we introduce three conventional ideas on feature extraction of cell population profiles, methods using bulk data, methods based on cellular subsets, and non-parametric methods.
The mean value of distribution obtained by bulk data is one of the most commonly used features of cell population profiles. For example, bulk transcriptome data have contributed greatly to the identification of tissue-specific genes [28]. The identification of tissue-specific genes is way to compare cell populations from multiple tissues to find transcriptome axes whose mean values differ significantly among the multiple tissue cell populations on the Omics State Space. In the medical science field, many searches for biomolecular markers using bulk data have been conducted.[29, 30]
Feature extraction based on cellular subsets is frequently done with single-cell data. Each cell in a cell population is a little different from the others, so no two cells are exactly the same. However, since cell populations are formed as cells proliferate and differentiate, there are cellular subsets with the same properties and functions in the cell population. Therefore, we can understand cellular function by classifying cells into subsets and annotating their functions. Since a cell population profile is a mixed distribution of cellular subsets, a quantitative value of the percentage of each subset is also a valid feature of a cell population profile. Computational methods for clustering cells using single-cell data to identify cellular subsets are actively being studied by computational biologists.[31, 32]
Cellular subset-based feature extraction also loses information. One reason is that the results of feature extraction are affected by prior biological knowledge and assumptions about the pre-identified cellular subsets. However, it is not known exactly how many cellular subsets there are in our body or how we should classify them. Novel subsets are being newly identified. Even data-driven classification using information science methods cannot eliminate such biases due to the assumptions made in the algorithms and statistical models. In addition, information about variability and diversity within cellular subsets is also lost.
Nonparametric feature extraction is another means to obtain feature values without assumptions about cellular subsets. A nonparametric statistical method models the probability distribution of the cell population profile without caring about the number of parameters. In cytometry data analysis, a method using information theory-based dissimilarity quantification and multi-dimensional scaling (MDS) has been proposed.[33, 34] Here, the dissimilarity matrix among probability distributions is calculated by nonparametric density estimation, and MDS is applied to this dissimilarity matrix to obtain coordinates that reflect the dissimilarity relationship. Decomposition into Extended Exponential Family expresses the cell population profile distribution as an exponential family-like formula in a nonparametric manner, giving coordinates based on the inner-product matrix among the distributions.[35] The coordinates obtained by these procedures can be treated as data-driven feature values of the cell population profiles.
The development of feature extraction methods that satisfy these requirements is a future challenge in data analysis for implementing genetic epidemiology models in the single-cell era. The advantage of cellular subset-based feature extraction is that the biological meaning of the obtained features is clear and easy to interpret. The advantage of nonparametric methods is that they can model cell population profiles without using prior assumptions about cellular subsets. However, nonparametric methods generally require larger sample sizes to perform robust analysis. Due to cost issues, it is often difficult to acquire single-cell data with very large sample sizes. While there are many methods to compare multiple samples in cytometry data, such methods are lacking in single-cell RNA-seq data in particular.[36] That is a future task in single cell data analysis.
Dynamics of cell population profilesResearch designs that combine genetic epidemiology and systems biology are useful in medical research.[37, 38] In genetic epidemiology, genetic factors are identified by analyzing the diversity of complex phenotypes such as diseases among individuals. However, understanding the dynamics of complex phenotypes within the same individual is also important for the control of diseases with a systems biology approach. For example, how the body responds to drug stimuli and their molecular mechanisms is a fundamental research topic in the medical sciences. The dynamics of biological phenomena can also be explained by a cell population-based framework.
The cell population profile changes over time in response to external stimuli to affect biological phenomena. For example, lymphocyte populations in the peripheral blood change after vaccination to cause an immune response. In addition, external signals trigger cellular responses such as proliferation and differentiation. Therefore, the state of a cell at a future time point depends on the current Omics state and the external environment of the cell. This transition rule is defined by the biological pathway.
Because of each cell's dynamics, the overall distribution changes when a cell at a certain coordinate moves to another point depending on the type of stimulus. Cell death, division, or proliferation can also cause the change in distribution. This change is triggered by a change in the proteome layer, in which receptor proteins on the cell surface respond to an external substance and change their activity. The regulatory relationships between biomolecules determine where cells move from one point to another in the Omics State Space depending on the presence and type of stimulation.
The dynamics of such cell population profiles can also be analyzed by applying ordinary data analysis methods through feature extraction. The changes in the cell population profiles for each sample can be visualized and analyzed as the time series data of its features. Various data analysis methods are available to analyze time series data.[39] A cell population-based framework such as the Cell Population Model or Multi-Tissue Model provides an integrated approach to both genetic epidemiology and systems biology (Figure 3). This framework will be useful to investigate the genetic effect that is condition-specific or related with dynamics. For example, the recent research suggested that the RNASEH2B variant has relation to hemophagocytic lymphohistiocytosis (HLH) depending on the biological condition.[40]
Graphical abstract of cell population-based framework integrating genetic epidemiology and systems biology. In the context of genetic epidemiology, genetic effects influence phenotypic diversity through their impact on the probability distributions of various cell population profiles that make up an individual. In the context of systems biology, within an individual, responses to stimuli drive biological pathways and cause biological phenomena by altering their distribution. In both cases, extracting feature values from the distribution makes it possible to represent them in a statistical model
CONCLUSIONIn this perspective paper, we proposed a cell population-based framework for genetic epidemiology. In this framework, genetic diversity influences phenotypic diversity through the diversity of cell population profiles. Cell population profiles are high-dimensional distributions on the Omics State Space, and all biomolecular measurement data are used to obtain the properties of this distribution. To conduct genetic epidemiology in a cell population-based framework, feature extraction from cell population profiles is important from a data analysis standpoint. In addition, this framework can also be applied to represent the dynamics of cell population profiles, providing an integrated approach to genetic epidemiology and systems biology.
ACKNOWLEDGMENTSThis work was supported by a KAKENHI Grant-in-Aid from the Japan Society for the Promotion of Science (JSPS; grant number JP19J14816 and 21K21316), the Core Research for Evolutionary Science and Technology (CREST; grant numbers JPMJCR1502 and JPMJCR15G1), and the Japan Science and Technology Agency (JST; the AIP Challenge and JPMJCR21U2).
CONFLICT OF INTERESTThe authors declare that they have no conflict of interest.
Comments (0)