We manually checked the sequencing data quality to ensure a reliable estimation of the forensic efficiencies for the analyzed SNP loci in the Chinese Korean ethnic group. This process involved a thorough evaluation of the newly generated dataset, including both per-sample and per-marker examinations. We removed samples with mapping rates < 100% and excluded SNP loci with allele calling rates < 100% to generate a comprehensive dataset for subsequent analyses. Eventually, four samples and 47 SNPs were excluded, leaving a total of 157 samples and 1946 SNP loci (accounted for 97.6% of the total SNP loci) for further forensic analyses. These 1946 SNP loci spanned across the 22 autosomes in the human genome (Additional file 1: Fig. S1A).
To determine whether there were any biologically related participants in this study, we performed the kinship analysis among the 157 Korean individuals. We showed that none of the 157 individuals shared close relatedness with each other. Consequently, 157 unrelated Korean individuals were recruited for subsequent analyses of forensic and population genetic analyses. The sequencing depth of these SNPs varied from 106.85 ± 39.62× to 20,376.01 ± 6654.35×, with most of the sequencing depth concentrated between 1000× and 5000× (Additional file 1: Fig. S1B). We also presented the SNPs with sequencing depths below 500 × in a boxplot, as shown in Additional file 1: Fig. S1C. The results showed that the minimal sequencing depth for the all 1946 SNP loci exceeded 50×. Furthermore, the histogram (Additional file 1: Fig. S1D) showed that the majority of heterozygous SNP loci had an average coverage ratio (ACR) ranging from 0.6538 to 0.9538. Overall, the results suggested that high-quality SNP data have been generated in the Chinese Korean ethnic group with the new NGS-based panel, further ensuring the reliability of subsequent analyses.
Forensic performance of the 1946 SNP loci in the Chinese Korean ethnic groupFor the HWE tests of the 1946 SNPs, 13 diallelic SNPs disconformed to HWE after Bonferroni correction (p < 0.000025). The results of LD tests for pairwise SNP loci showed that only 14 pairs (14/1048575 = 0.0013%) of SNP loci deviated from LD in the Chinese Korean ethnic group.
The statistical parameters for the 1946 SNPs were computed as a crucial step in providing informative recommendations for evaluating the panel's potential for forensic use in the Chinese Korean ethnic group. The 1946 SNP loci consisted of 1906 diallelic SNPs and 40 tri-allelic SNPs. We excluded 13 diallelic SNP loci for being completely homozygosity and heterozygosity in the Chinese Korean ethnic group before visualizing the statistical parameters of these SNP loci. Overall, these SNP loci displayed high polymorphisms in the Chinese Korean ethnic group (Additional file 1: Fig. S2A). In the diallelic SNP set, PD values ranged from 0.0252 to 0.5016; GD values varied from 0.0126 to 0.5016; PM values were in the range of 0.3412 to 0.9748; Hobs values varied from 0.0064 to 0.9873; and PE values spanned from 0.0002 to 0.9744. In the tri-allelic SNP set, PD values of these SNPs ranged from 0.3042 to 0.8034; GD values varied from 0.1944 to 0.6602; PM values spanned from 0.1966 to 0.6958; Hobs values were in the range of 0.1465 to 0.9172; PE values ranged from 0.0169 to 08307. Overall, the medians of PD, GD, PM, Hobs, and PD values were 0.5529, 0.4154, 0.4471, 0.4164, 0.1368 and 0.6916, 0.5410, 0.3084, 0.5518, 0.2601 in the diallelic; and tri-allelic SNP sets, respectively, indicating that the tri-allelic SNP loci were genetically more polymorphic than diallelic SNP loci.
Additionally, we estimated the distribution patterns of 1-CPD and 1-CPE values while incrementally incorporating SNP loci in descending order. Generally, both 1-CPD and 1-CPE values gradually decreased as the number of SNP loci increased. The lowest 1-CPD and 1-CPE values for the 1946 SNPs in the Chinese Korean ethnic group were 3.76E-308 and 2.18E-130, respectively. These results indicated that this panel can be effectively used for individual identification and parentage testing (Additional file 1: Fig. S2B). Moreover, even when the successful detection rate of the SNP loci decreased to 10% (200/1993), the 1-CPD and 1-CPE values could still meet the statistically required identification thresholds for individual identification and parentage testing. For more information on the forensic parameters of these SNPs, referring to Additional file 1: Table S2.
Furthermore, we conducted simulations for PC, FS, HS, and GG relationships based on the allele frequencies of the 1946 SNPs in the Chinese Korean ethnic group. Each simulation was repeated 2000 times, and we analyzed the LR for each type of kinship to determine their distribution patterns. As shown in Fig. 1A, significant differences were observed in the log10(LR) distribution for first-degree kinship (including PC and FS) and second-degree kinships (encompassing HS and GG). For the first-degree kinships, the log10(LR) values for PC and FS were in the range of 110.46 to 155.90 (mean of 132.48 ± 6.90 standard deviation) and 87.02 to 149.38 (mean of 117.31 ± 9.62 standard deviation), respectively. For the second-degree kinships, the Log10(LR) values for HS and GG kinships were distributed from 10.65 to 41.95 (mean of 26.71 ± 4.66 standard deviation), and 10.81 to 42.88 (mean of 26.71 ± 4.74 standard deviation), respectively. The distributions of Log10(LR) values for FS, HS and GG kinships were entirely distinct from those of unrelated individuals in the simulated kinships (Fig. 1B–D), demonstrating that the new NGS-based panel was potentially useful for identifying first-degree and second-degree kinships in the Chinese Korean ethnic group.
Fig. 1Estimating the efficiencies of the 1946 SNP loci for kinship analyses by simulating different kinships. A Box plots displaying the distributions of Log10 (LR) for different kinships based on 1946 SNP loci. PC, parent–child pair; FS, full sibling pair; HS, half sibling pair; GG, grandparent-grandchild pair; B Log10 (LR) distributions for FS and unrelated individuals (UI); C Log10 (LR) distributions for HS and unrelated individuals (UI); D Log10 (LR) distributions for GG and unrelated individuals (UI)
Informativeness for assignment (I n) statistics estimated for the 1706 SNPsBy incorporating the 75 reference populations into this study, we were able to conduct a comprehensive evaluation of the ancestry inference power for this panel. Initially, we visualized the allele frequencies of the 1706 overlapping SNPs across all 76 populations through a heatmap (Fig. 2A). These results revealed notable differentiation in the allele frequencies of specific SNP loci across various populations, suggesting that these SNP loci could be potentially useful for ancestry inference. Subsequently, we further used the infocalc program to compute the informativeness for assignment (In) statistics to measure the ancestry information content of these SNP loci in distinguishing among different populations. The In values were calculated to distinguish populations at different levels. In_1 denoted the capacity of the SNP loci to differentiate among the overall 76 intercontinental populations. In_2 quantified the effectiveness of the SNP loci in distinguishing populations from East Asia, South Asia, Central South Asia and Middle East. In_3 specifically gauged their efficiencies in distinguishing East Asian populations. We also generated the scatter plots to provide an intuitive distribution of the forensic efficiencies for these SNP loci, with the X and Y axes demonstrating the In and PD values of each SNP locus. As shown in Fig. 2B–D, while most SNP loci exhibited superior effectiveness in individual identification, some also displayed potential for ancestry inference. In addition, the number of SNP loci with higher ancestral inference efficiency increased with the geographic separation of the distinguished groups. To support future ancestry inference studies, we also provided the SNP loci with robust ancestry inference efficiencies (In > 0.1) in Additional file 1: Table S3.
Fig. 2Forensic applications of the 1706 SNP loci in individual identification and ancestry inference. A Heatmap of the allele frequencies for the 1706 SNP loci in the 76 populations; B Distribution of PD and In_1 values of the 1706 SNPs to distinguish among the overall eight intercontinental populations C Distribution of PD and In_2 values of the 1706 SNPs to distinguish between Asia populations; D Distribution of PD and In_3 values of the 1706 SNPs to distinguish between East Asia populations. The In_1 denotes that the SNP loci are used to distinguish between the overall 76 intercontinental populations. The In_2 is calculated to show the efficiency of the SNP locus to distinguish between East Asia, South Asia, Central South Asia and Middle East populations. The In_3 is estimated to show the efficiency of the SNP locus in distinguishing between East Asia populations
Population genetic structures revealed by F ST estimations and ADMIXTURE analysisThe fixation index (FST) was used to estimate pairwise genetic differentiations among the overall 76 populations, encompassing the Chinese Korean ethnic group genotyped in this study and 75 worldwide populations selected from the 1 KG and the HGDP. First, we computed the pairwise FST among different intercontinental populations and visualized the FST values using a heatmap. Then, we extracted the FST between the Chinese Korean ethnic group and its reference populations from different continents to visually present genetic relationships among these populations. Our observations unveiled significantly lower genetic distinctions among populations residing within the same continents, whereas higher genetic differences were evident when comparing populations from different continents. The African populations were relatively more genetically far from other intercontinental populations (Fig. 3A). In this context, the Chinese Korean ethnic group exhibited the most genetic differentiations from African populations, as indicated by a mean FST of 0.2138. In contrast, the Chinese Korean ethnic group shared the least genetic divergence from the East Asian populations, with a mean FST of 0.0087. Meanwhile, the mean FST values between the Chinese Korean ethnic group and populations from South Asia, Central South Asia, America, Europe, Middle East and Oceania populations ranged from 0.0714 to 0.1637, lower than those observed for African populations but higher than those for East Asian populations (Fig. 3B). In the East Asian reference populations, the Yakut and Lahu groups showed the most genetic differentiations from the Chinese Korean ethnic group compared with other East Asian populations. In contrast, the Han populations from various regions (Northern Han, Southern Han, Beijing Han), Xibo, Tujia, Japanese, and Mongolian groups displayed the least genetic differentiations from the Chinese Korean ethnic group. Generally, the FST values of pairwise populations increased relative to the magnitude of geographical separation. Detailed information on the FST values of the pairwise populations was shown in Additional file 1: Table S4.
Fig. 3Pairwise FST values among different populations. A Heat map of the FST values estimated among different intercontinental populations, including Africa, Europe, East Asia, South Asia, America, Central South Asia, Middle East and Oceania populations. The color ranges from red to blue, corresponding to FST values from low to high. B Bar plot of the FST values between the Chinese Korean ethnic group and the reference populations. Populations labeled with the same color indicate they are from the same continents; C ADMIXTURE results for K = 9. The genetic components of different populations are represented in different colors
The ADMIXTURE analysis was performed to estimate the genetic structure of the Chinese Korean ethnic group and compared it to the 75 reference populations. This investigation allowed us to better understand the ancestries of different populations by separating their genotype data into distinct components, assuming between two and ten ancestral populations in running the ADMIXTURE program. When K = 9, the least CV error was determined. As shown in Fig. 3C, the primary ancestral element found in Korean ethnic genomes corresponded to the genetic heritage of East Asian populations, as the K increased from two to ten, which supported the finding that the Chinese Korean ethnic group was closely related to East Asia populations. The results of ADMIXTURE analyses for K values from two to ten was shown in Additional file 1: Fig. S3.
Principal component analysisAs anticipated from the FST results, the Chinese Korean ethnic group showed the least genetic differentiations from East Asia populations compared with non-East Asia populations. However, we were still unaware of any potential population substructures within the East Asia populations. To reveal the genetic relationships between the Chinese Korean ethnic group and the East Asia reference populations, we performed the PCA. We tested the power of this panel for disclosing population structures in a stepwise method, by generating the population-level and individual-level PCA plots, respectively. The population-level PCA (Additional file 1: Fig. S4A) showed that distinct population clusters were generated concordant with their geographic locations, especially for Africa, Europe, South Asia and East Asia populations. More dispersive distributions were observed for America, Central South Asia, Oceania, and Middle East populations. Focusing the PCA on East Asia populations (as shown in Additional file 1: Fig. S4B), it became easier to discern the genetic affiliations among the East Asian populations. Nevertheless, the genetic structure of the Chinese Korean ethnic group exhibited a stronger resemblance to the Han Chinese hailing from diverse regional origins, as well as Tujia, Yi, and Japanese populations.
Next, we verified the ability of this panel to make ancestral inferences for individuals of unknown origin by individual-level PCA. We demonstrated that individuals of different intercontinental origins could be clustered according to their biogeographic locations or genetic structure similarities (Fig. 4A, B), consistent with the population-level PCA results. We also recreated the PCA plots involving South Asia, Central South Asia, Middle East, and East Asia populations to further inspect the efficiency of this panel in distinguishing populations sharing closer geographic proximity. As indicated in Fig. 4C, D, four distinct population clusters were generated: South Asia, Central South Asia, Middle East, and East Asia population clusters. The Middle East and the Central South Asia clusters were relatively distant from the East Asia and South Asia clusters, reaffirming that genetic similarities decrease as geographical separations increase. When the PCA was restricted to East Asia individuals, there was a distinct separation among the Han populations, the Northern ethnic minorities, and the Southern ethnic minorities (Fig. 4E, F), suggesting that genetic substructure may still exist in some geographically closed populations. The results of individual-level PCA further supported the conclusion that this panel could be potentially useful for ancestry inference, capable of distinguishing not only between distant populations but also among Han Chinese and the northern and southern minority groups in East Asia.
Fig. 4Population genetic structures revealed by principal component analyses. Each dot represents a single individual and is colored according to its continental origin. A, B PCA of the overall individuals with the first three principal components (PC) involved, which reveals the population genetic structures of the intercontinental populations from eight major global regions; C, D PCA of the individuals from East Asia, South Asia, Central South Asia and Middle East with the first three PCs involved, which reveals the population genetic structures of these populations; E, F PCA of the individuals from East Asia with the first three PCs involved, which reveals the population genetic structures of East Asia populations. The individuals from East Asia are categorized into the Han populations, northern ethnic minorities and southern ethnic minorities
Gene flow among different intercontinental populationsCompared to previous reports, this study allowed us to better examine ADMIXTURE and the recent gene flow within the Chinese Korean ethnic group and the reference populations through computation of the f statistics by enlarging the sample size and the number of SNP loci. As indicated in Fig. 5A, the outgroup-f3 statistics showed that the 74 populations (except for the Korean_C and the Mbuti populations) could be separated into different genetic clusters based on pairwise shared genetic drift. The Chinese Korean ethnic group shared the most genetic drifts with the East Asia populations, followed by South Asia, Central South Asia, Middle East, Europe, America, Oceania and Africa populations. Furthermore, we extracted the outgroup-f3 statistics calculated in the East Asian populations to determine the relationships between the Chinese Korean ethnic group and the other East Asian reference populations. As shown in Fig. 5B, the Northern Han, Miao, Daur, Beijing Han, Southern Han, Tujia, Japanese and Mongolian shared more genetic drifts with the Chinese Korean ethnic group, reconfirming pairwise FST and PCA conclusions. We also used the f4 statistics to assess the shared genetic ancestries or admixtures among the East Asia populations. The f4 statistics showed significantly more genetic exchanges between the Chinese Korean ethnic group and the Northern Han, Beijing Han, and Southern Han populations compared to Dai, Tu, Vietnam, Naxi, Hezhen, Yakut, and Cambodian populations (f4 < 0, Z > 2) (Fig. 5C–E). However, when the Miao, Daur, Tujia, Japanese, Mongolian, She, Yi and Xibo populations were considered as the comparison populations, no significant genetic exchanges were observed between the Chinese Korean ethnic group and the Northern Han, Beijing Han, and Southern Han populations (f4 < 0, Z < 2), suggesting that the Chinese Korean ethnic group might share more genetic affinities with Han populations across different regions and with the Daur, Tujia, Japanese, Mongolian, Miao, She, Yi and Xibo populations.
Fig. 5Gene flow estimated among the Chinese Korean ethnic group and the reference populations. A Pairwise outgroup-f3 of the Chinese Korean ethnic group and the reference populations. The color gradient ranges from green to red, corresponding to outgroup-f3 values from low to high. The map shows the approximate geographic distribution for each population; B Pairwise outgroup-f3 of the Chinese Korean ethnic group and the East Asian reference populations; C–E Distribution of f4 values from Dstats tests under the model of [Mbuti, Korean_C; X, Y], where X represents the Han populations of different regions, Y represents different East Asian populations and the Mbuti population serves as an outgroup
Phylogeny reconstruction of the overall populations using diverse algorithmsIt is widely acknowledged that modern humans originated from an African population, making it a useful starting point for analyzing evolutionary relationships and interpreting the genetic tree. At this point, we inferred the unrooted phylogenies for all 76 populations based on both pairwise FST values (Fig. 6A) and the genotype data of high-density SNP loci (Fig. 6B), respectively. The results indicated that populations from the same continents shared a subbranch in the population-level tree (Fig. 6A). The Chinese Korean ethnic group was notably positioned within the East Asia cluster. We could easily distinguish seven major population clusters, representing Africa, Europe, East Asia, South Asia, Middle East, America, and Oceania populations. In contrast, the Central South Asia populations were dispersedly distributed in the phylogenetic tree. However, the individual-level phylogenetic tree showed compelling evidence of closer phylogenetic relationships among some Americans and Europeans, Europeans and Middle East individuals, South Asians and Central South Asians (Fig. 6B).
Fig. 6Phylogenetic relatedness of the Chinese Korean ethnic group and the reference populations. A Population-level maximum-likelihood phylogenetic tree; B Individual-level maximum-likelihood phylogenetic tree; C Population-level maximum likelihood tree and pairwise residuals for the phylogenies after accounting for four migration events. The scale bar shows the average standard error of the entries in the covariance matrix
We also applied a model based on population allele frequencies to account for variances introduced by secondary migration events to test whether admixture was confounding the phylogeny. The maximum likelihood (ML) trees with one to seven admixture events (∆M = 1–7) were generated for 20 iterations by the TreeMix software. Based on the distributions of ∆M, the optimal number of admixture events for the Chinese Korean ethnic group and the reference populations was indicated to be ∆M = 4. The ML tree, considering a total of 76 populations along with four admixture events (∆M = 4), was shown in Fig. 6C. These results showed that Africa, East Asia, Europe, and South Asia populations consistently clustered together, forming distinct branches in the tree. In contrast, the Central South Asia, Oceania, and Middle East populations did not exhibit a strong tendency to form distinct clusters. In this study, the gene flow events basically occurred from African populations to American, European, and Oceanian populations. When up to seven migration events were included in the model, we further detected admixture from European to American populations. However, we did not detect direct admixture between the Chinese Korean ethnic group and reference populations even after allowing for seven migration events. According to the phylogenetic reconstruction and admixture event estimation analyses, we showed that the Chinese Korean ethnic group was genetically more similar to the East Asia populations. Besides, the gene pool of the Chinese Korean ethnic group was relatively less influenced by contributions from other intercontinental origins. The results of Treemix analyses and pairwise residuals for one to seven migration events were shown in Additional file 1: Fig. S5~6.
Comments (0)