We analyzed the genetic and phenotypic data for SPARK [27] and SSC [28], downloaded from the Simons Foundation Autism Research Initiative (SFARI) base. The analysis of SPARK and SSC data was reviewed and approved by institutional review board at the University of Pennsylvania (IRB protocol number: 825701). The iPSYCH study was approved by Danish Data Protection Agency and the Scientific Ethics Committee in Denmark. The study is part of a PhD dissertation [30].
1000 genomes projectWe used unrelated participants from the 1000 Genomes Project as the reference ancestry populations for the principal component analysis (PCA), specifically, the Utah Residents with Northern and Western European Ancestry from the United States (CEU), Yoruba from Ibadan, Nigeria (YRI), Han Chinese from Beijing, China (CHB), Japanese from Tokyo, Japan (JPT), Toscani from Italia (TSI), Finnish from Finland (FIN), British from England and Scotland (GBR), and Iberians from Spain (IBS) [29]. We kept autosomal SNPs with call rate \(\ge\) 95% and Hardy–Weinberg Equilibrium (HWE) \(\ge\) 1 × 10–5 in each population. We used CEU, YRI, CHB and JPT as the reference population for the first PCA to identify individuals with European ancestry. CEU, TSI, FIN, GBR and IBS were used as the reference population for the second PCA to better delineate European ancestry. We kept SNPs with minor allele frequency (MAF) \(\ge\) 1 × 10–2 in the reference populations (Fig. S1A).
SPARKGenotypingSPARK is an autism research initiative recruiting autistic probands and their families in the United States [27]. Participants were recruited from 32 clinical sites in the United States (Table S1) and were asked to complete a detailed questionnaire. The SPARK 201909 release (202002 update) includes 27,072 participants genotyped using Illumina Global Screening Array (GSA) v1 design using the Genome Reference Consortium Human Build 38 (GRCh38 human genome build).
Genotyping quality controlWe removed participants who withdrew from the study and participants with questionable phenotypes (including lower confidence in autism diagnosis and suspected confounders to autism diagnosis including medical complications; more in supplemental notes). We restricted our analysis to participants from families with both parents, the proband and at least one unaffected sibling. We removed participants with more than 5% missing genotypes, and related families (closer than 2nd degree) based on kinship coefficients estimated using Kinship-Based Inference for genome-wide association studies (GWAS) (KING) [31]. Autosomal bi-allelic SNPs with call rate \(\ge\) 95% were used in the analysis [5]. We restricted the analysis to SNPs that were common to both 1000 Genomes and SPARK. We excluded SNPs in regions of extended LD [32, 33], SNPs with MAF < 0.01 or HWE < 1 × 10−5. SNPs with greater than 5% Mendelian error rate were removed (Fig. S1B). We ended up with 322,042 SNPs in the analysis.
SSCGenotypingSSC is ascertained in a slightly different manner to SPARK. SSC is a collection of more than 2000 families who have only one autistic child in each family [28]. SSC families were recruited from 12 sites (Table S2). SSC participants were genotyped on one of three platforms: Illumina 1Mv1 (n = 1354), Illumina 1Mv3 (n = 4626), or Illumina Omni2.5 (n = 4240) on Homo sapiens genome assembly National Center for Biotechnology Information NCBI36. For SSC, all genotypes were mapped to GRCh38 using LiftOver [34].
Genotyping quality controlWe kept the SNPs that were common to all three of the SSC genotyping platforms and combined the SSC datasets. Participants who withdrew from the study were excluded. Family relationships were evaluated using KING by estimating kinship coefficients for all pairwise relationships [31]. Genotype patterns were consistent with the stated family relationships in all SSC families and no relationships of 2nd degree or closer were detected across families. We restricted the analysis to autosomal bi-allelic SNPs with call rate \(\ge\) 95% that were common to 1000 Genomes and SSC. We excluded SNPs in regions of extended LD [32, 33], SNPs with MAF < 0.01, SNPs with Mendelian error rate greater than 5%, and SNPs with HWE p-value < 1 × 10−5 (Fig. S1C). The total number of SNPs in the final analysis is 486,963.
Principal components analysis for ancestry estimationFirst, the continental ancestry for each SPARK and SSC participant was estimated. To do this, we used unrelated CEU, YRI, CHB, and JPT participants from the 1000 Genomes data. The PCA was performed using PLINK [35]. We used the first and second PCs to identify participants of European ancestry and removed all non-European participants from the analyses. Participants were assumed to be of European ancestry if their average PC1, and their average PC2 values were each both closer to that of the CEU participants than to that of the YRI and CHB/JPT participants (Fig. S2). We ended up with 1863 quartets and 420 trios in SSC, and 1586 quartets in SPARK. In cases where a family in SPARK has more than one unaffected sibling (n = 37), we prioritize selecting the sibling who shares the same sex as the proband, if available, and who is closest in age to the proband. We further excluded 11 SPARK families in which one or both parents had an autism diagnosis.
Next, we used 1000 Genomes participants of European ancestry (CEU, TSI, FIN, GBR, and IBS) to perform a second round PCA to better characterize the European ancestry in SPARK and SSC (Fig. S3). PC loadings from this round of PCA were used in the rest of the analyses. The absolute eigenvalues of PC1 from this round of PCA was used to identify ancestry-informative SNPs (SNPs that loaded the most heavily on |PC1|) in the intra-locus and inter-locus correlations analyses.
Autism and intelligence polygenic scoresWe used SNP effect sizes and standard errors estimated from an external autism GWAS with 19,870 autistic individuals and 39,078 controls from the Danish Integrative Psychiatric Research (iPSYCH) consortium [36] to calculate autism PGS in SPARK and SSC.
To calculate the intelligence PGS, we used a large scale intelligence GWAS summary statistics [37]. SNPs that passed genotyping quality control and were common to the autism or intelligence GWAS summary statistics, 1000 Genomes, and the target dataset (SPARK or SSC) were used in the analysis.
PGS were calculated using LDpred2 [38]. LDpred2 adjusts the effect sizes from GWAS summary statistics by conditioning on a genetic architecture prior (the heritability explained by the genotypes and the fraction of causal markers) and LD information from a reference panel. We used the parents in SPARK or SSC for the LD references. We ran LDpred2 genome-wide using the ‘auto’ option to let LDpred2 automatically estimate the sparsity, p, and the SNP heritability, h2, from the summary statistics. The correlation between SNPs were calculated in a window size of 3 cM. For autism (a binary trait) PGS, we use SDss denote the standard deviations derived from the summary statistics, which for a binary trait, SDss = \(\frac }}} \right)\sqrt } }}\), where \(n_ = \frac }}} \right. \kern-0pt} }} + }}} \right. \kern-0pt} }}}}\), \(se\left( }}} \right)\) is the standard error of the effect of variant j. SDtest denote the standard deviations of genotypes of participants in the study population (SDtest = \(\sqrt *\left( } \right)}\) where AFtest is the minor allele frequency of founders in the target population). As recommended by the authors of LDpred2, SNPs with SDss < 0.5·SDtest or SDss > 0.1 + SDtest or SDss < 0.1 or SDtest < 0.05 were removed (nSPARK = 60, nSSC = 14) [38]. Missing genotypes (< 5%) in SPARK and SSC were imputed with mean using snp_fastImputeSimple() function with “method = mean2” option in the bigsnpr [39] R package [40]. There were a total of 300,201 SNPs in SPARK and 475,058 SNPs in SSC included in the autism PGS calculation. There were a total of 307,058 SNPs in SPARK and 481,349 SNPs in SSC included in the intelligence PGS calculation. We used the polygenic transmission disequilibrium test (pTDT) [22] to evaluate if the polygenic influence of autism is over transmitted to autistic probands.
Cognitive impairment and intellectual disability in autistic probandsProbands in SPARK were assessed for cognitive impairment (CI), whereas probands in SSC were assessed for intellectual disability (ID). We divided families in SPARK and in SSC by whether the proband had cognitive impairment (in SPARK) or intellectual disability (in SSC) (autism w/ CI/ID) or not (autism w/o CI/ID) to evaluate if the degree of AM is different between these families. Criteria for likely cognitive impairment in SPARK were defined by nine variables related to the cognitive development of each proband (nautism w/CI family = 707, nautism w/o CI family = 867) (supplemental notes). In SSC, probands with full scale IQ < 70 were classified as having intellectual disability (nautism w/ID family = 659, nautism w/o ID family = 1618).
Correlations between spouses’ phenotypes, ancestry, and PGSThe Social Responsiveness Scale (SRS) adult version [18] obtained from an informant (mother reported on father and father reported on mother), and the self-reported Broad Autism Phenotype Questionnaire (BAPQ) [19] were available in SSC for parents. The results of these questionnaires were used as quantitative endo-phenotypes to better understand the genetic architecture of autism. The SRS adult version informant questionnaire measures core autistic traits on a continuous scale [18], and is made up of subscales which evaluate Awareness, Cognition, Mannerisms, Motivation, and Communication respectively [18]. The BAPQ self-report questionnaire measures the broader autism phenotype in three subscales: Aloof, Rigid, and Pragmatic Language [19]. The correlations between spouses’ measures of quantitative autistic traits in SPARK and SSC were evaluated using Spearman’s correlation coefficient [41]. The correlations between spouses’ genetic ancestry (the top two PCs from the PCA with 1000 Genomes participants of European ancestry), as well as autism and intelligence PGS (adjusted for age, sex, and the first 10 PCs from the PCA with 1000 Genomes participants of European ancestry) were evaluated using Pearson’s correlation coefficient. Spousal correlations in autism w/o CI/ID and autism w/ CI/ID families were compared using Fisher’s z-test with the cocor package [42] in R [40]. To adjust the significance level, we used the Bonferroni correction (divided the original Type I error rate, \(\alpha\) (0.05) by the number of tests (n = 336, including subgroup analyses, see below)).
To investigate to what extent the spousal correlations of quantitative autistic traits could be explained by autism and intelligence PGS, genetic ancestry PCs, and demographic variables including sex, age, and highest education (predictors), we first built univariate regression models with parents’ SRS or BAPQ total scores as the dependent variable, and one of the predictors as the independent variable. Then, we built a full model with all predictors as independent variables. We reported the adjusted R-squared as a measure of the proportion of variance in parents’ SRS or BAPQ total scores that was explained by the independent variable(s). We then took the residuals of SRS and BAPQ total scores from each model (with sex variable removed) and recalculated the parents’ correlations of the residuals.
Carriers of rare variants with large effectRare de novo protein truncating variants (PTVs) and copy number variants in SPARK and SSC probands included in this paper have been analyzed and reported previously [43]. We identified all probands with a de novo PTV or copy number deletion in 373 neurodevelopmental disorder (NDD) related genes reported in Fu et al. [43] as carriers (ncarrier-SPARK = 74, ncarrier-SSC = 88).
Subgroup analysesSSC is a simplex collection while SPARK includes both simplex (n = 1008) and multiplex families (n = 157; there are 410 families with unknown family type). There is also evidence supporting different genetic architectures of autism in males and females [44], and among individuals with and without a rare variant of large effect [23]. Therefore, we decided to compare the pattern of AM of autism w/ and w/o CI/ID in subgroups defined by above variables. To ensure adequate statistical power, we require each of the subgroups has a sample size of at least 85 (80% power to detect a correlation of 0.3 [14], assuming a Type I error rate of 0.05, Tables S3, S4). After examining the sample size, we compared the pattern of AM of autism w/ and w/o CI/ID in the following subgroups: simplex families in SPARK, as well as simplex families with male probands, simplex families with female probands, simplex families with probands without a rare de novo PTV or deletion in NDD genes, and simplex families with male probands without a rare de novo PTV or deletion in NDD genes in both SPARK and SSC. We also compared the pattern of AM between families with female probands and families with male probands in both SPARK and SSC.
Intra-locus correlations in SPARK and SSCWe first pruned SNPs using PLINK [35] with window size of 500 kb to remove SNPs with r2 > 0.1 from the SPARK dataset (now 90,621 SNPs). Then we used SNPRelate [45] in R [40] to randomly select SNPs that are at least 500 kb apart. This step was repeated 1000 times to create 1000 datasets, and each dataset contained approximately 3700–3800 SNPs. From the SNPs selected in each iteration, we identified the top 200 SNPs that loaded the heaviest on |PC1| and the bottom 200 SNPs that loaded the least on |PC1|. These top 200 SNPs are the SNPs that were most ancestry-informative for detecting population substructure, whereas the bottom 200 SNPs for |PC1| were less ancestry-informative and were used to serve as “controls”. These “control” SNPs were not neutral SNPs but were less ancestry-informative.
These steps were repeated in SSC quartets (nfamily = 1863). After pruning, there were 63,583 SNPs left in SSC. Each of the 1000 iterations of randomly selecting SNPs that were 500 kb apart ended up with each dataset containing approximately 3600–3700 SNPs.
We calculated the intra-locus correlation coefficient using Wright’s F statistic, for the more ancestrally informative (top 200) and less ancestrally informative (bottom 200) SNPs in SPARK and SSC respectively. Consider a single bi-allelic marker (SNP) with alleles A, and a. If the observed number of Aa heterozygotes is \(n_\) and the expected number of Aa heterozygotes assuming Hardy–Weinberg Equilibrium (HWE) is \(n_\), then Wright’s F is:
$$F = 1 - \frac }} }}$$
(1)
We compared the distribution of Wright’s F between more ancestrally informative (top 200) SNPs for |PC1| and less ancestrally informative (bottom 200) SNPs for |PC1| in fathers, mothers, and unaffected siblings separately. Standard deviation was computed based on the mean Wright’s F distribution of the 1000 iterations. Two sample t-test was used to compare the mean Wright’s F between more ancestrally informative SNPs and less ancestrally informative SNPs. If there is no intra-locus correlation, the distribution of the mean Wright’s F for the more ancestrally informative and less ancestrally informative SNPs should not be statistically significantly different from each other.
To evaluate intra-locus correlations with a larger set of SNPs, we repeated this analysis for top 1000 SNPs for |PC1| and bottom 1000 SNPs for |PC1|.
Inter-locus correlations in SPARK and SSCNext, we calculated the inter-locus correlation coefficient, using a linkage disequilibrium (LD) parameters D2 between two markers on different chromosomes. Consider two bi-allelic markers, the first SNP with alleles A and a; and the second SNP with alleles B and b. If the observed proportion of AB haplotypes is \(p_\), the observed proportion of the A allele is \(p_\), and the observed proportion of the B allele is \(p_\), then:
$$D^ = \left( - p_ p_ } \right)^$$
(2)
Haplotype frequencies were calculated using the Expectation–Maximization algorithm [46]. We calculated D2 between pairs of SNPs that are more ancestry-informative (top 200 SNPs for |PC1|) and between pairs of SNPs that are less ancestry-informative (bottom 200 SNPs for |PC1|) on different chromosomes in fathers, mothers, and unaffected siblings separately. D2 were calculated separately in fathers, mothers, and unaffected siblings because these values are sensitive to the sample size used [47]. Standard deviation was computed based on the mean D2 distributions of the 1000 iterations. Two sample t-test was used to compare the mean D2 for more ancestrally informative SNPs to less ancestrally informative SNPs. If there is no inter-locus correlation, the mean D2 values for the more ancestry-informative SNPs should not be different from those calculated for the less ancestry-informative SNPs.
To evaluate inter-locus correlations with a larger set of SNPs, we repeated this analysis for top 1000 SNPs for |PC1| and bottom 1000 SNPs for |PC1|.
Quantification of assortative mating on autismTo quantify assortative mating on autism, we estimated the correlation (\(\theta\)) between genetic predictors of autism from SNPs on odd chromosomes and even chromosomes [48]. Following the method developed in Yengo et al., in parents of SPARK and SSC, we first selected SNPs on odd and on even chromosomes. We then conducted PCA in PLINK [35] to get the top 20 PCs using LD pruned SNPs (r2 > 0.1, > 1 Mb apart) on odd (\(PCO\)) and on even (\(PCE\)) chromosomes. Autism PGS from SNPs on odd (\(S_\)) and even (\(S_\)) chromosomes were calculated with the iPSYCH autism GWAS [36] summary statistics with SNP effect sizes adjusted by LDpred2 [38]. We fit the following two regressions to test \(\theta\):
$$\begin S_ & = \theta S_ + PCE_ + \cdots + PCE_ \\ S_ & = \theta S_ + PCO_ + \cdots + PCO_ \\ \end$$
Comments (0)