The subjects used in this study were included in the Arao cohort study, which consists of community-dwelling elderly Japanese individuals in Arao city, Kumamoto prefecture, Japan. The cohort study is a part of The Japan Prospective Studies Collaboration for Aging and Dementia (JPSC-AD) [22]. We analyzed comprehensive SNP genotyping data and previously genotyped 5-HTTLPR information [14] of subjects who consented to genetic analysis and had no missing values (N = 1456, 39.3% male). All subjects were 65 years of age or older, with a mean and standard deviation (SD) of 74.3 and 6.6 years, respectively. The current study was approved by the ethics committee of Kumamoto University.
Genotyping with SNP microarrayGenotyping and imputation of this cohort were conducted as a part of the JPSC-AD study [22] and were described elsewhere (Furuta et. al., in revision). In brief, an Illumina Japanese Screening Array (Illumina, San Diego, CA, USA), which allowed genotyping of approximately 730,000 SNPs, was used for genotyping. Quality controls were performed at the sample level (e.g., gender discrepancy, sample call rate, close relatives, and outliers) and at the SNP level (e.g., Hardy-Weinberg equilibrium (HWE), call rate, monomorphic SNPs, minor allele count, and frequency difference with reference panels). The imputation was performed using Minimac4 v1.0.0 (https://github.com/statgen/Minimac4). The reference panels included 1037 whole genome sequences from BioBank Japan and 2504 publicly available whole genome sequences from the 1000 Genomes Project (Phase 3v5) [23].
Tag SNPs for the four major 5-HTTLPR allelesAll SNPs on the long arm of chromosome 17, where the SLC6A4 locus is located, were extracted from the SNP array data of 1456 subjects. SNPs used for the 5-HTTLPR genotype imputation were further selected as follows: (1) biallelic, (2) minor allele frequency (MAF) > 0.01, (3) P-value of the test for HWE > 0.001. Next, linkage disequilibrium (LD) pruning was conducted for the SNPs selected above under the following conditions: LD threshold, 0.8; window size, 500; and window increment, 50. After LD pruning, 30,242 SNPs remained. Linear regression analysis was conducted to select tag SNPs for the four major 5-HTTLPR alleles (i.e., S14a, L16a, L16c, and L16d) from the 30,242 SNPs using SNP & VARIATION SUITE Version 8.9.1. (https://www.goldenhelix.com/products/SNP_Variation/). To perform linear regression analysis, the genotype of each SNP, the explanatory variable in the model, was coded as the number of minor alleles (i.e., 0, 1, or 2). The dependent variable was the number of focal 5-HTTLPR alleles (0, 1, or 2) possessed by each subject. Although the use of more SNPs can tag each 5-HTTLPR allele more precisely, this method is not practical for genotype imputation using PHASE software [24, 25]. Therefore, in this study, we attempted to select 30 or fewer tag SNPs. For each 5-HTTLPR allele, we first selected the single SNP with the lowest P-value from the F-test in the simple regression analysis. The F-test indicates whether the linear regression model provides a better fit to the data than a model without explanatory variables. Incidentally, in simple regression analysis, the P-value of the F-test is equal to the P-value of the t-test used to examine if the slope is significantly different from zero. Then, after adding that SNP to the explanatory variables, the second SNP with the smallest P-value obtained from the F-test was selected. In this way, all SNPs identified up to that point were added to the explanatory variables, and the process of selecting the next SNP was repeated. When the P-value for the F-statistic exceeded 10−10, all SNPs selected up to that point, excluding the current SNP, were considered the tag SNPs for the focal 5-HTTLPR allele. Finally, 28 tag SNPs were selected for the four 5-HTTLPR alleles.
5-HTTLPR imputation method in a multi-allelic mannerUsing 28 tag SNPs, genotype imputation of 5-HTTLPR was conducted using PHASE software version 2.1 (https://stephenslab.uchicago.edu/phase/download.html) [24, 25]. Because 16 5-HTTLPR alleles were found in our dataset, they were encoded as integer numbers in the input file of the PHASE software. The prediction performance of genotype imputation for the four major 5-HTTLPR alleles was evaluated by four-fold cross-validation. Specifically, data from 1456 subjects were divided into four subsets, each comprising 364 subjects. Each subset was regarded as the test dataset (N = 364 without 5-HTTLPR genotype information), and the remaining three subsets were regarded as the training dataset (N = 1092 with 5-HTTLPR genotype information). Genotype imputation for 5-HTTLPR on the test dataset was then performed by alternating the training and test datasets four times. After genotype imputation, the 5-HTTLPR alleles other than S14a, L16a L16c, and L16d were regarded as X. Results from four cross-validations were aggregated, and overall accuracy, recall, and precision were calculated for the aggregated data (N = 1456).
Imputation using a previously reported methodThe 5-HTTLPR imputation method for estimating the S or L allele was reported using the Family Transitions Project (FTP) and the Center for Antisocial Drug Dependence (CADD)/the Genetics of Antisocial Drug Dependence (GADD) datasets, which primarily comprise individuals of European descent [21]. In this method, 5-HTTLPR was treated as a single biallelic SNP in hg19 coordinates chr17: 28,564,497. We replicated this method using subjects only with the four major alleles (N = 1387) based on SNPs after imputation in the same region as the previous paper (hg19 coordinates chr17: 27,064,497 to 30,064,497). To manipulate the Variant Call Format (VCF) file containing the imputed SNP information of the analyzed subjects, we used BCFtools version 1.18 (https://github.com/samtools/bcftools). To convert our reference VCF file to a fixed M3VCF file for use with Minimac4, we used Minimac3 version 2.0.1 (https://github.com/Santy-8128/Minimac3). To convert the M3VCF file to an MVCF file and perform imputation, we used Minimac4 version 4.1.6 (https://github.com/statgen/Minimac4). We classified the 5-HTTLPR into three groups (i.e., S/S, S/L, and L/L genotypes) and examined its accuracy in each cross-validation trial using Minimac4. Results of each cross-validation were aggregated to calculate the overall accuracy (N = 1387).
Data processing and statistical analysisGeneral data processing was performed using R packages and in-house scripts (R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/) or in a CentOS Linux release 7.5.1804 environment. Scripts are available upon request.
Comments (0)