Patterns of cytonuclear discordance and divergence between subspecies of the scarlet macaw (Ara macao) in Central America

Samples, DNA extraction, and genome sequencing

To develop a large number of nuclear genetic markers for our phylogenetic investigations, we utilized next-generation sequencing data (Illumina). At the time of our study, the number of Ara genomes publicly available was limited, so we were required to generate our own datasets. To do this, we selected samples of both A. macao subspecies that were obtained from Central America. Two samples came from Laguna del Tigre National Park in Guatemala, and an additional four samples were from Costa Rica. These Costa Rican samples were captive parrots that were part of the Ara Project (https://macawrecoverynetwork.org/). Although these birds were assumed to derive from Costa Rica, the specific source locations were unknown as these parrots were either confiscated birds or pets that had been donated by their owners. Subspecific taxonomic designations for all six samples were previously determined based on microsatellite analysis (Hains 2015). The two Guatemalan samples were classified as A. m. cyanoptera, as was one of the Costa Rican samples (Table 1, Table S1). The other three Costa Rican samples were classified as A. a. macao. For outgroup comparisons, we also sequenced a single sample of the red-and-green macaw, A. chloropterus.

Table 1 Sample information, taxonomic designation, country of origin, sex, and relative diversity estimate (π) for all samples used in this study. The Guatemalan samples were both from Laguna del Tigre National Park. The Costa Rican samples were captive parrots whose specific source locations were unknown (see text for more details)

From each sample, we extracted DNA from whole blood using the DNeasy Blood & Tissue Kit (QIAGEN Inc.), following the manufacturer’s recommended protocol. Next, we sheared this extracted DNA with a Covaris ultrasonicator (Covaris, Woburn, MA). With the resulting fragments, we constructed standard 2 × 150 nucleotide libraries with barcoded adapters using the Illumina TruSeq Library Preparation kit following the standard protocol (Illumina, San Diego, CA). After library preparation we combined the barcoded samples in two separate pools and these multiplexed libraries were sequenced on an Illumina HiSeq X Ten at the New York Genome Center (NYGC).

Read mapping and variant calling

To our seven previously unpublished Illumina read sets (see above), we added comparable published data (raw Illumina reads) from a single Brazilian sample of A. m. macao (Seabury et al. 2013), as well as one sample each of the outgroups A. militaris and A. chloropterus (Hains et al. 2020). Independently for each of these 10 samples’ read data, we used the program Trim Galore (https://github.com/FelixKrueger/TrimGalore) to trim bases from read ends with a quality score (Q score) less than 20. Any subsequent read pair for which either read was less than 30 nucleotides long was then removed from the dataset. Next, we used the program BWA v.0.7.12 (Li 2013) with the ‘MEM’ algorithm to map our trimmed reads to the Ara macao reference genome, v.1.1 (GenBank Accession: GCA_000400695.1; Seabury et al. 2013). Following read mapping, we used the tool ‘AddOrReplaceReadGroups’ within the Picard toolkit v.1.119 (http://broadinstitute.github.io/picard/) to add read groups and sort the mapped reads. We then identified and marked read duplicates using the Picard tool ‘MarkDuplicates’. We realigned indels using the ‘IndelRealigner’ tool in the Genome Analysis Toolkit (‘GATK’) v.3.8.1 (McKenna et al. 2010), then for each sample, we called variant sites using GATK’s ‘HaplotypeCaller’ (specific flags: --emitRefConfidence GVCF, --variant_index_type LINEAR, --variant_index_parameter 128000 -rf BadCigar).

Once all samples were processed, we collectively genotyped them using GATK’s ‘GenotypeGVCFs’ tool, producing one multi-sample variant call format (VCF) file with all samples and identified (‘called’) genetic variants. We used GATK’s ‘SelectVariants’ tool to limit our dataset to just single nucleotide polymorphisms (SNPs), then used this same tool to remove variants with a quality by depth less than 6 (QD < 6.0), Fisher strand bias greater than 40 (FS > 40.0), mapping quality less than 59 (MQ < 59.0), mapping quality rank sum less than − 0.3 (MQRankSum < -0.3), read position rank sum less than − 2 (ReadPosRankSum < -2.0), and a strand odds ratio greater than 2 (SOR > 2.0). These filtering thresholds were determined based on the observed variant distributions for these parameters (Figure S2), and all were equal to, or more stringent than, the developer’s recommended cutoffs (DePristo et al. 2011).

Reference genome annotation

To categorize the segregating variants in our SNP dataset (e.g., ‘intronic’, ‘missense’, ‘non-synonymous’, etc.), we produced gene predictions for the Ara macao reference genome using the program BRAKER2 v.2.1.5 (Brůna et al. 2021), incorporating AUGUSTUS v.3.4.0 (Stanke et al. 2006). The program GenomeThreader v.1.7.1 (Gremme et al. 2005) was used within BRAKER2 to generate training gene structures based on protein sequences from the annotated parrot (Psittaciformes) Melopsittacus undulatus (“budgerigar”, GenBank Accession: GCA_012275295.1). We then annotated our filtered SNP VCF file using the program SnpEff v.4.3 (Cingolani et al. 2012), with a custom annotation database built using the A. macao reference genome and the general feature format (GFF) file generated with BRAKER2.

Phylogenetic analyses

We conducted two phylogenetic analyses, one utilizing genetic markers from the nuclear genome and the other using assembled, whole mitochondrial genomes. To establish our nuclear genetic markers, we first removed mitochondrial regions from our VCF dataset using the ‘SelectVariants’ tool in GATK v.4.2.5. We then used the ‘VariantFiltration’ and ‘SelectVariants’ tools to designate any variant within a sample with a depth of coverage less than 8× as ‘no call’. Next, we used the ‘SelectVariants’ tool to retain only genetic variants for which all samples were genotyped (flag: --max-nocall-number 0). We then used the program bcftools v.1.15 (Li 2011) to select all 4-fold synonymous (‘silent’), segregating sites as annotated in the A. macao reference genome (see above). Our choice to use only 4-fold synonymous sites was made to minimize variance in divergence rates across sites (Wright and Andolfatto 2008), and allow for the effective application of a single mutation model. For variants present on the same scaffold, we used a sliding window of 50 SNPs at 10 SNP increments between windows to thin SNPs if their pairwise squared correlation (r2) was greater than 50% (Novembre et al. 2008). This was done using PLINK v.1.90b6.6 (Purcell et al. 2007). Our final VCF for nuclear phylogenetic analysis consisted of 8,443 unlinked, 4-fold synonymous genetic markers.

We converted this filtered VCF to PHYLIP format using the vcf2phylip.py v.1.5 python script (https://zenodo.org/record/1257058#.YJL3ymZKi6t). We then used jModelTest v.2.1.10 (Guindon and Gascuel 2003; Darriba et al. 2012) with default settings to select the best-fit model of nucleotide substitution for this dataset based on AIC score. The model that was selected was the transversion model (‘TVM’, Posada 2003). Using this model, we carried out a maximum-likelihood phylogenetic analysis with PhyML v.3.1 (Guindon et al. 2010), applying 100 non-parametric bootstrap replicates to determine confidence values for the observed relationships between samples. The resulting phylogenetic tree was visualized with FigTree v.1.4.4 (Rambaut 2018).

For the mitochondrial phylogenetic analysis, we used the ‘FastaAlternateReferenceMaker’ tool in GATK along with our quality filtered VCF file (see above) to produce mitochondrial genomes for each sample (reference scaffold: CM002021.1). These mitochondrial genomes were then combined into a single, aligned FASTA file. To determine the best partitioning scheme and nucleotide substitution model for this data we used PartitionFinder v.2.1.1 (Lanfear et al. 2016), considering all models. The first, second, and third positions for each of the 13 mitochondrial coding regions were examined separately, and the small-sample size corrected version of the Akaike Information Criterion (AICc) was used to select the best partitioning scheme. With this scheme (Table S2), we conducted a maximum-likelihood (ML) phylogenetic analysis using IQtree v.1.6.12 (Nguyen et al. 2015). To determine support for each node we generated 1000 ultrafast bootstrap approximation (UFBoot) replicates (Hoang et al. 2018).

Sample clustering and admixture

To examine sample clustering, we retained likely ‘neutral’ sites from our filtered VCF dataset (sample-SNP read depth ≥ 8× and no missing data across all samples; see above). We defined neutral sites as those annotated by us in the A. macao genome as ‘intronic’ or ‘synonymous’. We used only likely neutral sites to reduce the possibility that past differential selection acting on protein structure or gene expression would obscure historical phylogeographic patterns (Wright and Andolfatto 2008).

After filtering to retain only likely neutral sites, we removed the A. militaris outgroup sample, along with any resulting non-segregating sites. We then filtered the remaining sites for linkage as described above. This left us with 59,028 segregating, neutral, genetic markers. With these markers, we used a principal component analysis (PCA) to investigate clustering among the A. chloropterus and A. macao samples. This PCA was carried out with the program PLINK v.1.90b6.6 (Purcell et al. 2007), and the results were visualized using R v.4.0.2 (R Core Team 2020).

We also wanted to examine sample clustering utilizing just the A. macao samples. From our VCF file of likely neutral SNPs, we removed the three outgroup samples (one A. militaris and two A. chloropterus), and subsequently any variants that were no longer segregating. We then filtered for linkage as described above. This left us with 43,487 neutral, segregating genetic markers that we then used to perform a second PCA.

We also used this dataset of 43,487 neutral, segregating genetic markers to examine population structure and ancestry proportions among our A. macao samples with a maximum likelihood approach. This was done using the program ADMIXTURE v.1.3.0 (Alexander et al. 2009; Alexander and Lange 2011). For this analysis, we examined potential sample clusters (K) from one to five. Each K value was run 20 independent times with different seed values used for each run. Across K values, means observed for the standard error of the 10-fold cross-validation (CV) error estimate were compared to identify the best supported number of clusters represented in the data. Smaller mean CV values support greater confidence in the number of clusters modeled (Alexander et al. 2015). We used the online version of CLUMPAK (http://clumpak.tau.ac.il/, Kopelman et al. 2015), with default settings to determine the mean q-matrix cluster assignment for each sample, at each K value.

Genetic diversity and genome-wide divergence

To examine relative genetic diversity within each sample/population as well as divergence between samples and populations, we first generated a nuclear genetic dataset using the ‘VariantFiltration’ and ‘SelectVariants’ tools to designate any variant within a sample with a depth of coverage less than 15× as ‘no call’. This more stringent filter for read depth relative to the earlier analyses described was done to improve confidence in the called homozygous/heterozygous state for each site within each sample. Read depths of at least 15× have been shown to be sufficient to accurately genotype segregating sites in genomic data with greater than 98% confidence (Song et al. 2016).

After depth filtering, we used the ‘SelectVariants’ tool in GATK to retain only SNPs for which all samples were genotyped. We also used the ‘SelectVariants’ tool to retain only biallelic variants, then selected only likely ‘neutral’ sites for the reasons described above (annotated as ‘intronic’ or ‘synonymous’). After filtering, we retained 2,571 biallelic, neutral, segregating sites. We used PLINK v.1.90b6.6 to convert each sample’s genotype at each site to a numeric value (0 or 2 = homozygous; 1 = heterozygous), using the ‘-recodeA’ function. The resulting file was manually edited to remove the header and the excess columns generated by PLINK (e.g., population, sex, phenotype, etc.). We then used a custom Perl script to determine nucleotide diversity (π) for each sample individually at each site (Formula 10.5, Nei 1987). The mean π value was calculated across all examined sites. Although these mean values were not absolute measures of genome-wide diversity for these samples, they did allow for relative comparisons between samples, populations, and taxa. We also assessed nucleotide differentiation between pairs of samples (dXY) for each site (Eq. 10.20, Nei 1987), as well as the net number of nucleotide substitutions per site after accounting for within sample π (dA; Eq. 10.21; Nei 1987). Means for these metrics (dXY & dA) were then calculated for each pairwise-sample comparison across all examined sites.

Patterns of incomplete lineage sorting and introgression

We first wanted to assess the extent of phylogenetic incongruence between South American A. m. macao, and each of the two Central American A. macao subspecies. Based on our previous findings (Schmidt et al. 2020), we postulated greater rates of post-divergence genetic exchange between South and Central American A. m. macao, relative to South American A. m. macao and Central American A. m. cyanoptera. To assess this hypothesis, we utilized the ABBA/BABA test to calculate Patterson’s D statistic (Green et al. 2010), using the dataset of 59,028 segregating, neutral genetic markers as in our interspecific clustering analysis (A. militaris excluded; see above). The D statistic is determined by comparing shared genetic variation between three focal taxa or populations and an outgroup, and determining whether phylogenetically informative sites agree with the primary phylogeny (‘AABB’ sites), or support one of the two possible alternative relationships (‘ABBA’ or ‘BABA’ sites). We used the default parameters of the ‘Dtrios’ option within the program Dsuite v.0.4r38 to estimate the relative frequencies of AABB, ABBA, and BABA sites (Malinsky et al. 2021). From these, we also used Dsuite to calculate the D statistic. A D statistic that statistically deviates from zero (‘0’) suggests genetic exchange after divergence (‘introgression’), whereas a D statistic that does not deviate from zero supports incomplete linage sorting (ILS) as the primary cause of phylogenetic incongruence. Dsuite was used to determine statistical significance via a jackknife approach, subsampling blocks of variants. A result was considered significant at p ≤ 0.05. For this analysis, the two A. chloropterus samples were used as the outgroup, and we compared genetic variation between the Brazilian A. m. macao sample (P3), the A. m. macao samples from Costa Rica (P2), and the A. m. cyanoptera samples from both Costa Rica and Guatemala combined (P1). Because variation in geographic distance could affect patterns of introgression (i.e., Guatemala is geographically further away from South America than is Costa Rica), we also performed a second analysis, with only the Costa Rican A. m. cyanoptera sample (Guatemalan A. m. cyanoptera excluded).

We additionally wanted to examine the extent of possible genetic exchange between the two subspecies in Central America. Here we postulated that if post-divergence genetic exchange has occurred between Central American A. m. macao and A. m. cyanoptera populations, then the Central American A. m. macao samples (all from Costa Rica) would share more alleles with the A. m. cyanoptera from Costa Rica then with the A. m. cyanoptera from Guatemala. This analysis was conducted using the same methodologies as previously described, but with the A. macao only dataset of 43,487 neutral, segregating genetic markers (see above).

Comments (0)

No login
gif