Transcriptome driven discovery of novel candidate genes for human neurological disorders in the telomer-to-telomer genome assembly era

The release of the first complete sequence of a human genome by the T2T Consortium unlocked previously hidden genomic regions for genetic analyses and exposed previously misassembled sequences. To assess its potential in discovering novel genes associated with human neurological disorders, multiple publicly available RNA-seq datasets were analysed using the T2T-CHM13 assembly, comparing it to the GRCh38 assembly. Through this approach, 336 candidate genes for eight neurological disorders that were not annotated in the GRCh38 assembly were discovered. The subsequent sections discuss the results for each of the disorders, with a particular focus on the implications of this study for future research. Due to the extensive number of novel genes discovered in this study, the discussion will primarily concentrate on the most significant findings or functionally related gene groups. It is important to note that this comparative study relies solely on transcriptome data. Therefore, no final conclusions about the functions of these novel genes can be drawn. Consequently, the following paragraph should be viewed as potential directions for future experimental research aimed at characterising these findings functionally.

Alzheimer’s disease

Among the three novel genes identified in induced neurons derived from Alzheimer’s disease patients was the gene LOC124906857, which encodes phosphoglucomutase-like protein 5. Phosphoglucomutase 5 is predominantly found in adherens junctions, which are believed to be involved in blood–brain barrier (BBB) permeability regulation in Alzheimer’s disease [25]. Hence, the upregulation of LOC124906857 may lead to higher BBB permeability and an influx of neurotoxic plasma-derived components, cells, and pathogens [26]. This suggests a potential modulating role for LOC124906857 in the disease, which should be validated in a suitable experimental model system.

Amyotrophic lateral sclerosis

Novel candidate genes for amyotrophic lateral sclerosis (ALS) were revealed, which included eight rRNAs and two spliceosomal RNAs, all of which were downregulated. Protein synthesis [27] and spliceosomal deficiencies [28] have been linked to ALS pathology. The interruption of ribosomal translation results in the binding of the SURF complex to the exon junction, triggering mRNA decay [29]. SMG1 is one of the four SURF components, and the LOC124907829 gene, encoding the novel protein serine/threonine-protein kinase SMG1-like, is upregulated in motor neurons derived from ALS patients. This may lead to mRNA decay in ALS patient-derived motor neurons and presents a link to the nonsense-mediated mRNA decay reported by Xu et al. [30] in animal and cellular models of ALS. The differential expression of LOC124907841, which encodes bolA-like protein 2, points towards oxidative stress [31], a potential therapeutic target in ALS therapy [32]. To validate its relevance to ALS pathology, further investigation into the involvement of LOC124907841 in the clearance of reactive oxygen species is warranted.

Autism spectrum disorders

Of the two datasets originating from autism spectrum disorder (ASD) studies, the ASD/pb dataset had a sample size of only two individuals per group. Therefore, one must assume low statistical power for the ASD/pb dataset, and consequently, I will only discuss overlapping genes with the ASD/nsc dataset and highly significant DEGs. Among the common genes uncovered are three U2 spliceosomal RNA encoding genes. The study from which the ASD/pb dataset was derived describes dysregulation of RNA editing in the brains of autistic individuals [14] and it is reasonable to assume, that these novel spliceosomal RNAs are involved. However, it is worth noting that the transcripts encoding those genes were upregulated in post-mortem brains but downregulated in neuronal stem cells (NSCs). This suggests dysregulation in their expression during development and adulthood. Analysis mistakes regarding the two datasets were excluded, so the observed opposite direction of differential expression of novel genes between the two datasets is most likely an effect of the developmental stage of the sequenced tissue/cells. However, a sample mix-up in one of the studies cannot be excluded, so these results should be interpreted with caution.

Among the ten most significant novel genes observed in the ASD/pb dataset is LOC124905679, which was predicted to encode hornerin, a gene previously linked to ASD on the genomic level [33]. Additionally, the tektin-4-like gene LOC124907385 was upregulated in the ASD/nsc dataset. A missense variant in TEKT4 was discovered in a family with PSMD12 haploinsufficiency, a neurodevelopmental disorder with autistic features.

Analysis of both ASD studies also identified two ribosomal RNAs, LOC124907244 and LOC124907457. Ten rRNAs were DE in both studies. Recently, the ribosomes have gained central importance in understanding the development of ASD [34]. Mutations in genes relevant for translation control, such as FMR1, TSC2, and PTEN, have shown high penetrance in ASD development [35]. Therefore, the ten novel rRNA encoding genes discovered here are strong candidates for being downstream effectors of these translation control genes.

Epilepsy

Regarding the novel candidate genes uncovered in the study presented here, epilepsy stands out as the most intriguing neurological disorder. The transcript of LOC124906791, a protein with pore-like structured and rich in beta strands (Fig. 4A), was significantly upregulated in four epilepsy datasets (E/dc, E/na, E/nc, and E/nopc). Given that epilepsy primarily involves disruptions in ion exchange homeostasis, this putative membrane pore-forming protein becomes of paramount interest for future epilepsy research. Its upregulation in all four studies suggests that increased expression of the protein may lead to ion exchange with the extracellular space, a major trigger for epileptic seizures [36]. The role of pore-forming proteins in epilepsy has only recently come under discussion [37] and the evidence provided here, positions LOC124906791 as a promising candidate for the development of inhibitory drugs.

Another novel protein identified, LOC124906582, predicted to have a highly ordered structure rich in cross beta strands, with a tubular shape resembling the structure of fibrillary amyloid-β(1–42) [23], was also significantly upregulated in four epilepsy datasets (E/na, E/nc, E/nopc, and E/pf). AlphaFold simulations demonstrated the ability of LOC124906582 to form dimeric, trimeric, and tetrameric multimers, suggesting its capacity to create aggregates. Compelling evidence for the involvement of amyloidogenic proteins like amyloid-β, α-Synuclein, and Tau in the development of late-onset epilepsy is accumulating [38, 39], underscoring the significance of the newly discovered protein LOC124906582 as a hitherto unknown key player in epilepsy development. Therefore, conducting in vitro co-aggregation studies with the aforementioned amyloidogenic proteins would offer valuable evidence for a more comprehensive experimental validation of this novel protein.

Glioma

The term glioma encompasses a group of cancer types that affect glial cells in the brain. Glioblastoma is a grade 4 type and the most aggressive form of glioma. This study unveiled 62 novel candidate genes for glioma and 7 novel candidate genes for glioblastoma. Among the top 20 novel genes associated with glioma, based on the lowest p-values, 17 are ncRNAs. Numerous studies have linked ncRNAs to glioma pathology, and they were connected to poor patient survival rates. Hence, they serve as predictive markers of disease progression [40]. The newly discovered ncRNAs should be correlated with disease phenotypes to enhance predictions of patient outcomes.

One of the top 20 DEGs is LOC124907389, which encodes the leucine-rich repeat transmembrane protein FLRT2. In breast cancer, FLRT2 has been identified as a tumor suppressor gene [41], and interestingly, LOC124907389 exhibited a twofold downregulation in patient-derived glioma tissue. This suggests that this novel FLRT2-like gene may also exert tumor suppressive functions in relation to glioma.

Multiple sclerosis

The analysis of three distinct multiple sclerosis (MS) datasets—post-mortem white matter lesions, CD4+ T cells, and CD19+ T cells – has unveiled 146 novel potential disease-associated candidate genes. It is noteworthy that the MS/bl dataset included an exceptionally high number of cases (N = 72). The DE ncRNAs LOC124907785 and LOC124907382 were common across all datasets. These ncRNAs were upregulated in brain lesions but downregulated in immune cells, suggesting a significant role in MS pathology. A blast search revealed that all four annotated LOC124907785 transcripts exhibited a sequence identity ranging from 78 to 86% with multiple transcripts of LOC124906734, encoding the protein translation initiation factor IF-2-like. This gene was notably downregulated in CD4+ cells and slightly downregulated in CD19+ cells. A homozygous missense variant in the EIF2B2 (eukaryotic Translation Initiation Factor 2B Subunit Beta) gene has been identified as causative for early-onset vanishing white matter disease [42]. This finding strengthens the case for LOC124907785, a putative inhibitory ncRNA differentially expressed in all three MS datasets, as a strong candidate for a regulatory RNA with disease-modifying properties.

Additionally, four other ncRNAs were DE in both brain lesions and CD19+ T cells. Three of these ncRNAs (LOC124905722, LOC124906211, and LOC124907546) were upregulated in brain lesions but downregulated in immune cells, while one (LOC124907109) was downregulated in both datasets. ncRNAs have been a subject of intense research in recent MS studies [43,44,45,46]. The observed pattern of upregulation in brain lesions and downregulation in immune cells suggests a disturbed regulation of ncRNA expression in MS.

A total number of 51 novel DEGs were discovered in MS patient brain lesions, with more than 50% highly upregulated rRNA-encoding genes (rDNA, avg. Log2 FC = 4.6). Since this data was produced with an rRNA removal step [22], and elevated rDNA levels were only detected in the 72 cases, contamination with rRNA can be excluded. Spurlock et al. reported elevated levels of misprocessed rRNA in mononuclear cells from individuals with relapsing remitting multiple sclerosis, attributing this to environmental factors rather than genetics [47]. Both findings therefore strongly suggest a significant role for ribosomes in MS pathology.

Mitochondrial DNA mutation m.3243 A > G

One of the most prevalent mitochondrial DNA (mtDNA) mutations is m.3243 A > G [48], which manifests with a broad spectrum of clinical features, including seizures, stroke-like episodes, hearing impairment, gastrointestinal disturbance, psychiatric involvement, ataxia [49], and neurodegeneration [50]. Among the newly discovered genes which were DE in patient-derived cell lines, 13 were ncRNAs and six were proteins. The novel LOC124907531 gene, encoding amyloid-beta A4 precursor protein-binding family A member 2-like, demonstrated a threefold upregulation. The protein product of the APBA2 gene interacts with the amyloid precursor protein (APP) and influences the proteolytic production of amyloid-β [51]. The neurodegenerative features of m.3243 A > G cases have been attributed to defects in nitric oxide metabolism and mtDNA-related mitochondrial respiration [50]. These are also features of Alzheimer’s disease and other neurodegenerative disorders [52]. Although the m.3243 A > G mutation could not be linked to cases of early-onset Alzheimer’s disease [53], an involvement of amyloid-β in m.3243 A > G linked neurodegeneration has not been ruled out yet and should be investigated in connection to amyloid-β pathology.

Implications of the study

The data presented here indicate that about half of the 26 selected studies on neurological disorders have yielded novel candidate genes with differential expression for further study. Furthermore, on average 3.6% of the DEGs discovered with the GRCh38 genome assembly did not exhibit differential expression when reads were aligned to the T2T-CHM13 assembly. This suggests that prior analyses were hindered by the identification of putative false-positive DEGs caused by inaccuracies in the GRCh38 assembly. In light of these findings, it is highly recommended to re-map RNA-seq data from older studies to validate the integrity of the published data. Additionally, this approach can help identify if any phenotype-associated genes are among the 1956 novel genes discovered by the T2T Consortium. The discovery of many DEGs presented here was only possible by unlocking previously inaccessible regions of the human genome. Notably, the highly repetitive epilepsy-associated protein-coding gene LOC124906791 was previously obscured by technical limitations, and the revelation of numerous rRNA encoding transcripts upregulated in white matter lesions of MS patients was only made possible through the assembly of an additional 9.9 Mbp of rDNA regions in the T2T-CHM13 assembly. While all the datasets examined here were based on Illumina short reads, the use of long read RNA sequencing techniques [54] in conjunction with the T2T-CHM13 assembly or even a human pangenome [55] promises significant improvements in the quality and depth of future human transcriptome analyses.

Comments (0)

No login
gif