VADEr: Vision Transformer-Inspired Framework for Polygenic Risk Reveals Underlying Genetic Heterogeneity in Prostate Cancer

Abstract

Polygenic risk scores (PRSs) serve as quantitative metrics of genetic liability for various conditions. Traditionally calculated as an effect size weighted genotype summation, this formulation assumes conditional feature independence and overlooks the potential for complex interactions among genetic variants. Transformers, a class of deep learning architectures known for capturing dependencies between features, have demonstrated remarkable predictive power across domains. In this work, we introduce VADEr, a Vision Transformer (ViT)-inspired architecture that combines techniques from both natural language processing and computer vision to capture properties exhibited by genetic data and model local and global interactions for genotype-to-phenotype prediction. Evaluating VADEr’s performance in predicting prostate cancer (PCa) risk, we found that across a range of metrics, including accuracy, average precision, and Matthews correlation coefficient, VADEr outperformed all benchmark methods, demonstrating its effectiveness in the context of complex disease risk prediction. To illuminate identified drivers of disease risk by VADEr, we formulated DARTH scores, an attention-based attribution metric, to capture the personalized contribution of each genomic region. These scores revealed distinct genetic heterogeneity captured by VADEr, with drivers of predicted risk identified in key PCa risk regions including the HOXB13, TMPRSS2, and MSMB loci. DARTH scores also revealed germline predispositions for particular PCa molecular subtypes, including an association between the LMTK2 locus and the SPOP subtype, both implicated in the regulation of androgen receptor activity. Overall, by effectively capturing dependencies among genetic variants and providing interpretable insights, VADEr and DARTH scores offer a promising direction for advancing genotype-to-phenotype prediction, particularly in complex disease.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by Emerging Leader Award from The Mark Foundation for Cancer Research, grant #18-022-ELA, NIH Grant R01CA269919 to H.C., and infrastructure grant 2P41GM103504-11.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

North West Multi-centre Research Ethics Committee approved the UK Biobank. UK Biobank gave approval for this work (Project #37671).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data and Code Availability

All data utilized in this study were obtained from public sources. Specifically, ELLIPSE genotypes and phenotypes were downloaded from dbGaP (study accession: phs001120.v1.p1). UKBB data was retrieved under project ID 37671. TCGA data was obtained from the TCGA Genomic Data Commons Data Portal. The complete codebase for this study, including the full VADEr model architecture and tailored implemented training procedures, can be found on our GitHub at https://github.com/jvtalwar/DARTH_VADEr.

Comments (0)

No login
gif