P-glycoprotein (P-gp) is a transmembrane protein widely involved in the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of drugs within the human body. Accurate prediction of P-gp inhibitors and substrates is crucial for drug discovery and toxicological assessment. However, existing models rely on limited molecular information, leading to suboptimal model performance for predicting P-gp inhibitors and substrates. To overcome this challenge, we compiled an extensive dataset from public databases and literature, consisting of 5943 P-gp inhibitors and 4018 substrates, notable for their high quantity, quality, and structural uniqueness. In addition, we curated two external test sets to validate the model’s generalization capability. Subsequently, we developed a multimodal contrastive learning framework named MC-PGP for predicting P-gp inhibitors and substrates. This framework integrates three types of features from Simplified Molecular Input Line Entry System (SMILES) sequences, molecular fingerprints, and molecular graphs using an attention-based fusion strategy to generate a unified molecular representation. Furthermore, we employed a graph contrastive learning approach to enhance structural representations by aligning local and global structures. Extensive experimental results highlight the superior performance of MC-PGP, which achieves improvements in the area under the curve of receiver operating characteristic (AUC-ROC) of 9.82% and 10.62% on the external P-gp inhibitor and external P-gp substrate datasets, respectively, compared with 12 state-of-the-art methods. Furthermore, the interpretability analysis of all three molecular feature types offers comprehensive and complementary insights, demonstrating that MC-PGP effectively identifies key functional groups involved in P-gp interactions. These chemically intuitive insights provide valuable guidance for the design and optimization of drug candidates.
Comments (0)