A comprehensive set of computational tools and software applications were employed to facilitate various aspects of the study. These tools and their respective functions are summarized in Table 1. These computational tools and software applications collectively played pivotal roles in enabling various aspects of the research, from compound structure visualization to molecular modeling, docking, and prediction of important pharmacological and pharmacokinetic properties, ultimately contributing to the comprehensive analysis and design of potential drug candidates.
Training setWe curated a dataset comprising 35 Pyrimidine derivatives, which had been previously reported in the literature [7, 8]. To facilitate further analyses, the chemical structures of these compounds were meticulously drawn using ChemDraw Professional 17.1 and saved in.mol format. The agonistic potency of these compounds against the GPR119 receptor was quantified in terms of EC50 values, expressed in nanomolar (nM) units. These EC50 values were then transformed into their negative logarithms (pEC50 = − LogEC50) to establish a linear relationship for subsequent 3D-QSAR and Pharmacophore model development. The activity data for the Pyrimidine derivatives are presented in Table 2. The Pyrimidine structure, along with its substitution sites, is depicted below (Fig. 1) for reference:
Table 2 Representation of structure and their activityFig. 1Representation of Pyrimidine structrure with substitution sites
These preparations and transformations of the dataset, as described above, provided the foundational data required for subsequent modeling and analysis of the Pyrimidine derivatives in the context of GPR119 receptor agonism.
Pharmacophore mappingThe International Union of Pure and Applied Chemistry (IUPAC) offers a precise definition of a pharmacophore, characterizing it as the collective set of both steric (relating to the spatial arrangement of atoms or molecules) and electronic (related to the distribution of electrons) characteristics. The ultimate goal of these interactions is to initiate and activate the biological response associated with that target structure [9,10,11].
Generation of pharmacophore hypothesisPharmacophore mapping is used to determine the unique feature essential for producing pharmacological activity. The pharmacophore hypothesis was generated by the PHASE module of Schrodinger Maestro v13.5 software. For the pharmacophore mapping, 35 compounds were selected [7, 8] from the datasets which was already reported in the literature, and the parameters were selected as how the active molecule binds to the receptor with a default box size of 1 Å and 2 Å minimum inter-site distance followed by the minimization and alignment of all 35 compounds. Where we choose six features such as hydrogen bond donor, hydrogen bond acceptor, aromatic ring, positive, negative, and hydrophobic. Based on the chemical features related to the structural similarities of all the 35 compounds the PHASE generated 20 possible hypothesizes (Tables 17 and 18) to explain how the active molecules bind to the receptor. In our study, the software module identified two recurring chemical features: hydrogen bond acceptor (A) and hydrophobic group (H). To precisely locate these features, an in-built distance calculation tool was employed to determine inter-featuric distances (Table 3), as illustrated in Fig. 2.
Table 3 Pharmacophoric features and their distance value in ÅFig. 2A Pharmacophoric features of AAAHH_1 with their inter-featuric distances map. Hydrogen bond acceptor- salmon, aromatic ring- orange & hydrophobic- green. B Pharmacophoric features mapped over the most potent molecule (Molecule.17)
3D-Quantitative structure–activity relationship (3D-QSAR)QSAR serves to establish correlations between the physicochemical characteristics of substances and their biological responses, whether observed In-vitro or In-vivo. Within the domain of 3D QSAR methods, diverse approaches such as Comparative Molecular Field Analysis (CoMFA), Comparative Molecular Similarity Index Analysis (CoMSIA), Atom-based 3D-QSAR, and Field-based QSAR are utilized. These methodologies enable the exploration of how various factors such as steric effects, electrostatic properties, the presence of hydrogen bond donors, acceptors, and hydrophobic interactions influence the observed biological activities of chemical entities. This process is further utilized for the development of novel GPR119 agonists as anti-diabetic drugs. We gathered datasets comprising known activity data associated with G-Protein Coupled Receptor 119 agonistic effects, quantified in terms of EC50 values measured in nanomolar units (nM). These datasets were compiled from research articles and served as the foundation for constructing 3D-QSAR models. The process involved in conducting the 3D-QSAR studies is delineated in Fig. 3, encompassing various essential steps and methodologies employed for model development and analysis [12,13,14,15,16,17,18,19,20,21].
Fig. 3Steps involved in the 3D-QSAR studies
Selection of series for 3D-QSAR studiesWe focused on Indolinylpyrimidine derivatives containing the Pyrimidine nucleus, which exhibited GPR119 agonist activity. We meticulously curated a dataset comprising 35 such Pyrimidine derivatives. The biological activity of these compounds was originally reported in the literature as EC50 values, expressed in nanomolar (nM) concentrations (Table 2). To facilitate our quantitative structure–activity relationship (QSAR) investigations, we systematically converted all these biological activity values (EC50) into molar units and subsequently transformed them into their negative logarithmic representation, denoted as pEC50 values. These pEC50 values were employed as the dependent variable in QSAR analysis, forming a crucial component of the study to establish meaningful relationships between chemical structures and biological activity [7, 8].
Generation of CoMFA and CoMSIA modelAll the structures in the dataset were sketched in ChemDraw 17.1 software and saved in.mol format. To get the best conformers of each molecule the Sybyl-X v2.1.1 software was used for energy minimization. The energy minimizations were done using the Tripos force field and the Powell gradient algorithm with a convergence criterion of 0.005 or 0.05 kcal/(mol*Å) and a maximum iteration count of 10,000. The partial atomic charges were calculated using the Gasteiger-Huckel method.
Reliability of CoMFA (Comparative Molecular Field Analysis) and CoMSIA (Comparative Molecular Similarity Indices Analysis) models by meticulously following a series of crucial steps. Initially, we conducted an initial geometry optimization using the Powell method, integrating the Tripos force field and Gasteriger-Huckel charge. This optimization phase comprised 10,000 iterations with a stringent energy convergence cutoff at 0.001 kcal/mol. Subsequently, a simulated annealing process was implemented, which began with heating the system to an initial temperature of 1000 Kelvin over 1000 femtoseconds, followed by a controlled cooling to 250 Kelvin within 1500 femtoseconds of annealing time. An exponential annealing function was employed and the process was iterated 100 times. For conformation analysis, temperatures ranging from 250 to 300 Kelvin were considered and hierarchical clustering was executed. The lowest energy conformers within each major cluster were chosen meticulously for further optimization, utilizing the same energy minimization technique as in the initial geometry optimization. To align all minimized conformers, the Sybyl X-2.1.1 software Distill Rigid alignment type was exclusively employed, as depicted in Fig. 4. These rigorous procedures were undertaken to construct robust CoMFA and CoMSIA models, with a specific emphasis on bioactive conformations and precise molecular alignment, aligning with the core objectives of our scientific investigation [12,13,14,15,16,17,18,19].
Fig. 4Representation of distill rigid alignment
Descriptors Calculation, 3D-QSAR Model development3D-QSAR models (CoMFA and CoMSIA) were developed using the ‘QSAR Tools’ module in the Sybyl-X v2.1.1 software package.
Dataset divisionA dataset of thirty-five Indolinylpyrimidine derivatives (Table 2), screened according to the same pharmacological protocol, was selected from the literature. All the compounds have been built, parameterized (Gasteiger-Huckel method) and energy minimized within MOE using MMFF94 forcefield. In every model generation, the compounds in the datasets were randomly divided into the training set (75% of the total number of compounds) for model generation, and the test set (25% of the total number of compounds) to verify the external predictive ability of the model (Table 4).
Table 4 Division of Training set and Test setCoMFA methodTo calculate the steric and electrostatic fields within the CoMFA method, Lennard–Jones and Coulombic potentials were utilized, respectively. To encapsulate the aligned molecules comprehensively, a 3D cubic lattice with a grid spacing of 2.0 Å in all directions was automatically generated. These grid points were produced using the Tripos force field and featured a sp3 carbon atom probe characterized by a Van der Waals radius of 1.52 Å and a charge of +1.00 (representing the default probe atom in Sybyl-X v2.1.1). For the assessment of various steric and electrostatic fields, a default energy cut-off of 30 kcal/mol was applied [12,13,14,15,16,17,18,19].
CoMSIA methodTo establishment of the CoMSIA model was executed using Sybyl X v2.1.1, with the maintenance of the default attenuation factor set at 0.3. This process entailed the introduction of a sp3-hybridized carbon atom carrying a +1.0-probe charge. Within the framework of the CoMSIA methodology, a comprehensive set of five descriptors was computed, encompassing steric, electrostatic, hydrophobic, as well as hydrogen bond donor and acceptor fields. The derivation of the CoMSIA model was accomplished through the application of the partial least square (PLS) technique. In this computational procedure, the CoMSIA fields functioned as independent variables, while the dependent variables consisted of the pEC50values associated with each compound in our analysis.
Internal validation and partial least squares (PLS) analysisWe employed the Partial Least Squares (PLS) approach, an extension of multiple regression analysis, on the training dataset. This analysis aimed to establish correlations between the QSAR models (CoMFA and CoMSIA) and biological activity. To assess the predictive performance of our models and identify the optimal number of components yielding the highest cross-validated r2 (r2cv), we utilized the leave-one-out (LOO) cross-validation method. In LOO cross-validation, one compound is systematically removed from the dataset, and a model is constructed using the remaining compounds. This model is then used to predict the activity of the omitted molecule. The process is repeated for each compound, allowing us to assess the overall predictive power of our model. The optimal number of components obtained from this cross-validation procedure was subsequently employed in the subsequent regression model, ensuring its reliability and predictive accuracy. Final CoMFA and CoMSIA models were generated using non-cross-validation PLS analysis (Table 5 and 6). To further assess the statistical confidence and robustness of the derived models, a 100-cycle bootstrap analysis was performed (Table 7). This procedure in which n random selections out of the original set of an object are performed several times (100 times). The mean correlation coefficient is represented as bootstrap r2 (r2boot) [12,13,14,15,16,17,18,19].
Table 5 Summary of CoMFA PLS Results of Indolinylpyrimidine DerivativesTable 6 Summary of CoMSIA PLS Results of Indolinylpyrimidine DerivativesTable 7 Various parameters of the QSAR modelModel developmentQSAR (CoMFA AND CoMSIA) models developedWe assembled a dataset comprising a total of 35 compounds retrieved from a previously known database. These compounds were collectively utilized to generate a Quantitative Structure–Activity Relationship (QSAR) model. The selection of the best-performing model was based on a comprehensive assessment of various statistical parameters, as detailed in Table 19. Consequently, the model demonstrating the most favorable statistical characteristics and predictive capabilities was chosen for further investigations and subsequent studies. A model can be considered predictive and statistically significant if the following conditions are satisfied:
$$\begin } - }\left( }} \right) > 0. \hfill \\ }\left( }} \right) > 0. \hfill \\ }\left( }} \right) < \, 0. \hfill \\ } > 0. \hfill \\ \end$$
Atom and field-based 3D-QSARTo generate the 3D-QSAR models the chosen congeneric series of 35 compounds (Table 2) have been used after performing alignment and minimization in Schrodinger Maestro v13.5. In the atom-based QSAR model atoms are dictated by van der Waals models where atoms are treated as a sphere and occupy the same regions of space. Atom-based 3D-QSAR model was developed from a set of aligned ligands to predict activities for other molecules. The splitting of compounds was set up as 68% compounds in the training set and the remaining 32% compounds in the test set in the case of Atom-Based 3D-QSAR, whereas 70% compounds in the training set and the remaining 30% compounds in the test set in case of field base 3D-QSAR. The clustering of compounds was done with a 4–5 PLS factor [20, 21]. Tables 20, 21, 22, 23 summarizes the different parameters of the QSAR model which are as follows:
Segment displayed the fraction due to each type (Atom based and Field base) in the QSAR model for each number of PLS factors used in the model (Tables 20, 21, 23). A scatter plot (Fig. 11 and 13) was generated by plotting actual activity vs. predicted activity to confirm minimal diversity in the biological activities between training set molecules (Table 8, 9).
Table 8 Actual and Predicted pEC50 and Residual values of Generated 3D-Atom Based ModelsTable 9 Actual and Predicted pEC50 and Residual values of Generated 3D-Field Based ModelsHomology modelingIn the context of the G-protein coupled receptor 119 (GPR119), a pivotal player in drug discovery, the predicament surrounding experimental structural determination is accentuated. Herein, homology modeling emerges as an indispensable technique for the design and evaluation of biological experiments. The significance of homology modeling in the case of the GPR119 protein cannot be overstated. The structural insights furnished by this modeling technique greatly inform the rational design of drug candidates and guide subsequent biological inquiries. The accuracy and fidelity of the predicted protein structure hold paramount importance. Parameters for validation, such as the Ramachandran plot, the Errat score, and other relevant metrics, are pivotal in gaging the geometric integrity of the protein model. Homology Modeling has been done by several tools to predict the better quality or validated GPR119 protein structure [22].
The rational for Homology modeling behind not using of 7XZ5 for the GPR119-Gs-LPC complex [23], which has a sequence length of 394 amino acids. However, upon conducting a thorough literature survey, we identified that our specific Homo sapiens GPR119 isoform consists of 335 amino acids and do not have crystal structure in the Protein Data Bank (PDB). Consequently, we choose to perform homology modeling to generate a structural model, considering the absence of crystallographic data for our target isoform. This approach allowed us to elucidate the unresolved components of the protein's structure and incorporate any necessary sequence modifications.
Homology modeling by modeller 10.2Comparative modeling with Modeller 10.2 module predicts protein structure in five sequential steps. The first step of modeling, searching for the 3D structure of proteins related to the target. For this, we have used the site (www.ncbi.nlm.nih.gov) and downloaded the GPR119 sequence in FASTA file format. The second step denotes the selection of templates for alignment by BLAST. Templates based on maximum identity related to the target sequence from a list of generated sequences. We choose five template structures for alignment (6H7J, 6KR8,6TKO, 7BZ2, and 7DHI). In the third step, we have aligned the target protein with the template. The fourth step is the model building step which after the Python script run, automatically calculates a 3D model of the target using an auto model class of the module (Table 10). The Fifth step is the evaluation step, in which the model is evaluated using DOPE (Discrete Optimized Protein Energy) and GA341 assessment score. If several models are calculated for the same target, the best model can be selected as a low DOPE score & high GA341 score as mentioned in Modeller v10.2 and also validated via Ramachandran plot and Errat Score (Table 11). Model qseq.B99990009.pdb was found to be best among total 10 models [24].
Table 10 Generated Models by Modeller 10.2Table 11 Validation of Generated Models by Modeller 10.2Homology modeling by I-TASSERA homology model of GPR119 was generated using the I-TASSER server, which relies on multiple threading templates for 3D model construction. The I-TASSER server selected the top 10 templates based on various criteria, including sequence identity and predicted secondary structure and solvent exposure (Table 12). Subsequently, the most promising models from the largest cluster of structures were chosen through the SPICKER program. To enhance the accuracy of the 3D structures, the Protein Preparation Workflow from the Schrödinger Maestro v13.5 package was employed for refinement. Energy minimization was performed using the OPLS 2005 force field, with a default setting of 0.3 Å for the root mean square deviation (RMSD). The quality and validity of the constructed homology model were assessed using the Ramachandran plot and Errat score, and it was determined that Model.5 exhibited the highest level of accuracy (Table 13) [25].
Table 12 Top 10 threading templates used by I-TASSERTable 13 Top 5 final models predicted by I-TASSERHomology modeling by robettaThe Robetta server, accessible at http://robetta.bakerlab.org, offers automated tools for the prediction and analysis of protein structures. In the context of structure prediction, input sequences are submitted to the server, where they are divided into potential domains, and structural models are then generated using comparative modeling techniques. The resulting models are detailed in Table 14. Additionally, the server allows the submission of experimental constraints data, such as information from NMR, Cryo-EM, and X-ray crystallography, which can aid in Rosetta structure determination. The quality and reliability of these models were assessed through the evaluation of Ramachandran plots and Errat Scores, revealing that Model.1 exhibited the highest level of accuracy [26].
Table 14 Models Predicted by Robetta webserverHomology modeling by swiss modelIn comparative modeling, a 3D protein model of a GPR119 receptor sequence is generated by: Input data: The target protein amino acid sequence is put in FASTA format from the NCBI database. Template search: Data provided in this step served as a query to search for evolutionary-related protein structures against the SWISS-MODEL template library SMTL. Template selection: The template search was completed, and templates were ranked according to the expected quality of the resulting models, as estimated by Global Model Quality Estimate (GMQE) so based on this 6ni3 and 7e32 PDB were chosen as templates for model building (Table 15). Model building: Each selected template (6ni3 and 7e32), a 3D protein model was automatically generated by first transferring conserved atom coordinates as defined by the target template alignment. Model quality estimation: To quantify modeling errors and give estimates on expected model accuracy by QMEAN score Ramachandran Plot and Errat score and Model.1 was found to be more accurate (Table 16) [27].
Table 15 PDB Hits Predicted by Swiss model webserverTable 16 Models Predicted by Swiss model webserversVirtual screening/molecular dockingOur research investigated into intermolecular interactions, specifically between a protein model representing GPR119 and small chemical molecules sourced from datasets, namely SwissSimilarity and the Zinc Database. Additionally, we examined interactions with chemically designed structures through molecular docking studies. This investigation aimed to shed light on the binding modes responsible for stimulating the protein. An essential prerequisite for conducting such studies is a high-resolution protein structure. However, in our project, the structural information for the GPR119 receptor was unavailable. Consequently, we adopted the approach of generating a homology-modeled structure for the GPR119 receptor. Subsequently, we employed this homology model to facilitate docking investigations, providing valuable insights into the interactions between the receptor and the chemical ligands. Molecular Docking studies were performed by the Glide (Grid-based Ligand Docking with Energetics) module of Schrodinger Maestro v13.5, favorable interactions between a GPR119 receptor structure obtained from the Homology modeling (Model.3) and one or more ligand molecules from Database structure or Designed Structures. Steps for molecular docking are discussed in the following section:
In protein preparation, the precision of Glide docking outcomes is contingent upon the integrity of the initial protein structures. It is imperative to utilize meticulously processed protein structures to optimize docking results. In this phase, a homology-modeled protein structure was seamlessly integrated into Maestro v13.5. Concomitantly, all solvent molecules, except those engaged in coordinating metal ions, were removed from the protein structure. After this, a cautious energy minimization procedure was applied. This process was governed by a user-defined root mean square deviation (RMSD) tolerance, meticulously preserving the spatial alignment of the protein structure concerning the input coordinates. This approach ensured the correct orientation of solvent molecules while also addressing steric clashes and hydrogen bonding irregularities.
In ligand preparation, it is of paramount importance that the docked dataset, encompassing compounds derived from Zinc and SwissSimilarity, accurately reflects the true structures of ligands as they would be situated within protein–ligand complexes to attain the most precise results. The Schrödinger LigPrep tool was employed, enabling the creation of high-fidelity, all-atom 3D structures for numerous drug-like compounds. The LigPrep procedure encompassed a series of meticulous steps, encompassing data conversion, structural diversification, elimination of extraneous structures, and structural optimization.
Sitemap generation was essential for comprehending protein active site characteristics, a crucial aspect in molecular docking studies where the specific location of binding sites remains ambiguous. The SiteMap algorithm was harnessed to systematically identify and evaluate binding sites, thus facilitating the prediction of their suitability for ligand binding. To enhance the accuracy of our study, five discrete Sitemap analyses were executed.
Before commencing ligand docking, the generation of receptor grids was a prerequisite. This necessitated the preparation of homology model structures featuring the correct bond ordering and formal charges. For grid generation, the OPLS 2005 force field was harnessed. Ultimately, for Glide ligand docking, a set of receptor grids and dataset structures, which had been precomputed, were employed [
Comments (0)