Aligning Large Language Models with Humans: A Comprehensive Survey of ChatGPT’s Aptitude in Pharmacology

3.1 Construction of the Pharmacology-LLM-Test-Set

A comprehensive and meticulously designed set of evaluation tasks is crucial for assessing, testing, and enhancing the potential and value of LLMs in specific domains. Constructing an integrative test set that covers a wide range of tasks in pharmacology not only tests the ability of LLMs to process complex pharmacological problems but also stimulates new research and application ideas, advancing the application of AI in drug discovery and development.

To better apply LLMs in pharmacological practice, we propose the ‘Pharmacology-LLM-test-set,’ a test set designed to evaluate the performance of general or specialized LLMs in pharmacology. This test set consists of three tasks: fact query, strategy summarization, and text generation (Table 1). Specifically,

1.

Task I: fact query assesses the LLM’s performance in querying basic pharmacological information. It includes 15 compounds across biomacromolecules, mid-sized molecules, and small molecules, covering five subtasks and 18 attributes such as chemical identifiers, MW, isotopic mass, bioavailability, surface area, pharmacokinetics, and drug-drug interactions (Table 1).

2.

Task II: strategy summarization task aimed at evaluating the potential of LLMs in chemical structure optimization. We selected ten compounds, including buspirone, paroxetine, rilpivirine, 8-hydroxyquinoline, and paclitaxel (Taxol), for optimization. The focus was on three subtasks and four strategies related to metabolic stability, reducing liver toxicity, and others, as outlined in Table 1.

3.

Task III: text generation task aimed at trying to assess the ability of LLMs to extract and summarize information in pharmacological texts, focusing on summarizing limitations and trends as two subtasks (Table 1).

Table 1 Details of constructing the pharmacology large language model test set ‘Pharmacology-LLM-test-set‘

Moreover, to further promote the widespread use and continuous improvement of the pharmacology-LLM-test-set, we have uploaded it to both Hugging Face (https://huggingface.co/datasets/zhangyingbo1984/Pharmacology-LLM-test-set) and GitHub (https://github.com/zyb1984/Pharmacology-LLM-test-set) platforms for easy access by other users. Additionally, based on this test set, the baseline ‘question-answer’ scenarios and scoring outcomes for GPT-3.5 and GPT-4 can be found in the document’s appendix.

3.2 Evaluation of Pharmacologica Test Set Based on General LLMs3.2.1 The Accessibility of ChatGPT in Pharmacologica Test Set

Essential attribute evaluations, such as contextual consistency, semantic similarity, and consistency tests, are fundamental for assessing the capabilities of general LLMs like GPT-3.5, GPT-4, Llama2, Claude, PaLM, and specialized LLMs like DrugChat, DrugGPT, and Mol-Instructions in handling question-answering tasks [45]. Since LLMs do not require specialized knowledge or terminology for everyday conversations or text generation, general LLMs typically exhibit good contextual consistency and semantic similarity. However, in specialized fields, where executing question-answering tasks or generating text demands extensive professional knowledge or terminology, conducting basic attribute evaluations is the first step towards aligning human expectations with LLMs. In this study, we assessed the primary attributes of LLMs in the field of pharmacology using three fundamental attribute metrics: contextual consistency, semantic similarity, and consistency tests (including Cronbach’s alpha consistency based on Levenshtein Similarity and entity similarity).

The evaluation results for contextual consistency, semantic similarity, and consistency tests indicate that ChatGPT demonstrates good human alignment capabilities. Specifically, the contextual consistency score is 4.25 ± 0.63, the semantic similarity score is 4.15 ± 0.79, and the Cronbach's alpha consistency based on Levenshtein Similarity or entity similarity is 0.990 (0.980–0.996) and 0.987 (0.983–0.991), respectively (Fig. 2, Table S1). Further comparisons of GPT-3.5 and GPT-4 across the three tasks reveal that GPT-4 outperforms GPT-3.5 in most tasks (Fig. 2, Table S1). However, for the text summarization task (Task III), the performance difference between GPT-3.5 and GPT-4 in contextual consistency is minimal, indicating that even GPT-3.5, as a well-trained LLM, can effectively understand pharmacological instructions issued by humans.

Fig. 2figure 2

The accessibility of ChatGPT in the pharmacological test set based on contextual consistency, semantic similarity, and consistency tests. a: The accessibility of ChatGPT in the pharmacological test set based on contextual consistency; b: The accessibility of ChatGPT in the pharmacological test set based on semantic similarity; c: The accessibility of ChatGPT in the pharmacological test set based on Levenshtein similarity consistency tests; d: The accessibility of ChatGPT in the pharmacological test set based on entity similarity consistency tests

3.2.2 The Accuracy of ChatGPT in the Drug Basic Information Query Tasks3.2.2.1 The Accuracy of ChatGPT in the Drug Chemical Identifiers Information-Based Query Tasks

A chemical identifier is a unique symbol that identifies compounds in computer systems. It plays a vital role in compound retrieval and chemoinformatics [46]. Standard chemical identifiers include chemical formulas, IUPAC identifiers, CAS identifiers, InChI identifiers, InChIKey identifiers, SMILES identifiers, and more. The DrugBank database systematically records the chemical formula, IUPAC identifier, InChI identifier, InChIKey identifier, and SMILES identifiers of drugs are systematically recorded to standardize the basic information of collected drugs [27].

To assess the adaptability of ChatGPT in associating compound chemical identifiers, we conducted a ‘question and answer’ style query for the chemical identifiers of 15 drugs. The results indicated that, except for the chemical formula, ChatGPT (including both GPT-3.5 and GPT-4) could not provide adequate and accurate answers to the queried InChI, InChIKey, IUPAC name, and SMILES of the 15 drugs (Table 2, Fig. S2).

Table 2 The consistency performance of ChatGPT across various types of tasks in the drug chemical identifiers information-based query taska

Among the drug identifiers that ChatGPT (including GPT-3.5 and GPT-4) could effectively answer, the average accuracy rate was 83.33 ± 37.90%, with GPT-3.5 achieving an accuracy rate of 86.67 ± 35.19% and GPT-4 achieving an accuracy rate of 80.00 ± 41.40%. Therefore, compared to GPT-3.5, GPT-4 did not exhibit a significant improvement in accuracy rate but demonstrated a downward trend. Upon examining the distribution of incorrect answers, they were found to be scattered. It is speculated that the reason for these incorrect answers may be associated with the frequency of drug molecular formulas in the training tasks of GPT-3.5 rather than the difficulty level of the queries (Table 2, Fig. S2).

Another critical issue that needs attention is the ‘knowledge hallucination’ with ChatGPT; i.e., when asked about the InChI identifier, InChIKey identifier, IUPAC identifier, and SMILES identifiers of the 15 drugs, GPT-3.5 and GPT-4 explicitly stated that they could not effectively answer this professional information, but instead gave seemingly reasonable but wrong answers (details in Supplementary Data 1).

3.2.2.2 The Accuracy of ChatGPT in the Drug's Physicochemical Properties Query Task

The physicochemical properties of drugs significantly impact their absorption, distribution, metabolism, and excretion processes in the body. Hence, these factors also affect drug efficacy and pharmacodynamic characteristics. The physicochemical properties of common drugs in pharmacology and chemoinformatics include MW, monoisotopic weight, logP, logD, bioavailability, and PSA [20, 21]. The DrugBank database has detailed records of the MW of collected drug, monoisotopic weight, logP, and other physicochemical properties. To test the adaptability of ChatGPT in the physicochemical properties of drugs, we conducted a ‘question and answer’ style query on the physicochemical properties of 15 query drugs. The results have shown that, except for logP and PSA, for which ChatGPT (including GPT-3.5 and GPT-4) explicitly stated its inability to answer, ChatGPT (including GPT-3.5 and GPT-4) effectively answered the other three types of physicochemical attributes (Fig. 3a).

Fig. 3figure 3

Investigating the potential of ChatGPT in the ‘question and answer’ task for drug physicochemical properties. a Overview of the capability and accuracy of ChatGPT in answering the physicochemical properties of query drugs; b The consistency of the predicted molecular weight (MW) by GPT-3.5 with the molecular weight records in the DrugBank database; c The consistency of the predicted molecular weight (MW) by GPT-4 with the molecular weight records in the DrugBank database; d The consistency of the predicted monoisotopic weight by GPT-3.5 with the monoisotopic weight records in the DrugBank database; e The consistency of the predicted monoisotopic weight by GPT-4 with the monoisotopic weight records in the DrugBank database

The accuracy of predictions by ChatGPT for MW, monoisotopic weight, and bioavailability were 60.00%, 50.00%, and 53.33, respectively. For GPT-3.5, the accuracy of predictions for MW, monoisotopic weight, and bioavailability were 66.67%, 60.00%, and 60.00%, respectively. while for GPT-4, they were 53.33%, 40.00%, and 40.00%, respectively. Compared to GPT-3.5, GPT-4 exhibited no significant improvement in drug MW, monoisotopic weight, or bioavailability but displayed a downward trend. Analysis of the distribution of incorrect predictions revealed a scattered distribution pattern without a high concentration in small- or large-molecule drugs (Fig. 3a). Analysis of error values for MW and monoisotopic weight showed an uneven distribution pattern, indicating that the errors were unrelated to the size of the molecule (Fig. 3b, c, d, and e).

3.2.2.3 The Accuracy of ChatGPT in Pharmacological Properties of Drugs

Pharmacological properties of drugs, such as the mechanism of action, pharmacodynamics, and toxicity, play a crucial role in elucidating and determining drug absorption, utilization, distribution, and metabolic patterns of drugs within the body [20, 21]. Understanding these properties is essential for identifying contraindications, determining dosage and administration frequency, and greatly influencing the medical application of drugs. Through a ‘question and answer’ task focusing on the fundamental pharmacological properties, pharmacodynamics, mechanism of action, and toxicity of these 15 queried drugs, the accuracy rates were found to be 93.00 ± 20.54%, 85.00 ± 3.27%, 88.67 ± 6.94%, and 95.00 ± 0.00%, respectively, showcasing higher prediction accuracy compared to other drug properties. The prediction accuracy rates for GPT-3.5 were 95.33 ± 13.56%, 85.00 ± 3.27%, 86.67 ± 8.59%, and 95.00 ± 0.00%, while for GPT-4, the rates were 90.67 ± 26.04%, 85.00 ± 3.27%, 90.67 ± 4.17%, and 95.00 ± 0.00% (Table 3, Fig. S3).

Table 3 The performance of ChatGPT in predicting drug pharmacological propertiesa

In predicting basic pharmacological properties, the overall prediction accuracy of GPT-3.5 (95.33± 13.56%) was higher than that of GPT-4 (90.67 ± 26.04%). Upon detailed comparison of prediction outcomes for each compound, it has been discovered that the performance difference in predicting the basic pharmacological properties of the compound dequalinium is the primary reason GPT-3.5 has demonstrated superior predictive performance over GPT-4. According to the DrugBank database, dequalinium is used in various over-the-counter products to treat mouth infections and inflammation, such as tonsillitis, pharyngitis, and gingivitis. It is also indicated for treating bacterial vaginosis in adult women aged < 55 years in the form of vaginal tablets. It was GPT-3.5 that explicitly provided the information that dequalinium can be used as an antimicrobial and anti-inflammatory agent for treating different infections. However, GPT-4 only mentioned that dequalinium can be used as an antimicrobial agent in lozenges or mouthwashes.

In the task of action mechanism properties, the overall prediction accuracy of GPT-4 (90.67 ± 4.17%) was higher than that of GPT-3.5 (86.67 ± 8.59%). Comparing the prediction performance for each compound, GPT-4 exhibited better performance than GPT-3.5 in predicting the pharmacodynamic properties of benzphetamine, dezocine, camostat, ropivacaine, montelukast, and silodosin (Table 3, Fig. S3).

According to the DrugBank database, benzphetamine is described as follows: ‘The mechanism of action of these drugs is not fully understood; however, it may be similar to that of amphetamines. Amphetamines stimulate norepinephrine and dopamine release in nerve endings in the lateral hypothalamic feeding center, decreasing appetite.’ This release is mediated by the binding of benzphetamine to centrally located adrenergic receptors. GPT-4 not only responded that benzphetamine could increase the release of norepinephrine in the brain (which can be used for short-term treatment of obesity), but also mentioned its similarity to amphetamines as a sympathomimetic amine. However, in GPT-3.5, although it acknowledged that benzphetamine reduces appetite and increases feelings of fullness by enhancing the release of norepinephrine, it did not respond regarding the similarity to other drugs and its use in short-term obesity treatment (refer to Table 3, Fig. S3). Similarly, in predicting the pharmacodynamic properties of dezocine, camostat, ropivacaine, montelukast, and silodosin, GPT-4 demonstrated better performance in specific details than GPT-3.5.

With regard to other pharmacological properties of drugs, such as pharmacodynamics and toxicity, GPT-3.5 and GPT-4 demonstrated similar and excellent performance, achieving accuracy rates of 85.00 ± 3.27% and 95.00 ± 0.00%, respectively. No significant differences were observed between GPT-3.5 and GPT-4 in these aspects (Table 3 and Fig. S3).

Based on the data analysis of ChatGPT on fundamental pharmacological properties, drug action mechanisms, pharmacokinetics, and toxicity, it has been demonstrated that ChatGPT has a distinct advantage in text-processing tasks than those related to text-numerical association and text-text association.

3.2.2.4 The Accuracy of ChatGPT in Drug-Target Attribute Query Task

The drug's targets, such as antagonists, agonists, blockers, inhibitors, and modulators, are the proteins that drugs directly act upon and are critical to the drug's mechanism of action [20, 21]. In the ‘question-answer’ tasks for the target properties of 15 drugs, GPT-3.5 and GPT-4 exhibited varying performances across different drugs. For drugs with a single target, such as dimetacrine, pefloxacin, ropivacaine, and apremilast, both GPT-3.5 and GPT-4 demonstrated good predictive performance, achieving 100% prediction accuracy (Fig. 4 and Supplementary data 4). However, for drugs with two or more targets, except for dezocine, both GPT-3.5 and GPT-4 could not accurately predict all the targets of the drugs.

Fig. 4figure 4

Exploring the potential of ChatGPT in ‘question-answer’ -based tasks for drug-target attributes. a Overview of the predicted and actual targets of class A drugs (molecular weight < 255.24); b Overview of the predicted and actual targets of class A drugs (255.24 < molecular weight > 412.64); c Overview of the predicted and actual targets of class C drugs (molecular weight > 412.64)

For example, irbesartan, widely used to treat hypertensive patients with type 2 diabetes, relieves hypertension and reduces blood sugar levels. It has direct targets, including AGTR1 (Angiotensin II receptor type 1) and JUN (Jun proto-oncogene, AP-1 transcription factor subunit). However, GPT-3.5 and GPT-4 only recorded AGTR1 as the target for irbesartan, omitting JUN (refer to Fig. 4). Similar issues were observed in the query tasks for silodosin, dequalinium, camostat, and other drugs (Fig. 4). Comparatively, GPT-4 may exhibit higher accuracy in the task of drug target prediction compared to GPT –3.5. For example, naltrexone, a medication used to manage alcohol or opioid dependence, blocks the effects of opioids in the brain to reduce cravings. It acts as an antagonist for OPRK1 (opioid receptor kappa 1) and as an agonist for OPRM1 (opioid receptor mu 1) and SIGMAR1 (sigma non-opioid intracellular receptor 1). GPT-3.5 only recorded the OPRM1 target, disregarding the other protein targets. In contrast, GPT-4 recorded the OPRM1 target correctly identified the primary target, OPRK1 (Fig. 4).

Another phenomenon observed in the target prediction task is ‘illusory knowledge construction’ and ‘knowledge hallucination.’ When predicting the action target of montelukast, both GPT-3.5 and GPT-4 not only failed to make accurate predictions for its inhibitory target ALOX5 (arachidonate 5-lipoxygenase) but also erroneously predicted new targets LTB4R (leukotriene B4 receptor) and LTC4S (leukotriene C4 synthase) (refer to Fig. 4). Upon searching the Genecards database, it was discovered that LTB4R and LTC4S belong to cysteinyl leukotriene receptors. However, while LTC4S is a valid target of montelukast, LTB4R is a false target. The Genecards database lists five drugs that can interact with the LTB4R receptor, including three confirmed drugs such as gamolenic acid, zafirlukast, and leukotriene B4, and two drugs that have only been demonstrated in experiments, such as cinalukast and morniflumate [29].

3.2.2.5 The Accuracy of ChatGPT in the Querying Tasks of Drug-Drug Interactions

Drug-drug interactions (DDIs) occur when two or more drugs are used in combination, and it elicits various risks, including those associated with liver damage, elevated blood pressure, and lowered blood pressure. It is an essential factor influencing drug efficacy and safety and is also one of the critical issues affecting rational clinical drug use and post-marketing surveillance. The DDIs have become a significant area of interest in pharmacology [47,48,49]. The DrugBank database provides detailed records of DDI risks. Regarding amobarbital, nine types of DDIs have been documented, including risks of adverse effects, methemoglobinemia, hypotension, central nervous system depression, sedation, constipation, decreased therapeutic efficacy of amobarbital, decreased therapeutic efficacy of other drugs, and decreased metabolism rate of amobarbital (Fig. 5, Table S3).

Fig. 5figure 5

Investigating the potential of ChatGPT in drug-drug interaction ‘question-answer’ -based tasks. The query drugs are represented by colored cells, and the numbers within the colored cells indicate the count of drug-drug interactions with specific items. For example, a value of eight signifies that there are 8 drug-drug interactions resulting in the mentioned side effect when combined with the respective query drug. Further details regarding the drug-drug interactions can be found in Supplementary Data 5

The analysis of the potential of ChatGPT in predicting DDIs reveals an overall prediction accuracy of 64.50 ± 21.27%, with GPT-3.5 achieving a prediction accuracy of 64.64 ± 0.00% and GPT-4 achieving a prediction accuracy of 64.33 ± 0.00% (Table S3). A comparison of ChatGPT's prediction results for large, medium, and small molecules of varying sizes shows that the performance is significantly better for medium- and small-molecule compounds than for large-molecule drugs. For instance, dequalinium, a large molecule compound with a MW of 456.67, exhibits DDIs mainly related to risks of adverse effects, bleeding, viral infections, methemoglobinemia, hypotension, and decreased therapeutic efficacy of other drugs (Fig. 5, Table S3). Both GPT-3.5 and GPT-4 failed to produce precise predictions regarding dequalinium. However, they did emphasize the significance of disclosing all medications to healthcare professionals, especially in cases where dequalinium is predominantly administered topically.

Furthermore, the overall predictive performance of GPT-4 was compared to that of GPT-3.5 in drug predictions. It was observed that GPT-4 outperformed GPT-3.5 in predicting 15 types of drugs. Specifically, GPT-4 demonstrated significantly better predictive performance for six drugs, namely amobarbital, butobarbital, chlorzoxazone, dezocine, irbesartan, and dimetacrine. However, in the case of montelukast and apremilast, GPT-3.5 exhibited better predictive performance than GPT-4 (Table S3).

For instance, let us consider chlorzoxazone as an example. In the DrugBank database, there are four types of interactions between chlorzoxazone and other drugs: the risk of side effects with 103 drugs, the risk of CNS depressant effects with 22 drugs, the risk of sedative effects with one drug, and the risk of changing the rate of metabolism with 214 drugs. GPT-3.5 provided answers regarding the CNS depressant risk and adverse effects risk of drug-drug interactions for chlorzoxazone. However, GPT-4 not only provided answers regarding the CNS depressant risk and adverse effects risk of drug-drug interactions for chlorzoxazone, but it also addressed the risk of affecting the metabolism rate, stating that ‘as chlorzoxazone is primarily metabolized by the liver, drugs that can affect liver enzymes may affect the metabolism of chlorzoxazone. This could alter the drug's effectiveness or increase the risk of side effects.’ Similar phenomena were also observed for the other five drugs (Fig. 5, Table S3).

3.2.3 Assessing the Potential of ChatGPT in Drug Structure Optimization Tasks

Compound structure optimization is crucial in enhancing the bioavailability of lead compounds or drug candidates, mitigating toxicity, improving metabolic stability, and optimizing pharmacodynamics [28,29,30,31]. In order to assess the potential of ChatGPT in the field of compound structure optimization, we established ‘Improving metabolic activity,’ ‘Reducing hepatotoxicity,’ ‘Reducing cardiotoxicity,’ and ‘Increasing solubility’ as the primary optimization objectives. The findings indicate that ChatGPT (GPT-3.5 and GPT-4) solely demonstrates its ability to have general ideas in drug structure optimization tasks. In other words, it can delineate common strategies employed in structure optimization. However, it cannot devise comprehensive optimization plans for specific drugs (Fig. 6).

Fig. 6figure 6

Exploring ChatGPT's potential in drug structure optimization ‘Question-answer’ -based tasks. a The potential of ChatGPT in drug structure optimization is being explored using buspirone, paroxetine, and 8-chloro-4-(4-methylpiperazin-1-yl)benzofuro[3,2-d]pyrimidine as compounds to be optimized; b, c: Amodiaquine and ibufenac are being utilized as compounds to be optimized in order to explore their potential in reducing hepatotoxicity; d, e: The potential to lower cardiac toxicity is being investigated using 2-[[(2R)–4-(4-fluorophenyl)–2-methylpiperazin-1-yl]methyl]–7-methoxy-[1,2,4]triazolo[1,5-c]quinazolin-5-amine and N-(2,3-dihydro-[1,4]dioxino[2,3-c]pyridin-7-ylmethyl)–1-[2-(3-fluoro-6-methoxy-1,5-naphthyridin-4-yl)ethyl]piperidin-4-amine as compounds to be optimized; F: rilpivirine, 8-hydroxy camptothecin, and taxol are being utilized to investigate the potential of ChatGPT in enhancing water solubility optimization schemes

Metabolic activity optimization encompasses optimization strategies aimed at enhancing the metabolic stability of compounds, prolonging drug action duration in the body, increasing exposure within the body, reducing compound clearance rates, and improving bioavailability. In drug structure optimization tasks targeting ‘improving metabolic activity,’ we selected buspirone, paroxetine, and 8-chloro-4-(4-methylpiperazin-1-yl)benzofuro[3,2-d]pyrimidine as the compounds to be optimized. Similar issues were observed in the structure optimization of paroxetine and 8-chloro-4-(4-methylpiperazin-1-yl)benzofuro[3,2-d]pyrimidine. GPT-3.5 suggests modification of susceptible functional groups, blocking metabolic sites, employing the prodrug approach, and utilizing metabolic stability prediction and modeling. However, it does not provide detailed operational procedures and optimization plans (Fig. 6).

For optimization tasks, including reducing hepatotoxicity, reducing cardiotoxicity, and increasing solubility, both GPT-3.5 and GPT-4 provide generalized answer schemes. For instance, when addressing the solubility improvement of Taxol, GPT-3.5 and GPT-4 propose optimization strategies such as the ‘prodrug approach,’ ‘formulation techniques,’ ‘structural modifications,’ and ‘combination with solubilizing agents.’ They also suggest methods such as complexation with cyclodextrins, nanoemulsion formulation, or encapsulation in liposomes or nanoparticles to enhance solubility by increasing drug dispersibility and effective surface area in water. However, they do not describe the execution difficulty, specific implementation methods, or successful case studies (Fig. 6).

In summary, ChatGPT demonstrates its generalizing ability in compound structure optimization tasks. It provides structural optimization strategies for improving compound activity but cannot offer effective plans and specific examples. Additionally, it fails to provide adequate literature and data support.

3.2.4 The Accuracy of ChatGPT in Systematically Summarizing and Inferring the Current Limitations and Emerging Trends in Pharmacology

The efficacy of retrieval, comprehension, summarization, and reasoning abilities is vital in evaluating the capabilities of LLM models [6, 14, 50]. To clarify and determine ChatGPT's capability in text summarization, we evaluated its performance in ‘current limitations in pharmacological research’ and ‘future trends in pharmacological research.’

3.2.4.1 The Accuracy of ChatGPT in Systematically Summarizing the Current Limitations in Pharmacology

In three repeated inquiries into GPT-3.5 and GPT-4, 16 topics are identified as limitations in current pharmacological research. These topics include ‘limited predictability of preclinical models,’ ‘regulatory challenges,’ ‘limited availability of drug targets,’ ‘translational challenges,’ ‘limited access to human tissue samples,’ ‘limited understanding of disease mechanisms,’ and others. These topics receive an importance score of 80 or above (Fig. 7a and Table S4).

Fig. 7figure 7

ChatGPT's ability to systematically summarize and infer the current limitations and emerging trends in pharmacology. a Exploring the ability of ChatGPT to summarize the current limitations in pharmacology systematically. b Exploring the ability of ChatGPT to summarize and infer the emerging trends in pharmacology systematically

Among all the topics, GPT-3.5 identifies ‘lack of diversity in clinical trials,’ ‘high cost of drug development,’ ‘limited understanding of disease mechanisms,’ and ‘ethical concerns’ (5 times) as the most significant limitations in current pharmacology. For instance, GPT-3.5 highlights that current clinical trials are based on a minority of populations and do not represent a broader population, resulting in potential drug efficacy and safety variations across different patient populations. To address this limitation, conducting clinical trials in the broader population or ethnic group is suggested as a practical approach to improving efficacy and safety in current pharmacological research. The three reviewers concur that this topic is a limitation in current pharmacological research. However, they do not consider it the most significant limitation, assigning it an importance score of 86.67 ± 2.89 (Fig. 7a and Table S4).

The results are inconsistent when comparing the importance scores provided by the three reviewers with the number of recommendations made by ChatGPT. The three reviewers considered limited understanding of disease mechanisms and limited access to human tissue samples as the most significant limitations in current pharmacological research, with an average score of 91.67 ± 2.89. However, GPT-3.5 only recommended these two topics five times and one time (Fig. 7a and Table S4), indicating a significant imbalance. The topic ‘limited understanding of disease mechanisms’ is regarded as the most crucial limitation in pharmacological research, possibly due to its frequent mention in the literature. On the other hand, ‘limited access to human tissue samples’ has only been recommended once, which may be related to the relatively low frequency of reports in the literature. However, three reviewers gave it a very high importance score. Most pharmacological studies speculate that it is associated with the urgent need for human tissue samples, including live ones. Unfortunately, such samples are severely scarce, and related research is often restricted.

3.2.4.2 The Accessibility and Accuracy of ChatGPT in Systematically Summarizing and Inferring the Emerging Trends in Pharmacology

In the three repeated inquiries to GPT-3.5 and GPT-4, 18 topics are identified as trends in future pharmacological research. These topics include ‘Artificial intelligence and machine learning,‘ ‘drug repurposing,’ ‘precision medicine,’ ‘nanomedicine,’ ‘gene therapy and gene editing,’ ‘digital health,’ and others. Among them, the ‘Artificial Intelligence and Machine Learning’ topic and the ‘Drug Repurposing’ topic are considered to be hotspots for future pharmacology research, with each being recommended by ChatGPT six times (three times each by GPT-3.5 and GPT-4). However, based on the perspectives of the three reviewers, ‘digital health’ and ‘immunotherapy’ are considered the most important subjects for future research. Each topic receives a high importance score of 91.67 ± 2.89, making them the highest-scoring topics (Fig. 7b and Table S5).

‘Nanomedicine’ is the most controversial topic among all the covered topics. One reviewer argues that this topic remains a prominent issue in pharmacology, assigning it a high score of 90. However, the other two reviewers assigned importance scores below 80. Except for the ‘nanomedicine’ topic, all other topics received an importance score of more than 85 (Fig. 7b and Table S5).

3.3 Evaluation of Lead Compound Structure Optimization Tasks for LLMs Based on Specific Text RAG Mode

We constructed a transient LLM named PharmacologyGPT, using GPT-4 as the base LLM and Liu et al's literature records as the source for specific text RAG. For three optimization tasks on 10 compounds, such as metabolic stability, reduced toxicity, and enhanced water solubility, the results showed that PharmacologyGPT improved the predictive effectiveness compared to GPT-3.5 and GPT-4 without altering the prompt method. PharmacologyGPT provided answers for the lead compound optimization strategies reported in the literature and explained specific actionable plans (Fig. 8d). Additionally, basic attribute evaluations, such as context consistency (Fig. 8a) and semantic relevance (Fig. 8b), indicated that PharmacologyGPT did not significantly affect the context consistency and semantic relevance of LLM. Therefore, exploring LLMs based on specific information, such as RAG or fine-tuning, will significantly improve the hallucination issues in general LLMs when handling pharmacology tasks.

Fig. 8figure 8

Evaluating the capability of ChatGPT in lead compound optimization based on the RAG (Retrieval-Augmented Generation) model. a The accessibility of ChatGPT RAG in lead compound optimization based on contextual consistency; b The accessibility of ChatGPT RAG in lead compound optimization based on semantic similarity; c The accessibility of ChatGPT RAG in lead compound optimization based on Levenshtein similarity consistency tests and entity similarity consistency tests; d The accuracy of ChatGPT RAG in lead compound optimization based on expert scores

Furthermore, for the water solubility improvement optimization task of paclitaxel, the specific text RAG-powered PharmacologyGPT demonstrated exceptional capability. PharmacologyGPT addressed the glycosylation prodrug and poly(ethylene glycol) (PEG) prodrug strategies recorded in the literature but also compared the effectiveness of both optimization strategies and identified the PEG prodrug strategy as the relatively more efficient optimization approach.

Our research results indicate that, under the RAG framework, even GPT-4 based on non-English text can significantly improve the accuracy of answering complex pharmacological questions (specifically, in this study, focusing on lead compound optimization tasks). A tracking analysis of the specific text RAG data flow revealed that when we use a specific data source, such as the RAG resource, the LLMs first segment the data and convert it into vector representations, which are then stored in a vector library. Subsequently, during question-answering tasks, the LLMs retrieve relevant information through information retrieval and enhance the response accuracy with the added context.

Comments (0)

No login
gif