Large Language Models for Mental Health Applications: Systematic Review

IntroductionMental Health

Mental health, a critical component of overall well-being, is at the forefront of global health challenges []. In 2019, an estimated 970 million individuals worldwide experienced mental illness, accounting for 12.5% of the global population []. Anxiety and depression are among the most prevalent psychological conditions, affecting 301 million and 280 million individuals, respectively []. In addition, 40 million people experienced bipolar disorder, 24 million experienced schizophrenia, and 14 million experienced eating disorders []. These mental disorders collectively contribute to an estimated US $5 trillion in global economic losses annually []. Despite the staggering prevalence, many cases remain undetected or untreated, with the resources allocated to the diagnosis and treatment of mental illness far less than the negative impact it has on society []. Globally, untreated mental illnesses affect 5% of the population in high-income countries and 19% of the population in low- and middle-income countries []. The COVID-19 pandemic has further exacerbated the challenges faced by mental health services worldwide [], as the demand for these services increased while access was decreased []. This escalating crisis underscores the urgent need for more innovative and accessible mental health care approaches.

Mental illness treatment encompasses a range of modalities, including medication, psychotherapy, support groups, hospitalization, and complementary and alternative medicine []. However, the societal stigma attached to mental illnesses often deters people from seeking appropriate care []. Influenced by the fear of judgment and concerns about costly, ineffective treatments [], many people with mental illness avoid or delay psychotherapy []. The COVID-19 crisis and other global pandemics have underscored the importance of digital tools, such as telemedicine and mobile apps, in delivering care during critical times []. In this evolving context, large language models (LLMs) present new possibilities for enhancing the delivery and effectiveness of mental health care.

Recent technological advancements have revealed some unique advantages of LLMs in mental health. These models, capable of processing and generating text akin to human communication, provide accessible support directly to users []. A study analyzing 2917 Reddit (Reddit, Inc) user reviews found that conversational agents (CAs) powered by LLMs are valued for their nonjudgmental listening and effective problem-solving advice. This aspect is particularly beneficial for individuals considered socially marginalized, as it enables them to be heard and understood without the need for direct social interaction []. Moreover, LLMs enhance the accessibility of mental health services, which are notably undersupplied globally []. Recent data reveals substantial delays in traditional mental health care delivery; 23% of individuals with mental illnesses report waiting for >12 weeks for face-to-face psychotherapy sessions [], with 12% waiting for >6 months and 6% waiting for >1 year []. In addition, 43% of adults with mental illness indicate that such long waits have exacerbated their conditions [].

Telemedicine, enhanced by LLMs, offers a practical alternative that expedites service delivery and could flatten traditional health care hierarchies []. This includes real-time counseling sessions through CAs that are not only cost-effective but also accessible anytime and from any location. By reducing the reliance on physical visits to traditional health care settings, telemedicine has the potential to decentralize access to medical expertise and diminish the hierarchical structures within the health care system []. Mental health chatbots developed using language models, such as Woebot [] and Wysa [], have been gaining recognition. Both chatbots follow the principles of cognitive behavioral therapy and are designed to equip users with self-help tools for managing their mental health issues []. In clinical practice, LLMs hold the potential to support the automatic assessment of therapists’ adherence to evidence-based practices and the development of systems that offer real-time feedback and support for patient homework between sessions []. These models also have the potential to provide feedback on psychotherapy or peer support sessions, which is especially beneficial for clinicians with less training and experience []. Currently, these applications are still in the proposal stage. Although promising, they are not yet widely used in routine clinical settings, and further evaluation of their feasibility and effectiveness is necessary.

The deployment of LLMs in mental health also poses several risks, particularly for groups considered vulnerable. Challenges such as inconsistencies in the content generated and the production of “hallucinatory” content may mislead or harm users [], raising serious ethical concerns. In response, authorities such as the World Health Organization have developed ethical guidelines for artificial intelligence (AI) research in health care, emphasizing the importance of data privacy; human oversight; and the principle that AI tools should augment, rather than replace, human practitioners []. These potential problems with LLMs in health care have gained considerable industry attention, underscoring the need for a comprehensive and responsible evaluation of LLMs’ applications in mental health. The following section will further explore the workings of LLMs and their potential applications in mental health and critically evaluate the opportunities and challenges they introduce.

LLMs in Mental Health

LLMs represent advancements in machine learning, characterized by their ability to understand and generate human-like text with high accuracy []. The efficacy of these models is typically evaluated using benchmarks designed to assess their linguistic fidelity and contextual relevance. Common metrics include Bilingual Evaluation Understudy for translation accuracy and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for summarization tasks []. LLMs are characterized by their scale, often encompassing billions of parameters, setting them apart from traditional language models []. This breakthrough is largely due to the transformer architecture, a deep neural network structure that uses a “self-attention” mechanism developed by Vaswani et al []. This allows LLMs to process information in parallel rather than sequentially, greatly enhancing speed and contextual understanding []. To clearly define the scope of this study concerning LLMs, we specify that an LLM must use the transformer architecture and contain a high number of parameters, traditionally at least 1 billion, to qualify as “large” []. This criterion encompasses models such as GPT (OpenAI) and Bidirectional Encoder Representations from Transformers (BERT; Google AI). Although the standard BERT model, with only 0.34 billion parameters [], does not meet the traditional criteria for “large,” its sophisticated bidirectional design and pivotal role in establishing new natural language processing (NLP) benchmarks justify its inclusion among notable LLMs []. The introduction of ChatGPT (OpenAI) in 2022 generated substantial public and academic interest in LLMs, underlining their transformative potential within the field of AI []. Other state-of-the-art LLMs include Large Language Model Meta AI (LLaMA; Meta AI) and Pathways Language Model (PaLM; Google AI), as illustrated in [-].

Table 1. Comparative analysis of large language models (LLMs) by parameter size and developer entity. Data were summarized with the latest models up to June 2024, with data for parameters and developers from GPT (OpenAI) to Large Language Model Meta AI (LLaMA; Meta AI) adapted from the study by Thirunavukarasu et al [].Model namePublication dateParameters (billion)DeveloperGenerative Pretrained Transformer (GPT)June 20180.117OpenAIBidirectional Encoder Representations from Transformers (BERT)October 20180.34GoogleGPT-2January 20191.5OpenAIEnhanced Representation through Knowledge Integration (ERNIE)September 20190.114BaiduConditional Transformer Language Model (CTRL)September 20191.63OpenAIMegatronSeptember 20193.9NVIDIABidirectional and Auto-Regressive Transformers (BART)October 20190.374MetaTuring Natural Language Generation (Turing-NLG)January 2020530MicrosoftGPT-3June 2020175OpenAIVision Transformer (ViT)October 20200.632GoogleInspired by artist Salvador Dalí and Pixar\'s WALL·E (DALL-E)October 20201.2OpenAISwin TransformerMarch 20210.197MicrosoftWu Dao 2.0June 20211750HuaweiJurassic-1August 2021178AI21 LabsMegatron-Turing Natural Language Generation (MT-NLG)October 2021530Microsoft & NvidiaClaudeDecember 202152AnthropicGeneralist Language Model (GLAM)December 20211200GoogleERNIE 3.0December 2021260BaiduGuided Language-to-Image Diffusion for Generation and Editing (GLIDE)December 20213.5OpenAIGopherDecember 2021280DeepMindCausal Masked Modeling 3 (CM3)January 202213MetaLanguage Model for Dialogue Applications (LaMDA)January 2022137GoogleGPT-NeoXFebruary 202220EleutherAIChinchillaMarch 202270DeepMindGopherCiteMarch 2022280DeepMindDALL-E 2April 20223.5OpenAIFlamingoApril 202280DeepMindPathways Language Model (PaLM)April 2022540GoogleGatoMay 20221.2DeepMindOpen Pretrained Transformer (OPT)May 2022175MetaYet Another Language Model (YaLM)June 2022100YandexMinervaJune 2022540GoogleBigScience Large Open-science Open-access Multilingual Language Model (BLOOM)July 2022175Hugging FaceGalacticaNovember 2022120MetaAlexa Teacher Model (Alexa TM)November 202220AmazonLarge Language Model Meta AI (LLAMA)February 202365MetaGPT-4March 20231760OpenAICerebras-GPTMarch 202313CerebrasFalconMarch 202340Technology Innovation InstituteBloomberg Generative Pretrained Transformer (BloombergGPT)March 202350BloombergPanGu-2March 20231085HuaweiOpenAssistantMarch 202317LAIONPaLM 2May 2023340GoogleLlama 2July 202370MetaFalcon 180BSeptember 2023180Technology Innovation InstituteMistral 7BSeptember 20237.3MistralClaude 2.1November 2023200AnthropicGrok-1November 2023314xAIMixtral 8x7BDecember 202346.7MistralPhi-2December 20232.7EleutherAIGemmaFebruary 20247GoogleDBRXMarch 2024136DatabricksLlama 3April 202470Meta AIFugaku-LLMMay 202413Fujitsu, Tokyo Institute of Technology, etcNemotron-4June 2024340Nvidia

LLMs are primarily designed to learn fundamental statistical patterns of language []. Initially, these models were used as the basis for fine-tuning task-specific models rather than training those models from scratch, offering a more resource-efficient approach []. This fine-tuning process involves adjusting a pretrained model to a specific task by further training it on a smaller, task-specific dataset []. However, developments in larger and more sophisticated models have reduced the need for extensive fine-tuning in some cases. Notably, some advanced LLMs can now effectively understand and execute tasks specified through natural language prompts without extensive task-specific fine-tuning []. Instruction fine-tuned models undergo additional training on pairs of user requests and appropriate responses. This training allows them to generalize across various complex tasks, such as sentiment analysis, which previously required explicit fine-tuning by researchers or developers []. A key part of the input to these models, such as ChatGPT and Gemini (Google AI), includes a system prompt, often hidden from the user, which guides the model on how to interpret and respond to user prompts. For example, it might direct the model to act as a helpful mental health assistant. In addition, “prompt engineering” has emerged as a crucial technique in optimizing model performance. Prompt engineering involves crafting input texts that guide the model to produce the desired output without additional training. For example, refining a prompt from “Tell me about current events in health care” to “Summarize today’s top news stories about technology in health care” provides the model with more specific guidance, which can enhance the relevance and accuracy of its responses []. While prompt engineering can be highly effective and reduce the need to retrain the model, it is important to be wary of “hallucinations,” a phenomenon where models confidently generate incorrect or irrelevant outputs []. This can be particularly challenging in high-accuracy scenarios, such as health care and medical applications [-]. Thus, while prompt engineering reduces the reliance on extensive fine-tuning, it underscores the need for thorough evaluation and testing to ensure the reliability of model outputs in sensitive applications.

The existing literature includes a review of the application of machine learning and NLP in mental health [], analyses of LLMs in medicine [], and a scoping review of LLMs in mental health. These studies have demonstrated the effectiveness of NLP for tasks such as text categorization and sentiment analysis [] and provided a broad overview of LLM applications in mental health []. However, a gap remains in systematically reviewing state-of-the-art LLMs in mental health, particularly in the comprehensive assessment of literature published since the introduction of the transformer architecture in 2017.

This systematic review addresses these gaps by providing a more in-depth analysis; evaluating the quality and applicability of studies; and exploring ethical challenges specific to LLMs, such as data privacy, interpretability, and clinical integration. Unlike previous reviews, this study excludes preprints, follows a rigorous search strategy with clear inclusion and exclusion criteria (using Cohen κ to assess the interreviewer agreement), and uses a detailed assessment of study quality and bias (using the Risk of Bias 2 tool) to ensure the reliability and reproducibility of the findings.

Guided by specific research questions, this systematic review critically assesses the use of LLMs in mental health, focusing on their applicability and efficacy in early screening, digital interventions, and clinical settings, as well as the methodologies and data sources used. The findings of this study highlight the potential of LLMs in enhancing mental health diagnostics and interventions while also identifying key challenges such as inconsistencies in model outputs and the lack of robust ethical guidelines. These insights suggest that, while LLMs hold promise, their use should be supervised by physicians, and they are not yet ready for widespread clinical implementation.

Methods

This systematic review followed the PRISMA (Preferred Reporting Items for Systematic Review and Meta-Analyses) guidelines []. The protocol was registered on PROSPERO (CRD42024508617). A PRISMA checklist is available in .

Search Strategies

The search was initiated on August 3, 2024, and completed on August 6, 2024, by 1 author (ZG). ZG systematically searched 5 databases: MEDLINE, IEEE Xplore, Scopus, JMIR, and ACM Digital Library using the following search keywords: (mental health OR mental illness OR mental disorder OR psychiatry) and (large language models). These keywords were consistently applied across each database to ensure a uniform search strategy. To conduct a comprehensive and precise search for relevant literature, strategies were tailored for different databases. All metadata were searched in MEDLINE and IEEE Xplore, whereas the search in Scopus was confined to titles, abstracts, and keywords. The JMIR database used the criteria exact match feature to refine search results and enhance precision. In the ACM Digital Library database, the search focused on full text. The screening of all citations involved four steps:

Initial search. All relevant citations were imported into a Zotero (Corporation for Digital Scholarship) citation manager library.Preliminary inclusion. Citations were initially screened based on predefined inclusion criteria.Duplicate removal. Citations were consolidated into a single group, from which duplicates were eliminated.Final inclusion. The remaining references were carefully evaluated against the inclusion criteria to determine their suitability.Study Selection and Eligibility Criteria

All the articles that matched the search criteria were double screened by 2 independent reviewers (ZG and KL) to ensure that each article fell within the scope of LLMs in mental health. This process involved the removal of duplicates followed by a detailed manual evaluation of each article to confirm adherence to our predefined inclusion criteria, ensuring a comprehensive and focused review. To quantify the agreement level between the reviewers and ensure objectivity, interrater reliability was calculated using Cohen κ, with a score of 0.84 indicating a good level of agreement. In instances of disagreement, a third reviewer (AL) was consulted to achieve consensus.

To assess the risk of bias, we used the Risk of Bias 2 tool, as recommended for Cochrane Reviews. The results have been visualized in . We thoroughly examined each study for potential biases that could impact the validity of the results. These included biases from the randomization process, deviations from intended interventions, missing outcome data, inaccuracies in outcome measurement, and selective reporting of results. This comprehensive assessment ensures the credibility of each study.

The criteria for selecting articles were as follows: we limited our search to English-language publications, focusing on articles published between January 1, 2017, and April 30, 2024. This timeframe was chosen considering the substantial developments in the field of LLMs in 2017, marked notably by the introduction of the transformer architecture, which has greatly influenced academic and public interest in this area.

In this review, the original research articles and available full-text papers have been carefully selected, aiming to focus on the application of LLMs in mental health. To comply with the PRISMA guidelines, articles that have not been published in a peer-reviewed venue, including those only available on a preprint server, were excluded. Owing to the limited literature specifically addressing the mental health applications of LLMs, we included review articles to ensure a comprehensive perspective. The selection criteria focused on direct applications, expert evaluations, and ethical considerations related to the use of LLMs in mental health contexts, with the goal of providing a thorough analysis of this rapidly developing field.

Information Extraction

The data extraction process was jointly conducted by 2 reviewers (ZG and KL), focusing on examining the application scenarios, model architecture, data sources, methodologies used, and main outcomes from selected studies on LLMs in mental health.

Initially, we categorized each study to determine its main objectives and applications. The categorization process was conducted in 2 steps. First, after reviewing all the included articles, we grouped them into 3 primary categories: detection of mental health conditions and suicidal ideation through text, LLM use for mental health CAs, and other applications and evaluation of the LLMs in mental health. In the second step, we performed a more detailed categorization. After a thorough, in-depth reading of each article within these broad categories, we refined the classifications based on the specific goals of the studies. Following this, we summarized the main model architectures of the LLMs used and conducted a thorough examination of data sources, covering both public and private datasets. We noted that some review articles lacked detail on dataset content; therefore, we focused on providing comprehensive information on public datasets, including their origins and sample sizes. We also investigated the various methods used across different studies, including data collection strategies and analytic methodologies. We examined their comparative structures and statistical techniques to offer a clear understanding of how these methods are applied in practice.

Finally, we documented the main outcomes of each study, recording significant results and aligning them with relevant performance metrics and evaluation criteria. This included providing quantitative data where applicable to underscore these findings. We used a narrative approach to synthesize the information, integrating and comparing results from various studies to emphasize the efficacy and impact of LLMs on mental health. This narrative synthesis allowed us to highlight the efficacy and impact of LLMs in mental health, providing quantitative data where applicable to underscore these findings. The results of this analysis are presented in Tables S1-S3 in [,-], each corresponding to 1 of the primary categories.

ResultsStrategy and Screening Process

The PRISMA diagram of the systematic screening process can be seen in . Our initial search across 5 academic databases, namely, MEDLINE, IEEE Xplore, Scopus, JMIR, and ACM Digital Library, yielded 14,265 papers: 907 (6.36%) from MEDLINE, 102 (0.72%) from IEEE Xplore, 204 (1.43%) from Scopus, 211 (1.48%) from JMIR, and 12,841 (90.02%) from ACM Digital Library. After duplication, 97.91% (13,967/14,265) of the unique papers were retained. Subsequent screening was based on predefined inclusion and exclusion criteria, narrowing down the selection to 0.29% (40/13,967) of the papers included in this review. The reasons for the full-text exclusion of 61 papers can be found in .

‎

Figure 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow of the selection process. LLM: large language model.

In our review of the literature, we classified the included articles into 3 broad categories: detection of mental health conditions and suicidal ideation through text (15/40, 38%), LLMs’ use for mental health CAs (7/40, 18%), and the other applications and evaluation of the LLMs in mental health (18/40, 45%). The first category investigates the potential of LLMs for the early detection of mental illness and suicidal ideation via social media and other textual sources. Early screening is highlighted as essential for preventing the progression of mental disorders and mitigating more severe outcomes. The second category assesses LLM-supported CAs used as teletherapeutic interventions for mental health issues, such as loneliness, with a focus on evaluating their effectiveness and validity. The third category covers a broader range of LLM applications in mental health, including clinical uses such as decision support and therapy enhancement. It aims to assess the overall effectiveness, utility, and ethical considerations associated with LLMs in these settings. All selected articles are summarized in Tables S1-S3 in according to the 3 categories.

Mental Health Conditions and Suicidal Ideation Detection Through Text

Early intervention and screening are crucial in mitigating the global burden of mental health issues []. We examined the performance of LLMs in detecting mental health conditions and suicidal ideation through textual analysis. Of 40 articles, 6 (15%) assessed the efficacy of early screening for depression using LLMs [,,,,,], while another (1/40, 2%) simultaneously addressed both depression and anxiety []. One comprehensive study examined various psychiatric conditions, including depression, social anxiety, loneliness, anxiety, and other prevalent mental health issues []. Two (5%) of the 40 articles assessed and compared the ability of LLMs to perform sentiment and emotion analysis [,], and 5 (12%) articles focused on the capability of LLMs to analyze textual content for detecting suicidal ideation [,,,,]. Most studies (10/40, 25%) used BERT and its variants as one of the primary models [,,,,,,,,,], while GPT models were also commonly used (8/40, 20%) [,,,,,,,]. Most training data (10/40, 25%) comprised social media posts [,,,,,,,,,] from platforms such as Twitter (Twitter, Inc), Reddit, and Sina Weibo (Sina corporation), covering languages such as English, Malay dialects, Chinese, and Portuguese. In addition, 5 (12%) of the 40 studies used datasets consisting of clinical transcripts and patient interviews [,,,,], providing deeper insights into LLM applications in clinical mental health settings.

In studies focusing on early screening for depression, comparing results horizontally is challenging due to variations in datasets, training methods, and models across different investigations. Nonetheless, substantial evidence supports the significant potential of LLMs in detecting depression from text-based data. For example, Danner et al [] conducted a comparative analysis using a convolutional neural network on the Distress Analysis Interview Corpus-Wizard of Oz dataset, achieving F1-scores of 0.53 and 0.59; however, their use of GPT-3.5 demonstrated superior performance, with an F1-score of 0.78. Another study involving the E-Distress Analysis Interview Corpus dataset (an extension of Distress Analysis Interview Corpus-Wizard of Oz) used the Robustly Optimized BERT Approach for Depression Detection to predict the Patient Health Questionnaire-8 scores from textual data. This approach identified 3 levels of depression and achieved the lowest mean absolute error of 3.65 in Patient Health Questionnaire–8 scores [].

LLMs play an important role in sentiment analysis [,], which categorizes text into overall polarity classes, such as positive, neutral, negative, and occasionally mixed, and emotion classification, which assigns labels such as “joy,” “sadness,” “anger,” and “fear” []. These analyses enable the detection of emotional states and potential mental health issues from textual data, facilitating early intervention []. Stigall et al [] demonstrated the efficacy of these models, with their study showing that Emotion-aware BERT Tiny, a fine-tuned variant of BERT, achieved an accuracy of 93.14% in sentiment analysis and 85.46% in emotion analysis. This performance surpasses that of baseline models, including BERT-Base Cased and BERTTiny-Pretrained [], underscoring the advantages and validity of fine-tuning in enhancing model performance. LLMs have also demonstrated robust accuracy in detecting and classifying a range of mental health syndromes, including social anxiety, loneliness, and generalized anxiety. Vajre et al [] introduced PsychBERT, developed using a diverse training dataset from both social media texts and academic literature, which achieved an F1-score of 0.63, outperforming traditional deep learning approaches such as convolutional neural networks and long short-term memory networks, which recorded F1-scores of 0.57 and 0.51, respectively []. In research on detecting suicidal ideation using LLMs, Diniz et al [] showcased the high efficacy of the BERTimbau large model within a non-English (Portuguese) context, achieving an accuracy of 0.955, precision of 0.961, and an F1-score of 0.954. The assessment of the BERT model by Metzler et al [] found that it correctly identified 88.5% of tweets as suicidal or off-topic, performing comparably to human analysts and other leading models. However, Levkovich et al [] noted that while GPT-4 assessments of suicide risk closely aligned with those by mental health professionals, it overestimated suicidal ideation. These results underscore that while LLMs have the potential to identify tweets reflecting suicidal ideation with accuracy comparable to psychological professionals, extensive follow-up studies are required to establish their practical application in clinical settings.

LLMs in Mental Health CAs

In the growing field of mental health digital support, the implementation of LLMs as CAs has exhibited both promising advantages [,,,] and significant challenges [,]. The studies by Ma et al [] and Heston [] demonstrate the effectiveness of CAs powered by LLMs in providing timely, nonjudgmental mental health support. This intervention is particularly important for those who lack ready access to a therapist due to constraints such as time, distance, and work, as well as for certain populations considered socially marginalized, such as older adults who experience chronic loneliness and a lack of companionship [,]. The qualitative analysis of user interactions on Reddit by Ma et al [] highlights that LLMs encourage users to speak up and boost their confidence by providing personalized and responsive interactions. In addition, VHope, a DialoGPT-enabled mental health CA, was evaluated by 3 experts who rated its responses as 67% relevant, 78% human-like, and 79% empathetic []. Another study found that after observing 717 evaluations by 100 participants on 239 autism-specific questions, 46.86% of evaluators preferred responses of the chief physicians, whereas 34.87% preferred the responses of GPT-4, and 18.27% favored the responses of Enhanced Representation through Knowledge Integration Bot (ERNIE Bot; version 2.2.3; Baidu, Inc). Moreover, ChatGPT (mean 3.64, 95% CI 3.57-3.71) outperformed physicians (mean 3.13, 95% CI 3.04-3.21) in terms of empathy [], indicating that LLM-powered CAs are not only effective but also acceptable by users. These findings highlight the potential for LLMs to complement mental health intervention systems and provide valuable medical guidance.

The development and implementation of a non-English CA for emotion capture and categorization was explored in a study by Zygadlo et al []. Faced with a scarcity of Polish datasets, the study adapted by translating an existing database of personal conversations from English into Polish, which decreased the accuracy in tasks from 90% in English to 80% in Polish []. While the performance remained commendable, it highlighted the challenges posed by the lack of robust datasets in languages other than English, impacting the effectiveness of CAs across different linguistic environments. However, findings by He et al [] suggest that the availability of language-specific datasets is not the sole determinant of CA performance. In their study, although ERNIE Bot was trained in Chinese and ChatGPT in English, ChatGPT demonstrated greater empathy for Chinese users []. This implies that factors beyond the training language and dataset availability, such as model architecture or training methodology, can also affect the empathetic responsiveness of LLMs, underscoring the complexity of human-AI interaction.

Meanwhile, the reliability of LLM-driven CAs in high-risk scenarios remains a concern [,]. An evaluation of 25 CAs found that in tests involving suicide scenarios, only 2 included suicide hotline referrals during the conversation []. This suggests that while these CAs can detect extreme emotions, few are equipped to take effective preventive measures. Furthermore, CAs often struggle with maintaining consistent communication due to limited memory capacity, leading to disruptions in conversation flow and negatively affecting user experience [].

Other Applications and Evaluation of the LLMs in Mental Health

ChatGPT has gained attention for its unparalleled ability to generate human-like text and analyze large amounts of textual data, attracting the interest of many researchers and practitioners []. Numerous evaluations of LLMs in mental health have focused on ChatGPT, exploring its utility across various scenarios such as clinical diagnosis [,,], treatment planning [,,], medication guidance [,,], patient management [], psychiatry examinations [], and psychology education [], among others [,,,].

Research has highlighted ChatGPT’s accuracy in diagnosing various psychiatric conditions [,,,]. For example, Franco D’Souza et al [] evaluated ChatGPT’s responses to 100 clinical psychiatric cases, awarding it an “A” rating in 61 cases, with no errors in the diagnoses of different psychiatric disorders and no unacceptable responses, underscoring ChatGPT’s expertise and interpretative capacity in psychiatry. Further supporting this, Schubert et al [] assessed the performance of ChatGPT 4.0 using neurology board-style examination questions, finding that it answered 85% of the questions correctly, surpassing the average human performance of 73.8%. Meanwhile, in a study of LLMs regarding the prognosis and long-term outcomes of depression, GPT-4, Claude (Anthropic), and Bard (Google AI) showed strong agreement with mental health professionals. They all recommended a combination of psychotherapy and antidepressant medication in every case []. This not only proves the reliability of LLMs for mental health assessment but also highlights their usefulness in providing valuable support and guidance for individuals seeking information or coping with mental illness.

However, the direct deployment of LLMs, such as ChatGPT, in clinical settings carries inherent risks. The outputs of LLMs are heavily influenced by prompt engineering, which can lead to inconsistencies that undermine clinical reliability [,-,]. For example, Farhat et al [] conducted a critical evaluation of ChatGPT’s ability to generate medication guidelines through detailed cross-questioning and noted that altering prompts substantially changed the responses. While ChatGPT typically provided helpful advice and recommended seeking expert consultation, it occasionally produced inappropriate medication suggestions. Perlis et al [] verified this, showing that GPT-4 Turbo suggested medications that were considered less efficient choices or contraindicated by experts in 12% of the cases. Moreover, LLMs often lack the necessary clinical judgment capabilities. This issue was highlighted in the study by Grabb [], which revealed that despite built-in safeguards, ChatGPT remains susceptible to generating extreme and potentially hazardous recommendations. A particularly alarming example was ChatGPT advising a patient with depression to engage in high-risk activities such as bungee jumping as a means of seeking pleasure []. These LLMs depend on prompt engineering [,,], which means their responses can vary widely depending on the wording and context of the prompts given. The system prompts, which are predefined instructions given to the model, and the prompts used by the experimental team, such as those in the study by Farhat et al [], guide the behavior of ChatGPT and similar LLMs. These prompts are designed to accommodate a variety of user requests within legal and ethical boundaries. However, while these boundaries are intended to ensure safe and appropriate responses, they often fail to align with the nuanced sensitivities required in psychological contexts. This mismatch underscores a significant deficiency in the clinical judgment and control of LLMs within sensitive mental health settings.

Further research into other LLMs in the mental health sector has shown a range of capabilities and limitations. For example, a study by Sezgin et al [] highlighted Language Model for Dialogue Applications’ (LaMDA’s) proficiency in managing complex inquiries about postpartum depression that require medical insight or nuanced understanding; however, they pointed out its challenges with straightforward, factual questions, such as “What are antidepressants?” []. Assessments of LLMs such as LLaMA-7B, ChatGLM-6B, and Alpaca, involving 50 interns specializing in mental illness, received favorable feedback regarding the fluency of these models in a clinical context, with scores >9.5 out of 10. However, the results also indicated that the responses of these LLMs often failed to address mental health issues adequately, demonstrated limited professionalism, and resulted in decreased usability []. Similarly, a study on psychiatrists’ perceptions of using LLMs such as Bard and Bing AI (Microsoft Corp) in mental health care revealed mixed feelings. While 40% of physicians indicated that they would use such LLMs to assist in answering clinical questions, some expressed serious concerns about their reliability, confidentiality, and potential to damage the patient-physician relationship [].

DiscussionPrincipal Findings

In the context of the wider prominence of LLMs in the literature [,,,,,,,], this study supports the assertion that interest in LLMs is growing in the field of mental health. indicates an increase in the number of mental health studies using LLMs, with a notable surge observed in 2023 following the introduction of ChatGPT in late 2022. Although we included articles only up to the end of April 2024, it is evident that the number of articles related to LLMs in the field of mental health continues to show a steady increase in 2024. This marks a substantial change in the discourse around LLMs, reflecting their broader acceptance and integration into various aspects of mental health research and practice. The progression from text analysis to a diverse range of applications highlights the academic community’s recognition of the multifaceted uses of LLMs. LLMs are increasingly used for complex psychological assessments, including early screening, diagnosis, and therapeutic interventions.

‎

Figure 2. Number of articles included in this literature review, grouped by year of publication and application field. The black line indicates the total number of articles in each year. CA: conversational agent.

The findings of this study demonstrate that LLMs are highly effective in analyzing textual data to assess mental states and identify suicidal ideation [,,,,,,,,,,], although their categorization often tends to be binary [,,,,,,]. These LLMs possess extensive knowledge in the field of mental health and are capable of generating empathic responses that closely resemble human interactions [,,]. They show great potential for providing mental health interventions with improved prognoses [,,,,,], with the majority being recognized by psychologists for their appropriateness and accuracy [,,]. The careful and rational application of LLMs can enhance mental health care efficiently and at a lower cost, which is crucial in areas with limited health care capacity. However, there are currently no studies available that provide evaluative evidence to support the clinical use of LLMs.

LimitationsLimitations of Using LLMs in Mental Health

On the basis of the works of literature, the strengths and weaknesses of applying the LLMs in mental health are summarized in .

LLMs have a broad range of applications in the mental health field. These models excel in user interaction, provide empathy and anonymity, and help reduce the stigma associated with mental illness [,], potentially encouraging more patients to participate in treatment. They also offer a convenient, personalized, and cost-effective way for individuals to access mental health services at any time and from any location, which can be particularly helpful for populations considered socially isolated, especially older adults [,,]. In addition, LLMs can help reduce the burden of care during times of severe health care resource shortages and patient overload, such as during the COVID-19 pandemic []. Although previous research has highlighted the potential of LLMs in mental health, it is evident that they are not yet ready for clinical use due to unresolved technical risks and ethical issues.

The use of LLMs in mental health, particularly those fine-tuned for specific tasks such as ChatGPT, reveals clear limitations. The effectiveness of these models heavily depends on the specificity of user-generated prompts. Inappropriate or imprecise prompts can disrupt the conversation’s flow and diminish the model’s effectiveness [,,,,]. Even small changes in the content or tone of prompts can sometimes lead to significant variations in responses, which can be particularly problematic in health care settings where interpretability and consistency are critical [,,]. Furthermore, LLMs lack clinical judgment and are not equipped to handle emergencies [,]. While they can generally capture extreme emotions and recognize scenarios requiring urgent action, such as suicide ideation [,,,,], they often fail to provide direct, practical measures, typically only advising users to seek professional help []. In addition, the inherent bias in LLM training data [,] can lead to the propagation of stereotypical, discriminatory, or biased viewpoints. This bias can also give rise to hallucinations, that is, LLMs producing erroneous or misleading information [,]. Furthermore, hallucinations may stem from overfitting the training data or a lack of context understanding []. Such inaccuracies can have serious consequences, such as providing incorrect medical information, reinforcing harmful stereotypes, or failing to recognize and appropriately respond to mental health crises []. For example, an LLM might reinforce a harmful belief held by a user, potentially exacerbating their mental health issues. It could also generate nonfactual, overly optimistic, or pessimistic medical advice, delaying appropriate professional intervention. These issues could undermine the integrity and fairness of social psychology [,,,].

Another critical concern is the “black box” nature of LLMs [,,]. This lack of interpretability complicates the application of LLMs in mental health, where trustworthiness and clarity are important. When we talk about neural networks as black boxes, we know details such as what they were trained with, how they were trained, and what the weights are. However, with many new LLMs, such as GPT-3.5 and 4, researchers and practitioners often access the models via web interfaces or application programming interfaces without complete knowledge of the training data, methods, and model updates. This situation not only presents the traditional challenges associated with neural networks but also has all these additional problems that come from the “hidden” model.

Ethical concern is another significant challenge associated with applying LLMs in mental health. Debates are emerging around issues such as digital personhood, informed consent, the risk of manipulation, and the appropriateness of AI in mimicking human interactions [,,,,]. A primary ethical concern is the potential alteration of the traditional therapist-patient relationship. Individuals may struggle to fully grasp the advantages and disadvantages of LLM derivatives, often choosing these options for their lower cost or greater convenience. This shift could lead to an increased reliance on the emotional support provided by AI [], inadvertently positioning AI as the primary diagnostician and decision maker for mental health issues, thereby undermining trust in conventional health care settings. Moreover, therapists may become overly reliant on LLM-generated answers and use them in clinical decision-making, overlooking the complexities involved in clinical assessment. This reliance could compromise their professional judgment and reduce opportunities for in-depth engagement with patients [,,]. Furthermore, the dehumanization and technocratic nature of mental health care has the potential to depersonalize and dehumanize patients [], where decisions are more driven by algorithms than by human insight and empathy. This can lead to decisions becoming mechanized, lacking empathy, and detached from ethics []. AI systems may fail to recognize or adequately interpret the subtle and often nonverbal cues, such as the tone of voice, facial expressions, and the emotional weightage behind words, which are critical in traditional therapeutic settings []. These cues are essential for comprehensively understanding a patient’s condition and providing empathetic care.

In addition, the current roles and accuracy of LLMs in mental health are limited. For instance, while LLMs can categorize a patient’s mood or symptoms, most of these categorizations are binary, such as depressed or not depressed [,]. This oversimplification can lead to misdiagnoses. Data security and user privacy in clinical settings are also of utmost concern [,,,,]. Although approximately 70% of psychiatrists believe that managing medical documents will be more efficient using LLMs, many still have concerns about their reliability and privacy [,,]. These concerns could have a devastating impact on patient privacy and undermine the trust between physicians and patients if confidential treatment records stored in LLM databases are compromised. Beyond the technical limitations of AI, the current lack of an industry-benchmarked ethical framework and accountability system hinders the true application of LLMs in clinical practice [].

Limitations of the Selected Articles

Several limitations were identified in the literature review. A significant issue is the age bias present in the social media data used for depression and mental health screening. Social media platforms tend to attract younger demographics, leading to an underrepresentation of older age groups []. Furthermore, most studies have focused on social media platforms, such as Twitter, primarily used by English-speaking populations, which may result in a lack of insight into mental health patterns in non–English-speaking regions. Our review included studies in Polish, Chinese, Portuguese, and Malay, all of which highlighted the significant limitations of LLMs caused by the availability and size of databases [,,,,]. For instance, due to the absence of a dedicated Polish-language mental health database, a Polish study had to rely on machine-translated English databases []. While the LLMs achieve 80% accuracy in categorizing emotions and moods in Polish, this is still lower than the 90% accuracy observed in the original English dataset. This discrepancy highlights that the accuracy of LLMs can be affected by the quality of the database.

Another limitation of this study is the low diversity of LLMs studied. Although we used “large language models” as keywords in our search phase, the vast majority of identified studies (39/40, 98%) focused on BERT and its variants, as well as the GPT model, as one of the models studied. Therefore, this review provides only a limited picture of the variability expected in applicability between different LLMs. In addition, the rapid development of LLM technologies presents a limitation; this study can only reflect current advancements and may not encompass future advances or the full potential of LLMs. For instance, in tests involving psychologically relevant questions and answers, GPT-3.5 achieved an accuracy of 66.8%, while GPT-4.0 reached an accuracy of 85%, compared to the average human score of 73.8% []. Evaluating ChatGPT at different stages separately and comparing its performance to that of humans can lead to varied conclusions. In the assessment of prognosis and treatment planning for depression using LLMs, GPT 3.5 demonstrated a distinctly pessimistic prognosis that differed significantly from those of GPT-4, Claude, Bard, and mental health professionals []. Therefore, continuous monitoring and evaluation are essential to fully understand and effectively use the advancements in LLM technologies.

Opportunities and Future Work

Implementing technologies involving LLMs within the health care provision of real patients demands thorough and multifaceted evaluations. It is imperative for both industry and researchers to not let rollout exceed proportional requirements for evidence on safety and efficacy. At the level of the service provider, this includes providing explicit warnings to the public to discourage mistaking LLM functionality for clinical reliability. For example, GPT-4 introduced the ability to process and interpret image inputs within conversational contexts, leading OpenAI to issue an official warning that GPT-4 is not approved for analyzing specialized medical images such as computed tomography scans [].

A key challenge to address in LLM research is the tendency to produce incoherent text or hallucinations. Future efforts could focus on training LLMs specifically for mental health applications, using datasets with expert labeling to reduce bias and create specialized mental health lexicons [,,]. The creation of specialized datasets could take advantage of the customizable nature of LLMs, fostering the development of models that cater to the distinct needs of varied demographic groups. For instance, unlike models designed for health care professionals that assist in tasks such as data documentation, symptom analysis, medication management, and postoperative care, LLMs intended for patient interaction might be trained with an emphasis on empathy and comfortable dialogue.

Another critical concern is the problem of outdated training data in LLMs. Traditional LLMs, such as GPT-4 (with a cutoff date up to October 2023), rely on potentially outdated training data, limiting their ability to incorporate recent events or information. This can compromise the accuracy and relevance of their responses, leading to the generation of uninformative or incorrect answers, known as “hallucinations” []. Retrieval-augmented generation (RAG) technology offers a solution by retrieving facts from external knowledge bases, ensuring that LLMs use the most accurate and up-to-date information []. By searching for relevant information from numerous documents, RAG enhances the generation process with the most recent and contextually relevant content []. In addition, RAG includes evidence-based information, increasing the reliability and credibility of LLM responses [].

To further enhance the reliability of LLM content and minimize hallucinations, recent studies suggest adjusting model parameters, such as the “temperature” setting [-]. The temperature parameter influences the randomness and predictability of outputs []. Lowering the temperature typically results in more deterministic outputs, enhancing coherence and reducing irrelevant content []. However, this adjustment can also limit the model’s creativity and adaptability, potentially making it less effective in scenarios requiring diverse or nuanced responses. In mental therapy, where nuanced and sensitive responses are essential, maintaining an optimal balance is crucial. While a lower temperature can ensure accuracy, which is important for tasks such as clinical documentation, it may not suit therapeutic dialogues where personalized engagement is key. Low temperatures can lead to repetitive and impersonal responses, reducing patient engagement and therapeutic effectiveness. To mitigate these risks, regular updates of the model incorporating the latest therapeutic practices and clinical feedback are essential. Such updates could refine the model’s understanding and response mechanisms, ensuring it remains a safe and effective tool for mental health care. Nevertheless, determining the “optimal” temperature setting is challenging, primarily due to the variability in tasks and interaction contexts, which require different levels of creativity and precision.

Data privacy is another important area of concern. Many LLMs, such as ChatGPT and Claude, involve sending data to third-party servers, which poses the risk of data leakage. Current studies have found that LLMs can be enhanced by privacy-enhancing techniques, such as zero-knowledge proofs, differential privacy, and federated learning []. In addition, privacy can be preserved by replacing identifying information in textual data with generic tokens. For example, when recording sensitive information (eg, names, addresses, or credit card numbers), using alternatives to mask tokens can help protect user data from unauthorized access []. This obfuscation technique ensures that sensitive user information is not stored directly, thereby enhancing data security.

The lack of interpretability in LLM decision-making is another crucial area for future research on health care applications. Future research should examine the models’ architecture, training, and inferential processes for clearer understanding. Detailed documentation of training datasets, sharing of model architectures, and third-party audits would ideally form part of this undertaking. Investigating techniques such as attention mechanisms and modular architectures could illuminate aspects of neural network processing. The implementation of knowledge graphs might help in outlining logical relationships and facts []. In addition, another promising approach involves creating a dedicated embedding space during training, guided by an LLM. This space aligns with a causal graph and aids in identifying matches that approximate counterfactuals [].

Before deploying LLMs in mental health settings, a comprehensive assessment of their reliability, safety, fairness, abuse resistance, interpretab

View original article

JMIR MENTAL HEALTH

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Large Language Models for Mental Health Applications: Systematic Review

Comments (0)