The dataset comprises 33 anonymized breast cancer free-text pathology reports sourced from Taipei Medical University Hospital as one of them is illustrated in Fig. 1.
Fig. 1Screenshot demonstrating the free-text format of a pathology report (example from Taipei Medical University)
These reports, representative of a broad spectrum of breast cancer cases, were chosen to ensure a comprehensive analysis applicable to real-world clinical scenarios. The diversity of the dataset is crucial for the development of a robust AI model capable of accurately structuring and extracting information from free-text pathology reports. We acknowledge the concerns regarding the limited size of our dataset and its potential impact on the generalizability of our findings. To address these issues, we plan to explore opportunities to expand our dataset to include more diverse reports from multiple institutions. This will enable a more robust validation of our AI model’s capabilities across a broader clinical spectrum. Additionally, we will conduct statistical tests to assess the sufficiency of our sample size for capturing the complexity and variability of breast cancer pathology reports. These evaluations will be aligned with established research methodologies and best practices in the field to ensure the reliability and applicability of our findings.
Research Design and Prototype DevelopmentThis study employed a structured approach using Large Language Models (LLMs) to automate the extraction and structuring of information from the pathology reports. The primary algorithm used is the Generative Pre-trained Transformer (GPT), integrated into a custom-built Streamlit.io web application. Streamlit.io, an open-source platform, was chosen for its ability to facilitate rapid development of generative AI applications. This platform allowed for the direct use of our GitHub profile and an integrated development environment (IDE) through GitHub Codespaces, enabling coding in a browser without local IDE installations as showing in Fig. 2. This setup supports coding and deploying directly on Streamlit.io’s servers, streamlining the development process by eliminating the need for traditional deployment pipelines.
Fig. 2Integrated Development Environment Interface
Prototype and Algorithm IntegrationThe user interface of the prototype is a straightforward single-page application (SPA), emphasizing simplicity and ease of use. The front-end development began by importing essential libraries like streamlit for app framework and pandas for data manipulation. The main function of the Streamlit app (main) encapsulates all functionalities, including file uploading and data processing controls. An Excel file uploader was implemented to allow users to input their pathology reports, which are then processed into a panda’s DataFrame for data structuring. The code structure is displayed in Fig. 3.
Fig. 3The selection of GPT-3.5 over other LLMs such as BERT or BioBERT was driven by GPT-3.5's superior performance in understanding and generating complex language patterns, crucial for processing the nuanced language found in medical pathology reports. Comparative studies have shown that GPT-3.5 provides enhanced context capture and coherence in generating textual interpretations, which is vital for the accurate extraction and structuring of data from unstructured medical texts. These capabilities make GPT-3.5 particularly adept at handling the specialized vocabulary and varied syntactical structures prevalent in pathology reports. Integration with the OpenAI API uses the GPT-3.5 model, managed through environment variables and secure API key handling using dotenv as displayed in Fig. 4. Rate limits are carefully managed to ensure reliable API performance, incorporating strategies such as extended pauses between requests to handle API constraints effectively. The application features robust error handling mechanisms to maintain operation continuity and data integrity.
Fig. 4The effectiveness of GPT-3.5 in our application is significantly enhanced by meticulous prompt engineering. This process involves the strategic formulation of input prompts to the model to optimize clarity and specificity in the information retrieval process. Through iterative refinements, these prompts are tailored to align closely with the idiosyncratic expressions and terminology of breast cancer pathology, which significantly contributes to the high accuracy rates observed. This method underscores the importance of model tuning and adaptation in leveraging generative AI effectively in specialized domains, which involves designing and refining inputs to the AI model to optimize output accuracy for medical data extraction [10, 20]. This systematic approach is crucial due to the complexities and nuances of medical data. Each prompt is meticulously crafted and iteratively refined based on performance feedback, ensuring high accuracy in extracting breast cancer-specific information according to ICCR protocols [21]. In practice, our system utilizes a loop function to systematically issue these crafted prompts to the ChatGPT model via the OpenAI API. This function iterates over a sequence of structured prompts, each designed to extract specific pieces of medical information from the pathology reports. As each prompt is processed, the AI’s response is evaluated for accuracy and relevance. If the response fails to meet our strict accuracy thresholds or appears incomplete, the prompt is adjusted dynamically and re-submitted. This adaptive prompt engineering process ensures that the extracted information is both precise and comprehensive, adhering to established ICCR protocols for data integrity and reliability. The strategic use of this iterative loop mirrors methodologies discussed in [22] which highlights the importance of tailored prompts in effectively extracting structured medical information using generative AI models [22].
Old Prompts:
prompt =
f"Check ER, PgR, and HER2 is positive or Not\n\n".
f"Show the result in the following format: \n".
f"ER: Positive/Negative\n".
f"PgR: Positive/Negative\n".
f"HER2: Positive/Negative\n".
f"Identify the laterality of the specimen as well (right, left, or unspecified).\n\n".
f"Laterality: Right/Left/Not specified\n\n".
f"Extract the dimensions of the specimen.\n\n".
f"Dimensions: Width Height Depth \n\n".
f"Tumour Focality: Cannot be assessed/Single focus/Multiple foci\n\n".
f"Determine the histological tumour type. If the tumour type is 'Mixed', specify the subtypes present.
f"Histological Tumour Type: Description\n\n".
f"Assess the histological tumour grade and include any relevant scores or details if the score cannot.
f"Histological Tumour Grade: Description\n\n".
f"Determine if carcinoma in situ is present and mention the type if applicable.\n\n".
f"Carcinoma In Situ: Not identified/Present; if present, mention the type\n\n".
f"Determine the presence and involvement of skin in the tumour extension.\n\n".
New Prompts:
"Size_of_foci": "Search the medical report for any mention of the 'Sizes of individual foci'. If a specific size or range is given, strictly provide that in a short sentence, such as '5 mm' or '3–5 mm'. If no sizes are mentioned, strictly respond with 'not specified’.”,
"tumour_dimensions": "Check the medical report for tumor presence. If there is No residual invasive carcinoma, strictly respond with 'No residual invasive carcinoma'. If only microinvasion ≤ 1mm is mentioned, strictly respond with 'Only microinvasion present (≤ 1 mm)'. Otherwise, strictly respond with 'not specified'.",
"max_dimension_largest_focus": "Search the medical report for the maximum dimension of the largest invasive focus if it is greater than 1 mm. strictly Provide the exact measurement rounded to the nearest mm as a single word. If this information is not available, strictly respond with 'not specified'.",
"tumour_additional_dimensions": "Extract the additional dimensions of the largest invasive focus from the report, presented as 'length x width' in mm. If no additional dimensions are mentioned, strictly respond with 'not specified'.",
"tumour_maxWhole_dimensions": "Determine the maximum dimension of the entire tumor field from the report. strictly Provide this dimension in mm. If the dimension cannot be assessed or is not mentioned, strictly respond with 'not specified'.",
"specify dimensions": "Check the medical report to determine if it states that the tumor dimensions cannot be assessed. If 'Cannot be assessed' is mentioned, strictly respond with 'Cannot be assessed' and include any specific reason given in the report. If the report does not address the assessability of tumor dimensions at all, strictly respond with 'not specified'.",
User Interface and Output ValidationThe Streamlit application serves as the interface where users can upload their pathology reports as shown in Fig. 5, view the extracted data and perform validations. Discrepancies can be corrected directly in the interface, enhancing the utility and accuracy of the application. The validated data is then available for download in Excel format, allowing for further analysis or archival displayed in Fig. 6.
Fig. 5The architecture of the system facilitates a seamless flow from data input through processing to output shown in Fig. 7. The integration of Streamlit, GitHub Codespaces and the OpenAI API forms a robust framework that supports the extraction and structuring of data from pathology reports, transforming unstructured text into structured, analyzable formats. Figure 8 shows the system architecture not only demonstrates the application's functionality but also its potential for scalability and adaptation to other types of medical data analysis.
Fig. 7This methodological framework leverages advanced AI within a user-centric prototype to transform pathology report processing, enhancing accuracy, efficiency and interoperability in medical data analysis.
Comments (0)