Authors
Iblal Rakha1 and Noorhan Abbas2, 1Oxford University Hospitals NHS Foundation Trust, UK, 2University of Leeds, UK
Abstract
The NHS faces mounting pressures, resulting in workforce attrition and growing care backlogs. Pharmacy services, critical for ensuring medication safety and effectiveness, are often overlooked in digital innovation efforts. This pilot study investigates the potential of Large Language Models (LLMs) to alleviate pharmacy pressures by answering clinical pharmaceutical queries. Two retrieval techniques were evaluated: Vanilla Retrieval Augmented Generation (RAG) and Graph RAG, supported by an external knowledge source designed specifically for this study. ChatGPT 4o without retrieval served as a control. Quantitative and qualitative evaluations were conducted, including expert human assessments for response accuracy, relevance, and safety. Results demonstrated that LLMs can generate high-quality responses. In expert evaluations, Vanilla RAG outperformed other models and even human reference answers for accuracy and risk. Graph RAG revealed challenges related to retrieval accuracy. Despite the promise of LLMs, hallucinations and the ambiguity around LLM evaluations in healthcare remain key barriers to clinical deployment. This pilot study underscores the importance of robust evaluation frameworks to ensure the safe integration of LLMs into clinical workflows. However, regulatory bodies have yet to catch up with the rapid pace of LLM development. Guidelines are urgently needed to address the issues of transparency, explainability, data protection, and validation, to facilitate the safe and effective deployment of LLMs in clinical practice.
Keywords
Large Language Model Evaluation, Retrieval Augmented Generation, Clinical Question Answering, Knowledge Graphs, Healthcare Artificial Intelligence