CAN WE TRUST ChatGPT? A PILOT STUDY ABOUT THE ARTIFICIAL INTELLIGENCE’ RECOMMENDATIONS ON URINARY INCONTINENCE

Barbosa-Silva J1, Driusso P2, Ferreira E3, De Abreu R4

Research Type

Pure and Applied Science / Translational

Abstract Category

E-Health

Abstract 16
Interventional Studies
Scientific Podium Short Oral Session 2
Wednesday 23rd October 2024
08:52 - 09:00
Hall N105
Incontinence Female Outcomes Research Methods Rehabilitation Voiding Dysfunction
1. Hochschule Osnabrück, University of Applied Sciences, Osnabrück, Germany, 2. Federal University of São Carlos, São Carlos, Brazil, 3. School of Medicine, University of São Paulo, Brazil, 4. LUNEX ASBL Luxembourg Health & Sport Sciences Research Institute, Differdange, Luxembourg
Presenter
Links

Abstract

Hypothesis / aims of study
Although health-counselling provided has gained popularity among patients, there is an increased concern among health professional regarding the content available in the public domain, especially because the low-quality content can potentially exert a negative influence on the physician-patient relationship.
Artificial intelligence models are often used nowadays by patients and healthcare professionals. Chat Generative Pre-trained Transformer (ChatGPT) is a variant of the GPT model developed by an open artificial intelligence (AI; OpenAI). Results from previous studies suggest that the accuracy of the AI-model in order to provide guidance about health-related topics is still controversial. It seems that the accuracy of ChatGPT is influenced by subspecialized domains, which interferes directly in ability of the AI-model in generating recommendations for specific topics. 
Some authors have highlighted the potential use of ChatGPT in the urogynecological field[1], however, there is till sno report about the ability and accuracy of the OpenAI in addressing inquires about urinary incontinence (UI) management. Urinary leakage is a symptom that interferes directly in the womens’ quality of life, and it is also associated with mortability, depression and self-isolation. Moreover, it is already known that the content related to UI available in free-platforms (i.e., YouTube) is often incomprehensible and lacks actionable information, which leads to a limited availability of high-quality content available online. This fact may interferes in the women's attitude toward seeking to treat this specific symptom.
Nowadays, it's crucial to evaluate how well the AI-generated is able to address requires regarding health-related topics. Although health professionals can not prevent patients from accessing different sources in the Internet, they should be aware of the varying quality of different platforms and models. Therefore, this study aims to investigate the potential use and accuracy of ChatGPT in addressing questions related to female UI, compared to well-established guidelines.
Study design, materials and methods
In order to conduct this cross-sectional study, we used the publicly accessible website ChatGPT (available from: https://chat.openai.com/chat). Ethical approval was not required. Two researchers (J.B.S; R.M.A) developed five questions related to UI management from the patients' point of view. All the responses to the questions were extracted by the same two researchers from recommendations presented in guidelines and recently publications, such as the International Continence Society (ICS) and the International Urogynecological Association (IUGA)[2,3].
Questions were inserted by one researcher (J.B.S.) into the ChatGPT platform on September 16th (2023) by using an icogniton mode browser. Each question was input individually and all responses were recorded and extracted. Two experienced researchers (P.D. and E.A.F.) with twenty years of clinical experience in Women's Health field have graded and reviewed each response independently in order to test the performance of ChatGPT. The specialists were instructed to compare the proportion of correc responses from ChatGPT to the scientific evidence.
The accuracy of each response was evaluated using a Likert scale reported in previous studies and categorized by a ranged from one to six. The completeness of the answers was also assessed independently, with a Likert scale of three categories: 1) Incomplete, addresses some aspects of the question, but significant parts are missing or incomplete; 2) Adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete; 3) Comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected.
Any disagreement among the two specialists was independently reviewed by a meeting for consensus among the two researchers.
Results
Table 1 shows the answers generated by ChatGPT and recommendations from specific guidelines, as well the accuracy reached by consensus accuracy among specialists. The inquery regarding the cure of urinary loss (Q1) was classified as having a balanced percentage of correct and incorrect recommendations. Two questions related to stress UI treatment and pelvic floor muscle training (Q2 and Q4, respectively) provided more incorrect than correct recommendations.
The terminology used by ChatGPT was not adequate compared to the guidelines (i.e., “Kegel” exercise instead of “pelvic floor muscles training”). Moreover, the recommendations about techniques suggested by the AI for the UI treatment were not very well describe (i.e, “minimally invasive procedures” for stress UI). Some recommendations do not follow the indications from guidelines (i.e., botox injections and radiofrequency for treating UI, avoiding caffeine for treating stress UI, double voiding as a therapeutic technique in order to treat urgency UI, stopping the flow of urine as an exercise to train the pelvic floor muscles). Nonethless, some recommendations provided by ChatGPT may be considered complementary to the pelvic floor muscles training (PFMT) in order to treat UI  (i.e., biofeedback, electrical stimulation), however, the AI-model did not mention the association between different methods. In addition, a protocol for PFMT was generated by the Open-AI, without questioning for more details. 
Regarding urgency UI, ChatGPT showed more adequate content when compared to stress UI. The inquiry about the treatment for urgency UI (Q3) was classified as having more correct than incorrect content. Nonethless, the answers from the AI were considered corrected by the two specialists when focused on bladder training (Q5).
Figure 1 presents the individual assessment and the consensus among experts regarding the completeness of ChatGTP' answers. Three answers (Q2, Q3, Q5) were classified as adequate, as they provided the minimum information expected to be classified as correct. Two questions related to urinary leackge cure (Q1) and pelvic floor muscle training (Q4) were classified with incomplete information, and therefore, were considered with missing information.
Interpretation of results
Based on our preliminary analysis, the contribution of ChatGPT to the Evidence-Based Practice (EBP) during the management of UI is controversial. We identified inconsistencies in answers provided by the AI-model, however, ChatGPT was able to provided adequate recommendations for the urgency UI management. However, it lacked in addressing the full content of all the answers, which means that even presenting a correct content, there were more informations that could have ben included.
Answers generated by ChatGPT are heterogeneous and mixed. Although patients would beneficiate from some answers from the AI-model, there was inacurrate content inserted in the AI-answers. It should concern health professionals, as it is known that the interaction between patient and professional may be affected by the content provide online. Therefore, if the patient believes that prior methods, discovered through individual internet searches, can be used in isolation, health professionals may encounter resistance when suggesting alternative techniques that could potentially be more effective for their clinical condition.
Concluding message
Our findings showed an inconsistency and heterogeneity when evaluating the accuracy of answers generated by ChatGPT, compared to scientific guidelines. The content related to possible treatment options for stress UI and pelvic floor muscle training were considered inaccurate, however, the AI-model answers were considered adequate for topics related to urgency UI.
Regarding the completeness of the answers, it seems that the AI-model did not completely formulated according to the content reported in previous guidelines, which highlights to healthcare professionals and scientific community a concern about using artificial intelligence in patient counselling.
Figure 1 Table 1. Accuracy of ChatGTP answers assessed by two different experts independently and their consensus.
Figure 2 Figure 1 The completeness of answers generated by ChatGPT, according to specialists’ asseesment.
References
  1. Grünebaum A, Chervenak J, Pollet S, Katz A, Chervenak F. The exciting potential for ChatGPT in obstetrics and gynecology. American Journal of Obstetrics and Gynecology. 2023.
  2. Bø K, Frawley H, Haylen B, Abramov Yea. An International Urogynecological Association (IUGA)/International Continence Society (ICS) joint report on the terminology for the conservative and nonpharmacological management of female pelvic floor dysfunction. Neurourol Urodyn. 2017;36(2):221-44.
  3. Abrahms R, Andersson K, Apostolidis A, BIrder L, Bliss D, Brubaker Lea. 6th International Consultation on Incontinence: Evaluation and treatment of urinary incontinence, pelvic organ prolapse and faecal incontinence. ICS Standards 2023: 3 The International Consultation on Incontinence Algorithms. 2017.
Disclosures
Funding None Clinical Trial No Subjects None
Citation

Continence 12S (2024) 101358
DOI: 10.1016/j.cont.2024.101358

11/12/2024 16:13:37