- Research Note
- Open access
- Published:
The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses
BMC Research Notes volume 17, Article number: 247 (2024)
Abstract
Objective
The integration of artificial intelligence (AI) in healthcare education is inevitable. Understanding the proficiency of generative AI in different languages to answer complex questions is crucial for educational purposes. The study objective was to compare the performance ChatGPT-4 and Gemini in answering Virology multiple-choice questions (MCQs) in English and Arabic, while assessing the quality of the generated content. Both AI models’ responses to 40 Virology MCQs were assessed for correctness and quality based on the CLEAR tool designed for evaluation of AI-generated content. The MCQs were classified into lower and higher cognitive categories based on the revised Bloom’s taxonomy. The study design considered the METRICS checklist for the design and reporting of generative AI-based studies in healthcare.
Results
ChatGPT-4 and Gemini performed better in English compared to Arabic, with ChatGPT-4 consistently surpassing Gemini in correctness and CLEAR scores. ChatGPT-4 led Gemini with 80% vs. 62.5% correctness in English compared to 65% vs. 55% in Arabic. For both AI models, superior performance in lower cognitive domains was reported. Both ChatGPT-4 and Gemini exhibited potential in educational applications; nevertheless, their performance varied across languages highlighting the importance of continued development to ensure the effective AI integration in healthcare education globally.
Introduction
Arabic is the native language for over 400 million people with a major role for effective communication in the majority of Middle East and North African countries [1]. Nevertheless, English is the official language of teaching and learning in healthcare education across a majority of Arab countries [2,3,4]. Therefore, university students are challenged by a linguistic shift at the start of healthcare education [5, 6].
The emergence of generative artificial intelligence (AI) models can provide support to bridge the linguistic challenge in healthcare for non-native English speakers [7,8,9,10]. However, the utility of AI models should be preceded by a critical evaluation of the reliability of its generated content especially in non-English languages, given the dominant training of large language models (LLMs) in English [11, 12]. If generative AI models are integrated in education, the bias toward English in LLM training could undermine efforts to achieve global educational equity [13,14,15].
The integration of generative AI models into various aspects of daily life has been marked by a growing interest [16,17,18]. This trend was particularly notable in healthcare, where AI models can increase the operational efficiency and improve the quality of delivery of patient care and healthcare education [7, 18,19,20,21]. The potential benefits of AI in education are widely recognized; nevertheless, valid ethical concerns are recognized besides the concern regarding inaccuracies reported for the AI-generated content [7, 22,23,24].
The use of multiple-choice questions (MCQs) in healthcare education is recognized as a reliable method to evaluate the students’ achievement of learning outcomes [25, 26]. A relevant approach of classifying MCQs is the revised Bloom’s taxonomy which is based on cognitive functions ranging from basic knowledge recall to the application of knowledge in problem-solving and systematic analysis of various concepts [27,28,29].
Since the utility and integration of generative AI in various aspects of healthcare education appears inevitable, it is important to consider the strengths and limitations of such innovative technology [20, 30, 31]. This involves a thorough evaluation of the performance of the widely used AI chatbots, such as ChatGPT and Gemini, in different educational contexts [32,33,34,35]. Recent studies investigated the capability of different AI models to pass exams in different domains, with a wide variability in performance as reviewed recently by Newton and Xiromeriti [36]. This variability can be attributed to different factors, such as the AI model used, the prompting approach, and importantly the language(s) used in prompting [36,37,38]. Such findings highlight the necessity for continued research to elucidate the determinants of AI models’ performance, thereby informing the refinement of AI algorithms for improved performance and subsequent improved utility in various disciplines such as healthcare education [7, 39, 40].
Therefore, the current study aimed to compare the performance of two prominent AI models (ChatGPT-4 versus Gemini) in English and Arabic languages within the specialized field of Virology. The original hypothesis postulated that generative AI models’ performance in English is superior to that in Arabic, inferred based on the presumed higher quality of training data available in English and based on the few reports describing this disparity in language performance [41, 42]. Highlighting the critical discrepancies in generative AI performance across languages can lead to identification of possible areas for improvement by AI developers.
Methods
Study design
The study utilized the METRICS checklist for the design and reporting of generative AI studies in healthcare [37]. The basis of the study was randomly selected 40 Virology MCQs, used for testing of medical students during the period 2017–2022. The MCQs were fully designed by the first author (M.S.), with a PhD degree in Clinical Virology. The MCQs were original, without any copyright issues.
The MCQs were classified based on the revised Bloom’s taxonomy into two cognitive levels: higher involving 20 “Remember” and “Understand” MCQs; and lower involving 20 “Apply” and “Analyze” MCQs classified based on a consensus between the first and senior authors. The MCQs were translated into Arabic by the first author and back translated into English by the senior author, both bilingual in English and Arabic.
The study was approved by the institutional review board (IRB) at the Faculty of Pharmacy – Applied Science Private University (reference number: 2024-PHA-5).
Models of generative AI tested, settings, and testing time
Two generative AI models were selected for testing based on their relevance, popularity, and advanced capabilities. The two models were ChatGPT-4 (OpenAI, San Francisco, CA) [33], and Gemini (Google, Mountain View, CA) [32].
We did not use the “regenerate response” or “modify response” features and refrained from providing any feedback for the two models to avoid feedback bias. Testing was conducted during 17 February–2 March 2024.
Prompt and language specificity
The following exact prompt was used: “For the following virology MCQ, please select the single most appropriate answer with an explanation for the rationale behind selecting this choice and excluding the other choices”. All MCQs were presented independently one-by-one in English and one-by-one in Arabic.
AI content evaluation approach and individual involvement in evaluation
First, we assessed the correctness of responses based on the key answers of the MCQs. Then, subjective evaluation of the AI generated content was based on a modified version of the CLEAR tool [43]. This involved assessing the content on three dimensions: (1) Completeness of the generated response; (2) Accuracy reflected by lack of false knowledge and the content being evidence-based; and (3) Appropriateness and relevance of content being easy to understand, well organized, and free from irrelevant content (Additional file 1) [43]. Each dimension was evaluated using a 5-point Likert scale, with 1 indicating “poor” and 5 representing “excellent” [43]. To enhance the objectivity of the assessment, a predefined list of criteria pertinent to each MCQ was established through discussions between the first and senior authors. Subsequently, the content produced by the two models underwent independent evaluation by the two raters. The CLEAR score for each piece of content was calculated by averaging the scores across the three assessed dimensions. The overall average CLEAR scores were then derived by averaging the scores assigned by the two raters.
Statistical and data analyses
The statistical analysis was conducted using IBM SPSS Statistics Version 26.0 (Armonk, NY: IBM Corp). To explore the associations between categorical variables, we employed two-sided Fisher’s exact test (FET), while the associations between the scale variable (CLEAR score) and categorical variables was assessed using the non-parametric the Mann–Whitney U test (M-W). The Kolmogorov-Smirnov test was employed to confirm the non-normality of the scale variable.
Results
General performance of ChatGPT-4 versus Gemini in English and Arabic
A higher percentage of correct responses for both generative AI models was observed in English compared to Arabic despite the lack of statistical significance. For Gemini in Arabic, the total number of correct responses was 22/40 (55.0%) while the correct responses for the same MCQs in English was 25/40 (62.5%, P = .650, FET). A similar trend was observed for ChatGPT-4 with correct responses in Arabic at 26/40 (65.0%) compared to 32/40 (80.0%) in English (P = .210, FET).
In Arabic, higher number of correct responses was seen in ChatGPT-4 compared to Gemini (65.0% vs. 55.0%, P = .494, FET) and similarly higher number of correct responses was observed in English for ChatGPT-4 compared to Gemini (80.0% vs. 62.5%, P = .137, FET).
Performance of ChatGPT-4 and Gemini based on revised Bloom’s categories
In the evaluation of Gemini and ChatGPT-4 performance across the two revised Bloom’s cognitive domains, the overall performance was consistently better in lower cognitive domain (Table 1).
Within each revised Bloom’s domain, ChatGPT-4 was marginally better in performance outscoring Gemini as follows: in lower cognitive domain, ChatGPT-4 had 35 correct responses (87.5%) vs. 28 Gemini correct responses (70.0%, P = .099, FET). In higher cognitive domain, ChatGPT-4 had 23 correct responses (57.5%) vs. 19 Gemini correct responses (47.5%, P = .502, FET).
Performance of ChatGPT-4 versus Gemini based on the average CLEAR scores
The performance of both generative AI models was superior in English as opposed to Arabic based on the average CLEAR scores as follows. For Gemini, the average CLEAR score was 3.48 ± 1.16 in Arabic compared to 4.00 ± 1.06 in English (P = .022, M-W, Fig. 1A). For ChatGPT-4, the average CLEAR score was 4.20 ± 0.74 in Arabic compared to 4.68 ± 0.57 in English (P = .001, M-W, Fig. 1B). In Arabic, Gemini received lower average CLEAR scores compared to ChatGPT-4 (3.48 ± 1.16 vs. 4.20 ± 0.74, P = .005, M-W, Fig. 1), and the same pattern was noticed for English language with Gemini having lower average CLEAR scores compared to ChatGPT-4 (4.00 ± 1.06 vs. 4.68 ± 0.57, P = .002, M-W, Fig. 1).
Table 2 illustrates the performance of the two AI models per CLEAR score component
Discussion
The current study focused on comparative analysis of the generative AI models Gemini and ChatGPT-4 abilities to answer Virology MCQs across English and Arabic languages. The findings revealed potential limitations inherent in the current versions of AI technologies which should be addressed prior to its incorporation in healthcare education especially for non-English speakers.
The results highlighted a discernible performance disparity between English and Arabic, with both AI models showing a lower accuracy in Arabic. This finding can be attributed to challenges encountered within LLMs’ processing capabilities, particularly for languages that possess complex grammatical structures or languages with limited digital resources. The reduced accuracy in Arabic emphasizes the necessity for enriched training datasets that more comprehensively cover the linguistic diversity inherent in global languages. This comes in light of growing evidence of lower performance of different generative AI models in non-English languages.
In line with our observations, Samaan et al. reported ChatGPT lower accuracy in Arabic compared to English for cirrhosis-related queries [42]. Similarly, Banimelhem and Amayreh reported suboptimal English to Arabic translation capabilities for ChatGPT [44]. Additionally, a recent study showed the superior performance of four generative AI models in English compared to Arabic in infectious disease queries [45], while an earlier study showed the inferior performance of ChatGPT in general health queries in Arabic dialects [41]. Additionally, the inferior performance of AI chatbots was reported in other non-English languages including Chinese [46], Polish [47, 48], and Spanish [49].
Interestingly, the study findings showed that both Gemini and ChatGPT-4 struggled with higher cognitive MCQs, which need advanced critical thinking and problem-solving skills. This limitation of generative AI performance is particularly relevant in healthcare education, where the ability to apply knowledge creatively and critically is essential [50]. The observed limitation raises concerns about the current reliability of AI as an educational tool, which was reported in the context of various AI chatbots [7, 20, 40, 51, 52]. Collectively, these results highlight the critical areas for future development and improvement in AI training approaches.
Of note, this study highlighted ChatGPT-4 superior performance compared to Gemini in processing both Arabic and English languages across various cognitive levels, particularly emphasizing a pronounced advantage in addressing higher cognitive MCQs in Arabic. These findings might hint to OpenAI leading position in the development of LLMs, while also acknowledging the continued need for enhancements to improve performance in educational contexts [53].
Finally, this study showed the need for substantial improvements in generative AI training to enhance performance in non-English languages and in processing of higher-order cognitive queries. Addressing these challenges can improve the quality of AI-generated content and ensure its reliability, rendering AI chatbots as effective educational tools across diverse linguistic and cultural contexts.
In conclusion, the study findings showed the capabilities and limitations of ChatGPT-4 and Gemini in the future of AI-assisted education. The variations in performance observed between languages and cognitive categories highlight the need for continued research, development, and optimization of generative AI models. A special attention should be paid into enhancing the linguistic diversity and cognitive understanding capabilities of generative AI models to achieve global educational equity.
Limitations
The study limitations included the limited number of MCQs, which can restrict the scope of performance evaluation in this study. The subjective assessment of AI-generated content based on the CLEAR scores is another limitation highlighting the need for caution in interpretation. Additionally, this study focused solely on Virology MCQs, which may limit the generalizability of the findings to other healthcare disciplines. Moreover, the rapid evolution of LLMs highlights that the results may not fully reflect the evolving capabilities of the same generative AI models over time.
Data availability
The datasets used and analyzed for this study are available in the Open Science Framework using the direct web link: https://osf.io/hq48k/.
Abbreviations
- AI:
-
Artificial intelligence
- CLEAR:
-
Completeness of content, Lack of false information in the content, Evidence supporting the content, Appropriateness of the content, and Relevance
- FET:
-
Two-sided Fisher’s exact test
- LLMs:
-
Large language models
- MCQ:
-
Multiple choice question
- METRICS:
-
Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language
- M-W:
-
Mann Whiteny U test
References
UNESCO. World Arabic Language Day. 7. March 2024, 2024. Updated 18 December 2023. Accessed 7 March 2024, 2024. https://www.unesco.org/en/world-arabic-language-day
Alhamami M, Almelhi A. English or Arabic in Healthcare Education: perspectives of Healthcare alumni, students, and instructors. J Multidiscip Healthc. 2021;14:2537–47. https://doi.org/10.2147/jmdh.S330579.
Kaliyadan F, Thalamkandathil N, Parupalli SR, Amin TT, Balaha MH, Al Bu Ali WH. English language proficiency and academic performance: a study of a medical preparatory year program in Saudi Arabia. Avicenna J Med Oct-Dec. 2015;5(4):140–4. https://doi.org/10.4103/2231-0770.165126.
Alshareef M, Mobaireek O, Mohamud M, Alrajhi Z, Alhamdan A, Hamad B. Decision Makers’ Perspectives on the Language of Instruction in Medicine in Saudi Arabia: A Qualitative Study. Health Professions Education. 2018/12/01/ 2018;4(4):308–316. https://doi.org/10.1016/j.hpe.2018.03.006
Sabbour SM, Dewedar SA, Kandil SK. Language barriers in medical education and attitudes towards arabization of medicine: student and staff perspectives. East Mediterr Health J Dec. 2012;4(12):1263–71. https://doi.org/10.26719/2010.16.12.1263.
Tayem Y, AlShammari A, Albalawi N, Shareef M. Language barriers to studying medicine in English: perceptions of final-year medical students at the Arabian Gulf University. East Mediterr Health J Feb. 2020;24(2):233–8. https://doi.org/10.26719/2020.26.2.233.
Sallam M. ChatGPT Utility in Healthcare Education, Research, and practice: systematic review on the promising perspectives and valid concerns. Healthc (Basel) Mar. 2023;19(6):887. https://doi.org/10.3390/healthcare11060887.
Hwang SI, Lim JS, Lee RW, et al. Is ChatGPT a fire of Prometheus for non-native English-speaking researchers in Academic writing? Korean J Radiol Oct. 2023;24(10):952–9. https://doi.org/10.3348/kjr.2023.0773.
Teixeira da Silva JA. Can ChatGPT rescue or assist with language barriers in healthcare communication? Patient Education and Counseling. 2023/10/01/ 2023;115:107940. doi:10.1016/j.pec.2023.107940.
Seetharaman R. Revolutionizing Medical Education: can ChatGPT boost subjective learning and expression? J Med Syst May. 2023;9(1):61. https://doi.org/10.1007/s10916-023-01957-w.
Nicholas G, Bhatia A. Lost in translation: large Language models in Non-english Content Analysis. arXiv Preprint. 2023. https://doi.org/10.48550/arXiv.2306.07377.
Lai VD, Ngo NT, Veyseh APB, et al. Chatgpt beyond English: towards a comprehensive evaluation of large language models in multilingual learning. arXiv Preprint. 2023. https://doi.org/10.48550/arXiv.2304.05613.
Gurevich E, El Hassan B, El Morr C. Equity within AI systems: what can health leaders expect? Healthc Manage Forum Mar. 2023;36(2):119–24. https://doi.org/10.1177/08404704221125368.
Holstein K, Doroudi S. Equity and Artificial Intelligence in Education: will AIEd amplify or alleviate inequities in education? arXiv Preprint. 2021. https://doi.org/10.48550/arXiv.2104.12920.
Mijwil M, Abotaleb M, Guma ALI, Dhoska K. Assigning Medical professionals: ChatGPT’s contributions to Medical Education and Health Prediction. Mesopotamian J Artif Intell Healthc. 2024;07/20:2024:76–83. https://doi.org/10.58496/MJAIH/2024/011.
Patterns (N Y). Jan 13 2023;4(1):100676. doi:10.1016/j.patter.2022.100676.
Kocoń J, Cichecki I, Kaszyca O et al. ChatGPT: Jack of all trades, master of none. Information Fusion. 2023/11/01/ 2023;99:101861. doi:10.1016/j.inffus.2023.101861.
Sallam M. Bibliometric top ten healthcare-related ChatGPT publications in the first ChatGPT anniversary. Narra J. 2024;4(2):e917. https://doi.org/10.52225/narra.v4i2.917.
Alowais SA, Alghamdi SS, Alsuhebany N et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Medical Education. 2023/09/22 2023;23(1):689. doi:10.1186/s12909-023-04698-z.
Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical, dental, pharmacy, and public health education: a descriptive study highlighting the advantages and limitations. Narra J Apr. 2023;3(1):e103. https://doi.org/10.52225/narra.v3i1.103.
Yilmaz Muluk S, Olcucu N. The role of Artificial Intelligence in the primary Prevention of Common Musculoskeletal diseases. Cureus. 2024/7/25 2024;16(7):e65372. https://doi.org/10.7759/cureus.65372
Oniani D, Hilsman J, Peng Y et al. Adopting and expanding ethical principles for generative artificial intelligence from military to healthcare. npj Digital Medicine. 2023/12/02 2023;6(1):225. https://doi.org/10.1038/s41746-023-00965-x
Cappellani F, Card KR, Shields CL, Pulido JS, Haller JA. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye. 2024. https://doi.org/10.1038/s41433-023-02906-0. /01/20 2024;.
Emsley R. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia. 2023;9(1):52. https://doi.org/10.1038/s41537-023-00379-4. /08/19 2023.
Kwon HJ, Chae SJ, Park JH. Educational implications of assessing learning outcomes with multiple choice questions and short essay questions. Korean J Med Educ Sep. 2023;35(3):285–90. https://doi.org/10.3946/kjme.2023.266.
Singh T. Principles of assessment in medical education. Jaypee Brothers Medical; 2021.
Stringer JK, Santen SA, Lee E et al. Examining Bloom’s taxonomy in multiple choice questions: students’ Approach to questions. Med Sci Educ. 2021/08/01 2021;31(4):1311–7. https://doi.org/10.1007/s40670-021-01305-y
Bloom BS, Krathwohl DR. Taxonomy of Educational objectives: the classification of Educational Goals. Green: Longmans; 1956. p. 403.
Seaman M, BLOOM’S TAXONOMY. Its Evolution, Revision, and Use in the Field of Education. Curriculum and Teaching Dialogue. 2011 2011;13(1/2):29-131A.
Reddy S. Generative AI in healthcare: an implementation science informed translational path on application, integration and governance. Implement Sci Mar. 2024;15(1):27. https://doi.org/10.1186/s13012-024-01357-9.
Bharatha A, Ojeh N, Rabbi A, et al. Comparing the performance of ChatGPT-4 and medical students on MCQs at varied levels of Bloom’s taxonomy. Adv Med Educ Pract. 2024;05/09:15:393–400. https://doi.org/10.2147/AMEP.S457408.
Google G. 5 March 2024, 2024. Accessed 5 March 2024, 2024. https://gemini.google.com/app
OpenAI. GPT-4. 5 March 2024, 2024. Accessed 5 March 2024, 2024. https://openai.com/
Rane N, Choudhary S, Rane J. Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. J Appl Artif Intell. 2024;03/20(1):69–93. https://doi.org/10.48185/jaai.v5i1.1052.
Podder I, Pipil N, Dhabal A, Mondal S, Pienyii V, Mondal H. Evaluation of Artificial Intelligence-based chatbot responses to common dermatological queries. Jordan Med J. 2024;07/20:58:271–7. https://doi.org/10.35516/jmj.v58i2.2960.
Newton P, Xiromeriti M. ChatGPT performance on multiple choice question examinations in higher education. A pragmatic scoping review. Assessment & Evaluation in Higher Education.1–18. https://doi.org/10.1080/02602938.2023.2299059
Sallam M, Barakat M, Sallam M. A preliminary Checklist (METRICS) to standardize the design and reporting of studies on generative Artificial Intelligence-based models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res Feb. 2024;15:13:e54704. https://doi.org/10.2196/54704.
Yilmaz Muluk S, Olcucu N. Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-3.5 and GoogleBard in identifying red flags of low back Pain. Cureus. 2024/7/01 2024;16(7):e63580. https://doi.org/10.7759/cureus.63580
Bandi A, Adapa PV, Kuchi YE. The Power of Generative AI: a review of requirements, models, input–output formats, evaluation Metrics, and challenges. Future Internet. 2023;15(8):260. https://doi.org/10.3390/fi15080260.
Sallam M, Al-Farajat A, Egger J. Envisioning the future of ChatGPT in Healthcare: insights and recommendations from a systematic identification of Influential Research and a call for Papers. Jordan Med J. 2024;02/19(1). https://doi.org/10.35516/jmj.v58i1.2285.
Sallam M, Mousa D. Evaluating ChatGPT performance in arabic dialects: a comparative study showing defects in responding to Jordanian and Tunisian general health prompts. Mesopotamian J Artif Intell Healthc. 2024;01/10:2024:1–7. https://doi.org/10.58496/MJAIH/2024/001.
Samaan JS, Yeo YH, Ng WH et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab Journal of Gastroenterology. 2023/08/01/ 2023;24(3):145–148. doi:10.1016/j.ajg.2023.08.001.
Sallam M, Barakat M, Sallam M. Pilot testing of a Tool to standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-based models. Cureus Nov. 2023;15(11):e49373. https://doi.org/10.7759/cureus.49373.
Banimelhem O, Amayreh W. Is ChatGPT a Good English to Arabic Machine Translation Tool? 2023:1–6.
Sallam M, Al-Mahzoum K, Alshuaib O, et al. Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic. BMC Infect Dis. 2024;2024/08/08(1):799. https://doi.org/10.1186/s12879-024-09725-y.
Liu X, Wu J, Shao A, et al. Uncovering Language disparity of ChatGPT on Retinal Vascular Disease Classification: cross-sectional study. J Med Internet Res Jan. 2024;22:26:e51926. https://doi.org/10.2196/51926.
Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Scientific Reports. 2023/11/22 2023;13(1):20512. https://doi.org/10.1038/s41598-023-46995-z
Siebielec J, Ordak M, Oskroba A, Dworakowska A, Bujalska-Zadrozny M. Assessment Study of ChatGPT-3.5’s performance on the final Polish Medical examination: Accuracy in answering 980 questions. Healthcare. 2024;12(16):1637. https://doi.org/10.3390/healthcare12161637.
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, et al. Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin Pract Nov. 2023;20(6):1460–87. https://doi.org/10.3390/clinpract13060130.
Jonathan MS, Andrew DO, Kamal RM, et al. Critical thinking in healthcare and education. BMJ. 2017;357:j2234. https://doi.org/10.1136/bmj.j2234.
Michel-Villarreal R, Vilalta-Perdomo E, Salinas-Navarro DE, Thierry-Aguilera R, Gerardou FS. Challenges and opportunities of Generative AI for higher education as explained by ChatGPT. Educ Sci. 2023;13(9):856. https://doi.org/10.3390/educsci13090856.
Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. Front Educ. 2023;8:1333415. https://doi.org/10.3389/feduc.2023.1333415.
Egger J, Sallam M, Luijten G et al. Medical ChatGPT – a systematic Meta-review. medRxiv. 2024:2024.04.02.24304716. https://doi.org/10.1101/2024.04.02.24304716
Funding
This research received no external funding.
Author information
Authors and Affiliations
Contributions
Conceptualization: M.S.; Data curation: M.S., K.A.-M., Rawan Ahmad Almutawaa, J.A.A., R.A.D., D.R.A., Reem Abdullah Almutairi, M.B.; Formal analysis: M.S., K.A.-M., Rawan Ahmad Almutawaa, J.A.A., R.A.D., D.R.A., Reem Abdullah Almutairi, M.B.; Investigation: M.S., K.A.-M., Rawan Ahmad Almutawaa, J.A.A., R.A.D., D.R.A., Reem Abdullah Almutairi, M.B.; Methodology: M.S., K.A.-M., Rawan Ahmad Almutawaa, J.A.A., R.A.D., D.R.A., Reem Abdullah Almutairi, M.B.; Visualization: M.S.; Project administration: M.S.; Supervision: M.S.; Writing - original draft: M.S.; Writing - review & editing: M.S., K.A.-M., Rawan Ahmad Almutawaa, J.A.A., R.A.D., D.R.A., Reem Abdullah Almutairi, M.B.; All authors contributed to the article and approved the submitted version.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study was approved by the institutional review board (IRB) at the Faculty of Pharmacy – Applied Science Private University (reference number: 2024-PHA-5). This study did not require informed consent from human participants as it involved no human data collection.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sallam, M., Al-Mahzoum, K., Almutawaa, R.A. et al. The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Res Notes 17, 247 (2024). https://doi.org/10.1186/s13104-024-06920-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13104-024-06920-7