A statistical investigation of AI tools’ accuracy in solving Algebra, Calculus, and Statistics problems: comparative analysis of ChatGPT and Gemini

Umar, Norazah and Ahmad, Nurhafizah and Othman, Jamal and Hamat, Muniroh (2026) A statistical investigation of AI tools’ accuracy in solving Algebra, Calculus, and Statistics problems: comparative analysis of ChatGPT and Gemini. Merging Lanes: Where E-Learning Diversity Meets Future Trends, 11. pp. 108-115. ISSN 978-629-98755-9-8

Official URL: https://appspenang.uitm.edu.my/sigcs/

Abstract

This study presents a descriptive comparative investigation of the mathematical accuracy of two widely used large language model (LLM) tools, ChatGPT and Gemini, across three core domains: Algebra, Calculus, and Statistics. The increasing adoption of generative AI in higher education has raised concerns about the reliability of AIgenerated mathematical solutions, particularly when outputs appear coherent but contain hidden reasoning gaps. To examine domain-specific performance, both tools were tested using an identical prompt protocol, and only first responses were recorded to reflect typical student usage. Accuracy was evaluated using final-answer correctness and summarized using descriptive statistics, reported as percentage of correct solutions by domain. Results indicate that both tools achieved consistently high accuracy across all domains, exceeding 88%. ChatGPT demonstrated higher accuracy in Algebra (97.22%) compared to Gemini (91.67%), suggesting stronger performance on symbolic manipulation and structured equation-based tasks. In contrast, Gemini achieved perfect accuracy in both Calculus and Statistics (100% each), outperforming ChatGPT in those domains (88.89% and 94.44%, respectively). These findings indicate that LLM effectiveness in mathematics is domain-dependent rather than uniform, with each system exhibiting distinct strengths. Overall, the study suggests that AI tools can serve as useful computational assistants in mathematics learning and practice, but domain sensitivity implies that outputs should be interpreted cautiously and verified, especially in formal assessment contexts. Future work should expand the problem set, incorporate step-validity scoring, and evaluate performance under reworded and out-of-distribution problem conditions to better assess reasoning robustness.

Metadata

Item Type: Article
Creators:
Creators
Email / ID Num.
Umar, Norazah
norazah191@uitm.edu.my
Ahmad, Nurhafizah
nurha9129@uitm.edu.my
Othman, Jamal
jamalothman@uitm.edu.my
Hamat, Muniroh
muniroh@uitm.edu.my
Contributors:
Contribution
Name
Email / ID Num.
Advisor
Abd Rahman, Nor Hanim
UNSPECIFIED
Chief Editor
Othman, Jamal
UNSPECIFIED
Subjects: Q Science > QA Mathematics > Evolutionary programming (Computer science). Genetic algorithms
Divisions: Universiti Teknologi MARA, Pulau Pinang > Permatang Pauh Campus
Journal or Publication Title: Merging Lanes: Where E-Learning Diversity Meets Future Trends
ISSN: 978-629-98755-9-8
Volume: 11
Page Range: pp. 108-115
Keywords: ChatGPT, Gemini, AI
Date: April 2026
URI: https://ir.uitm.edu.my/id/eprint/137356
Edit Item
Edit Item

Download

[thumbnail of 137356.pdf] Text
137356.pdf

Download (716kB)

ID Number

137356

Indexing

Statistic

Statistic details