Abstract
Intent classification for Malay language queries remains a challenge due to limited annotated datasets, severe class imbalance, and rich morphological variation. This research introduces ATISMalay, a validated Malay-language dataset constructed by translating the ATIS benchmark using machine translation and refined through bilingual expert evaluation, with Cohen’s Kappa confirming fair agreement. The dataset revealed structural limitations, prompting the development of ICDAMalay, an intent classification model based on BERT and BERT+CRF architectures. Performance comparisons and error analysis highlighted recurring misclassifications, especially in low-resource intent classes. To address these issues, a multi-level text data augmentation strategy was implemented during pre-processing, applied systematically at the character, word, and phrase levels. Eight augmented datasets were generated and evaluated using BLEU and BPRO metrics, with full-tiered augmentation improving accuracy by 95% over the benchmark and 9.99 over the ATISMalay baseline. The study’s unique contribution is a systematic, tiered augmentation framework tailored for low-resource languages. This research supports practical applications in Malaylanguage chatbots, e-government platforms, and educational tools, where accurate intent classification is essential. Future work may explore multi-intent classification, cross-domain scalability, and integration with open-domain conversational systems to further advance Malay NLP.
Metadata
| Item Type: | Thesis (Masters) |
|---|---|
| Creators: | Creators Email / ID Num. Mat Zailan, Anis Syafiqah 2022698578 |
| Contributors: | Contribution Name Email / ID Num. Thesis advisor Abdullah, Nur Atiqah Sia UNSPECIFIED |
| Subjects: | P Language and Literature > P Philology. Linguistics Q Science > QA Mathematics |
| Divisions: | Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences |
| Programme: | Master of Science (Computer Science) |
| Keywords: | Intent classification, Natural Language Processing (NLP), Machine learning. |
| Date: | 2026 |
| URI: | https://ir.uitm.edu.my/id/eprint/135784 |
Download
135784.pdf
Download (196kB)
Digital Copy
Physical Copy
ID Number
135784
Indexing
