Multi-level text data augmentation for Malay intent classification model

Mat Zailan, Anis Syafiqah (2026) Multi-level text data augmentation for Malay intent classification model. Masters thesis, Universiti Teknologi MARA (UiTM).

Abstract

Intent classification for Malay language queries remains a challenge due to limited annotated datasets, severe class imbalance, and rich morphological variation. This research introduces ATISMalay, a validated Malay-language dataset constructed by translating the ATIS benchmark using machine translation and refined through bilingual expert evaluation, with Cohen’s Kappa confirming fair agreement. The dataset revealed structural limitations, prompting the development of ICDAMalay, an intent classification model based on BERT and BERT+CRF architectures. Performance comparisons and error analysis highlighted recurring misclassifications, especially in low-resource intent classes. To address these issues, a multi-level text data augmentation strategy was implemented during pre-processing, applied systematically at the character, word, and phrase levels. Eight augmented datasets were generated and evaluated using BLEU and BPRO metrics, with full-tiered augmentation improving accuracy by 95% over the benchmark and 9.99 over the ATISMalay baseline. The study’s unique contribution is a systematic, tiered augmentation framework tailored for low-resource languages. This research supports practical applications in Malaylanguage chatbots, e-government platforms, and educational tools, where accurate intent classification is essential. Future work may explore multi-intent classification, cross-domain scalability, and integration with open-domain conversational systems to further advance Malay NLP.

Metadata

Item Type: Thesis (Masters)
Creators:
Creators
Email / ID Num.
Mat Zailan, Anis Syafiqah
2022698578
Contributors:
Contribution
Name
Email / ID Num.
Thesis advisor
Abdullah, Nur Atiqah Sia
UNSPECIFIED
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics
Divisions: Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences
Programme: Master of Science (Computer Science)
Keywords: Intent classification, Natural Language Processing (NLP), Machine learning.
Date: 2026
URI: https://ir.uitm.edu.my/id/eprint/135784
Edit Item
Edit Item

Download

[thumbnail of 135784.pdf] Text
135784.pdf

Download (196kB)

Digital Copy

Digital (fulltext) is available at:

Physical Copy

Physical status and holdings:
Item Status:

ID Number

135784

Indexing

Statistic

Statistic details