An application of Malay short-form word conversion using Levenshtein distance / Azilawati Azizan, NurAine Saidin, Nurkhairizan Khairudin & Rohana Ismail

Azilawati Azizan, Azilawati Azizan and NurAine Saidin, NurAine Saidin and Nurkhairizan Khairudin, Nurkhairizan Khairudin and Rohana Ismail, Rohana Ismail (2020) An application of Malay short-form word conversion using Levenshtein distance / Azilawati Azizan, NurAine Saidin, Nurkhairizan Khairudin & Rohana Ismail. Mathematical Sciences and Informatics Journal (MIJ), 1 (2). pp. 34-42. ISSN 2735-0703

Abstract

Formerly, short-form word was widely used in the field of journalism. However, nowadays, short-form word has been widely used by many people, especially in online communication. These short-form words trigger problems in the field of data mining, especially those involving online text processing. It leads to inaccurate result of text mining activities. On the other hand, only few works have investigated on Malay short-form word identification and conversion. Therefore, this work aims to develop an application that can identify and convert Malay short-form words into its’ full word. In order to develop this application, the short-form rules need to be carefully examined. The formal rules from Dewan Bahasa & Pustaka (DBP) are used as the primary reference for generating the short form word identification algorithm. While for the conversion algorithm, Levenshtein Distance (LD) is used to measure the similarity. The rule-based technique is also used as a complement to LD technique. As a result, 70.27% of the Malay short-form words have been correctly converted into their full words. The conversion rate is quite promising, and this work can be further strengthened by incorporating more rules into the algorithm.

Metadata

Item Type: Article
Creators:
Creators
Email / ID Num.
Azilawati Azizan, Azilawati Azizan
azila899@uitm.edu.my
NurAine Saidin, NurAine Saidin
2017412258@isiswa.uitm.edu.my
Nurkhairizan Khairudin, Nurkhairizan Khairudin
nurkh098@uitm.edu.my
Rohana Ismail, Rohana Ismail
rohana@unisza.edu.my
Subjects: P Language and Literature > P Philology. Linguistics > Language and education > Malaysia
Q Science > QA Mathematics > Evolutionary programming (Computer science). Genetic algorithms > Malaysia
Q Science > QA Mathematics > Philosophy > Mathematical logic > Constructive mathematics > Algorithms
Divisions: Universiti Teknologi MARA, Perak > Tapah Campus > Faculty of Computer and Mathematical Sciences
Journal or Publication Title: Mathematical Sciences and Informatics Journal (MIJ)
UiTM Journal Collections: UiTM Journal > Mathematical Science and Information Journal (MIJ)
ISSN: 2735-0703
Volume: 1
Number: 2
Page Range: pp. 34-42
Keywords: Malay short form word; Noisy text normalization; Levenshtein Distance; Rule-based
Date: November 2020
URI: https://ir.uitm.edu.my/id/eprint/38191
Edit Item
Edit Item

Download

[thumbnail of 38191.pdf] Text
38191.pdf

Download (485kB)

ID Number

38191

Indexing

Statistic

Statistic details