Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo

Swee, Chuan Tan (2014) Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo. Malaysian Journal of Computing, 2 (2). pp. 1-21. ISSN 2231-7473

[img] Text
AJ_TERENCE YONG KOON BEH MJOC 14.pdf

Download (1MB)
Official URL: http://mjoc.uitm.edu.my/v2/

Abstract

Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine learning and data mining community because many traditional classification algorithms assume similar proportions of majority and minority classes. When the data is imbalanced, these algorithms generate models that achieve good classification accuracy for the majority class, but poor accuracy for the minority class. This paper reports our experience in applying data balancing techniques to develop a classifier for an imbalanced real-world fraud detection data set. We evaluated the models generated from seven classification algorithms with two simple data balancing techniques. Despite many ideas floating in the literature to tackle the imbalanced issue, our study shows the simplest data balancing technique is all that is required to significantly improve the accuracy in identifying the primary class of interest (i.e., the minority class) in all the seven algorithms tested. Our results also show that precision and recall are useful and effective measures for evaluating models created from artificially balanced data. Hence, we advise data mining practitioners to try simple data balancing first before exploring more sophisticated techniques to tackle the class imbalance problem.

Item Type: Article
Creators:
CreatorsEmail
Swee, Chuan TanUNSPECIFIED
Divisions: University Publication Centre (UPENA)
Journal or Publication Title: Malaysian Journal of Computing
ISSN: 2231-7473
Volume: 2
Number: 2
Page Range: pp. 1-21
Official URL: http://mjoc.uitm.edu.my/v2/
Item ID: 12419
Uncontrolled Keywords: Imbalanced data, Machine Learning, Model Evaluation, Performances Measures
Last Modified: 12 Mar 2019 04:21
Depositing User: Staf Pendigitalan 1
URI: http://ir.uitm.edu.my/id/eprint/12419

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year