Handling highly imbalanced output class label: a case study on Fantasy Premier League (FPL) virtual player price changes prediction using machine learning / Muhammad Muhaimin Khamsan and Ruhaila Maskat

Khamsan, Muhammad Muhaimin and Maskat, Ruhaila (2019) Handling highly imbalanced output class label: a case study on Fantasy Premier League (FPL) virtual player price changes prediction using machine learning / Muhammad Muhaimin Khamsan and Ruhaila Maskat. Malaysian Journal of Computing (MJoC), 4 (2): 4. pp. 304-316. ISSN 2600-8238

Abstract

In practice, a balanced target class is rare. However, an imbalanced target class can be handled by resampling the original dataset, either by oversampling/upsampling or undersampling/downsampling. A popular upsampling technique is Synthetic Minority Over-sampling Technique (SMOTE). This technique increases the minority class by generating synthetic class labels and assigned the class based on the K-Nearest Neighbour (K-NN). SMOTE upsampling can only upsample at most one minority class at a time, which means for a multiclass dataset, it needs to undergo multilayer SMOTE to balance the class label distribution. This paper aims to find a suitable method in handling imbalanced class using dataset from Fantasy Premier League (FPL) virtual player to predict price changes. The cleaned dataset has a highly imbalanced class distribution, where the frequency of “Price Remain Unchanged (PRU)” is higher than “Price Fall (PF)” and “Price Rise (PR)”. This paper compared between the baseline (original) dataset, SMOTE-applied dataset and shuffled, linear and stratified sampling in split train-test subset, based on a deep learning algorithm. This paper also proposed criteria of low values in standard deviation (distribution of true positive on each class label on accuracy) as a measurement for finding the best method in handling imbalanced class labels. As a result, multilayer SMOTE until all the classes distribution is the same, combined with stratified sampling in split training and testing subset, get the lower standard deviation (5.7873), high accuracy (80.06%) and less execution runtime (1 minute 41 seconds) compared to the original highly imbalanced dataset.

Metadata

Item Type: Article
Creators:
Creators
Email / ID Num.
Khamsan, Muhammad Muhaimin
muhaiminkhamsan@gmail.com
Maskat, Ruhaila
ruhaila@tmsk.uitm.edu.my
Subjects: Q Science > QA Mathematics > Mathematical statistics. Probabilities > Data processing
Divisions: Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences
Journal or Publication Title: Malaysian Journal of Computing (MJoC)
UiTM Journal Collections: UiTM Journal > Malaysian Journal of Computing (MJoC)
ISSN: 2600-8238
Volume: 4
Number: 2
Page Range: pp. 304-316
Keywords: Imbalanced class label; SMOTE upsampling; machine learning; price changes prediction
Date: December 2019
URI: https://ir.uitm.edu.my/id/eprint/61448
Edit Item
Edit Item

Download

[thumbnail of 61448.pdf] Text
61448.pdf

Download (810kB)

ID Number

61448

Indexing

Statistic

Statistic details