Distance-based undersampling for imbalance dataset: a comprehensive simulation study

Abd Rahman, Hezlin Aryani (2023) Distance-based undersampling for imbalance dataset: a comprehensive simulation study. PhD thesis, Universiti Teknologi MARA (UiTM).

Abstract

This thesis presents a simulation study on parameter estimation for the effect of imbalance problem in binary logistic regression. As well as assessing the effect of random oversampling (ROS), random undersampling (RUS), and distance-based undersampling (E-DBUS, iE-DBUS and M-DBUS) on imbalanced data of the binary logistic regression. This study obtained the threshold for imbalance ratio and size of sample, which are affected by the presence of imbalanced. The motivation behind this study is influenced by three main factors. Firstly, imbalanced problem normally effects the accuracy in predictive models, especially in data mining and machine learning models. However, most classification focusses on other classifiers, and not much on binary logistic regression. Secondly, there is a lack of focus in studies of imbalance involving simulation studies in the area of imbalanced data, especially in visualization of the effect of imbalanced on classifiers, in this case, binary logistic regression. Thirdly, resampling strategies are a more straight-forward approach to handling imbalanced data. However, this strategy has been under-rated and various studies suggested various resampling strategies to better handle imbalance dataset. Distancedbased sampling has shown positive impact on imbalanced data. Hence, proving the positive effect on binary logistic regression is the quest for this study. Simulation studies are useful to assess and confirm the effects of parameter estimation for binary logistic regression under various conditions. The first phase of this study covers the effect of different types of covariates, imbalance ratio and sample size on parameter estimation for binary logistic regression model. Data were simulated for different sample sizes, types of covariates (continuous and categorical) and imbalance ratio. The simulation results show that the effect imbalance problem is more prominent in smaller sample sizes (n < 2000) and highly imbalanced data (IR < 10%). The effect reduces as sample size increases and data became more balanced. The effect of imbalanced were more dominant for categorical covariates compared to continuous covariates. In Phase 2, the effect of the ROS and RUS were assessed for imbalanced datasets on parameter estimation of binary logistic regression. Results shows that the ROS has better performance in curbing the effect of imbalanced compared to RUS on all different sample sizes and imbalance ratio on various types of covariates; continuous, categorical, and mixture of both, due to the doubled in the number of sample size. However, random synthetisation of observations was unfavourable, especially in statistics. Thus, in Phase 3, the simulation focused on the RUS and the distanced-based undersampling strategies in handling the effects of imbalanced datasets on parameter estimation of binary logistic regression for one continuous covariate. Comparing the results in Phase 1-3, the distance-based undersampling, either Euclidean (E-DBUS), Mahalanobis (M-DBUS) or improved-Euclidian (iE-DBUS), - based undersampling, were more reliable in curbing the effect of imbalanced problem as compared to ROS and RUS. Further, in phase 4 (evaluation), the performance of all random and the three distanced-based undersampling (E-DBUS, iE-DBUS, and M-DBUS) were investigated using 14 benchmark datasets studies, comparing the accuracy, sensitivity and specificity of the binary logistic regression model. The results showed that the M-DBUS performed the best compared to the other undersampling strategies. However, the difference in terms of performance were not far compared from E-DBUS and iE-DBUS. The significance of this study will benefit the body of knowledge of statistics and predictive data analytics, especially in the area of imbalanced data handling.

Metadata

Item Type: Thesis (PhD)
Creators:
Creators
Email / ID Num.
Abd Rahman, Hezlin Aryani
UNSPECIFIED
Contributors:
Contribution
Name
Email / ID Num.
Thesis advisor
Wah, Yap Bee
UNSPECIFIED
Subjects: Q Science > QA Mathematics
Q Science > QA Mathematics > Analysis
Divisions: Universiti Teknologi MARA, Shah Alam > College of Computing, Informatics and Mathematics
Programme: Doctor of Philosophy (Statistics)
Keywords: Imbalanced dataset, Data level, Evaluation
Date: 2023
URI: https://ir.uitm.edu.my/id/eprint/122845
Edit Item
Edit Item

Download

[thumbnail of 122845.pdf] Text
122845.pdf

Download (23kB)

Digital Copy

Digital (fulltext) is available at:

Physical Copy

Physical status and holdings:
Item Status:

ID Number

122845

Indexing

Statistic

Statistic details