Abstract
This thesis presents a simulation study on parameter estimation for the effect of imbalance problem in binary logistic regression. As well as assessing the effect of random oversampling (ROS), random undersampling (RUS), and distance-based undersampling (E-DBUS, iE-DBUS and M-DBUS) on imbalanced data of the binary logistic regression. This study obtained the threshold for imbalance ratio and size of sample, which are affected by the presence of imbalanced. The motivation behind this study is influenced by three main factors. Firstly, imbalanced problem normally effects the accuracy in predictive models, especially in data mining and machine learning models. However, most classification focusses on other classifiers, and not much on binary logistic regression. Secondly, there is a lack of focus in studies of imbalance involving simulation studies in the area of imbalanced data, especially in visualization of the effect of imbalanced on classifiers, in this case, binary logistic regression. Thirdly, resampling strategies are a more straight-forward approach to handling imbalanced data. However, this strategy has been under-rated and various studies suggested various resampling strategies to better handle imbalance dataset. Distancedbased sampling has shown positive impact on imbalanced data. Hence, proving the positive effect on binary logistic regression is the quest for this study. Simulation studies are useful to assess and confirm the effects of parameter estimation for binary logistic regression under various conditions. The first phase of this study covers the effect of different types of covariates, imbalance ratio and sample size on parameter estimation for binary logistic regression model. Data were simulated for different sample sizes, types of covariates (continuous and categorical) and imbalance ratio. The simulation results show that the effect imbalance problem is more prominent in smaller sample sizes (n < 2000) and highly imbalanced data (IR < 10%). The effect reduces as sample size increases and data became more balanced. The effect of imbalanced were more dominant for categorical covariates compared to continuous covariates. In Phase 2, the effect of the ROS and RUS were assessed for imbalanced datasets on parameter estimation of binary logistic regression. Results shows that the ROS has better performance in curbing the effect of imbalanced compared to RUS on all different sample sizes and imbalance ratio on various types of covariates; continuous, categorical, and mixture of both, due to the doubled in the number of sample size. However, random synthetisation of observations was unfavourable, especially in statistics. Thus, in Phase 3, the simulation focused on the RUS and the distanced-based undersampling strategies in handling the effects of imbalanced datasets on parameter estimation of binary logistic regression for one continuous covariate. Comparing the results in Phase 1-3, the distance-based undersampling, either Euclidean (E-DBUS), Mahalanobis (M-DBUS) or improved-Euclidian (iE-DBUS), - based undersampling, were more reliable in curbing the effect of imbalanced problem as compared to ROS and RUS. Further, in phase 4 (evaluation), the performance of all random and the three distanced-based undersampling (E-DBUS, iE-DBUS, and M-DBUS) were investigated using 14 benchmark datasets studies, comparing the accuracy, sensitivity and specificity of the binary logistic regression model. The results showed that the M-DBUS performed the best compared to the other undersampling strategies. However, the difference in terms of performance were not far compared from E-DBUS and iE-DBUS. The significance of this study will benefit the body of knowledge of statistics and predictive data analytics, especially in the area of imbalanced data handling.
Metadata
| Item Type: | Thesis (PhD) |
|---|---|
| Creators: | Creators Email / ID Num. Abd Rahman, Hezlin Aryani UNSPECIFIED |
| Contributors: | Contribution Name Email / ID Num. Thesis advisor Wah, Yap Bee UNSPECIFIED |
| Subjects: | Q Science > QA Mathematics Q Science > QA Mathematics > Analysis |
| Divisions: | Universiti Teknologi MARA, Shah Alam > College of Computing, Informatics and Mathematics |
| Programme: | Doctor of Philosophy (Statistics) |
| Keywords: | Imbalanced dataset, Data level, Evaluation |
| Date: | 2023 |
| URI: | https://ir.uitm.edu.my/id/eprint/122845 |
Download
122845.pdf
Download (23kB)
Digital Copy
Physical Copy
ID Number
122845
Indexing
