Abstract
Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance of imputation methods to replace the missing values with estimated values. Based on the literature search, investigation for an appropriate imputation method on Single-Site Temporal Time-Dependent (SSTTD) multivariate structure air quality dataset particularly with long gap sequence of missing values issue was found less discussed. Several empirical orthogonal functions (EOF) based imputation methods are proposed in this study to fill the gap. The EOF, sometimes named Principal Component Analysis (PCA) method, is a promising technique applied to solve for missing values. However, the existing EOF imputation method has a drawback because it uses data matrix centralization based on statistics mean for EOF computation. To be applied for the air quality dataset, the existing approach needs to be improvised because the air quality dataset often consists of extreme observations due to climatic variations and random processes. Therefore, the implementation of statistic median and trimmed mean seems better in the matrix centralization. In this study, several proposed EOF-based methods are introduced. The capability of the methods for estimating missing values for long gap problems focusing on air quality (PM10) of the SSTTD multivariate data set in Malaysia is investigated. The performance of the existing EOF based method, the EOF mean centred approach (EOF-mean) and several proposed EOF based methods; the EOF based on median (EOF-median), EOF based on the trimmed mean (EOF-trimmean) and the newly applied Regularized Expectation Maximization Principal Component Analysis (R-EMPCA) are compared. The study was conducted using real PM10 data set from Klang and Shah Alam air quality monitoring stations. Performance assessment and evaluation of the methods was conducted by comparing the imputed values in the artificial missing data set with the true observed values in the reference (complete) data set. The artificial missing values data sets are created from an identified reference (complete) data set with respect to several patterns according to four different percentages (5, 10, 20 and 30) and long sequence (gap) size (12, 24, 168 and 720) of missing points (hours) at both study locations. Based on several performance indicators, including RMSE, MAE, Rsquare and AI, the results have shown that R-EMPCA has the most excellent performance with the highest accuracy in estimating the missing values, and the second best is EOF-trimmean. For further improvement, the estimation of the estimated values was improvised using B-spline Roughness Penalty (RP) Smoothing approach, which resulted in the proposed R-EMPCA-RP and EOF-trimmean-RP imputation methods. The application of the RP approach is proven fruitful.
Metadata
Item Type: | Thesis (Masters) |
---|---|
Creators: | Creators Email / ID Num. Muhammad Ghazali, Shamihah 2018489528 |
Contributors: | Contribution Name Email / ID Num. Thesis advisor Shaadan, Norshaida norshahida588@uitm.edu.my Thesis advisor Idrus, Zainura UNSPECIFIED |
Subjects: | Q Science > QA Mathematics > Mathematical statistics. Probabilities > Data processing T Technology > TD Environmental technology. Sanitary engineering > Air pollution and its control > Indoor air pollution. Including indoor air quality |
Divisions: | Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences |
Programme: | Master of Science (Statistics) |
Keywords: | Air quality, missing data, empirical orthogonal functions (EOF) |
Date: | 2022 |
URI: | https://ir.uitm.edu.my/id/eprint/66929 |
Download
66929.pdf
Download (139kB)