Integrating an enhanced hidden markov model with feature substitution for short-text e-commerce product classification

Muhammad Noor Mathivanan, Norsyela (2025) Integrating an enhanced hidden markov model with feature substitution for short-text e-commerce product classification. PhD thesis, Universiti Teknologi MARA (UiTM).

Abstract

Automatic product classification based on short-text data is essential for managing the vast information generated on e-commerce platforms. As a subset of text classification, product classification assigns items to predefined categories. Within this domain, product title classification is challenging due to text brevity, inconsistent terminology, and noisy information. Accurate classification is crucial for enhancing the online shopping experience by improving organization, retrieval, and recommendation. Despite the rapid growth of data on e-commerce platforms, existing classification models continue to face challenges with accuracy and efficiency. These difficulties arise from short and inconsistent product descriptions, varying terminology across sellers and noisy or redundant information that complicates classification. This research addresses these challenges by leveraging Hidden Markov Models (HMMs), which capture sequential data through probabilistic modeling of hidden states and transitions. The study improves HMM performance in short-text product title classification through two key innovations which are feature substitution using Latent Dirichlet Allocation (LDA) and weighted parameter estimation in HMM. Traditional HMMs often degrade with complex data, while rigid emission parameters limit adaptability. Feature substitution reduces sparsity and redundancy in text, whereas weighted parameter estimation increases flexibility in parameter learning. To overcome these limitations, weighted parameter estimation is integrated into the HMM framework. This enhancement addresses the rigidity of emission parameters and improves adaptability to complex and diverse product data, which increases the model’s flexibility and overall performance. This study proposes three methods to enhance HMM performance in product title classification. The propose method I (FS-LDA) substitutes semantically related features within the same product category to reduce sparsity and strengthen representation. The proposed method II focuses on adjusting emission parameters based on information from the training data, which allows the model to adapt more effectively to complex and imbalanced distributions. The proposed method III integrates FS-LDA with weighted parameter estimation in a unified framework, combining the advantages of both techniques to achieve improved classification outcomes. Experiments across five e-commerce datasets show significant improvements over traditional HMMs. Method III achieved over 95% accuracy in binary classification and F1-Scores above 90%. In multiclass scenarios, F1-Scores exceeded 70%, demonstrating robustness across categories. The proposed methods also outperformed Naïve Bayes and Support Vector Machines, particularly in short-text multiclass tasks. Beyond e-commerce, validation in spam filtering and occupational data mining confirmed substantial gains in accuracy and F1-Scores. In conclusion, the proposed method III is the most effective approach to enhance HMM-based product title classification. The scope of this research is explicitly focused on short-text product title classification in e-commerce, contributing to the body of knowledge on text classification using HMMs. This study also provides a foundation for developing user-friendly tools, libraries, and documentation to facilitate the integration of enhanced HMM-based classifiers into existing e-commerce systems.

Metadata

Item Type: Thesis (PhD)
Creators:
Creators
Email / ID Num.
Muhammad Noor Mathivanan, Norsyela
2017293402
Contributors:
Contribution
Name
Email / ID Num.
Advisor
Md. Ghani, Nor Azura
UNSPECIFIED
Subjects: H Social Sciences > HF Commerce
H Social Sciences > HF Commerce > Electronic commerce
Divisions: Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences
Programme: Doctor of Philosophy (Statistics)
Keywords: Hidden Markov Model, HMM, Feature substitution, Short-text classification, E-commerce, Product taxonomy, Data sparsity, Semantic embeddings, Natural language processing
Date: November 2025
URI: https://ir.uitm.edu.my/id/eprint/137027
Edit Item
Edit Item

Download

[thumbnail of 137027.pdf] Text
137027.pdf

Download (19kB)

Digital Copy

Digital (fulltext) is available at:

Physical Copy

Physical status and holdings:
Item Status:

ID Number

137027

Indexing

Statistic

Statistic details