Abstract
Automatic product classification based on short-text data is essential for managing the vast information generated on e-commerce platforms. As a subset of text classification, product classification assigns items to predefined categories. Within this domain, product title classification is challenging due to text brevity, inconsistent terminology, and noisy information. Accurate classification is crucial for enhancing the online shopping experience by improving organization, retrieval, and recommendation. Despite the rapid growth of data on e-commerce platforms, existing classification models continue to face challenges with accuracy and efficiency. These difficulties arise from short and inconsistent product descriptions, varying terminology across sellers and noisy or redundant information that complicates classification. This research addresses these challenges by leveraging Hidden Markov Models (HMMs), which capture sequential data through probabilistic modeling of hidden states and transitions. The study improves HMM performance in short-text product title classification through two key innovations which are feature substitution using Latent Dirichlet Allocation (LDA) and weighted parameter estimation in HMM. Traditional HMMs often degrade with complex data, while rigid emission parameters limit adaptability. Feature substitution reduces sparsity and redundancy in text, whereas weighted parameter estimation increases flexibility in parameter learning. To overcome these limitations, weighted parameter estimation is integrated into the HMM framework. This enhancement addresses the rigidity of emission parameters and improves adaptability to complex and diverse product data, which increases the model’s flexibility and overall performance. This study proposes three methods to enhance HMM performance in product title classification. The propose method I (FS-LDA) substitutes semantically related features within the same product category to reduce sparsity and strengthen representation. The proposed method II focuses on adjusting emission parameters based on information from the training data, which allows the model to adapt more effectively to complex and imbalanced distributions. The proposed method III integrates FS-LDA with weighted parameter estimation in a unified framework, combining the advantages of both techniques to achieve improved classification outcomes. Experiments across five e-commerce datasets show significant improvements over traditional HMMs. Method III achieved over 95% accuracy in binary classification and F1-Scores above 90%. In multiclass scenarios, F1-Scores exceeded 70%, demonstrating robustness across categories. The proposed methods also outperformed Naïve Bayes and Support Vector Machines, particularly in short-text multiclass tasks. Beyond e-commerce, validation in spam filtering and occupational data mining confirmed substantial gains in accuracy and F1-Scores. In conclusion, the proposed method III is the most effective approach to enhance HMM-based product title classification. The scope of this research is explicitly focused on short-text product title classification in e-commerce, contributing to the body of knowledge on text classification using HMMs. This study also provides a foundation for developing user-friendly tools, libraries, and documentation to facilitate the integration of enhanced HMM-based classifiers into existing e-commerce systems.
Metadata
| Item Type: | Thesis (PhD) |
|---|---|
| Creators: | Creators Email / ID Num. Muhammad Noor Mathivanan, Norsyela 2017293402 |
| Contributors: | Contribution Name Email / ID Num. Advisor Md. Ghani, Nor Azura UNSPECIFIED |
| Subjects: | H Social Sciences > HF Commerce H Social Sciences > HF Commerce > Electronic commerce |
| Divisions: | Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences |
| Programme: | Doctor of Philosophy (Statistics) |
| Keywords: | Hidden Markov Model, HMM, Feature substitution, Short-text classification, E-commerce, Product taxonomy, Data sparsity, Semantic embeddings, Natural language processing |
| Date: | November 2025 |
| URI: | https://ir.uitm.edu.my/id/eprint/137027 |
Download
137027.pdf
Download (19kB)
Digital Copy
Physical Copy
ID Number
137027
Indexing
