A hybrid natural language processing method for interpretable ICD classification from electronic medical records clinical notes

Iqbal Basheer, Nurul Anis Balqis (2025) A hybrid natural language processing method for interpretable ICD classification from electronic medical records clinical notes. Masters thesis, Universiti Teknologi MARA (UiTM).

Abstract

Accurate interpretation of Electronic Medical Records (EMRs), especially clinical notes, is crucial for effective healthcare communication and achieving accurate patient outcomes. The main challenges in hybrid Natural Language Processing (NLP) methods include integrating various techniques while maintaining contextual understanding, resolving ambiguous abbreviations, and reducing misinterpretations of clinical narratives. The dataset in this study consisted of cardiovascular-related clinical notes containing medical abbreviations, diagnoses, and discharge summaries. Before analysis, the data underwent preprocessing steps such as text normalization, abbreviation extraction, and punctuation cleaning to ensure consistency and readiness for the model. This study addresses abbreviation ambiguity, diagnosis prediction, and International Classification of Diseases (ICD) classification using a hybrid NLP approach. The objectives are to extract and expand abbreviations, develop a hybrid framework for diagnosis prediction and ICD mapping, and evaluate its performance. The methodology integrates the Text-to-Text Transfer Transformer (T5) model with enhanced inference combining cosine similarity and beam search for abbreviation expansion, MedBioClinicalBERT, an integration of BioClinicalBERT and MedBERT for diagnosis prediction, and Semantic Role Labeling (SRL) for explainability. The enhanced elicitive inference achieved 95.38% BLEU and 97.96% ROUGE-L scores on abbreviation expansion. For diagnosis prediction, the hybrid input framework with MedBioClinicalBERT attained 90.00% accuracy with precision, recall, and F1 scores of 0.9530, 0.9470, and 0.9000, respectively, outperforming BioClinicalBERT and MedBERT individually. Standardization to ICD-10 codes was refined using fuzzy matching to improve mapping accuracy. The overall performance for the hybrid NLP method is 94.89% of precision, 94% of recall, and 95% of F1 score. Although limitations persist due to the multimodal data nature of clinical notes and the cardiovascular-specific dataset, the proposed method demonstrates substantial improvements. Overall, this study highlights the effectiveness of combining hybrid NLP methods with advanced abbreviation expansion to enhance EMR interpretation and ICD-10 classification, paving the way for broader applications in medical text analysis.

Metadata

Item Type: Thesis (Masters)
Creators:
Creators
Email / ID Num.
Iqbal Basheer, Nurul Anis Balqis
UNSPECIFIED
Contributors:
Contribution
Name
Email / ID Num.
Thesis advisor
Nordin, Sharifalillah
UNSPECIFIED
Thesis advisor
Abdul Hamid, Nurzeatul Hamimah
UNSPECIFIED
Subjects: H Social Sciences > HD Industries. Land use. Labor
H Social Sciences > HD Industries. Land use. Labor > Service industries
Divisions: Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences
Programme: Master of Science (Computer Science)
Keywords: Electronic Medical Records (EMR), Semantic Role Labeling (SRL), International Classification of Diseases (ICD)
Date: December 2025
URI: https://ir.uitm.edu.my/id/eprint/132629
Edit Item
Edit Item

Download

[thumbnail of 132629.pdf] Text
132629.pdf

Download (15kB)

Digital Copy

Digital (fulltext) is available at:

Physical Copy

Physical status and holdings:
Item Status:

ID Number

132629

Indexing

Statistic

Statistic details