Abstract
This project is about constructing similarity thesaurus from collection of text documents. The collection that was use is collection of military an operation documents. Similarity thesaurus is group of word that share similar meaning. The major problem in information retrieval is word mismatch. This problem arose when the query word that use to search documents differ with word that been use in writing documents. Example the query word is money however the related documents use dollar instead money in the writing. So with similarity thesaurus help, the query will be expended to search document that related the dollar also. In this project 195 text documents were use. From 195 documents, 100 documents were using first to construct the similarity thesaurus. Then another 95 documents were added and the similarity thesaurus was compute again. The term to term relationship method was use in constructing the similarity thesaurus. The degree of confidence was introduced in this research to show the reliability of related term. Before the method applied word in the documents were tokenized. Then, removal punctual mark and stemming proceeds were applied to every token. Result show that size documents and collocation word in documents were variable that affect the similarity thesaurus. If more documents were use the more reliable and accurate the similarity thesaurus. The result also shows that the degree of confidence should be taken into account as well as the degree of similarity between two terms when grouping similar terms. Although the degree of similarity between two term are high however, if degree of confidence too low than the similarity between two term should not be taken.
Metadata
Item Type: | Thesis (Degree) |
---|---|
Creators: | Creators Email / ID Num. Said, Suhaib 2010606178 |
Contributors: | Contribution Name Email / ID Num. Thesis advisor Annamalai, Muthukkaruppan UNSPECIFIED |
Subjects: | Q Science > QA Mathematics > Instruments and machines > Electronic Computers. Computer Science > Database management |
Divisions: | Universiti Teknologi MARA, Shah Alam > Faculty of Computer and Mathematical Sciences |
Programme: | Bachelor of Computer Science (Hons) |
Keywords: | Thesaurus, military documents, collection, text documents |
Date: | 2012 |
URI: | https://ir.uitm.edu.my/id/eprint/109709 |
Download
109709.pdf
Download (192kB)