Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

DOCUMENT CLUSTERING USING TOP TERM FREQUENCY SELECTION

 

TITLE DOCUMENT CLUSTERING USING TOP TERM FREQUENCY SELECTION.
AUTHOR PARINYA SAEANG
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE
FACULTY FACULTY OF SCIENCE
ADVISOR DAMRAS WONGSAWANG
CO-ADVISOR CHOMTIP PORNPANOMCHAI
 
ABSTRACT
Document clustering, an IR methodology of grouping documents, is based on the similarity property of terms in documents. Groups of documents are formed in such a way that documents in the same cluster are similar to one another and dissimilar to documents in other clusters. However, if there is a large amount of index terms in documents, this may result in the problem of document clustering. The effect of too many index terms may mean that documents are so similar that they cannot be grouped. Furthermore, a large amount of index terms can affect the accuracy of clustering results and also processing time. Processing time depends mostly on the computation of matrixes which take enormous resources and time consumption which is a constraint of document clustering for clustering large data sets. This thesis proposes the alternative of document clustering focusing on the improvement of term selection, called Top Term Frequency Selection (TTFS). The concept is to select terms suitable for use in document clustering while providing suitable weight assignments to be used for the calculation of document similarity. The prototype of TTFS has been developed and tested with well known IR testing data sets such as CRAN, CISI, and TIME. The experimental results show that TTFS results in good clustering results comparable to the method of selecting all terms as index terms. The benefit of TTFS is its higher efficiency of clustering. Due to the small size of the matrix, the computation of some terms is better than all terms processing. Furthermore, the proposed weight assignments are helpful for clustering documents as well as the standard weight assignments. However, in order to apply TTFS in the practical applications, the mechanism of term weight sorting before term selection should be further explored and developed for better processing performance.
KEYWORD INFORMATION RETRIEVAL / DOCUMENT CLUSTERING / SIMILARITY MEASURE

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333