Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

EFFICIENT DOCUMENT CLUSTERING USING SUFFIX ARRAY

 

TITLE EFFICIENT DOCUMENT CLUSTERING USING SUFFIX ARRAY
AUTHOR KOVIT KARAWATREEDECH
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE
FACULTY FACULTY OF SCIENCE
ADVISOR DAMRAS WONGSAWANG
CO-ADVISOR SUDSANGUAN NGAMSURIYAROJ
CHOMTIP PORNPANOMCHAI
 
ABSTRACT
Nowadays the Search Engine is popular among users in searching the information on the Internet. However, the results are sometimes unmatched with users’ need or placed in the latter pages of the document, and the searching wastes their time to arrive at. Such problems have been solved by an idea of clustering that is grouping documents having the same or similar content together. A Suffix Tree Clustering (STC), one of the most popular clustering techniques currently in use, is an efficient technique, which employs the Suffix Tree Algorithm in performing a string matching. The shortfall of the technique is that it requires an extensive memory to execute and due to the complexity of the implementation, some errors may occur. This thesis solves these technical problems by using Suffix Array instead of Suffix Tree. Similar to the Suffix Tree Clustering technique, the Suffix Array Clustering technique employs a Suffix Array Algorithm in doing a string matching. From our intensive study and program development for the Suffix Array Algorithm and Suffix Tree Algorithm, we found that the major advantages of the Suffix Array Algorithm over the Suffix Tree Algorithm were that the Suffix Array Algorithm is easier to implement and generally requires less memory. In addition, we compared the speed of execution of the two techniques and found that the Suffix Tree Clustering is faster when applied to a small document collection, but slower than, the Suffix Array Clustering when applied to a large document collection. The rationale is that the STC requires more memory space than SAC in maintaining its structure. Thus in the case of collection grows larger, the size of tree grows up faster then that of array. This results in more time taken in maintaining the structure when the tree structure goes beyond the capacity of memory. Hence, SAC is suitable for a large document collection or suitable to run on the computer system with very limited memory resource.
KEYWORD DOCUMENT CLUSTERING / SUFFIX TREE / SUFFIX ARRAY

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333