Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

THE DYNAMIC PARALLEL ANT CRAWLER(DPAC) FOR THE THAI SEARCH ENGINE

 

TITLE THE DYNAMIC PARALLEL ANT CRAWLER(DPAC) FOR THE THAI SEARCH ENGINE
AUTHOR PARICHAT KANUENGHET
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE
FACULTY FACULTY OF SCIENCE
ADVISOR JARERNSRI L. MITRPANONT
CO-ADVISOR SUDSANGUAN NGAMSURIYAROJ
 
ABSTRACT
This research proposes an approach of the Dynamic Parallel Web Crawler which aims at improving efficiency of Thai Web Search Engines in both the quality of the result set of web pages retrieved and the storage optimization. The research framework consists of three main components. First, the concept of parallel crawling was applied to facilitate the crawling task to be performed in parallel and dynamic. Second, an Ant Colony Algorithm was adopted to increase the speed of the crawler and to gain more crawling search area. Since the algorithm was designed to support an optimized search conducted in the distributed environment, it is thus very applicable to the Dynamic Web Crawler solution. Finally, an agent based concept was used by introducing an Ant Crawler Agent to support several crawling tasks to be responsible for each crawler agent. This enables the management, and representation of the crawling result is be more flexible and effective. To verify the research idea, a prototype system of the proposed model of the Dynamic Parallel Ant Crawler (DPAC) was designed and developed using JSP and Servlet. In the experiment, the distributed system environment consisting of three servers was established to implement the core function of the parallel agent crawler. We evaluated the system in three parts: Retrieval Performance Evaluation, Resultbased Evaluation and User Acceptance Test (UAT). The Retrieval Performance Evaluation is conducted to compare the throughputs between a single Ant Crawler System (ACS) and a parallel Ant Crawler System (ACS) using three servers. We use 100 keywords to test with the DPAC run on a single ACS and parallel ACS. The result shows that the parallel Ant Crawler yields a greater number of URLs relevant to the user’s requirements compared with the single Ant Crawler. In addition, with the same number of throughputs, the parallel ACS reduced the crawling time significantly. For the Result-based Evaluation, the accuracy rate of the set of URLs generated by the DPAC system was focused on and the results were compared with recognized commercial search engines. The results revealed that the DPAC generates the set of URLs covering the given keywords and the ranking order of those URLs in an accurate and consistent way as the commercial tools do. Using 1,000 keywords, the results showed that DPAC outperformed the other two search engines. Finally, the prototype was also measured by means of the UAT test in order to assess the user satisfaction in using our search engine. With 800 keywords of the test among 40 users, the user acceptance rate was 97.25%. As a consequence, that this approach offers another sound and proven solution for the research area of the Thai Search Engine.
KEYWORD DYNAMIC WEB CRAWLER / ANT ALGORITHM / PARALLEL CRAWLER / FOCUSED CRAWLER / THAI SEARCH ENGINE

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333