Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

A HIGHLY EFFECTIVE SYSTEM FOR THAI AND ENGLISH PRINTED CHARACTER RECOGNITION BY WORD PREDICTION METHOD

 

TITLE A HIGHLY EFFECTIVE SYSTEM FOR THAI AND ENGLISH PRINTED CHARACTER RECOGNITION BY WORD PREDICTION METHOD
AUTHOR BUNTIDA SUVACHARAKULTON
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE(INTERNATIONAL PROGRAMME)
FACULTY FACULTY OF INFORMATION AND COMMUNICATION TECHNOLOGY
ADVISOR SUPACHAI TANGWONGSAN
CO-ADVISOR SUKANYA PHONGSUPHAP
CHOMTIP PORNPANOMCHAI
PANJAI TANTASANAWONG
 
ABSTRACT
This thesis proposes a model of optical character recognition with the technique of word prediction for bilingual documents in Thai and English. The model is hence named, BOCR-WP (Bilingual Optical Character Recognition with Word Prediction). The BOCR-WP is an enhancement of conventional OCR with two additional and distinctive processes: language identification and word prediction. For language identification, the process attempts to distinguish which language mode those image strips should belong to, Thai or English, as a result of the identification. In word prediction, the process is actually followed by character verification after the processing of character recognition. The main idea is that instead of attempting to recognize each individual character via the conventional method, the new approach is trying to identify whole words, either in Thai or English, by using contextual analysis to predict those probable words. Then verify them to obtain the right one by template matching. Obviously, the longer the matched word is, the better the speed of recognition will be. Finally, the technique of dictionary look-up is used in order to improve the accuracy of the final answer for the whole recognition process. A series of experiments showed that the BOCR-WP was able to classify the script modes, Thai or English, correctly with a high accuracy of 99.99% on average. This system also yielded a better performance compared to conventional OCR in terms of speed improvement with a best case of 28.85%, 22.20% on average, and a minimal improvement of 15.69% while still being able to maintain a quality of accuracy of 100% in Thai and 99% in English from a source of 141 bilingual documents or a total of 284,417 characters.
KEYWORD BILINGUAL OCR / N-GRAM / THAI CHARACTER RECOGNITION / WORD PREDICTION / WORD VERIFICATION

 

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333