Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

A PROTOTYPE OF TEXT- TO-SPEECH FOR THAI BASED ON TIME DOMAIN PITH-SYNCHRONOUS OVERLAP AND ADD

 

TITLE A PROTOTYPE OF TEXT- TO-SPEECH FOR THAI BASED ON TIME DOMAIN PITH-SYNCHRONOUS OVERLAP AND ADD.
AUTHOR NATTHAKIJ ANGSUBHAKORN
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE
FACULTY FACULTY OF SCIENCE
ADVISOR ASSOC. PROF. SUPACHAI TANGWONGSAN
CO-ADVISOR ASST. PROF. DAMRAS WONGSAWANG
 
ABSTRACT
This research work presents a prototype of Text-to-Speech for Thai. Text-to-Speech (TTS) or Speech Synthesis is an application that can automatically generate human speech from the input text. It normalizes the complex word, exception word, abbreviation, number, and symbol into a simple form. In addition, the speech synthesis is used when visibility is problematic such as for the blind or when visibility is focused on something else. Moreover, it can be used to perceive information remotely such as via telephone. The prototype was developed and based on the Time Domain Pitch-Synchronous Overlap and Add method. The model consists of two major modules: text analysis and speech signal processing. The text analysis module decomposes the input text into phonetic unit description parameters, which is composed of phonetic units and prosody information. Next, these parameters are further processed by a speech signal processing module, which is based on TD-PSOLA algorithm. The phonetic units are decomposed into a sequence of overlapping signals extracted pitch-synchronously by multiplying the signals to a window function. The window function is usually Hanning or Hamming type, which is typically centered at the pitch-mark position. The pitch-mark position can be obtained using a general pitch determination algorithm such as Autocorrelation, AMDF, and Cepstrum Analysis. In order to increase the pitch of speech, the duration between pitch-marks should be decreased. On the other hand, to decrease pitch of speech, the duration between pitch-marks is inversely increased. In addition, the duration modification of the signals can be performed by repeating or omitting pitch-marked signals. In order to increase duration, the signals are repeated, while to decrease duration, the signals are omitted. As a result, pitch contour of signals can be modified similar to Thai standard tone contours, which consists of low, falling, high, and rising levels. The experiment was performed by generating a set of sentences and evaluating the overall quality of output speech using the Mean Opinion Score (MOS) method. The evaluation aspects considered in the experiments were pronunciation, distinctness, naturalness, and intelligibility. The evaluation was performed by 15 Thai native speakers in a controlled environment. The experimental result shows that the speech output generated from the prototype is considered intelligible and can be recognized as natural Thai pronunciation, spoken by a native speaker. The prototype is able to produce most commonly used words in CVC (Consonant-Vowel- Consonant) pattern.
KEYWORD TEXT-TO-SPEECH / SPEECH SYNTHESIS / TD-PSOLA

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333