Mahidol University Logo
Faculty of ICT, Mahidol University
 

Admissions

Printable Version

 

COMPRESSION OF DNA SEQUENCES

 

TITLE COMPRESSION OF DNA SEQUENCES.
AUTHOR THEERACHAI LAOKULSANT
DEGREE MASTER OF SCIENCE PROGRAMME IN COMPUTER SCIENCE
FACULTY FACULTY OF SCIENCE
ADVISOR DAMRAS WONGSAWANG
CO-ADVISOR SUKANYA PHONGSUPHAP
 
ABSTRACT
Nowadays, there is an evolution of decoding a genetic code of the deoxyribo- nucleic acid (DNA) and there is a prediction that the 21st century will be a century of biotechnology. DNA sequences can be considered as texts over a four-letter alphabet {(A) adenine, (C) cytosine, (G) guanine, (T) thymine}. For complete genomes, these texts can be very long. The human genome, for instance, contains three billion characters over twenty-three pairs of chromosomes. The fact described above illustrates that decoding a genetic code of DNA needs a huge storage space to store the DNA data. One approach that helps to decrease this needed storage space is data compression. Unfortunately, the compression of DNA sequences appears to be a difficult task. They are, at first glance, very similar to random string, and have only very hidden regularities. The classical algorithms for a text compression do not work on DNA sequences. This research aimed to propose three approaches of the DNA sequence compression that consume a storage space of fewer than two bits for each character. IbioCompress algorithm is improved from Biocompress-2 with a combination of the techniques of LZ77 and LZ78 algorithms. It has three search matching methods; namely, forward, reverse and inverse search matching methods, which have no boundary of search matching (sliding windows of no limited size). These methods use two bits for encoding literal words. IbioCompress also uses other methods for encoding; namely, adaptive arithmetic order-1, order-2 or PPMZ-M. TCompress algorithm is the approach for transforming the DNA sequence data in an ACGT format and then into many other formats; namely, C4C, C3C, C2C and C01, before compressing it (in ACGT, C4C, C3C, C2C, and C01 format) with good and popular compression programs, like Winzip, Pkzip, Gzip, PPMZ, BOA, Rk, etc. Finally, IgenCompress algorithm is improved from GenCompress by modifying the encoding edit operation and encoding literal words using adaptive arithmetic order-1 or PPMZ-M by comparison and encoding with the best. As a result, the algorithms of TCompress and IgenCompress consume a storage space of fewer than 2 bits for each character. With the TCompress algorithm, the DNA sequences in C4C format can also be compressed with other classical algorithms for a text compression, like Winzip, Pkzip and Gzip. The DNA sequences in an ACGT format can be compressed with PPMZ algorithm with the best compression ratio. Finally, it can be concluded that in general, IgenCompress algorithm has a better compression ratio than the algorithms of TCompress, Biocompress-2 and GenCompress.
KEYWORD COMPRESSION/ DNA COMPRESSION

 

Go to Top

 

ICT Building, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, Nakhonpathom 73170 Tel. +66 02 441-0909 Fax. +66 02 849-6099
Mahidol University Computing Center, The Faculty of ICT, Mahidol University , Rama 6 Road, Rajathevi, Bangkok 10400 Tel. +66 02 354-4333 Fax. +66 02 354-7333