Wavelet Based Lossless DNA Sequence Compression for Faster Detection of Eukaryotic Protein Coding Regions

Full Text (PDF, 296KB), PP.47-53

Views: 0 Downloads: 0


J.K. Meher 1,* M.R. Panigrahi 2 G.N. Dash 3 P.K. Meher 4

1. Computer Science and Engineering, Vikash College of Engineering for Women, Bargarh, Odisha, India.

2. Chemical Engineering, Vikash College of Engineering for Women, Bargarh, Odisha, India.

3. School of Physics, Sambalpur University, Odisha, India

4. Institute for Infocomm Research, Singapore

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2012.07.05

Received: 5 Apr. 2012 / Revised: 10 May 2012 / Accepted: 16 Jun. 2012 / Published: 28 Jul. 2012

Index Terms

Discrete Wavelet Transform, Comb filter, Indicator sequence, Protein coding regions


Discrimination of protein coding regions called exons from noncoding regions called introns or junk DNA in eukaryotic cell is a computationally intensive task. But the dimension of the DNA string is huge; hence it requires large computation time. Further the DNA sequences are inherently random and have vast redundancy, hidden regularities, long repeats and complementary palindromes and therefore cannot be compressed efficiently. The objective of this study is to present an integrated signal processing algorithm that considerably reduces the computational load by compressing the DNA sequence effectively and aids the problem of searching for coding regions in DNA sequences. The presented algorithm is based on the Discrete Wavelet Transform (DWT), a very fast and effective method used for data compression and followed by comb filter for effective prediction of protein coding period-3 regions in DNA sequences. This algorithm is validated using standard dataset such as HMR195, Burset and Guigo and KEGG.

Cite This Paper

J.K. Meher,M.R. Panigrahi,G.N. Dash,P.K. Meher,"Wavelet Based Lossless DNA Sequence Compression for Faster Detection of Eukaryotic Protein Coding Regions",IJIGSP,vol.4,no.7,pp.47-53,2012. DOI: 10.5815/ijigsp.2012.07.05 


[1]Z. Wang, Y. Z. Chen and Y. X. Li, “A Brief Review of Computational Gene Prediction Methods,” Genomics Pro- teomics Bioinformatics, Vol. 2, No. 4, 2004, pp. 216-221. 

[2]J. W. Fickett, “The Gene Identification Problem: Over-view for Developers,” Computers & Chemistry, Vol. 20, No. 1, 1996, pp. 103-118. 

[3]Catherine Mathe, Marie-France Sagot, Thomas Schiex and Pierre Rouze, Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30, No. 19, pp-4103-4117

[4]P. D. Cristea, “Genetic signal Representation and Analy-sis,” Proceedings of SPIE Conference, International Biomedical Optics Symposium (BIOS'02), Vol. 4623, 2002, pp. 77-84. 

[5]B. D. Silverman and R. Linsker, “A Measure of DNA Periodicity,” Journal of Theoretical Biology, Vol. 118, No. 3, 1986, pp. 295-300. 

[6]R. Zhang and C. T. Zhang, “Z Curves, an Intuitive Tool for Visualizing and Analyzing the DNA Sequences,” Journal of Biomolecular Structure & Dynamics, Vol. 11, No. 4, 1994, pp. 767-782. 

[7]A. K. Brodzik and O. Peters, “Symbol-Balanced Quater-nionic Periodicity Transform for Latent Pattern Detection in DNA Sequences,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), Vol. 5, 2005, pp. 373-376. 

[8]P. D. Cristea, “Genetic signal Representation and Analy-sis,” Proceedings of SPIE Conference, International Biomedical Optics Symposium (BIOS'02), Vol. 4623, 2002, pp. 77-84. 

[9]A. S. Nair and S. P. Sreenathan, “A Coding Measure Scheme Employing Electron-Ion Interaction Pseudopo-tential (EIIP),” Bioinformation, Vol. 1, No. 6, 2006, pp. 197-202. 

[10]G. L. Rosen, “Signal Processing for Biologically-Inspired Gradient Source Localization and DNA Sequence Analy-sis,” Ph.D. Thesis, Georgia Institute of Technology, At-lanta, 2006. 

[11]A. S. Nair and S. P. Sreenathan, “An Improved Digital Filtering Technique Using Frequency Indicators for Lo-cating Exons,” Journal of the Computer Society of India, Vol. 36, No. 1, 2006. 

[12]M. Akhtar, J. Epps and E. Ambikairajah, “On DNA Nu-merical Representations for Period-3 Based Exon Predic-tion,” IEEE International Workshop on Genomic Signal Processing and Statistics, Tuusula, 2007. 

[13]D. Anastassiou, “Frequency-Domain Analysis of Bio-molecular Sequences,” Bioinformatics, Vol. 16, No. 12, 2000, pp. 1073-1082. 

[14]P. P. Vaidyanathan and B. J. Yoon, “The Role of Signal Processing Concepts in Genomics and Proteomics,” Journal of the Franklin Institute, Vol. 341, No. 1-2, 2004, pp. 111-135. 

[15]J. K. Meher, P. K. Meher and G. N. Dash “Improved Comb Filter based Approach for Effective Prediction of Protein Coding Regions in DNA Sequences”, International Journal of signal and information processing (JSIP), Scientific Research Publishing, Vol.2, N0.2, pp. 88-99, May-2011.

[16]D. Koltar and Y. Lavner, “Gene Prediction by Spectral Rotation (SR) Measure: A New Method for Identifying Protein-Coding Regions,” Genome Research, Vol. 13, No. 8, 2003, pp. 1930-1937. 

[17]J. Tuqan and A. Rushdi, “A DSP Approach for Finding the Codon Bias in DNA Sequences,” IEEE Journal of Selected Topics in Signal Processing, Vol. 2, No. 3, 2008, pp. 343-356. 

[18]P. Jesus, M. Chalco and H. Carrer, “Identification of Pro-tein Coding Regions Using the Modified Gabor-Wavelet Tranform,” IEEE/ACM Transaction on Computational Biology and Bioinformatics, Vol. 5, No. 2, 2008, pp. 198- 207. 

[19]L. Galleani and R. Garello, “The Minimum Entropy Mapping Spectrum of a DNA Sequence,” IEEE Transac-tion on Information Theory, Vol. 56, No. 2, 2010, pp. 771-783. 

[20]Chen, X., Kwong, S., and Li, M. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the 10th Workshop on Genome Informatics (GIW-99) (Dec. 1999), pp. 52–61.

[21]Neva Cherniavsky and Richard Ladner, Grammar-based Compression of DNA Sequences, UW CSE Technical Report 2007-05-02, May 2004.

[22]Grumbach, S. and Tahi, F., A new challenge for compression algorithms: genetic sequences, Information Processing & Management, 30:875–886, 1994.

[23]Lanctot, J.K., Li, M., and Yang, E., Estimating DNA sequence entropy. Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, 409–418, 2000.

[24]Toshiko Matsumoto1,3 Kunihiko Sadakane2 Hiroshi Imai, Biological Sequence Compression Algorithms, Genome Informatics 11: 43–52 (2000)

[25]Adjeroh, D., Zhang, Y., Mukherjee, A., Powell, M., and Bell, T., DNA sequence compression using the Burrows-Wheeler Transform, Bioinformatics Conference, 2002. Proceedings. IEEE Computer Society, Dec 2004, 303 – 313.

[26]J. K. Meher, M. K. Raval, P. K. Meher and G. N. Dash. “New Encoded Single Indicator Sequences based on Physico-chemical Parameters for Efficient Exon Identification”, International Journal of Bioinformatics Research and Applications (IJBRA),Inderscience Publishers, Vol 8, Nos. 1/2, pp-126-140, 2012, 

[27]S. K. Mitra, “Digital Signal Processing,” Tata McGraw-Hill, New Delhi, 2006. 

[28]A. V. Oppenheim and R. W. Schafer, “Discrete-Time Signal Processing,” Prentice-Hall Inc., Upper Saddle River, 1999. 

[29]M. Kanehisa and S. Goto, “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acid Research, Vol. 28, No. 1, 2000, pp. 27-30. doi:10.1093/nar/28.1.27 

[30]M. Burset and A. R. Guigo, “Evaluation of Gene Struc-ture Prediction Programs,” Genomics, Vol. 34, No. 3, 1996, pp. 353-367. doi:10.1006/geno.1996.0298 

[31]S. Rogic, A. Mackworth and F. Ouellette, “Evaluation of Gene Finding Programs on Mammalian Sequences,” Ge-nome Research, Vol. 11, No. 5, 2001, 817-832. 

[32]G. Aggarwal and R. Ramaswamy, “Ab Initio Gene Iden-tification: Prokaryote Genome Annotation with GeneScan and GLIMMER,” Journal of Biosciences, Vol. 27, No. 1, 2002, pp. 7-14.