Loading [MathJax]/extensions/MathMenu.js
DNA sequence compression - Based on the normalized maximum likelihood model | IEEE Journals & Magazine | IEEE Xplore

DNA sequence compression - Based on the normalized maximum likelihood model


Abstract:

Genomic data provide challenging problems that have been studied in a number of fields such as statistics, signal processing, information theory, and computer science. Th...Show More

Abstract:

Genomic data provide challenging problems that have been studied in a number of fields such as statistics, signal processing, information theory, and computer science. This article shows that the methodologies and tools that have been recently developed in these fields for modeling signals and processes appear to be most promising for genomic research
Published in: IEEE Signal Processing Magazine ( Volume: 24, Issue: 1, January 2007)
Page(s): 47 - 53
Date of Publication: 31 January 2007

ISSN Information:

References is not available for this document.

DNA compression DNA Sequences: Random, Independent, OR Dependent?

Genetic data represent the most important source of information about life, where, in an exquisite way, nature encodes information about the proteins forming an organism, about controlling biological pathways, about the variations rendering any two individuals different, and probably about many other things unknown yet. Deoxyribonucleic acid (DNA) is a polymer, formed of two entwined helicoidal chains, also called strands. Each strand is a linked chain of nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T). Ignoring its inherent nontrivial three-dimensional structure, DNA can be represented as a one-dimensional chain of the four nucleotides, which can be written either in terms of the four symbols , or in terms of the numbers , describing the bases.

Select All
1.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, "Basic local alignment search tool", J. Mol. Biol., vol. 215, no. 3, pp. 403-410, 1990.
2.
X. Chen, S. Kwong and M. Li, "A compression algorithm for DNA sequences", IEEE Eng. Med. Biol. Mag., pp. 61-66, July/Aug. 2001.
3.
X. Chen, M. Li, B. Ma and J. Tromp, "DNACompress: Fast and effective DNA sequence compression", Bioinformatics, vol. 18, no. 12, pp. 1696-1698, 2002.
4.
S. Grumbach and F. Tahi, "Compression of DNA sequences", Proc. Data Compression Conf. 1993, pp. 340-350.
5.
S. Grumbach and F. Tahi, "A new challenge for compression algorithms: Genetic sequences", J. Inform. Process. Manage., vol. 30, no. 6, pp. 875-886, 1994.
6.
A. M. Hauth and D. A. Joseph, "Beyond tandem repeats: Complex pattern structures and distant regions of similarity", Bioinformatics, vol. 18, no. 7, pp. S31-S37, 2002.
7.
D. Holste, I. Grosse, S. Breier, P. Schieg and H. Herzel, "Repeats and correlations in human DNA sequences", Phys. Rev. E, vol. 67, 2003.
8.
G. Korodi and I. Tabus, "An efficient normalized maximum likelihood algorithm for DNA sequence compression", ACM Trans. Inform. Syst., vol. 23, no. 1, pp. 3-34, 2005.
9.
G. Korodi and I. Tabus, "Compression of annotated nucleotide sequences", IEEE/ACM Trans. Computat. Biol. Bioinformatics.
10.
B. Ma, J. Tromp and M. Li, "PatternHunter: Faster and more sensitive homology search", Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
11.
M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney and H. Zhang, "An information based sequence distance and its application to whole mitochondrial genome phylogeny", Bioinformatics, vol. 17, no. 2, pp. 149-154, 2001.
12.
M. Li, X. Chen, X. Li, B. Ma and P. Vitanyi, "The similarity metric", Proc. 14th Annu. ACM-SIAM Symp. Discrete Algorithms, pp. 863-872, 2003.
13.
S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins", J. Mol. Biol., vol. 48, no. 3, pp. 443-453, 1970.
14.
W. R. Pearson and D. J. Lipman, "Improved tools for biological sequence comparison", Nat. Academy Sci., vol. 85, pp. 2444-2448, 1988.
15.
J. Rissanen, "Fisher information and stochastic complexity", IEEE Trans. Inform. Theory, vol. 42, no. 1, pp. 40-47, 1996.
16.
J. Rissanen, "Strong optimality of the normalized ML models as universal codes and information in data", IEEE Trans. Inform. Theory, vol. 47, no. 5, pp. 1712-1717, 2001.
17.
É. Rivals, J. P. Delahaye, M. Dauchet and O. Delgrange, A guaranteed compression scheme for repetitive DNA sequences, 1995.
18.
I. Tabus and J. Astola, "On the use of MDL principle in gene expression prediction", J. Appl. Signal Processing, vol. 2001, no. 4, pp. 297-303, 2001.
19.
I. Tabus, G. Korodi and J. Rissanen, "DNA sequence compression using the normalized maximum likelihood model for discrete regression", Proc. Data Compression Conf. 2003, pp. 253-262.
20.
I. Tabus, J. Rissanen and J. Astola, "Normalized maximum likelihood models for Boolean regression with application to prediction and classification in genomics" in Computational and Statistical Approaches to Genomics, New York:Kluwer Academic, pp. 173-196, 2002.
21.
I. Tabus, J. Rissanen and J. Astola, "Classification and feature gene selection using the normalized maximum likelihood model for discrete regression", Signal Processing, vol. 83, no. 4, pp. 713-727, 2003.

Contact IEEE to Subscribe

References

References is not available for this document.