Loading [MathJax]/extensions/MathMenu.js
Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction | IEEE Journals & Magazine | IEEE Xplore

Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction


Abstract:

Genomic sequence processing has been an active area of research for the past two decades and has increasingly attracted the attention of digital signal processing researc...Show More

Abstract:

Genomic sequence processing has been an active area of research for the past two decades and has increasingly attracted the attention of digital signal processing researchers in recent years. A challenging open problem in deoxyribonucleic acid (DNA) sequence analysis is maximizing the prediction accuracy of eukaryotic gene locations and thereby protein coding regions. In this paper, DNA symbolic-to-numeric representations are presented and compared with existing techniques in terms of relative accuracy for the gene and exon prediction problem. Novel signal processing-based gene and exon prediction methods are then evaluated together with existing approaches at a nucleotide level using the Burset/Guigo1996, HMR195, and GENSCAN standard genomic datasets. A new technique for the recognition of acceptor splice sites is then proposed, which combines signal processing-based gene and exon prediction methods with an existing data-driven statistical method. By comparison with the acceptor splice site detection method used in the gene-finding program GENSCAN, the proposed DSP-statistical hybrid technique reveals a consistent reduction in false positives at different levels of sensitivity, averaging a 43% reduction when evaluated on the GENSCAN test set.
Published in: IEEE Journal of Selected Topics in Signal Processing ( Volume: 2, Issue: 3, June 2008)
Page(s): 310 - 321
Date of Publication: 24 June 2008

ISSN Information:

References is not available for this document.

I. Introduction

Deoxyribonucleic acid (DNA), the material of heredity in most living organisms, consists of genic and intergenic regions, as shown in Fig. 1. In eukaryotes, genes are further divided into relatively small protein coding segments known as exons, interrupted by noncoding spacers known as introns. In eukaryotes such as human, the intergenic and intronic regions often make up more than 95% of their genomes. Codons (i.e., triplets of possible four types of DNA nucleotides , , , and ) in exons encode 20 amino acids and 3 terminator signals, known as stop codons (i.e., TAA, TAG, and TGA). Initial exons of the genes begin with a start codon “ATG.” Looking from the end of DNA (upstream) to its end (downstream), the exon-to-intron border is known as the donor splice site and consists of a consensus dinucleotide “GT” as the first two nucleotides of the intron, whereas the intron-to-exon border is known as the acceptor splice site, which consists of a consensus dinucleotide “AG” as the last two nucleotides of the intron. The accurate identification of genomic protein coding regions, along with the recognition of other signals and/or regions (shown in Fig. 1) would result in an ideal gene finding and annotation system.

Select All
1.
M. Stanke, R. Steinkamp, S. Waack and B. Morgenstern, "AUGUSTUS: A web server for gene finding in eukaryotes", Nucl. Acids Res. Web Server Issue, vol. 32, pp. W309-W312, 2004, [online] Available: .
2.
V. V. Solovyev, A. A. Salamov and C. B. Lawrence, "Identification of human gene structure using linear discriminant functions and dynamic programming", Proc. 3rd Int. Conf. Intelligent Systems for Molecular Biology, pp. 367-375, 1995.
3.
G. Parra, E. Blanco and R. Guigo, "GenelD in drosophila", Genome Res., vol. 10, no. 4, pp. 511-515, 2000.
4.
A. V. Lukashin and M. Borodovsky, "GeneMark.hmm: New solutions for gene finding", Nucl. Acids Res., vol. 26, no. 4, pp. 1107-1115, 1998.
5.
D. Kulp, D. Haussler, M. G. Reese and F. H. Eeckman, "A generalized hidden Markov model for the recognition of human genes in DNA", Proc. 4th Int. Conf. Intelligent Systems for Molecular Biology, pp. 134-142, 1996.
6.
C. Burge and S. Karlin, "Prediction of complete gene structure in human genomic DNA", J. Mol. Biol., vol. 268, no. 1, pp. 78-94, 1997.
7.
A. Krogh, "Two methods for improving performance of an HMM and their applications for gene-finding", Proc. 5th Int. Conf. Intelligent Systems for Molecular Biology, pp. 179-186, 1997.
8.
S. Salzberg, A. L. Delcher, K. H. Fasman and J. Henderson, "A decision tree system for finding genes in DNA", J. Comput. Biol, vol. 5, no. 4, pp. 667-680, 1998.
9.
M. Q. Zhang, "Identification of protein coding regions in the human genome by quadratic discriminant analysis", Proc. Nat. Acad. Sci., vol. 94, no. 2, pp. 565-568, 1997.
10.
M. Burset and R. Guigo, "Evaluation of gene structure prediction programs", Genomics, vol. 34, pp. 353-367, 1996.
11.
S. Rogic, A. K. Mackworth and B. F. Ouellette, "Evaluation of gene-finding programs on mammalian sequences", Genome Res., vol. 11, no. 5, pp. 817-832, 2001.
12.
V. Makarov, "Computer programs for eukaryotic gene prediction", Briefings Bioinf., vol. 3, no. 2, pp. 195-199, 2002.
13.
A. Nagar, S. Purushothaman and H. Tawfik, "Evaluation and fuzzy classification of gene finding programs on human genome sequences", FSKD, pp. 821-829, 2005.
14.
S. Logeswaran, E. Ambikairajah and J. Epps, "A method for detecting short initial exons", Proc. IEEE Workshop Genomic Signal Processing and Statistics, pp. 61-62, 2006.
15.
Y. Saeys, P. Rouze and Y. V. de Peer, "In search of the short ones: Improved prediction of short exons in vertebrates plants fungi and protists", Bioinformatics, vol. 23, no. 4, pp. 414-420, 2007.
16.
K. Murakami and T. Takagi, "Gene recognition by combination of several gene-finding programs", Bioinformatics, vol. 14, no. 8, pp. 665-675, 1998.
17.
V. Pavlovic, A. Garg and S. Kasif, "A Bayesian framework for combining gene predictions", Bioinformatics, vol. 18, no. 1, pp. 19-27, 2002.
18.
D. Anastassiou, "Genomic signal processing", IEEE Signal Process. Mag., vol. 18, no. 4, pp. 8-20, Apr. 2001.
19.
X. Zhang, F. Chen, Y. Zhang, S. C. Agner, M. Akay, Z. Lu, et al., "Signal processing techniques in genomic engineering", Proc. IEEE, vol. 90, no. 12, pp. 1822-1833, Dec. 2002.
20.
R. F. Voss, "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences", Phy. Rev. Lett., vol. 68, no. 25, pp. 3805-3808, 1992.
21.
R. Zhang and C. T. Zhang, "Z curves an intuitive tool for visualizing and analyzing the DNA sequences", J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767-782, 1994.
22.
B. D. Silverman and R. Linsker, "A measure of DNA periodicity", J. Theor. Biol, vol. 118, pp. 295-300, 1986.
23.
P. D. Cristea, "Genetic signal representation and analysis", Proc. SPIE Inf. Conf Biomedical Optics Symp., vol. 4623, pp. 77-84, 2002.
24.
A. K. Brodzik and O. Peters, "Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences", Proc. IEEE ICASSP, vol. 5, pp. v/373-v/376, 2005.
25.
J. Ning, C. N. Moore and J. C. Nelson, "Preliminary wavelet analysis of genomic sequences", Proc. IEEE Bioinformatics Conf., pp. 509-510, 2003.
26.
G. L. Rosen, Signal processing for biologically-inspired gradient source localization and DNA sequence analysis, 2006.
27.
A. S. S. Nair and T. Mahalakshmi, "Visualization of genomic data using inter-nucleotide distance signals", IEEE Int. Conf. Genomic Signal Processing, 2005.
28.
E. Coward, "Equivalence of two Fourier methods for biological sequences", J. Math. Biol., vol. 36, pp. 64-70, 1997.
29.
W. Wang and D. H. Johnson, "Computing linear transforms of symbolic signals", IEEE Trans. Signal Process., vol. 50, no. 3, pp. 628-634, Mar. 2002.
30.
E. N. Trifonov, "3- 10.5- 200- and 400-base periodicities in genome sequences", Phys. A, vol. 249, pp. 511-516, 1998.

Contact IEEE to Subscribe

References

References is not available for this document.