Loading [MathJax]/extensions/MathMenu.js
Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform | IEEE Journals & Magazine | IEEE Xplore

Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform


Abstract:

An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods based on th...Show More

Abstract:

An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods based on the occurrence of specific patterns of nucleotides at coding regions have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically predefined window length required for a local analysis of a DNA region. We introduce a method based on a modified Gabor-wavelet transform (MGWT) for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods by using eukaryote data sets. The results show that MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors but also makes a tool available for detailed exploration of the nucleotide occurrence.
Page(s): 198 - 207
Date of Publication: 07 May 2008

ISSN Information:

PubMed ID: 18451429

1 Introduction

Recent advances in bioinformatics and genomic signal processing have generated much interest due to the integration of theory and methods of signal processing with the global understanding of the functional genomics of organisms [1], [2]. When a new organism is sequenced, it is desirable to obtain as much information as possible about it. A fundamental stage is the identification of protein coding regions in DNA sequences [3], [4]. The methods for identification of coding regions described in the literature may be grouped in different ways. Blanco and Guigó divide the different methods into three approaches [5]: search by content, search by signal (also referred as search by site), and search by similarity. See also [6] for a similar taxonomy. Search by content refers to methods that search for DNA segments with specific properties like the frequency of nucleotides, the composition of nucleotides with abundant G/C or A/T, codons composition, and CpG islands. On the other hand, search-by-signal and search-by-similarity approaches refer to methods that are based on a previously known database that is used to train a supervised classifier like Markov chains, for instance [7]. Guigó [8] suggested a slightly different taxonomy for these methods, dividing them into model-dependent and model-independent methods. Model-dependent methods are built upon some a priori information usually available from databases of previously known organisms' genomic information. Model-independent methods do not assume such a priori information. These differences explain why gene finding programs are typically based on combinations of such techniques: model-dependent methods tend to be more precise because they count on a priori information to train the classifiers [8]. Nevertheless, sequenced organisms may have coding regions that are not represented on the available databases. In such situations, model-independent methods complement the program capabilities to detect the coding regions. This paper introduces a new model-independent method based on the detection of periodic regions, as described in the following.

Contact IEEE to Subscribe

References

References is not available for this document.