Loading [MathJax]/extensions/MathMenu.js
AngClust: Angle Feature-Based Clustering for Short Time Series Gene Expression Profiles | IEEE Journals & Magazine | IEEE Xplore

AngClust: Angle Feature-Based Clustering for Short Time Series Gene Expression Profiles


Abstract:

When clustering gene expression, it is expected that correlation coefficients of genes in the same clusters are high, and that gene ontology (GO) enrichment analysis of m...Show More

Abstract:

When clustering gene expression, it is expected that correlation coefficients of genes in the same clusters are high, and that gene ontology (GO) enrichment analysis of most clusters will be significant. However, existing short-term gene expression clustering algorithms have limitations. To address this problem, we proposed a novel clustering process based on angular features for short-term gene expression. Our method (named AngClust) uses angular features to indicate the change of trend in gene expression levels at two neighboring time points. The changes of angles at multiple time points reflects the change of trend of the overall expression levels. Such changes are used to measure whether the expression trends of different genes are similar. To obtain functionally significant clusters from the clustering results, we evaluated numbers of genes in clusters, average correlation coefficient, fluctuation, and their correlation with GO term enrichment. The efficacy of AngClust outperform two other measures, Euclidean distance (ED) and dynamic time warping of correlation (DTW), on a dataset of yeast gene expression. The ratios of GO and pathway term-enriched of clusters of AngClust is higher than or equal to that of STEM and TMixClust on human, mouse, and yeast time series of gene expression.
Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 20, Issue: 2, 01 March-April 2023)
Page(s): 1574 - 1580
Date of Publication: 19 July 2022

ISSN Information:

PubMed ID: 35853049

Funding Agency:


1 Introduction

Time series of gene expression are extremely useful for investigating various kinds of biological processes, such as cell reproduction, development, and response to external stimuli [1], [2], [3]. Gene temporal expression data can be roughly divided into two categories: (i) short time series containing several time points (generally three to eight time points) and (ii) long time series with more than eight time points [4]. It was estimated that approximately 80% of the time series data sets of gene expression are short time series [5]. Most algorithms for analyzing time series are based on traditional clustering methods, such as hierarchical clustering, K-means, Bayesian networks self-organizing map and many more [2], [6], [7], [8], [9], [10], [11]. These methods can reveal some biological characteristics but do not consider the temporal nature of time series, as the algorithms generally do not account for temporal autocorrelation between adjacent pairs of time points. Recently, some progress has been reported for clustering time series of gene expression, such as the expression profiles of continuous representation through hidden Markov models [5], but those remain restricted and domain specific only to long time series data sets. For time series data, these state-of-the-art algorithms tend to overfit because of the small number of sampling points.

References

References is not available for this document.