Stop Chasing Trends: Discovering High Order Models in Evolving Data | IEEE Conference Publication | IEEE Xplore

Stop Chasing Trends: Discovering High Order Models in Evolving Data


Abstract:

Many applications are driven by evolving data - patterns in Web traffic, program execution traces, network event logs, etc., are often non-stationary. Building prediction...Show More

Abstract:

Many applications are driven by evolving data - patterns in Web traffic, program execution traces, network event logs, etc., are often non-stationary. Building prediction models for evolving data becomes an important and challenging task. Currently, most approaches work by "chasing trends", that is, they keep learning or updating models from the evolving data, and use these impromptu models for online prediction. In many cases, this proves to be both costly and ineffective - much time is wasted on re-learning recurring concepts, yet the classifier may remain one step behind the current trend all the time. In this paper, we propose to mine high-order models in evolving data. More often than not, there are a limited number of concepts, or stable distributions, in the data stream, and concepts switch between each other constantly. We mine all such concepts offline from a historical stream, and build high quality models for each of them. At run time, combining historical concept change patterns and cues provided by an online training stream, we find the most likely current concept and use its corresponding models to classify data in an unlabeled stream. The primary advantage of the high-order model approach is its high accuracy. Experiments show that in benchmark datasets, classification error of the high-order model is only a small fraction of that of the current best approaches. Another important benefit is that, unlike state-of-the-art approaches, our approach does not require users to tune any parameters to achieve a satisfying result on streams of different characteristics.
Date of Conference: 07-12 April 2008
Date Added to IEEE Xplore: 25 April 2008
ISBN Information:

ISSN Information:

Conference Location: Cancun, Mexico
References is not available for this document.

I. Introduction

The primary task of data mining is to develop models based on existing data. In classification, usually the training data is fixed, for example, it is stored in a data warehouse, and the models, once trained from the stored data, can be applied to future data without much change. Thus, the knowledge discovery process can be regarded as consisting of two sequential phases: a training phase, where models are learned from past data, and a testing phase, where models are applied on the future data.

Select All
1.
G. Hulten, L. Spencer, and P. Domingos, "Mining time-changing data streams," in SIGKDD. San Francisco, CA: ACM Press, 2001, pp. 97-106. [Online]. Available: citeseer.nj.nec.com/hulten01mining.html
2.
W. N. Street and Y. Kim, "A streaming ensemble algorithm (SEA) for large-scale classification," in SIGKDD, 2001. [Online]. Available: citeseer.nj.nec.com/489183.html
3.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, "Multi-dimensional regression analysis of time-series data streams," in VLDB, Hongkong, China, 2002.
4.
S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan, "Clustering data streams," in FOCS, 2000, pp. 359-366.
5.
H. Wang, J. Yin, J. Pei, P. S. Yu, and J. X. Yu, "Suppressing model overfitting in mining concept-drifting data streams," in SIGKDD, 2006.
6.
H. Wang, W Fan, P. S. Yu, and J. Han, "Mining concept-drifting data streams using ensemble classifiers," in SIGKDD, 2003.
7.
Y. Chi, P. S. Yu, H. Wang, and R. Muntz, "Loadstar: A load shedding scheme for classifying data streams," in SIAM Data Mining, 2005.
8.
Y. Chi, H. Wang, and P. S. Yu, "Loadstar: Load shedding in data stream mining," in VLDB, 2005, pp. 1303-1305.
9.
P. Wang, H. Wang, X. Wu, W. Wang, and B. Shi, "On reducing classifier granularity in mining concept-drifting data streams," in ICDM, 2005.
10.
P. Chan, "An extensible meta-learning approach for scalable and accurate inductive learning," Ph.D. dissertation, Columbia University, 1996. [Online]. Available: citeseer.ist.psu.edu/article/chan96extensible.html
11.
H. S. Seung, M. Opper, and H. Sompolinsky, "Query by committee," in Computational Learning Theory, 1992, pp. 287-294. [Online]. Available: citeseer.ist.psu.edu/seung92query.html
12.
T. G. Dietterich, "Ensemble methods in machine learning," Lecture Notes in Computer Science, vol. 1857, pp. 1-15, 2000. [Online]. Available: citeseer.nj.nec.com/dietterich00ensemble.html
13.
Y. Yang, X. Wu, and X. Zhu, "Combining proactive and reactive predictions for data streams," in SIGKDD, 2005, pp. 710-715.
14.
J. Yang, W. Wang, and P. Yu, "Discovering high order periodic patterns," Knowledge and Information Systems Journal (KAIS), vol. 6, no. 3, pp. 243-268, 2004.
15.
J. Z. Kolter and M. A. Maloof, "Dynamic weighted majority: A new ensemble method for tracking concept drift," in ICDM, 2003.
16.
K. O. Stanley, "Learning concept drift with a committee of decision trees," in Technical Report Al-03-302, Dept. of Computer Sci., Uni. of Terns at Austin, USA, 2003.
17.
A. Tsymbal, "The problem of concept drift: definitions and related work," in Technical Report TCD-CS-2004-15, Dept. of Computer Sci., Trinity College Dublin, Ireland, 2004.
18.
G. Widmer and M. Kubat, "Learning in the presence of concept drift and hidden contexts," in Machine learning, 1996.
19.
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "A framework for clustering evolving data streams," in VLDB, 2003.
20.
KDDCUP-1999, "The third international knowledge discovery and data mining tools competition," http://kdd.ics.uci.edu/databases/kddcup99/ kddcup99.html, 1999.
21.
J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

Contact IEEE to Subscribe

References

References is not available for this document.