Systematic data selection to mine concept-drifting data streams

Wei Fan

doi:10.1145/1014052.1014069

Publication

KDD 2004

Conference paper

Systematic data selection to mine concept-drifting data streams

KDD 2004

View publication

Abstract

One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method.

Date

22 Aug 2004

Publication

KDD 2004

Authors

Wei Fan

IBM-affiliated at time of publication

Abstract

Date

Publication

Authors

Share