Is random model better? On its accuracy and efficiency
Abstract
Inductive learning searches an optimal hypothesis that minimizes a given loss function. It is usually assumed that the simplest hypothesis that fits the data is the best approximate to an optimal hypothesis. Since finding the simplest hypothesis is NP-hard for most representations, we generally employ various heuristics to search its closest match. Computing these heuristics incurs significant cost, making learning inefficient and unscalable for large dataset. In the same time, it is still questionable if the simplest hypothesis is indeed the closest approximate to the optimal model. Recent success of combining multiple models, such as bagging, boosting and meta-learning, has greatly improved the accuracy of the simplest hypothesis, providing a strong argument against the optimality of the simplest hypothesis. However, computing these combined hypotheses incurs significantly higher cost. In this paper, we first advert that as long as the error of a hypothesis on each example is within a range dictated by a given loss function, it can still be optimal. Contrary to common beliefs, we propose a completely random decision tree algorithm that achieves much higher accuracy than the single best hypothesis and is comparable to boosted or bagged multiple best hypotheses. The advantage of multiple random tree is its training efficiency as well as minimal memory requirement. © 2003 IEEE.