A framework for local supervised dimensionality reduction of high dimensional data
Abstract
High dimensional data presents a challenge to the classification problem because of the difficulty in modeling the precise relationship between the large number of feature variables and the class variable. In such cases, it may be desirable to reduce the information to a small number of dimensions in order to improve the accuracy and effectiveness of the classification process. While data reduction has been a well studied problem for the unsupervised domain, the technique has not been explored quite as extensively for the supervised case. Existing techniques which try to perform dimensionality reduction are too slow for practical use in the high dimensional case. These techniques try to find global discriminants in the data. However, the behavior of the data often varies considerably with data locality and different subspaces may show better discrimination in different localities. This is an even more challenging task than the global discrimination problem because of the additional issue of data localization. In this paper, we propose the novel idea of supervised subspace sampling in order to create a reduced representation of the data for classification applications in an efficient and effective way. The method exploits the natural distribution of the different classes in order to sample the best subspaces for class discrimination. Because of its sampling approach, the procedure is extremely fast and scales almost linearly both with data set size and dimensionality.