ALPOS: A machine learning approach for analyzing microblogging data
Abstract
With the development of Internet, the increasing volume of information posted on micro-blogging sites like Twitter necessitates the need for efficient information filtering. In conventional text classification problems, it is assumed that the feature vectors extracted from the available documents are sufficient to learn good classifiers. However, this conventional approach is not likely to work for Twitter due to the limited number of characters on each tweet. From a higher level, each tweet can be viewed as an abbreviated abstraction of a long document, and we only have a partial observation of this document. To solve the problem caused by the partial observations, we introduce a novel domain adaption/transfer learning approach called Assisted Learning for Partial Observation (ALPOS). The basic idea is to use a large number of multi-labeled examples (source domain) to improve the learning on the partial observations (target domain). In particular, we learn a hidden, higher-level abstraction space, which is meaningful for the multi-labeled examples in the source domain. This is done by simultaneously minimizing the document reconstruction error and the error in a classification model learned in the hidden space by using known labels from the source domain. The partial observations in the target space are then mapped to the same hidden space for recovery and classification. We compare the performance of this method with existing approaches on synthetic data and the well-known Reuters-21578 dataset. We also present experimental results on twitter classification. © 2010 IEEE.