Efficient sampling-based kernel mean matching
Abstract
Many real-world applications exhibit scenarios where distributions represented by training and test data are not similar, but related by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques suffer to learn a good predictive model in the presence of covariate shift. Recent studies have proposed approaches to address this challenge by weighing training instances based on density ratio between test and training data distributions. Kernel Mean Matching (KMM) is a well known method for estimating density ratio, but has time complexity cubic in the size of training data. Therefore, KMM is not suitable in real-world applications, especially in cases where the predictive model needs to be updated periodically with large training data. We address this challenge by taking fixed-size samples from training and test data, performing independent computations on these samples, and combining the results to obtain overall density ratio estimates. Our empirical evaluation demonstrates a large gain in execution time, while also achieving competitive accuracy on numerous benchmark datasets.