Privacy and regression model preserved learning
Abstract
Sensitive data such as medical records and business reports usually contains valuable information that can be used to build prediction models. However, designing learning models by directly using sensitive data might result in severe privacy and copyright issues. In this paper, we propose a novel matrix completion based framework that aims to tackle two challenging issues simultaneously: i) handling missing and noisy sensitive data, and ii) preserving the privacy of the sensitive data during the learning process. In particular, the proposed framework is able to mask the sensitive data while en-suring that the transformed data are still usable for training regression models. We show that two key properties, namely model preserving and privacy preserving, are satisfied by the transformed data obtained from the proposed framework. In model preserving, we guarantee that the linear regression model built from the masked data approximates the regression model learned from the original data in a perfect way. In privacy preserving, we ensure that the original sensitive data cannot be recovered since the transformation procedure is irreversible. Given these two characteristics, the transformed data can be safely released to any learners for designing prediction models without revealing any private content. Our empirical studies with a synthesized dataset and multiple sensitive benchmark datasets verify our theoretical claim as well as the effectiveness of the proposed framework.