Privacy preserving population stratification for collaborative genomic research
Abstract
The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ethnicity is principal component analysis (PCA). In this paper, we propose a framework to perform population stratification using PCA across multiple collaborators in a privacy-preserving way. In our proposed client/server-based scheme, clients (collaborators) send metadata (in the form of their local PCA outputs) about their research datasets to the server under local differential privacy. The server then aligns the local PCA results to identify the genetic differences among collaborators' datasets. To align the local PCA results of the collaborators accurately, we let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. Then, the collaborators use this global PCA model to generate their local PCA outputs. Our results on real genomic data show that the proposed framework can perform population stratification with high accuracy while preserving the privacy of the research participants.