Synthetic Data Generation via Multidimensional Multiset Sum
Abstract
The US Decennial Census provides valuable data for researchers across a wild range of disciplines. In 2020, the Census Bureau changed its disclosure avoidance system to use differential privacy. However, research on the impacts of this change on both respondent privacy and data accuracy are limited because privacy-preserving mechanisms operate on the underlying data ("microdata") as opposed to the published aggregate statistics: Because the microdata are kept secret, outside researchers cannot directly run the algorithm. In this work, we provide tools to sample synthetic microdata from a distribution given published Census statistics. We formulate synthetic data generation in this context as a combinatorial optimization problem and develop new algorithms that improve upon existing techniques. While the problem we study is provably hard, we show empirically that our methods work well in practice, and we offer theoretical arguments to explain our performance.