Scalable algorithms at genomic resolution to fit LD distributions
Abstract
While the problem of reconstructing a population that matches a given LD (linkage disequilibrium) distribution is not straightforward, it is further compounded if the population must additionally match MAF (minimum allele frequency) distribution as well. Here we address the task of co-fitting the multiple distributions at genomic resolutions. The solution is based on incrementally scaling a fast, i.e., linear time, non-generative algorithm (SimBA). Non-generative implies that the algorithm does not generate the population through evolution-simulation. Instead it directly builds the genomes in terms of polymorphic alleles that mimic the the structure of the desired population. We present an incremental framework to scale up the algorithm that continues to be both accurate and efficient. We demonstrate the efficacy of the algorithm on a variety of data sets, both human as well as plant data. Such simulation of populations that match summary distributions play a critical role in in-silico hypothesis-testing and optimization. For instance in-silico breeding optimization in plants can model years or decades of experimentation to predict breeding outcomes in an incredibly short time of days, if not hours or minutes.