Myriad-Parallel data generation on shared-nothing architectures
Abstract
The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously - a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a sharednothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging- testing tools General Terms Software Engineering, Testing and Debugging, Testing Tools, Scalable Data Generation. Copyright 2012 ACM.