Forecasting a Storm: Divining Optimal Configurations using Genetic Algorithms and Supervised Learning
Abstract
With the advent of Big Data platforms like Apache Storm, computations once deemed infeasible locally become possible at scale. However, doing so entails orchestrating powerful yet expensive clusters. With its focus on stream processing, Storm optimizes for low-latency and high throughput. However, to realize this goal and thereby maximize the utility of these clusters' resources, operators must execute these tasks under their optimal configurations. Yet, the search space for finding such configurations is so vast and time-consuming to explore so as to be effectively intractable due to issues like the temporal overhead of testing new candidate configurations, the sheer number of permutations of parameters within each configuration and their interdependence among each other. In order to efficiently cover the search space, we automate the process with genetic algorithms. Moreover, we fuse this technique not only with additional cluster information gleaned from JMX profiling and Storm performance data but also with classifiers constructed from training data from past executions of a plethora of Storm topologies. Utilizing a diverse set of Storm benchmark topologies as evaluation data, we show that the fully enhanced genetic algorithms can efficiently find configurations that perform on average 4.67x better than 'rules of thumb'-derived manual baselines. Moreover, we demonstrate that our fully refined classifiers enhance the GA throughput on average across the topologies by 22% while reducing search time by a factor of 6.47x.