Failure recovery in cooperative data stream analysis
Abstract
We present a failure recovery framework for System S, a large-scale stream data analysis environment. It is intended to support multiple sites, which have their own local administration and goals. However, it is beneficial for these sites to cooperate with each other, especially in the presence of various failures. Our ultimate goal is to support automatic, timely failure recovery through cooperation among sites. We identify the unique challenges in the context of System S and present our initial design work. In particular, we consider a backup selection problem, specifying where to recover failed jobs, which we formulate as an optimization problem. We present an approximation algorithm together with empirical results obtained through simulations. Our numerical evaluations show that the proposed approximation algorithm is very efficient and effective compared to the optimal solutions. It exhibits a promising empirical performance ratio that is close to the theoretical limit of polynomial approximations of such a problem. © 2007 IEEE.