Comparative analysis of event tupling schemes
Abstract
Event logs provide an effective means of improving system availability. However, the majority of faults produce many errors because faults propagate in the time and error detection domains. Thus, the ability to coalesce related events is critical. The tupling heuristics developed at Carnegie-Mellon University provide one such methodology. These heuristics were applied to a new and larger set of data in order to evaluate the generality of the scheme and to extend the previous work. The extensions included deriving a semantic understanding of why the rules work, expanded statistical analysis, and a comprehensive sensitivity study to determine the effects of changes in the rules. The results prove that tupling is a useful and general methodology. The sensitivity study enabled the identification of refinements to the rules, while the high degree of skew in the tuple variables enables us to propose that the extreme percentiles be used as an alarm threshold for proactive fault management.