Approach to selecting metrics for detecting performance problems in information systems
Abstract
Early detection of performance problems is essential to limit their scope and impact. Most commonly, performance problems are detected by applying threshold tests to a set of detection metrics. For example, suppose that disk utilization is a detection metric, and its threshold value is 80%. Then, an alarm is raised if disk utilization exceeds 80%. Unfortunately, the ad hoc manner in which detection metrics are selected often results in false alarms and/or failing to detect problems until serious performance degradations result. To address this situation, we construct rules for metric selection based on analytic comparisons of statistical power equations for five widely used metrics: departure counts (D), number in system (L), response times (R), service times (S), and utilizations (U). These rules are assessed in the context of performance problems in the CPU and paging sub-systems of a production computer system.