Evaluating the effectiveness of information extraction in real-world storage management
Abstract
As storage deployments within enterprises continue to grow, there is an increasing need to simplify and automate. Existing tools for automation rely on extracting information in the form of device models and workload patterns from raw performance data collected from devices. This paper evaluates the effectiveness of applying such information extraction techniques on realworld data collected over a period of months from the data centers of two commercial enterprises. Real-world monitor data has several challenges that typically do not exist in controlled lab environments. Our analysis for creating models is using popular algorithms such as M5, CART, ARIMA and Fast Fourier Transform (FFT). The relative error rate in predicting device response time from real-world data is 40-45% - a similar experiment using data from a controlled lab environment has a relative error of'25%. Bootstrapping models for the two commercial datasets ran for 245 mins and 477 mins respectively, which illustrates the need for mechanisms that effectively deal with large enterprise scales. We describe one such technique that clusters devices with similar hardware configurations. With a cluster size of five devices, we were able to reduce the model creation time to 94 mins and 138 mins respectively. Finally, an interesting trade-off arises in model accuracy and computation time required to refine the model. To maintain an average of 35% relative error requires all data samples of the devices to be included in model refinement, a process taking 5 hours; on the other hand, if the models are not entirely rebuilt, the relative error climbs to 53%. © 2008 IEEE.