NetworkMD: Topology inference and failure diagnosis in the last mile
Abstract
Health monitoring, automated failure localization and diagnosis have all become critical to service providers of large distribution networks (e.g., digital cable and fiber-to-thehome), due to the increases in scale and complexity of their offered services. Existing automated failure diagnosis solutions typically assume complete knowledge of network topology, which in practice is rarely available. The solution presented in this paper - Network Management and Diagnosis (NetworkMD) - is an automated failure diagnosis system that can infer failure groups based on historical failure data, and optionally geographical information. The inferred failure groups mirror missing topologies, and can be used to localize failures, diagnose root causes of problems, and detect misconfiguration in known topologies. NetworkMD uses an unsupervised learning algorithm based on non-negative matrix factorization (NMF) to infer failure groups. Using cable network as the primary example, we demonstrate the effectiveness of NetworkMD in both simulated settings and real environment using data collected from a commercial network serving hundreds of thousands of customers via thousands of intermediate network devices. Copyright 2007 ACM.