PIXSAR: Incremental reclustering of augmented XML trees
Abstract
XML is one of the primary encoding schemes for data and knowledge. We investigate incremental physical data clustering in systems that store XML documents using a native format. We formulate the XML clustering problem as an augmented (with sibling edges) tree partitioning problem and propose the PIXSAR (Practical Incremental XML Sibling Augmented Reclustering) algorithm for incrementally clustering XML documents. We show the fundamental importance of workload-driven dynamically rearranging storage. PIXSAR incrementally executes reclustering operations on selected subgraphs of the global augmented document tree. The subgraphs are implied by significant changes in the workload. As the workload changes, PIXSAR incrementally djusts the XML data layout so as to better fit the workload. PIXSAR's main parameters are the radius, in pages, of the augmented portion to be reclustered and the way reclustering is triggered. We briefly explore some of the effects of indexes; a full treatment of indexes is the subject of another paper. We use an experimental data clustering system that includes a fast disk simulator and File System simulator for storing native XML data. We use a novel method for 'exporting' the Saxon query processor into our setting. Experimental results indicate that using PIXSAR significantly reduces the number of page faults (counting ALL page faults incurred while querying the document as well as maintenance operations) thereby resulting in improved query performance. Copyright 2008 ACM.