Storage Reduction Through Minimal Spanning Trees and Spanning Forests
Abstract
It is often possible to save storage space in a computer by storing only the differences among data items rather than the entire items. For example, suppose we have two records A and B. We should store all of A, then for B store a pointer to A and the differences between A and B. If A and B are similar, there will be few differences and storage space can be saved. In storing a large number of data items by this method, we must consider the problem of how to choose a reference item for each data item so as to minimize storage space. This paper explores the use of a minimal spanning tree as a solution to this problem. The more date items being stored, the more likely it is that this method will be superior to the straightforward method where all the data is stored in full. This is because a large data set is likely to contain many data items which are quite similar. It is shown that once the number of records exceeds a certain threshold, it is guaranteed that our method is better. Experiments were conducted on both artificial data and real data. The results are encouraging. For instance, if the number of variables is 6 and each variable can assume 3 possible distinct values, then as soon as the number of records exceeds 3 percent of all possible records, our minimal spanning tree method is already better than the conventional method. For a set of patient data, the minimal spanning tree method saves 73 percent of the memory space. We also give the criterion for breaking a long link on the minimal spanning tree to further reduce the memory space required. Copyright © 1977 by The Institute of Electrical and Electronics Engineers, Inc.