A Parallel Hash Join Algorithm for Managing Data Skew
Abstract
As relations grow larger and queries grow more complex, processing queries in parallel becomes more important. However, the presence of data skew can cause load imbalances among the various processors, significantly reducing the speedup available from conventional parallel join algorithms. The presence of even one large skew value can cause a processor to become overloaded. This paper presents a parallel hash join algorithm, based on the new concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. We compare the overall speedup due to this algorithm to some existing parallel hash join methods. The proposed method does considerably better in high skew situations. In situations of low skew, the performance of the proposed algorithm and the conventional methods are comparable, even when the additional scheduling phase is taken into account. Results under different assumptions and in other environments are similar. © 1993 IEEE