Optimizing software cache-coherent cluster architectures
Abstract
Software cache-coherent systems using programmable protocol processors provide a flexible infrastructure to expand the systems in size and function. However this flexibility comes at a cost in performance. First, the software implementation of protocols is inherently slower than a hardware implementation. Second, when multiple processors share a protocol processor, contention may result in a substantial increase in memory latency. In this paper, we study how the overhead of a software scheme can be reduced in the context of a shared-memory system consisting of SMP clusters. We study various design choices including hardware assists such as forwarding logic in the protocol processor and software hints through explicit communication primitives. We conduct our experiments via trace-driven simulation and compare the execution of three programs from the SPLASH-2 suite. We found that small cluster sizes (up to 4 processors/node) work well for both hardware and software implementations. When the forwarding logic is incorporated with the software scheme, the performance is competitive to that of the hardware scheme. When enhanced further by explicit communication primitives, the software scheme can perform even better than a pure hardware implementation. This is particularly noticeable when the network latency is high.