Hybrid checking for microarchitectural validation of microprocessor designs on acceleration platforms
Abstract
Software-based simulation provides a convenient environment for microprocessor design validation, where a number of complex software checkers are integrated with the simulated design to identify discrepancies between design and specification. Unfortunately, the performance of software-based simulation is vastly inadequate to achieve sufficient coverage for large microprocessor designs with complex microarchitectures. Hence, acceleration and emulation platforms are heavily deployed in the industry for high-performance validation. However, software checkers cannot be directly incorporated into such platforms, forcing designers to craft ad-hoc solutions. Adapting checking solutions for software simulation to acceleration platforms presents the following constraints: i) only a limited number of signals can be monitored per cycle for checking purposes so as to retain acceptable simulation performance, and ii) the overhead of the added checking logic must be minimal. In this work, we explore a novel solution to adapt software-based checkers for individual microarchitectural blocks to acceleration platforms, by leveraging a hybrid approach. Our solution exploits embedded logic and data tracing for post-simulation checking in a synergistic fashion to limit the associated overhead. Embedded logic can be used for synthesized local checkers as well as to compress the traced data and thus limit recording overhead. We analyze several trade-offs associated with checking accuracy and logic / recording overhead for different microarchitectural blocks of an out-of-order superscalar processor design. We strive to provide valuable insights on how to adapt such software checkers to the acceleration environment using our hybrid approach. We find that, by leveraging simple embedded checkers and data compressors (15-25% logic overhead), we can achieve excellent checking accuracy even when aggressively compressing the data for transfer (only 15-25 bits/cycle), and localize bugs up to 5,900 cycles sooner than an architectural-level checker. © 2013 IEEE.