Publication
ISSCC 2024
Conference paper

A Software-Assisted Peak Current Regulation Scheme to Improve Power-Limited Inference Performance in a 5nm AI SoC

View publication

Abstract

The rapid emergence of AI models, specifically large language models (LLMs) requiring large amounts of compute, drives the need for dedicated AI inference hardware. During deployment, compute utilization (and thus power consumption) can vary significantly across layers of an AI model, number of tokens, precision, and batch size [1]. Such wide variation, which may occur at fast time scales, poses unique challenges in optimizing performance within the system-level specifications for discrete accelerator cards, including not just average power consumption, but also peak instantaneous current draw, which may require consideration of time constants down to μs-scale [2]. Prior current-limiting systems [2], [3], which use reactive schemes and often target general-purpose processors, may not be sufficient for AI workloads. This work leverages the unique characteristic of AI workloads, which allows predictive compile-time software optimization and proposes a new power management architecture to minimize worst-case margins and realize the potential of AI accelerators. In addition, due to wide variation of power consumption across card components in AI workloads, sensing the card-level (vs. chip-level) current provides more opportunity for optimization. A new software-assisted feed-forward current-limiting scheme is thus proposed in conjunction with PCIe-card-level closed-loop control to maximize performance under sub-ms peak current constraints.