Predicting LLM Inference Latency: A Roofline-Driven ML Method
Abstract
Recent advancements in Large Language Models (LLMs) for Generative AI have significantly increased their popularity, resulting in an exponential arise of new close and open LLM models with frequent algorithm updates. This further complicates the challenge of optimal application management, resource allocation, and scaling in cloud environments for optimal inference latency. Hence, the typical approach of running and learning to define the optimal configuration starts to be unpractical due to the large combinatorial problem and shortage/cost of GPU resources, which creates the necessity for predictive performance models. Given that, we propose a new LLM performance prediction model that can be leveraged for optimal cluster management. The novelty of our approach is the combination of an analytical Roofline Model (RLM) specific for LLM implementation and based on the hardware characteristic with data from Regression Models trained with historical data. More specifically, our approach calibrates the theoretical hardware performance given from RLM with inherent runtime overhead captured by Regression Models, offering a more interpretable and accurate prediction method in cloud-based deployments. We validate our method for both vLLM and Triton inference servers, demonstrating that by combining the RLM with regression, our approach improves the $R^2$ value by $17\%$ and reduces MSE by up to $81\%$ for vLLM, compared to other regression-only models.