How low-precision computing boosts efficiency — without hurting accuracy
IBM Research scientists are working on making AI computing more efficient by shrinking the size of numbers and co-designing hardware to match.
IBM Research scientists are working on making AI computing more efficient by shrinking the size of numbers and co-designing hardware to match.
As artificial intelligence has grown out of the lab and into all areas of our lives, it’s become clear that we need to find new ways to make this technology do more with less computing power. For large language models to be viable for businesses, they can’t require hundreds of GPUs to run. One option includes low-precision computing. Sometimes called approximate computing, low-precision arithmetic is well suited for AI applications whose computations don’t require a high degree of precision to produce accurate enough results. Its advantages include reduced compute and energy costs, as well as lower latency.
Developed over the past decade or so, and maturing in recent years, low-precision computing is quickly becoming an industry standard. As its utility and prominence grow, hardware needs to be co-designed to work in tandem with this approach. It’s no coincidence, then, that low-precision computing is a crucial element of the architecture in IBM’s family of AIUs, devices designed from the ground up to train and deploy AI models as efficiently as possible. Typical CPUs work in high precision, usually FP32 or FP64 floating point arithmetic. FP32 or FP64 means numbers are represented to 32 or 64 bits. This high level of precision is perfect for mathematics, medicine, or engineering, calculations that require exacting tolerances.
For AI, though, this level of precision is often overkill. Not only that, but since AI involves massive amounts of computations with a high level of redundancy, the compute requirements for running large language models at FP32 or FP64 precision are monumental. These numbers can be quantized, or pared down to a limited bitwidth, without sacrificing accuracy.
Much like a back-of-the-envelope calculation is sufficient for calculating the tip at a restaurant, or a quick glance is all it takes for us to tell a rose apart from a daisy, 16-, 8-, or even 4-bit precision can provide the appropriate level of computational accuracy for today’s AI models. In fact, 16- and 8-bit compute are already industry accepted standards and are deployed in production settings. At the same time, researchers must develop algorithms to control the effect of reduced precision, so errors don’t add up and derail model accuracy.
As the state of the industry moves toward low-precision computing, scientists at IBM Research have been advancing techniques for low-precision training, and are currently working on incorporating them into how we train and inference with LLMs, as well as how we design hardware for them.
Full-precision (or double-precision) computing usually refers to operations performed with 64-bit representations, says Mudhakar Srivatsa, a distinguished engineer at IBM Research and an expert on AI optimization. As you increase the number of bits, you increase both precision (decreasing the gap between two representable numbers) and range (increasing the minimum and maximum value of representable numbers). Computers can handle operations involving large, high-precision figures. Ask a modern computer to perform a single piece of arithmetic at 64 bits of precision, and the result will be near-instantaneous.
But when it comes to AI training and inference, which involve trillions of operations, the computational requirements and latency of full-precision computing add up quickly. For example, given an input of length 1,000, generating 1,000 output tokens on a LLaMa3 8b model would require over 20 trillion operations. This number grows with the number of concurrent requests and the context length (size of inputs and outputs), with many emerging workloads such as agentic systems driving the need for 128,000-token or even 1 million-token context lengths. “To multiply two n-digit numbers, you are performing n2 amount of work,” Srivatsa says.
Model weights for full-precision computing are also larger than low-precision weights, meaning each one takes up more memory space to store. Further, these weights (which can be tens to hundreds of gigabytes) must be moved from memory to the compute cores for every token generation. More bits of precision take quadratically more energy and silicon surface area — moving from 32 to 64 bits requires four times larger computational building blocks, while every step-down decreases energy and silicon requirements by four times. AI researchers are looking to low-precision computing to save on these operations.
The levels that typically qualify as low-precision are 8-, 4-, and 2-bit, says Srivatsa. There are two main angles that low-precision quantization has taken, says IBM Research distinguished engineer Raghu Ganti: One is reducing AI model weights to 4 bits while keeping activation functions at 16 bits, and the other, referred to as FP8, reduces both model weights and activation functions to 8 bits.
The redundancy of LLMs’ computations create a lot of noise tolerance, making low precision suitable for them. The final output of any given operation, and the actions taken based on that output, are often the same whether you compute in full or reduced precision, says IBM Research scientist Viji Srinivasan, an expert on the hardware involved in low-precision computing. “You can go from FP64 or FP32 down to FP16, and if I do my computations using that precision, that bitwidth for representing a number, I will lose both precision and range,” she says. “But if my resulting actions are the same as if I computed in higher precision, then I’m okay.”
To put a finer point on it, if the operation a*b is computed in 64, 32, or 16 bits, the values may have minor differences and could be considered errors. The higher bit representations will be more precise and have a more representable range for whatever values a and b are. In that sense, low-precision computing does not lead to an identical result.
Srivatsa provides the following example of how changing precision can change your results:
- in 16-bit precision, but in 32- or 64-bit
- However, in 16-, 32-, and 64-bit precision
- The way it should work is , a mathematical property known as associativity. At any finite-precision representation of floating-point numbers, this property is not guaranteed. It is just more obvious and perceivable at lower precisions.
Fortunately, with AI, you aren’t counting on just one neuron to get it right, Srivatsa explains. “You have thousands of neurons, and if one makes a mistake, many other neurons will capture the signal.” There’s redundancy in the system, to the point that if some are incorrect or partially incorrect, low-precision computing still gives an overall correct answer. “It’s like a majority vote: As long as 70%, 80% of them are doing a good job, I’m good,” Srivatsa says.
Plus, the exact product of an operation isn’t what we’re after anyway — what we’re after is the reaction based on that output, Srinivasan explains. “If the reaction remains the same whether I do this operation in high or low precision, then I’m better off doing it in low precision,” she says.
When Srinivasan says “better off,” she’s talking about the processor requirements. Reducing the amount of computation required for a given task, you can pack in more multipliers — or model weights — into the same area. In turn, that reduced footprint means low-precision hardware requires less energy per operation and can perform more operations per CPU cycle.
And the results speak for themselves. Training in FP8 can speed up the process by as much as 50% without sacrificing quality, Ganti says. “I think we should start seeing evidence of training moving to FP8 by the end of this year across the industry,” he predicts. “And full-blown adoption will probably come by the end of the first half of 2025.”
For floating-point numbers, reducing precision to 16 or 8 bits introduces errors. No matter what, shrinking bitwidth makes it impossible to represent finer precision and larger range. But there are mechanisms in AI applications used during both model training and inference to compensate for that quantization loss.
One of these options is mixed-precision operations, wherein floating-point multiplication is performed at lower precision (which accounts for over 90% of operations during training or inferencing), while accumulation (required for matrix multiplication) is performed at a higher precision. This is usually combined with pre or post scaling of the operands going into matrix multiplication to obtain better output accuracy. A second option is quantization-aware training, which compensates for the precision lost during quantization. This method, performed during pre-training or fine-tuning, teaches a model the compensation factor it will need to use when scaling for low precision, with both the model weights and the non-linear functions known as activation functions. Because it’s done during training, this prepares the model to correct the outputs being produced during inference.
Inferencing using AI models is a game of watts and dollars per million tokens, says Srivatsa. “So if we can shrink the silicon footprint of a given operation from 4mm square to 1mm square, and if we can make that chip draw just one quarter of the power, that’s a huge win.” Low-precision computing is also enabling developers to deploy some LLMs right on laptops and consumer-grade GPUs, Ganti says, because they require fewer compute resources.
And while most discussion of low-precision AI computing centers around model weights and activation values, a model’s key-value or KV cache can benefit from reduced precision, too. The KV cache stores the tokens an LLM has already generated, and this can quickly grow to require even more memory than the model itself – in many production-grade inference servers the KV-cache may even be 8x the model size. And since sequence lengths are getting longer all the time, especially with AI agents which produce and critique their own outputs, storing KV caches in low precision has the potential to seriously shrink the memory footprint of LLMs, Srivatsa says. “This is a high-priority problem that needs to be solved,” he notes.
So far, we’ve talked mostly about floating point numbers, but low-precision AI computing also involves fixed-point representations, or integers. And in some cases, low-precision computing can also mean converting floating-point values to fixed-point values — from FP16 or FP32 to INT8, for example. Whether using fixed- or floating-point representations, low-precision inferencing can still be run on AI models trained in full precision, Srivatsa says.
One potential shortcoming of low-precision inference, Srinivasan notes, is in cases where you’re running a model with certain computations that are so sensitive to precision that reducing bitwidth can alter the actionable result. “You would be in a bind,” she says. “But there are ways to cure it.” Returning to training or fine-tuning will let you adjust model weights to account for this.
Low-precision computing is changing not just how we use processors, but also how we build them. IBM Research is working on designing low-precision hardware built from the ground up with AI in mind. Low-precision computing can work on existing processors, Srinivasan explains, but it doesn’t have as much advantage. “If you want to get all the benefits of area reduction and energy reduction, then you want to build hardware capable of doing lower precision arithmetic.” If the hardware is built for higher precision, running it in low precision is still engaging high-precision multipliers, so it isn’t going to offer the benefits of packing more compute into the same footprint or drawing less energy. “The area and power cost will be incurred even if the computations did not require higher precision,” says Srinivasan.
Srinivasan and other IBM Research scientists have been working on hardware that natively supports low-precision computing in the AIU family, including NorthPole, Spyre, and the analog chip prototypes. The Spyre accelerator, for example, supports multiplication and accumulation operations in FP16, FP8, INT8, and INT4, helping it achieve higher compute densities than other hardware typically used for model training — more horsepower with the same footprint.
While low precision is an important part of IBM’s AI computing strategy, it is only one piece of the picture, Srinivasan points out. The AIU family of hardware also benefits from advancements like on-chip memory which eliminates the classical von Neumann bottleneck between memory and computing that tends to bog down AI training and inference. Flexibility, too, is a crucial component of these devices, which can run models in multiple different precisions depending on the specifics of their training. It’s also possible to build full-precision capabilities into chips alongside low-precision, so that if a model isn’t resilient to approximation, it can still be run on the same hardware. Researchers are working on various blends of precision that will make sense for different applications in the future.
The future of low-precision computing, Srivatsa says, may also include even lower precision. He and others are working on 2-bit precision, which includes only -1, 0, and 1. While LLMs use multiplication, reducing the bitwidth in this way turns the operations into only addition and subtraction, which are computationally cheap. It’s only in the early stages, but this is a direction of future research to see just how far low precision can go.