MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Frantar, Elias; Castro, Roberto L.; Chen, Jiale; Hoefler, Torsten; Alistarh, Dan

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

UDC.coleccion	Investigación	es_ES
UDC.conferenceTitle	ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming - PPoPP '25	es_ES
UDC.endPage	251	es_ES
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación	es_ES
UDC.startPage	239	es_ES
UDC.volume	2025	es_ES
dc.contributor.author	Frantar, Elias
dc.contributor.author	Castro, Roberto L.
dc.contributor.author	Chen, Jiale
dc.contributor.author	Hoefler, Torsten
dc.contributor.author	Alistarh, Dan
dc.date.accessioned	2025-04-21T15:33:44Z
dc.date.available	2025-04-21T15:33:44Z
dc.date.issued	2025-02
dc.description	Presented at PPoPP '25: The 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Las Vegas NV USA March 1 - 5, 2025.	es_ES
dc.description.abstract	[Abstract]: As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4×) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8×) when integrated with the popular vLLM open-source serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.	es_ES
dc.description.sponsorship	This research was supported in part by generous grants from NVIDIA and Google.	es_ES
dc.identifier.citation	Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 239–251. https://doi.org/10.1145/3710848.3710871	es_ES
dc.identifier.doi	10.1145/3710848.3710871
dc.identifier.isbn	979-8-4007-1443-6
dc.identifier.uri	http://hdl.handle.net/2183/41820
dc.language.iso	eng	es_ES
dc.publisher	Association for Computing Machinery	es_ES
dc.relation.uri	https://doi.org/10.1145/3710848.3710871	es_ES
dc.rights	Atribución 4.0 Internacional	es_ES
dc.rights.accessRights	open access	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	Large language model (LLM) inference	es_ES
dc.subject	GPU programming	es_ES
dc.subject	Batch parallelism	es_ES
dc.title	MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models	es_ES
dc.type	conference output	es_ES
dc.type.hasVersion	VoR	es_ES
dspace.entity.type	Publication
relation.isAuthorOfPublication	9dbced89-f8fe-43fb-8b3d-cca5da284d32
relation.isAuthorOfPublication.latestForDiscovery	9dbced89-f8fe-43fb-8b3d-cca5da284d32

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Castro_RobertoL_2025_MARLIN.pdf
Size:: 953.75 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Investigación (FIC)