Mixture of Attentions For Speculative Decoding

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Mixture of Attentions For Speculative Decoding

Zimmer, M; Gritta, M; Lampouras, G; Ammar, HB; Wang, J; (2025) Mixture of Attentions For Speculative Decoding. In: 13th International Conference on Learning Representations ICLR 2025. ICLR: Singapore. Green open access

[thumbnail of 11328_Mixture_of_Attentions_Fo.pdf]

Preview

PDF
11328_Mixture_of_Attentions_Fo.pdf - Accepted Version
Download (454kB) | Preview

Abstract

The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.

Type:	Proceedings paper
Title:	Mixture of Attentions For Speculative Decoding
Event:	ICLR 2025
Open access status:	An open access version is available from UCL Discovery
Publisher version:	https://openreview.net/forum?id=Rz0kozh3LE
Language:	English
Additional information:	This version is the version of record. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords:	large language models, speculative decoding, EAGLE
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI:	https://discovery.ucl.ac.uk/id/eprint/10212512

Downloads since deposit

18Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item