Multi-Query Attention

2 min readSep 21, 2023

➡ MQA addresses a common challenge faced by models with large context sizes during inference. Typically, 𝐢𝐧𝐜𝐫𝐞𝐚𝐬𝐢𝐧𝐠 𝐭𝐡𝐞 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐬𝐢𝐳𝐞 𝐥𝐞𝐚𝐝𝐬 𝐭𝐨 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐦𝐩𝐮𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐥 𝐜𝐨𝐬𝐭𝐬.

➡ Usually 𝐢𝐧𝐜𝐫𝐞𝐦𝐞𝐧𝐭𝐚𝐥 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 is used during 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞. In this approach, values are fed into the network one token at a time, and K (keys) and V (values) are computed across the tokens observed so far. However, this method encounters issues when dealing with lengthy input.

➡ To improve latency and reduce computational overhead, various solutions have been introduced. Some involve techniques like 𝐊-𝐕 𝐜𝐚𝐜𝐡𝐢𝐧𝐠 (Maintaining computed state across multiple iterations ) or 𝐛𝐚𝐭𝐜𝐡𝐢𝐧𝐠 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐞𝐬 together during inference without modifying the model’s architecture.

➡ On the other hand, architecture-focused solutions, such as MQA, emerged in 2019. This technique has been adopted by several #llm (LLMs) like 𝐋𝐥𝐚𝐦𝐚2, 𝐒𝐭𝐚𝐫𝐂𝐨𝐝𝐞𝐫 (a model trained on over 80 programming languages) and 𝐅𝐚𝐥𝐜𝐨𝐧.

➡ MQA brings significant improvements in throughput, allowing the system to process more data within the same timeframe while reducing latency for faster response times. The primary objective is to reduce computation during inference.

➡ 𝐌𝐐𝐀 𝐕𝐒 𝐌𝐇𝐀
In the traditional 𝐌𝐮𝐥𝐭𝐢-𝐇𝐞𝐚𝐝 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 (MHA), each Q (query), K (key), and V (value) is divided into multiple vectors equal to the number of heads. Each head then performs the same procedure independently.

➡However, in MQA, we no longer use multiple heads for keys and values. Instead, we utilize multiple heads for queries, and all heads of the query are multiplied by shared K and V vectors, eliminating the need to divide keys and values into multiple vectors (heads).

The key distinction between these techniques lies in the reduced amount of data read/written from memory with MQA.

➡This has noteworthy implications for performance, particularly in terms of 𝐚𝐫𝐢𝐭𝐡𝐦𝐞𝐭𝐢𝐜 𝐢𝐧𝐭𝐞𝐧𝐬𝐢𝐭𝐲 𝐢𝐧𝐜𝐫𝐞𝐚𝐬𝐞, which signifies the degree to which data values are reused for a given computation.

➡This reuse is particularly evident in MQA, where K and V are reused, whereas in MHA, different heads are called each time for calculations. Additionally, 𝐌𝐐𝐀 reduces memory space by decreasing the amount of 𝐊𝐕-𝐜𝐚𝐜𝐡𝐞 𝐝𝐚𝐭𝐚 𝐬𝐭𝐨𝐫𝐞𝐝 in memory between iterations of the inference process.

These insights are based on a comprehensive Medium article, and you can find the link to the article at the end of this article, along with the official MQA paper reference.

Furthermore, if anyone wishes to contribute additional information or make modifications to what I have mentioned, please don’t hesitate to do so in the comments section. Your input is welcome and appreciated.

Medium article

Multi-Query Attention is All You Need

by James K Reed, Dmytro Dzhulgakov, Dmytro Ivchenko, and Lin Qiao

blog.fireworks.ai

Official_Paper : https://arxiv.org/pdf/1911.02150v1.pdf

Multi-Query Attention

Multi-Query Attention is All You Need

by James K Reed, Dmytro Dzhulgakov, Dmytro Ivchenko, and Lin Qiao

Written by Ali Issa

No responses yet