โก MQA addresses a common challenge faced by models with large context sizes during inference. Typically, ๐ข๐ง๐๐ซ๐๐๐ฌ๐ข๐ง๐ ๐ญ๐ก๐ ๐๐จ๐ง๐ญ๐๐ฑ๐ญ ๐ฌ๐ข๐ณ๐ ๐ฅ๐๐๐๐ฌ ๐ญ๐จ ๐ก๐ข๐ ๐ก๐๐ซ ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐๐จ๐ฌ๐ญ๐ฌ.
โก Usually ๐ข๐ง๐๐ซ๐๐ฆ๐๐ง๐ญ๐๐ฅ ๐ ๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง is used during ๐ข๐ง๐๐๐ซ๐๐ง๐๐. In this approach, values are fed into the network one token at a time, and K (keys) and V (values) are computed across the tokens observed so far. However, this method encounters issues when dealing with lengthy input.
โก To improve latency and reduce computational overhead, various solutions have been introduced. Some involve techniques like ๐-๐ ๐๐๐๐ก๐ข๐ง๐ (Maintaining computed state across multiple iterations ) or ๐๐๐ญ๐๐ก๐ข๐ง๐ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฉ๐ฅ๐ ๐ฌ๐๐ช๐ฎ๐๐ง๐๐๐ฌ together during inference without modifying the modelโs architecture.
โก On the other hand, architecture-focused solutions, such as MQA, emerged in 2019. This technique has been adopted by several #llm (LLMs) like ๐๐ฅ๐๐ฆ๐2, ๐๐ญ๐๐ซ๐๐จ๐๐๐ซ (a model trained on over 80 programming languages) and ๐ ๐๐ฅ๐๐จ๐ง.
โก MQA brings significant improvements in throughput, allowing the system to process more data within the same timeframe while reducing latency for faster response times. The primary objective is to reduce computation during inference.
โก ๐๐๐ ๐๐ ๐๐๐
In the traditional ๐๐ฎ๐ฅ๐ญ๐ข-๐๐๐๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง (MHA), each Q (query), K (key), and V (value) is divided into multiple vectors equal to the number of heads. Each head then performs the same procedure independently.
โกHowever, in MQA, we no longer use multiple heads for keys and values. Instead, we utilize multiple heads for queries, and all heads of the query are multiplied by shared K and V vectors, eliminating the need to divide keys and values into multiple vectors (heads).
The key distinction between these techniques lies in the reduced amount of data read/written from memory with MQA.
โกThis has noteworthy implications for performance, particularly in terms of ๐๐ซ๐ข๐ญ๐ก๐ฆ๐๐ญ๐ข๐ ๐ข๐ง๐ญ๐๐ง๐ฌ๐ข๐ญ๐ฒ ๐ข๐ง๐๐ซ๐๐๐ฌ๐, which signifies the degree to which data values are reused for a given computation.
โกThis reuse is particularly evident in MQA, where K and V are reused, whereas in MHA, different heads are called each time for calculations. Additionally, ๐๐๐ reduces memory space by decreasing the amount of ๐๐-๐๐๐๐ก๐ ๐๐๐ญ๐ ๐ฌ๐ญ๐จ๐ซ๐๐ in memory between iterations of the inference process.
These insights are based on a comprehensive Medium article, and you can find the link to the article at the end of this article, along with the official MQA paper reference.
Furthermore, if anyone wishes to contribute additional information or make modifications to what I have mentioned, please donโt hesitate to do so in the comments section. Your input is welcome and appreciated.
Medium article
Official_Paper : https://arxiv.org/pdf/1911.02150v1.pdf