𝐖𝐡𝐚𝐭’𝐬 𝐆𝐫𝐨𝐮𝐩𝐞𝐝-𝐐𝐮𝐞𝐫𝐲 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(𝐆𝐐𝐀) ? 𝐚 𝐩𝐚𝐩𝐞𝐫 𝐟𝐫𝐨𝐦 𝐆𝐨𝐨𝐠𝐥𝐞 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡

2 min readSep 26, 2023

During autoregressive decoding with Transformer models, the main problem is the extra memory bandwidth needed. This is due to the need to load decoder weights, and all attention keys and values, at each step of processing.

An effort to reduce the memory bandwidth overhead of loading keys and values is through 𝐦𝐮𝐥𝐭𝐢-𝐪𝐮𝐞𝐫𝐲 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧, which involves using multiple query heads with a single key/value head.

However, based on the researchers in the 𝐆𝐐𝐀 𝐩𝐚𝐩𝐞𝐫, MQA highlights certain drawbacks. Specifically, utilizing MQA can lead to a decline in quality and introduce training instability. Consequently, attempting to train distinct models optimised separately for quality and inference may not be a practical solution, as stated in the paper.

This is because the primary goal of employing the MQA technique is to accelerate the inference process, making the modification of the entire model architecture and training approach for this purpose impractical.

The paper discusses two key concepts :

1) 𝐔𝐩𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐨𝐟 𝐞𝐱𝐢𝐬𝐭𝐢𝐧𝐠 𝐌𝐇𝐀 𝐂𝐡𝐞𝐜𝐤𝐩𝐨𝐢𝐧𝐭𝐬: Researchers propose a method to transition from a pre-trained model checkpoint using multi-head attention (MHA) to one using multi-query attention (MQA). In this process, the original multiple heads for keys and values in MHA are combined into a single head for both through mean pooling. This approach was found to be superior to randomly initializing key and value heads or selecting one head from the MHA checkpoint.

Then the model is further pre-trained using the MQA checkpoint, but on a small fraction (α) of its original training steps, while continuing to follow the same pre-training parameters used in the initial model training (e.g., using the same data, learning rate schedule, optimization algorithm, and other parameters that were used in the original training of the model).

2) 𝐆𝐐𝐀 (𝐆𝐫𝐨𝐮𝐩𝐞𝐝-𝐐𝐮𝐞𝐫𝐲 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧): Another method introduced is GQA, which seeks to strike a balance between MHA and MQA. GQA partitions query heads into G groups, with each group sharing a single key head and value head.

During the transition from a multi-head checkpoint to a GQA checkpoint, the key and value heads for each group are created by averaging the original heads within that group. Essentially, the new key and value heads are generated by mean-pooling the original heads, providing a trade-off between the speed of MQA and the quality of MHA.

For better visualization, feel free to check the image below taken from the official GQA paper.

For more interesting information like this, don’t hesitate to follow me! :)

𝐖𝐡𝐚𝐭’𝐬 𝐆𝐫𝐨𝐮𝐩𝐞𝐝-𝐐𝐮𝐞𝐫𝐲 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(𝐆𝐐𝐀) ? 𝐚 𝐩𝐚𝐩𝐞𝐫 𝐟𝐫𝐨𝐦 𝐆𝐨𝐨𝐠𝐥𝐞 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ali Issa

Responses (1)