𝐀𝐧𝐨𝐰π₯𝐞𝐝𝐠𝐞 𝐃𝐒𝐬𝐭𝐒π₯π₯𝐚𝐭𝐒𝐨𝐧 : The Art of Transferring Knowledge from Big LLMs to Smaller Ones.

Ali Issa
2 min readOct 2, 2023

--

When preparing a model for production, whether through training or fine-tuning, several critical questions must be addressed:

1. Inference Speed: One important consideration is the speed at which the model can provide results during inference. Faster models tend to offer a better user experience, but achieving high inference speed may come at a cost.

2. Compute Budget: Determining the available computational resources or budget is crucial when optimizing the model for production.

3. Model Performance vs Resource Efficiency: This often involves trade-offs, such as sacrificing a bit of accuracy to gain faster inference times or reduce storage requirements.

Optimizing the model can be achieved through techniques like distillation, quantization, and pruning. In this discussion, we will focus on distillation.

𝐃𝐒𝐬𝐭𝐒π₯π₯𝐚𝐭𝐒𝐨𝐧 involves transferring knowledge from a larger teacher model to a smaller student model, allowing the latter to perform the same tasks as the larger model. Here’s how it’s typically done:

Source

A)The teacher model (usually the one trained or fine-tuned in an earlier step) is frozen.

B)The student model learns to mimic the teacher’s behavior statistically. Weight adjustments can be made in the final prediction layer or hidden layers.

C)Training data prompts used for fine-tuning the teacher model are passed to both the teacher and student models.

D)Knowledge Distillation: This process minimizes the difference in completion between the teacher and the student given a prompt from the training data. It is achieved through a loss function called distillation loss, comparing probability distributions between the two models’ completions.

E)Temperature in Softmax: Temperature values greater than 1 are added to the softmax function in both models. Increasing the temperature allows the student model to learn better from the teacher model’s knowledge about class relationships and input data relationships.

For example, if two classes have similar probabilities under a high temperature, it suggests that they are somewhat interchangeable or related in the context of the input data.

F)Output Labels: Multiple labels are generated for each prompt:

1- Soft Labels: Teacher model’s output with a temperature greater than one in the softmax function, representing a probability distribution.

2-Soft Predictions: Student model’s output with a high temperature softmax.

Distillation Loss : Computes the loss between soft labels and soft predictions.

3- Hard Predictions: Probability distribution when temperature equals 1 (no changes to the softmax function). It is compared to the hard labels.

4- Hard Labels: Ground truth labels taken from the training dataset, used to train the student model to generate answers similar to the actual data.

Student Loss: Measures the difference between hard predictions and hard labels.

Distillation and student loss update student weights via back propagation.

Note: distillation works well with encoder-only models like BERT, less so with autoregressive models.

Please don’t hesitate to folllow me for additional insights like these.

--

--

No responses yet