Enhancing Large Language Models: The Power of Continued Pre-Training

2 min readNov 17, 2024

𝐖𝐡𝐚𝐭 𝐢𝐬 𝐜𝐨𝐧𝐭𝐢𝐧𝐮𝐞𝐝 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠

Pre-training an LLM can begin with initial weights that are not random but were originally used to pre-train the model on different data. This allows us to leverage the knowledge gained from the initial pre-training rather than starting from scratch. It could be an efficient way to keep models current and adaptable to new domains without retraining from scratch. This is crucial for incorporating the latest information and insights into our models.

𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭

A recent paper highlighted three different approaches:
1. Regular pretraining on an initial dataset (D1).
2. Continued pretraining by further training an already pretrained model on a new dataset (D2).
3. Retraining from scratch on the combined datasets (D1 + D2).

𝐊𝐞𝐲 𝐅𝐢𝐧𝐝𝐢𝐧𝐠𝐬

- Continued pre-training is 2× cheaper than full retraining but achieves comparable results.
- Adding a small fraction (as low as 0.5–5%) of the original data helps prevent 𝐜𝐚𝐭𝐚𝐬𝐭𝐫𝐨𝐩𝐡𝐢𝐜 𝐟𝐨𝐫𝐠𝐞𝐭𝐭𝐢𝐧𝐠, keeping past knowledge intact.
- Re-warming and re-decaying the learning rate: To maintain performance in continued pre-training, the learning rate is reintroduced with a warmup phase, followed by decay — matching the original pre-training schedule. This approach preserves the model’s stability and ensures efficient adaptation to new data.

This summary was inspired by Sebastian Raschka, PhD, whose newsletter offers valuable insights presented in a clear, organised way. He also recently published a book titled Build a Large Language Model (From Scratch).

There’s plenty to explore and learn, but taking small steps each day to absorb a bit of information can lead to significant progress over weeks and months.

If you like what you see, hit the follow button! You can also find me on LinkedIn, and we can follow each other there too. 😊

Enhancing Large Language Models: The Power of Continued Pre-Training

𝐖𝐡𝐚𝐭 𝐢𝐬 𝐜𝐨𝐧𝐭𝐢𝐧𝐮𝐞𝐝 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠

𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭

𝐊𝐞𝐲 𝐅𝐢𝐧𝐝𝐢𝐧𝐠𝐬

Written by Ali Issa

No responses yet