Enhancing Large Language Models: The Power of Continued Pre-Training
๐๐ก๐๐ญ ๐ข๐ฌ ๐๐จ๐ง๐ญ๐ข๐ง๐ฎ๐๐ ๐ฉ๐ซ๐-๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐
Pre-training an LLM can begin with initial weights that are not random but were originally used to pre-train the model on different data. This allows us to leverage the knowledge gained from the initial pre-training rather than starting from scratch. It could be an efficient way to keep models current and adaptable to new domains without retraining from scratch. This is crucial for incorporating the latest information and insights into our models.
๐๐ฑ๐ฉ๐๐ซ๐ข๐ฆ๐๐ง๐ญ
A recent paper highlighted three different approaches:
1. Regular pretraining on an initial dataset (D1).
2. Continued pretraining by further training an already pretrained model on a new dataset (D2).
3. Retraining from scratch on the combined datasets (D1 + D2).
๐๐๐ฒ ๐ ๐ข๐ง๐๐ข๐ง๐ ๐ฌ
- Continued pre-training is 2ร cheaper than full retraining but achieves comparable results.
- Adding a small fraction (as low as 0.5โ5%) of the original data helps prevent ๐๐๐ญ๐๐ฌ๐ญ๐ซ๐จ๐ฉ๐ก๐ข๐ ๐๐จ๐ซ๐ ๐๐ญ๐ญ๐ข๐ง๐ , keeping past knowledge intact.
- Re-warming and re-decaying the learning rate: To maintain performance in continued pre-training, the learning rate is reintroduced with a warmup phase, followed by decay โ matching the original pre-training schedule. This approach preserves the modelโs stability and ensures efficient adaptation to new data.
This summary was inspired by Sebastian Raschka, PhD, whose newsletter offers valuable insights presented in a clear, organised way. He also recently published a book titled Build a Large Language Model (From Scratch).
Thereโs plenty to explore and learn, but taking small steps each day to absorb a bit of information can lead to significant progress over weeks and months.
If you like what you see, hit the follow button! You can also find me on LinkedIn, and we can follow each other there too. ๐