Reproducing GPT-2 124M: Key Insights from Andrej Karpathy’s 4-Hour Deep Dive

6 min readSep 10, 2024

This is a summary of Andrej karpathy’s video about pre-training a GPT-2 124M parameter model from scratch.Feel free to check it using this link

General notes

Andrej Karpathy noted that the code published by OpenAI for GPT-2 was not designed for training and was written in TensorFlow. Preferring PyTorch, he utilized the Hugging Face implementation, available on GitHub.
GPT-2 is a modified version of the original Transformer, distinguished by the absence of an encoder and the encoder-decoder multi-head attention mechanism.

Architectural notes

Reshuffled layer normalization: Layer norms are positioned before the feed-forward and multi-head attention layers, with an additional layer norm after these components.
GELU activation: Unlike ReLU, GELU doesn’t have a flat tail at 0 and allows for non-zero values below 0, demonstrating improved performance in both GPT-2 and BERT papers.
AdamW optimizer: Preferred over SGD for better optimization.
Residual connection scaling: A scaling factor is applied to reduce standard deviation increase caused by residual connections, calculated by dividing by the square root of the number of residual connections.

Device management: Models use .to(device), while tensors require explicit assignment (tensor = tensor.to(device)).

Optimizations

Hardware and Precision Optimizations

Most numbers in the architecture use 32 bits.
INT8 has uniform spacing.

Tensor cores

It’s An instruction in the A100 architecture(the one used by Andrej for training)that performs 4x4 matrix multiplication.
When a matrix multiplication is provided, we also specify the intermediate precision.
For a given matrix multiplication, it divides matrices into 4x4 matrices and performs multiplication.
TF-32 reduces precision internally when performing matrix multiplication, but the result remains the same.

60% utilization is considered good. Most GPUs are memory-bound; we can’t feed them data fast enough, so tensor cores are often idle due to data unavailability.
Lower precision allows faster access and more storage.
When a floating point is being represented, we have sign to specify if number is positive or negative), the exponent responsible to set the range, and precision determines the accuracy of numbers.

BF16: Same range as TF32 but with less precision.
FP16: Doesn’t have the same range (gradient scalers problem) but higher precision.

Software Optimizations

torch.autocast

Used for mix precision, chooses which operations should be converted to BF16 and which should stay as TF32, depending on precision sensitivity. It allows faster and longer runs, helping mitigate precision issues.

For more info: Link

torch.compile

while all calculations occur on the GPU, most of its memory resides in High Bandwidth Memory (HBM), with only a small amount of faster memory on the GPU chip itself.

This chip contains Streaming Multiprocessors (SMs) that contains tensor cores for computations. Due to limited on-chip memory, data frequently travels between the chip and HBM, which can be time-consuming. torch.compile makes code significantly faster by reducing Python overhead and GPU read/writes. It analyzes the code to optimize operations, removing it from the Python interpreter’s sequential execution.

It addresses this issue through techniques like kernel fusion, which allows larger chunks of data to be processed on-chip before sending results back to HBM, thereby reducing round trips and bandwidth usage.

This optimization is particularly important because GPU-to-memory transfers are time-intensive, and accessing disk data requires routing through the CPU, which is computationally expensive.

Flash-attention

A kernel fusion operation that torch.compile can’t identify. It performs more FLOPs than the original implementation, applied to matrix multiplication, dropout, softmax, and mask operations.

Code optimization

Check for “ugly numbers” because CUDA kernels are written in terms of powers of 2. For example, increasing vocab size to the nearest power of 2 can increase FLOPS.
In the code, GPT-2 vocab size was increased from 50,257 to 50,304 for optimization reasons. Why? CUDA uses block tiles in powers of 2. When computations don’t fit these tiles, additional operations are required, increasing execution time.
GPT-2’s original vocabulary size was 50,257, but it has been increased to 50,304 because this number is divisible by 2, 4, 8, 16, 32, and 64. This adjustment optimizes performance, particularly for CUDA operations. However, the tokenizer will never generate indices for these additional numbers because it’s only trained to produce indices up to 50,257.
Consequently, the prediction probabilities for these extra tokens will never be used. The network should learn that these tokens have a very low probability, so their logits should approach negative infinity. In practice, log(-inf) equals 0.

Hyperparameter optimization

Adam optimizer

Parameters were taken from GPT-3 (not mentioned in GPT-2).

Learning rate scheduler

cosine decay with warmup.

Weight decay

not applied to biases, one-dimensional vectors and layer norm (only to embeddings and matrix multiplications).

Gradient accumulation

Is a technique used for handling large batch sizes in language model training. Here’s an explanation with an example:

Batch size: When we say “0.5M”, it means 0.5 x 10⁶ (500,000) tokens in each batch.
Due to memory constraints, we can’t process this large batch size at once. Instead, we use gradient accumulation: we process multiple smaller batches, accumulating their gradients until we reach the desired total batch size. Only then do we perform a single parameter update.
The formula for gradient accumulation steps is: GRAD_accum = total_batch_size // (B * T) Where B is the batch size per GPU and T is the sequence length.
Example: If we’re using a batch size of 16 and sequence length of 1024, we’re processing 16 x 1024 tokens per forward-backward pass. To reach our target of 0.5M tokens per update, we need to accumulate gradients for:
0.5M / (16 * 1024) ≈ 30 steps
Important note on loss calculation: When accumulating gradients, be careful with loss functions like MSE. Since we’re accumulating losses over multiple steps, we should not use the mean for each individual accumulation step. Instead, sum the losses and take the mean only after all accumulation steps
This technique allows us to effectively train with large batch sizes even when limited by GPU memory.

Training Process

Distributed Data Parallel (DDP) for multi-GPU training was used

Data Used

The dataset utilized is FineWeb Edu, a high-quality common crawl dataset provided by Hugging Face.

Steps vs Epochs

A step refers to a single gradient update, while an epoch is a full pass through the entire dataset.

Model Evaluation

The evaluation metric is based on HellaSwag, a multiple-choice problem. For small language models like GPT-2, a direct answer selection is difficult. The solution involves creating a batch of 4 sentences. Each option in this batch has its own context and option, and computations are performed accordingly.

Additional notes taken from the video
- Gradient Norm Clipping: This technique is employed to prevent the model from experiencing large gradient updates.
-Document Permutation: It’s beneficial to shuffle the documents during each epoch to avoid repetitive training on the same order of documents, which could introduce periodicity into the model. The order of documents should not affect the model’s performance.

Certainly, much more information was covered in the 4-hour video, particularly regarding the code. Feel free to ask any questions or suggest any additional details you think would be valuable to include in the article.

If you like what you see, hit the follow button! You can also find me on LinkedIn, and we can follow each other there too. 😊