Sitemap

Building and Training Large Language Models (LLMs): A Stanford Lecture Summary

6 min readApr 1, 2025

--

Image generated by ChatGPT

This is a summary of the Stanford CS229 Machine Learning lecture : Building Large Language Models (LLMs) by Yann Dubois

Introduction to Large Language Models (LLMs)

Large Language Models (LLMs), such as ChatGPT, Claude, Gemini, and LLaMA, have become fundamental tools in modern machine learning applications. Understanding how these models are built and trained involves several crucial components, including architecture, training algorithms, data, evaluation methods, and system considerations.

Essential Components in Training LLMs

When building an LLM, several critical aspects must be considered:

Architecture

Most LLMs use transformer-based architectures. It is crucial to understand how the model structure impacts training, evaluation, and practical use.

Data

Data quality and diversity significantly affect model performance. Typically, data is downloaded from the internet using standard crawling methods like Common Crawl. Once obtained, the text must be extracted from HTML pages, including handling special cases like mathematical content.

Data processing includes:

  • Filtering undesirable content, such as harmful information and personally identifiable information (PII), which can often be identified using small classification models.
  • Duplicate content, such as recurring headers and footers or identical content from different URLs, must be removed to avoid redundancy.
  • Heuristic filtering further refines the dataset by removing low-quality documents based on simple rules . For example, eliminating documents that are too short or excessively long.
  • Model-based filtering is another effective method, where small models predict the likelihood of a webpage being referenced by Wikipedia for example (if something referenced by wikipedia, it might be a good website).
  • Data mixtures are also crucial, involving weighting different content types (such as code, books, entertainment) appropriately.
  • Finally, learning rate annealing on high-quality (it’s like overfitting on high quality data).

Tokenization

Tokenizers convert text into a sequence of tokens that models can process. A simple approach is to treat each word as a token .But this breaks down when there are typos or out-of-vocabulary (OOV) words, which the model wouldn’t recognize. On the other end, character-level tokenization handles typos and OOV issues better but results in very long sequences, which can be inefficient for training.

Subword-level tokenization, like Byte Pair Encoding (BPE), offers a balance between these extremes. BPE starts with individual characters and iteratively merges the most frequent pairs to form new tokens. You can control how many merges the tokenizer applies, effectively customizing the vocabulary size. This approach handles typos more gracefully than word-level tokenization and keeps sequence lengths shorter than character-level methods.

Tokenizers are trained independently from the LLM by feeding them text, the algorithm learns which token pairs to merge based on frequency.

Pre-tokenization is often performed beforehand, such as splitting text by spaces or punctuation, to improve consistency and efficiency.

Training Algorithm and Loss Function

The training process of an LLM consists of the following steps: a sentence is tokenized, passed through the model, producing logits, which are then converted into a probability distribution (containing the probabilities of each token in the LLM’s vocabulary). The most likely token is retrieved, and the loss is computed. The primary loss function for LLMs is typically the cross-entropy loss.

Evaluation

Traditionally, LLMs were evaluated using perplexity, which measures how confidently a model predicts the next token. Perplexity values typically range between 1 and the size of the vocabulary, representing how many tokens the model is “hesitating” between on average.

For more info about perplexity feel free to check my recent post on LinkedIn

However, perplexity has limitations. Although averaging the loss makes it independent of sentence length, longer sentences tend to accumulate higher loss, while shorter ones yield lower loss.This will lead to misleading comparisons. Additionally, perplexity is sensitive to the tokenizer and vocabulary size used. For instance, ChatGPT uses a vocabulary size of 50,000, whereas Gemini uses 10,000. These differences make perplexity unreliable for fair benchmarking across models.

As a result, modern evaluation has shifted away from perplexity, especially in academic benchmarks.

Evaluation challenges:

  • Prompt sensitivity and loss inconsistencies: Since pre-trained models are not always instruction-following, evaluating tasks like multiple-choice questions requires special techniques. We can’t directly prompt these models to choose the correct answer because they haven’t been trained to follow instructions yet. Instead, one common approach is to use the model’s probability distribution to determine which option A, B, C, or D has the highest likelihood, assuming the full content of each choice is included in the input.
  • Training-test contamination: Unlike companies who know their training data, the public lacks access to proprietary datasets. To detect potential data leakage in closed models, evaluators may check whether the model can reproduce test examples exactly, suggesting it was trained on them.
  • Evaluation inconsistencies: Different LLM evaluators (like HELM and Harness) use varying strategies, which can result in inconsistent measurements across tools.

Systems Considerations

Deploying LLMs in real-world applications involves handling large-scale computations, data pipeline efficiency, and optimized hardware usage.

Scaling Laws

Research by OpenAI has shown that increasing both the model size and the amount of training data typically leads to better performance.

Note that larger models tend not to overfit in the traditional sense.

This relationship is formalized through scaling laws, which estimate how performance improves as a function of model size, dataset size, and compute. These laws guide decisions like: given a certain GPU budget, which model architecture or size should be trained?

Historically, hyperparameter tuning was performed on large models. Training a new one each day(for example) and selecting the best after weeks of experimentation.

Modern pipelines are more efficient: they tune hyperparameters on smaller models, fit scaling laws to the results, and then extrapolate the expected performance of larger models. This allows researchers to reserve most of the budget (e.g., 27 out of 30 days) for training the final large-scale model, using only a small portion (e.g., 3 days) for tuning.

An example of this approach is comparing LSTMs and Transformers. By training various sizes of each architecture and fitting scaling laws, researchers can determine which is more effective under a fixed compute budget.

The Chinchilla paper further refined this understanding by fixing total compute and training models of different sizes on varying amounts of data. It found that for optimal performance, models should be trained with around 20 tokens per parameter. If inference cost isn’t considered, this ratio increases to about 150 parameters per token.

Post-Training Alignment and Fine-Tuning

Raw language models typically do not follow instructions and can generate harmful or toxic content. Thus, alignment through supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) is essential.

Supervised Fine-Tuning (SFT)

It is the first step in post-training, where a language model is trained to generate responses in a desired format using labeled input-output pairs. These labels can come from either human-annotated data or synthetically generated examples. While human-labeled data is high-quality, it’s expensive and time-consuming to collect at scale. To address this, researchers use synthetic data generation. It leverages smaller human annotated datasets to train models (like Alpaca 7B) that can then generate large-scale instruction-following examples.

The goal of SFT is not to teach the model new knowledge but to fine-tune it to produce outputs that align with expected behavior and formattin, essentially behavior cloning. According to the LIMA paper, scaling up the amount of SFT data beyond a certain point doesn’t significantly improve performance, reinforcing the idea that relatively small datasets are sufficient for effective SFT.

One potential drawback is that SFT can introduce hallucinations, where the model confidently outputs incorrect information or harmful content.

This leads to the next step in post-training: Reinforcement Learning from Human Feedback (RLHF), which goes beyond behavior cloning by directly optimizing for human preferences rather than just reproducing human behavior.

Reinforcement Learning with Human Feedback (RLHF)

RLHF further refines the model by maximizing human preferences rather than simply mimicking labeled data. It involves training a reward model to evaluate model outputs based on human judgments, and then optimizing the LLM using methods like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).

Note that DPO is a simplified approach that directly uses labeled preference data without the need to train a separate reward model. In contrast, PPO requires a reward model, which is trained on human-annotated preferences (or synthetically generated comparisons), to guide the optimization process.

RLHF helps mitigate hallucinations and aligns outputs more closely with human expectations but is computationally expensive and challenging due to ethical considerations and annotator variability.

Evaluation Challenges in Post-Training

Evaluating fine-tuned models with standard metrics like perplexity or validation loss can be misleading since post-training methods optimize models differently than standard language modeling. Instead, comparative benchmarks such as Chatbot Arena (human-based comparisons) and AlpacaEval (LLM-based comparisons) have become popular.

Curious about the latest in AI and ML? Follow me on LinkedIn for fresh insights and updates!

--

--

No responses yet