Stanford’s CS25: Lecture 2summary

Ali Issa
2 min readJul 2, 2024

--

Lecture 2 of Stanford University’s CS25 V4 Transformers course was delivered by Jason Wei and Hyung Won Chung. Highly recommended to watch, and below are some notes that I took.

Generated by DALL-E 3

𝐖𝐡𝐲 𝐃𝐨 𝐋𝐋𝐌𝐬 𝐖𝐨𝐫𝐤?

LLMs have revolutionised natural language processing and are now widely used for various tasks. Let’s delve into how they work and why they’re so effective.

Next work prediction

is Massive Multi-Task Learning: LLMs predict the next word in a sequence, a task known as “next-word prediction.” After training on this prediction task, LLMs gain the ability to perform multiple other tasks, such as translation, grammar correction (choosing the correct verb), math, and world knowledge.

The challenge lies in the fact that when predicting a sentence, the model learns a multitude of tasks simultaneously. For instance, it learns about world knowledge, predicts commas, and handles grammar — all from a single sentence. Now imagine scaling this to a large corpus of data.

Increasing Computation to Improve Loss

LLMs benefit from increased computation power. Possible reasons could be:

-Memorization: Unlike SLMs, LLMs can memorize vast amounts of data. They don’t need to be selective about what they learn.

-Heuristics: SLMs tend to learn simple tasks first. If they struggle with basic tasks, they won’t handle complex ones well. LLMs, on the other hand, handle complex tasks effectively.

Individual Task Improvement and Overall Loss

When the overall loss decreases during training, not all individual tasks need to improve equally. Some tasks may suddenly improve, while others show smoother progress. In the lecture, we observed this behavior using datasets like Big-Bench, which contains tasks from different categories.

U-Shaped Scaling

During a lecture, Jason demonstrated an interesting phenomenon. A tiny LLM predicted correctly, a small one struggled, and a large one excelled at a specific task (e.g., repeating words). The small model’s weakness was due to its unfamiliarity with the task. Inspecting data helps understand model behavior, and if you plot the curve, you’ll notice a U shape.

𝐄𝐧𝐜𝐨𝐝𝐞𝐫-𝐝𝐞𝐜𝐨𝐝𝐞𝐫, 𝐝𝐞𝐜𝐨𝐝𝐞𝐫 𝐚𝐧𝐝 𝐞𝐧𝐜𝐨𝐝𝐞𝐫 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞𝐬

Encoder: Bidirectional self-attention allows any token to attend to any other token in the sentence.

Decoder: Causal self-attention attends only to tokens before a given token.

Cross Attention: Connects encoder and decoder by attending to the final encoder layer.

Encoder-decoder vs. decoder-only:
-> Encoder-decoder combines bidirectional and causal attention.
-> Parameters are shared in the decoder.
-> Encoder-decoder attends the last layer to the decoder, while the decoder with input and output attends at each layer.

If you like what you see, hit the follow button! You can also find me on LinkedIn, and we can follow each other there too. 😊

--

--