Understand how large language models actually work — transformers, tokenization, prompting, sampling strategies, and the context window constraints that shape every LLM-powered system.
A large language model (LLM) is a neural network trained on massive amounts of text to predict what token comes next. That tiny capability — predicting the next token — turns out to be enough to power conversation, code generation, translation, summarization, and reasoning.
Every LLM, no matter how sophisticated, is fundamentally doing this loop:
Everything else — instruction following, tool use, reasoning — emerges from how this loop is trained and prompted.
Modern LLMs use the transformer architecture (Vaswani et al., 2017). The key innovation is self-attention: each token can look at every other token in the input to decide what's relevant.
| Stage | Data | Goal |
|---|---|---|
| Pretraining | Trillions of tokens of internet text | Learn general language + world knowledge |
| Fine-tuning | Curated task-specific examples | Specialize on a domain or behavior |
| RLHF / preference tuning | Human preferences | Align outputs with what people want |
Parameters are the weights in the network. More parameters = more capacity, but also more compute and memory. Rough modern landscape: small models (1-8B) run on a laptop, medium models (30-70B) need a high-end GPU, frontier models (100B+) live in cloud datacenters.