A Recipe Guide: How LLMs are Made

Robert & Claude

2/5/2026

•

6 month prep time

Share this post

The step-by-step process behind modern AI

We run an AI training company in Birmingham, Alabama. One of the most common questions we get is: "How are these things even made?"

It's one of those modern mysteries—like "what was Chrysler thinking with the PT Cruiser?"

So here's our attempt: a recipe.

Not as a gimmick (well maybe a little), but because it genuinely maps well. Building a large language model has ingredients, steps that have to happen in order, and if you skip something or get the proportions wrong, the whole thing falls flat. It's not unlike my dad's biscuit recipe—except this one costs a hundred million dollars.

What You're Making

A large language model is a system that can read and generate text. Ask it a question, it answers. Give it a task, it tries to complete it. It can write emails, explain concepts, summarize documents, generate code, and hold conversations.

The best ones feel like talking to a knowledgeable colleague who's read everything and has infinite patience. They're not perfect—they make mistakes, they hallucinate facts, they have blind spots—but when they work, they're remarkably useful.

This is what you're building: a system that understands language well enough to be genuinely helpful.

Prep time: 6 months (mostly data collection and cleaning)
Cook time: Several weeks to several months
Total time: About a year, give or take
Yield: 1 large language model (serves millions)
Difficulty: Advanced

Equipment

Thousands of GPUs — Graphics Processing Units, the specialized chips that handle the math. CPUs (the regular chips in your computer) are too slow for this. GPUs can do many calculations in parallel, which is exactly what training requires. You'll need somewhere between 1,000 and 100,000 of them.
A data center with serious cooling — GPUs generate enormous heat when running at full capacity. Without industrial cooling, they'd overheat and shut down within minutes.
Engineers who don't mind waiting — Training runs take weeks or months. Someone needs to monitor the process, catch problems early, and make adjustments. It's less glamorous than it sounds.
Budget — Anywhere from a few million to over a hundred million dollars. The electricity bill alone for a large training run can exceed $10 million. This is not a side project.

Ingredients

For the base (pre-training):

Trillions of tokens of text — A "token" is roughly a word-chunk. The word "grandmother" might be two tokens: "grand" and "mother." You want text from everywhere: books, websites, scientific papers, code, forums. The more diverse and high-quality the text, the more capable the model. This is its entire education—everything it will know comes from this data.
A transformer architecture — The underlying structure that processes the text. Transformers work by paying "attention" to relationships between words, even when they're far apart in a sentence. This architecture is why modern AI can understand context and nuance that older systems missed entirely.
Compute — Measured in GPU-hours. For a serious model, you're looking at millions of GPU-hours. More compute generally means better results, up to a point—this is why companies keep building bigger data centers.

For the fine-tuning:

Instruction-response pairs — Thousands of examples showing a question or task, followed by a good response. Without these, the model knows language but doesn't know how to be helpful. These examples teach it the format of being an assistant.
Human annotators — People with good judgment who write and evaluate responses. Their taste shapes the model's behavior. Hire carefully.

For alignment:

A reward model — A separate, smaller model trained to predict what humans prefer. You build this by showing people two responses and asking "which is better?" thousands of times. The reward model learns to score responses the way humans would.
Preference data — Those pairs of responses with human judgments. This is expensive to collect (around $5-20 per comparison) which is why many teams now use AI to generate synthetic preferences at a fraction of the cost.
RLHF or DPO — Reinforcement Learning from Human Feedback, or Direct Preference Optimization. These are methods for adjusting the model to produce responses that score higher with the reward model. RLHF is more complex but can be more powerful; DPO is simpler and increasingly popular.

Instructions

Step 1: Pre-training

This is where you build the foundation. It takes the longest, costs the most, and determines the model's ceiling. Everything else is refinement.

Gather your data. Collect text from across the internet, license books, pull from code repositories. You want trillions of tokens. Quality matters enormously—the model can only know what it's seen, and it will absorb both the good and the bad.
Clean it. Remove duplicates (they cause the model to memorize rather than generalize). Filter out low-quality content, spam, and harmful material. Fix encoding issues and formatting problems. This is tedious, unglamorous work, but skipping it creates problems that are nearly impossible to fix later.
Tokenize. Break all the text into tokens using a consistent scheme. The model doesn't see words or characters—it sees these chunks. A good tokenizer balances vocabulary size against the ability to represent any text.
Train the model to predict the next token. This is the core of the whole process. Show the model a sequence of tokens, ask "what comes next?" It guesses. You compare the guess to what actually came next. The difference becomes a signal that adjusts the model's internal weights, making it slightly better at predicting. Repeat this billions of times across trillions of tokens.

Through this simple task—just predicting the next word—the model learns grammar, facts, reasoning patterns, writing styles, and even some common sense. It's learning the structure of human knowledge by learning to imitate it.
Wait. This runs for weeks or months on thousands of GPUs. Check the loss curves (a measure of prediction error) to make sure the model is improving and nothing has gone wrong. You can't rush this.

What you have at the end is a base model. It knows a lot, but it has no idea how to be helpful. Ask it a question and it might just continue the question, or ramble, or produce something bizarre. It's learned language, but not how to have a conversation.

Step 2: Supervised Fine-Tuning (SFT)

Now you teach it how to actually respond to people.

Create examples. Thousands of instruction-response pairs: "Explain photosynthesis simply" followed by a clear explanation. "Write a professional email declining a meeting" followed by a good email. "What's the capital of France?" followed by "Paris." You're demonstrating what a helpful assistant looks like.
Fine-tune. Train the base model on these examples using the same next-token prediction. The difference is that now all the training data shows the pattern: user asks, assistant responds helpfully. The model learns this format.
Evaluate. The model should now follow instructions and try to be helpful. It understands it's supposed to answer questions, not just continue text. But it might still occasionally produce responses that are wrong, offensive, or unhelpful.

This step takes the raw capability from pre-training and shapes it into something usable. Think of it as teaching manners.

Step 3: Preference Tuning

This is what separates a decent model from one people actually want to use.

Train a reward model. Show humans two responses to the same prompt, ask which is better, repeat thousands of times. Train a smaller model to predict those preferences. This reward model becomes your automated taste-tester—it can score any response for quality.
Apply reinforcement learning. Let the main model generate responses. Score them with the reward model. Adjust the main model to produce responses that score higher. The model learns not just to respond, but to respond well.

RLHF (Reinforcement Learning from Human Feedback) uses an algorithm called PPO to make these adjustments carefully—you don't want the model to find shortcuts that score high but aren't actually good. DPO (Direct Preference Optimization) skips the reward model and learns directly from the preference pairs, which is simpler but sometimes less flexible.
Balance. You want helpful, harmless, and honest. Push too hard on being helpful and the model might make things up to please you. Push too hard on harmlessness and it becomes useless, refusing to engage with anything. This is more art than science.

Most teams today use a combination—some SFT, some DPO, some RLHF, sometimes multiple rounds of each. The recipe keeps evolving.

Notes

Cost: Post-training alone (steps 2 and 3) can run $50 million or more for frontier models. The whole process can exceed $100 million. Llama 3.1 reportedly had a 200-person post-training team.
Synthetic data: Instead of paying humans for every preference rating, many teams now use AI to generate feedback at less than a penny per sample. The previous generation of models rates the current one. It works surprisingly well and has made research accessible to more teams.
This keeps changing. What I've described is the standard approach as of 2025. New techniques emerge constantly—reinforcement pre-training, Constitutional AI, model merging. The fundamentals stay the same: data, compute, feedback.
Scale matters. More data, more parameters, more compute generally means a better model. This is why companies are building data centers as fast as they can get permits and power.

Serving Suggestions

Once it's trained, you deploy it:

As an API — Developers send requests, get responses back. This is how most commercial models are accessed.
As a chat interface — A conversation UI for regular users.
Embedded in products — Inside email clients, code editors, search engines, anywhere text is involved.

The quality of what went in determines what comes out.

Final Thoughts

That's how you make a large language model.

It's not magic. It's data, compute, and careful engineering. The model learns to predict text, then learns to follow instructions, then learns what humans actually prefer. Each step builds on the last.

Understanding this won't make you an AI researcher, but it'll help you make sense of what these tools can and can't do—and why they behave the way they do.

Y'all can do a lot more with AI when you know what's actually under the hood.

Robert & Claude

Thank you for your time!