Hey everyone! I just stumbled upon something absolutely wild that I had to share with you all. Andrej Karpathy (yes, THE Andrej Karpathy) just released nanochat, and honestly, it’s blowing my mind right now.
What’s the Big Deal?
Here’s the thing: you can now train your own ChatGPT-like AI assistant for about $100 and 4 hours of compute time. I know, I know – that sounds too good to be true. But stick with me here.
Karpathy basically took everything we know about training large language models and compressed it into just 8,000 lines of code. No bloated dependencies. No massive config files. Just clean, understandable code that does exactly what it says on the tin.
Let Me Break This Down
The project follows his signature minimalist philosophy (if you’ve played with nanoGPT, you know what I’m talking about). But this time, he’s covered the entire pipeline:
Custom tokenizer training (he literally rewrote it in Rust because Python was too slow)
Pre-training from scratch
Mid-training for chat adaptation
Supervised fine-tuning
Reinforcement learning (a simplified GRPO implementation)
A working web UI so you can actually chat with your creation
The Fast Track: Your 4-Hour Journey
Want to get started? Here’s literally all you need:
git clone git@github.com:karpathy/nanochat.git
cd nanochat
screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
Wait 4 hours (grab some coffee, binge a show, whatever), then fire up the web interface:
python -m scripts.chat_web
Boom. You’ve got yourself a conversational AI that can write stories, answer questions, and engage in dialogue.
What Can This Thing Actually Do?
Okay, so I was curious about the actual performance. Here’s what you get for your $100:
The Base Model (4 hours, ~$100):
560M parameters (20-layer Transformer)
MMLU score: 31% (random baseline is 25%)
GSM8K: 4.5% (can solve basic math problems)
Can write coherent stories and answer simple questions
Performance roughly comparable to GPT-2 Large
Want to Splurge? ($300 version, 12 hours):
26 layers instead of 20
Surpasses GPT-2 performance
More coherent conversations
Better reasoning capabilities
The Power User Edition ($1000, ~42 hours):
30 layers deep
MMLU: 40+ points
ARC-Easy: 70+ points
This is getting into actually useful territory
The Technical Bits (For the Nerds)
What really impressed me was how Karpathy handled the tokenizer. He trained a custom one with 65,536 vocabulary items (2^16) in just 1 minute on 2 billion characters. The compression ratio? 4.8 – better than GPT-2’s tokenizer, approaching GPT-4 territory in some aspects.
The training itself follows Chinchilla scaling laws:
560M parameters × 20 = 11.2B tokens
Total compute: ~4e19 FLOPs
Automatic learning rate scaling (1/√dim)
Muon optimizer for matrix parameters, AdamW for embeddings
The Evaluation Suite
Here’s something I really appreciate – the project includes a complete evaluation framework. You can benchmark your model on:
World Knowledge: ARC-Easy/Challenge, MMLU
Math: GSM8K (elementary school math problems)
Coding: HumanEval (Python programming tasks)
ChatCORE: A comprehensive metric combining everything
Run it with:
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i mid
The RL Magic
One of the coolest parts? The reinforcement learning implementation. Karpathy stripped away all the complexity and kept what works:
No trust regions (bye-bye reference model and KL regularization)
On-policy learning (no PPO ratio clipping)
GAPO-style normalization at the token level
Simple reward shifting
Despite the simplifications, results speak for themselves: GSM8K performance jumps from 4.5% to 7.6%.
Running on Different Hardware
Don’t have 8×H100s lying around? (Who does, right?) The code is surprisingly flexible:
Single GPU: Remove torchrun, expect 8× longer training time Memory constrained? Adjust batch size:
--device_batch_size=16 # 40GB VRAM
--device_batch_size=8 # 20GB VRAM
--device_batch_size=4 # 10GB VRAM
A100 nodes: Fully compatible, slightly slower than H100s
Why This Matters
Look, I’ve seen a lot of AI projects that promise to make things “simple.” Most of them are either oversimplified toys or still require a PhD to understand. Nanochat is different.
Karpathy designed this with what he calls “maximum forkability” in mind:
Minimal cognitive complexity
No giant config objects
No model factories
No if-then-else monsters
You can actually understand what’s happening at each step. Want to experiment with a different tokenizer? Go for it. Different dataset? Easy. Tweak hyperparameters? Just change the values.
The Community is Already Buzzing
Since the release, people are already building on top of it. Chinmay Kak shared his nanosft project (single-file fine-tuning implementation). Others are experimenting with different tokenization algorithms and training regimes.
My Take
This is education disguised as a training framework. Yes, you can build a useful model with it. But more importantly, you can learn how these systems actually work without drowning in abstraction layers.
The fact that it’s MIT licensed means you can fork it, modify it, commercialize it – whatever you need. It’s a strong baseline that invites experimentation.
Want to Dive In?
Here are your starting points:
GitHub: github.com/karpathy/nanochat
Technical Discussion: Check the GitHub discussions
Community: Join the Discord channel
Fair warning: This is explicitly described as “far from done.” It’s a strong baseline, not a production-ready framework. But that’s exactly what makes it valuable – there’s room for you to make it your own.
Final Thoughts
We’re living in a time where training a conversational AI is no longer the exclusive domain of massive tech companies. For the price of a nice dinner for two, you can spin up your own language model and actually understand how it works.
That’s pretty incredible when you think about it.
Now if you’ll excuse me, I have some experiments to run. 🚀
Top comments (0)