A new GitHub project called NanoEuler implements a full GPT-2 scale transformer in pure C and CUDA with no external frameworks. The repo surfaced on Hacker News where it earned 35 points and 8 comments.
Model: NanoEuler | Scale: GPT-2 | Language: C/CUDA | Dependencies: None | License: Not specified
What It Is and How It Works
NanoEuler contains a complete transformer implementation written from scratch. It includes embedding layers, multi-head attention, feed-forward blocks, layer normalization, and a basic tokenizer, all compiled directly to CUDA kernels.
The code avoids PyTorch, TensorFlow, or any Python runtime. Every matrix multiplication and attention operation runs through hand-written CUDA kernels. This approach forces explicit memory management and kernel fusion decisions that frameworks normally hide.
Benchmarks and Reported Numbers
No formal speed or accuracy numbers appear in the repository yet. The project states it reaches GPT-2 scale parameter counts but supplies no tokens-per-second figures or perplexity scores on standard datasets.
Early Hacker News discussion focused on the educational value rather than production metrics. Commenters noted the absence of cuBLAS or other libraries as both a strength for learning and a likely performance limiter.
How to Try It
Clone the repository and build with a standard CUDA toolkit.
git clone https://github.com/JustVugg/nanoeuler
cd nanoeuler
make
./nanoeuler generate --prompt "The quick brown"
The build requires only nvcc and a CUDA-capable GPU. No Python environment or additional packages are needed. Users can inspect every kernel in the src/ directory to modify attention or normalization logic directly.
Pros and Cons
- Single-language codebase simplifies debugging of memory layout and kernel launches.
- Zero framework overhead removes hidden Python or autograd costs.
- Limited to basic GPT-2 architecture; no rotary embeddings, grouped-query attention, or flash attention yet.
- Performance depends entirely on the author's hand-written kernels, which currently lack the optimizations present in mature libraries.
Alternatives and Comparisons
Several projects already run transformers in C or CUDA with varying levels of abstraction.
| Project | Language | Dependencies | GPT-2 Scale Ready | Optimization Level |
|---|---|---|---|---|
| NanoEuler | C/CUDA | None | Yes | Basic kernels |
| llama.cpp | C/C++ | None | Yes | AVX2, NEON, cuBLAS |
| tinygrad | Python | Minimal | Yes | Metal/CUDA backend |
| mlc-llm | C++/CUDA | TVM | Yes | High |
llama.cpp offers broader hardware support and more mature kernels. NanoEuler trades that maturity for a smaller, fully transparent code surface.
Who Should Use This
Researchers teaching low-level CUDA programming or students building a first transformer from scratch will find the minimal surface useful. Teams needing maximum tokens per second or multi-GPU scaling should start with llama.cpp or MLC-LLM instead.
Developers who want to experiment with custom attention variants without framework constraints can fork NanoEuler quickly.
Bottom Line and Verdict
NanoEuler demonstrates that a complete GPT-2 scale model can be written and run using only C and CUDA. It serves best as a learning artifact rather than a production inference engine today.
The project lowers the barrier for anyone who wants to read and modify every line of a working language model without framework abstractions. Future kernel improvements could shift its position from educational example to practical minimal runtime.
Top comments (0)