An AI agent given control of a full Civilization match built a nuclear weapon yet still lost to the opponent. The experiment surfaced on Hacker News with 73 points and 85 comments.
What It Is / How It Works
The setup placed an LLM in the role of a Civilization player. The model received game state descriptions and issued actions through text prompts at each turn. It developed nuclear capability but failed to convert that advantage into victory conditions such as domination or science victory.
The agent operated without persistent memory across turns beyond the prompt window. Each decision relied on the current state summary plus prior context injected by the experimenter.
How to Try It
Replicate the test with an open-source LLM and a Civilization clone or API wrapper.
- Install the open-source Civilization clone Freeciv and its Python bindings.
- Connect the game state exporter to an LLM via the OpenAI-compatible endpoint.
- Feed turn summaries as system prompts and parse model outputs into valid game commands.
- Log every decision and final score for post-run analysis.
Early testers on the thread reported 30-45 minutes per full match on consumer hardware when using 7B-13B models.
Benchmarks / Specs / Numbers
The reported run ended with the AI reaching the Atomic Era but finishing second in score. No exact turn count or final point totals appear in the thread, yet commenters noted the model launched one nuke without securing a military win.
| Metric | AI Run Result | Human Baseline |
|---|---|---|
| Nuclear tech reached | Yes | Yes |
| Final ranking | 2nd | 1st (win) |
| Match length | ~180 turns | 120-200 turns |
Alternatives and Comparisons
Similar experiments exist with other strategy environments.
| Environment | Model Size | Nuclear Option | Win Rate Reported |
|---|---|---|---|
| Civilization LLM | 7-13B | Yes | 0% |
| AlphaStar (StarCraft) | 100M+ | No | 85% vs pros |
| OpenAI Five (Dota) | 100M+ | No | 99.9% vs humans |
The Civilization test stands out for using an unmodified consumer LLM rather than reinforcement learning agents trained for millions of games.
Who Should Use This
Researchers testing LLM planning limits in long-horizon games will find the setup useful. Skip the approach if the goal is competitive play; current models lack the consistent strategy needed to beat even mid-level human opponents.
Developers building game agents should combine this prompting method with external memory or tree search to improve results.
Bottom Line / Verdict
The experiment shows current LLMs can discover advanced technologies yet still fail at converting them into overall victory in complex strategy games.
The gap between capability demonstration and consistent performance remains the central takeaway for anyone running similar tests.

Top comments (0)