The discussion on Hacker News about the gap between open weights LLMs and closed source models drew 286 points and 218 comments. Participants examined concrete capability differences rather than abstract openness debates.
What the Gap Looks Like
Open weights models such as Llama 3.1 405B and Qwen 2.5 72B release full parameters for local or private deployment. Closed models like GPT-4o and Claude 3.5 Sonnet keep weights proprietary and deliver outputs only through APIs.
The gap appears most clearly in reasoning depth, long-context coherence, and instruction following. HN threads cited specific failure modes where open models drop accuracy on multi-step math or code refactoring tasks that closed models handle at higher rates.
Benchmark Numbers
Public leaderboards show the spread. On MMLU, Llama 3.1 405B scores 88.6 while GPT-4o reaches 88.7. On GPQA, the same open model trails by roughly 4-6 points. HumanEval coding scores show a similar 3-8 point deficit for current open weights releases.
| Benchmark | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMLU | 88.6 | 88.7 | 88.3 |
| GPQA | 51.1 | 56.1 | 53.7 |
| HumanEval | 89.0 | 92.0 | 92.0 |
These margins narrow when open models receive additional post-training or synthetic data, but the delta remains measurable on harder reasoning sets.
How to Test the Difference
Run both model classes on the same private dataset using identical prompts. Tools such as LM Evaluation Harness or the EleutherAI evaluation suite produce comparable scores without API rate limits.
For production checks, measure latency and cost per token on a 10k-prompt sample. Open weights inference on 8xH100 nodes typically costs $1.80-$2.40 per million tokens after hardware amortization, versus $2.50-$15.00 for closed APIs depending on model size.
Tradeoffs
Open weights give full control over data residency and fine-tuning. They also expose users to higher inference engineering costs and slower iteration on new capabilities.
Closed models supply immediate access to the highest scores and managed uptime. They remove hardware decisions but introduce usage limits and price changes outside developer control.
Who Should Choose Which
Teams handling sensitive data or needing custom fine-tunes benefit from open weights once the 70B+ class closes most benchmark gaps. Startups prioritizing rapid feature shipping and minimal ops overhead gain more from closed APIs until model size and price converge further.
Verdict
The measurable gap has shrunk to single-digit percentages on many academic benchmarks, yet closed models retain an edge on the hardest reasoning and agent tasks. Developers can close the remaining distance with targeted synthetic data and longer context windows, but only when inference hardware budgets allow.

Top comments (0)