A GitHub repository called claude-real-video shows how to feed video to any LLM by extracting frames and generating text descriptions. The project surfaced in a Hacker News thread that reached 70 points and 19 comments.
Repo: claude-real-video | Core method: frame sampling + captioning | Target models: Claude, GPT, Llama, Mistral | License: MIT
How Claude-Real-Video Works
The script samples video at fixed intervals, sends each frame to a vision model for captioning, then concatenates the captions with timestamps into a single text prompt. The resulting text is passed to the target LLM exactly like any other document.
No model weights are changed. The approach works with closed models that accept only text or images.
Setup Steps
Clone the repository and install the listed Python dependencies. Provide an input video path and choose a captioning backend such as GPT-4o-mini or a local BLIP-2 instance. Run the main script to produce a timestamped transcript file that can be copied into any chat interface.
Typical command sequence uses one line for sampling and one line for caption generation. Output length scales with video duration and chosen frame rate.
Performance Numbers Reported
Early users on the thread report processing a 5-minute 1080p clip in 45–70 seconds on an RTX 3060 when using a 7B caption model. Token count for the final transcript averages 1,800–2,400 tokens for that length.
Longer videos increase cost linearly when using paid vision APIs. Local caption models remove per-token fees but add VRAM requirements of 8–12 GB.
Comparison with Native Video Models
| Feature | claude-real-video | GPT-4o video | Gemini 1.5 Pro | Video-LLaMA 2 |
|---|---|---|---|---|
| Works with any LLM | Yes | No | No | No |
| Max video length | Unlimited (text) | 20 min | 1 hour | 10 min |
| Cost per 5 min | $0.01–0.04 | $0.15 | $0.08 | Free (local) |
| Requires vision API | Optional | Required | Required | No |
The text-based route trades visual fidelity for flexibility and length limits.
Pros and Cons
- Works with models that have no native video support
- No additional fine-tuning needed
- Output can be edited before feeding the LLM
- Frame sampling loses motion details and fast actions
- Caption quality depends on the vision model chosen
- Adds latency compared with end-to-end video models
Who Should Use This
Developers who already have strong text-only workflows and need occasional video context benefit most. Researchers testing new LLMs on video benchmarks without waiting for native multimodal releases will also find it practical. Teams requiring precise motion analysis or real-time streaming should skip it and use dedicated video models instead.
Bottom line: claude-real-video gives immediate video access to the entire LLM ecosystem without waiting for new multimodal releases.
The method lowers the barrier for existing text pipelines while highlighting the remaining gap in native long-context video understanding.
Top comments (0)