A practical guide to using YouTube subtitles as training data for language models - from extraction to cleaning.
- - If you've ever tried to fine-tune a language model, you know the hardest part isn't the training - it's finding good data. Public datasets like Common Crawl are massive but noisy. Wikipedia is clean but limited in conversational tone. And most domain-specific corpora are either paywalled or too small to be useful. After months of experimenting, I found an underrated source that changed everything for me: YouTube subtitles. YouTube hosts over 800 million videos, and a significant portion of them have subtitles - either manually uploaded by creators or auto-generated by Google's speech recognition. That's an enormous, continuously updated corpus of natural language, covering virtually every topic and language imaginable. Here's how I extracted, cleaned, and structured 10,000+ transcripts into a usable dataset for LLM fine-tuning.
- - ## Why YouTube Subtitles Work for AI Training Before diving into the how, let's talk about the why. YouTube transcripts have several properties that make them uniquely valuable for language model training: 1. Natural conversational tone. Unlike Wikipedia or academic papers, YouTube transcripts capture how people actually talk. This is critical if you're building a model that needs to sound human. 2. Domain diversity. Want transcripts about quantum physics? Cooking? Legal advice? Software engineering? YouTube has it all, often explained at multiple levels of complexity. 3. Multilingual coverage. Many channels provide subtitles in multiple languages, making it possible to build parallel corpora for translation tasks. 4. Continuous updates. New content is uploaded every minute. Your dataset never goes stale. 5. Structured metadata. Each transcript comes with a video title, channel name, and timestamps - useful for building context-aware training examples.
- -
## Step 1: Choosing Your Data Source Strategy
There are three main approaches to collecting YouTube transcripts at scale:
### Option A: YouTube Data API + youtube-transcript-api (Python)
The most common approach in tutorials. You use the YouTube Data API to search for videos, then a library like
youtube-transcript-apito fetch subtitles.
from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript("video_id")
Pros: Free, programmable, good for small datasets.
Cons: Rate-limited, breaks frequently, requires API key management, and doesn't handle playlists or channels natively.
Option B: yt-dlp with subtitle flags
yt-dlp can download subtitles alongside videos:
yt-dlp - write-sub - sub-lang en - skip-download "VIDEO_URL"
Pros: Reliable, handles many edge cases.
Cons: Designed for single videos. Scaling to thousands requires custom scripting.
Option C: Dedicated subtitle extraction tools
This is the approach I ended up using. Tools like YTVidHub let you paste an entire playlist or channel URL and extract all subtitles in bulk - as SRT, VTT, or clean TXT files.
For my project, this was the fastest path. I needed transcripts from 50+ educational playlists, and doing it one video at a time wasn't practical. Being able to paste a playlist URL and get a ZIP file of all transcripts saved me days of work.
- - ## Step 2: Choosing the Right Format YouTube subtitles come in three main formats, and the choice matters for your pipeline: | Format | Contains Timestamps | Best For | | - - - - | - - - - - - - - - -| - - - - - | | SRT | Yes (HH:MM:SS,ms) | Video editing, time-aligned training | | VTT | Yes (HH:MM:SS.ms) | Web applications, HTML5 players | | TXT | No | LLM training, text analysis | For LLM fine-tuning, TXT is almost always what you want. Timestamps add noise that your model doesn't need. If you're doing speech recognition or time-aligned tasks, SRT gives you the timing data. Here's what raw SRT looks like vs. clean TXT: SRT (before):
1
00:00:01,000 → 00:00:04,500
Welcome back to the channel everyone
2
00:00:04,500 → 00:00:08,200
Today we're going to talk about transformer architectures
TXT (after):
Welcome back to the channel everyone. Today we're going to
talk about transformer architectures.
The difference in data quality is significant when you're processing thousands of files.
- - ## Step 3: Cleaning the Data Raw transcripts - even in TXT format - need cleaning before they're useful for training. Here's my pipeline: ### 3.1 Remove auto-generated artifacts YouTube's auto-generated subtitles often contain:
-
[Music],[Applause],[Laughter]tags - Repeated phrases from speech recognition errors
- Missing punctuation
import re
def clean_transcript(text: str) -> str:
# Remove bracket annotations
text = re.sub(r'\[.*?\]', '', text)
# Remove duplicate consecutive lines
lines = text.split('\n')
cleaned = [lines[0]] if lines else []
for line in lines[1:]:
if line.strip() != cleaned[-1].strip():
cleaned.append(line)
return '\n'.join(cleaned).strip()
3.2 Filter by quality
Not all transcripts are equal. Auto-generated subtitles for fast-talking speakers or heavy-accent content tend to be low quality. I filter based on:
- Word count: Discard transcripts under 200 words (likely incomplete)
-
Language detection: Use
langdetectto verify the transcript matches the expected language - Repetition ratio: If more than 30% of lines are duplicates, skip it
from langdetect import detect
def is_quality_transcript(text: str, expected_lang: str = 'en') -> bool:
words = text.split()
if len(words) < 200:
return False
try:
if detect(text) != expected_lang:
return False
except:
return False
lines = [l.strip() for l in text.split('\n') if l.strip()]
if len(lines) == 0:
return False
unique_ratio = len(set(lines)) / len(lines)
if unique_ratio < 0.7:
return False
return True
3.3 Structure for training
For fine-tuning, I structure each transcript as a training example with metadata:
{
"source": "youtube",
"video_title": "Understanding Transformer Architectures",
"channel": "3Blue1Brown",
"language": "en",
"word_count": 4523,
"text": "Welcome back to the channel everyone…"
}
This metadata lets you filter and weight examples during training. Educational content from established channels tends to produce better results than random vlogs.
- - ## Step 4: Scaling to 10,000+ Transcripts Here's where the approach matters. My target was 10,000 transcripts across five domains:
- Computer Science (2,000)
- Mathematics (2,000)
- Physics (2,000)
- Philosophy (2,000)
- General Science (2,000) ### Finding the right playlists I curated a list of high-quality educational channels and their playlists:
- MIT OpenCourseWare (full lecture series)
- 3Blue1Brown (math visualizations)
- Veritasium (science explanations)
- Computerphile (CS topics)
- CrashCourse (general education) Each channel has dozens of playlists, each containing 10–100+ videos. This is where bulk extraction becomes essential - processing 50 playlists one video at a time would take weeks. ### The extraction workflow
- Collect playlist URLs from target channels
- Extract all subtitles in TXT format (I used YTVidHub for bulk extraction)
- Run the cleaning pipeline
- Filter by quality metrics
- Structure into JSONL format The entire pipeline from URL collection to final dataset took about 3 days, with most of the time spent on curation rather than extraction.
- - ## Step 5: Results and Lessons Learned After processing, my final dataset contained:
- 10,847 transcripts (started with ~14,000, filtered ~3,000)
- ~28 million words total
- Average transcript length: 2,580 words
- Languages: Primarily English, with ~500 Spanish transcripts ### What worked well
- Educational playlists are gold. Structured, well-spoken content with accurate subtitles.
- Bulk extraction saved enormous time. What would have taken weeks of scripting took hours.
- The quality filter was essential. About 22% of auto-generated transcripts were too noisy to use. ### What I'd do differently
- Start with manually-uploaded subtitles. They're significantly more accurate than auto-generated ones. Most educational channels upload their own.
- Include more diverse content. My dataset skewed heavily toward lecture-style content. Adding interviews, debates, and tutorials would improve conversational diversity.
- Build the pipeline incrementally. Don't try to download everything at once. Start with 100 transcripts, validate your cleaning pipeline, then scale.
- - ## Final Thoughts YouTube subtitles are one of the most accessible, diverse, and continuously updated text corpora available. For anyone building language models - whether for fine-tuning, RAG systems, or text analysis - it's a resource worth exploring. The key is having an efficient extraction pipeline. Whether you use Python scripts, command-line tools, or dedicated platforms, the bottleneck is rarely the extraction itself - it's the curation and cleaning that determines your dataset quality. If you're working on a similar project, I'd love to hear about your approach. What data sources have you found most useful for LLM training?
- - I'm a solo developer building tools for video content data extraction. If you're interested in bulk YouTube subtitle extraction, check out YTVidHub.
Top comments (0)