How I built a CLI tool that downloads YouTube videos, transcribes them with Whisper, and generates structured market intelligence using Claude — all with one command.
Mariusz Smenzyk
AI Developer ✨ MusicTech ✨ SportTech
I watch a lot of YouTube — competitor demos, conference talks, market analyses, investor presentations. The problem? Key insights get lost. I'd watch a 45-minute video, think "that was great," and a week later remember almost nothing specific.
So I built a tool that turns any YouTube video into a structured intelligence report — automatically. One CLI command: download, transcribe, analyze, display.
Watching YouTube videos for market research is like drinking from a firehose:
As a hardware startup founder building BeatBuddy — a swimming wearable — I consume hours of video content every week about wearable tech, sports science, embedded systems, and competitor products. I needed a system that could extract structured, actionable intelligence from all this content.
The idea is simple — chain together three mature tools:
YouTube URL → Audio (yt-dlp) → Transcript (Whisper) → Analysis (Claude) → Web Page (Nuxt)
One command runs the entire pipeline:
bb-dataroom video add "https://www.youtube.com/watch?v=VIDEO_ID"
Let me walk through each step.
yt-dlp is the gold standard for YouTube downloading. Rather than calling it as a subprocess, I use its Python API directly for cleaner error handling:
import yt_dlp
opts = {
"format": "bestaudio/best",
"postprocessors": [{
"key": "FFmpegExtractAudio",
"preferredcodec": "wav",
}],
"postprocessor_args": ["-ar", "16000", "-ac", "1"],
}
with yt_dlp.YoutubeDL(opts) as ydl:
ydl.download([url])
Key detail: We output 16kHz mono WAV — this is exactly what Whisper expects. Downsampling at download time saves a conversion step and reduces file size by ~4x compared to stereo 44.1kHz.
We also extract rich metadata (title, channel, duration, thumbnail, tags) in a separate call without downloading the video — useful for the frontend display later.
If you're familiar with OpenAI's Whisper, you know the basics. Here we take it further:
import whisper
model = whisper.load_model("base")
result = whisper.transcribe(model, "audio.wav", verbose=False)
Auto-language detection is crucial for a research tool. We don't hardcode a language — Whisper detects it automatically. The same command works for English conference talks, Polish market reports, or Japanese tech demos.
Transcripts are cached as JSON sidecar files with full segment timestamps. Re-running the pipeline on the same video skips transcription entirely — no wasted compute.
This is where it gets interesting. We send the full transcript to Claude with a structured prompt that returns JSON:
{
"tl_dr": "2-3 sentence executive summary",
"key_insights": [
{
"title": "Insight title",
"description": "What was said",
"relevance": "How it applies to our product"
}
],
"market_signals": [
{
"signal": "Trend or opportunity",
"source": "Who mentioned it",
"strength": "strong | moderate | weak"
}
],
"competitive_intel": [
{
"company_or_product": "Name",
"detail": "What was mentioned",
"threat_level": "high | medium | low | opportunity"
}
],
"action_items": [
{
"action": "What to do",
"priority": "high | medium | low",
"category": "product | marketing | technical | business"
}
]
}
The analysis structure was designed for a startup context:
I use claude-sonnet-4-5 by default — it's fast enough for single-video analysis while being thorough in extracting nuanced business intelligence. The prompt is designed to work with any video language since Whisper handles the translation implicitly.
The analysis is written as a Nuxt Content markdown file with YAML frontmatter. The frontend renders it as part of our existing DataRoom application with:
The Research section sits in the sidebar alongside our other knowledge base tools (chat archives, analytics, AI insights).
YouTube's auto-generated captions are decent for simple content, but:
Whisper works best with uncompressed audio. MP3 compression introduces artifacts that can affect transcription accuracy, especially for quiet speech, heavy accents, or technical vocabulary. The WAV files are larger (~115 MB/hour) but we only keep them locally — they're excluded from git and deployment.
Each step has a different cost profile:
| Step | Time | Cost |
|---|---|---|
| Download | 10-30s | Free |
| Transcription | 1-5 min | Free (local) |
| LLM Analysis | 10-30s | ~$0.02-0.05 |
By caching at each step, re-running the pipeline (e.g., to try a different prompt or a different LLM model) only repeats the changed step. The --force flag bypasses all caches when needed.
For a small team, a CLI is the right level of automation:
Here's the asset structure for each analyzed video:
web/content/research/{slug}.md # Nuxt Content page
web/public/research/{slug}/
├── audio.wav # Whisper input (16kHz mono)
├── thumbnail.jpg # YouTube thumbnail
├── transcript.json # Whisper output with segments
├── analysis.json # LLM analysis cache
├── metadata.json # Video metadata
└── screenshots/ # Manual screenshots
The generated markdown contains structured YAML frontmatter (video ID, channel, duration, tags, analysis date) that the Nuxt frontend uses for rendering cards, filtering, and navigation.
Here's what it looks like in practice:
# Analyze a competitor demo
bb-dataroom video add "https://youtube.com/watch?v=VIDEO_ID"
# Take some screenshots while watching, drop them in
cp ~/Screenshots/*.png web/public/research/video-slug/screenshots/
bb-dataroom video index
# Start the dev server and browse
cd web && pnpm dev
The result is a searchable library of analyzed videos, organized by tags, with structured intelligence that the whole team can reference. No more "I think I saw something about this in a video last month."
A few ideas I'm considering:
The gap between "watching a YouTube video" and "having actionable intelligence from it" is wider than most people realize. This pipeline bridges that gap with open-source tools and a structured LLM step.
Total development time was about 2 hours with Claude Code. The ROI is immediate — every analyzed video becomes a permanent, searchable knowledge asset instead of a fading memory.
If you're building a startup and consuming video content for research, something like this is worth building. The tools are mature, the code is straightforward, and the results compound over time.
Building a Garmin Connect IQ App in Monkey C: Lessons from a BLE Remote Control
What I learned building a Garmin watch app that controls a swim metronome over BLE — from GATT command queues to manual ASCII parsing in a language with no regex.
The very first attempt to implement 4DX in Bravelab.io
A real-life example of implementing the Four Disciplines of Execution framework in a software development company. Learn what worked, what didn't, and how to get started with 4DX.