Blog
Feb 19, 2026 - 10 MIN READ
Building a YouTube Video Research Pipeline with Whisper and Claude

Building a YouTube Video Research Pipeline with Whisper and Claude

How I built a CLI tool that downloads YouTube videos, transcribes them with Whisper, and generates structured market intelligence using Claude — all with one command.

Mariusz Smenzyk

Mariusz Smenzyk

AI Developer ✨ MusicTech ✨ SportTech

I watch a lot of YouTube — competitor demos, conference talks, market analyses, investor presentations. The problem? Key insights get lost. I'd watch a 45-minute video, think "that was great," and a week later remember almost nothing specific.

So I built a tool that turns any YouTube video into a structured intelligence report — automatically. One CLI command: download, transcribe, analyze, display.


The Problem

Watching YouTube videos for market research is like drinking from a firehose:

  • No searchability — you can't search across video content
  • No structure — insights are scattered across 30-60 minute recordings
  • No persistence — notes are informal, scattered, incomplete
  • No shareability — team members have to rewatch the same videos

As a hardware startup founder building BeatBuddy — a swimming wearable — I consume hours of video content every week about wearable tech, sports science, embedded systems, and competitor products. I needed a system that could extract structured, actionable intelligence from all this content.


The Solution: A Four-Step Pipeline

The idea is simple — chain together three mature tools:

YouTube URL → Audio (yt-dlp) → Transcript (Whisper) → Analysis (Claude) → Web Page (Nuxt)

One command runs the entire pipeline:

bb-dataroom video add "https://www.youtube.com/watch?v=VIDEO_ID"

Let me walk through each step.


Step 1: Download Audio with yt-dlp

yt-dlp is the gold standard for YouTube downloading. Rather than calling it as a subprocess, I use its Python API directly for cleaner error handling:

import yt_dlp

opts = {
    "format": "bestaudio/best",
    "postprocessors": [{
        "key": "FFmpegExtractAudio",
        "preferredcodec": "wav",
    }],
    "postprocessor_args": ["-ar", "16000", "-ac", "1"],
}

with yt_dlp.YoutubeDL(opts) as ydl:
    ydl.download([url])

Key detail: We output 16kHz mono WAV — this is exactly what Whisper expects. Downsampling at download time saves a conversion step and reduces file size by ~4x compared to stereo 44.1kHz.

We also extract rich metadata (title, channel, duration, thumbnail, tags) in a separate call without downloading the video — useful for the frontend display later.


Step 2: Transcribe with OpenAI Whisper

If you're familiar with OpenAI's Whisper, you know the basics. Here we take it further:

import whisper

model = whisper.load_model("base")
result = whisper.transcribe(model, "audio.wav", verbose=False)

Auto-language detection is crucial for a research tool. We don't hardcode a language — Whisper detects it automatically. The same command works for English conference talks, Polish market reports, or Japanese tech demos.

Transcripts are cached as JSON sidecar files with full segment timestamps. Re-running the pipeline on the same video skips transcription entirely — no wasted compute.


Step 3: Analyze with Claude

This is where it gets interesting. We send the full transcript to Claude with a structured prompt that returns JSON:

{
  "tl_dr": "2-3 sentence executive summary",
  "key_insights": [
    {
      "title": "Insight title",
      "description": "What was said",
      "relevance": "How it applies to our product"
    }
  ],
  "market_signals": [
    {
      "signal": "Trend or opportunity",
      "source": "Who mentioned it",
      "strength": "strong | moderate | weak"
    }
  ],
  "competitive_intel": [
    {
      "company_or_product": "Name",
      "detail": "What was mentioned",
      "threat_level": "high | medium | low | opportunity"
    }
  ],
  "action_items": [
    {
      "action": "What to do",
      "priority": "high | medium | low",
      "category": "product | marketing | technical | business"
    }
  ]
}

The analysis structure was designed for a startup context:

  • Key Insights — the most important takeaways, with explicit relevance mapping to our product
  • Market Signals — trends and shifts with strength ratings
  • Competitive Intelligence — companies and products mentioned, with threat assessment
  • Action Items — prioritized by category

I use claude-sonnet-4-5 by default — it's fast enough for single-video analysis while being thorough in extracting nuanced business intelligence. The prompt is designed to work with any video language since Whisper handles the translation implicitly.


Step 4: Generate a Browsable Web Page

The analysis is written as a Nuxt Content markdown file with YAML frontmatter. The frontend renders it as part of our existing DataRoom application with:

  • Embedded YouTube player — watch the original alongside the analysis
  • Structured sections — each analysis category gets its own visual section
  • Tag-based filtering — the LLM auto-generates relevant tags for cross-video search
  • Screenshot gallery — manually added screenshots for key slides/moments
  • Collapsible transcript — full timestamped transcript for reference

The Research section sits in the sidebar alongside our other knowledge base tools (chat archives, analytics, AI insights).


Architecture Decisions Worth Noting

Why Not Just Use YouTube's Auto-Captions?

YouTube's auto-generated captions are decent for simple content, but:

  1. They lack punctuation and proper formatting
  2. They're not available for all videos
  3. They can't be easily processed offline
  4. Whisper produces significantly better results for non-English content and technical jargon

Why WAV Instead of MP3?

Whisper works best with uncompressed audio. MP3 compression introduces artifacts that can affect transcription accuracy, especially for quiet speech, heavy accents, or technical vocabulary. The WAV files are larger (~115 MB/hour) but we only keep them locally — they're excluded from git and deployment.

Why Cache at Every Step?

Each step has a different cost profile:

StepTimeCost
Download10-30sFree
Transcription1-5 minFree (local)
LLM Analysis10-30s~$0.02-0.05

By caching at each step, re-running the pipeline (e.g., to try a different prompt or a different LLM model) only repeats the changed step. The --force flag bypasses all caches when needed.

Why a CLI Instead of a Web App?

For a small team, a CLI is the right level of automation:

  • Fast to build — no auth, no queue, no webhooks
  • Easy to extend — adding a new analysis dimension is just editing a prompt template
  • Scriptable — can be integrated into batch workflows
  • Local-first — Whisper runs on your machine, no data leaves your network

What the Output Looks Like

Here's the asset structure for each analyzed video:

web/content/research/{slug}.md              # Nuxt Content page
web/public/research/{slug}/
  ├── audio.wav                              # Whisper input (16kHz mono)
  ├── thumbnail.jpg                          # YouTube thumbnail
  ├── transcript.json                        # Whisper output with segments
  ├── analysis.json                          # LLM analysis cache
  ├── metadata.json                          # Video metadata
  └── screenshots/                           # Manual screenshots

The generated markdown contains structured YAML frontmatter (video ID, channel, duration, tags, analysis date) that the Nuxt frontend uses for rendering cards, filtering, and navigation.


Real-World Workflow

Here's what it looks like in practice:

# Analyze a competitor demo
bb-dataroom video add "https://youtube.com/watch?v=VIDEO_ID"

# Take some screenshots while watching, drop them in
cp ~/Screenshots/*.png web/public/research/video-slug/screenshots/
bb-dataroom video index

# Start the dev server and browse
cd web && pnpm dev

The result is a searchable library of analyzed videos, organized by tags, with structured intelligence that the whole team can reference. No more "I think I saw something about this in a video last month."


What's Next

A few ideas I'm considering:

  • Batch processing — analyze a YouTube playlist in one go
  • Channel monitoring — watch specific channels and auto-analyze new uploads
  • Cross-video synthesis — identify patterns across multiple analyzed videos (the existing AI synthesis pipeline from our chat analysis could be adapted)
  • Podcast support — same pipeline for audio-only content via RSS feeds

Wrapping Up

The gap between "watching a YouTube video" and "having actionable intelligence from it" is wider than most people realize. This pipeline bridges that gap with open-source tools and a structured LLM step.

Total development time was about 2 hours with Claude Code. The ROI is immediate — every analyzed video becomes a permanent, searchable knowledge asset instead of a fading memory.

If you're building a startup and consuming video content for research, something like this is worth building. The tools are mature, the code is straightforward, and the results compound over time.


Tech Stack