When I started learning Deep Learning, I had a playlist with 50 hour-long lectures. I wanted to learn, but watching 50 hours of video felt overwhelming. That experience made me realize that as engineers, we can create tools to help manage all this content. In this article, I’ll show you how to build an AI system with Python that can summarize YouTube videos into notes.
AI System to Summarize YouTube Videos
We’re going to build a personal research assistant. This pipeline will pull subtitles from a video, break them into smaller pieces, and use a large language model to create clear notes for you.
Step 0: The Setup
First, let’s get our tools ready. We’ll use the Hugging Face transformers ecosystem, which is a standard for NLP. Here are the libraries you’ll need:
- youtube-transcript-api: To scrape the subtitles.
- transformers & accelerate: To run the AI model.
- sentencepiece: A tokenizer required by the T5 model family.
Write this command on your Google Colab notebook to install these libraries:
!pip install -U youtube-transcript-api transformers accelerate sentencepiece
Step 1: Extracting Transcripts
Data powers AI. Our first step is to get the text from the video. YouTube URLs can be messy, like shortened or mobile links, so we need a way to clean them and pull out the video_id:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
import re
def extract_video_id(url):
"""Extracts video ID from different YouTube URL formats."""
# We use Regex to hunt for the 11-character ID after 'v=' or 'youtu.be/'
match = re.search(r"(?:v=|youtu\.be/)([a-zA-Z0-9_-]{11})", url)
return match.group(1) if match else None
def get_transcript(video_id):
"""Fetch transcript using the NEW API format."""
try:
api = YouTubeTranscriptApi()
# The .fetch method grabs the subtitle object list
transcript = api.fetch(video_id)
# We join the list into a single long string of text
return " ".join([t.text for t in transcript])
except TranscriptsDisabled:
return "Error: Transcripts are disabled for this video."
except NoTranscriptFound:
return "Error: No transcript found for this video."
except Exception as e:
return f"Error: {str(e)}"In practice, getting data is most of the work. See the try-except blocks? That’s called defensive programming. Always expect that external APIs, like YouTube’s, might fail and be ready to handle it smoothly.
Step 2: Loading the Model
We will be using Flan-T5-Base from Google.
Why use T5? It’s an Encoder-Decoder model, so it’s great for tasks like summarization. The Flan version is fine-tuned to follow instructions, which helps it summarize better. Here’s how to load the model:
import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Check if we have a GPU (CUDA) available to speed things up device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "google/flan-t5-base" # Load the tokenizer (translates text to numbers) tokenizer = AutoTokenizer.from_pretrained(model_name) # Load the model (the neural network) and move it to the GPU/CPU model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
You can run large language models on a CPU, but if you have an NVIDIA GPU, using cuda will make it about ten times faster.
Step 3: Summarization Function
Now, let’s set up how the model processes the text. Instead of just calling model.generate, we’ll adjust some parameters, called hyperparameters, to control how creative and long the output is:
def summarize_chunk(text_chunk):
# We give the model a specific instruction (prompt engineering)
prompt = f"Summarize the following text clearly:\n{text_chunk}"
# Convert text to tensor numbers (inputs)
inputs = tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=1024
).to(device)
# Generate the summary
summary_ids = model.generate(
**inputs,
max_new_tokens=120, # Max length of the summary
num_beams=4, # Look for the 4 best paths (higher quality)
length_penalty=1.0, # Balance between short and long
early_stopping=True
)
# Decode back to text
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)Notice num_beams=4. This uses Beam Search. Instead of picking just the next most likely word, the model looks at four possible paths at once and chooses the one with the highest overall probability. This helps prevent the model from generating nonsense.
Step 4: Chunking the Text
This step is key for long videos. If a transcript has 10,000 words, Flan-T5 can’t handle it all at once and will crash. So, we need to break it into smaller parts:
def chunk_text(text, chunk_size=1200):
sentences = text.split(". ")
chunks, current_chunk = [], ""
for sentence in sentences:
# Check if adding the next sentence exceeds our limit
if len(current_chunk) + len(sentence) < chunk_size:
current_chunk += sentence + ". "
else:
# If full, seal the chunk and start a new one
chunks.append(current_chunk.strip())
current_chunk = sentence + ". "
if current_chunk:
chunks.append(current_chunk.strip())
return chunksThis method is called a Sliding Window or Chunking Strategy. In real-world RAG systems, the way you split your data can be more important than the model you choose.
Step 5: Main Pipeline
Finally, let’s put everything together. This function checks the URL, gets the text, splits it up, and sends it through the AI process:
def generate_video_notes(video_url):
print(f"\n🎬 Processing video: {video_url}")
video_id = extract_video_id(video_url)
if not video_id:
print("Invalid YouTube URL.")
return
print("🎧 Fetching transcript...")
transcript = get_transcript(video_id)
if transcript.startswith("Error"):
print(transcript)
return
print("🔪 Chunking transcript...")
chunks = chunk_text(transcript)
print(f" -> {len(chunks)} chunks created.")
print("🧠 Generating AI notes...")
notes = []
# Loop through chunks and summarize each one
for i, chunk in enumerate(chunks):
print(f" Summarizing chunk {i+1}/{len(chunks)}...")
summary = summarize_chunk(chunk)
notes.append(f"- {summary}")
print("\n" + "="*50)
print("📝 AI GENERATED NOTES")
print("="*50)
print("\n".join(notes))
if __name__ == "__main__":
url = input("Paste YouTube URL: ")
generate_video_notes(url)
🎬 Processing video: https://www.youtube.com/watch?v=KLfer0MES2w
🎧 Fetching transcript...
🔪 Chunking transcript...
-> 8 chunks created.
🧠 Generating AI notes...
Summarizing chunk 1/8...
Summarizing chunk 2/8...
Summarizing chunk 3/8...
Summarizing chunk 4/8...
Summarizing chunk 5/8...
Summarizing chunk 6/8...
Summarizing chunk 7/8...
Summarizing chunk 8/8...
==================================================
📝 AI GENERATED NOTES
==================================================
- India are the number-one seed for this T20 World Cup tournament. But no team has ever won a T20 World Cup playing at home. No team has ever defended a T20 World Cup title
- You'll need a bowler who's able to take a lot of wickets in a T20 game. You'll need a bowler who's able to take a lot of wickets.
- Bumrah's economy rate is very high.
- Bumra
- I'd like to say that I'm not sure if it's going to be an easy match or if it's going to be a hard match.
- It's up to whoever plays as wicket keeper to pick it up from there.
- It'll be a tough game for India to win at home, but I'm not going to go into too much detail.
Closing Thoughts
That’s how you can build an AI system with Python to summarize YouTube videos into notes. This is the starting point for automated content curation. You could even scale it up to process 1,000 news videos a day to spot market trends.
In the industry, this is called Orchestration. You’re not just training a model, you’re building the systems around it to make it actually useful.
If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.





