DEEVISum: How a lightweight AI model is making video summarization faster and smarter
Video summarization is a hot topic in AI right now, especially as platforms like TikTok and YouTube Shorts push creators toward shorter, more engaging content. But summarizing videos automatically is tough—it requires understanding not just what’s happening on screen, but also the context, emotion, and even speaker dynamics. Most existing AI models that do this well are massive, slow, and expensive to run. That’s where DEEVISum comes in.
What is DEEVISum?
Developed by researchers at IIT Bombay, DEEVISum (Distilled Early-Exit Vision-language model for Summarization) is a lightweight AI model designed to make video summarization faster and more efficient without sacrificing too much accuracy. The key innovation here is combining two techniques:
- Multi-Stage Knowledge Distillation (MSKD): Instead of just training a small model to mimic a big one (the usual approach), DEEVISum uses an intermediate "mentor" model to bridge the gap between a massive teacher model (like Google’s 28B-parameter PaLI-Gemma2) and a tiny student model (a 3B-parameter version). This helps the smaller model learn more effectively, improving performance by 1.33% over traditional distillation.
- Early Exit (EE): Normally, an AI model processes every layer of its neural network before making a prediction. Early Exit lets the model bail out early if it’s already confident in its answer—saving compute time. In tests, this reduced inference time by 21%, though it did cost a small drop in accuracy (1.3 F1 points).
Why does this matter?
Video summarization is a resource-heavy task. Big models like OpenAI’s GPT-4o or Google’s Gemini can do it well, but they’re expensive to run at scale. DEEVISum proves that with the right training tricks, smaller models can compete. On the TVSum dataset, DEEVISum (using the 3B-parameter PaLI-Gemma2 + MSKD) scored an F1 of 61.1—close to much larger models—while being far more efficient.
The role of audio and text
Another interesting aspect of DEEVISum is how it uses multi-modal prompts. Instead of just looking at video frames, it also processes:
- Titles and transcripts (for semantic context)
- Speaker diarization (who’s talking when)
- Emotion and gender cues from audio (to gauge tone)
This helps the model understand why certain moments matter—like when a speaker gets emotional or when dialogue shifts. In tests, adding audio-derived features (like emotion) boosted performance slightly, though speaker diarization sometimes introduced noise.
The catch: Dataset limitations
One big hurdle for video summarization AI is dataset quality. Most benchmarks (like TVSum and SumMe) use short, simple videos with basic annotations. When the researchers tested DEEVISum on a more diverse set of videos (drama, sports, politics), it outperformed older models by a wide margin (41.1 F1 vs. ~20 for others), suggesting current benchmarks aren’t challenging enough.
The bottom line
DEEVISum shows that smaller, optimized models can handle complex tasks like video summarization—if trained smartly. For businesses, this could mean cheaper, faster AI tools for content repurposing, social media clipping, or even meeting summaries. The code and dataset are publicly available, so we’ll likely see more experiments building on this soon.
Key takeaway: Efficiency tricks like multi-stage distillation and early exits are making AI more practical for real-world video processing. The next frontier? Better datasets to push these models further.