VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR
posted on September 21, 2025


By Po-han

VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

TLDR: Annotation-free video summarization evaluation that boosts human decision-making accuracy by up to 61% and cuts response time by 75%!

Project Website | arXiv | Code | Dataset

Motivation

Many real-world tasks still require human oversight: a traffic officer sifting through dashcam footage, or a researcher screening long conference videos. Watching raw video is slow, and existing vision-language models (VLMs) often produce verbose, redundant captions that hinder efficiency. Current video-to-text evaluation methods depend on costly human annotations and ignore whether summaries actually help humans make decisions. We ask:

TLDR System Plot
VIBE Framework Overview

Contributions

We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel framework that evaluates and selects VLM summaries without annotations or retraining.

VIBE Framework

VIBE Mechanism Overview
VIBE: Mechanism Overview

VIBE adapts the information bottleneck principle to video summarization:

By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.

Results

VIBE Results Table
VIBE: Results Overview

We validate VIBE through user studies with 243 participants across three datasets:

LongVideoBench Results
Correlation between Accuracy, Grounding, and Utility Scores

Key findings:

Impact

VIBE reframes video caption evaluation from a human decision support perspective. Unlike reference-based metrics, it scales to unseen data, works with black-box VLMs, and requires no human annotations. This makes VIBE a practical plug-in for improving video summarization in real-world settings: from scientific video search to public safety monitoring.