This post is a small index for the benchmarks that appear repeatedly in recent streaming video / long-video VLM papers.

The main split is simple:

  • online streaming benchmarks test whether the model can answer while the video is still coming in;
  • offline long-video benchmarks test long-context video understanding, but usually assume the whole video is already available;
  • standard video QA benchmarks are useful for comparability, but they are not the real target of streaming-memory papers.

The newer VideoRAG papers add another emphasis:

  • whether retrieval can find sparse evidence in long videos;
  • whether graph or structured memory helps with multi-hop temporal reasoning;
  • whether the benchmark distinguishes simple perception from deeper contextual reasoning.

Online / Streaming Benchmarks

BenchmarkWhat It Mainly TestsPapers Using It
RVS-EgoStreaming QA on egocentric videos with timestamped questions. Good for testing whether old visual evidence remains accessible.ReKV, StreamMem, LiveVLM, StreamingTOM, rLiVS
RVS-MovieStreaming QA on movie-style videos. More narrative and event-heavy than RVS-Ego.ReKV, StreamMem, LiveVLM, StreamingTOM, rLiVS
StreamingBenchBroader streaming benchmark with real-time visual understanding, omni-source understanding, and contextual understanding subtasks.StreamKV, LiveVLM, InfiniPot-V
StreamBenchOnline multi-turn video QA with memory-heavy question types such as object search, long-term memory search, short-term memory search, and conversational interaction.StreamChat; baselines include Video-online and Flash-VStream

Offline Long-video Benchmarks

BenchmarkWhat It Mainly TestsPapers Using It
MLVULong-video multiple-choice understanding. Often used as a compact proxy for long-context video reasoning.ReKV, StreamMem, LiveVLM, StreamingTOM, InfiniPot-V, AdaVideoRAG
Video-MMEGeneral long-video multimodal understanding across short, medium, and long videos. Papers often report the no-subtitle setting.StreamMem, LiveVLM, StreamingTOM, InfiniPot-V, AdaVideoRAG, ViG-RAG
EgoSchemaLong-range egocentric video reasoning. Useful for memory and temporal reasoning evaluation.ReKV, StreamMem, StreamingTOM, InfiniPot-V
LongVideoBenchLong-video QA with stronger pressure on long-context multimodal reasoning.LiveVLM, StreamingTOM, InfiniPot-V, ViG-RAG
HiVUHierarchical long-video benchmark for knowledge-rich videos. It separates questions into different reasoning levels, making it useful for adaptive VideoRAG evaluation.AdaVideoRAG
LongerVideosLong-form and multi-video benchmark used to test retrieval and reasoning over extended videos, especially for graph-RAG style methods.ViG-RAG
ActivityNet-QAOpen-ended video QA with longer activity videos.ReKV, StreamChat
QAEGO4DEgocentric long-video QA.ReKV
CG-Bench / CGBenchClue-grounded long-video QA, useful for retrieval-heavy methods.ReKV, rLiVS
MovieChatLong movie/video understanding.rLiVS
VS-Ego / VS-MovieOffline long-video evaluation around egocentric and movie scenarios.rLiVS

Standard Video QA Benchmarks

BenchmarkWhat It Mainly TestsPapers Using It
NExT-QA / NextQA-valsetShorter video QA and temporal reasoning. Often useful for token-selection ablations.StreamChat, rLiVS
MSVD-QAOpen-ended QA on short web videos.StreamChat
MSRVTT-QAOpen-ended QA on short web videos, broader than MSVD.StreamChat