AdaVideoRAG

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...

May 9, 2026 · 13 min

Benchmarks for Streaming Video Understanding

This post is a small index for the benchmarks that appear repeatedly in recent streaming video / long-video VLM papers. The main split is simple: online streaming benchmarks test whether the model can answer while the video is still coming in; offline long-video benchmarks test long-context video understanding, but usually assume the whole video is already available; standard video QA benchmarks are useful for comparability, but they are not the real target of streaming-memory papers. The newer VideoRAG papers add another emphasis: ...

May 2, 2026 · Updated May 10, 2026 · 2 min