Streaming

Benchmarks for Streaming Video Understanding

This post is a small index for the benchmarks that appear repeatedly in recent streaming video / long-video VLM papers. The main split is simple: online streaming benchmarks test whether the model can answer while the video is still coming in; offline long-video benchmarks test long-context video understanding, but usually assume the whole video is already available; standard video QA benchmarks are useful for comparability, but they are not the real target of streaming-memory papers. The newer VideoRAG papers add another emphasis: ...

StreamChat

Paper: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Code: hmxiong/StreamChat Background Most Video-LLMs are still awkward in a real streaming setting. Offline video QA usually assumes: the whole video is already available; the question is known before inference; the interaction is single-turn. But a streaming assistant has a different problem: video frames keep arriving; the user may ask questions at arbitrary timestamps; the system should remember previous conversation turns; the answer should come back with low latency. This is close to the motivation of ReKV, Flash-VStream, LiveVLM, and rLiVS, but StreamChat chooses a different abstraction. ...

Long Streaming Video Understanding Pipeline

Related papers: ReKV: Paper / Code StreamKV: Paper / Code InfiniPot-V: Paper / Code StreamMem: Paper LiveVLM: Paper / Code StreamingTOM: Paper / Code rLiVS: Paper / Code Core Question All of these papers are trying to solve the same systems problem: When a video stream keeps arriving, the user question is not known yet, and GPU memory is limited, how should a Video-LLM process the stream, compress memory, retrieve evidence, and generate an answer with low latency? The methods look different if we read them one by one: KV cache retrieval, semantic chunking, TaR / VaN, chat-template proxy queries, VSB, CTR, OQM, caption RAG, and so on. ...

StreamingTOM

Paper: StreamingTOM: Streaming Token Compression for Efficient Video Understanding Code: YIGE24/StreamingTOM Background Streaming video understanding has two constraints that offline video understanding does not really need to respect: causality: the model cannot use future frames to decide how to compress current frames; accumulation: tokens and KV cache keep growing as the video stream becomes longer. Most recent training-free streaming methods mainly work on the post-LLM KV cache: ReKV stores historical KV blocks and retrieves relevant ones at question time; StreamKV improves the segmentation / compression / retrieval pipeline; InfiniPot-V and StreamMem keep a bounded KV memory with query-agnostic compression; LiveVLM combines query-agnostic KV compression with query-time retrieval. These methods are useful, but they still have one important blind spot: ...

LiveVLM

Paper: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Code: sjtu-zhao-lab/LiveVLM Background Online video understanding is harder than offline long-video QA. In the offline setting, the model usually receives a video and a question together. It can then sample, compress, or retrieve content with the query already known. In the online setting, the model has two separate phases: encoding phase: video frames arrive continuously before any question appears; response phase: when a user asks a question, the model should answer quickly from the already processed stream. This creates three constraints at the same time: ...

StreamMem

Paper: StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding Background Streaming video understanding is hard because the model has to process frames as they arrive, without knowing: how long the video will be; what future user questions will ask; which past details will become important later. For long videos, the visual tokens and their KV cache keep growing over time. Even if a long-context MLLM can technically accept many tokens, storing and attending to all historical KV entries is still expensive. ...

rLiVS

Paper: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs Code: vdorovatas/rLiVS Background Streaming video understanding is hard because the model must process incoming frames online, keep useful past information, and still answer questions with low latency. The brute-force solution is to put as many frames as possible into the context window, but this quickly becomes too expensive for long videos. Recent papers handle this in different ways: ReKV keeps rich visual memory in the form of KV cache and retrieves it later, but memory and latency are still significant. Goldfish stores only captions for each short clip, which is cheap, but clip-to-clip continuity can be weak. rLiVS tries to sit between these two directions: ...

InfiniPot-V

Paper: InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding Code: aiha-lab/InfiniPot-V Background Streaming video understanding is more constrained than offline long-video understanding. In offline settings, the model can see the whole video first, maybe even the user query first, and then decide how to compress tokens or KV cache. But in streaming settings: frames arrive continuously; future queries are unknown; memory is fixed; KV cache still grows roughly linearly with time. This is exactly the part that makes many existing KV compression methods awkward for real streaming scenarios. ...

StreamKV

Paper: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression Code: sou1p0wer/StreamKV Background Streaming video question-answering (StreamingVQA) requires a model to continuously process incoming video, preserve useful historical context, and answer questions online with low latency. ReKV showed that video QA can be reformulated as retrieve relevant KV caches first, then answer with the retrieved KV. But it still has several weaknesses: It uses uniform segmentation, which may cut through semantic boundaries. It keeps essentially the whole historical visual context, so memory usage is still large. Its retrieval strategy is not flexible enough, especially when the useful information is distributed differently across layers. Core Idea StreamKV extends the ReKV line in two directions at the same time: ...

ReKV

Paper: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Code: Becomebright/ReKV Background Consider the problem of streaming video question-answering (StreamingVQA), it presents three challenges: Efficient Video Encoding: we need to efficiently process incoming frames without access to future frames or frequent revisiting of distant past frames. Video Context Preservation: models must preserve relevant information from earlier frames. Real-Time Response: models must provide accurate answers with minimum delay. Core Idea The attention calculation makes it possible to decouple video encoding from question answering. So we can pre-produce KV and reuse KV in QA. ...