Paper

MuKV

Paper: MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering Code: IMBALDY/MuKV Background Long streaming VideoQA has a simple but painful constraint: the video keeps arriving, while the future user questions are unknown. KV-cache methods such as ReKV make this setting more practical. Instead of recomputing historical video tokens when a question arrives, the model can prefill the video stream in advance, store the visual KV cache, retrieve the relevant cache blocks later, and answer with much lower online cost. ...

ViG-RAG

Paper: ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning PDF: AAAI Proceedings PDF Code: AI-Researcher-Team/ViG-RAG Background Long-video RAG is harder than text RAG because video evidence is not just a list of documents. Useful information may be distributed across: visual scenes; speech transcripts; entities and events; temporal order; uncertain or noisy observations. If we simply split the video into independent chunks and retrieve by static text similarity, two problems appear: ...

AdaVideoRAG

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...

StreamChat

Paper: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Code: hmxiong/StreamChat Background Most Video-LLMs are still awkward in a real streaming setting. Offline video QA usually assumes: the whole video is already available; the question is known before inference; the interaction is single-turn. But a streaming assistant has a different problem: video frames keep arriving; the user may ask questions at arbitrary timestamps; the system should remember previous conversation turns; the answer should come back with low latency. This is close to the motivation of ReKV, Flash-VStream, LiveVLM, and rLiVS, but StreamChat chooses a different abstraction. ...

StreamingTOM

Paper: StreamingTOM: Streaming Token Compression for Efficient Video Understanding Code: YIGE24/StreamingTOM Background Streaming video understanding has two constraints that offline video understanding does not really need to respect: causality: the model cannot use future frames to decide how to compress current frames; accumulation: tokens and KV cache keep growing as the video stream becomes longer. Most recent training-free streaming methods mainly work on the post-LLM KV cache: ReKV stores historical KV blocks and retrieves relevant ones at question time; StreamKV improves the segmentation / compression / retrieval pipeline; InfiniPot-V and StreamMem keep a bounded KV memory with query-agnostic compression; LiveVLM combines query-agnostic KV compression with query-time retrieval. These methods are useful, but they still have one important blind spot: ...

LiveVLM

Paper: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Code: sjtu-zhao-lab/LiveVLM Background Online video understanding is harder than offline long-video QA. In the offline setting, the model usually receives a video and a question together. It can then sample, compress, or retrieve content with the query already known. In the online setting, the model has two separate phases: encoding phase: video frames arrive continuously before any question appears; response phase: when a user asks a question, the model should answer quickly from the already processed stream. This creates three constraints at the same time: ...

StreamMem

Paper: StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding Background Streaming video understanding is hard because the model has to process frames as they arrive, without knowing: how long the video will be; what future user questions will ask; which past details will become important later. For long videos, the visual tokens and their KV cache keep growing over time. Even if a long-context MLLM can technically accept many tokens, storing and attending to all historical KV entries is still expensive. ...

rLiVS

Paper: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs Code: vdorovatas/rLiVS Background Streaming video understanding is hard because the model must process incoming frames online, keep useful past information, and still answer questions with low latency. The brute-force solution is to put as many frames as possible into the context window, but this quickly becomes too expensive for long videos. Recent papers handle this in different ways: ReKV keeps rich visual memory in the form of KV cache and retrieves it later, but memory and latency are still significant. Goldfish stores only captions for each short clip, which is cheap, but clip-to-clip continuity can be weak. rLiVS tries to sit between these two directions: ...

InfiniPot-V

Paper: InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding Code: aiha-lab/InfiniPot-V Background Streaming video understanding is more constrained than offline long-video understanding. In offline settings, the model can see the whole video first, maybe even the user query first, and then decide how to compress tokens or KV cache. But in streaming settings: frames arrive continuously; future queries are unknown; memory is fixed; KV cache still grows roughly linearly with time. This is exactly the part that makes many existing KV compression methods awkward for real streaming scenarios. ...

StreamKV

Paper: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression Code: sou1p0wer/StreamKV Background Streaming video question-answering (StreamingVQA) requires a model to continuously process incoming video, preserve useful historical context, and answer questions online with low latency. ReKV showed that video QA can be reformulated as retrieve relevant KV caches first, then answer with the retrieved KV. But it still has several weaknesses: It uses uniform segmentation, which may cut through semantic boundaries. It keeps essentially the whole historical visual context, so memory usage is still large. Its retrieval strategy is not flexible enough, especially when the useful information is distributed differently across layers. Core Idea StreamKV extends the ReKV line in two directions at the same time: ...