VLM | Yuxuan Tang

MuKV

Paper: MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering Code: IMBALDY/MuKV Background Long streaming VideoQA has a simple but painful constraint: the video keeps arriving, while the future user questions are unknown. KV-cache methods such as ReKV make this setting more practical. Instead of recomputing historical video tokens when a question arrives, the model can prefill the video stream in advance, store the visual KV cache, retrieve the relevant cache blocks later, and answer with much lower online cost. ...

Qwen3-VL

Paper: Qwen3-VL Technical Report Code: QwenLM/Qwen3-VL Models: Qwen3-VL Collection Background Qwen3-VL is the current multimodal branch of the Qwen3 family. For the long-video papers I have been reading, this model is useful as a new backbone reference. Many earlier methods assume the base Video-LLM is weak at long context, so they design external memory: KV-cache retrieval, as in ReKV / StreamKV; bounded KV memory, as in StreamMem / InfiniPot-V; streaming-oriented KV retrieval, as in LiveVLM; application-level memory, as in StreamChat; video RAG, as in AdaVideoRAG / ViG-RAG. Qwen3-VL changes the baseline. It does not remove the need for memory or retrieval, but it raises the starting point: ...

ViG-RAG

Paper: ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning PDF: AAAI Proceedings PDF Code: AI-Researcher-Team/ViG-RAG Background Long-video RAG is harder than text RAG because video evidence is not just a list of documents. Useful information may be distributed across: visual scenes; speech transcripts; entities and events; temporal order; uncertain or noisy observations. If we simply split the video into independent chunks and retrieve by static text similarity, two problems appear: ...

AdaVideoRAG

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...

Benchmarks for Streaming Video Understanding

This post is a small index for the benchmarks that appear repeatedly in recent streaming video / long-video VLM papers. The main split is simple: online streaming benchmarks test whether the model can answer while the video is still coming in; offline long-video benchmarks test long-context video understanding, but usually assume the whole video is already available; standard video QA benchmarks are useful for comparability, but they are not the real target of streaming-memory papers. The newer VideoRAG papers add another emphasis: ...

StreamChat

Paper: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Code: hmxiong/StreamChat Background Most Video-LLMs are still awkward in a real streaming setting. Offline video QA usually assumes: the whole video is already available; the question is known before inference; the interaction is single-turn. But a streaming assistant has a different problem: video frames keep arriving; the user may ask questions at arbitrary timestamps; the system should remember previous conversation turns; the answer should come back with low latency. This is close to the motivation of ReKV, Flash-VStream, LiveVLM, and rLiVS, but StreamChat chooses a different abstraction. ...

Long Streaming Video Understanding Pipeline

Related papers: ReKV: Paper / Code StreamKV: Paper / Code InfiniPot-V: Paper / Code StreamMem: Paper LiveVLM: Paper / Code StreamingTOM: Paper / Code rLiVS: Paper / Code Core Question All of these papers are trying to solve the same systems problem: When a video stream keeps arriving, the user question is not known yet, and GPU memory is limited, how should a Video-LLM process the stream, compress memory, retrieve evidence, and generate an answer with low latency? The methods look different if we read them one by one: KV cache retrieval, semantic chunking, TaR / VaN, chat-template proxy queries, VSB, CTR, OQM, caption RAG, and so on. ...

StreamingTOM

Paper: StreamingTOM: Streaming Token Compression for Efficient Video Understanding Code: YIGE24/StreamingTOM Background Streaming video understanding has two constraints that offline video understanding does not really need to respect: causality: the model cannot use future frames to decide how to compress current frames; accumulation: tokens and KV cache keep growing as the video stream becomes longer. Most recent training-free streaming methods mainly work on the post-LLM KV cache: ReKV stores historical KV blocks and retrieves relevant ones at question time; StreamKV improves the segmentation / compression / retrieval pipeline; InfiniPot-V and StreamMem keep a bounded KV memory with query-agnostic compression; LiveVLM combines query-agnostic KV compression with query-time retrieval. These methods are useful, but they still have one important blind spot: ...

LiveVLM

Paper: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Code: sjtu-zhao-lab/LiveVLM Background Online video understanding is harder than offline long-video QA. In the offline setting, the model usually receives a video and a question together. It can then sample, compress, or retrieve content with the query already known. In the online setting, the model has two separate phases: encoding phase: video frames arrive continuously before any question appears; response phase: when a user asks a question, the model should answer quickly from the already processed stream. This creates three constraints at the same time: ...

StreamMem

Paper: StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding Background Streaming video understanding is hard because the model has to process frames as they arrive, without knowing: how long the video will be; what future user questions will ask; which past details will become important later. For long videos, the visual tokens and their KV cache keep growing over time. Even if a long-context MLLM can technically accept many tokens, storing and attending to all historical KV entries is still expensive. ...