Retrieval

MuKV

Paper: MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering Code: IMBALDY/MuKV Background Long streaming VideoQA has a simple but painful constraint: the video keeps arriving, while the future user questions are unknown. KV-cache methods such as ReKV make this setting more practical. Instead of recomputing historical video tokens when a question arrives, the model can prefill the video stream in advance, store the visual KV cache, retrieve the relevant cache blocks later, and answer with much lower online cost. ...

ViG-RAG

Paper: ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning PDF: AAAI Proceedings PDF Code: AI-Researcher-Team/ViG-RAG Background Long-video RAG is harder than text RAG because video evidence is not just a list of documents. Useful information may be distributed across: visual scenes; speech transcripts; entities and events; temporal order; uncertain or noisy observations. If we simply split the video into independent chunks and retrieve by static text similarity, two problems appear: ...

AdaVideoRAG

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...

Long Streaming Video Understanding Pipeline

Related papers: ReKV: Paper / Code StreamKV: Paper / Code InfiniPot-V: Paper / Code StreamMem: Paper LiveVLM: Paper / Code StreamingTOM: Paper / Code rLiVS: Paper / Code Core Question All of these papers are trying to solve the same systems problem: When a video stream keeps arriving, the user question is not known yet, and GPU memory is limited, how should a Video-LLM process the stream, compress memory, retrieve evidence, and generate an answer with low latency? The methods look different if we read them one by one: KV cache retrieval, semantic chunking, TaR / VaN, chat-template proxy queries, VSB, CTR, OQM, caption RAG, and so on. ...

LiveVLM

Paper: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Code: sjtu-zhao-lab/LiveVLM Background Online video understanding is harder than offline long-video QA. In the offline setting, the model usually receives a video and a question together. It can then sample, compress, or retrieve content with the query already known. In the online setting, the model has two separate phases: encoding phase: video frames arrive continuously before any question appears; response phase: when a user asks a question, the model should answer quickly from the already processed stream. This creates three constraints at the same time: ...

StreamKV

Paper: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression Code: sou1p0wer/StreamKV Background Streaming video question-answering (StreamingVQA) requires a model to continuously process incoming video, preserve useful historical context, and answer questions online with low latency. ReKV showed that video QA can be reformulated as retrieve relevant KV caches first, then answer with the retrieved KV. But it still has several weaknesses: It uses uniform segmentation, which may cut through semantic boundaries. It keeps essentially the whole historical visual context, so memory usage is still large. Its retrieval strategy is not flexible enough, especially when the useful information is distributed differently across layers. Core Idea StreamKV extends the ReKV line in two directions at the same time: ...

ReKV

Paper: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Code: Becomebright/ReKV Background Consider the problem of streaming video question-answering (StreamingVQA), it presents three challenges: Efficient Video Encoding: we need to efficiently process incoming frames without access to future frames or frequent revisiting of distant past frames. Video Context Preservation: models must preserve relevant information from earlier frames. Real-Time Response: models must provide accurate answers with minimum delay. Core Idea The attention calculation makes it possible to decouple video encoding from question answering. So we can pre-produce KV and reuse KV in QA. ...