RAG | Yuxuan Tang

ViG-RAG

Paper: ViG-RAG: Video-aware Graph Retrieval-Augmented Generation via Temporal and Semantic Hybrid Reasoning PDF: AAAI Proceedings PDF Code: AI-Researcher-Team/ViG-RAG Background Long-video RAG is harder than text RAG because video evidence is not just a list of documents. Useful information may be distributed across: visual scenes; speech transcripts; entities and events; temporal order; uncertain or noisy observations. If we simply split the video into independent chunks and retrieve by static text similarity, two problems appear: ...

AdaVideoRAG

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...

StreamChat

Paper: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Code: hmxiong/StreamChat Background Most Video-LLMs are still awkward in a real streaming setting. Offline video QA usually assumes: the whole video is already available; the question is known before inference; the interaction is single-turn. But a streaming assistant has a different problem: video frames keep arriving; the user may ask questions at arbitrary timestamps; the system should remember previous conversation turns; the answer should come back with low latency. This is close to the motivation of ReKV, Flash-VStream, LiveVLM, and rLiVS, but StreamChat chooses a different abstraction. ...

rLiVS

Paper: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs Code: vdorovatas/rLiVS Background Streaming video understanding is hard because the model must process incoming frames online, keep useful past information, and still answer questions with low latency. The brute-force solution is to put as many frames as possible into the context window, but this quickly becomes too expensive for long videos. Recent papers handle this in different ways: ReKV keeps rich visual memory in the form of KV cache and retrieves it later, but memory and latency are still significant. Goldfish stores only captions for each short clip, which is cheap, but clip-to-clip continuity can be weak. rLiVS tries to sit between these two directions: ...