HiVU | Yuxuan Tang

Paper: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding Code: xzc-zju/AdaVideoRAG Background Long-video understanding is hard because the useful evidence is sparse, long-range, and often spread across multiple modalities: visual content; speech; scene text; temporal relations. RAG is a natural fit here. Instead of feeding the whole video to the MLLM every time, the system can first build a searchable memory, retrieve relevant evidence, and then answer with a smaller context. But a fixed VideoRAG pipeline is not ideal. Easy questions may not need retrieval at all, while hard questions may need structured graph reasoning. ...