Video-LLM

InfiniPot-V

Paper: InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding Code: aiha-lab/InfiniPot-V Background Streaming video understanding is more constrained than offline long-video understanding. In offline settings, the model can see the whole video first, maybe even the user query first, and then decide how to compress tokens or KV cache. But in streaming settings: frames arrive continuously; future queries are unknown; memory is fixed; KV cache still grows roughly linearly with time. This is exactly the part that makes many existing KV compression methods awkward for real streaming scenarios. ...

StreamKV

Paper: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression Code: sou1p0wer/StreamKV Background Streaming video question-answering (StreamingVQA) requires a model to continuously process incoming video, preserve useful historical context, and answer questions online with low latency. ReKV showed that video QA can be reformulated as retrieve relevant KV caches first, then answer with the retrieved KV. But it still has several weaknesses: It uses uniform segmentation, which may cut through semantic boundaries. It keeps essentially the whole historical visual context, so memory usage is still large. Its retrieval strategy is not flexible enough, especially when the useful information is distributed differently across layers. Core Idea StreamKV extends the ReKV line in two directions at the same time: ...

ReKV

Paper: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Code: Becomebright/ReKV Background Consider the problem of streaming video question-answering (StreamingVQA), it presents three challenges: Efficient Video Encoding: we need to efficiently process incoming frames without access to future frames or frequent revisiting of distant past frames. Video Context Preservation: models must preserve relevant information from earlier frames. Real-Time Response: models must provide accurate answers with minimum delay. Core Idea The attention calculation makes it possible to decouple video encoding from question answering. So we can pre-produce KV and reuse KV in QA. ...