StreamKV
Background Streaming video question-answering (StreamingVQA) requires a model to continuously process incoming video, preserve useful historical context, and answer questions online with low latency. ReKV showed that video QA can be reformulated as retrieve relevant KV caches first, then answer with the retrieved KV. But it still has several weaknesses: It uses uniform segmentation, which may cut through semantic boundaries. It keeps essentially the whole historical visual context, so memory usage is still large. Its retrieval strategy is not flexible enough, especially when the useful information is distributed differently across layers. Core Idea StreamKV extends the ReKV line in two directions at the same time: ...