StreamMem

Paper: StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding Background Streaming video understanding is hard because the model has to process frames as they arrive, without knowing: how long the video will be; what future user questions will ask; which past details will become important later. For long videos, the visual tokens and their KV cache keep growing over time. Even if a long-context MLLM can technically accept many tokens, storing and attending to all historical KV entries is still expensive. ...