Token Selection

Paper: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs Code: vdorovatas/rLiVS Background Streaming video understanding is hard because the model must process incoming frames online, keep useful past information, and still answer questions with low latency. The brute-force solution is to put as many frames as possible into the context window, but this quickly becomes too expensive for long videos. Recent papers handle this in different ways: ReKV keeps rich visual memory in the form of KV cache and retrieves it later, but memory and latency are still significant. Goldfish stores only captions for each short clip, which is cheap, but clip-to-clip continuity can be weak. rLiVS tries to sit between these two directions: ...