Token Compression

Paper: StreamingTOM: Streaming Token Compression for Efficient Video Understanding Code: YIGE24/StreamingTOM Background Streaming video understanding has two constraints that offline video understanding does not really need to respect: causality: the model cannot use future frames to decide how to compress current frames; accumulation: tokens and KV cache keep growing as the video stream becomes longer. Most recent training-free streaming methods mainly work on the post-LLM KV cache: ReKV stores historical KV blocks and retrieves relevant ones at question time; StreamKV improves the segmentation / compression / retrieval pipeline; InfiniPot-V and StreamMem keep a bounded KV memory with query-agnostic compression; LiveVLM combines query-agnostic KV compression with query-time retrieval. These methods are useful, but they still have one important blind spot: ...