Background

Consider the problem of streaming video question-answering (StreamingVQA), it presents three challenges:

  • Efficient Video Encoding: we need to efficiently process incoming frames without access to future frames or frequent revisiting of distant past frames.
  • Video Context Preservation: models must preserve relevant information from earlier frames.
  • Real-Time Response: models must provide accurate answers with minimum delay.

Core Idea

The attention calculation makes it possible to decouple video encoding from question answering. So we can pre-produce KV and reuse KV in QA.

Method

Video encoding via Sliding Window Attention

  • Due to the increasing number of frames that may produce too many KV blocks( $O(n^2)$ ) using full attention, ReKV adopted a sliding window attention.
  • So in this stage, the model uses sliding window attention to process the video chunk by chunk, getting the KV cache per layer.
  • Out-of-window KV caches will be offloaded to RAM or disk.

Retrieval

External Video KV-Cache Retrieval

  • ReKV uses an external CLIP-like model to retrieve video KV cache.
  • The model encodes the frames and the question to a same embedding place and calculate cosine similarity between them.
  • The corresponding video KV Cache is subsequently loaded onto the GPU for question-answering.

Internal Video KV-Cache Retrieval

  • Similar to external retrieval, internal retrieval is still performed at the level of video frames or blocks.
  • Using the video QA model itself, in which the average of the key vectors of a frame is computed as its representative frame vector(ReKV doesn’t differentiate between attention heads and instead concatenate them into a single vector). Similarly the question vector is computed as well.
  • Same cosine similarity as External.
  • Note: Internal Retrieval allows different layers to retrieve different video blocks.

Question-Answering Using Retrieved KV

  • Use the retrieved KV caches to generate answer.

Position Encoding

I think this is very IMPORTANT for other models

  • ReKV says the baseline of it employs RoPE.
  • But for question answering, ReKV does not account for the original positions of the retrieved KV-Caches, as handling unseen distances among tokens presents significant challenges.
  • Instead, they treat these re trieved tokens as regular consecutive tokens.

Remark: Posision Encoding is getting more and more complicated in later models (such as Qwen3-VL). I wonder whether this still work on new models.

Experiments

Benchmark and Metrics

  • $\text{MLVU}_{dev-mc}$: multiple-choice subset of the MLVU-dev benchmark. The evaluation metric is Accuracy.
  • $\text{QAEGO4D}_{test\text{-}mc}$: A multiple-choice benchmark for long egocentric video question answering. Evaluated by Accuracy.
  • EgoSchema: A long-video multiple-choice benchmark that stresses long-range temporal understanding. Evaluated by Accuracy.
  • ActivityNet-QA: An open-ended video question answering benchmark for long-term spatiotemporal reasoning. Evaluated by answer accuracy / score.
  • RVS-Ego and RVS-Movie: Streaming VideoQA benchmarks with timestamped open-ended questions, used to evaluate performance in real-time video understanding.
  • CGBench$_{mc}$: A multiple-choice benchmark for clue-grounded question answering in long videos, suitable for testing retrieval quality.

Implemation Details

  • LLAVA-OV-0.5B and LLAVA-OV-7B, NVIDIA A100 (80GB) GPU, FP16, 0.5FPS, local window=15K, external SigLIP-SO400M

Others

  • Encoding speeds are high, with LLaVA-OV-7B achieving 11 FPS
  • Memory: 18.8GB/h with LLaVA-OV-7B

Limitations

  • High memory usage if the video is long($O(n)$)
  • Fixed block size may break video continuity
  • Fixed number of retrieved frames
  • StreamVQA benchmarks are few
  • Sliding window attention may affect the model’s capacity