LiveVLM

Paper: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Code: sjtu-zhao-lab/LiveVLM Background Online video understanding is harder than offline long-video QA. In the offline setting, the model usually receives a video and a question together. It can then sample, compress, or retrieve content with the query already known. In the online setting, the model has two separate phases: encoding phase: video frames arrive continuously before any question appears; response phase: when a user asks a question, the model should answer quickly from the already processed stream. This creates three constraints at the same time: ...