📚 Coursework

Real-time Video LLM

Real-time Video LLM은 sampler/memory 설계를 넘어 'Streaming Pipeline + Adaptive Processing'으로 멘탈 모델이 바뀐다. VideoLLM-online의 EOS-based stream alignment, Flash-VStream의 STAR memory hierarchy, vLLM의 continuous batching까지 — latency budget 안에서 들어오는 데이터를 처리하는 실시간 비디오 추론 시스템의 설계와 운영.

고급 11 chapters Python

커리큘럼

11 chapters

🎯The Real-time Video LLM Problem

실시간 Video LLM이라는 새로운 문제

Real-time Video LLM은 '어떤 프레임을 고를까'도 '어떻게 기억할까'도 아닌, 'latency budget 안에서 끊임없이 들어오는 스트림을 어떻게 처리할까'라는 시스템 문제다.

Mental model shift: sampler → memory → streaming pipelineReal-time benchmarks (VStream-Bench, OVO-Bench, StreamingBench, ETBench)VideoLLM-online과 EOS-based stream alignment

📡Streaming Input Pipeline

스트리밍 입력 파이프라인

들어오는 video stream을 모델까지 전달하는 input pipeline은 backpressure, frame ordering, drop policy를 명시적으로 설계해야 하는 시스템 문제다.

RTSP/WebRTC/카메라 캡처 패턴Producer-consumer queue와 backpressureCircular buffer vs growing queue

🎬Adaptive Frame Sampling under Latency Budget

Latency Budget 하의 적응적 프레임 샘플링

Real-time에서 frame sampling은 'informative함'을 고르는 문제가 아니라 'budget이 허락하는 만큼만 처리'하는 causal, online 결정 문제다.

Offline vs causal vs online sampling의 차이Fixed interval, event-driven, query-aware samplingVideoLLM-online의 EOS-based sampling

🪟Sliding Window & Streaming Context

슬라이딩 윈도우와 스트리밍 컨텍스트

Streaming context 관리는 hour-scale memory의 그것과 닮았지만 'pre-process 불가, 진행 중 expire'라는 두 가지 제약이 결정적으로 다르다 — STAR 같은 tiered memory가 정답에 가까운 이유다.

Sliding window의 token budget 관리Fixed N vs adaptive windowMemory tier: turbulent / dynamic / abstract / long-term (Flash-VStream STAR)

⚡Visual Encoder Optimization

비주얼 인코더 최적화

Real-time video LLM에서 vision encoder는 latency budget의 30-60%를 잡아먹는 hotspot — token reduction, quantization, micro-batching 같은 최적화가 'encoder의 force-fit' 문제를 '체계적 budget allocation' 문제로 바꾼다.

CLIP/SigLIP encoder latency 분석Token reduction: TokenLearner, patch dropoutDistillation: TinyCLIP

🗜️Token Compression for Streaming

스트리밍을 위한 토큰 압축

오프라인 압축(HiCo, offline TokenMerge)은 시퀀스 전체를 한 번에 본다는 가정 위에 서 있다. 스트리밍에서는 'compression이 시간을 따라 incremental하게 일어나야 하고, 새로운 frame이 도착할 때마다 budget을 다시 분배해야 한다'는 제약이 추가된다. Flash-VStream의 STAR memory hierarchy, online TokenMerge variant, cluster-based summarization을 streaming 관점에서 다시 읽는다.

Offline vs streaming compression의 본질적 차이Flash-VStream STAR memory hierarchy (Spatial / Temporal / Abstract / Retrieval)Online TokenMerge / incremental clustering

🚀KV Cache & Continuous Batching for Real-time

실시간 추론을 위한 KV Cache와 Continuous Batching

Streaming pipeline의 마지막 stage는 LLM inference다. 여기서 KV cache 전략(prefix-cache vs sliding window), continuous batching(vLLM, SGLang, TensorRT-LLM), speculative decoding이 모두 'prefill vs decode latency 비대칭'을 어떻게 다루느냐는 한 질문으로 수렴한다. Real-time video LLM에서 prefill이 매 frame마다 일어난다는 사실을 인지하면 inference 측 결정의 의도가 보인다.

Prefill vs decode latency 비대칭KV cache layout과 메모리 점유Prefix-cache (shared prompt) for streaming

📊End-to-End Latency Profiling

엔드투엔드 레이턴시 프로파일링

실시간 비디오 LLM의 latency는 6개 stage(capture → decode → sample → encode → reduce → LLM → response)로 쪼개진다. 각 stage의 p50/p95/p99을 따로 측정하고 Amdahl's law로 bottleneck을 찾아야 한다. SLA를 'overall p99'으로 잡는 건 게으른 답이고, stage별 budget을 분해해야 production에서 reproducible debugging이 가능해진다.

6-stage latency breakdown for real-time video LLMp50 vs p95 vs p99의 statistical 의미와 product 함의Amdahl's law for system bottleneck

🏗️Producer-Consumer Pipeline Architecture

프로듀서-컨슈머 파이프라인 아키텍처

6-stage를 무엇으로 묶을지(async/sync, GPU/CPU 분리), stage 사이 queue를 어떻게 운영할지(backpressure, drop policy), failure mode(queue saturation, GPU starvation, head-of-line blocking)를 어떻게 다룰지가 streaming pipeline 아키텍처의 본질이다. Ray, asyncio, multiprocessing의 선택은 이 결정에서 파생된다.

Producer-consumer pattern in streaming pipelineAsync vs sync stage 분리GPU stage와 CPU stage의 isolation

🌐Production Deployment & Monitoring

프로덕션 배포와 모니터링

Real-time video LLM의 production은 latency histogram, drop rate, GPU utilization 3대 metric으로 운영된다. A/B 비교는 same-frame replay나 shadow traffic으로 신뢰성을 확보하고, canary rollout으로 배포 위험을 격리한다. Overload 시 'drop frame vs drop quality'의 graceful degradation 정책, cost-aware autoscaling, Gemini Live / GPT-4o Realtime / Twelve Labs Marengo 같은 실제 시스템의 운영 패턴을 정리한다.

Production observability: latency histogram, drop rate, GPU utilizationSame-frame replay와 shadow traffic A/B testingCanary rollout과 traffic split

🌟Architecting the Next-Generation Real-time VLM

차세대 Real-time VLM 아키텍처 설계

2026년 6월 기준 SOTA의 진짜 한계를 직시하고, 그것을 풀기 위한 다음 세대 시스템을 직접 설계해본다 — Cascade architecture, Chunk-level temporal encoder, Structured scene-graph memory, 그리고 가장 어려운 'Timing of Speech' 문제까지.

현재 SOTA의 진짜 한계Cascade Architecture (Stage별 latency budget)Chunk-level Temporal Encoder (V-JEPA-2)