📚 Coursework

Temporal Grounding for Video VLMs (2026)

Temporal Grounding이 2024년까지 DETR 기반 boundary regression이었다면, 2025-2026년에는 VLM이 timestamp를 직접 emit하는 생성 문제로 재정의되었다. Time-R1(NeurIPS 2025), VideoMind(ICLR 2026), MeCo(ICLR 2026), VideoITG(CVPR 2026 Highlight), TimeLens(CVPR 2026) 6편의 2026 핵심 논문과 12개 신규 paper 아이디어·데이터 조사까지.

고급 10 chapters Python

커리큘럼

10 chapters

🎯What is Temporal Grounding

Temporal Grounding 이란 무엇인가

Temporal Grounding은 비디오 V와 자연어 query q가 주어졌을 때, q가 묘사하는 사건의 시간 구간 [t_start, t_end]를 찾는 task다. 2024년까지 proposal/DETR 기반의 boundary regression 문제였다면, 2025–2026년에는 VLM이 timestamp를 token으로 직접 emit하는 verbal generation 문제로 재정의되었다. Time-R1(arXiv:2503.13377)이 zero-shot으로 Charades-STA [email protected]=78.1을 찍어 SFT baseline인 VideoChat-Flash 74.5를 넘은 사건이 이 패러다임 전환의 상징이다.

Task family — TSG vs MR vs TALFormal problem definition (V, q → [t_s, t_e])Eval metrics: R@N@IoU, mIoU, mAP, HIT@1

📏Benchmark Landscape & 7 Biases

벤치마크 지형도와 7가지 bias

Temporal grounding 벤치마크는 Charades-STA 30초짜리부터 ExtremeWhenBench 9시간짜리까지 시간 척도가 1,000배 차이가 난다. 같은 task 이름을 달고 있어도 측정하는 능력이 완전히 다르며, 7가지 known bias (caption-only prior, word shortcut, negative annotation 부재, localization-description entanglement, discrete granularity, train/test leak, long-form scarcity) 때문에 in-domain 점수만 보면 진짜 grounding이 아니라 dataset hack을 측정하는 결과가 된다.

11 benchmarks landscapeSOTA scores per benchmark 2026-06Otani et al. prior-only baseline

🏛️Pre-VLM Foundations — The DETR Era

Pre-VLM 시대 — 2D-TAN부터 DETR까지

2024년 이전 temporal grounding SOTA는 두 계열로 수렴했다. (1) 2D-TAN 계열의 proposal-grid + ranking head 방식, (2) Moment-DETR로 시작해 QD-DETR / CG-DETR / UniVTG로 진화한 set-prediction Transformer 계열. 두 흐름 모두 Charades-STA R1@0.5 ~60 부근에서 plateau를 만났고, 그 원인은 language understanding이 frozen encoder에 묶여 있고 head가 span regression이라는 닫힌 출력 공간을 갖는다는 구조적 한계였다. 이 ceiling이 곧 chapter 4의 "VLM이 timestamp를 verbal generation으로 직접 출력"이라는 패러다임 전환의 동기다.

2D-TAN / MS-2D-TAN proposal-based groundingMoment-DETR (M-DETR) — Transformer decoder의 첫 진입QD-DETR query-dependent video representation

🧠VLM-as-Grounder — Timestamp Generation

VLM-as-Grounder — Timestamp를 직접 emit하는 시대

2025년을 기점으로 temporal grounding은 boundary regression 문제에서 timestamp를 텍스트 토큰으로 emit하는 generation 문제로 재정의되었다. UniTime (NeurIPS 2025, arXiv:2506.18883)은 하나의 generative MLLM으로 Charades-STA / ActivityNet / TACoS / QVHighlights를 동시에 잡았고, MeCo (ICLR 2026, arXiv:2503.09027)는 timestamp를 직접 쓰지 않고 structural token + contrastive grounding으로 QVHighlights에서 mAP=45.3 / HIT@1=75.1을 찍어 M-DETR, UMT, QD-DETR, CG-DETR, UniVTG를 모두 추월했다.

Verbal timestamp generation paradigmUniTime (NeurIPS 2025) — universal generative MLLM grounderMeCo (ICLR 2026) — timestamp-free semantic-oriented approach

🎮RL Fine-tuning Era — Time-R1, TempSamp-R1, VideoTemp-o3, TimeLens

RL Fine-tuning 시대 — GRPO와 verifiable rewards

2025년 하반기부터 temporal grounding의 SOTA는 SFT가 아니라 GRPO 기반 RL post-training이 잡았다. Time-R1 (NeurIPS 2025, arXiv:2503.13377)이 Qwen2.5-VL-7B를 단 2.5K의 TimeRFT 데이터로 RL fine-tune하여 zero-shot Charades-STA R1@0.5=60.8을 달성, VideoChat-Flash·VideoMind·TimeSuite 같은 대규모 SFT baseline들을 한 번에 추월했다. 핵심 무기는 tIoU + format reward라는 verifiable reward — LLM-judge가 아니라 정수 계산 한 줄로 정답을 채점할 수 있다는 점이 RL을 가능하게 했다. 그러나 IoU 한 가지만 최대화하면 model은 semantic alignment를 버리고 boundary length를 키우는 reward hacking을 학습한다. VideoTemp-o3 (arXiv:2602.07801)의 penalty-aware IoU와 TimeLens (CVPR 2026)의 thinking-free RLVR이 이 함정에 대응하는 두 가지 설계 패턴이다.

왜 RL이 grounding에 잘 맞는가 — IoU는 verifiable, discrete, cheap한 rewardGRPO 알고리즘과 PPO와의 차이Time-R1 (NeurIPS 2025)

🕵️Agentic Search for Long-Form Grounding

Agentic Search — 한 시간 영상에서 moment 찾기

한 시간짜리 영상에서 30초 moment를 찾는 일은 더 이상 grounding 문제가 아니라 **search 문제**다. ExtremeWhenBench(arXiv:2606.12300)는 monolithic Video-LLM이 mIoU 0.110, CLIP retrieval만 써도 0.269, retrieve-then-ground hybrid가 0.354로 6.7× 격차가 난다는 것을 보였고, 실패의 **85%가 search, 11%만 localization**임을 정량화했다. 이 챕터는 hour-scale에서 single-pass VLM이 왜 무너지는지, VideoMind(ICLR 2026)의 Chain-of-LoRA, AVI(arXiv:2511.14446)의 agentic loop, Deep Video Discovery(arXiv:2505.18079)의 retrieve-zoom-verify 패턴이 각각 어떻게 이 문제를 풀어내는지를 다룬다.

Hour-scale grounding의 token budget 폭발Temporal dispersionVideoMind (ICLR 2026): Chain-of-LoRA 4-role agent

📡Streaming + Online Grounding

Streaming + Online Grounding — 미래 frame 없이 결정하기

2026년 6월 StreamingHarness (arXiv:2606.08615)가 narration win-rate 61.4%로 streaming SOTA를 갈아치웠지만, 이는 **vision-only / 1 FPS / causal**이라는 세 가지 제약 위에 세워진 숫자다. 같은 시기 등장한 CacheFlow (arXiv:2511.13644)의 KV-cache 압축, LiveVLM (arXiv:2505.15269)의 live architecture와 함께, online grounding은 offline 모델을 빠르게가 아니라 lookahead 없이 decision per second라는 본질적으로 다른 문제로 재정의되었다.

Online setting의 수학적 정의StreamingHarness — 61.4% narration win-rateCacheFlow — KV-cache compression

🔌Plug-and-Play: VideoITG and the Empty Field

Plug-and-Play의 빈 자리 — VideoITG는 왜 유일한가

2026년 주요 6편의 temporal grounding 논문 중에서 downstream Video-LLM에 손을 대지 않고도 동작하는 진짜 plug-and-play는 VideoITG(NVIDIA, CVPR 2026 Highlight, arXiv:2507.13353) 단 하나다. VideoITG는 512 frame을 uniform sampling 후 instruction-conditioned scoring으로 Top-K를 골라 임의의 Video-LLM에 그대로 넘기는 frame selector 모듈이며, VideoITG-40K와 VidThinker 자동 어노테이션 파이프라인 위에서 학습되었다. 반면 VideoMind, MeCo, Time-R1, UniTime, TimeLens는 모두 base model을 fine-tune해야 한다.

VideoITG (CVPR 2026 Highlight) 아키텍처와 VidThinker 파이프라인Plug-and-play 인터페이스 계약왜 VideoMind·MeCo·Time-R1·UniTime·TimeLens는 fine-tuning을 요구하는가

🛡️Trust: Hallucination, Faithfulness, Abstention

신뢰할 수 없는 Grounder — Hallucination·Faithfulness·Abstention

Charades-STA R1@0.5 기준 70점을 넘는 SOTA grounding 모델이 존재하지 않는 이벤트에 대한 쿼리에서 30~50%의 false-positive hallucination을 보인다는 사실이 CounterVid와 DIQ-H를 통해 밝혀졌다. Step-Level Faithfulness 연구는 in-domain 정확도보다 추론 과정의 faithfulness 점수가 OOD 일반화를 더 잘 예측함을 증명하며, VideoTemp-o3의 penalty-aware IoU는 RL reward hacking 문제를 해결해 NextGQA mIoU 33.4, Acc 76.4%를 달성했다. 신뢰할 수 있는 temporal grounding 시스템을 위해서는 벤치마크·학습·평가 세 축 모두에서 hallucination을 정면으로 다루는 근본적인 재설계가 필요하다.

세 가지 실패 모드: hallucination·unfaithful rationale·abstention 불능CounterVid와 DIQ-H — reliability gap 정량화Step-Level Faithfulness — 올바른 답도 틀릴 수 있다

🚀Novel Research Directions — 12 Paper Ideas + Data + Feasibility

새 논문을 어디에 쓸 것인가 — 12개 아이디어·데이터·Feasibility

2026년 Temporal Grounding 연구의 white space는 크게 네 클러스터로 나뉜다: Hour-scale & Streaming, Trust & Reliability, Compositional & Causal, Low-cost Annotation & Ego. 각 클러스터마다 신규 벤치마크(ExtremeWhenBench, Streaming-Train-248K 등)가 방금 공개된 덕분에 baseline 부재라는 가장 강력한 reviewer 반론이 사라졌다. 12개 아이디어의 compute·novelty·reviewer risk를 정량화한 feasibility scorer를 통해 어느 아이디어부터 착수할지 우선순위를 잡을 수 있다.

2026 white space 지도 — 12개 open problem과 그 배경Hour-scale & Streaming: StreamGround, RetGround-Agent, MemGroundTrust & Reliability: AbstainGround, FaithGround