Post by Kaggle

516,407 followers

Most of the world's information isn't neatly packaged into text files or code repositories; it exists in messy, unstructured formats like video and audio. This means long-context video comprehension is extremely important. But as model context windows expand to millions of tokens, how do we measure true comprehension versus lucky shortcuts like sparse frame sampling? Today, we are excited to add 1H-VideoQA on Kaggle Benchmarks. Originally developed by Google DeepMind's Staff Research Scientist’s Antoine Y in 2024 and now updated with the latest SOTA models, 1H-VideoQA is a curated benchmark designed to for evaluate frontier models on long-context video understanding and temporal episodic reasoning. Why it matters Traditional video benchmarks let models "cheat" with sparse frame sampling — performance often saturates after just 16 frames, which doesn't measure real long-context reasoning. 1H-VideoQA acts as a multimodal needle-in-a-haystack: models must locate seconds-long events hidden inside 40–90-minute YouTube videos. Because answering requires timeline synthesis across the full video, accuracy scales logarithmically with frame density — evidence that the benchmark is measuring genuine processing, not lucky guesses. How the frontier ranks 🥇 Gemini 3.5 Flash — 80.2% 🥈 Gemini 3 Flash Preview — 79.2% 🥉 Gemini 2.5 Pro — 78.9% Because models must process raw video frames as native tokens to locate seconds-long events hidden inside an hour of footage, accuracy scales logarithmically with frame density - denser sampling, better answers. Read the technical report, download the dataset, and check out the live leaderboard: https://lnkd.in/gt9wuh4t.