AI-generated video is getting scarily good. Sora, Veo, Kling — the latest crop of models can produce photorealistic clips that fool most humans on first glance. But look closer and the seams show: objects morph, shadows point the wrong way, a paddle passes straight through a boat hull. These are artifacts — the fingerprint of a generative model that hasn't quite learned the physics of the real world.
The authors ask a deceptively simple question: can today's multimodal AI models — the same ones that ace visual QA and write image captions — see these artifacts? They build Artifact-Bench, a benchmark with 1,350 videos, 30 artifact types arranged in a three-level taxonomy, and three progressively harder tasks: "Is this AI?", "Which of these two looks more real?", and "Name the specific artifacts." Think of it as an eye exam for AI vision systems.
The headline result is brutal. The best model, Gemini 3.1 Pro, scores 47.5 out of 100 overall. Human experts score 87.7. On the hardest task — naming the specific artifact — every model scores below 10%. And scaling up model size, or turning on "thinking" mode, barely moves the needle. The models aren't perceiving artifacts at all; they're guessing based on semantic shortcuts.
Video generation models have made stunning progress. Veo 3, Kling 2.5, HunyuanVideo — each release produces clips closer to photorealism. Yet something is always slightly off. A hand with six fingers. A shadow that doesn't match the light source. A football that splits into two and merges back. Why?
The paper frames the problem precisely. Unlike traditional video quality degradation — compression artefacts, sensor noise, motion blur — artifacts in AI-generated videos arise from limitations of the generative model itself. The model fails to maintain:
Before you can test whether a model sees artifacts, you need to catalog them. The authors build a three-level hierarchical taxonomy of realism artifacts — 3 top-level domains, ~13 failure families, and 30 fine-grained artifact types.
Artifact-Bench doesn't just ask "can you tell real from fake?" It probes three progressively harder capabilities: binary classification, relative realism judgment, and fine-grained artifact diagnosis. Each task is stratified into three difficulty levels (L1–L3).
Nineteen MLLMs enter. Two are proprietary (Gemini 3.1 Pro, Gemini 3 Flash), fourteen are open-source general-purpose, and three are specialised for AIGC detection. The results are sobering.
Here is the most striking finding in the paper: as difficulty increases from L1 to L3, human performance declines smoothly and monotonically. Model performance does not. Models fluctuate irregularly — sometimes scoring higher on harder samples. This reveals that models aren't perceiving artifacts the way humans do.
The knee-jerk response to any AI limitation is "just scale up." More parameters, more data, more compute. Artifact-Bench tests this directly with model families that span multiple sizes, and with "thinking" variants that add explicit chain-of-thought reasoning. The result: scaling barely helps, and thinking mode sometimes hurts.
Artifact-Bench doesn't just expose weaknesses — it maps exactly where future work is needed. The paper ends with a clear call to action for the next generation of multimodal models.
Paper: arXiv:2605.18984
· Code: GitHub
Built as an interactive companion. All data drawn directly from the paper.