An Interactive Reading of

Artifact-Bench: Evaluating MLLMs on
Detecting and Assessing
AI-Generated Video Artifacts

The paper, in plain English

AI-generated video is getting scarily good. Sora, Veo, Kling — the latest crop of models can produce photorealistic clips that fool most humans on first glance. But look closer and the seams show: objects morph, shadows point the wrong way, a paddle passes straight through a boat hull. These are artifacts — the fingerprint of a generative model that hasn't quite learned the physics of the real world.

The authors ask a deceptively simple question: can today's multimodal AI models — the same ones that ace visual QA and write image captions — see these artifacts? They build Artifact-Bench, a benchmark with 1,350 videos, 30 artifact types arranged in a three-level taxonomy, and three progressively harder tasks: "Is this AI?", "Which of these two looks more real?", and "Name the specific artifacts." Think of it as an eye exam for AI vision systems.

The headline result is brutal. The best model, Gemini 3.1 Pro, scores 47.5 out of 100 overall. Human experts score 87.7. On the hardest task — naming the specific artifact — every model scores below 10%. And scaling up model size, or turning on "thinking" mode, barely moves the needle. The models aren't perceiving artifacts at all; they're guessing based on semantic shortcuts.

I
30-Artifact Taxonomy
A three-level hierarchy — Surface, Structural, Temporal-Semantic — covering everything from flickering noise to causality violations.
II
Three-Task Framework
Binary classification, pairwise realism comparison, and fine-grained artifact identification — progressively probing model capabilities.
III
The Human Gap
Humans: 87.7. Best AI: 47.5. Models don't follow the human difficulty curve, revealing they rely on shortcuts, not genuine artifact perception.
~ 20 minutes · 7 chapters · 5 interactive visualisations
Chapter 1

Why AI Videos Still Look Wrong

Video generation models have made stunning progress. Veo 3, Kling 2.5, HunyuanVideo — each release produces clips closer to photorealism. Yet something is always slightly off. A hand with six fingers. A shadow that doesn't match the light source. A football that splits into two and merges back. Why?

The paper frames the problem precisely. Unlike traditional video quality degradation — compression artefacts, sensor noise, motion blur — artifacts in AI-generated videos arise from limitations of the generative model itself. The model fails to maintain:

🎨
Visual Fidelity
Colours, textures, and exposure should stay consistent frame-to-frame. In AI video, they flicker, over-smooth, or shift.
🏗
Object Structure
A human hand should have five fingers. A face should have two eyes. Objects shouldn't merge, split, or morph unexpectedly.
Temporal Coherence
Motion should follow physical laws across frames. Objects shouldn't teleport, appear from nothing, or violate causality.
The key insight: artifact-based detection offers a more principled pathway for identifying AI-generated content than semantic or style-based cues. As generative models improve in visual fidelity, the artifacts become subtler — but they remain the most reliable fingerprint of synthetic origin.
Next: A Map of Glitches — the 30-type taxonomy
Chapter 2

A Map of Glitches

Before you can test whether a model sees artifacts, you need to catalog them. The authors build a three-level hierarchical taxonomy of realism artifacts — 3 top-level domains, ~13 failure families, and 30 fine-grained artifact types.

Click on nodes to expand/collapse branches. Hover to see descriptions.
Diagnostic, not exhaustive. The taxonomy is designed to be diagnostic rather than strictly mutually exclusive. A single video may contain multiple co-occurring artifacts — a structural deformation accompanied by temporal inconsistency. This multi-label design enables a more faithful evaluation.
Next: Three Tasks, Three Levels
Chapter 3

Three Tasks, Progressive Depth

Artifact-Bench doesn't just ask "can you tell real from fake?" It probes three progressively harder capabilities: binary classification, relative realism judgment, and fine-grained artifact diagnosis. Each task is stratified into three difficulty levels (L1–L3).

Task 1: RVAC
Real vs. AI-Generated Video Classification. Given one video, determine if it is synthetic. Binary yes/no. 500 QA pairs across L1–L3.
Task 2: PVRC
Pairwise Video Realism Comparison. Given two AI videos, pick the more realistic one. Tests relative judgment. 250 QA pairs.
Task 3: AID
Artifact Identification. Given an AI video, identify all specific artifacts from 6 candidates. Multi-label. 350 QA pairs. The hardest task.
The chart updates as you select different tasks and difficulty levels.
L1 (Easiest)
The random baseline problem. RVAC and PVRC are both binary tasks — random guessing yields ~50%. Yet most models fail to consistently surpass this baseline, especially at higher difficulty levels. The AID task is even worse: with 6 options and multiple correct answers, even informed guessing yields ~10% — and that's roughly where the best models land.
Next: The Model Arena
Chapter 4

The Model Arena

Nineteen MLLMs enter. Two are proprietary (Gemini 3.1 Pro, Gemini 3 Flash), fourteen are open-source general-purpose, and three are specialised for AIGC detection. The results are sobering.

Select models to compare. Charts update live.
The specialist paradox. VideoVeritas 8B, purpose-built for AIGC detection, achieves the highest RVAC accuracy (68.2) among all models — even beating Gemini 3.1 Pro (74.0 RVAC avg). But on PVRC it scores 53.1, and on AID just 7.8. Specialisation helps with the easiest task but doesn't transfer to harder reasoning.
Next: The Human Gap
Chapter 5

The Human Gap

Here is the most striking finding in the paper: as difficulty increases from L1 to L3, human performance declines smoothly and monotonically. Model performance does not. Models fluctuate irregularly — sometimes scoring higher on harder samples. This reveals that models aren't perceiving artifacts the way humans do.

Select a task to see how human and model accuracy scales with difficulty.
The implication for RLHF. If MLLMs can't reliably judge video realism, they can't serve as reward providers for reinforcement learning. Inaccurate reward signals would encourage models to optimise toward superficial statistical patterns rather than genuinely improving perceptual realism — a misalignment that compounds with each training iteration.
Next: Why Scaling Doesn't Help
Chapter 6

Why Bigger Isn't Better

The knee-jerk response to any AI limitation is "just scale up." More parameters, more data, more compute. Artifact-Bench tests this directly with model families that span multiple sizes, and with "thinking" variants that add explicit chain-of-thought reasoning. The result: scaling barely helps, and thinking mode sometimes hurts.

Select a model family to compare sizes and thinking vs. instruct variants.
The diagnosis. Artifact-aware evaluation demands fine-grained perceptual sensitivity to subtle spatial-temporal inconsistencies — not high-level semantics or world knowledge. Scaling parameters and adding generic reasoning may improve linguistic coherence, but it doesn't enhance the model's ability to faithfully perceive artifacts.
Next: The Road Ahead
Chapter 7

The Road Ahead

Artifact-Bench doesn't just expose weaknesses — it maps exactly where future work is needed. The paper ends with a clear call to action for the next generation of multimodal models.

This radar chart shows the capability profile of each model group across all task-difficulty combinations.
The bottom line. Artifact-Bench reveals that current MLLMs lack genuine artifact-aware perception. They rely on superficial semantic cues and dataset biases rather than the fine-grained, temporal-spatial reasoning that humans apply naturally. Building models that can truly see AI-generated artifacts — not just guess about them — remains an open and urgent challenge.

Paper: arXiv:2605.18984 · Code: GitHub
Built as an interactive companion. All data drawn directly from the paper.