An Interactive Reading of

Artifact-Bench: Evaluating MLLMs on
Detecting and Assessing
AI-Generated Video Artifacts

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang et al.
HKUST(GZ) · PKU · Kling Team · May 2026 · arXiv:2605.18984

The paper, in plain English

AI-generated video is getting scarily good. Sora, Veo, Kling — the latest crop of models can produce photorealistic clips that fool most humans on first glance. But look closer and the seams show: objects morph, shadows point the wrong way, a paddle passes straight through a boat hull. These are artifacts — the fingerprint of a generative model that hasn't quite learned the physics of the real world.

The authors ask a deceptively simple question: can today's multimodal AI models — the same ones that ace visual QA and write image captions — see these artifacts? They build Artifact-Bench, a benchmark with 1,350 videos, 30 artifact types arranged in a three-level taxonomy, and three progressively harder tasks: "Is this AI?", "Which of these two looks more real?", and "Name the specific artifacts." Think of it as an eye exam for AI vision systems.

The headline result is brutal. The best model, Gemini 3.1 Pro, scores 47.5 out of 100 overall. Human experts score 87.7. On the hardest task — naming the specific artifact — every model scores below 10%. And scaling up model size, or turning on "thinking" mode, barely moves the needle. The models aren't perceiving artifacts at all; they're guessing based on semantic shortcuts.

I

30-Artifact Taxonomy

A three-level hierarchy — Surface, Structural, Temporal-Semantic — covering everything from flickering noise to causality violations.

II

Three-Task Framework

Binary classification, pairwise realism comparison, and fine-grained artifact identification — progressively probing model capabilities.

III

The Human Gap

Humans: 87.7. Best AI: 47.5. Models don't follow the human difficulty curve, revealing they rely on shortcuts, not genuine artifact perception.

~ 20 minutes · 7 chapters · 5 interactive visualisations

Chapter 1

Why AI Videos Still Look Wrong

Video generation models have made stunning progress. Veo 3, Kling 2.5, HunyuanVideo — each release produces clips closer to photorealism. Yet something is always slightly off. A hand with six fingers. A shadow that doesn't match the light source. A football that splits into two and merges back. Why?

In plain English

Imagine watching a magic trick filmed by a smartphone. The magician saws a person in half, and you accept it — you know it's a trick. But what if the person's legs morphed into the table while the saw was still moving? That's not illusion; that's a rendering bug. AI-generated videos are full of these bugs.

The root cause is that generative models learn statistics of pixels, not physics. They've seen millions of videos of people walking, so they know what walking looks like on average. But they've never internalised Newton's laws, object permanence, or the way light refracts through glass. When the model encounters an edge case — two objects colliding, a mirror reflecting a scene — it falls back on statistical patterns that violate real-world constraints.

These violations are called artifacts, and they're the central subject of this paper. Artifact-Bench asks: can the AI models that are supposed to understand video also detect these glitches?

The paper frames the problem precisely. Unlike traditional video quality degradation — compression artefacts, sensor noise, motion blur — artifacts in AI-generated videos arise from limitations of the generative model itself. The model fails to maintain:

🎨

Visual Fidelity

Colours, textures, and exposure should stay consistent frame-to-frame. In AI video, they flicker, over-smooth, or shift.

🏗

Object Structure

A human hand should have five fingers. A face should have two eyes. Objects shouldn't merge, split, or morph unexpectedly.

⏱

Temporal Coherence

Motion should follow physical laws across frames. Objects shouldn't teleport, appear from nothing, or violate causality.

The key insight: artifact-based detection offers a more principled pathway for identifying AI-generated content than semantic or style-based cues. As generative models improve in visual fidelity, the artifacts become subtler — but they remain the most reliable fingerprint of synthetic origin.

Next: A Map of Glitches — the 30-type taxonomy →

Chapter 2

A Map of Glitches

Before you can test whether a model sees artifacts, you need to catalog them. The authors build a three-level hierarchical taxonomy of realism artifacts — 3 top-level domains, ~13 failure families, and 30 fine-grained artifact types.

In plain English

Think of this taxonomy like a medical diagnostic manual. A doctor doesn't just say "you're sick" — they locate the problem: respiratory system, lower airway, bronchitis. Similarly, Artifact-Bench doesn't just flag "this video looks fake." It asks: is the problem in the surface appearance, the object structure, or the temporal flow? And then, specifically, which artifact?

The three tiers work like zooming in. Surface Artifacts are things you can see in a single frame: wrong colours, blurred textures, camera shake that doesn't match the scene. Structural Defects require understanding what objects should look like: a hand with too many fingers, a building that bends. Temporal-Semantic Violations only appear across frames: a person swallowing utensils, a ball that duplicates, consequences without causes.

Click on each domain in the interactive tree below to explore the full hierarchy.

Click on nodes to expand/collapse branches. Hover to see descriptions.

Diagnostic, not exhaustive. The taxonomy is designed to be diagnostic rather than strictly mutually exclusive. A single video may contain multiple co-occurring artifacts — a structural deformation accompanied by temporal inconsistency. This multi-label design enables a more faithful evaluation.

Next: Three Tasks, Three Levels →

Chapter 3

Three Tasks, Progressive Depth

Artifact-Bench doesn't just ask "can you tell real from fake?" It probes three progressively harder capabilities: binary classification, relative realism judgment, and fine-grained artifact diagnosis. Each task is stratified into three difficulty levels (L1–L3).

In plain English

Think of it like a driving test. Task 1 is the written exam: "Is this an AI video? Yes or no." Straightforward, but you'd be surprised how many models fail even this. Task 2 is a side-by-side comparison: "Here are two AI videos. Which looks more realistic?" This tests relative judgment, not just binary detection. Task 3 is the road test: "Tell me exactly what's wrong with this video." You have to pick the right artifacts from six options — and multiple answers can be correct.

Each task comes in three flavours. L1 videos have obvious artifacts — easy to spot. L2 is the middle ground. L3 videos are high-realism — the artifacts are subtle and require careful inspection. This stratification reveals whether a model's perception scales with difficulty the way human perception does.

Task 1: RVAC

Real vs. AI-Generated Video Classification. Given one video, determine if it is synthetic. Binary yes/no. 500 QA pairs across L1–L3.

Task 2: PVRC

Pairwise Video Realism Comparison. Given two AI videos, pick the more realistic one. Tests relative judgment. 250 QA pairs.

Task 3: AID

Artifact Identification. Given an AI video, identify all specific artifacts from 6 candidates. Multi-label. 350 QA pairs. The hardest task.

The chart updates as you select different tasks and difficulty levels.

Minimum Difficulty Level

L1 (Easiest)

The random baseline problem. RVAC and PVRC are both binary tasks — random guessing yields ~50%. Yet most models fail to consistently surpass this baseline, especially at higher difficulty levels. The AID task is even worse: with 6 options and multiple correct answers, even informed guessing yields ~10% — and that's roughly where the best models land.

Next: The Model Arena →

Chapter 4

The Model Arena

Nineteen MLLMs enter. Two are proprietary (Gemini 3.1 Pro, Gemini 3 Flash), fourteen are open-source general-purpose, and three are specialised for AIGC detection. The results are sobering.

In plain English

Imagine lining up nineteen contestants for a vision test. Some are generalists (they've trained on everything), some are specialists (they've been fine-tuned to spot fake videos). You show each contestant 1,100 video clips and ask them to classify, compare, and diagnose. The specialist who studied the hardest — VideoVeritas 8B — comes in second with 46.0. The best generalist — Gemini 3.1 Pro — leads at 47.5. But the human experts sitting next to them? 87.7.

That gap — 40 points between the best AI and a competent human — tells you everything. These models can partly tell real from fake at the easiest level, but they have almost no ability to explain why a video looks wrong. On the diagnosis task (AID), every single model scores below 10% average accuracy.

Select models to compare. Charts update live.

Filter Model Group

The specialist paradox. VideoVeritas 8B, purpose-built for AIGC detection, achieves the highest RVAC accuracy (68.2) among all models — even beating Gemini 3.1 Pro (74.0 RVAC avg). But on PVRC it scores 53.1, and on AID just 7.8. Specialisation helps with the easiest task but doesn't transfer to harder reasoning.

Next: The Human Gap →

Chapter 5

The Human Gap

Here is the most striking finding in the paper: as difficulty increases from L1 to L3, human performance declines smoothly and monotonically. Model performance does not. Models fluctuate irregularly — sometimes scoring higher on harder samples. This reveals that models aren't perceiving artifacts the way humans do.

In plain English

Imagine a spelling bee where the words get progressively harder. A good speller's accuracy drops smoothly — 95% on easy words, 80% on medium, 65% on hard. That's the human curve on Artifact-Bench: 93.6 → 86.4 → 80.3 across tasks.

Now imagine a contestant who scores 50% on easy words, 65% on medium, and 45% on hard — all over the place. You wouldn't conclude they're a good speller having a slightly off day. You'd conclude they're guessing. That's exactly what's happening with the MLLMs. Their performance doesn't track difficulty because they're not perceiving the artifacts — they're picking up on superficial statistical shortcuts.

Drag the difficulty slider below and watch how the gap between human and model scores changes. The human curve is always smooth and downward. The model curves are jagged.

Select a task to see how human and model accuracy scales with difficulty.

Select Task

The implication for RLHF. If MLLMs can't reliably judge video realism, they can't serve as reward providers for reinforcement learning. Inaccurate reward signals would encourage models to optimise toward superficial statistical patterns rather than genuinely improving perceptual realism — a misalignment that compounds with each training iteration.

Next: Why Scaling Doesn't Help →

Chapter 6

Why Bigger Isn't Better

The knee-jerk response to any AI limitation is "just scale up." More parameters, more data, more compute. Artifact-Bench tests this directly with model families that span multiple sizes, and with "thinking" variants that add explicit chain-of-thought reasoning. The result: scaling barely helps, and thinking mode sometimes hurts.

In plain English

You'd think that a 38-billion-parameter model would be better at spotting video glitches than an 8-billion one. Same family, same training recipe, just bigger. Not so. InternVL3.5 38B scores 34.7 total. InternVL3.5 8B scores 34.5. The difference is noise.

What about "thinking" — giving the model time to reason step by step? Qwen3-VL 8B-Thinking scores 33.3. The non-thinking Qwen3-VL 8B-Instruct scores 36.0. The thinking variant is worse. The model spends its extra compute on linguistic coherence, not on actually perceiving the artifact.

The analogy: giving a colourblind person a bigger dictionary doesn't help them see red. Artifact detection requires a different kind of visual processing — fine-grained spatial sensitivity and temporal coherence tracking — that current architectures simply don't have.

Select a model family to compare sizes and thinking vs. instruct variants.

Model Family

The diagnosis. Artifact-aware evaluation demands fine-grained perceptual sensitivity to subtle spatial-temporal inconsistencies — not high-level semantics or world knowledge. Scaling parameters and adding generic reasoning may improve linguistic coherence, but it doesn't enhance the model's ability to faithfully perceive artifacts.

Next: The Road Ahead →

Chapter 7

The Road Ahead

Artifact-Bench doesn't just expose weaknesses — it maps exactly where future work is needed. The paper ends with a clear call to action for the next generation of multimodal models.

In plain English

If you're building video generation models — Sora, Veo, Kling — you need a reliable way to evaluate whether your outputs are improving. Right now, the standard approach is to use MLLMs as automated judges. Artifact-Bench says: those judges are unreliable. They can't tell you why a video looks wrong, and their "realism" judgments don't match what humans actually perceive.

This matters because many training pipelines — reinforcement learning from human feedback, preference optimisation — rely on these models as reward signals. If the reward signal is noisy and misaligned, the model optimises for the wrong thing: superficial patterns that fool the judge rather than genuine realism.

The fix will require fundamentally better fine-grained perception and temporal-spatial modelling in future MLLMs — not just bigger models, but ones that actually track object motion, structure, and physics across frames.

This radar chart shows the capability profile of each model group across all task-difficulty combinations.

The bottom line. Artifact-Bench reveals that current MLLMs lack genuine artifact-aware perception. They rely on superficial semantic cues and dataset biases rather than the fine-grained, temporal-spatial reasoning that humans apply naturally. Building models that can truly see AI-generated artifacts — not just guess about them — remains an open and urgent challenge.

Paper: arXiv:2605.18984 · Code: GitHub
Built as an interactive companion. All data drawn directly from the paper.

Artifact-Bench: Evaluating MLLMs onDetecting and AssessingAI-Generated Video Artifacts

Why AI Videos Still Look Wrong

A Map of Glitches

Three Tasks, Progressive Depth

The Model Arena

The Human Gap

Why Bigger Isn't Better

The Road Ahead

Artifact-Bench: Evaluating MLLMs on
Detecting and Assessing
AI-Generated Video Artifacts