An Interactive Reading of

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass Team, Shanghai AI Laboratory
May 2026 · arXiv:2605.19276

The paper, in plain English

Every week a new large language model ships with claims about how smart it is. But how do you actually test that? Running one model through one benchmark is easy. Running forty models through a hundred benchmarks, each with different scoring rules, different prompt formats, and different post-processing steps, is a logistical nightmare. Researchers end up spending more time gluing pipelines together than doing science.

OpenCompass is Shanghai AI Lab's answer: an open-source platform that treats LLM evaluation as an engineering problem. You declare which models and which datasets in a config file, and the system takes care of everything else — splitting the work across GPUs, routing each benchmark to the right evaluator (rule-based, LLM-as-a-Judge, or a cost-saving cascade of both), and spitting out a unified scorecard. The design borrows from distributed systems thinking: a Partitioner chops the workload into independent atomic tasks, and a Runner schedules them in parallel across local machines or cloud clusters.

The result is a platform that supports 100+ benchmark datasets spanning knowledge, reasoning, math, science, code, and long-context tasks, and has produced an academic leaderboard with scores for every major model from GPT-5 to open-source 4B-parameter models. The top model (Gemini-3-Pro) averages 81.32 across six headline benchmarks — but no single model wins every category, which is precisely why you need a system like this.

I

Partition & Parallelize

A Partitioner decomposes the full model×dataset grid into atomic tasks; a Runner distributes them across GPU clusters — cutting wall-clock time from days to hours.

II

Three Evaluator Modes

Rule-based scoring for clean answers, LLM-as-a-Judge for open-ended tasks, and a Cascade that combines both — filtering easy samples through cheap rules before calling the expensive judge.

III

100+ Benchmarks, One Scorecard

From MMLU and HLE to LiveCodeBench and RULER, OpenCompass unifies knowledge, reasoning, math, science, language, code, and long-text evaluation under a single configuration.

Chapter 1

The Evaluation Problem

A new large language model ships every week. Everyone claims it is better. But better at what, exactly — and who decides?

In plain English

Imagine you run a consumer-testing lab. A manufacturer sends you a new phone and says "it's the best." You need to test battery life, screen brightness, drop resistance, camera quality, and twenty other attributes — each with its own equipment and its own measurement protocol. Now imagine forty phone manufacturers sending you phones every month, and each attribute test requires a different machine that takes hours to run. That is roughly the situation LLM researchers face today.

The problem is not that benchmarks do not exist — MMLU, HLE, GPQA, and dozens more are widely used. The problem is fragmentation: each benchmark has its own prompt format, its own answer extraction logic, and its own scoring metric. Running one model through one benchmark is a script. Running forty models through a hundred benchmarks is a distributed systems problem disguised as a measurement task.

OpenCompass treats evaluation as an engineering discipline. The platform's core insight is that all evaluation tasks follow the same three-phase pipeline: preprocess data, run inference, compute scores. Standardise the interfaces between those phases and the whole thing becomes composable.

The paper identifies three specific pain points that make LLM evaluation arduous:

1. Inconsistent evaluation methods. Some benchmarks use exact-match scoring; others use ROUGE; still others use a second LLM to grade responses. There is no single "evaluate" button.

2. Prompt sensitivity. A model's score on the same benchmark can swing by several percentage points depending on whether you phrase the question as "Choose the correct answer:" or "Which of the following is true?" Standardised prompt templates matter.

3. Fragmented preprocessing. Each benchmark requires its own data-loading logic, field alignment, and output post-processing. Researchers re-implement the same boilerplate from scratch every time.

OpenCompass addresses all three by providing a unified pipeline with modular, swappable components for each phase. The platform supports both objective evaluation (multiple-choice, math, code — tasks with definitive gold answers, scored by Accuracy, Exact Match, F1) and subjective evaluation (open-ended generation, creative writing — scored by an LLM acting as a judge across dimensions like coherence, relevance, and instruction adherence).

The evaluated objects fall into two categories: Base Models (pretrained only, with text-continuation capabilities, evaluated via perplexity-based methods) and Chat Models (fine-tuned with SFT or RLHF, evaluated via generation-based approaches).

The quiet insight is that LLM evaluation is no longer a measurement problem — it is a distributed systems problem. The bottleneck is not "how do we score a model?" but "how do we score forty models on a hundred benchmarks without our infrastructure collapsing?"

The five-component architecture →

Chapter 2

The Five-Layer Architecture

OpenCompass decomposes evaluation into five modular components connected by a four-stage workflow pipeline.

In plain English

Think of a restaurant kitchen during dinner rush. The Configuration System is the menu and the order ticket — it tells everyone what needs to happen. The Partitioner is the head chef who breaks down the orders into individual tasks: "chop these vegetables, grill that steak, plate the dessert." The Runner is the kitchen manager who assigns tasks to available stations. The Tasks are the line cooks doing the actual work. And the Summarizer is the expeditor who collects finished plates, checks quality, and sends them out with the right labels.

The key design choice is component decoupling: none of the five layers knows or cares how the others work internally. The Partitioner does not need to know whether the Runner sends jobs to a local GPU or a cloud cluster. The Task does not care how many other Tasks are running in parallel. This modularity is what lets OpenCompass scale from a single developer's laptop to a multi-GPU cluster without changing the configuration.

The architecture comprises five core components arranged in a layered logical structure:

Configuration System — Parses and instantiates heterogeneous inputs (Python config files or CLI arguments) into a unified evaluation configuration object. Built on MMEngine, it performs consistency checks and manages dataset/model configurations.

Partitioner — Maps the evaluation workload to the Cartesian product of model-dataset combinations, then splits each combination into independent atomic subtasks based on strategies like number of workers, dataset size, or naive equal partitioning.

Runner — The adaptive abstraction layer between task logic and cluster infrastructure. Supports local environments, Alibaba Cloud DLC, Volcengine Cloud, and more. Handles job submission, lifecycle management, retry logic, and log collection.

Tasks — Two atomic task types: OpenICLInferTask (runs model inference) and OpenICLEvalTask (computes evaluation scores). Each carries its complete execution context independently.

Summarizer — Aggregates results, computes composite metrics, and produces customisable visualisation reports with dataset-level groupings.

Click any component to see how it works in detail.

Click a component above

Select any box in the architecture diagram to explore its role in the evaluation pipeline.

The four-stage workflow — Configure → Infer → Evaluate → Visualise — mirrors the standard ML experiment lifecycle. The difference is that OpenCompass makes each stage independently configurable and parallelisable, turning a sequential slog into a concurrent pipeline.

How configurations and prompts work →

Chapter 3

Configuration & Prompt Design

Every evaluation begins with a config file. The prompt you feed the model determines the score you get — so OpenCompass makes prompt construction a first-class citizen.

The Configuration System, built on OpenICL, supports three prompt construction mechanisms:

Few-shot prompts use the Retriever module to select contextual examples from an indexed dataset. These examples are concatenated with the test sample to form a complete prompt. The retrieval strategy is configurable — you can select examples by similarity, randomly, or from fixed positions.

Zero-shot prompts are constructed via the Prompt Template module — just the task description and test sample, with no additional examples. Suitable for models with strong zero-shot generalisation capabilities (modern chat models).

Rich prompt structures map to ChatML templates, enabling multi-turn contextual prompts with system messages, user turns, and assistant turns.

A complete dataset configuration requires three parts: data loading (how to read the raw dataset), inference configuration (prompt template + retrieval strategy), and evaluation configuration (which metric to compute, how to extract answers).

Prompt Length Decomposition

$$L_{\text{prompt}} = L_{\text{system}} + k \cdot L_{\text{example}} + L_{\text{question}}$$

where $L$ denotes token length, $k$ is the number of few-shot examples, $L_{\text{system}}$ is the system instruction, $L_{\text{example}}$ is the average example length, and $L_{\text{question}}$ is the test question. More examples help accuracy but increase inference cost linearly.

Drag the sliders to see how prompt strategy affects estimated inference cost and context budget.

Number of few-shot examples (k) 5

Avg. example length (tokens) 200

Question length (tokens) 100

The paper emphasises that prompt sensitivity is one of the three core pain points in LLM evaluation. Two teams can evaluate the same model on the same benchmark and get different scores simply because they used different prompt templates. OpenCompass addresses this by standardising prompt construction in the config — making results reproducible across teams and runs.

How partitioning slashes evaluation time →

Chapter 4

Partitioning & Parallelism

Evaluating a single model on a hundred benchmarks takes days. OpenCompass turns that serial slog into a parallel sprint.

In plain English

You have 40 model variants and 100 benchmarks. Each model-benchmark pair is an independent test — Model A's score on MMLU does not affect Model B's score on HellaSwag. That means you can run all 4,000 tests in parallel if you have enough machines. This is the same insight behind distributed build systems like Bazel or MapReduce: decompose the work into independent chunks, distribute them, and aggregate the results.

OpenCompass takes the Cartesian product of your model list and dataset list — every model paired with every dataset — then chops each pair into bite-sized subtasks that a single GPU can handle. The Partitioner decides how to split; the Runner decides where to run.

Drag the sliders below to see how parallelism scales. The key insight: doubling workers nearly halves wall-clock time, but with diminishing returns as coordination overhead grows.

Evaluation Task Complexity

$$T_{\text{total}} = \frac{N_{\text{models}} \times N_{\text{datasets}} \times N_{\text{samples}}}{W} \times t_{\text{sample}} + t_{\text{overhead}} \times \lceil \frac{N_{\text{models}} \times N_{\text{datasets}}}{W} \rceil$$

where $W$ is the number of parallel workers, $t_{\text{sample}}$ is the per-sample inference time, and $t_{\text{overhead}}$ is the fixed cost of task dispatch and result collection per batch.

Drag the sliders to see how parallelism scales evaluation wall-clock time.

Number of models 10

Number of datasets 20

Samples per dataset 1000

Seconds per sample 2.0s

The Partitioner supports three splitting strategies: naive equal partition (split tasks evenly), num-worker partition (split to match available workers), and size-based partition (split by dataset size). Each produces a structured task list that the Runner then distributes. The key is that subtasks are independent — no data dependencies between them means near-linear scaling.

How evaluators work →

Chapter 5

Three Evaluator Strategies

Not all answers can be graded the same way. OpenCompass provides three evaluator modes — and a clever cascade that combines them to save money.

In plain English

Imagine grading an exam. Multiple-choice questions are trivial: compare the student's answer to the answer key. But essay questions need a human reader — someone who judges coherence, originality, and logic. Now imagine you have 10,000 exams and the human reader charges $0.01 per essay. You'd want to auto-grade everything you can and only send the ambiguous ones to the human. That is exactly what the Cascade Evaluator does.

OpenCompass offers three grading strategies. Rule-based evaluators are cheap and deterministic — they use regex, exact match, or NLP metrics like BLEU and ROUGE. LLM-as-a-Judge evaluators use a powerful LLM to grade open-ended responses on multiple dimensions. And the Cascade evaluator combines both: rules handle the easy cases, and the LLM judge handles the hard ones, dramatically reducing cost while preserving accuracy.

Use the simulation below to see how the cascade's cost savings depend on the rule evaluator's coverage and accuracy. The green region is "free" (handled by rules); the red region is "expensive" (sent to the LLM judge).

Rule-Based Evaluator implements lightweight objective evaluation using predefined rules or classic NLP metrics. Three subtypes: Option Extraction (regex-based multiple-choice parsing), Content Regex (entity extraction with MathEvaluator for LaTeX), and Classic NLP Metrics (BLEU, ROUGE, AUC-ROC, F1).

LLM-as-a-Judge Evaluator invokes a second LLM as the Judge Model, reusing inference-task logic with structured scoring prompts. Used for complex objective scenarios (answers too nuanced for rules) and subjective scenarios (open-ended generation rated on relevance, fluency, logic, innovativeness).

Cascade Evaluator operates in two modes. In Cascaded Mode, rules pre-filter: correct samples are cheaply accepted; uncertain samples go to the LLM judge. In Parallel Mode, both evaluators run on every sample; a sample is correct if either evaluator says so (higher tolerance, higher cost).

Drag sliders to see how the Cascade evaluator saves LLM judge calls.

Cascade Mode

Parallel Mode

Rule coverage (%) 70%

Rule accuracy (%) 90%

Total samples 2000

The cascade is particularly elegant in Cascaded Mode: if your rule evaluator covers 70% of samples at 90% accuracy, you only send 30% of samples to the expensive LLM judge. That is a 70% cost reduction with minimal accuracy loss. The trade-off is tunable via coverage and accuracy thresholds in the config.

Explore the benchmark landscape →

Chapter 6

The Benchmark Landscape

OpenCompass supports 100+ datasets across eight domains. Here is what each domain measures — and why the hardest benchmarks humbled every model.

Select models to compare their profiles across benchmark domains.

Model A

Model B

Notice that even the best models (Gemini-3-Pro at 81.32 average) do not dominate every dimension. The HLE benchmark (Humanity's Last Exam) topped out at 37.98 for the best model — barely above random guessing on some sub-tasks. That gap between "best" and "perfect" is where the next generation of models will compete.

The full model leaderboard →

Chapter 7

The Leaderboard

Forty-two models, six benchmarks, one scorecard. Click any column header to sort and discover where each model excels — or fails.

Click column headers to sort. Hover over cells for details.

The spread from top to bottom is massive: Gemini-3-Pro averages 81.32 while Gemma-3-27B-it averages 42.09. But the more interesting story is in the columns: the best model on HLE (37.98) is not the best model on AIME 2025 (96.04). Specialisation is real, and it is why single-number summaries are dangerous.

What comes next for OpenCompass →

Chapter 8

What Comes Next

The current OpenCompass is single-modal and serial between stages. The road ahead points toward pipeline parallelism and multimodal evaluation.

The paper identifies two specific future directions:

Pipeline Parallelism Enhancement. The current workflow requires that all inference tasks for a dataset complete before evaluation begins. The planned improvement maintains intra-dataset seriality but achieves cross-dataset pipeline parallelism: after Dataset A's inference completes and its evaluation begins, Dataset B's inference starts simultaneously.

Dialogue Template Expansion. Currently limited to single-modal (text) evaluation, OpenCompass plans to extend the ChatML format to support both multimodal evaluation (images, audio) and multi-turn dialogue evaluation — critical for assessing models that are increasingly multimodal.

These improvements reflect a broader trend in the field: evaluation infrastructure is catching up to model capabilities. As models become more general, evaluation systems must become more comprehensive, more efficient, and more flexible. OpenCompass is positioned as the open-source backbone for that effort.

Drag the slider to compare serial vs. pipeline execution schedules.

Number of datasets 6

Inference time per dataset (s) 60s

Evaluation time per dataset (s) 20s

The shift from serial to pipeline execution is not just an engineering convenience — it is a capability multiplier. Pipeline parallelism can reduce total evaluation time by up to $1 / (1 + r)$ where $r$ is the ratio of evaluation time to inference time. For typical workloads, that means a 30–50% speedup with zero additional hardware.

End of interactive reading.

Read the original paper on arXiv · Browse the code on GitHub