An Interactive Reading of

OpenCompass: A Universal Evaluation Platform for Large Language Models

The paper, in plain English

Every week a new large language model ships with claims about how smart it is. But how do you actually test that? Running one model through one benchmark is easy. Running forty models through a hundred benchmarks, each with different scoring rules, different prompt formats, and different post-processing steps, is a logistical nightmare. Researchers end up spending more time gluing pipelines together than doing science.

OpenCompass is Shanghai AI Lab's answer: an open-source platform that treats LLM evaluation as an engineering problem. You declare which models and which datasets in a config file, and the system takes care of everything else — splitting the work across GPUs, routing each benchmark to the right evaluator (rule-based, LLM-as-a-Judge, or a cost-saving cascade of both), and spitting out a unified scorecard. The design borrows from distributed systems thinking: a Partitioner chops the workload into independent atomic tasks, and a Runner schedules them in parallel across local machines or cloud clusters.

The result is a platform that supports 100+ benchmark datasets spanning knowledge, reasoning, math, science, code, and long-context tasks, and has produced an academic leaderboard with scores for every major model from GPT-5 to open-source 4B-parameter models. The top model (Gemini-3-Pro) averages 81.32 across six headline benchmarks — but no single model wins every category, which is precisely why you need a system like this.

I
Partition & Parallelize
A Partitioner decomposes the full model×dataset grid into atomic tasks; a Runner distributes them across GPU clusters — cutting wall-clock time from days to hours.
II
Three Evaluator Modes
Rule-based scoring for clean answers, LLM-as-a-Judge for open-ended tasks, and a Cascade that combines both — filtering easy samples through cheap rules before calling the expensive judge.
III
100+ Benchmarks, One Scorecard
From MMLU and HLE to LiveCodeBench and RULER, OpenCompass unifies knowledge, reasoning, math, science, language, code, and long-text evaluation under a single configuration.
Chapter 1

The Evaluation Problem

A new large language model ships every week. Everyone claims it is better. But better at what, exactly — and who decides?

The paper identifies three specific pain points that make LLM evaluation arduous:

1. Inconsistent evaluation methods. Some benchmarks use exact-match scoring; others use ROUGE; still others use a second LLM to grade responses. There is no single "evaluate" button.

2. Prompt sensitivity. A model's score on the same benchmark can swing by several percentage points depending on whether you phrase the question as "Choose the correct answer:" or "Which of the following is true?" Standardised prompt templates matter.

3. Fragmented preprocessing. Each benchmark requires its own data-loading logic, field alignment, and output post-processing. Researchers re-implement the same boilerplate from scratch every time.

OpenCompass addresses all three by providing a unified pipeline with modular, swappable components for each phase. The platform supports both objective evaluation (multiple-choice, math, code — tasks with definitive gold answers, scored by Accuracy, Exact Match, F1) and subjective evaluation (open-ended generation, creative writing — scored by an LLM acting as a judge across dimensions like coherence, relevance, and instruction adherence).

The evaluated objects fall into two categories: Base Models (pretrained only, with text-continuation capabilities, evaluated via perplexity-based methods) and Chat Models (fine-tuned with SFT or RLHF, evaluated via generation-based approaches).

The quiet insight is that LLM evaluation is no longer a measurement problem — it is a distributed systems problem. The bottleneck is not "how do we score a model?" but "how do we score forty models on a hundred benchmarks without our infrastructure collapsing?"

The five-component architecture
Chapter 2

The Five-Layer Architecture

OpenCompass decomposes evaluation into five modular components connected by a four-stage workflow pipeline.

The architecture comprises five core components arranged in a layered logical structure:

Configuration System — Parses and instantiates heterogeneous inputs (Python config files or CLI arguments) into a unified evaluation configuration object. Built on MMEngine, it performs consistency checks and manages dataset/model configurations.

Partitioner — Maps the evaluation workload to the Cartesian product of model-dataset combinations, then splits each combination into independent atomic subtasks based on strategies like number of workers, dataset size, or naive equal partitioning.

Runner — The adaptive abstraction layer between task logic and cluster infrastructure. Supports local environments, Alibaba Cloud DLC, Volcengine Cloud, and more. Handles job submission, lifecycle management, retry logic, and log collection.

Tasks — Two atomic task types: OpenICLInferTask (runs model inference) and OpenICLEvalTask (computes evaluation scores). Each carries its complete execution context independently.

Summarizer — Aggregates results, computes composite metrics, and produces customisable visualisation reports with dataset-level groupings.

Click any component to see how it works in detail.

Click a component above

Select any box in the architecture diagram to explore its role in the evaluation pipeline.

The four-stage workflow — Configure → Infer → Evaluate → Visualise — mirrors the standard ML experiment lifecycle. The difference is that OpenCompass makes each stage independently configurable and parallelisable, turning a sequential slog into a concurrent pipeline.

How configurations and prompts work
Chapter 3

Configuration & Prompt Design

Every evaluation begins with a config file. The prompt you feed the model determines the score you get — so OpenCompass makes prompt construction a first-class citizen.

The Configuration System, built on OpenICL, supports three prompt construction mechanisms:

Few-shot prompts use the Retriever module to select contextual examples from an indexed dataset. These examples are concatenated with the test sample to form a complete prompt. The retrieval strategy is configurable — you can select examples by similarity, randomly, or from fixed positions.

Zero-shot prompts are constructed via the Prompt Template module — just the task description and test sample, with no additional examples. Suitable for models with strong zero-shot generalisation capabilities (modern chat models).

Rich prompt structures map to ChatML templates, enabling multi-turn contextual prompts with system messages, user turns, and assistant turns.

A complete dataset configuration requires three parts: data loading (how to read the raw dataset), inference configuration (prompt template + retrieval strategy), and evaluation configuration (which metric to compute, how to extract answers).

Prompt Length Decomposition
$$L_{\text{prompt}} = L_{\text{system}} + k \cdot L_{\text{example}} + L_{\text{question}}$$
where $L$ denotes token length, $k$ is the number of few-shot examples, $L_{\text{system}}$ is the system instruction, $L_{\text{example}}$ is the average example length, and $L_{\text{question}}$ is the test question. More examples help accuracy but increase inference cost linearly.
Drag the sliders to see how prompt strategy affects estimated inference cost and context budget.
5
200
100

The paper emphasises that prompt sensitivity is one of the three core pain points in LLM evaluation. Two teams can evaluate the same model on the same benchmark and get different scores simply because they used different prompt templates. OpenCompass addresses this by standardising prompt construction in the config — making results reproducible across teams and runs.

How partitioning slashes evaluation time
Chapter 4

Partitioning & Parallelism

Evaluating a single model on a hundred benchmarks takes days. OpenCompass turns that serial slog into a parallel sprint.

Evaluation Task Complexity
$$T_{\text{total}} = \frac{N_{\text{models}} \times N_{\text{datasets}} \times N_{\text{samples}}}{W} \times t_{\text{sample}} + t_{\text{overhead}} \times \lceil \frac{N_{\text{models}} \times N_{\text{datasets}}}{W} \rceil$$
where $W$ is the number of parallel workers, $t_{\text{sample}}$ is the per-sample inference time, and $t_{\text{overhead}}$ is the fixed cost of task dispatch and result collection per batch.
Drag the sliders to see how parallelism scales evaluation wall-clock time.
10
20
1000
2.0s

The Partitioner supports three splitting strategies: naive equal partition (split tasks evenly), num-worker partition (split to match available workers), and size-based partition (split by dataset size). Each produces a structured task list that the Runner then distributes. The key is that subtasks are independent — no data dependencies between them means near-linear scaling.

How evaluators work
Chapter 5

Three Evaluator Strategies

Not all answers can be graded the same way. OpenCompass provides three evaluator modes — and a clever cascade that combines them to save money.

Rule-Based Evaluator implements lightweight objective evaluation using predefined rules or classic NLP metrics. Three subtypes: Option Extraction (regex-based multiple-choice parsing), Content Regex (entity extraction with MathEvaluator for LaTeX), and Classic NLP Metrics (BLEU, ROUGE, AUC-ROC, F1).

LLM-as-a-Judge Evaluator invokes a second LLM as the Judge Model, reusing inference-task logic with structured scoring prompts. Used for complex objective scenarios (answers too nuanced for rules) and subjective scenarios (open-ended generation rated on relevance, fluency, logic, innovativeness).

Cascade Evaluator operates in two modes. In Cascaded Mode, rules pre-filter: correct samples are cheaply accepted; uncertain samples go to the LLM judge. In Parallel Mode, both evaluators run on every sample; a sample is correct if either evaluator says so (higher tolerance, higher cost).

Drag sliders to see how the Cascade evaluator saves LLM judge calls.
Cascade Mode
Parallel Mode
70%
90%
2000

The cascade is particularly elegant in Cascaded Mode: if your rule evaluator covers 70% of samples at 90% accuracy, you only send 30% of samples to the expensive LLM judge. That is a 70% cost reduction with minimal accuracy loss. The trade-off is tunable via coverage and accuracy thresholds in the config.

Explore the benchmark landscape
Chapter 6

The Benchmark Landscape

OpenCompass supports 100+ datasets across eight domains. Here is what each domain measures — and why the hardest benchmarks humbled every model.

Select models to compare their profiles across benchmark domains.

Notice that even the best models (Gemini-3-Pro at 81.32 average) do not dominate every dimension. The HLE benchmark (Humanity's Last Exam) topped out at 37.98 for the best model — barely above random guessing on some sub-tasks. That gap between "best" and "perfect" is where the next generation of models will compete.

The full model leaderboard
Chapter 7

The Leaderboard

Forty-two models, six benchmarks, one scorecard. Click any column header to sort and discover where each model excels — or fails.

Click column headers to sort. Hover over cells for details.

The spread from top to bottom is massive: Gemini-3-Pro averages 81.32 while Gemma-3-27B-it averages 42.09. But the more interesting story is in the columns: the best model on HLE (37.98) is not the best model on AIME 2025 (96.04). Specialisation is real, and it is why single-number summaries are dangerous.

What comes next for OpenCompass
Chapter 8

What Comes Next

The current OpenCompass is single-modal and serial between stages. The road ahead points toward pipeline parallelism and multimodal evaluation.

The paper identifies two specific future directions:

Pipeline Parallelism Enhancement. The current workflow requires that all inference tasks for a dataset complete before evaluation begins. The planned improvement maintains intra-dataset seriality but achieves cross-dataset pipeline parallelism: after Dataset A's inference completes and its evaluation begins, Dataset B's inference starts simultaneously.

Dialogue Template Expansion. Currently limited to single-modal (text) evaluation, OpenCompass plans to extend the ChatML format to support both multimodal evaluation (images, audio) and multi-turn dialogue evaluation — critical for assessing models that are increasingly multimodal.

These improvements reflect a broader trend in the field: evaluation infrastructure is catching up to model capabilities. As models become more general, evaluation systems must become more comprehensive, more efficient, and more flexible. OpenCompass is positioned as the open-source backbone for that effort.

Drag the slider to compare serial vs. pipeline execution schedules.
6
60s
20s

The shift from serial to pipeline execution is not just an engineering convenience — it is a capability multiplier. Pipeline parallelism can reduce total evaluation time by up to $1 / (1 + r)$ where $r$ is the ratio of evaluation time to inference time. For typical workloads, that means a 30–50% speedup with zero additional hardware.

End of interactive reading.

Read the original paper on arXiv  ·  Browse the code on GitHub