An Interactive Reading of

CROP: Expert-Aligned Image Cropping via
Compositional Reasoning and
Optimizing Preference

The paper, in plain English

Every smartphone owner has taken a photo that looked better in the viewfinder than it did in the gallery. The problem isn't the sensor — it's the composition. Professional photographers don't just point and shoot; they scan the scene for leading lines, rule-of-thirds intersections, and vanishing points before they press the shutter. Existing automatic cropping tools miss this: they either follow rigid hand-crafted rules, chase the brightest salient region, or copy the crop from a vaguely similar stock photo.

CROP treats cropping as a reasoning problem. It feeds the image to a vision-language model and asks it to think like a photographer: first identify the compositional elements (leading lines, symmetry axes, subject placement), then propose candidate crops guided by those elements, and finally pick the best one. A second training stage, Direct Preference Optimization, teaches the model which crops human experts actually prefer — not just which one is "correct," but which one is more beautiful.

The result: 86.2% top-1 accuracy on the GAICD benchmark (vs. 85.4% for the next-best method), an IoU of 0.871 on FLMS, and a 79.2% preference rate over the previous best in a user study — all from a 7-billion-parameter model fine-tuned on a single RTX 4090.

I
Compositional Reasoning Pipeline
A three-stage "analysis-proposal-decision" process that deconstructs the image into compositional elements before proposing crops — mirroring how a professional photographer actually works.
II
Visual Enhancement
Overlays bounding boxes and guide lines directly onto the image, forcing the VLM to attend to spatial structure rather than just semantic content.
III
Expert Preference Alignment
A two-stage training framework — supervised fine-tuning followed by DPO — that teaches the model to rank crops the way human experts do.
Chapter 1

Why Cropping Is Hard

Three decades of automatic cropping have produced three paradigms — and each fails for a different reason. Understanding those failures is the first step to building something better.

The paper identifies three dominant paradigms in automatic image cropping, each with distinct limitations shown in their Figure 1:

Performance Across Paradigms

Hover over bars to see exact values. The chart compares representative methods from each paradigm.

Why this matters
The best retrieval method (ProCrop) and the best saliency method (CAGR) still fall short of expert-level cropping. Even GPT-5 in zero-shot mode achieves only 26.9% ACC1/5 — suggesting that raw model scale alone cannot substitute for structured reasoning about composition.
Next: The Photographer's Toolkit
Chapter 2

The Photographer's Toolkit

Before CROP can reason about composition, it needs a vocabulary. The paper defines ten compositional elements that photographers use to create well-balanced images.

$$T_{\text{comp}} = \Phi_{\text{VLM}}\!\bigl(E_{\text{vis}}(I_{\text{ori}}),\; P_{\text{comp}}\bigr)$$

Equation (1) defines the composition analysis stage. The visual encoder $E_{\text{vis}}$ extracts features from the original image $I_{\text{ori}}$, and the VLM processes these features with a composition prompt $P_{\text{comp}}$ to produce a set of detected elements:

$T_{\text{comp}} = \{(e_k, b_k)\}_{k=1}^{K}$, where each element consists of a categorical label $e_k$ and positional coordinates $b_k$.

The ten elements are: rule of thirds, center, golden ratio, horizontal, symmetric, diagonal, curved, vertical, triangle, and vanishing point — following the classification in the CADB dataset.

The Ten Compositional Elements

Click any element to explore how it shapes the photographer's crop.

Why this matters
These ten elements aren't arbitrary — they come from decades of photography pedagogy (Freeman, 2017; Prakel, 2020). By forcing the model to detect them explicitly, CROP creates an interpretable intermediate representation: you can inspect why the model proposed a particular crop by looking at which elements it identified.
Next: Think Like a Photographer
Chapter 3

Think Like a Photographer

The Compositional Reasoning Pipeline breaks cropping into three stages: analyze the scene, propose candidate crops, and make a final aesthetic decision.

$$\underbrace{T_{\text{comp}} = \Phi_{\text{VLM}}(E_{\text{vis}}(I_{\text{ori}}), P_{\text{comp}})}_{\text{Analysis}} \;\;\xrightarrow{\;I_{\text{comp}} = V(I_{\text{ori}}, T_{\text{comp}})\;}\;\; \underbrace{C_{\text{cand}} = \Phi_{\text{VLM}}(E_{\text{vis}}(I_{\text{comp}}), T_{\text{comp}}, P_{\text{crop}})}_{\text{Proposal}} \;\;\longrightarrow\;\; \underbrace{C_{\text{final}} = \Phi_{\text{VLM}}(C_{\text{cand}}, P_{\text{aes}})}_{\text{Decision}}$$

The pipeline consists of four operations connected in sequence:

The Pipeline, Step by Step

Click each stage to explore its role and see how information flows through the pipeline.

Why this matters
Previous methods treat cropping as a single-shot prediction. By decomposing it into three reasoning stages, CROP gives the model a place to "show its work" — each intermediate output can be inspected, debugged, and improved independently.
Next: Making the Model See
Chapter 4

Making the Model See

Vision-language models understand what an image shows, but they ignore where things are. Visual enhancement forces spatial structure into the model's attention.

$$I_{\text{comp}} = V(I_{\text{ori}},\; T_{\text{comp}})$$

Equation (2) defines the visual enhancement step. The function $V(\cdot)$ overlays graphical elements derived from the composition analysis $T_{\text{comp}}$ onto the original image:

The paper quantifies the model's attention to different token types (Table 1). Semantic tokens $e_k$ receive normalized attention of 0.416 for "Center," while coordinate tokens $b_k$ receive only 0.103 — a 4× gap. Visual enhancement closes this gap by converting spatial coordinates into visual features the model naturally processes.

Attention: Semantics vs. Coordinates

The bars show how much attention the VLM assigns to semantic labels vs. spatial coordinates, with and without visual enhancement.

Why this matters
Visual enhancement is a lightweight intervention — no architectural changes, no extra training — yet it consistently improves performance across all metrics. The ablation (Table 3, C4 vs. C3) shows it adds roughly 0.6–1.0 points on ACC1/5. It works because it exploits a known blind spot of VLMs: their bias toward semantics over spatial structure.
Next: From Imitation to Preference
Chapter 5

From Imitation to Preference

Supervised fine-tuning teaches the model what experts chose. Direct Preference Optimization teaches it why one crop ranks above another.

$$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y)}\left[\sum_{t=1}^{|y|} \log \Phi(y_t \mid x, y_{
$$\mathcal{L}_{\text{DPO}}(\theta;\; \Phi_{\text{ref}}) = -\mathbb{E}_{(x,\, y_w,\, y_l)}\Bigl[\log \sigma\!\bigl(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\bigr)\Bigr]$$
$$\hat{r}_\theta(x, y) = \beta \log \frac{\Phi_\theta(y \mid x)}{\Phi_{\text{ref}}(y \mid x)}$$

The training framework has two stages:

  • Stage 1 — Imitation Learning (SFT): Minimizes the cross-entropy loss (Eq. 5) over the dataset $\mathcal{D}_{\text{SFT}}$. Each sample is a quadruple $(I_i, C_i, P_i, y_i)$ where $y_i$ is the crop with the highest Mean Opinion Score (MOS). The model learns to predict the expert's chosen crop.
  • Stage 2 — Preference Alignment (DPO): Uses the SFT model as reference $\Phi_{\text{ref}}$ and constructs a preference dataset $\mathcal{D}_{\text{DPO}}$ with winning responses $y_w$ (highest MOS) and losing responses $y_l$ (lower MOS). The loss (Eq. 6) encourages the model to increase the probability of $y_w$ while decreasing $y_l$. The implicit reward (Eq. 7) is controlled by $\beta$, which balances reward signal versus generative diversity.

Key training details: the base model is Qwen2.5-VL-7B, fine-tuned with LoRA ($r = 16$, $\alpha = 32$). SFT uses learning rate $1 \times 10^{-4}$; DPO uses $1 \times 10^{-5}$ with $\beta = 0.2$.

DPO: How Preference Strength Shapes Training

Drag the β slider to see how it controls the divergence between the policy model and the reference model.

0.20
SFT IoU (FLMS)
0.822
DPO IoU (FLMS)
0.871
DPO improvement
+5.96%
Why this matters
The gap between SFT and DPO isn't incremental — it's qualitative. SFT treats each crop as right or wrong. DPO teaches the model to understand degree of quality: why a 4.3-MOS crop is better than a 4.1-MOS crop. This comparative reasoning is what makes expert judgment hard to automate, and DPO directly targets it.
Next: Tuning the System
Chapter 6

Tuning the System

Every component matters — but some matter more than others. The ablation study and sensitivity analysis reveal where the gains come from and how to set the knobs.

Ablation: What Each Component Adds

Hover over bars to see exact metric values for each configuration.

Sensitivity to Hyperparameters

Drag any slider to see how the parameter affects IoU on the FLMS dataset. Solid line = DPO model; dashed line = SFT-only.

0.20
0.50
0.95
Why this matters
DPO doesn't just improve peak performance — it dramatically increases stability. As temperature rises from 0.1 to 1.2, the SFT model's IoU drops from 0.858 to below 0.78. The DPO model stays above 0.85 across the entire range. For real deployment, where you can't guarantee perfect inference settings, this robustness may matter more than the headline numbers.
Next: The Verdict
Chapter 7

The Verdict

Numbers on benchmarks tell one story; human eyes tell another. Both agree: CROP produces crops that look better.

Head-to-Head on Three Benchmarks

Click on the legend to toggle methods. The chart shows all major methods across GAICD, FLMS, and FCDB datasets.

User Study: Human Preference

Each bar shows the preference rate when participants compared CROP against the baseline method.

Why this matters
The user study confirms what metrics can only approximate: CROP's crops are visibly better, not just metrically better. A 79.2% preference rate means that nearly 4 out of 5 times, human viewers chose CROP over the previous state of the art. The remaining limitation is computational cost — the 7B model requires a dedicated GPU — but as lightweight VLMs improve, this bottleneck will shrink.
Next: Closing
"The proposed approach fully exploits the VLM's capability for aesthetic understanding and reasoning — deconstructing a complex and subjective aesthetic problem into an analysis-proposal-decision process."
86.2% ACC1/5 on GAICD
0.871 IoU on FLMS
79.2% User preference over GAIC
Read the paper
Zhitong Dong, Chao Li, Jie Yu, Hao Chen. "CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference." arXiv:2605.12545, May 2026.
arxiv.org/abs/2605.12545