An Interactive Reading of

CROP: Expert-Aligned Image Cropping via
Compositional Reasoning and
Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen
Southeast University · Alibaba Group · May 2026 · arXiv:2605.12545

The paper, in plain English

Every smartphone owner has taken a photo that looked better in the viewfinder than it did in the gallery. The problem isn't the sensor — it's the composition. Professional photographers don't just point and shoot; they scan the scene for leading lines, rule-of-thirds intersections, and vanishing points before they press the shutter. Existing automatic cropping tools miss this: they either follow rigid hand-crafted rules, chase the brightest salient region, or copy the crop from a vaguely similar stock photo.

CROP treats cropping as a reasoning problem. It feeds the image to a vision-language model and asks it to think like a photographer: first identify the compositional elements (leading lines, symmetry axes, subject placement), then propose candidate crops guided by those elements, and finally pick the best one. A second training stage, Direct Preference Optimization, teaches the model which crops human experts actually prefer — not just which one is "correct," but which one is more beautiful.

The result: 86.2% top-1 accuracy on the GAICD benchmark (vs. 85.4% for the next-best method), an IoU of 0.871 on FLMS, and a 79.2% preference rate over the previous best in a user study — all from a 7-billion-parameter model fine-tuned on a single RTX 4090.

I

Compositional Reasoning Pipeline

A three-stage "analysis-proposal-decision" process that deconstructs the image into compositional elements before proposing crops — mirroring how a professional photographer actually works.

II

Visual Enhancement

Overlays bounding boxes and guide lines directly onto the image, forcing the VLM to attend to spatial structure rather than just semantic content.

III

Expert Preference Alignment

A two-stage training framework — supervised fine-tuning followed by DPO — that teaches the model to rank crops the way human experts do.

Chapter 1

Why Cropping Is Hard

Three decades of automatic cropping have produced three paradigms — and each fails for a different reason. Understanding those failures is the first step to building something better.

In plain English

Imagine you're a photo editor at a magazine, handed a thousand vacation snapshots and told to pick the best crop for each. You could follow a checklist ("put the subject in the center"), track the brightest spot, or search a stock-photo database for something similar. That's roughly what the three generations of automatic cropping algorithms do — and why they all fall short.

Hand-crafted rules like "rule of thirds" are rigid. Saliency detectors get confused when two subjects compete for attention. Retrieval systems match the topic of the photo (people on grass) but miss the composition (a diagonal leading line that the crop should preserve).

CROP starts from a different premise: instead of shortcuts, make the model reason about composition the way a photographer does — step by step.

The paper identifies three dominant paradigms in automatic image cropping, each with distinct limitations shown in their Figure 1:

Hand-crafted features (e.g., Smartcrop, 2014): Uses rigid rules based on edges, faces, and foreground detection. These generalize poorly to complex scenes and produce compositions lacking aesthetic appeal.
Saliency prediction (e.g., CACNet, 2021): Focuses on detecting visually salient regions but struggles with compositional trade-offs. When multiple focal points compete (a woman and a band in the background), the model cannot make a clear decision.
Retrieval-based (e.g., Cropper, 2025): Retrieves visually similar images as references. However, cosine-similarity retrieval captures semantic proximity rather than compositional structure, providing wrong guidance for cropping.

Performance Across Paradigms

Hover over bars to see exact values. The chart compares representative methods from each paradigm.

Why this matters

The best retrieval method (ProCrop) and the best saliency method (CAGR) still fall short of expert-level cropping. Even GPT-5 in zero-shot mode achieves only 26.9% ACC1/5 — suggesting that raw model scale alone cannot substitute for structured reasoning about composition.

Next: The Photographer's Toolkit →

Chapter 2

The Photographer's Toolkit

Before CROP can reason about composition, it needs a vocabulary. The paper defines ten compositional elements that photographers use to create well-balanced images.

$$T_{\text{comp}} = \Phi_{\text{VLM}}\!\bigl(E_{\text{vis}}(I_{\text{ori}}),\; P_{\text{comp}}\bigr)$$

Equation (1) defines the composition analysis stage. The visual encoder $E_{\text{vis}}$ extracts features from the original image $I_{\text{ori}}$, and the VLM processes these features with a composition prompt $P_{\text{comp}}$ to produce a set of detected elements:

$T_{\text{comp}} = \{(e_k, b_k)\}_{k=1}^{K}$, where each element consists of a categorical label $e_k$ and positional coordinates $b_k$.

The ten elements are: rule of thirds, center, golden ratio, horizontal, symmetric, diagonal, curved, vertical, triangle, and vanishing point — following the classification in the CADB dataset.

The Ten Compositional Elements

Click any element to explore how it shapes the photographer's crop.

Why this matters

These ten elements aren't arbitrary — they come from decades of photography pedagogy (Freeman, 2017; Prakel, 2020). By forcing the model to detect them explicitly, CROP creates an interpretable intermediate representation: you can inspect why the model proposed a particular crop by looking at which elements it identified.

Next: Think Like a Photographer →

Chapter 3

Think Like a Photographer

The Compositional Reasoning Pipeline breaks cropping into three stages: analyze the scene, propose candidate crops, and make a final aesthetic decision.

$$\underbrace{T_{\text{comp}} = \Phi_{\text{VLM}}(E_{\text{vis}}(I_{\text{ori}}), P_{\text{comp}})}_{\text{Analysis}} \;\;\xrightarrow{\;I_{\text{comp}} = V(I_{\text{ori}}, T_{\text{comp}})\;}\;\; \underbrace{C_{\text{cand}} = \Phi_{\text{VLM}}(E_{\text{vis}}(I_{\text{comp}}), T_{\text{comp}}, P_{\text{crop}})}_{\text{Proposal}} \;\;\longrightarrow\;\; \underbrace{C_{\text{final}} = \Phi_{\text{VLM}}(C_{\text{cand}}, P_{\text{aes}})}_{\text{Decision}}$$

The pipeline consists of four operations connected in sequence:

Composition Analysis (Eq. 1): The VLM identifies compositional elements and returns them as structured JSON with category labels and bounding box coordinates.
Visual Enhancement (Eq. 2): A visualization function $V(\cdot)$ overlays graphical annotations — bounding boxes for subject-placement elements, guide lines for layout elements — onto the original image.
Cropping Proposal (Eq. 3): Using both the enhanced image and the text analysis, the VLM generates $N$ candidate crops.
Aesthetic Decision (Eq. 4): The VLM evaluates all candidates on balance, focus, and visual appeal, selecting the single best crop.

The Pipeline, Step by Step

Click each stage to explore its role and see how information flows through the pipeline.

Why this matters

Previous methods treat cropping as a single-shot prediction. By decomposing it into three reasoning stages, CROP gives the model a place to "show its work" — each intermediate output can be inspected, debugged, and improved independently.

Next: Making the Model See →

Chapter 4

Making the Model See

Vision-language models understand what an image shows, but they ignore where things are. Visual enhancement forces spatial structure into the model's attention.

In plain English

Suppose you describe a room to someone over the phone: "There's a sofa against the left wall and a lamp in the right corner." They'll remember "sofa" and "lamp" but probably forget which was on the left and which on the right. Vision-language models do the same thing — they attend strongly to semantic concepts ("sofa," "lamp") but barely register spatial coordinates.

CROP's fix is clever: instead of relying on text alone, it draws the compositional elements directly onto the image. Bounding boxes highlight subject placement; guide lines reveal structural alignment. The model can no longer ignore spatial cues because they're painted into the pixels it processes.

The attention analysis in the paper quantifies this: after visual enhancement, attention to coordinate tokens jumps from 0.024 to meaningful levels.

$$I_{\text{comp}} = V(I_{\text{ori}},\; T_{\text{comp}})$$

Equation (2) defines the visual enhancement step. The function $V(\cdot)$ overlays graphical elements derived from the composition analysis $T_{\text{comp}}$ onto the original image:

Subject-placement elements (rule of thirds, center, golden ratio): Overlay bounding boxes highlighting relative positions.
Layout elements (vertical, horizontal, diagonal, curved): Draw guide lines revealing the scene's overall structure.

The paper quantifies the model's attention to different token types (Table 1). Semantic tokens $e_k$ receive normalized attention of 0.416 for "Center," while coordinate tokens $b_k$ receive only 0.103 — a 4× gap. Visual enhancement closes this gap by converting spatial coordinates into visual features the model naturally processes.

Attention: Semantics vs. Coordinates

The bars show how much attention the VLM assigns to semantic labels vs. spatial coordinates, with and without visual enhancement.

Why this matters

Visual enhancement is a lightweight intervention — no architectural changes, no extra training — yet it consistently improves performance across all metrics. The ablation (Table 3, C4 vs. C3) shows it adds roughly 0.6–1.0 points on ACC1/5. It works because it exploits a known blind spot of VLMs: their bias toward semantics over spatial structure.

Next: From Imitation to Preference →

Chapter 5

From Imitation to Preference

Supervised fine-tuning teaches the model what experts chose. Direct Preference Optimization teaches it why one crop ranks above another.

In plain English

Imagine training a food critic. In phase one, you show them dishes that top chefs loved — "this risotto got five stars." They learn to recognize great food, but they can't explain why risotto A beats risotto B when both look similar. That's supervised fine-tuning (SFT): it teaches the model to imitate expert choices.

In phase two, you present pairs of dishes and ask: "Which would the Michelin inspector prefer?" By comparing, the critic learns the relative judgment that underpins expert taste. That's Direct Preference Optimization (DPO). It doesn't need a separate reward model or reinforcement learning — it directly optimizes the model to increase the probability of preferred outputs while decreasing the probability of rejected ones.

Drag the β slider below to see how this preference strength parameter shapes the training landscape.

$$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y)}\left[\sum_{t=1}^{|y|} \log \Phi(y_t \mid x, y_{

$$\mathcal{L}_{\text{DPO}}(\theta;\; \Phi_{\text{ref}}) = -\mathbb{E}_{(x,\, y_w,\, y_l)}\Bigl[\log \sigma\!\bigl(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\bigr)\Bigr]$$

$$\hat{r}_\theta(x, y) = \beta \log \frac{\Phi_\theta(y \mid x)}{\Phi_{\text{ref}}(y \mid x)}$$

The training framework has two stages:

Stage 1 — Imitation Learning (SFT): Minimizes the cross-entropy loss (Eq. 5) over the dataset $\mathcal{D}_{\text{SFT}}$. Each sample is a quadruple $(I_i, C_i, P_i, y_i)$ where $y_i$ is the crop with the highest Mean Opinion Score (MOS). The model learns to predict the expert's chosen crop.
Stage 2 — Preference Alignment (DPO): Uses the SFT model as reference $\Phi_{\text{ref}}$ and constructs a preference dataset $\mathcal{D}_{\text{DPO}}$ with winning responses $y_w$ (highest MOS) and losing responses $y_l$ (lower MOS). The loss (Eq. 6) encourages the model to increase the probability of $y_w$ while decreasing $y_l$. The implicit reward (Eq. 7) is controlled by $\beta$, which balances reward signal versus generative diversity.

Key training details: the base model is Qwen2.5-VL-7B, fine-tuned with LoRA ($r = 16$, $\alpha = 32$). SFT uses learning rate $1 \times 10^{-4}$; DPO uses $1 \times 10^{-5}$ with $\beta = 0.2$.

DPO: How Preference Strength Shapes Training

Drag the β slider to see how it controls the divergence between the policy model and the reference model.

β (preference strength) 0.20

SFT IoU (FLMS)

0.822

DPO IoU (FLMS)

0.871

DPO improvement

+5.96%

Why this matters

The gap between SFT and DPO isn't incremental — it's qualitative. SFT treats each crop as right or wrong. DPO teaches the model to understand degree of quality: why a 4.3-MOS crop is better than a 4.1-MOS crop. This comparative reasoning is what makes expert judgment hard to automate, and DPO directly targets it.

Next: Tuning the System →

Chapter 6

Tuning the System

Every component matters — but some matter more than others. The ablation study and sensitivity analysis reveal where the gains come from and how to set the knobs.

Ablation: What Each Component Adds

Hover over bars to see exact metric values for each configuration.

Sensitivity to Hyperparameters

Drag any slider to see how the parameter affects IoU on the FLMS dataset. Solid line = DPO model; dashed line = SFT-only.

β (DPO coefficient) 0.20

Temperature 0.50

Top-p 0.95

Why this matters

DPO doesn't just improve peak performance — it dramatically increases stability. As temperature rises from 0.1 to 1.2, the SFT model's IoU drops from 0.858 to below 0.78. The DPO model stays above 0.85 across the entire range. For real deployment, where you can't guarantee perfect inference settings, this robustness may matter more than the headline numbers.

Next: The Verdict →

Chapter 7

The Verdict

Numbers on benchmarks tell one story; human eyes tell another. Both agree: CROP produces crops that look better.

Head-to-Head on Three Benchmarks

Click on the legend to toggle methods. The chart shows all major methods across GAICD, FLMS, and FCDB datasets.

User Study: Human Preference

Each bar shows the preference rate when participants compared CROP against the baseline method.

Why this matters

The user study confirms what metrics can only approximate: CROP's crops are visibly better, not just metrically better. A 79.2% preference rate means that nearly 4 out of 5 times, human viewers chose CROP over the previous state of the art. The remaining limitation is computational cost — the 7B model requires a dedicated GPU — but as lightweight VLMs improve, this bottleneck will shrink.

Next: Closing →

"The proposed approach fully exploits the VLM's capability for aesthetic understanding and reasoning — deconstructing a complex and subjective aesthetic problem into an analysis-proposal-decision process."

86.2% ACC1/5 on GAICD

0.871 IoU on FLMS

79.2% User preference over GAIC

Read the paper

Zhitong Dong, Chao Li, Jie Yu, Hao Chen. "CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference." arXiv:2605.12545, May 2026.
arxiv.org/abs/2605.12545

CROP: Expert-Aligned Image Cropping viaCompositional Reasoning andOptimizing Preference

Why Cropping Is Hard

Performance Across Paradigms

The Photographer's Toolkit

The Ten Compositional Elements

Think Like a Photographer

The Pipeline, Step by Step

Making the Model See

Attention: Semantics vs. Coordinates

From Imitation to Preference

DPO: How Preference Strength Shapes Training

Tuning the System

Ablation: What Each Component Adds

Sensitivity to Hyperparameters

The Verdict

Head-to-Head on Three Benchmarks

User Study: Human Preference

CROP: Expert-Aligned Image Cropping via
Compositional Reasoning and
Optimizing Preference