Every smartphone owner has taken a photo that looked better in the viewfinder than it did in the gallery. The problem isn't the sensor — it's the composition. Professional photographers don't just point and shoot; they scan the scene for leading lines, rule-of-thirds intersections, and vanishing points before they press the shutter. Existing automatic cropping tools miss this: they either follow rigid hand-crafted rules, chase the brightest salient region, or copy the crop from a vaguely similar stock photo.
CROP treats cropping as a reasoning problem. It feeds the image to a vision-language model and asks it to think like a photographer: first identify the compositional elements (leading lines, symmetry axes, subject placement), then propose candidate crops guided by those elements, and finally pick the best one. A second training stage, Direct Preference Optimization, teaches the model which crops human experts actually prefer — not just which one is "correct," but which one is more beautiful.
The result: 86.2% top-1 accuracy on the GAICD benchmark (vs. 85.4% for the next-best method), an IoU of 0.871 on FLMS, and a 79.2% preference rate over the previous best in a user study — all from a 7-billion-parameter model fine-tuned on a single RTX 4090.
Three decades of automatic cropping have produced three paradigms — and each fails for a different reason. Understanding those failures is the first step to building something better.
The paper identifies three dominant paradigms in automatic image cropping, each with distinct limitations shown in their Figure 1:
Hover over bars to see exact values. The chart compares representative methods from each paradigm.
Before CROP can reason about composition, it needs a vocabulary. The paper defines ten compositional elements that photographers use to create well-balanced images.
Equation (1) defines the composition analysis stage. The visual encoder $E_{\text{vis}}$ extracts features from the original image $I_{\text{ori}}$, and the VLM processes these features with a composition prompt $P_{\text{comp}}$ to produce a set of detected elements:
$T_{\text{comp}} = \{(e_k, b_k)\}_{k=1}^{K}$, where each element consists of a categorical label $e_k$ and positional coordinates $b_k$.
The ten elements are: rule of thirds, center, golden ratio, horizontal, symmetric, diagonal, curved, vertical, triangle, and vanishing point — following the classification in the CADB dataset.
Click any element to explore how it shapes the photographer's crop.
The Compositional Reasoning Pipeline breaks cropping into three stages: analyze the scene, propose candidate crops, and make a final aesthetic decision.
The pipeline consists of four operations connected in sequence:
Click each stage to explore its role and see how information flows through the pipeline.
Vision-language models understand what an image shows, but they ignore where things are. Visual enhancement forces spatial structure into the model's attention.
Equation (2) defines the visual enhancement step. The function $V(\cdot)$ overlays graphical elements derived from the composition analysis $T_{\text{comp}}$ onto the original image:
The paper quantifies the model's attention to different token types (Table 1). Semantic tokens $e_k$ receive normalized attention of 0.416 for "Center," while coordinate tokens $b_k$ receive only 0.103 — a 4× gap. Visual enhancement closes this gap by converting spatial coordinates into visual features the model naturally processes.
The bars show how much attention the VLM assigns to semantic labels vs. spatial coordinates, with and without visual enhancement.
Supervised fine-tuning teaches the model what experts chose. Direct Preference Optimization teaches it why one crop ranks above another.
The training framework has two stages:
Key training details: the base model is Qwen2.5-VL-7B, fine-tuned with LoRA ($r = 16$, $\alpha = 32$). SFT uses learning rate $1 \times 10^{-4}$; DPO uses $1 \times 10^{-5}$ with $\beta = 0.2$.
Drag the β slider to see how it controls the divergence between the policy model and the reference model.
Every component matters — but some matter more than others. The ablation study and sensitivity analysis reveal where the gains come from and how to set the knobs.
Hover over bars to see exact metric values for each configuration.
Drag any slider to see how the parameter affects IoU on the FLMS dataset. Solid line = DPO model; dashed line = SFT-only.
Numbers on benchmarks tell one story; human eyes tell another. Both agree: CROP produces crops that look better.
Click on the legend to toggle methods. The chart shows all major methods across GAICD, FLMS, and FCDB datasets.
Each bar shows the preference rate when participants compared CROP against the baseline method.