An Interactive Reading of

Joint Structural Pruning and
Mixed-Precision Quantization
for LLM Compression

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi & Phuong Hoai Ha
UiT The Arctic University of Norway · University of Oslo · June 2026 · arXiv:2606.07819
The paper, in plain English

Deploying a large language model on a phone or a single GPU is like trying to park a freight train in a compact spot. The model may be powerful, but its memory footprint and computational demands make it impractical for resource-constrained hardware. The two standard compression tricks — pruning (removing redundant parameters) and quantization (using fewer bits per number) — are usually applied in isolation, and each is typically optimised layer-by-layer. That means no one is watching how errors pile up as data flows through the whole network.

This paper introduces TOGA (Train Once, Get All), a framework that treats pruning and quantization as a single, joint optimisation problem. A small neural network called a hypernetwork learns which channels to keep and which bits to assign — not by looking at each layer in isolation, but by directly minimising the model's actual language-modelling loss. The result is a compressed model that is both smaller and smarter than what you get from the conventional sequential pipeline.

At ultra-low precisions (1–3 bits), TOGA reduces perplexity by up to 21% compared to the best existing methods on WikiText-2. In mainstream mixed-precision settings (4-bit/8-bit), it delivers up to 2× faster prefill, 6.5× peak memory reduction during decoding, and up to 30% faster inference compared to state-of-the-art semi-structured pruning methods — while actually improving accuracy on reasoning benchmarks.

I

Binary-Mask Quantization

A single binary mask per layer decides which weight channels get high precision and which get low precision. The mask is learned, not hand-crafted.

II

Joint Pruning + Quantization

Structural pruning and mixed-precision quantization are optimised simultaneously in a unified search space, allowing the two to adapt to each other.

III

End-to-End Hypernetwork

A hypernetwork trained on the global language-modelling loss discovers optimal pruning and quantization policies, avoiding greedy layer-wise shortcuts.

Chapter 1

The Compression Problem

Large language models are memory-hungry. A 7-billion-parameter model in FP16 needs 14 GB just to load its weights. Quantization and pruning are the two levers — but applying them independently leaves accuracy on the table.

Think of an LLM as a vast warehouse of knowledge. Each ‘parameter’ is a dial that has been carefully tuned during training. Quantization is like rounding every dial to the nearest notch on a coarser scale — you save space, but you lose fine-grained control. Pruning is like ripping out entire aisles of the warehouse that seem underused.

The problem is that most existing methods optimise each dial-setting aisle by aisle, ignoring how the whole warehouse works together. A small error in aisle 3 might not matter much, but by the time you get to aisle 30, those small errors have compounded into something that derails the entire operation. TOGA’s insight is to step back and look at the whole warehouse at once.

Drag the sliders below to see how uniform quantization degrades a model’s output quality — and how even a small fraction of high-precision channels can rescue it.

Weight quantization (uniform)
$$Q(W, b) = \text{clamp}\!\left(\left\lfloor \frac{W}{s} \right\rceil, -2^{b-1}, 2^{b-1}-1\right) \cdot s$$

where $s$ is the per-channel scale factor and $b$ is the bit-width. Lower $b$ means coarser approximations.

Charts update as you drag — no Run button needed

Quantization error is not uniform across layers. The first and last layers of a transformer are far more sensitive than the middle ones — a fact that TOGA exploits through adaptive bit-width allocation.
Next: Mixed-Precision Quantization →

Chapter 2

Mixed-Precision Quantization

Not all weights are created equal. Some channels carry critical information; others tolerate aggressive compression. The key question: how do you decide which is which — without inspecting every layer in isolation?

Imagine packing a suitcase for a trip. Your passport and wallet need a dedicated, secure pocket (high precision). Your socks can be stuffed into any remaining crevice (low precision). The art is in knowing which items go where — and that allocation differs depending on whether you’re heading to a business meeting or a camping trip.

TOGA replaces the hand-crafted packing rules with a learned strategy. A binary mask M marks each channel as salient (1) or non-salient (0). Salient channels get quantized to higher precision (e.g., 8-bit), while the rest get compressed more aggressively (e.g., 4-bit). The mask itself is learned by a hypernetwork optimising the end-to-end loss.

Use the sliders below to see how the fraction of salient channels affects reconstruction quality — and how the optimal fraction varies from layer to layer.

Equation 8 — Mixed-precision quantization
$$F_{\text{quant}}(W, M) = Q_h(W) \odot M + Q_l(W) \odot (1 - M)$$

where $Q_h$ is the high-precision quantizer (e.g., INT8), $Q_l$ is the low-precision quantizer (e.g., INT4), and $\odot$ denotes element-wise multiplication. The mask $M$ determines which channels receive which precision.

Prior methods like Atom fix the salient fraction at a uniform percentage across all layers. TOGA’s learned masks allocate more high-precision channels to early and late layers — exactly where the model is most sensitive — and fewer to the redundant middle.
Next: Structural Pruning →

Chapter 3

Structural Pruning

Pruning removes entire channels from the network, producing smaller, denser matrices that run fast on real hardware. But which channels can you safely cut?

Think of a symphony orchestra. Every musician plays a part, but if you listen carefully, some sections carry the melody while others fill in background texture. Pruning is like sending home the sections you can afford to lose — but you’d better be right about which ones are expendable.

Structural pruning uses binary masks on the input and output dimensions of each weight matrix. A mask value of 0 removes that entire channel; 1 keeps it. The pruning mask is not a heuristic — it is learned end-to-end by the same hypernetwork that learns the quantization masks.

Drag the sparsity slider below to see how removing channels affects the effective model size and the output quality.

Equation 3 — Structured pruning
$$F_{\text{prune}}(W, P_{\text{in}}, P_{\text{out}}) = \text{diag}(P_{\text{in}})^\top \, W \, \text{diag}(P_{\text{out}}) = P_{\text{in}}^\top \, W \, P_{\text{out}}$$

where $P_{\text{in}} \in \{0,1\}^{d_{\text{in}}}$ and $P_{\text{out}} \in \{0,1\}^{d_{\text{out}}}$ are binary masks for the input and output dimensions. Zeroed channels are removed entirely, yielding smaller dense matrices.

Unlike unstructured pruning (which zeros individual weights and produces sparse matrices that are slow on GPUs), structural pruning removes entire rows and columns. The result is a genuinely smaller model that runs faster on standard hardware.
Next: The Joint Framework →

Chapter 4

The Joint Framework

Pruning and quantization are not independent operations. The order matters, and so does the interaction between the two. TOGA combines them in a single, unified search space.

Consider renovating a house. You could first tear down all the walls you don’t need (pruning), then repaint what remains (quantization). Or you could repaint first, then tear down walls. The final result — and the mess you make along the way — depends on the order.

TOGA uses the prune-then-quantize order, which prior work shows consistently yields lower perplexity. The pruned weight matrix is then quantized using the binary-mask scheme. Crucially, both the pruning masks and the quantization masks are optimised jointly by a single hypernetwork.

Explore the interactive below to see how different combinations of sparsity and bit-widths trade off against compression ratio and perplexity.

Equation 10 — Joint pruning + quantization
$$F_{\text{quant}}(P_{\text{in}}^\top W P_{\text{out}}, M) = Q_h(P_{\text{in}}^\top W P_{\text{out}}) \odot M + Q_l(P_{\text{in}}^\top W P_{\text{out}}) \odot (1 - M)$$

The weight matrix is first pruned by input/output masks $P_{\text{in}}$, $P_{\text{out}}$, then the surviving weights are quantized with the mixed-precision mask $M$.

At a compression ratio of ~0.103, TOGA can flexibly choose between 45% sparsity with W3A3 precision or 59% sparsity with W4A4 — the same budget, very different architectures. Joint optimisation finds the best trade-off automatically.
Next: Hypernetwork Search →

Chapter 5

Hypernetwork Search

The binary masks that govern pruning and quantization are too numerous to search exhaustively. A hypernetwork learns to produce them — guided by the end-to-end language modelling loss and a budget regulariser.

Imagine a conductor who has never seen the full score but can instruct each section of the orchestra by reading a simplified lead sheet. The hypernetwork plays this conductor role: it takes in the global state of the model and outputs binary decisions for every layer — prune or keep, high-precision or low-precision.

To keep the compressed model within a target memory budget, the training loss includes a regularisation term that penalises deviations from the desired budget. The hypernetwork is trained with Gumbel-Softmax and Straight-Through Estimators to produce discrete binary masks while remaining differentiable.

Adjust the target budget below and watch how the regularisation penalty changes — and how the model balances accuracy against compression.

Equation 7 — Training objective
$$\min_\theta \; \mathcal{L}_{\text{CE}}(X, W, S) + \lambda \, R(b, B(S))$$
Equation 6 — Budget regularisation
$$R(b, B(S)) = \log\!\left(\frac{\max(b, \, B(S))}{\min(b, \, B(S))}\right)$$

where $B(S)$ is the expected budget induced by configuration $S$ (e.g., effective memory footprint), $b$ is the target budget, and $\lambda$ controls the regularisation strength.

The budget regulariser $R$ is asymmetric: it pushes harder when the model exceeds the budget than when it undershoots. This asymmetry encourages the hypernetwork to find configurations that are just barely within budget, squeezing out maximum accuracy.
Next: Results at Ultra-Low Bits →

Chapter 6

Results at Ultra-Low Bits

At 1–3 bits, most quantization methods destroy model quality. TOGA’s global, adaptive approach holds up where others collapse.

Think of ultra-low-bit quantization like trying to play a concerto on a toy piano. Most methods can only approximate a handful of notes, and the result sounds like noise. TOGA is like having a skilled musician who knows exactly which notes matter most and plays those perfectly, using the toy piano’s limited keys for everything else.

On Llama-2-7B quantized to an average of 3.2 bits, TOGA achieves a perplexity of 7.30 on WikiText-2 — compared to ResQ’s 7.35, Atom’s 12.12, and SpinQuant’s 438 (complete collapse). The advantage comes from adaptive salient-channel allocation guided by the global loss.

The charts below show head-to-head perplexity comparisons from the paper’s Table 1 across multiple models and precision formats.

SpinQuant, a strong uniform-precision baseline, produces total perplexity collapse at 3 bits (438 on Llama-2-7B). Mixed-precision is not optional at ultra-low bits — it is the difference between a working model and numerical garbage.
Next: Real-World Performance →

Chapter 7

Real-World Performance

Compression is meaningless if the resulting model is slower in practice. TOGA’s structured approach produces dense matrices that actually run faster on real hardware.

It is one thing to compress a model on paper; it is another to make it run faster on a GPU. Unstructured pruning produces sparse matrices that most GPU kernels cannot exploit. Semi-structured pruning (NVIDIA’s 2:4 pattern) helps, but is inflexible — you are locked to exactly 50% sparsity.

TOGA’s structural pruning produces genuinely smaller, dense matrices. Combined with custom CUDA kernels for mixed-precision GEMM operations, the result is a model that is not just smaller but actually faster. On an NVIDIA L40 GPU, TOGA achieves up to 2× prefill speedup and 6.5× peak memory reduction versus FP16.

Explore the performance charts below to see how prefill latency and peak memory scale with batch size.

The FP16 baseline runs out of memory at batch size 12. TOGA handles batch size 16 comfortably — with 6.5× less peak memory. That is the difference between running inference and getting an OOM error.
Next: Lessons & Limits →

Chapter 8

Lessons & Limits

TOGA’s adaptive allocation reveals a clear pattern: early and late layers matter most. But the framework has real limitations, particularly for very large models.

If you map which channels TOGA marks as ‘salient’ across all layers of Llama-2-7B, a striking pattern emerges. The first 16 transformer blocks and the final few blocks receive the most high-precision channels. The middle layers — roughly layers 17 to 28 — get far fewer. This aligns with independent research showing that removing early or late layers causes far more damage than removing middle ones.

This is not something a fixed-threshold method like Atom or ResQ can capture, because they apply the same salient fraction to every layer. TOGA’s learned, adaptive allocation is doing something qualitatively different — it is discovering layer importance from the data.

The main limitation: the hypernetwork must load the entire LLM into GPU memory during training. With 80 GB of VRAM, the current implementation caps out at ~32B-parameter models. Scaling to 70B+ will require distributed training or offloading techniques.

TOGA’s salient-channel distribution is an emergent property of the optimisation — no one told the hypernetwork to protect early layers. It discovered this known principle of transformer architecture purely from minimising the end-to-end loss.

TOGA demonstrates that joint, globally-optimised compression is strictly superior to the sequential, layer-wise pipelines that dominate current practice. The framework is a proof of concept that the same hypernetwork paradigm that works for structured pruning can be extended to mixed-precision quantization — and that the two techniques are more powerful together than either is alone.