The "quality control" for AI: Why we should compress large language models as easily as images

How language models can be flexibly downsized without retraining them, and why a simple quality control mechanism brings efficiency and control to AI.

from

Patrick Putzky

User-friendly solutions already exist for reducing the storage requirements of images and videos. The principle is well known: if the file is too large for an email attachment or smartphone storage, we slightly reduce the quality. A simple slider allows us to find the optimal balance between file size and visual quality that meets our requirements in a matter of seconds.

But when you move from the world of media to modern AI models, this ease suddenly disappears. Large language models (LLMs) often require enormous amounts of storage space. Making them smaller and more efficient has not been a flexible process so far, but rather a technical dead end.

Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?

The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.

The problem: compression as a one-way street

Until now, downsizing an AI model has been more like rigid manufacturing than flexible adaptation. Developers have to decide in advance: "I want to reduce the model to exactly 50% of its size." This triggers a complex calculation process.

If you ultimately determine that the model has lost too much accuracy and that 60% of the size would have been ideal, there is no way back. The entire process must start over from the beginning. There is no way to dynamically find the "sweet spot" between memory requirements and performance.

The solution: calculate once, choose freely

With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.

You can think of it as a detailed map that our algorithm creates once. We call this a "score map." This map records which parameters in the neural network are crucial for the model's knowledge and which are negligible.

Once this map has been calculated, we give control back to the users. We are essentially building the slider familiar from image editing for language models. What makes this special is that the model does not need to be retrained. The adjustment to the desired size happens almost immediately.

How does it work?

We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.

Unlike previous methods, which start a separate calculation for each target variable, we sort all connections in the model according to their importance. The result is a model that knows exactly what information it can do without when storage space is limited—similar to experienced editors who shorten a text without losing the core of the message.

Why is this important?

This flexibility is a crucial step toward making the use of LLMs more efficient.

Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.

We have successfully applied ACIP to well-known open models such as LLaMa, Mistral, and Qwen. The results show that we can massively reduce the memory requirements of the models, with a smooth and predictable performance curve.

Learn more

We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.

See the models on Hugging Face: Hugging Face Collection
To the project page: acip.merantix-momentum.com
About the paper: Transactions in Machine Learning Research

Subscribe to the Merantix Momentum Newsletter now.

The latest industry news, interviews, technologies and resources.

All articles

Clyravision: An entire forensics team in one system

Clyravision: The intelligent forensics system that not only recognizes manipulated images, but also explains exactly how and why they were altered - for more transparency and trust in the digital world.

Addressing Challenges and Opportunities in Medical Writing Using AI

An interview with Dr. Richardus Vonk, Independent Advisor and Dr. Bertram Weiss, VP Health Vertical at Merantix Momentum.

Article

Our publications

Discover the latest publications from our research team and more

Article

AI as the key to administrative modernization: A practical step-by-step model for public administration

A practical step-by-step model explains how AI is becoming the central driver for efficiency, cultural change and the ability to act in administration.

Article

From bottleneck to breakthrough: Why AI is crucial for progress in healthcare and pharma

Rising costs, a shortage of specialists and a flood of data are pushing the healthcare system and the pharmaceutical industry to their limits. Why artificial intelligence is now the key to taking care, research and development to a new level.

The "quality control" for AI: Why we should compress large language models as easily as images

Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?

The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.

The problem: compression as a one-way street

The solution: calculate once, choose freely

With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.

How does it work?

We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.

Why is this important?

This flexibility is a crucial step toward making the use of LLMs more efficient.

Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.

Learn more

We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.