The "quality control" for AI: Why we should compress large language models as easily as images

User-friendly solutions already exist for reducing the storage requirements of images and videos. The principle is well known: if the file is too large for an email attachment or smartphone storage, we slightly reduce the quality. A simple slider allows us to find the optimal balance between file size and visual quality that meets our requirements in a matter of seconds.
But when you move from the world of media to modern AI models, this ease suddenly disappears. Large language models (LLMs) often require enormous amounts of storage space. Making them smaller and more efficient has not been a flexible process so far, but rather a technical dead end.
Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?
The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.
The problem: compression as a one-way street
Until now, downsizing an AI model has been more like rigid manufacturing than flexible adaptation. Developers have to decide in advance: "I want to reduce the model to exactly 50% of its size." This triggers a complex calculation process.
If you ultimately determine that the model has lost too much accuracy and that 60% of the size would have been ideal, there is no way back. The entire process must start over from the beginning. There is no way to dynamically find the "sweet spot" between memory requirements and performance.
The solution: calculate once, choose freely
With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.
You can think of it as a detailed map that our algorithm creates once. We call this a "score map." This map records which parameters in the neural network are crucial for the model's knowledge and which are negligible.
Once this map has been calculated, we give control back to the users. We are essentially building the slider familiar from image editing for language models. What makes this special is that the model does not need to be retrained. The adjustment to the desired size happens almost immediately.
How does it work?
We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.
Unlike previous methods, which start a separate calculation for each target variable, we sort all connections in the model according to their importance. The result is a model that knows exactly what information it can do without when storage space is limited—similar to experienced editors who shorten a text without losing the core of the message.
Why is this important?
This flexibility is a crucial step toward making the use of LLMs more efficient.
- Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
- Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
- Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.
We have successfully applied ACIP to well-known open models such as LLaMa, Mistral, and Qwen. The results show that we can massively reduce the memory requirements of the models, with a smooth and predictable performance curve.
Learn more
We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.
- See the models on Hugging Face: Hugging Face Collection
- To the project page: acip.merantix-momentum.com
- About the paper: Transactions in Machine Learning Research
Subscribe to the Merantix Momentum Newsletter now.
More articles
The "quality control" for AI: Why we should compress large language models as easily as images
User-friendly solutions already exist for reducing the storage requirements of images and videos. The principle is well known: if the file is too large for an email attachment or smartphone storage, we slightly reduce the quality. A simple slider allows us to find the optimal balance between file size and visual quality that meets our requirements in a matter of seconds.
But when you move from the world of media to modern AI models, this ease suddenly disappears. Large language models (LLMs) often require enormous amounts of storage space. Making them smaller and more efficient has not been a flexible process so far, but rather a technical dead end.
Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?
The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.
The problem: compression as a one-way street
Until now, downsizing an AI model has been more like rigid manufacturing than flexible adaptation. Developers have to decide in advance: "I want to reduce the model to exactly 50% of its size." This triggers a complex calculation process.
If you ultimately determine that the model has lost too much accuracy and that 60% of the size would have been ideal, there is no way back. The entire process must start over from the beginning. There is no way to dynamically find the "sweet spot" between memory requirements and performance.
The solution: calculate once, choose freely
With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.
You can think of it as a detailed map that our algorithm creates once. We call this a "score map." This map records which parameters in the neural network are crucial for the model's knowledge and which are negligible.
Once this map has been calculated, we give control back to the users. We are essentially building the slider familiar from image editing for language models. What makes this special is that the model does not need to be retrained. The adjustment to the desired size happens almost immediately.
How does it work?
We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.
Unlike previous methods, which start a separate calculation for each target variable, we sort all connections in the model according to their importance. The result is a model that knows exactly what information it can do without when storage space is limited—similar to experienced editors who shorten a text without losing the core of the message.
Why is this important?
This flexibility is a crucial step toward making the use of LLMs more efficient.
- Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
- Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
- Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.
We have successfully applied ACIP to well-known open models such as LLaMa, Mistral, and Qwen. The results show that we can massively reduce the memory requirements of the models, with a smooth and predictable performance curve.
Learn more
We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.
- See the models on Hugging Face: Hugging Face Collection
- To the project page: acip.merantix-momentum.com
- About the paper: Transactions in Machine Learning Research

%20(1).png)
.png)



.png)

