The "quality control" for AI: Why we should compress large language models as easily as images

How language models can be flexibly downsized without retraining them, and why a simple quality control mechanism brings efficiency and control to AI.
from
Patrick Putzky

User-friendly solutions already exist for reducing the storage requirements of images and videos. The principle is well known: if the file is too large for an email attachment or smartphone storage, we slightly reduce the quality. A simple slider allows us to find the optimal balance between file size and visual quality that meets our requirements in a matter of seconds.

But when you move from the world of media to modern AI models, this ease suddenly disappears. Large language models (LLMs) often require enormous amounts of storage space. Making them smaller and more efficient has not been a flexible process so far, but rather a technical dead end.

Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?

The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.

The problem: compression as a one-way street

Until now, downsizing an AI model has been more like rigid manufacturing than flexible adaptation. Developers have to decide in advance: "I want to reduce the model to exactly 50% of its size." This triggers a complex calculation process.

If you ultimately determine that the model has lost too much accuracy and that 60% of the size would have been ideal, there is no way back. The entire process must start over from the beginning. There is no way to dynamically find the "sweet spot" between memory requirements and performance.

The solution: calculate once, choose freely

With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.

You can think of it as a detailed map that our algorithm creates once. We call this a "score map." This map records which parameters in the neural network are crucial for the model's knowledge and which are negligible.

Once this map has been calculated, we give control back to the users. We are essentially building the slider familiar from image editing for language models. What makes this special is that the model does not need to be retrained. The adjustment to the desired size happens almost immediately.

How does it work?

We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.

Unlike previous methods, which start a separate calculation for each target variable, we sort all connections in the model according to their importance. The result is a model that knows exactly what information it can do without when storage space is limited—similar to experienced editors who shorten a text without losing the core of the message.

Why is this important?

This flexibility is a crucial step toward making the use of LLMs more efficient.

  1. Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
  2. Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
  3. Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.

We have successfully applied ACIP to well-known open models such as LLaMa, Mistral, and Qwen. The results show that we can massively reduce the memory requirements of the models, with a smooth and predictable performance curve.

Learn more

We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.

Subscribe to the Merantix Momentum Newsletter now.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

More articles

The latest industry news, interviews, technologies and resources.

Clyravision: An entire forensics team in one system

Clyravision: The intelligent forensics system that not only recognizes manipulated images, but also explains exactly how and why they were altered - for more transparency and trust in the digital world.

Addressing Challenges and Opportunities in Medical Writing Using AI

An interview with Dr. Richardus Vonk, Independent Advisor and Dr. Bertram Weiss, VP Health Vertical at Merantix Momentum.

Our publications

Discover the latest publications from our research team and more

AI as the key to administrative modernization: A practical step-by-step model for public administration

A practical step-by-step model explains how AI is becoming the central driver for efficiency, cultural change and the ability to act in administration.

From bottleneck to breakthrough: Why AI is crucial for progress in healthcare and pharma

Rising costs, a shortage of specialists and a flood of data are pushing the healthcare system and the pharmaceutical industry to their limits. Why artificial intelligence is now the key to taking care, research and development to a new level.

The "quality control" for AI: Why we should compress large language models as easily as images

User-friendly solutions already exist for reducing the storage requirements of images and videos. The principle is well known: if the file is too large for an email attachment or smartphone storage, we slightly reduce the quality. A simple slider allows us to find the optimal balance between file size and visual quality that meets our requirements in a matter of seconds.

But when you move from the world of media to modern AI models, this ease suddenly disappears. Large language models (LLMs) often require enormous amounts of storage space. Making them smaller and more efficient has not been a flexible process so far, but rather a technical dead end.

Our research team at Merantix Momentum asked themselves: Why can't we compress AI models as intuitively as image files?

The answer to this question is our new method ACIP (Any Compression via Iterative Pruning), which we recently published in the journal Transactions on Machine Learning Research.

The problem: compression as a one-way street

Until now, downsizing an AI model has been more like rigid manufacturing than flexible adaptation. Developers have to decide in advance: "I want to reduce the model to exactly 50% of its size." This triggers a complex calculation process.

If you ultimately determine that the model has lost too much accuracy and that 60% of the size would have been ideal, there is no way back. The entire process must start over from the beginning. There is no way to dynamically find the "sweet spot" between memory requirements and performance.

The solution: calculate once, choose freely

With ACIP, we are fundamentally changing this dynamic. Our method decouples the complex analysis from the actual selection of the model size.

You can think of it as a detailed map that our algorithm creates once. We call this a "score map." This map records which parameters in the neural network are crucial for the model's knowledge and which are negligible.

Once this map has been calculated, we give control back to the users. We are essentially building the slider familiar from image editing for language models. What makes this special is that the model does not need to be retrained. The adjustment to the desired size happens almost immediately.

How does it work?

We use mathematical methods to evaluate the structures of the model globally. We iteratively identify those connections within the architecture that are least relevant to the overall result.

Unlike previous methods, which start a separate calculation for each target variable, we sort all connections in the model according to their importance. The result is a model that knows exactly what information it can do without when storage space is limited—similar to experienced editors who shorten a text without losing the core of the message.

Why is this important?

This flexibility is a crucial step toward making the use of LLMs more efficient.

  1. Storage efficiency: Companies no longer need to maintain dozens of versions of a model for different hardware environments. A single run is sufficient to derive any size.
  2. Adaptability: Users can decide for themselves how much storage resources they want to allocate, without any technical hurdles.
  3. Predictability: Developers can immediately see the compression level at which a model's performance significantly declines, rather than being left in the dark.

We have successfully applied ACIP to well-known open models such as LLaMa, Mistral, and Qwen. The results show that we can massively reduce the memory requirements of the models, with a smooth and predictable performance curve.

Learn more

We believe that efficiency is the key to making generative AI widely usable. ACIP is our contribution to making these tools more manageable and resource-efficient.

Oops! Something has gone wrong.
Oops! Something has gone wrong.
Oops! Something has gone wrong.
Oops! Something has gone wrong.
Oops! Something has gone wrong.

Discover more whitepapers

The AI Canvas: Our tool for project evaluation

Discover the AI Canvas!

Leveraging the EU AI Act to your advantage

Using the EU AI Act to your advantage

Data-driven to the drug of tomorrow

Opportunities and barriers of AI in a GxP world.

Leveraging the EU AI Act to your advantage

Using the EU AI Act to your advantage

Artificial Intelligence for Private Equity Portfolios

Increase in value for your entire portfolio

Data-driven to the drug of tomorrow

Opportunities and barriers of AI in a GxP world.

Subscribe to the Merantix Momentum Newsletter now.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.