Skip to main content

Text Models

Local text generation uses llama.cpp. GGUF is a popular file format for running LLMs locally and it originates from it.

Vision Support

Don't Forget

Do you want info on something specific on your screen? Use the Windows+Shift+S shortcut to make an image of the area of interest and just paste the image to your widget (CTRL+V on the active Widget Window).

What Is LLM Quantization?

LLM quantization is a technique used to make large language models (LLMs) smaller, faster, and cheaper to run.

The Idea in Simple Terms

LLMs normally store numbers (weights) using high-precision formats like 32-bit or 16-bit floating point. Quantization reduces this precision (for example to 8-bit or 4-bit numbers) while keeping the model’s behavior mostly the same.

Why Quantization Matters

  • 🚀 Faster inference – models run quicker
  • 💾 Lower memory usage – fits on smaller machines
  • 💰 Reduced costs – less compute and energy required

The Trade-off

  • ✅ Big gains in speed and efficiency
  • ⚠️ Risk of accuracy loss

Common Quantization Levels (GGUF)

GGUF supports many quantization types, here are important ones:

GGUF QuantBitsWhen to Use
IQ2_S~2-bitExtremely low memory, heavy quality loss
IQ3_S~3-bitVery compact, basic but usable
Q4_K_M4-bitDefault choice for most users
IQ4_XS~4-bitsmaller than Q4_K_M, similar quality, slower
Q5_K_M5-bitHigher quality, still efficient
Q6_K6-bitNear full precision
Q8_08-bitMaximum quality, large
F16+16-bit+No quantization

What should I run?

Go with Q4_K_M (IQ4_XS if low vram) or Q5_K_M, they offer a good trade-off. Go lower and the model will be a lot dumber. Q8 should be almost as good as the original. Preferable quants that were created using an iMatrix.

What Are iMatrix Quants?

iMatrix (importance matrix) quants use real activation data to guide quantization. Instead of guessing which weights matter, the model is run on sample prompts to measure which parts are most important. That information is stored in an importance matrix and used to preserve critical weights during quantization.