Text Models
Local text generation uses llama.cpp. GGUF is a popular file format for running LLMs locally and it originates from it.
Vision Support
Do you want info on something specific on your screen? Use the Windows+Shift+S shortcut to make an image of the area of interest and just paste the image to your widget (CTRL+V on the active Widget Window).
What Is LLM Quantization?
LLM quantization is a technique used to make large language models (LLMs) smaller, faster, and cheaper to run.
The Idea in Simple Terms
LLMs normally store numbers (weights) using high-precision formats like 32-bit or 16-bit floating point. Quantization reduces this precision (for example to 8-bit or 4-bit numbers) while keeping the model’s behavior mostly the same.
Why Quantization Matters
- 🚀 Faster inference – models run quicker
- 💾 Lower memory usage – fits on smaller machines
- 💰 Reduced costs – less compute and energy required
The Trade-off
- ✅ Big gains in speed and efficiency
- ⚠️ Risk of accuracy loss
Common Quantization Levels (GGUF)
GGUF supports many quantization types, here are important ones:
| GGUF Quant | Bits | When to Use |
|---|---|---|
| IQ2_S | ~2-bit | Extremely low memory, heavy quality loss |
| IQ3_S | ~3-bit | Very compact, basic but usable |
| Q4_K_M | 4-bit | Default choice for most users |
| IQ4_XS | ~4-bit | smaller than Q4_K_M, similar quality, slower |
| Q5_K_M | 5-bit | Higher quality, still efficient |
| Q6_K | 6-bit | Near full precision |
| Q8_0 | 8-bit | Maximum quality, large |
| F16+ | 16-bit+ | No quantization |
What should I run?
Go with Q4_K_M (IQ4_XS if low vram) or Q5_K_M, they offer a good trade-off. Go lower and the model will be a lot dumber. Q8 should be almost as good as the original. Preferable quants that were created using an iMatrix.
What Are iMatrix Quants?
iMatrix (importance matrix) quants use real activation data to guide quantization. Instead of guessing which weights matter, the model is run on sample prompts to measure which parts are most important. That information is stored in an importance matrix and used to preserve critical weights during quantization.