Text Models

Local text generation uses llama.cpp. GGUF is a popular file format for running LLMs locally and it originates from it.

Vision Support

Don't Forget

Do you want info on something specific on your screen? Use the Windows+Shift+S shortcut to make an image of the area of interest and just paste the image to your widget (CTRL+V on the active Widget Window).

Dependencies

Some models, like vision models will require dependencies (in the case of vision, it is the mmproj file). You can specify a dependency on the Add Text Model page.

What Is LLM Quantization?

LLM quantization is a technique used to make large language models (LLMs) smaller, faster, and cheaper to run.

The Idea in Simple Terms

LLMs normally store numbers (weights) using high-precision formats like 32-bit or 16-bit floating point. Quantization reduces this precision (for example to 8-bit or 4-bit numbers) while keeping the model’s behavior mostly the same.

Why Quantization Matters

🚀 Faster inference – models run quicker
💾 Lower memory usage – fits on smaller machines
💰 Reduced costs – less compute and energy required

The Trade-off

✅ Big gains in speed and efficiency
⚠️ Risk of accuracy loss

Common Quantization Levels (GGUF)

GGUF supports many quantization types, here are important ones:

GGUF Quant	Bits	When to Use
IQ2_S	~2-bit	Extremely low memory, heavy quality loss
IQ3_S	~3-bit	Very compact, basic but usable
Q4_K_M	4-bit	Default choice for most users
IQ4_XS	~4-bit	smaller than Q4_K_M, similar quality, slower
Q5_K_M	5-bit	Higher quality, still efficient
Q6_K	6-bit	Near full precision
Q8_0	8-bit	Maximum quality, large
F16+	16-bit+	No quantization

What should I run?

Go with Q4_K_M (IQ4_XS if low vram) or Q5_K_M, they offer a good trade-off. Go lower and the model will be a lot dumber. Q8 should be almost as good as the original. Preferable quants that were created using an iMatrix.

What Are iMatrix Quants?

iMatrix (importance matrix) quants use real activation data to guide quantization. Instead of guessing which weights matter, the model is run on sample prompts to measure which parts are most important. That information is stored in an importance matrix and used to preserve critical weights during quantization.

Vision Support​

Dependencies​

What Is LLM Quantization?​

The Idea in Simple Terms​

Why Quantization Matters​

The Trade-off​

Common Quantization Levels (GGUF)​

What should I run?​

What Are iMatrix Quants?​