# Model Choice
Always review the [Unsloth model notebook](https://unsloth.ai/docs/get-started/unsloth-notebooks) for your choice.
## GPU Choice
Ensure your GPU can support fused kernels:
```Python
import torch
print(torch.cuda.get_device_name(0))
torch.cuda.get_device_capability(0) # (major, minor)
```
Major should be at least 7, better 8: For Ampere+ (major ≥ 8, e.g. RTX 30xx/40xx, A100, H100, L40, Blackwell), you get the best Triton/fused-kernel path; older 7.x (V100, T4, RTX 20xx) still work but are slower.
## Language Models
Instruct models are pre-trained with built-in instructions, making them ready to use without any fine-tuning. These models, including GGUFs and others commonly available, are optimized for direct usage and respond effectively to prompts right out of the box.
Unsloth recommends starting with **Instruct models**, as they allow direct fine-tuning using conversational chat templates (ChatML, ShareGPT etc.) and require less data compared to **Base models** (which uses Alpaca, Vicuna etc). Learn more about the differences between [instruct and base models here](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/what-model-should-i-use#instruct-or-base-model).
If you pick a model with `-unsloth-bnb-4bit` quants, they use a bit more memory than normal 4-bit BitsAndBytes (`bnb`) setups, but at significantly higher accuracy.
|**Model Size**|**Precision**|**Min. VRAM (Training)**|**Suitable GPU**|**Notes**|
|---|---|---|---|---|
|**1B - 2B**|4-bit QLoRA|**~3 - 4 GB**|RTX 3060 (8GB)|Extremely fast; great for edge devices.|
|**3B** (Llama 3.2)|4-bit QLoRA|**~3.5 - 5 GB**|Colab T4 (16GB)|Very comfortable on free Colab tier.|
|**7B - 9B** (Llama 3.1)|4-bit QLoRA|**~6 - 8 GB**|Colab T4 / RTX 3080|The "Sweet Spot" for most users.|
|**7B - 9B**|16-bit LoRA|**~14 - 16 GB**|Colab T4 (Tight)|Minimal accuracy loss vs. 4-bit.|
|**14B** (Phi-4)|4-bit QLoRA|**~12 - 14 GB**|Colab T4 / RTX 3090|Fits on Colab, but keep context <4096.|
|**24B** (Mistral)|4-bit QLoRA|**~22 - 24 GB**|RTX 3090 / 4090|Requires 24GB consumer/pro GPU.|
|**32B** (QwQ / Qwen)|4-bit QLoRA|**~28 - 32 GB**|A6000 / A100|Hard to fit on a single 24GB card.|
|**70B** (Llama 3.3)|4-bit QLoRA|**~48 - 54 GB**|A100 (80GB)|Unsloth allows 89k context on 80GB VRAM.|
Very efficient LLaMA and QWEN models are:
- `unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit`
- `unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit`
## Quantization Levels
| Format | VRAM use | Accuracy | Use case |
| --------- | -------- | -------- | ------------------------------ |
| 16-bit | highest | best | full fine-tuning, small models |
| 4-bit bnb | low | good | qLoRA on consumer GPUs |
| GGUF | varies | good | inference only (llama.cpp) |
Do not confuse training quantization (4-bit/qLoRA) with inference formats (GGUF). You fine-tune in 4-bit, then export to GGUF for deployment.
When you pass `load_in_4bit=True` to `FastLanguageModel.from_pretrained`, Unsloth will typically map to a `…-bnb-4bit` variant or expect one to exist (e.g. `unsloth/llama-2-13b-bnb-4bit`).
If no 4‑bit variant exists for that base model (or you mix `fast_inference=True` with an incompatible checkpoint), you may see errors or unexpected VRAM usage, so choose a model family that Unsloth explicitly supports in 4‑bit (Llama 2/3, Qwen2.5, etc.).
**Sanity check:** VRAM usage should be much lower than FP16 for the same model (roughly 60–80% savings); if usage is close to FP16, you likely aren’t in 4‑bit fused path.