Model Choice - fnl.es

# Model Choice Always review the [Unsloth model notebook](https://unsloth.ai/docs/get-started/unsloth-notebooks) for your choice. ## GPU Choice Ensure your GPU can support fused kernels: ```Python import torch print(torch.cuda.get_device_name(0)) torch.cuda.get_device_capability(0) # (major, minor) ``` Major should be at least 7, better 8: For Ampere+ (major ≥ 8, e.g. RTX 30xx/40xx, A100, H100, L40, Blackwell), you get the best Triton/fused-kernel path; older 7.x (V100, T4, RTX 20xx) still work but are slower. ## Language Models Instruct models are pre-trained with built-in instructions, making them ready to use without any fine-tuning. These models, including GGUFs and others commonly available, are optimized for direct usage and respond effectively to prompts right out of the box. Unsloth recommends starting with **Instruct models**, as they allow direct fine-tuning using conversational chat templates (ChatML, ShareGPT etc.) and require less data compared to **Base models** (which uses Alpaca, Vicuna etc). Learn more about the differences between [instruct and base models here](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/what-model-should-i-use#instruct-or-base-model). If you pick a model with `-unsloth-bnb-4bit` quants, they use a bit more memory than normal 4-bit BitsAndBytes (`bnb`) setups, but at significantly higher accuracy. |**Model Size**|**Precision**|**Min. VRAM (Training)**|**Suitable GPU**|**Notes**| |---|---|---|---|---| |**1B - 2B**|4-bit QLoRA|**~3 - 4 GB**|RTX 3060 (8GB)|Extremely fast; great for edge devices.| |**3B** (Llama 3.2)|4-bit QLoRA|**~3.5 - 5 GB**|Colab T4 (16GB)|Very comfortable on free Colab tier.| |**7B - 9B** (Llama 3.1)|4-bit QLoRA|**~6 - 8 GB**|Colab T4 / RTX 3080|The "Sweet Spot" for most users.| |**7B - 9B**|16-bit LoRA|**~14 - 16 GB**|Colab T4 (Tight)|Minimal accuracy loss vs. 4-bit.| |**14B** (Phi-4)|4-bit QLoRA|**~12 - 14 GB**|Colab T4 / RTX 3090|Fits on Colab, but keep context <4096.| |**24B** (Mistral)|4-bit QLoRA|**~22 - 24 GB**|RTX 3090 / 4090|Requires 24GB consumer/pro GPU.| |**32B** (QwQ / Qwen)|4-bit QLoRA|**~28 - 32 GB**|A6000 / A100|Hard to fit on a single 24GB card.| |**70B** (Llama 3.3)|4-bit QLoRA|**~48 - 54 GB**|A100 (80GB)|Unsloth allows 89k context on 80GB VRAM.| Very efficient LLaMA and QWEN models are: - `unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit` - `unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit` ## Quantization Levels | Format | VRAM use | Accuracy | Use case | | --------- | -------- | -------- | ------------------------------ | | 16-bit | highest | best | full fine-tuning, small models | | 4-bit bnb | low | good | qLoRA on consumer GPUs | | GGUF | varies | good | inference only (llama.cpp) | Do not confuse training quantization (4-bit/qLoRA) with inference formats (GGUF). You fine-tune in 4-bit, then export to GGUF for deployment. When you pass `load_in_4bit=True` to `FastLanguageModel.from_pretrained`, Unsloth will typically map to a `…-bnb-4bit` variant or expect one to exist (e.g. `unsloth/llama-2-13b-bnb-4bit`). If no 4‑bit variant exists for that base model (or you mix `fast_inference=True` with an incompatible checkpoint), you may see errors or unexpected VRAM usage, so choose a model family that Unsloth explicitly supports in 4‑bit (Llama 2/3, Qwen2.5, etc.). **Sanity check:** VRAM usage should be much lower than FP16 for the same model (roughly 60–80% savings); if usage is close to FP16, you likely aren’t in 4‑bit fused path.