Model Loading - fnl.es

# Model Loading ## FastLanguageModel.from_pretrained By default this aims to load a model for the LoRA process, and uses `load_in_16bit`. ```python model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Thinking-2507", max_seq_length = 2048, load_in_4bit = True, ) ``` ### `max_sequence_length` The parameter sets the context window for model training/inference. Unsloth can set `max_sequence_length` above the model's original capacity because it uses RoPE scaling. Unsloth supports up to 4x longer context windows. Unsloth computes a scaling factor and adjusts how positions are mapped into RoPE's frequency space so that all positions fit into the frequency range the model was trained on. Because max. sequence length has a huge impact on training, you want to keep it small initially, during exploration (512, 1024). ### `dtype` Controls the compute dtype. Pass `None` for auto-detection (recommended). Explicit options: `torch.bfloat16` (Ampere+), `torch.float16`. ### `load_in_4bit` Turns on qLoRA by using the bitsandbytes 4-bit runtime quantization engine. **Sanity check:** VRAM usage should be much lower than FP16 for the same model (roughly 60–80% savings); if usage is close to FP16, you likely aren’t in 4‑bit fused path. ### `full_finetuning` Enables full fine-tuning. Can be done in 8 or 4 bit quantized modes. By default, it loads models in 16bit mode. ### `fast_inference=False` To disable vLLM when you want to do GRPO on a model that has no vLLM support. --- ## FastLanguageModel.get_peft_model Wrap the model with low-rank adapters for Huggingface' performance-efficient fine-tuning (HF PEFT). So it creates the `peft.LoraConfig` for you, injects the LoRA layers into the model, and freezes the base model (by setting `requires_grad = False` in PyTorch). Unlike normal PEFT, Unsloth also makes your model more efficient to train because it: - patches attention kernels into fused kernels (more efficient) - applies memory-efficient gradient logic - enables faster backprop kernels - ensures compatibility with 4-bit / qLoRA models - optionally activates gradient check-pointing optimized for (q)LoRA. ### `r` The rank parameter (smaller vs more expressive adapters). | r | Reason | | ----- | --------------------- | | 4 | extremely lightweight | | 8 | common small adapter | | 16 | typical default | | 32-64 | high-capacity LoRA | ### `lora_alpha` Alpha divided by rank to scale the actual LoRA update. $W' = W + \frac{\alpha}{r} B A$ Without scaling, increasing the rank would increase update magnitude. LoRA update scaling ensures a stable update magnitude independent of rank. Typical settings use `lora_alpha = 2 * r` to provide strong updates. Using the same value as `r` ensures linear updates relative to the LoRA rank. ### `use_rslora` Enables RSLoRA, which scales by `1/sqrt(r)` instead of `1/r`. Recommended for high-rank adapters (r >= 32), as it stabilizes training and often improves quality. ### `lora_dropout` Dropout applied to the LoRA adapter path only. Purpose: regularization to prevent adapter over-fitting. | value | meaning | | ----- | ----------------------- | | 0 | common default | | 0.05 | mild regularization | | 0.1 | stronger regularization | ### `bias` Controls whether bias parameters are trainable. | option | meaning | | ------------- | ------------------------------ | | `"none"` | biases frozen | | `"lora_only"` | biases in LoRA layers trainable| | `"all"` | all biases trainable | Typical default: `bias="none"`. Training biases increase parameter count and often gives little benefit. ### `use_gradient_checkpointing` Unsloth-specific convenience option (value: `True` or `"unsloth"`). Enables activation checkpointing so that: - fewer activations are stored - recomputation used during backward Effect: - much lower VRAM usage - slightly slower training ### `target_modules` Defines which linear layers receive LoRA adapters (smaller adapters). Typical transformer attention layers are: `"q_proj", "k_proj", "v_proj", "o_proj"`. This should be sufficient for making a model more concise, adopting a specific writing voice, or better following JSON formats, for example. ```python target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", ] ``` Sometimes you also want to adapt the MoE MLP layers: `"gate_proj", "up_proj", "down_proj"`. Consider adding those layers when adopting to a very specific domain (medicine, legal, science) with new terminology or jargon. It is more likely the model will start internalizing new facts (for the better or worse). ```python target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", # attention "gate_proj", "up_proj", "down_proj" # MLP ] ``` Check the [Unsloth model notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks) for the best recommendation. ### `random_state` Seed used when initializing LoRA matrices. Important for reproducibility.