# Training
## Trainer (and Trainer Config)
Create a `trainer` from the model, tokenizer, dataset, and configs.
Trainers is directly taken from the Huggingface Transformer Reinforcement Learning library `trl`.
Typical choices:
- [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer) and [`SFTConfig`](https://huggingface.co/docs/trl/sft_trainer#model-initialization) - standard fine-tuning
- [`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer) and [`GRPOConfig`](https://huggingface.co/docs/trl/grpo_trainer#speed-up-training-with-vllm-powered-generation) - to add reasoning to your model with a [reward function](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide#reward-functions-verifiers) (consider [using vLLM to speed up generation](https://huggingface.co/docs/trl/grpo_trainer#speed-up-training-with-vllm-powered-generation) by installing `trl[vllm]`)
- [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) and [`DPOConfig`](https://huggingface.co/docs/trl/dpo_trainer#compatibility-and-constraints) - fast training/testing with a [preference dataset](https://huggingface.co/docs/trl/dataset_formats#preference).
Again review the relevant [Unsloth model notebook](https://unsloth.ai/docs/get-started/unsloth-notebooks) for the optimal training setup.
## trainer.train() Key Parameters
- `per_device_train_batch_size = 2` - Increase for better GPU utilization but beware of slower training due to padding. Instead, increase `gradient_accumulation_steps` for smoother training.
- `gradient_accumulation_steps = 4` - Simulates a larger batch size without increasing memory usage.
- `max_steps = 60` - Speeds up training. For full runs, replace with `num_train_epochs = 1` (1-3 epochs recommended to avoid overfitting).
- `learning_rate = 2e-4` - Lower for slower but more precise fine-tuning. Try values like `1e-4`, `5e-5`, or `2e-5`. [Optimize the learning rate](Optimize%20the%20learning%20rate.md) by looking at the loss across different rates.
- `warmup_ratio = 0.03` - Fraction of steps for linear LR warm-up. Prevents instability at the start of training.
- `fp16` / `bf16` - Enable mixed-precision training. Use `bf16=True` on Ampere+ GPUs (RTX 30xx+), `fp16=True` otherwise. Reduces VRAM and speeds up training.
- `packing = True` (SFTTrainer only) - Packs multiple short samples into one sequence to maximize GPU utilization. Essential for datasets with variable-length short samples.
- To speed up evaluation you can: reduce the evaluation dataset size or set `evaluation_steps = 100`.
## Memory Stats
Show current memory stats:
```python
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
```
Show final memory and time stats:
```python
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
```