Training - fnl.es

# Training ## Trainer (and Trainer Config) Create a `trainer` from the model, tokenizer, dataset, and configs. Trainers is directly taken from the Huggingface Transformer Reinforcement Learning library `trl`. Typical choices: - [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer) and [`SFTConfig`](https://huggingface.co/docs/trl/sft_trainer#model-initialization) - standard fine-tuning - Enable QWEN-style [GSPO training in Unsloth](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation/gspo-reinforcement-learning) by setting `importance_sampling_level = "sequence"` in the GRPOConfig. - [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) and [`DPOConfig`](https://huggingface.co/docs/trl/dpo_trainer#compatibility-and-constraints) - fast training/testing with a [preference dataset](https://huggingface.co/docs/trl/dataset_formats#preference). Again review the relevant [Unsloth model notebook](https://unsloth.ai/docs/get-started/unsloth-notebooks) for the optimal training setup. ## trainer.train() Key Parameters - `per_device_train_batch_size = 2` - Increase for better GPU utilization but beware of slower training due to padding. Instead, increase `gradient_accumulation_steps` for smoother training. - `gradient_accumulation_steps = 4` - Simulates a larger batch size without increasing memory usage. - `max_steps = 60` - Speeds up training. For full runs, replace with `num_train_epochs = 1` (1-3 epochs recommended to avoid overfitting). - `learning_rate = 2e-4` - Lower for slower but more precise fine-tuning. Try values like `1e-4`, `5e-5`, or `2e-5`. [Optimize the learning rate](Optimize%20the%20learning%20rate.md) by looking at the loss across different rates. - `warmup_ratio = 0.03` - Fraction of steps for linear LR warm-up. Prevents instability at the start of training. - `fp16` / `bf16` - Enable mixed-precision training. Use `bf16=True` on Ampere+ GPUs (RTX 30xx+), `fp16=True` otherwise. Reduces VRAM and speeds up training. - `packing = True` (SFTTrainer only) - Packs multiple short samples into one sequence to maximize GPU utilization. Essential for datasets with variable-length short samples. - To speed up evaluation you can: reduce the evaluation dataset size or set `evaluation_steps = 100`. ## Memory Stats Show current memory stats: ```python gpu_stats = torch.cuda.get_device_properties(0) start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.") print(f"{start_gpu_memory} GB of memory reserved.") ``` Show final memory and time stats: ```python used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) used_memory_for_lora = round(used_memory - start_gpu_memory, 3) used_percentage = round(used_memory / max_memory * 100, 3) lora_percentage = round(used_memory_for_lora / max_memory * 100, 3) print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.") print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.") print(f"Peak reserved memory = {used_memory} GB.") print(f"Peak reserved memory for training = {used_memory_for_lora} GB.") print(f"Peak reserved memory % of max memory = {used_percentage} %.") print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.") ```