Distilling Step-by-Step

# Distilling Step-by-Step # Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes * Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister * UWash and Google Research * ACL 2023 ## Method Our paradigm has two simple steps: ![](Distilling%20Step-by-Step%20Overview.png) * First, given an LLM and an unlabeled dataset, we prompt the LLM to generate output labels along with rationales to justify the labels. * Each prompt is a triplet (xp, rp, yp), where xp is an example input, yp is its corresponding label and rp is a user-provided rationale that explains why xp can be categorized as yp. * With the demonstrations seen in p, the LLM can mimic the triplet demonstration to generate the rationale ˆri and output ˆyi for xi. * We require users to produce a few example demonstrations (∼ 10-shot for all tasks) in order to use the few-shot CoT * We utilize Chain-of-Thought (CoT) prompting (Wei et al., 2022) to elicit and extract rationales from LLMs * Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. * Second, we leverage these rationales in addition to the task labels to train smaller downstream models. * We first describe the current framework for learning task-specific models. With this framework in place, we extend it to incorporate rationales into the training process. * Multi-task learning with rationales. * To create a more explicit connection between xi’s to ˆyi’s, we use extracted rationales ˆri as additional supervision. * instead of using rationales as additional model inputs, we frame learning with rationales as a multi-task problem * We prepend “task prefixes”([label], [rationale]) to the input examples and train the smaller model to output ˆyi when [label] is provided and to produce ˆri with [rationale] (Raffel et al., 2020). * Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67 * Our proposed multi-task training framework consistently leads to better performances than treating rationale and label predictions as a single task. Singletask training can at times lead to worse performance than standard finetuning. * Task-specific distillation (Hinton et al., 2015; Tang et al., 2019) uses LLM teachers to generates pseudo noisy training labels, ˆyi in place of yi (Wang et al., 2021) * Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. 2022a. Pinto: Faithful language reasoning using prompt-generated rationales. ## Experiments ![](Distilling%20Step-by-Step%20training%20set%20size%20evaluation%20vs%20fine-tuning%20and%20standard%20distillation.png) ![](Distilling%20Step-by-Step%20model%20size%20evaluation%20vs%20fine-tuning%20and%20standard%20distillation.png) # Mindmap ![](Distilling%20Step-by-Step!%20Outperforming%20Larger%20Language%20Models%20with%20Less%20Training%20Data%20and%20Smaller%20Model%20Sizes.pdf)