# Large Language Models Are Human-Level Prompt Engineers
- Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
- UofT
- ICLR 2023
## Introduction
### Automatic Prompt Engineer (APE) workflow
* We propose Automatic Prompt Engineer1 (APE) for automatic instruction generation and selection
* Extensive experiments show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 24/24 Instruction Induction tasks and 17/21 curated BIG-Bench tasks.
* 
* Automatic Prompt Engineer (APE), automatically generates instructions for a task that is specified via output demonstrations: it generates several instruction candidates, either via direct inference or a recursive process based on semantic similarity, executes them using the target model, and selects the most appropriate instruction based on computed evaluation scores.
* APE is able to surpass human performance when using the InstructGPT model
## Methodology
* 
* APE first proposes a few candidate prompts, and then filters/refines the candidate set according to a chosen score function, ultimately choosing the instruction with the highest score.
### Reverse Mode Generation
* Although the “forward” model works out of the box for most of the pretrained LLMs, translating P(ρ | Dtrain, f(ρ) is high) into words requires custom engineering across different tasks. This is because while instructions are typically found in the beginning of passages, the “forward” model only generates text from left to right, which requires the instruction to be predicted at the end of the prompt.
* We consider “reverse” mode generation, which uses an LLM with infilling capabilities
* LLM with infilling capabilities—e.g., T5 (Raffel et al., 2020), GLM (Du et al., 2022), and InsertGPT (Bavarian et al., 2022)
* Infilling LLMs
* Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
* Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
* 
### Iterative Monte Carlo Search
* Instead of only sampling from the initial proposal, we consider exploring the search space locally around the current best candidates. This allows us to generate new instructions that are more likely to be successful.
* although this approach improves the overall quality of the proposal set U, the highest scoring instruction tends to remain the same with more stages. We conclude iterative generation provides marginal improvement over the relative simplicity and effectiveness of the generative process described in Subsection 3.1.
* Therefore, we use APE without iterative search as default unless otherwise stated.
* 
* We call this variant iterative APE. At each stage, we evaluate a set of instructions and filter out candidates with low scores. Then, an LLM is asked to generate new instructions similar to those with high scores. We provide the prompt used for resampling in Figure 3.
### Score Function
#### Execution accuracy
* evaluating the quality of an instruction ρ using the execution accuracy metric proposed by Honovich et al. (2022)
* Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
* execution accuracy is simply defined as the 0-1 loss, f(ρ, Q, A) = 1 [M([ρ; Q]) = A]
* execution accuracy aligns better with the test performance across the tasks. Thus, we choose it as our default metric unless otherwise stated
* it may be an order invariant set matching loss, as described in Appendix A of Honovich et al. (2022)
* Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
#### Efficient score estimation
* Estimating the score by computing the score over the entire training dataset for all instruction candidates can be expensive. To reduce the computation cost, we adopt a filtering scheme where a promising candidate receives more computation resources while a lowquality candidate receives less computation.
* We first evaluate all candidates with a small subset of the training dataset. For the candidates with a score greater than a certain threshold, we sample and evaluate a new non-overlapping subset from the training dataset to update the moving average of the score.
* we repeat this process until a small set of candidates is left, which are evaluated on the entire training dataset.
* This adaptive filtering scheme significantly improves the computation efficiency by keeping the exact computation costs for the high-quality samples and drastically reducing the computation costs for low-quality candidates.
## Experiments
* Our experiments show that APE can find prompts that improve task performance, performing equal to or even better than those authored by humans.
* APE also often produces insightful tricks for how to best prompt language models that can be successfully transferred to new tasks
* APE with InstructGPT outperforms human-engineered prompts, obtaining an IQM of 0.810 vs humans’ 0.749.
* One of the most influential recent works of prompt engineering was the discovery (Kojima et al., 2022) that LLMs could be made to give chain-of-thoughts simply by prepending “Let’s think step by step.” to the beginning of the LLM’s response.
* We use APE to find a prompt starting with “Let’s” that maximizes the likelihood of these correct reasoning steps.
* We believe this general workflow represents a common use-case for APE where prompt engineers use APE to optimize parts of their exiting templates to improve performance.
* APE produces the prompt “Let’s work this out in a step by step way to be sure we have the right answer.”
* This generated prompt further improves performance from 78.7 to 82.0 on MultiArith and from 40.7 to 43.0 on GSM8K
### Instruction Induction
* 
* We compare our method against two baselines: human prompt engineers (Human)4and the model-generated instruction algorithm proposed by Honovich et al. (2022). This algorithm can be thought of as a greedy version of APE, without a search and selection process; thus, we refer to it as “Greedy”.
* Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
## Quantitative Analysis
* How does the proposal quality change as we increase the model size?
* larger models tend to produce better proposal distributions than smaller ones, as do the models that were fine-tuned to follow human instructions
* 
* Figure 7 (Left) shows a monotonically increasing trend with a diminishing return, as human-level performance is achieved with 64 instruction samples
* we choose 50 as our default sample size.
* How transferable are the generated instructions?
* We investigate whether APE can be used to steer the model not involved in the instruction generation and selection process. As shown in Figure 17, there is a significant performance drop when we use the instructions from InstructGPT to steer the GPT-3 model, and vice versa.
* It suggests that the alignment between the scoring model and execution model is crucial, and the instructions generated by InstructGPT work best for the InstructGPT itself but do not transfer well to a different model like GPT-3.
## Cost Analysis
* APE instructions are context condensers
* APE instructions reduce the number of prompt tokens by up to an order of magnitude compared to in-context learning
## Conclusion
* We automate the prompt engineering process by formulating it as a black-box optimization problem, which we propose to solve using efficient search algorithms guided by LLMs. Our method achieves human-level performance on various tasks with minimum human inputs.
* This work builds the foundation to control and steer generative artificial intelligence.
# Mindmap
