We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM’s output. We assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. Our experiments demonstrate that the framework consistently improves LLMs’ (e.g., ChatGPT, Codex, InstructGPT) performance on these supervised tasks using minimal labeled data. Notably, using just 80 dialogues on the MultiWOZ dataset, our approach enhances ChatGPT’s performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. Additionally, the instance-specific chain-of-thought prompt generated by our approach improves InstructGPT’s reasoning accuracy compared to human-crafted or automatically generated prompts. The code and data are publicly available.
We utilize a relatively small and tunable LM (e.g., T5), as the policy model to generate the directional
stimulus prompt for each input query. This approach enables us to sidestep the direct optimization
of black-box LLMs by optimizing the small tunable policy model instead. We train the policy
model through supervised fine-tuning (SFT) using a few collected labeled data. After supervised
fine-tuning, we further optimize the policy model to explore better directional stimulus prompts
with reinforcement learning (RL). During RL training, we aim to maximize the reward defined as
downstream performance measures or any other measures of the LLM’s output conditioned on the
stimulus generated by the policy model.
We evaluate the performance of ChatGPT with standard prompting and our approach DSP
trained with SFT or SFT and then RL (SFT+RL) on varying sizes of training data and present the results in Figure 3. As can be seen, all
the evaluation scores improve with our proposed DSP compared with standard prompting. Specifically, the supervised fine-tuned policy
model generates the stimulus that effectively guides ChatGPT to generate summaries that closely align with the reference summaries, leading
to improved benchmark performance. Furthermore, the additional fine-tuning of the policy model with RL results in further performance
improvement, indicating the effectiveness of RL in exploring better directional stimulus that maximizes the reward. As the size of the
training data increases, the performance improvement becomes more significant. Despite using a small collection of only 1,000 to 4,000
samples to keep API usage costs low, our DSP approach still consistently enhances ChatGPT’s ROUGE, BLEU, and Meteor scores by 1-2 points,
even though ChatGPT has already achieved considerable performance.
However, due to the discrepancy between
the semantic-based metric BERTScore and the overlap-based metric ROUGE, which are used as the
reward, the improvement in BERTScore after RL training may be relatively less significant. Figure 4
presents the change of training rewards and ROUGE-1 score on the validation set during the training
process on 1,000 samples. We can see that the performance is closely related to the training rewards,
and the training is relatively stable using the NLPO algorithm.
To gain a better understanding of generated summaries guided by keywords, we employed GPT-4 to evaluate
the summaries.
The results are shown in Figure 5. We found that GPT-4 can
produce reasonable and detailed explanations of their assessment. From our test set of 500 samples: DSP-generated
summaries were favored 255 times (51.0%), summaries generated with original standard prompting were favored 222
times (44.4%), while a tie was observed in 23 cases (4.6%).
Table 1 summarizes the overall performance comparison,
from which we obtain the following observations:
(1) Our approach DSP significantly improves the
success and inform rates of Codex and ChatGPT, indicating that they better understand the scenario
and generate appropriate responses that help users in completing their tasks.
(2) There is no
improvement in the corpus-level BLEU score, possibly because the LLMs generate responses with
different speaking styles and vocabulary since they do not see oracle system responses. Nevertheless,
the high success and inform rates demonstrate the usefulness of our approach in delivering helpful and
reliable responses.
(3) RL training encourages the policy model to explore more
model-preferred stimulus, while supervised fine-tuning may merely generate stimulus closely aligned
with the pseudo-labeled data, which is not necessarily optimal.
(4) Our approach achieves notable
success with only 80 dialogues, surpassing several fully trained TOD models, particularly in terms
of Success and Inform rates.
As can be seen in Table 2, InstructGPT’s performance
varies significantly when using different task-specific prompts. Compared to the 14 task-specific
human-designed prompts, DSP enhances the performance with instance-specific prompts. It also
outperforms the prompt discovered by the APE approach. Solely relying on supervised fine-tuning
of the policy model with the dataset comprising the 14 human-designed prompts doesn’t lead to
its peak performance. After fine-tuning with RL, the policy model is encouraged to explore better
instance-specific trigger prompts, further improving performance.
In this paper, we introduce Directional Stimulus Prompting (DSP), a new prompting framework to provide black-box LLMs with fine-grained and instance-specific guidance toward the desired outputs. We use a tunable policy model to generate the directional stimulus to provide such guidance and convert the optimization of black-box LLMs to that of the policy model. Experimental results demonstrate the effectiveness of our approach. DSP not only enables better control and guidance for black-box LLMs, but also effectively utilizes labeled data. Furthermore, the generated stimulus provides valuable insights and interpretations of LLMs’ behaviors. In this work, we use heuristically selected or annotated pseudo-stimulus data for supervised fine-tuning of the policy model. For future work, we hope to explore the possibility of using a “machine language” between the policy model and the LLMs that might not be intuitively preferred by humans but can better convey guidance information, as well as other forms of directional stimulus beyond text.
@misc{li2023guiding,
title={Guiding Large Language Models via Directional Stimulus Prompting},
author={Zekun Li and Baolin Peng and Pengcheng He and Michel Galley and Jianfeng Gao and Xifeng Yan},
year={2023},
eprint={2302.11520},
archivePrefix={arXiv},
primaryClass={cs.CL}
}