Guiding Large Language Models via Directional Stimulus Prompting

Zekun Li^1*, Baolin Peng², Pengcheng He², Michel Galley², Jianfeng Gao^2†, Xifeng Yan^1†

University of California, Santa Barbara
Microsoft
37th Conference on Neural Information Processing Systems (NeurIPS 2023)
{zekunli, xyan}@cs.ucsb.edu
{bapeng,penhe,mgalley,jfgao}@microsoft.com
^*Part of the work was done when Zekun Li was interning at Microsoft Research.
^†Co-advise on this work.

Paper Code arXiv

Comparison of our Directional Stimulus Prompting and the standard prompting method using LLMs such as ChatGPT for the summarization task.

Figure 1: Comparison of our Directional Stimulus Prompting and the standard prompting method using LLMs such as ChatGPT for the summarization task. DSP utilizes directional stimulus/hints (highlighted in orange), which are keywords in this case, to provide instance-specific guidance to LLMs in generating summaries (highlighted in blue) that better align with the desired reference summary with higher ROUGE scores or other measures like human preferences.

Abstract

We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM’s output. We assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. Our experiments demonstrate that the framework consistently improves LLMs’ (e.g., ChatGPT, Codex, InstructGPT) performance on these supervised tasks using minimal labeled data. Notably, using just 80 dialogues on the MultiWOZ dataset, our approach enhances ChatGPT’s performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. Additionally, the instance-specific chain-of-thought prompt generated by our approach improves InstructGPT’s reasoning accuracy compared to human-crafted or automatically generated prompts. The code and data are publicly available.

Approach

Figure 2: Overview of our proposed framework DSP, where we learn a small tunable policy model to generate the directional stimulus (keywords in this case) that provide input-specific guidance for the LLM toward the desired target. The policy model can be trained with SFT and/or RL, where the reward is defined as the downstream task performance measure, such as the ROUGE score for the summarization task, or other alignment measures like human preferences.

We utilize a relatively small and tunable LM (e.g., T5), as the policy model to generate the directional stimulus prompt for each input query. This approach enables us to sidestep the direct optimization of black-box LLMs by optimizing the small tunable policy model instead. We train the policy model through supervised fine-tuning (SFT) using a few collected labeled data. After supervised fine-tuning, we further optimize the policy model to explore better directional stimulus prompts with reinforcement learning (RL). During RL training, we aim to maximize the reward defined as downstream performance measures or any other measures of the LLM’s output conditioned on the stimulus generated by the policy model.

Experiments

In this work, we focus on the summarization, dialogue response generation, and automatic prompt generation tasks. We mainly use pre-trained T5 or Flan-T5 to initialize the policy model and evaluate the OpenAI’s ChatGPT (gpt-3.5-turbo), Codex (code-davinci-002), and InstructGPT (text-davinci-002).

1. Summarization

Figure 3: Performance comparison of ChatGPT with standard prompting and DSP trained with SFT and SFT+RL, using varying numbers of training samples from the CNN/Daily Mail dataset.

We evaluate the performance of ChatGPT with standard prompting and our approach DSP trained with SFT or SFT and then RL (SFT+RL) on varying sizes of training data and present the results in Figure 3. As can be seen, all the evaluation scores improve with our proposed DSP compared with standard prompting. Specifically, the supervised fine-tuned policy model generates the stimulus that effectively guides ChatGPT to generate summaries that closely align with the reference summaries, leading to improved benchmark performance. Furthermore, the additional fine-tuning of the policy model with RL results in further performance improvement, indicating the effectiveness of RL in exploring better directional stimulus that maximizes the reward. As the size of the training data increases, the performance improvement becomes more significant. Despite using a small collection of only 1,000 to 4,000 samples to keep API usage costs low, our DSP approach still consistently enhances ChatGPT’s ROUGE, BLEU, and Meteor scores by 1-2 points, even though ChatGPT has already achieved considerable performance.

Figure 4: Training curve on 1000 samples from the CNN/Daily Mail dataset.

However, due to the discrepancy between the semantic-based metric BERTScore and the overlap-based metric ROUGE, which are used as the reward, the improvement in BERTScore after RL training may be relatively less significant. Figure 4 presents the change of training rewards and ROUGE-1 score on the validation set during the training process on 1,000 samples. We can see that the performance is closely related to the training rewards, and the training is relatively stable using the NLPO algorithm.

To gain a better understanding of generated summaries guided by keywords, we employed GPT-4 to evaluate the summaries.

Figure 5: GPT-4 evaluation on comparing the summaries generated with our approach DSP, i.e., with the guidance of our generated keywords, and the original standard prompting, i.e., without keyword guidance.

The results are shown in Figure 5. We found that GPT-4 can produce reasonable and detailed explanations of their assessment. From our test set of 500 samples: DSP-generated summaries were favored 255 times (51.0%), summaries generated with original standard prompting were favored 222 times (44.4%), while a tie was observed in 23 cases (4.6%).

2. Dialogue Response Generation

Table 1: Response generation performance of different methods on the MultiWOZ 2.0&2.1 datasets, where Succ. and Comb. denote the Success and Combined Score metrics, respectively.

Table 1 summarizes the overall performance comparison, from which we obtain the following observations:

(1) Our approach DSP significantly improves the success and inform rates of Codex and ChatGPT, indicating that they better understand the scenario and generate appropriate responses that help users in completing their tasks.

(2) There is no improvement in the corpus-level BLEU score, possibly because the LLMs generate responses with different speaking styles and vocabulary since they do not see oracle system responses. Nevertheless, the high success and inform rates demonstrate the usefulness of our approach in delivering helpful and reliable responses.

(3) RL training encourages the policy model to explore more model-preferred stimulus, while supervised fine-tuning may merely generate stimulus closely aligned with the pseudo-labeled data, which is not necessarily optimal.

(4) Our approach achieves notable success with only 80 dialogues, surpassing several fully trained TOD models, particularly in terms of Success and Inform rates.

3. Chain-of-Thought reasoning

Table 2: Zero-shot chain of thoughts performance of InstructGPT (text-davinci-002) with different prompts. *Our approach trains a policy model to generate instance-specific prompt triggers, which are compared to the task-specific prompts in [26, 79].

As can be seen in Table 2, InstructGPT’s performance varies significantly when using different task-specific prompts. Compared to the 14 task-specific human-designed prompts, DSP enhances the performance with instance-specific prompts. It also outperforms the prompt discovered by the APE approach. Solely relying on supervised fine-tuning of the policy model with the dataset comprising the 14 human-designed prompts doesn’t lead to its peak performance. After fine-tuning with RL, the policy model is encouraged to explore better instance-specific trigger prompts, further improving performance.

Conclusions and Future Work

In this paper, we introduce Directional Stimulus Prompting (DSP), a new prompting framework to provide black-box LLMs with fine-grained and instance-specific guidance toward the desired outputs. We use a tunable policy model to generate the directional stimulus to provide such guidance and convert the optimization of black-box LLMs to that of the policy model. Experimental results demonstrate the effectiveness of our approach. DSP not only enables better control and guidance for black-box LLMs, but also effectively utilizes labeled data. Furthermore, the generated stimulus provides valuable insights and interpretations of LLMs’ behaviors. In this work, we use heuristically selected or annotated pseudo-stimulus data for supervised fine-tuning of the policy model. For future work, we hope to explore the possibility of using a “machine language” between the policy model and the LLMs that might not be intuitively preferred by humans but can better convey guidance information, as well as other forms of directional stimulus beyond text.

BibTeX

@misc{li2023guiding,
        title={Guiding Large Language Models via Directional Stimulus Prompting}, 
        author={Zekun Li and Baolin Peng and Pengcheng He and Michel Galley and Jianfeng Gao and Xifeng Yan},
        year={2023},
        eprint={2302.11520},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }