The idea behind the generated knowledge approach is to ask the LLM to generate potentially useful information about a given question/prompt before generating a final response.
For example, say you want to write a small blog post about some animal, like Spirit bears. Before asking the LLM to write the blog post, you can ask it to generate some facts about Spirit bears. This will help the LLM write a more informative blog post. There are two simple ways of going about this.
The first approach asks the LLM to generate knowledge then write the post, all with a single prompt.
Generate 4 facts about the Kermode bear, then use these facts to write a short blog post using the information:
The Kermode bear, also known as the spirit bear, is a rare subspecies of the American black bear that is found in the coastal regions of British Columbia, Canada.
1. The Kermode bear is a unique subspecies of the American black bear, with approximately 10% of the population having white fur.
2. The Kermode bear is a symbol of the First Nations people of British Columbia, and is seen as a spirit animal.
3. The Kermode bear is an important part of the ecosystem, helping to disperse seeds and maintain the balance of the forest.
4. The Kermode bear is a protected species, and its population is monitored closely by the provincial government.
The Kermode bear is a unique and special animal that is found in the coastal regions of British Columbia, Canada. Approximately 10% of the population of this subspecies of the American black bear have white fur, making them a rare sight. The Kermode bear is a symbol of the First Nations people of British Columbia, and is seen as a spirit animal. It is an important part of the ecosystem, helping to disperse seeds and maintain the balance of the forest. The Kermode bear is a protected species, and its population is monitored closely by the provincial government. It is a beautiful and majestic animal, and it is important that we do our part to protect and conserve this species.
In this approach, we first prompt the LLM to generate facts about the bear:
Generate 10 facts about the Kermode bear:
1. The Kermode bear, also known as the Spirit Bear, is a rare subspecies of the American black bear found in British Columbia, Canada.
2. The Kermode bear has a unique white or cream-colored coat, which is caused by a recessive gene.
3. The Kermode bear is a symbol of hope and renewal for the First Nations people of British Columbia.
4. The Kermode bear is a protected species and is listed as a species of special concern by the Committee on the Status of Endangered Wildlife in Canada.
Then, we feed that information into another prompt to write the blog post:
1. The Kermode bear, also known as the Spirit Bear, is a rare subspecies of the American black bear found in British Columbia, Canada.
2. The Kermode bear has a unique white or cream-colored coat, which is caused by a recessive gene.
3. The Kermode bear is a symbol of hope and renewal for the First Nations people of British Columbia.
4. The Kermode bear is a protected species and is listed as a species of special concern by the Committee on the Status of Endangered Wildlife in Canada.
Use the above facts to write a one paragraph blog post about the Kermode bear:
The Kermode bear, also known as the Spirit Bear, is a rare subspecies of the American black bear found in British Columbia, Canada. This unique bear has a white or cream-colored coat, caused by a recessive gene, and is a symbol of hope and renewal for the First Nations people of British Columbia. The Kermode bear is a protected species and is listed as a species of special concern by the Committee on the Status of Endangered Wildlife in Canada, making it an important part of the region's biodiversity.
I recommend this second approach since it can produce longer content more reliably.
The generated knowledge approach was actually introduced for a completely different task, that of answering difficult questions. Consider the following question, which GPT-3 answers incorrectly:
If we first ask LLM to generate facts about Congo and South Africa, we can then use that information to answer the question correctly. In theory, this is similar to CoT prompting, since we are effectively getting the LLM to generate intermediate reasoning in the form of related facts.
Let's start with the first step, knowledge generation. We can ask the LLM to generate facts about Congo and South Africa:
Next, let's use that knowledge to answer the question correctly. This is the knowledge integration step!
Although the above use case was similar to the way generated knowledge was originally introduced, it is not exactly the same. The below content covers the more technical context in which the approach was introduced. It follows the two intermediate steps (knowledge generation and knowledge integration) pattern that we saw above.
In the knowledge generation step, the LLM is asked to generate a set of facts about the question. The LLM is prompted in few-shot fashion as seen below. M different completions are generated using this same prompt (similar to the self-consistency approach).
Generated Knowledge Example (Liu et al.)
Next, we generate "knowledge augmented" questions and prompt the LLM with them to get final answers. The easiest way to understand this is to go through an example.
Let's assume we are attempting to answer the question "Most Kangaroos have <mask> limbs". Assume that at the knowledge generation step we generated 2 knowledges (M=2):
Knowledge 1: Kangaroos are marsupials that live in Australia.
Knowledge 2: Kangaroos are marsupials that have 5 limbs.
Now, we concatenate each knowledge with the question to generate knowledge augmented questions:
Knowledge Augmented Question 1: Most Kangaroos have <mask> limbs. Kangaroos are marsupials that live in Australia.
Knowledge Augmented Question 2: Most Kangaroos have <mask> limbs. Kangaroos are marsupials that have 5 limbs.
We then prompt the LLM with these knowledge augmented questions and get the final answer proposals:
Answer 1: 4
Answer 2: 5
We select the answer with the highest probability as the final answer. The highest probability could be the softmax probability of the answer token, or the log probability of the answer token(s).
The recitation-augmented approach it is similar to generated knowledge (basically the same). However, is much less complex than the formal implementation of generated knowledge.
The idea here is to few shot prompt the LLM to generate information and answer in the same step. The fact that it is reciting/generating knowledge and answering the question in the same step is the main difference from the generated knowledge approach.
To reiterate, this approach prompts the model with multiple (question, recitation, answer) exemplars, then asks the question. The authors note that this approach can be combined with self consistency or multiple completion paths.
Generated knowledge shows improvements on various commonsense datasets.
The knowledge corresponding to the selected answer is called the selected knowledge.
In practice, you could take the most frequently occurring answer as the final one.
Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.