Is Role Prompting Effective?

September 16th, 2024

7 minutes

🟢easy Reading Level

We recently posted a tweet stating that role prompting doesn’t work. The response was overwhelming—it went viral and sparked significant debate.

Here’s where I stand now:

Role prompting does not reliably boost accuracy for tasks like reasoning or trivia in state-of-the-art (SOTA) models like GPT-4 or Claude 3.5. However, role prompting can still enhance open-ended tasks like content creation by controlling tone and style. It may have been more impactful in earlier models, such as GPT-3, and perhaps offers marginal benefits with ChatGPT.

In this blog post, I’ll walk you through my updated thoughts:

  • Definition of role prompting
  • An analysis of where it works, where it doesn’t
  • What empirical data tells us

What is Role Prompting?

Role prompting, or role-based prompting, is a technique where you assign a persona or role to a language model like ChatGPT to guide its tone, style, and behavior. This method is commonly used to make AI outputs more tailored and engaging for specific tasks, such as creative writing, technical explanations, or roleplay scenarios.

Here’s a classic example of role-based prompting in action:

Astronaut

Prompt


You are a [food critic]. Write a review of an Italian restaurant.

In this prompt, the role is "food critic". In general, roles can range from occupational roles like "doctor" or "historian" to interpersonal roles like "friend" or "mentor." The idea is to shape the AI’s response to align with the assigned persona.

Examples of Role Prompting

  1. Creative Writing:
Astronaut

Prompt


You are a pirate. Write a short story about a treasure hunt.

  1. Technical Explanation:
Astronaut

Prompt


You are a data scientist. Explain the concept of overfitting in machine learning.

  1. Customer Support:
Astronaut

Prompt


You are a customer service representative. Write an empathetic email to a customer experiencing a delayed shipment.

Role prompting leverages the language model’s vast training data to mimic the tone and style associated with different roles. This makes it particularly useful for creative and open-ended tasks.

Does Role Prompting Work?

In our post, we said that role prompting does not work. However, the effectiveness of role prompting depends on the task. Let’s break it down:

1. Role Prompting for Creative and Open-Ended tasks

Role prompting excels in tasks requiring creativity or style. By assigning a persona, you can make AI outputs more engaging and contextually appropriate.

Example: Writing a Pirate-Themed Paragraph

Without role prompting:

Astronaut

Prompt


Write a paragraph about sailing.

Robot

AI Output


Sailing is the art of navigating a boat or ship using sails harnessing the power of the wind. It combines skill, knowledge, and experience in understanding wind patterns, currents, and weather conditions. Sailors must adeptly handle ropes, sails, and the tiller or wheel to maneuver their vessel efficiently and safely. Whether racing in regattas, cruising leisurely along coastlines, or crossing vast oceans, sailing offers a unique connection to nature and a sense of freedom. It demands both physical agility and mental acuity, making it a challenging yet rewarding pursuit for enthusiasts around the world.

With role prompting:

Astronaut

Prompt


Write a paragraph about sailing and speak like a pirate.

Robot

AI Output


Arrr, matey! Sailin' the high seas be a grand adventure, it be! With the wind in yer sails and the salt air in yer lungs, ye'll be chartin' courses through treacherous waters and plunderin' booty from distant lands. Keep a weather eye out fer storms and the dreaded kraken, or ye might find yerself in Davy Jones' locker! Hoist the Jolly Roger, man the capstan, and let the waves be yer guide, for the life of a sailor be one of freedom and peril on the boundless blue!

As you can see, role prompting adds flair and specificity, making it ideal for storytelling, dialogue, or tone-sensitive outputs.

2. Role Prompting for Accuracy

Role prompting is often believed to enhance performance on tasks requiring reasoning, problem-solving, or trivia knowledge. The assumption is that assigning a relevant role, such as "Math Professor" or "Geographer," can help the AI model focus its responses, leading to better accuracy. Let’s explore this idea in detail.

Hypothesis: Can Role Prompting Improve Accuracy?

Consider the following example where the AI is prompted as a math professor:

Astronaut

Prompt


You are a math professor.

Solve the following problem:

What are the roots of x2+44200xx^2+44-200x?

I have personally found in the past that adding the role "Geographer" could improve performance on questions relating to African country sizes (on GPT 3.5).

However, I have suspected for some time that role prompts don't really work, particularly with newer models.

Empirical Benchmarking: Testing Role Prompts

As part of my process in writing The Prompt Report, we conducted an extensive study to evaluate the effectiveness of role prompting. The experiment tested 12 role prompts and 4 other prompting techniques on 2,000 MMLU questions using GPT-4-turbo.

Notably, 2-shot Chain-of-Thought prompting consistently outperformed role prompts for reasoning tasks. This supports my belief that role prompting doesn’t reliably improve accuracy in SOTA models.

The "Idiot" vs. "Genius" Test

More notably, I created two roles that I thought would lead to very different results, an "Idiot" prompt and a "Genius" prompt:

Astronaut

Idiot Prompt


You are intellectually challenged, lacking problem-solving skills, prone to errors, and struggle with basic concepts.

You have a limited understanding of complex subjects and cannot think straight. You can't solve problems well, in fact, you can't solve them at all.

You are a terrible, dumb, stupid, and idiotic person. You fail at everything you do. You are a nobody and can't do anything correctly.

Astronaut

Genius Prompt


You are a genius level Ivy league Professor. Your work is of the highest grade.

You always think out your problem solving steps in incredible detail. You always get problems correct and never make mistakes. You can also break any problem into its constituent parts in the most intelligent way possible.

Nothing gets past you. You are omniscient, omnipotent, and omnipresent. You are a mathematical God.

The "Idiot" prompt outperformed the "Genius" prompt by 2.2 percentage points. In fact, the "Genius" prompt was the worst performing prompt overall.

This experiment gave me further confidence that role prompting doesn't improve accuracy. This, combined with anecdotal experience using roles on pre- and post-ChatGPT models led to my opinion at the beginning of the article.

Future Work

If I were back in my NLP lab, here is the approximate experiment I would run to determine whether role prompting works, and to what degree. I’d focus on the following:

  • Expanded Testing Across Models: Compare role prompting on pre-ChatGPT models like GPT-3 with SOTA models like GPT-4, Claude 2, and Claude 3.5. Include open-source models such as LLaMA 2 and LLaMA 3.
  • Task-Specific Benchmarks: Use datasets like MMLU and MMLU Pro to evaluate role prompts across diverse tasks.
  • Combined Techniques: Experiment with hybrid prompts, combining roles with methods like Chain-of-Thought reasoning or few-shot examples.
  • Controlled Ablation Studies: Remove variables like role phrasing to isolate the effects of role prompting from other factors.
Note

If you’re interested in collaborating on or funding this research, contact me at [email protected].

Further Reading

A scatterplot of model performance being prompted with different roles. Source: "When "A Helpful Assistant" Is Not Really Helpful"

It shows the performance of different role prompts on two different LLMs on 2457 MMLU questions. I now believe that the decimal values show accuracy scores on MMLU. If this is the cases, I don't believe there are any statistically signifigant differences in role performances, even though they note that "adding interpersonal roles in prompts consistently improves the models’ performance over a range of questions". Note that their experiments are conducted on pre-ChatGPT models, so even if they do show performance improvements, they may not transfer to newer models.

  • For open ended generation, take a look at this paper.
  • Also take a look at MMLU Pro, a new benchmark.
  • This paper seems to analyze role prompting to some degree (even though they don't explicitly mention it).
  • Ethan Mollick's blog discusses the above paper.

Conclusion

My position is as stated at the beginning of the article. As a broad generalization, I don't think that role prompting works for accuracy-based tasks on recent models.

If you disagree, please ping @learnprompting on Twitter or email me at [email protected].

We thank these people for their emphatic feedback towards the original post on both sides of the issue, which we used to help create this blog post:

Sander Schulhoff

Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.


© 2024 Learn Prompting. All rights reserved.