Skip to main content

🟢 Multiple Choice Questions

Let's use GPT to solve an LSAT1 question!

Below is an example LSAT question. Consider how you would answer it, as well as your reasoning.

John of Worcester, an English monk, recorded the sighting, on December 8, 1128, of two unusually large sunspots. Five days later a brilliant aurora borealis (northern lights) was observed in southern Korea. Sunspot activity is typically followed by the appearance of an aurora borealis, after a span of time that averages five days. Thus, the Korean sighting helps to confirm John of Worcester's sighting. Which one of the following, if true, most strengthens the argument?

a) An aurora borealis can sometimes occur even when there has been no significant sunspot activity in the previous week.
b) Chinese sources recorded the sighting of sunspots more than 1000 years before John of Worcester did.
c) Only heavy sunspot activity could have resulted in an aurora borealis viewable at a latitude as low as that of Korea.
d) Because it is impossible to view sunspots with the naked eye under typical daylight conditions, the sighting recorded by John of Worcester would have taken place under unusual weather conditions such as fog or thin clouds.
e) John of Worcester's account included a drawing of the sunspots, which could be the earliest illustration of sunspot activity.
The correct answer is ...
c) Only heavy sunspot activity could have resulted in an aurora borealis viewable at a latitude as low as that of Korea.

Try pasting the problem into the demo below:

Why is my answer different?
Your answer could differ because of

1) Updates to the underlying model, GPT-3 2) Randomness in the text generation process. We can make the output more consistent by setting temperature to 0.

The model failed. Does that mean the model is incapable for answering this type of question? Not necessarily. We will dive into techniques that we can use to improve model results.

The Magic Phrase​

The standard prompt we used above gives little insight into the “reasoning” of GPT's output. We can try adding the phrase let's explain step by step like so:

...
e) John of Worcester's account included a drawing of the sunspots, which could be the earliest illustration of sunspot activity.

Let’s explain step by step

This phrase will increase the verbosity of the model. You might get an output like this:

info

Notice how the model reasons through the problem step-by-step.

The specific term for this behavior is Chain of Thought1; the model sequentially generates statements to reach an answer. This is similar to the concept of System 2 thinking (from Thinking Fast and Slow); the model defaults to system 1 thinking, but can chain system 1 thinking to arrive at a more methodological answer.

Improvements​

Here are some variations on our basic prompt for multiple choice questions:

Reorder Question Items​

We can reorder the items in the question

...
a) John of Worcester's account included a drawing of the sunspots, which could be the earliest illustration of sunspot activity.
b) Because it is impossible to view sunspots with the naked eye under typical daylight conditions, the sighting recorded by John of Worcester would have taken place under unusual weather conditions such as fog or thin clouds.
...

Reword the Question​

Recall the original prompt was this:

Which one of the following, if true, most strengthens the argument?

We can change the prompt to this:

Identify each choice as strengthens, weakens or doesn't impact the argument.

to gain further insight into the answer choice.

Add Additional Context​

Here is an example of a problem which can be easily solved by using Bayes' theorem:

Consider two medical tests, A and B, for a virus. Test A is 90% effective at recognizing the virus when it is
present, but has a 5% false positive rate (indicating that the virus is present, when it is not). Test B is 95%
effective at recognizing the virus, but has a 10% false positive rate. The two tests use independent methods
of identifying the virus. The virus is carried by 2% of all people.
(a) Say that a person is tested for the virus using only Test A. What is the probability that the person
is really carrying the virus given that Test A came back positive? (2 points)
(b) Say that a person is tested for the virus using only Test B. What is the probability that the person
is really carrying the virus given that Test B came back positive? (2 points)
(c) Say that a person is tested for the virus using both tests. What is the probability that the person is
really carrying the virus given that both tests came back positive? (2 points)

Let's try this with GPT:

The output is incorrect!

If we add a bit of context, like so:

...
Let's explain step by step. The formula for bayes is

The model will use the right formula, Bayes.

Which is correct!

danger

GPT model doesn't perform arithmetic operations well. You might notice that while the expression written is corrected, the computed number is not.

Try adding the phrase: Give the expression as answer, not a number to disable computation.

You may be interested in MRKL2, the paradigm of combining GPT with external tools like calculators, to solve this problem.

Written by zeyuzhao.


  1. The LSAT (Law School Admission Test) is a standardized test used by law schools in the United States to assess the critical thinking and analytical reasoning skills of prospective students.↩
  2. Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., Muhlgay, D., Rozen, N., Schwartz, E., Shachaf, G., Shalev-Shwartz, S., Shashua, A., & Tenenholtz, M. (2022). MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. ↩