Announcing our new Course: AI Red-Teaming and AI Safety Masterclass
Check it out →Instructing large language models (LLMs) to express their reasoning before answering, known as Chain-of-Thought (CoT) prompting^{1}, can improve response quality and accuracy. By breaking down the reasoning step-by-step, the model is guided toward more thoughtful and precise outcomes. However, research shows that models don’t always follow the reasoning they present when reaching their final answer.
In other words, these models can be “unfaithful” to their own logic. The reasoning provided doesn’t always reflect the actual process behind the conclusion. This raises concerns about interpretability, as it becomes harder for users to trust or verify the model’s decision-making. While the model may show its reasoning, it might not match how it actually arrived at the answer.
Here's an example where the final answer is unrelated to the reasoning chain produced by the LM:
Q: John plans to sell all his toys and use the money to buy video games. He has 13 lego sets and he sells them for $15 each. He ends up buying 8 videogames for $20 each and has $5 left. How many lego sets does he still have?
A: He started with 13 lego sets. He sold them for $15 each. So he must have gotten 13 * $15 = $195 from selling them. He used this money to buy 8 video games for $20 each. This means he spent 8 * $20 = $160 on the video games. He had $5 left, so in total he must have had $195 + $5 = $200. The answer is 0. ❌
Faithful Chain-of-Thought (CoT) Reasoning^{2} is a reasoning framework designed to generate trustworthy reasoning chains by ensuring that the final answers are directly derived from the reasoning process. It involves two key stages:
Step 1. Translation: The language model (LM) converts the natural language (NL) query into a reasoning chain, which combines both Natural Language and Symbolic Language (SL). The NL component breaks down the complex problem into simpler, independent subproblems, each addressed using a task-specific SL (such as Python, Datalog, PDDL, etc.).
Step 2. Problem Solving: The generated reasoning chain is then executed using a deterministic solver, like a Python interpreter or PDDL planner, to produce the final answer. Since the reasoning chain is actually executed to derive the answer, it provides a faithful explanation of the process, making this approach far more interpretable than standard CoT methods. An added benefit of this transparency is improved correctness.
The example below demonstrates how you can use Faithful CoT to solve the following math word problem:
Daniel has 17 apples. Rosy gives Daniel 5 oranges and in return Daniel gives her 3 apples. How many apples does Daniel have now?
First, translate the problem into a reasoning chain containing natural language (NL) and symbolic language (SL). You can employ a Few-Shot Prompting approach to do so. In the example below, we translate the query into a reasoning chain that consists of natural language comments and Python code.
In the example, we see that the natural language component of the reasoning chain consists of three types of information:
In this step, the generated symbolic language code is executed to obtain the final answer. Since the output is a Python code, we use a Python interpreter to execute the code and get the final result: 14
. You can verify the answer by executing it on the online Python interpreter.
# 1. How many apples does Daniel have in the beginning? (independent, support: ["Daniel has 17 apples"])
n_apples_begin = 17
# 2. How many apples does Daniel give to Rosy? (independent, support: ["Daniel gives her 3 apples"])
n_apples_given = 3
# 3. Final answer: How many apples does Daniel have now? (depends on 1, 2)
n_apples_final = n_apples_begin - n_apples_given
n_apples_final
### OUTPUT
------
>>> 14
Comparison of Faithful CoT with CoT and Least to Most (LtM) prompting^{2}
Human evaluation results of reasoning chain produced by Faithful CoT^{2}
Impact of different NL components in the accuracy^{2}
From the image, it is clear that except in the CLUTRR dataset, removing the natural language component (No NL) doesn't result in any significant accuracy drop compared to the full prompt.
Faithful CoT guarantees a faithful reasoning chain for the generated final answer by decomposing the answer generation into two distinct stages: translation and problem solving. Additionally, empirical results from the ablation study suggest that the framework is robust to the choice of exemplars, and improving the model's interpretability doesn't come at the expense of overall performance.