Revisiting Inference-Time Techniques for LLM Reasoning. How Many ‘r’s are in ‘arrividerci’?

Alex Honchar
11 min readFeb 3, 2025

--

Hello, everyone. In this quick article, I would like to reflect on the first lecture of the Advanced Large Language Model Agents course currently taught at Berkeley. The first lecture is about inference time techniques for large language models reasoning.

Here you’ll find a couple of experiments. They are done on purpose with a bit weaker model —gpt-4o-mini. The goal is to test not the large language model that is already trained to do inference like o1 or r1, but to test different tricks and techniques about using inference to reason, and whether it works or it doesn’t.

Also, I assume the situation when there is no oracle, meaning there is no feedback, so we cannot really benchmark the answer we’re getting from the LLM. This is another limitation relevant for the real-world applications.

1. Introduction to basic prompting techniques

The first approach is relatively straightforward — leveraging the token budget to generate a single correct answer. This follows the typical “let’s think step by step” methodology common in prompt engineering. It’s essentially about implementing those widely-shared internet techniques that make LLMs appear to reason through problems. The core mechanism relies on generating additional tokens, where the combination of pseudo-reasoning tokens and the final response aims to maximize accuracy.

“One-liner” solutions ❌

I chose the letter counting problem as a very simple one, and it’s known that vanilla LLMs don’t know how to count letters well. There was a famous strawberry example, and I decided to take “arrividerci” as very similar, but just not to use the same word. And you can see if we just ask vanilla LLM to answer this question, it fails.

Vanilla prompt, “step-by-step” prompt examples — obvious mistakes

Task decompositions ✅❌

In the course, there are a couple of approaches how you can ask LLM to decompose the task first. It’s about analogy prompting and least-to-most prompting, but in both examples, you ask LLM first to decompose the task and only then out of this reasoning trace to combine the answer. In my experiments, it works most of the times, but depends on the random seed. So it’s not deterministic and you cannot rely on the correctness. But in the examples I shared, it worked.

Analogical prompting, least-to-most prompting examples — can work well depending on the random seed

I also attempted to replicate a more elaborate self-discover prompt structure in ChatGPT, and as demonstrated, despite implementing a complex resonance structure, the model still fails to solve such a basic problem.

Attempting to replicate self-discover prompt structure in chatgpt interface

2. Search and selection from multiple candidates

We can already see where the issues are coming from. We should not limit LLMs to generate only one solution because of their non-deterministic nature. What we might want to do is generate multiple branches, where some of them will be able to recover from mistakes and compensate for hallucinations. The challenge is how to select the best response from multiple candidates. Let’s test it.

Self-consistency ✅

One approach is self-consistency, and it’s very straightforward. You just sample multiple answers to the same question, and then you either do majority voting on the end result, and in case of more open-ended questions where end result is not available, you do the LLM voter based on the traces of reasoning and choose the most consistent one.

def generate_responses(prompt, model="gpt-4o-mini", num_samples=3, temperature=1.0):
enhanced_prompt = f"""{prompt}\n
Think step by step:
"""
return [client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": enhanced_prompt}],
temperature=temperature
).choices[0].message.content.strip() for _ in range(num_samples)]

def send_responses_to_llm_voter(question, responses, model="gpt-4o-mini", temperature=0.0):
prompt = (
f"Here are multiple responses to a question:\n\n" +
f"{question}\n\n" +
"\n".join(f"{i+1}. {resp}" for i, resp in enumerate(responses)) +
"\n Do majority voting and return the answer."
)
return client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature
).choices[0].message.content

samples = generate_responses(question, num_samples=5)
final_answer = send_responses_to_llm_voter(question, samples)

for i, sample in enumerate(samples):
print(f"Sample {i+1}: {sample}")
print("-"*100)
print(f"Final Answer (Self-Consistency with LLM Voter): {final_answer}")

And as you can see, in this case, we are able to get the right answer for our arrivederci question.

Final Answer (Self-Consistency with LLM Voter): To determine the number of 'r's in the word "arrividerci" using majority voting from the provided responses:

1. Response 1: **3 'r's**
2. Response 2: **3 'r's**
3. Response 3: **2 'r's**
4. Response 4: **3 'r's**
5. Response 5: **3 'r's**

Now, let's tally the votes:
- **3 'r's**: 4 votes
- **2 'r's**: 1 vote

The majority of responses indicate that there are **3 'r's** in the word "arrividerci."

Thus, the final answer is **3 'r's**.

Tree of thoughts ✅

Another approach, well-established in computer science, involves building a tree of thoughts. Rather than sampling just one layer of answers, this method selects the most promising responses and generates new branches of answers based on these selections. This process can be repeated both breadth-wise and depth-wise through the tree structure.

def generate_thoughts(original_question, thoughts, model="gpt-4o-mini", num_samples=3, temperature=1.0):
responses = []
enhanced_prompt = f"""Here is my question: {original_question}\n
Here are the thoughts I have so far: {thoughts}\n
What is the right answer to the question?
What are the next possible steps to answer the original question?
"""
for _ in range(num_samples):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user",
"content": enhanced_prompt}],
temperature=temperature,
)
responses.append(response.choices[0].message.content.strip())
return responses

def evaluate_thoughts(thoughts, model="gpt-4o-mini", temperature=0.0):
evaluation_prompt = (
"Here are different possible thoughts for answering a question:\n\n" +
"\n".join(f"{i+1}. {thought}" for i, thought in enumerate(thoughts)) +
"\n\nWhich thought is the most promising for answering the question? Respond with the number and only the number."
)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": evaluation_prompt}],
temperature=temperature
)
best_thought_index = int(response.choices[0].message.content.strip()) - 1
try:
return thoughts[best_thought_index]
except (ValueError, IndexError):
return best_thought_index

current_state = "No thoughts"
for i in range(2): # tree depth
thoughts = generate_thoughts(question, current_state, num_samples=3)
best_thought = evaluate_thoughts(thoughts)
current_state = best_thought
print("\nFinal Solution:", current_state)

This sampling-based approach typically yields correct answers, it provides an additional layer of reliability because some samples still might be incorrect.

Final Solution: The answer to your original question is indeed correct: the word "arrividerci" contains 3 'r' letters.

### Step-by-step breakdown of the analysis:

1. **Identify the word**: The word we are examining is "arrividerci."

2. **Break down the letters**: We analyze each letter of the word:
- a (1st position)
- r (2nd position)
- r (3rd position)
- i (4th position)
- v (5th position)
- i (6th position)
- d (7th position)
- e (8th position)
- r (9th position)
- c (10th position)
- i (11th position)

3. **Count the 'r' letters**:
- There are 'r's in the 2nd, 3rd, and 9th positions:
- This gives us a total of 3 'r' letters.

4. **Conclusion**: You confirmed that there are 3 'r' letters in "arrividerci."

### Next Possible Steps:

- **Explore more words**: You could analyze other words with similar patterns to see how many specific letters they contain. This could help reinforce your counting skills.

- **Investigate word origin or meaning**: Look into the etymology of "arrividerci," which is an Italian word meaning "goodbye." This can deepen your understanding of language and culture.

- **Practice with variations**: Create similar words or phrases in different languages and practice identifying letters within them. This could make for an interesting language exercise.

- **Letter frequency analysis**: You could expand your analysis to look at the frequency of all letters in a word or text, which is useful in various fields such as linguistics, cryptography, and handwriting analysis.

- **Engage with multiple-choice questions**: Consider creating flashcards or exercises that challenge your understanding of letter counts in various terms.

By following these steps and suggestions, you can further your exploration of language and improve your skills in counting and analysis.

3. Iterative self-improvement

While sampling techniques can help mitigate some LLM errors, these experiments are still missing a crucial component we discussed earlier — the absence of an oracle. In many real-world scenarios, we often cannot obtain ground-truth feedback, yet we still need some form of reflection mechanism or feedback loop. This brings us to our next approach: inference-time self-improvement, where LLMs iteratively enhance their own responses for a given task.

Reflexion and Self-Refine ✅❓

Two influential papers, Reflection and Self-Refine, explore fundamental approaches for enabling LLMs to reflect on and potentially correct their own responses. To demonstrate these techniques, I conducted an experiment using two distinct prompts: first, a basic prompt without structured thinking or improvements, followed by an iterative refinement approach incorporating step-by-step reasoning based on the initial (presumably incorrect) response.

def self_refine(prompt, iterations=3, model="gpt-4o-mini"):
"""Self-refinement through iterative self-feedback loops."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": f"{prompt}"}],
temperature=1.0
).choices[0].message.content.strip()

print(f"Initial response: {response}")

for _ in range(iterations):

print(f"Iteration {_+1}")

feedback = client.chat.completions.create(
model=model,
messages=[{"role": "user",
"content": f"""
My original question: {question}\n
Here is my response:\n{response}\n
Re-think step-by-step the math, counting, reasoning and provide feedback for improvement only in this domain.
"""}],
temperature=1.0
).choices[0].message.content.strip()

print(f"Feedback: {feedback}")

response = client.chat.completions.create(
model=model,
messages=[{"role": "user",
"content": f"""
My original question: {question}\n
Here is my response:\n{response}\n
Feedback for improvement:\n{feedback}\n
Now improve the response accordingly.
"""}],
temperature=1.0
).choices[0].message.content.strip()

return response

# Example Usage
final_refined_answer = self_refine(question, iterations=3)
print("\nFinal Refined Answer:\n", final_refined_answer)

The experimental results revealed two key insights:

  • First, a well-crafted reflection prompt can effectively enable self-correction of errors, particularly when targeting specific domains like mathematical operations, counting tasks, and logical reasoning.
  • However, poorly constructed self-refine and self-reflection prompts typically fail to achieve meaningful improvements in response quality.
Initial response: The word 'arrividerci' contains 2 'r's.
Iteration 1
Feedback: Let's break down the word 'arrividerci' step-by-step to accurately count the number of 'r's:

1. **Identify all letters in 'arrividerci':**
- The letters are: a, r, r, i, v, i, d, e, r, c, i.

2. **Count the occurrences of 'r':**
- Review each letter one by one:
- a (not 'r')
- r (count 1)
- r (count 2)
- i (not 'r')
- v (not 'r')
- i (not 'r')
- d (not 'r')
- e (not 'r')
- r (count 3)
- c (not 'r')
- i (not 'r')

3. **Total count of 'r's:**
- From this breakdown, we can see that there are actually 3 'r's in the word 'arrividerci' (the letters at positions 2, 3, and 9).

### Feedback for Improvement:
- **Double-checking:** Always consider re-checking your count, especially if the word may appear to have a certain number of letters at first glance.
- **Methodical Approach:** When counting occurrences of specific letters, it can be helpful to write them out or systematically go through the word to avoid missing any repetitions.
- **Visual Cues:** Using a highlighting or marking method when counting letters could improve accuracy and help avoid mistakes.

In conclusion, the correct count is 3 'r's in 'arrividerci'.
Iteration 2
Feedback: Your response to the question about counting the 'r's in the word 'arrividerci' is well-structured and clear. Here’s a step-by-step breakdown of your method, along with some feedback for improvement:

1. **Letter Identification**:
- You correctly identified all the letters in 'arrividerci', which is the essential first step. Listing the letters is helpful for clarity.

2. **Counting Occurrences**:
- Your method of reviewing each letter one by one is effective. It allows for accurate tracking of the counts without missing any occurrences.

3. **Total Count**:
- You arrived at the correct total of **3 'r's**, which is a critical outcome in responding to the question.

### Feedback for Improvement:
- **Simplify Counting**: Instead of listing each letter and noting whether or not it is an 'r', a straightforward approach could be to simply iterate through the string and maintain a running total. For example, using a list comprehension or a loop could simplify this process.
- **Use of Regular Expressions**: For those familiar with programming, using a regular expression to find all occurrences of 'r' may be a quicker and more elegant solution.
- **Visual Representation**: If you were presenting this verbally or in a teaching scenario, consider using visual aids (like highlighting or bolding) to emphasize the 'r's as you count.
- **Confirmation Step**: After counting, reiterating the positions and confirming the total could add even more clarity. For example, stating, "The 'r's are found at positions 2, 3, and 9, confirming a total of 3."

### Conclusion:
Overall, your answer is correct and methodical. The critiques above are minor but could enhance clarity and efficiency in similar tasks in the future.

Final Refined Answer:
Certainly! Here’s an improved response that incorporates the feedback and suggestions for clarity and efficiency:

---

**Question:** How many 'r's are in the word 'arrividerci'?

**Answer:**

To accurately determine the number of 'r's in the word 'arrividerci', let’s analyze the word step-by-step.

1. **Identify the letters in 'arrividerci':**
The letters are: a, r, r, i, v, i, d, e, r, c, i.

2. **Count the occurrences of 'r':**
As we go through the letters:
- a (not 'r')
- r (count 1)
- r (count 2)
- i (not 'r')
- v (not 'r')
- i (not 'r')
- d (not 'r')
- e (not 'r')
- r (count 3)
- c (not 'r')
- i (not 'r')

3. **Total count of 'r's:**
By tallying the counts, we can confirm that there are a total of **3 'r's** in the word 'arrividerci'. The 'r's are located at positions 2, 3, and 9 in the word.

### Tips for Efficient Counting:
- **Systematic Approach:** When counting letters, you can iterate through the string in a single pass using a simple loop, which maintains a running total for easier tracking.
- **Regular Expressions:** If you're programming, consider using a regular expression for finding all occurrences of 'r'. For example, in Python, you might use `len(re.findall('r', 'arrividerci'))` to get the count quickly.
- **Visual Recognition:** When reviewing similar words in a teaching environment, visually highlighting or bolding the 'r's can help emphasize their presence and make counting easier.

In conclusion, there are **3 'r's** in the word 'arrividerci'. Your counting method is clear and effective, and with a few slight improvements, it can be even more efficient and easy to understand.

---

This improved version maintains the thoroughness of the counting method while integrating suggestions for simplification and clarity.

Conclusions

Based on these experiments and lecture findings, the fundamental nature of large language models remains consistent — they are machine learning models trained on data with an inherently probabilistic nature. While various prompting techniques can elicit more detailed outputs and potential self-corrections, we still must contend with the probabilistic distribution of responses. This explains why sampling approaches utilizing tree-of-thought methods and self-consistency prove most effective, even when individual reasoning samples contain errors.

Looking ahead, two main approaches are emerging for handling such errors:

  1. Building systematic frameworks (agentic architectures) where we deliberately break down processes into discrete steps, incorporating various tools, memory systems, and potentially multiple language models to minimize prompting and sampling randomness.
  2. Moving away from inference-time tricks toward teaching language models how to reason inherently — as demonstrated by OpenAI o3, DeepSeek r1, and other recent models.

The Berkeley course will explore these topics further, providing ample opportunities for hands-on experimentation and learning in systematic problem-solving approaches.

--

--

Alex Honchar
Alex Honchar

No responses yet