Embracing System 2 Thinking in LLMs

Why Better Prompts Aren’t Enough

Charlie Koster
5 min readJan 25, 2025

--

Imagine someone tosses a ball to you unexpectedly. Your System 1 response kicks in: you instinctively react. With this sudden visual input, electrical signals are sent through your brain and your hands automatically move to attempt to keep the ball from hitting you, if not catch it.

Now contrast that with someone saying, “Hey, I’m about to toss you this ball.” This time, your System 2 thinking takes over — you plan, anticipate, and position yourself for the catch. You visualize the arc of the throw and deliberately place your hands to maximize your chances of success. You are thinking in the traditional sense of the word.

When you send a prompt to an LLM you are invoking an analogous System 1 response. The LLM is taking your prompt and in a single shot it is giving you its spontaneous reaction.

Sometimes these responses are great, the LLM is able to catch the ball so-to-speak, however we’ve all seen instances where it goes less than great. Hallucinations, inaccuracies, logical inconsistencies. We’re left wondering, “What was this LLM thinking?”

The truth is, the LLM is thinking as little as you are a thinking when a random ball unexpectedly flies in your direction. Which is to say, there is no “thought”, only automatic reaction. A quick signal through a network of neurons, real or artificial, trained to react from past experiences.

But you can make LLMs think!

System 1 Tunnel Vision

An early anti-pattern I’m noticing with the usage of LLMs is unwitting over-focus on System 1 tactics when the problem calls for a System 2 approach.

We seem to be wired with a bias that a single invocation of an LLM should be sufficient to get us the results we want and we employ a few strategies to try and force that reality.

  • Prompt engineering has quickly become a necessary skill to improve LLM output, while prompt over-engineering quickly leads to frustrating results. LLMs can only handle so much nuance in their instructions.
  • RAG (Retrieval-Augmented Generation) — a fancy term that simply means “insert relevant contextual data” — is a must-have strategy when a bit of extra data compounds the efficacy of the output. Keep in mind that mutli-step, heavily nuanced problems may not be completely solved by RAG.
  • Fine-tuning models can be a good last resort when the above two strategies are insufficient. Again, because we’re in System 1 territory, fine-tuning will be inferior for complex problem solving or reasoning.

To be fair, all of these are valid strategies for improving results for System 1 problems. It is, however, important that you carefully determine whether the problem you’re attempting to solve with an LLM is better solved with a thoughtless, one-shot prompt. Or does it require concepts like reasoning, nuance, or mutli-step logic? If the latter, then System 2 thinking is likely to yield better results.

System 2 Thinking

System 2 thinking involves any combination of the following:

  • Breaking the problem into distinct sequential or parallel steps
  • Iteration — piping output from an LLM as input to another LLM
  • Shared context across steps

Notably, System 2 thinking is characterized by sending more than just a single prompt into an LLM.

Example: Terms & Privacy Analyzer

An international company recently released a new model, Deepseek, that is said to rival OpenAI’s state of the art models, however there are criticisms of Deepseek peppered throughout AI forums this week. Let’s dig deeper and analyze their Terms and Privacy policy to understand why.

Rather than having an extended conversation with an LLM like ChatGPT — what would effectively be System 2 thinking with a human in the middle — let’s build a simple LLM agent that allows us to reuse this type of System 2 approach in the future.

I want the following thought process to occur:

  • Extract the primary sections of a text document
  • Traverse through each section, extracting notable insights, and double checking nothing important was missed
  • Create a short summary and list the insights
Simulating System 2 thinking by piping output from one LLM to another multiple times

Step 1: Extract Sections

Given the following text:
{text}

1) Identify and split it into primary sections.
2) Return sections in valid JSON format as an array of objects. Each object should have:
- "section_title": a short optional title or a placeholder if none is clear
- "section_text": the text belonging to that section

Example JSON format:
[
{{
"section_title": "Introduction",
"section_text": "Some introduction text here..."
}},
{{
"section_title": "Details",
"section_text": "Some details text here..."
}}
]

If the text doesn't naturally split, just return a single array element with the entire text.
Only output valid JSON.

Step 2a: Extract insights

The following section is:
{section}

Extract any unusual, concerning, or unethical points from this section as bullet points.
If there are none, output "None":

Step 2b: Double check nothing was missed

These are the extracted insights so far from all sections:
{combined_insights_text}

Reflect on whether any important or concerning points might have been missed.
If there are additional insights, provide them as bullet points.
If there are no additional insights, just output "No additional insights".

Step 3: Summarize and list

Summarize the following text into one or two paragraphs:
{ev.original_text}
These are the extracted insights from the text:
{ev.insights}

Return them as bullet points, one per line, filtering out any empty or trivial bullet points:

Privacy output
Terms output

Full source code employing System 2 thinking in LlamaIndex

By splitting what could be one enormous prompt into smaller, well-defined prompts we are able to effectively mimic Systems 2 thinking. Several iterative thoughts were combined to produce a much more useful response.

Furthermore, this approach can enable you to use cheaper models to obtain similar results. The above example uses gpt-4o-mini which wouldn’t be able to provide as deep or as accurate insights were this analysis performed in a single well-engineered prompt.

Using more expensive reasoning models would likely yield similarly useful results. However, besides Deepseek, those reasoning models have historically been very expensive. Additionally, when those models are pausing to think, what do you suppose they are doing behind the scenes? They are employing chain-of-thought and System 2 thinking!

Conclusion

The shift from single-pass prompt engineering to multi-step, iterative workflows reflects a broader transition from System 1 to System 2 thinking in how we work with LLMs. While crafting the perfect prompt can be useful for well-defined tasks, complex problems often demand a more thoughtful, structured approach.

System 2 thinking is not just a tool for solving immediate challenges — it’s a stepping stone toward more advanced, autonomous workflows powered by LLM agents. By breaking tasks into smaller, manageable steps and leveraging iterative refinement, System 2 enables intelligent, agent-driven systems capable of handling multi-faceted problems with minimal human intervention.

As we look ahead to 2025, the role of LLM agents will become increasingly important. These agents, designed to think in deliberate, multi-step ways, are poised to redefine and disrupt how we approach problem-solving and decision-making. Paying close attention to their evolution will be critical for anyone aiming to stay ahead in the AI landscape.

--

--

Charlie Koster
Charlie Koster

Written by Charlie Koster

Principal Software Architect | Conference Speaker | Blog Author

No responses yet