AI’s getting smarter. Adaptive reasoning ditches one-size-fits-all thinking to save time, money and mistakes.

2025 has been the year of reasoning models. OpenAI released o1 and Google released Gemini 2.0 Flash Thinking in December 2024. DeepSeek R1, an open source reasoning model, hit the market in January 2025. Anthropic, Google, Alibaba and xAI have all followed suit, all releasing reasoning models throughout 2025.
Reliance on these newly released reasoning models by growth stage and enterprise B2B companies has similarly grown. According to the “AI, Applied Benchmarks Report” (research conducted by Georgian and NewtonX’s), 18% of IT/product/engineering leaders surveyed in March 2025 reported using DeepSeek R1 in production systems. 35% said they were using OpenAI o1/o1 mini.
Furthermore, 43% of respondents said that high model training and production costs impeded their progress toward productization, suggesting that companies may still be trying to understand how best to employ reasoning models at scale in production systems.
Most systems that employ reasoning today rely on static reasoning: every input gets the same model, the same prompt and the same depth of reasoning, leading to inefficiency and wasted time and money. A trivial query might get over-processed, driving up cost and latency. A complex, high-stakes task might be underserved, leading to risky errors.
In my view, the next frontier in production-ready reasoning is adaptive reasoning: AI systems that allocate just the right amount of reasoning per input, balancing accuracy, cost and latency in real time. For CIOs, adaptive reasoning may be a new operating model for how enterprise AI systems should be designed, deployed and scaled.
What are reasoning language models (RLMs)?
Reasoning language models are language models that can generate a thinking process. They start with a question, produce reasoning steps and arrive at an answer. RLMs can move beyond simply mapping an input to an output; they can actively engage in a multi-step decision-making process.
Reasoning in this sense is both challenging (requiring control over multi-step logical processes) and fundamental (critical for complex problem solving, instruction-following and factual accuracy).
What is static reasoning, and what are some of the ways it can waste time and money?
RLMs often operate statically:
- The same model is used for every query
- The same prompt format is applied, regardless of the complexity of a given task
- The same reasoning depth is executed for every input
- The same tools are used.
This static approach means that the system doesn’t adapt to an input or task at hand. An easy task gets the same heavy reasoning chain as a more complex task.
In my experience testing RLMs, I’ve observed some common modes of failure of RLMs, and I’ll dive into detail around each, including:
- Premature anchoring: wrong first guess gets rationalized.
- Overthinking: wastes compute, sometimes lowers accuracy.
- Instruction-following failures: reasoning doesn’t guarantee the model will follow constraints.
Premature anchoring
Premature anchoring occurs when a model forms an early hypothesis in response to an input and then anchors on it. Then the model employs a reasoning process like chain-of-thought reasoning (CoT) in preparing its response, basing its chain of thought on its early hypothesis.
But what if the initial hypothesis is false or biased? In my experiments, RLMs often ignore or devalue later evidence contradicting their initial hypotheses. Instead, they tend to use CoT to rationalize their initial hypotheses.
In my view, models that are prematurely anchoring are not reasoning; they are rationalizing. This is a view supported by recent research on CoT reasoning and internal bias in reasoning models.
Overthinking
Overthinking occurs when a model generates unnecessarily long reasoning chains. For example, a model rationalizes an early wrong guess or applies extended reasoning to a simple input. Overthinking wastes compute and slows responses without improving outcomes.
Research on reasoning length in LLMs shows that, for a given model, most questions have a minimum reasoning token threshold (“token complexity”): the smallest CoT length needed to solve them. Once reasoning exceeds this threshold, the likelihood of a correct answer rises sharply. But further increases in length usually deliver diminishing returns — higher cost and latency with little or no additional accuracy.
In one experiment, researchers tested 31 prompts for Claude 3.5 Sonnet on a single question. The five prompts that produced fewer tokens than the threshold all failed, while 24 of the 26 prompts above the threshold succeeded. Across benchmarks, results consistently showed that accuracy is driven primarily by crossing this threshold rather than by piling on extra tokens. At scale, this implies that significantly lengthening reasoning chains is unlikely to yield meaningful accuracy gains once the threshold is met, suggesting wasted reasoning tokens, time and money.
Instruction-following
Sometimes reasoning makes models worse at following instructions. Instead of sticking to given instructions (like a required format or word limit), they drift off, adding extra details or ignoring constraints.
Research into instruction-following pitfalls found that longer chains of thought don’t improve instruction-following; in fact, they often hurt it. The best results appear to come from selectively using reasoning only when it’s likely to help. A smart router that knows when to reason and when not to can keep models accurate while still following the rules.
The shift from static to adaptive reasoning
Imagine for a moment that you run a help desk. You hire an expert, Mia, to handle all incoming requests. Mia usually gets things right, but she’s slow and treats every task with the same level of effort, no matter how simple or complex.
Before long, the backlog grows. So you hire a second expert, Lea. Lea is faster and cheaper, but only reliable for less complex cases.
You decide to take a more active role in triaging what work goes to Mia and what work goes to Lea based on their track record, skills, and the job at hand. In other words, you will reason based on each input to your help desk and divert work to the appropriate person, thereby managing the tradeoff between quality/accuracy and cost/latency.
Furthermore, you empower Mia to adjust how she works (i.e., how much she thinks, how she plans out her tasks to address the input, how long she spends on a task, etc.) so that she doesn’t approach every input with the same level of complexity.
This Mia/Lea analogy is meant to illustrate the process of adaptive reasoning and routing-driven reasoning in large language models (LLMs) and agentic systems.
Organizations may choose to implement RLMs the way you might treat Mia — they may provide all problems to their best reasoning model and hope for high accuracy. As we’ve identified, the problem with this approach is that RLMs are locked into a single configuration for the entire task — fixed depth, prompt, tools and model choice.
Adaptive reasoning, on the other hand, treats all levers (reasoning depth, tokens used, prompts, tool calls and even which model to use) as dynamic parameters that change in real time based on the input’s difficulty, uncertainty and context.
The key to adaptive reasoning in production is routing—the decision-making process that determines how each input should be handled before it reaches the model. An effective router calibrates resources to the problem at hand, allocating just enough reasoning to balance accuracy, latency and cost.
Implementing adaptive reasoning systems at scale requires evaluation loops
Adaptive reasoning and routing have the potential to be a scalable efficiency lever for organizations. GPT-5’s router is an early example of a broader shift from static to adaptive reasoning. However, GPT-5’s generalist design struggled. In my experience, a model that has domain-specific knowledge and tooling may perform better than a generalist model.
Furthermore, organizations may be able to use a small model or a classifier alongside domain-specific tooling and optimized prompts for how to use tools to create router-based reasoning systems. The router then scores each input on attributes like difficulty, uncertainty or domain and then maps it to the optimal configuration (reasoning depth, tokens used, prompts, tool calls and even which model to use).
There are real-world use cases for reasoning systems. For example, threat actor attribution (TAA) helps identify individuals, groups or organizations responsible for a cyberattack or malicious activity. TAA requires that a model form a hypothesis, compare candidate actors and select the most likely threat actor.
The router-based reasoning system may work well for the TAA use case because:
- Some inputs may be simple (e.g., matching known patterns) and handled by smaller models or rules.
- Others may demand deeper reasoning (e.g., synthesizing fragmented signals across domains) that requires a more powerful model.
If a system misroutes a hard case to a weak model, attribution will fail. Given the stakes are high for a cybersecurity use case, misrouting to a weak model may have severe consequences.
For adaptive reasoning systems to work reliably at scale, routing decisions can’t be a black box. Over time, the only way to keep adaptive reasoning systems reliable is to evaluate performance continuously, not just when things break.
Evaluation (sometimes shortened as “eval” or “evals”) isn’t just about debugging where systems are failing; it is about building a feedback loop that keeps systems aligned as data, models and user expectations shift.
In traditional machine learning, that loop follows this process: train → validation → test. For agentic systems, the evaluation development loop uses:
- Eval/test sets to shape and validate system behavior.
- Meta-evaluation for LLM judges, since the evaluator is itself a model.
- Architecture-level evaluation to evaluate prompts, planners, tool use strategies and reasoning depth.
The key is that evals aren’t an afterthought. They’re a design tool. An effective evaluation loop surfaces gaps before they show up in user experience.
Adaptive reasoning won’t succeed at scale without evals and may require an evaluation infrastructure to implement adaptive reasoning at scale. Organizations that treat evals as core design, not just debugging, may be able to employ reasoning faster and cheaper.
Summary for CIOs
Reasoning is quickly becoming the backbone of enterprise AI, but static systems waste time, money and invite risk. Without adaptive routing, costs and latency rise; without evaluation loops, trust erodes.
For CIOs, the takeaway is clear: make adaptivity and evaluation core design principles. Treat routing as an operating and efficiency lever to balance accuracy, cost and latency, and embed evaluation loops as a permanent safeguard.
Those who build this foundation now may be able to scale AI faster, cheaper and with greater reliability than competitors.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?