Aymen K

Posted on Jan 26 • Edited on Feb 19 • Originally published at ainovae.hashnode.dev

Understanding Reasoning AI: How DeepSeek R1 & OpenAI o3/o1 Work

#ai #deepseek #learning #openai

The AI landscape is evolving at breakneck speed. We’ve grown accustomed to chat models like GPT-4o and Claude Sonnet 3.5—masters of fast, intuitive responses. They write emails, crack jokes, and explain quantum physics in plain English. But when faced with debugging a 500-line script or optimizing a supply chain, their limits become clear. These models excel at conversational tasks but struggle with problems that demand slow, deliberate reasoning.

With the recent release of DeepSeek’s R1 and OpenAI’s o1/o3 models, we are witnessing the rise of reasoning LLMs—a new class of AI designed not just to chat but to think. These models take a methodical, step-by-step approach to tackle complex math problems, debug code, and analyze medical data with precision. They represent a shift from “autocomplete on steroids” to deliberate problem-solving machines.

Let’s explore how these reasoning models work, why they’re different, and how they’re reshaping AI’s role in solving the toughest challenges.

Understanding Reasoning LLM Models

To grasp why reasoning models like DeepSeek R1 and OpenAI’s o1/o3 are revolutionary, we need to start with the foundation: next-word prediction, the engine powering most chat models today.

The Power (and Limits) of Next-Word Prediction

Models like GPT-4o and Claude Sonnet 3.5 are trained to predict the next token (word or subword) in a sequence. This simple objective forces them to learn grammar, world knowledge, and even basic reasoning—all at once.

This approach scales remarkably well. As models grow larger and datasets expand, they unlock emergent abilities—skills like translation or arithmetic that weren’t explicitly programmed. GPT-3, for instance, couldn’t reliably solve math problems, but GPT-4 improved significantly.

But there’s a catch: next-word prediction is System 1 thinking—fast, intuitive, but shallow. It uses the same computational effort for easy tasks (“What’s 2+2?”) and hard ones (“Prove Fermat’s Last Theorem”). When faced with complex reasoning, it often guesses plausibly rather than calculating methodically.

The Chain-of-Thought Workaround

To address this limitation, researchers introduced the concept of chain-of-thought (CoT) prompting in 2022. This technique leverages prompt engineering to guide models into explicitly revealing their intermediary reasoning steps or internal thoughts, nudging them toward System 2 thinking—slower, more deliberate, and logical reasoning.

By generating intermediate tokens (like algebraic steps), the model “stores” partial solutions, mimicking human problem-solving.

This hack improved performance on math, coding, and logic tasks—but it was still a band-aid. Models weren’t trained to reason this way; they were just prompted to.

The New Paradigm: Reinforcement Learning Meets Chain-of-Thought

Reasoning LLMs like DeepSeek R1 and OpenAI’s o1/o3 go further. Instead of relying on prompting tricks, they’re trained from the ground up using reinforcement learning over chain-of-thought trajectories.

Let's explain how it works!

First, let’s introduce Reinforcement Learning (RL) for those who might be unfamiliar with it. It’s a machine learning paradigm where an agent learns to make decisions through trial and error, receiving rewards for good actions and penalties for bad ones. Think of it like training a dog—you reinforce the behaviors you want to encourage with rewards.

In the world of conversational AI, RL plays a pivotal role. You’ve likely come across terms like RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback). These techniques fine-tune models like GPT-4o or Claude to:

Align outputs with human preferences (e.g., polite, helpful).
Avoid harmful or nonsensical responses.

For example, RLHF might nudge a model toward drafting concise, well-structured email responses while steering it away from rambling or off-topic content. However, it’s important to note that these techniques focus on style and safety, not necessarily on improving reasoning correctness.

How RL Transforms Reasoning Models

With reasoning LLMs, RL is applied differently. Instead of aligning with human preferences, it’s used to train models to generate verified, correct solutions. Here’s how:

Training Data with Ground Truth:
- Datasets contain problems with explicitly correct answers (e.g., solved math equations, bug-free code).
- Example: A Python script that calculates Fibonacci numbers, verified by unit tests.
Generate Multiple Trajectories:
- For each problem, the model produces dozens of potential solutions (trajectories), including intermediate reasoning steps.
- Example: 10 different ways to derive the solution to “Solve 3x + 5 = 20”, some correct, some flawed.
Grader/Verifier:
- A verifier (e.g., code executor, equation solver) checks each trajectory.
- Correct solutions get a reward; incorrect ones get nothing or get a penalty.
Policy Optimization:
- The ultimate goal of RL is to learn the best policy, so the model’s weights are adjusted to favor high-reward trajectories.
- Over time, it internalizes patterns that lead to verified answers which reinforces correct reasoning.

The Process in Action

Let’s break down a training step for a coding problem:

Input: “Write a Python function to calculate Fibonacci numbers.”
Model Generates: 20 potential solutions with different algorithms (recursive, iterative).
Grader Tests: Each solution against unit tests (e.g., fib(5) must return 5).
Reward: Only bug-free code that passes tests gets a reward.
Policy Update: The model learns to prioritize efficient, test-passing code.

This method, described in OpenAI’s Learning to Reason with LLMs blog post and Nathan Lambert’s breakdown, turns the model into a self-improving problem-solver.

A New Scaling Law: Quality Over Quantity

In the past 2 years, LLM progress followed a predictable formula:

bigger models + more data = better performance

But this paradigm is hitting diminishing returns. Training data is becoming scarce (e.g., high-quality text is finite), and doubling model size no longer guarantees proportional gains.

Reasoning models like OpenAI’s o1 and DeepSeek R1 rewrite this playbook. Their performance scales not with raw data volume, but with how compute is allocated to verify reasoning quality. This introduces a new scaling law:

What the Data Shows

As shown from OpenAI's o1 model training, reasoning models improve steadily with more compute allocated to verifying trajectories. Unlike traditional training (where gains plateau), iterative refinement of reasoning paths yields continuous improvements.

Additionally at test time, allowing models like DeepSeek R1 to use more tokens for “thinking” (e.g., generating longer chain-of-thought steps) directly boosts accuracy. This mirrors human problem-solving—more time spent deliberating often leads to better solutions.

Why This Matters

Smaller Models, Bigger Impact: A 7B-parameter reasoning model can outperform a 70B-parameter generalist on niche tasks (e.g., medical coding analysis) by focusing compute on verification, not memorization.
Escape the Data Bottleneck: Instead of scraping the internet for more text, these models extract value from smaller, high-quality datasets (e.g., verified math problems).
Flexible Compute Tradeoffs: Users can “dial up” accuracy by allowing more reasoning steps (tokens) or “dial down” for faster, cheaper outputs.

Future of Reasoning models

Today’s reasoning models are still in their infancy, yet they’re already achieving what took traditional LLMs years to master. Benchmarks like GPQA—a grueling test of complex, “Google-proof” reasoning—are being saturated 10x faster than older models could manage.

And this is just the beginning!

With the recent surge of DeepSeek R1 and a multitude of open-source reasoning models like Llama-3-Reason and Mistral-Math, more researchers and developers are diving into this field. This collective momentum promises rapid advancements—not in years, but potentially within months. The era of deliberate, step-by-step reasoning has just begun, and its potential to reshape problem-solving is boundless.

How to Use Reasoning LLM Models

(Inspired by Latent Space’s “Missing Manual” for OpenAI o1)

Reasoning models like OpenAI’s o1 or DeepSeek R1 demand a fundamentally different prompting than chat-first models like Claude or GPT-4o. Here’s how to use them effectively:

1. Provide Extensive Context

Think of the model as a new hire. It won’t ask for more details, so you need to give it everything upfront. Share as much context as possible—even more than you think is necessary. For example:

Explain what you’ve already tried and why it didn’t work.
Include relevant data, like database schemas or company-specific terms.
Describe the problem space in detail.

The more context you provide, the better the model can deliver accurate results without unnecessary back-and-forth.

2. Focus on the Outcome, Not the Process

Instead of telling the model how to solve a problem, focus on what you want as the final result. Be specific about your goals:

Do you need a complete file, a list of options, or a detailed explanation?
Let the model handle the reasoning and planning—it’s designed to do that for you.

3. Understand Their Strengths and Limitations

✅ What They Can Do

Generate Complete Files: They can one-shot entire files (or multiple files) with minimal errors, following your existing patterns.
Explain Complex Concepts: They excel at breaking down difficult topics with clear examples, almost like writing an article.
Provide Structured Outputs: They can create detailed comparisons, pros/cons lists, or multiple plans for architectural decisions.
Hallucinate Less: They are generally more accurate, especially with niche tasks like bespoke query languages or medical differential diagnoses.

❌ What They Can’t Do (Yet)

Write in Specific Styles: They tend to default to an academic or corporate tone, struggling to adapt to unique voices.
Build Entire Applications: While great at one-shotting features, they require significant iteration to create a full SaaS or complex app.

By following these tips, you can unlock the full potential of reasoning models and achieve faster, more accurate results.

Practical Use Cases

Reasoning models excel at tasks that require deep thinking and structured problem-solving. Here are some of their applications:

Coding

As mentioned earlier, these models are able to generate multiple code files in a single step with minimal errors. For example, Cole Madin demonstrated how DeepSeek R1 was used within Bolt.diy to build a complete chat interface for an AI agent, delivering outstanding results compared to Claude-Sonnet-3.5 on the first attempt.

Planning in Workflows and Agents

Reasoning models are highly effective in upfront planning for agent-based workflows, outlining detailed steps for execution by traditional agents.

Unify used OpenAI's o1 model to create their AI agents for account qualification using LangGraph, the model was used to generate detailed, step-by-step plans and identify potential challenges. It effectively expanded on user queries, even with minimal input, making it ideal for strategic planning and further research tasks.

Unify Launches Agents for Account Qualification using LangGraph and LangSmith

blog.langchain.dev

Deep Research and Reflection

As research assistants, these models excel at analyzing academic papers, summarizing key findings, and generating comprehensive reports. They are also highly effective at reflecting on and analyzing search results or large sets of documents, enabling them to extract critical insights or identify overlooked details.

Langchain has posted an excellent tutorial on how to set up your own local research assistant using DeepSeek-R1, making it easier than ever to leverage these powerful tools for academic and research purposes.

Data Analysis

Advanced reasoning models have proven to be invaluable for analyzing complex datasets, including those in highly specialized fields like medicine and finance. These models uncover patterns, extract actionable insights, and streamline decision-making processes.

Enhancing Financial and Media Analysis Using OpenAI’s o1 Model in AiReportPro: explores how the o1 model transforms financial and media analysis, empowering businesses with real-time intelligence for better strategic decisions.
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?: investigates the potential of OpenAI’s o1 model in medical applications, assessing its capability to support healthcare professionals through diagnostic and predictive analytics.

Newsfeeds Analysis

Reasoning models are powerful tools for analyzing news and social media, identifying trends, and surfacing relevant information. For example, the Firecrawl team developed the o1 Trend Finder, a tool powered by the o1 model. It efficiently filters critical insights from news feeds, providing users with targeted updates that matter most.

Conclusion

We’ve seen how reasoning LLMs like DeepSeek R1 and OpenAI o1/o3 are flipping the script. No longer just "fast talkers" these models slow down, break problems into steps, and verify solutions like meticulous problem-solvers. By combining chain-of-thought logic with reinforcement learning, they’re tackling tasks that once stumped traditional chatbots—debugging code, cracking math proofs, or untangling complex data.

This isn’t just an upgrade—it’s a fundamental shift! Reasoning models prove that AI can do more than mimic human conversation; it can mirror our deliberate, step-by-step thinking. But here’s the kicker:

Does this mean we’re edging closer to AGI?

While they’re still far from human-like general intelligence, their ability to learn verified reasoning patterns hints at a future where AI doesn’t just answer—it understands.

So, where do you think we’re headed with ever smarter LLMs? Share your thoughts below! 🌟🤔

DEV Community