ArticleAI

[Article Review] DeepSeek R1 Reasoning Mechanism

Link to the article

What makes DeepSeek R1 Special ?

The LLM industry was primarily led by private companies until this model emerged. All the best-performing models in recent benchmarks, such as OpenAI O1, Claude 3.5 by Anthropic, and Gemini by Google, were developed by private companies with massive budgets. This raised concerns about the potential for a monopoly over such a powerful tool. Deepseek R1, however, introduced a state-of-the-art (SOTA) open-source solution, addressing many aspects of AI model development.

We can analyze every AI advancement through three pillars: the mathematical model, compute power, and data.

Deepseek has made significant progress in all three pillars. However, for this blog, I will not cover the general news but will instead focus on improvements in the model architecture.

Contributions of R1

We can summarize the contributions made by R1 under 2 points:

1 - Post-Training: Large-Scale Reinforcement Learning on the Base Model

2 - Distillation: Smaller Models Can Be Powerful Too

What is Reasoning ?

Reasoning in AI refers to the ability of a model to break down complex problems into smaller, more manageable steps and solve them systematically. In the context of language models like DeepSeek R1, reasoning capabilities allow the model to "show its work" by explicitly demonstrating the thought process that leads to its final answer.

Interestingly, when we ask language models to first explain the steps to solve a problem before actually solving it, they tend to produce better outputs.

Usage of Reinforcement Learning in Reasoning

During the development of the R0 model—the predecessor of R1—the DeepSeek team used DeepSeek V3 as the base model. The architecture of V3 is similar to other LLMs on the market. To incorporate a reasoning mechanism into this model, the team utilized reinforcement learning algorithms. As a result, they eliminated the need to generate large amounts of training data for supervised learning. Specifically, they employed the GRPO algorithm for reinforcement learning.

The purpose of the below formula is maximizing advantage Ai, in order to achieve that we sample the different models from the model database evaluating the improvement compared to the base models. While doing that we prefer small steps for the sake of stability.

In this formula:

A penalty term that discourages the new policy from deviating too far from a chosen reference policy by penalizing their KL divergence.

This ratio compares how likely the new policy is to pick a certain action relative to how likely the old policy was to pick that same action. If the ratio is greater than 1, the new policy chooses more often than before; if it’s less than 1, the new policy is choosing it less often.

This term ensures that you do not get rewarded for “moving too far” from the old policy if the ratio would otherwise give a large (but potentially destructive) update.

Reward Function

The hardest part of using a reinforcement learning model is defining a reward function. The reward function serves as a roadmap for the final model. For rewards, the R1 team used two different parameters.

1 - Accuracy Rewards: This is generated based on the correctness of answers to exact-answer questions, such as coding, logic, and mathematics. This value can only be generated if the question has an evaluable answer. For example, it is not possible to generate a reward for a question like "How does the moon appear tonight?"

2 - Format Rewards: This is generated by checking whether the response is correctly formatted. The correct format is:

<think>REASONING RESPONSE</think> <answer>ANSWER RESPONSE</answer>

DeepSeek R1

Even though R0 gave promising results, there were three problems in the model:

1 - Although the model was performing well on questions with exact solutions, due to the lack of supervised training, it did not perform well on general questions like "How do you make a pizza?"

2 - Sometimes, the model mixed languages, producing poorly readable outputs. To resolve this, consistent supervised learning was required.

3 - The model was capable of producing dangerous responses to questions like "How to make an atomic bomb?" It also generated outputs that were not aligned with China's political perspective. Therefore, they introduced bias into the final model.

The training process was completed in four steps:

1 - ColdStart

To address the above issues, they applied supervised training. The training data was generated by first prompting DeepSeek R0 and other models (possibly OpenAI models :D). Then, the outputs were edited by humans. After training with the generated data, the model was ready for reinforcement learning.

2 - Reasoning-oriented Reinforcement Learning

The RL algorithm is very similar to the one used in R0, as explained above. The only difference is that they introduced a third reward focused on language consistency. The purpose of this reward is to penalize the model when it mixes languages in its responses, which often occurs when the prompt includes words from different languages.

3 - Rejection Sampling and Supervised Fine-Tuning

Unlike the initial cold-start data, which primarily focused on reasoning, this stage incorporated data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.

The data used for fine-tuning was divided into two categories:

a - Reasoning Data (600k samples): Generated by correcting wrong answers to questions.

b - Non-Reasoning Data (200k samples): For tasks such as writing, factual QA, self-cognition, and translation, they adopted the DeepSeek-V3 pipeline and reused portions of the SFT dataset from DeepSeek-V3.

4 - Reinforcement Learning for all Scenarios

To further align the model with human preferences, they implemented a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. If the model’s response was not produced in the desired manner, it was given a negative reward signal.

Distillation: Empower Small Models with Reasoning Capability

As stated above, the main challenge of supervised training is the difficulty of gathering training data. As a solution, the DeepSeek team gathered data from existing models with higher capabilities. The idea proved highly effective.

This process involved training smaller models to mimic the reasoning patterns and outputs of the larger R1 model. Through this approach, they successfully created more compact models that maintained much of the reasoning power while being more practical for deployment.

I hope you find the summary helpful. See you in the next writing.

Never lose the child within you—dream, wonder, and stay joyful.