GRPO踩坑记录2

这次我更换了任务，结合了MMLUPro、MELD和MATH三个数据集，并统一使用了相同的系统提示，模型仍然采用Qwen2.5-0.5B-Instruct （足够小，且不需要冷启动）。

PROMPT=(
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
    "<think> reasoning process here </think> <answer> answer here </answer>"
)

奖励函数分为两种：第一种是 format_reward，顾名思义，用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>；第二种是 accuracy_reward，用于检测输出的准确性。
从训练日志中可以清楚地看到，format_reward经历了一次显著的提升（这是否可以被称为一种涌现现象？），而accuracy_reward则一直在大幅波动，没有明显的上升趋势。这表明，模型只是通过“模仿”特定格式来提高得分，而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目，能够看到模型只是在拙劣的模仿这种格式，而没有学会真的思考。

<think> 
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>

<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.

According to our formula, it would look like this: 
\[ 2x + 5 = 15 \]

Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]

Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]

Therefore, the number in question is 5.</answer>

分析了一下，原因可能有两个：（1）模型尺寸太小；（2）奖励函数设置有问题。

DEV Community

GRPO踩坑记录2

Top comments (0)

Read next

Tech Layoffs Are Just Legacy Code Refactoring in Human Form

Buffered vs Unbuffered Channels in Golang: A Developer's Guide to Concurrency

How I Built a Fun & Interactive Quiz App Using React

Day 4: IP Addressing and Subnetting