这次我更换了任务,结合了MMLUPro、MELD和MATH三个数据集,并统一使用了相同的系统提示,模型仍然采用Qwen2.5-0.5B-Instruct (足够小,且不需要冷启动)。
PROMPT=(
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
"The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
"<think> reasoning process here </think> <answer> answer here </answer>"
)
奖励函数分为两种:第一种是 format_reward
,顾名思义,用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>
;第二种是 accuracy_reward
,用于检测输出的准确性。
从训练日志中可以清楚地看到,format_reward经历了一次显著的提升(这是否可以被称为一种涌现现象?),而accuracy_reward则一直在大幅波动,没有明显的上升趋势。这表明,模型只是通过“模仿”特定格式来提高得分,而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目,能够看到模型只是在拙劣的模仿这种格式,而没有学会真的思考。
<think>
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>
<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.
According to our formula, it would look like this:
\[ 2x + 5 = 15 \]
Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]
Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]
Therefore, the number in question is 5.</answer>
分析了一下,原因可能有两个:(1)模型尺寸太小;(2)奖励函数设置有问题。
Top comments (0)