DEV Community

Maxim Saplin
Maxim Saplin

Posted on

DeepSeek-V3: Laziness vs Eagerness

In late 2023, people complained about GPT-4 Turbo's laziness - often the model didn't complete tasks. Later, OpenAI stated they had fixed this issue in the newer gpt-4-0125-preview. With Deepseek V3 (not to be confused with R1), I observed the opposite behavior - you can get more than you ask for.

A recent example: I asked Aider (configured with deepseek-chat) to add a small feature to a web app. Besides the changes related to the task, there were additional modifications. The model corrected a few spelling issues in the HTML, which seemed nice, although unrelated to the request. However, it also modified an unrelated piece of JavaScript, breaking an older feature. This made me realize I would prefer the lazy model I was used to over this eager new one full of surprises.

With Deepseek V3, I think it goes a bit further. The model can be characterized as hectic and sporadic. Stress testing it in LLM Chess highlights how brittle the model is, demonstrating a high rate of mistakes in completions and frequently breaking the game loop. On average, it stays in the game for 58 moves compared to 190 moves for gpt-4o-2024-11-20. It is also far more verbose than other SOTA models (247 tokens per 1,000 moves vs. 51 tokens for gpt-4o-2024-11-20).

Deepseek might be a good chat model - i.e., when a human interacts with a chatbot directly, carefully reads the replies, and doesn't care much about a few details being off. Some might find the model's eagerness useful in coding, assuming every change is carefully inspected (though are there many who would spend their brain-cycles reviewing the outputs vs. copying-pasting right away).

OpenAI and Anthropic have found a better balance between laziness and eagerness. Their models demonstrate better steerability and stability. I would be cautious about applying Deepseek's models in a larger-scale, multi-user environment with LLMs integrated into complex workflows (prompt chains, agentic flows, outputs consumed by third parties, etc.).

Engineering teams working with LLM integrations know the challenge: you build the product, test it internally, and everything seems fine. Later, you ship the product, and with many more people, you discover how creative users can be in their interactions with the LLM. The tested prompts and models start failing at an alarming rate in production. Model "durability" - its resilience to tricky combinations of tokens in the input (which can unexpectedly provoke hallucinations, break instructions, etc.) — becomes far more important when scaling.

Top comments (0)