tecvan

Posted on Dec 26

In-depth Review of Cursor: Revolutionary Productivity Tool or Overhyped Toy?

Cursor has been trending recently, to the point where programmers around me have stopped discussing Hebei Huahua, LOL, Black Monkey, etc., and instead are talking about Cursor's assisted programming capabilities in various settings. Various content platforms are also iterating numerous related tutorial videos at an astonishing speed:

After trying it for a while, my first impression was indeed impressive. It can help me solve many basic problems and tangibly improve development efficiency. Some particularly memorable features include:

Enhanced context indexing capabilities brought by features like Codebase Indexing and @symbol, which greatly improve the quality of code generated by LLM;
The Cursor Composer feature provides a highly focused programming panel that makes it easier to edit and develop across files compared to the disposable chat mode of previous GPT products, which better aligns with professional developers' modular programming habits.

However, I believe that at least at the current stage, these types of products can only be positioned as "assisted programming". Although they can greatly improve efficiency, they are still just auxiliary tools in programming activities, colloquially known as assistants; human intelligence - the programmer's essence - remains in the dominant position, somewhat like a shopkeeper.

Below, I'll list several points analyzing what tasks Cursor can do well, what it can't do well, and more importantly, how individuals should adapt to this era of rapid AI development.

The Essence of Cursor

Without a doubt, Cursor can solve many basic problems, and I myself have already gotten used to using Cursor instead of VS Code for daily work. It's very useful, but not mysterious - essentially it's just a traditional IDE equipped with sufficiently good interaction and LLM capabilities, thus surpassing traditional IDEs. In terms of interaction, it supplements VS Code with:

Better context referencing capabilities tailored for LLM scenarios (i.e., @codebase/@files and other symbol directives);
A Composer panel suitable for complex programming, where you can maintain a longer conversation with LLM and support multi-file editing, allowing continuous communication and code generation to complete complex tasks;
Almost barrier-free code auto-completion capabilities that support multi-line editing, which is very useful in scenarios like variable name modification;
It even thoughtfully supports awakening LLM interaction panels in the Terminal to achieve command generation and command line error handling capabilities.

These interactive innovations, while not particularly complex from a technical perspective (and even have some flaws), do succeed from a product perspective in both adhering to developer habits and providing more contextual information for LLM, thereby improving model effectiveness. A particularly commendable point is that these interactions are very consistent with traditional IDE interaction logic, requiring minimal learning for professional programmers to quickly get started and achieve almost seamless integration.

Beyond interaction, Cursor hasn't invested too much effort in developing its own large model (which shows some restraint), but instead provides relatively streamlined model switching functionality, allowing users to switch between mainstream LLMs in various contexts according to their preferences in the configuration panel:

I think this is a very smart product decision that both avoids investing limited resources into uncertain large model development - thus allowing focus on polishing the product experience itself - and gives users sufficient freedom to continue with their preferences. Although it cedes some commercial interests, it gains higher user acceptance and thus is more likely to stand out in the current market by more easily gaining the trust of seed users.

As an ordinary developer, I don't think we need to over-mystify this software, as mentioned before: it's essentially just a traditional IDE equipped with sufficiently good interaction and LLM capabilities, so you can completely view it as an enhanced version of VS Code - it can do everything VS Code can do, just with many naturally integrated LLM interaction capabilities added on top, getting greatly enhanced programming experience and efficiency improvement through features like Auto Completion even without any configuration, which I think is well worth the $20 monthly fee.

Cursor's Suitable Scenarios

From my usage experience, I think Cursor is particularly well-suited for the following programming scenarios.

1. Q&A

First is the most basic Q&A scenario. I believe the vast majority of programmers can't know everything, and during development, they always need to look up various materials. In the past, I had to use search engines like Google/Baidu to find relevant articles, then carefully read multiple articles myself and summarize to find solutions.

After LLM supported internet connectivity, I started trying to consult technical questions using platforms like Claude, ChatGPT, Perplexity, etc. These tools can automatically extract keywords from my questions, search the internet, and most importantly - summarize and analyze the search results, then return a relatively concise answer. Ideally, I no longer need to spend time during intense work carefully reading and understanding these articles, allowing better maintenance of flow state.

However, the confidence rate of answers given by these platforms is not very high, mainly because it's difficult to provide sufficiently detailed context information in a simple dialog box - you can't paste the entire repository's code in there, right? The various context symbol (@xxx) reference capabilities provided by Cursor nicely fill this gap. For example, you can use the @Codebase symbol to index the entire repository (provided the Code Indexing configuration is enabled) and ask LLM to provide analysis results:

This extremely strong context indexing capability essentially works by embedding the entire repository into a vector database and then providing it for LLM consumption, thus surpassing regular public knowledge domain Q&A models to possess extremely strong private domain knowledge understanding capabilities, which is very difficult to achieve just using models like GPT, Claude, Doubao, etc.

Furthermore, based on these features, Cursor can also very efficiently help me analyze and summarize the underlying principles of various projects:

This is a very, very useful feature. In the past, when taking over an old project or learning some open source projects, I had to understand the project's code line by line, but now I can skip many details and let Cursor help me understand the repository code faster.

2. Generating Unit Tests

This is very powerful, as it can both incrementally generate and continuously improve coverage by generating more test cases on demand

Many LLM tools and platforms can generate unit tests, but Cursor performs better. Comparatively, tools like GPT have overly simple interaction forms, only able to copy and paste code segments into dialog boxes to request test generation, which leads to two problems: first is missing context, making it difficult to correctly infer downstream module implementation details and thus affecting generation quality; second is difficulty in implementing incremental updates, for example when you've already generated a set of unit tests for your code module but subsequent iterations make the tests outdated, it's hard to use GPT to adjust the tests specifically for the changed code.

Cursor has clearly solved these problems. As mentioned earlier, Cursor can vectorize the entire code repository, so it has the ability to freely index all files in the repository. When generating unit test code, it can provide both the target module and its upstream/downstream module code to LLM simultaneously - the more complete the context information, the more precise the generated results will necessarily be.

To give a more concrete example, when using GPT and other tools to generate unit tests, due to lack of Vitest-related corpus, no matter how I adjust the prompt, most of the time it returns Jest-based test code, while I prefer using Vitest, so each time I need to copy out the code, make a bunch of adjustments before I can put it in the repository to run; while when using Cursor, as long as appropriate prompts are used, most of the time it can return Vitest-based results with much lower adjustment costs, for example my commonly used prompt:

Secondly, colleagues who frequently write unit tests should all have a feeling that writing tests for a module from scratch is not particularly difficult - what's troublesome is subsequent adjustments and maintenance, because you must deeply understand how incremental code affects old logic and adjust existing tests accordingly. In the past when using tools like GPT, facing such incremental scenarios usually meant having to paste all the modified code into the dialog box to regenerate all test code, but this would cause all previous fine-tuning of existing tests to be invalidated, requiring repeating those adjustment work.

After using Cursor, the problem has been optimized a lot. We can use Cursor specifically to generate tests for changed content, for example with prompts like:

Another more practical trick is that Cursor also supports referencing Git commits through the @ symbol, so you can also generate tests for specific commits, for example with prompts like:

However, in practice, the quality of incrementally generated tests is relatively lower than full generation, easily losing context or adding redundant tests, so there are still some debugging costs, but it's still much more efficient than using GPT.

3. Language Conversion

In daily development, sometimes we need to make some language conversion adjustments to existing code, such as rewriting old JavaScript projects to TypeScript. Just using a simple prompt like "rewrite to TypeScript" can achieve this:

In the example above, Cursor extracts the correct parameter types based on the function body and correctly infers the function's return type. While not perfect, it's much more efficient compared to manual operations, and even if there are issues with the conversion results, they can be quickly identified using the tsc tool, so it's highly recommended for colleagues facing similar scenarios to prioritize using Cursor for code conversion.

Cursor can even smoothly implement cross-language conversions, such as from JavaScript to Rust:

Note that in these scenarios, usually only simple syntax conversions are completed - the underlying API calls still need manual adjustment to run correctly.

PS: There are two other very useful scenarios: generating type descriptions based on JSON, and generating type descriptions based on JSON-Schema - both tested to be very accurate.

4. Implementing V0 Versions

Cursor's Composer panel is very suitable for implementing V0 versions of features, for example:

Although from a professional programmer's perspective, the generated results may be relatively rough and cannot be submitted as the final version - sometimes they won't even run - but relatively speaking, it has already saved a lot of time in initializing projects, creating files, writing scaffold code, etc. Next, you just need to continuously fine-tune the code based on the V0 version according to your own thinking habits, and usually can complete the development of a complete feature.

Additionally, Cursor seems to have extremely strong learning capabilities - during use it seems to continuously observe your coding habits, and subsequently generated code will increasingly approach your coding style, with code adjustment costs becoming lower and lower.

Adding one point: although writing prompts initially takes some time, this process itself is also about clarifying your own thoughts, thinking more clearly about processes, design, constraints, etc. rather than jumping straight into coding, which actually makes it easier to do things right.

5. Solving Medium-Complexity Problems

Cursor has a very typical Agent architecture under the hood - it doesn't simply call LLM and directly apply results. In features like Composer and Inline Prompt, it will perform basic task analysis, subtask decomposition, planning, etc. based on your instructions, then combine rich context data to execute multiple types of subtasks and merge to produce final results.

While this execution method may amplify the uncertainty brought by LLM randomness, it also gives Cursor preliminary structured thinking models, already having the ability to break down complex problems into several small problems and solve them step by step, sufficient to solve medium-complexity comprehensive problems. Daily tasks like drawing pages, making CLI tools, building engineering scaffolds, writing regular algorithms, etc. are all manageable. For specific effects, you can search various self-media platforms yourself - too many people are hyping this software, so I won't elaborate.

Worth mentioning is that even while executing so many tasks, Cursor's performance is still very impressive, especially Tab Completion - my personal feeling is that it can basically guess results within seconds with high accuracy, much better than other products on the market.

However, even so, Cursor still struggles to directly solve complex problems, such as developing a mature shopping mini-program or making an ordering system, etc., due to several factors:

Natural language expressiveness is insufficient, making it difficult to clearly describe all requirement details in limited prompts;
The number of tokens LLM can generate in a single interaction is limited, often unable to meet the code volume needed for complex requirements;

However, Cursor provides a very good solution for such scenarios - Composer. This feature is relatively complex, but in short, you can manually break down complex tasks into several small tasks, then issue instructions multiple times in the Composer panel. Cursor will generate multiple files accordingly, with high correlation between files that reference and consume each other, very suitable for complex feature development.

Cursor's Shortcomings

Objectively speaking, LLM is indeed very powerful, bringing huge imaginative space to various content production industries, and it's not too much to call it a new technological singularity. However, it's not yet sufficient to completely replace humans - its underlying implementation logic has already determined that its capability ceiling can only reach or slightly exceed the average level of human public information.

The same goes for Cursor - it can greatly improve code development efficiency, but also has many defects and shortcomings, and currently cannot smoothly replace humans in developing production systems with complexity. Although it can indeed provide many surprises, the randomness in LLM's reasoning process determines that it cannot 100% satisfy various engineering requirements like functionality, performance, maintainability, technical taste, forward and backward compatibility, etc. I believe it will always remain just an "assisted programming" tool, with human intelligence being the subject of the coding process.

1. Randomness

LLM is very powerful, but its essence is actually probability calculation - guessing the next most likely "correct" answer based on context and the model's own parameters. Since it's "guessing", it can't achieve the same rigor as logical operations - slight changes in context and prompts can trigger butterfly effects leading to completely different results. Although the model continues to iterate and develop, inevitably becoming more stable, executing faster, with longer context in the future, I believe this fundamental randomness cannot disappear, and randomness implies risk.

Implement Fibonacci sequence

Implement, Fibonacci sequence

Just adding a "," in the prompt causes the generated result to change significantly

Although human intelligence also has very high randomness, one big difference is that most living humans possess strong self-reflection and logical reasoning abilities, capable of self-reviewing to find potential problems, thus making randomness-induced risks relatively controllable - the software engineering of the past few decades has all been developed by human intelligence. LLM does not possess this kind of self-reflection and correction ability, ultimately only able to guarantee accuracy at some probability level, difficult to guarantee getting accurate answers next time. Moreover, unlike content production fields like articles and images, software projects have very low fault tolerance - any tiny error can cause the entire application to crash or have unexpected subtle errors.

The result of this randomness is that although LLM can batch generate relatively "qualified" code at extremely fast speeds, in rigorous business environments, engineers still need to spend a lot of time debugging LLM-generated code to ensure compatibility with other system components, and most of the time, they also need to make fine adjustments to multiple code details to ensure compliance with functional requirements; need to spend time understanding this code, reviewing whether the generated results comply with overall architectural constraints, style constraints, etc., to ensure long-term maintainability, and so on. In short, at least at the current stage, a lot of human intervention is still needed for supervision, using various technical and non-technical methods to ensure code quality.

2. Lack of Domain Knowledge

There are many articles and videos on the internet hyping Cursor, expressing that Cursor has seemingly surpassed and replaced humans, able to complete complex application development like magic, achieving "not writing a single line of code, letting AI work for me". But from my actual experience over these few months, I feel it's still quite far from this state - first because LLM lacks domain knowledge and struggles to solve specific problems in specific domains; second because LLM does not possess structured logical thinking ability and cannot break down complex tasks into several subtasks to solve problems more steadily step by step.

During my usage, I had a very strong feeling: even if I follow various Prompt Engineering suggestions and spend a lot of time carefully writing, most of the time it's still very difficult to precisely describe requirements using just natural language (we're more used to implementing in code), thus LLM struggles to generate code that truly meets expectations. This is essentially a language expressiveness issue - even within the same team in a business environment, product and development teams with the same domain background need to carefully prepare many documentation materials, communicate in multiple meetings, repeatedly align from multiple angles to align on requirement details, let alone with large models that don't really understand domain-specific knowledge details? To put it more intuitively, LLM very likely doesn't recognize the specific meanings of certain domain-specific terms, nor has it mastered certain specific historical context information, resulting in talking past each other - it's very difficult for developers to express all context clearly in limited prompts, there are too many information black holes for LLM, the two sides cannot fully align, and thus cannot smoothly complete business requirement development.

To give a specific example: "When users haven't logged in via OAuth, if they click the registration button, pop up the user manual dialog and send behavior tracking to Platform B" - here OAuth, registration, dialog, tracking etc. all belong to open knowledge that LLM can understand well with relevant corpus knowledge, but:

What exactly is the "user manual"?
What is "Platform B"? How exactly to report tracking?
What should the tracking content include?
Etc.

All belong to private domain knowledge specific to certain teams, containing too many details. Unless you can supplement this information through methods like RAG, Fine-Tune etc., LLM cannot understand the specific implementation methods of tasks, and ultimately will likely generate results like this that cannot be used directly:

3. Lacks Creativity

LLM has another fairly big problem: lack of creativity. This point should be relatively easy to understand - from a principle perspective, LLM is just collecting large amounts of public materials, then using deep learning technology, especially the Transformer architecture in neural networks, to capture complex patterns and semantic relationships in language, thereby training a model that can understand and generate natural language as accurately as possible.

Simply put, LLM is like a knowledge retrieval tool with super high performance and wisdom, and by default comes loaded with massive public internet materials. The result is that when you ask questions that have corresponding materials explaining them, it can generate answers very well, but when exceeding its retrieval range, performance drops significantly, even producing so-called "hallucinations".

Of course, this problem currently has a mature solution: Retrieval-Augmented Generation, which can be simply understood as giving LLM more knowledge through vector databases. LLM will simultaneously retrieve this knowledge during execution and deduce answers closer to specific domains from it, making LLM much more applicable in various specific business domains.

But facing some deeper problems, even applying RAG architecture probably can't solve them. For example, for some lesser-known frameworks where there isn't much related discussion material online and you also can't provide related knowledge, LLM will struggle to give relatively accurate answers. This is because LLM essentially only does mathematical probability inference but doesn't possess complex logical reasoning abilities - it cannot deduce new knowledge based on new corpus, lacking the creativity of human intelligence.

To give a more concrete example, when you encounter some specific bugs during programming: if it's something previously researched and explained in detail on the internet with causes and solutions, then LLM will perform very well, directly giving final answers; if it's framework issues but lacking related materials, LLM probably cannot provide solutions - you need to dig into framework details and find answers yourself; and if it's issues with specific business system code itself, LLM is basically powerless and cannot give valuable answers. Therefore, facing complex and specific problems still requires human intelligence to step in.

Best Practices

1. Frequent Code Commits

Remember, LLM is always random and cannot guarantee the correctness of next execution's results. During use, be sure to develop the habit of frequent code commits to facilitate rollback when problems occur. My personal operating habits are roughly:

After generating code using various Cursor panels, first do a simple code review, modifying basic issues like variable names, loop structures, etc.;
Debug in software context, commit code immediately after preliminary verification of correctness;
Review code design, focusing on: function input/output structure, module import/export content, logic and loop branch handling, etc. At this point usually continuously let LLM optimize code according to my ideas, or make some small modifications myself, iterate multiple times continuing to commit code - this process usually takes the most time;
After code is preliminarily stable, continue letting LLM help generate unit test code. This phase usually requires multiple iterations and commits until tests pass and coverage meets standards;
After both feature code and test code are ready, submit PR preparing for code merge;

In this process, each LLM call may get unexpected results, but as long as it basically meets standards I will immediately commit. If issues are encountered later they can be reverted at any time. Although this produces many commit histories, you can adjust history records using operations like rebase before merging.

2. Value Code Review

Secondly, it's recommended to put more time and energy into Code Review, because code generated by Cursor may be locally optimal but may not achieve global optimality due to lacking relatively abstract global architecture information - for example repeatedly writing components or breaking certain conventional architecture rules, etc. Over time such code may gradually deteriorate maintainability and readability to the point where it cannot continue iterating (to be fair, human intelligence can also cause such problems).

Therefore it's highly recommended to value establishing rigorous Code Review culture. Ideally Cursor should generate as much code as possible, replacing human coding work as much as possible, then human intelligence is responsible for verifying and reviewing the rationality of this code, ensuring results are correct and long-term maintainable.

3. Better Engineering Systems

As mentioned earlier, results generated by Cursor, or say LLM, are random and cannot guarantee correct results every time. Each change may affect the stability of existing code. If all these changes must be manually verified by humans, testing costs will be unmanageable and the testing phase will become a new efficiency bottleneck in engineering.

A better approach should be investing more effort in building more robust engineering systems (this part can also be assisted by Cursor), using automated tools to replace manual labor for many basic quality inspection tasks, such as introducing UT/E2E for runtime quality testing, adding TS type checking and ESLint style checking in CI/CD to ensure code quality, etc. The key is using more efficient, lower cost methods to continuously verify AI product quality and more agilely respond to changes. Specific strategies can refer to my previous articles:

"Frontend Engineering Series One: Preface"
"Frontend Engineering Series Two: Coding Efficiency"

No need to repeat here.

4. Use AI-Friendly Tech Stacks

When using Cursor or other similar assisted programming tools, you should try to choose various AI-friendly tech stacks. Taking frontend as an example: Tailwind is better than native CSS/Less etc.; TypeScript is better than JavaScript, CoffeeScript etc.; MVVM frameworks like React or Vue are better than native JS + HTML + CSS; GraphQL is better than RESTful etc.

Why do these differences emerge between tech stacks? The key is that LLM fundamentally implements content generation based on probability inference, with results' quality depending on many factors like model quality, training corpus, context completeness, prompts, etc. Just for the task of assisted code production, the more thoroughly LLM understands the tech stack, naturally the better the effect; the more focused the code structure, the lower the information noise during inference, the better the results; the fewer specialized scenarios in business implementation and more general rules, the less LLM needs to understand, usually the better the effect; and so on.

Based on these dimensions, I personally summarized several simple rules that can be used to help evaluate whether a tech stack is more suitable for AIGC scenarios, including:

Community Activity: The more prosperous the community, meaning more users, the richer the related technical discussions and materials will necessarily be, the more complete information LLM can index during training or execution, thus the easier to deduce better results. For example, suppose you encounter a very specific and thorny problem at work - if someone has encountered it before and organized the root causes and solutions into an article published on the internet, if LLM can retrieve this article during execution, it can deduce final solutions based on the article content; if such information doesn't exist online, considering LLM doesn't possess complex logical reasoning abilities, it probably cannot give effective solutions.
Structuredness: The stronger the tech stack's own structuredness and modularity, the more focused its information representation form, the easier for LLM to correctly deduce. For example, atomic CSS (like Tailwind) is a good example - native CSS expresses page element visual effects through specific property key-value pairs, while atomic CSS expresses certain style rule sets through atomic class names, with more focused information easier for AI to understand; and affected by cascade rules, when using native CSS, element styles may be affected by global, ancestor level elements, multiple selectors and other level style rules, which for LLM means specific information is scattered in multiple corners of the project, needing to consume and understand more context to deduce correct results. Relatively speaking, under atomic CSS frameworks, most style information is focused in the class list corresponding to elements, with highly focused information, lower reasoning costs, and more reliable results.
General Rules Better Than Specialized Design: The more general the tech stack's design specifications, the easier for LLM to understand, thus more suitable for AIGC scenarios. For example, GraphQL is clearly better than RESTful because GraphQL provides a set of general language rules for describing entities + entity relationships, sufficient to express most regular business operations like data storage, retrieval, deletion, modification etc., so for LLM it only needs to understand this set of general language rules, combined with specific entities and entity relationships in specific business domains to flexibly write various data operation logic based on GraphQL; while RESTful specification focuses more on entities, besides several basic data operations, when involving complex data structure scenarios, usually have to specially design and develop considering practicality, performance etc., and these specialized treatments appear too specific for LLM, with corresponding context complexity and noise also higher, thus harder to deduce correct answers.
Automated Quality Inspection: The more powerful the tech stack's quality inspection tools, the earlier and more comprehensively quality issues can be found, thus better mitigating quality risks brought by LLM randomness, and thus more suitable for AIGC scenarios. For example, TypeScript versus JavaScript - the former has a stronger type declaration system, thus can find many type mismatch issues in static code analysis phase, so even if LLM generates type-mismatched code, problems can be found before running code, with much lower error correction costs.

Of course, the above rules are just my personal opinion. With the iteration and development of large models, specific rules will inevitably be added or modified later. This is not important - what's important is that readers are strongly recommended to not just make decisions based on personal or team preferences when making technical choices in the future, but should consider more about tech stacks' suitability for AI, even taking this as the primary principle, trying to choose AI-friendly technologies and tools as much as possible, enabling LLM to better and more accurately assist in completing various development tasks, fully integrating into daily work to improve individual and team overall efficiency.

5. Lower Expectations

Yes, you read the title right - you need to lower expectations! As mentioned several times before, LLM is not magic - it has hallucinations, knowledge gaps, randomness, is not good at solving complex problems, has various shortcomings. From my usage experience, most of the time it's still far from reaching my expected state, requiring me to participate in various code details.

For example, generated unit tests usually won't run, requiring various modifications and fine-tuning; generated documentation may also miss some important content, requiring adjusting prompts and multiple retries to possibly achieve expected effects; generated code, even with the enhancement of various Cursor context reference capabilities, often makes mistakes in details leading to execution failures. Given LLM's underlying implementation logic, these results are all too normal.

Therefore, it's recommended to lower expectations for Cursor or other LLM tools - it's not magic and cannot completely replace human intelligence. Currently you still need to carefully play the director role, guiding Cursor to more accurately and better solve your problems, carefully reading and understanding both existing repository code and LLM generated code. Furthermore, you still need to have relatively strong technical capabilities yourself, understanding various technical underlying details to timely discover and fix problems. To some extent, Cursor is like a knife - its sharpness depends on your own skill.

Afterword

Finally, a few words: after trying it for these few months, while Cursor may not be perfect, it can indeed complete many basic and repetitive tasks, letting me focus more attention on business, architecture and other high-dimensional aspects rather than having to focus too much on specific details; can provide truly meaningful hints and help when I encounter knowledge blind spots; can help me complete many repetitive, low-level tasks. This is not a toy, but a truly productivity-enhancing production tool - the best AI-assisted coding tool I've used, strongly recommended for readers to try.

However, Cursor is always just an auxiliary production tool - human intelligence is the essence. Tools can make you faster, stronger, better, but the important premise is "you" - don't put the cart before the horse. Later I plan to write another article talking about my thoughts on AI, how as a programmer should get along friendly with AI in this era, who might be replaced and how to avoid being replaced, etc. - interested readers remember to follow.

DEV Community