The cover image is copyright Fabian Oefner from the Disintegrating II series. This is one of my favorite cars, the Ford GT40. In earlier days of CQRS, a web search for "cqrs" would auto-correct to "cars". The old CQRS blog humorously bears the subtitle: Did you mean cars?
I had a great opportunity to implement a system using some tactics that I had been researching and playing with in my spare time. Fast forward to today. I've faced a few unforeseen problems and learned a few lessons from these. This post addresses a specific piece: request/reply APIs.
Overview
I use a tactic called Command/Query Responsibility Segregation (CQRS). For those not familiar with it, here is a quick summary of the API operations espoused by CQRS.
Returns data | Makes changes | |
---|---|---|
Query | ✔️ | ❌ |
Command | ❌ | ✔️ |
Why use this pattern? I like it for a couple of reasons. As a consumer of the API, I never have to worry that asking a question will have unintended consequences. Conversely, I know exactly which API calls make changes to the system. There is no ambiguity. This makes the API easy to reason about. But historically, this pattern evolved because read and write concerns are often very different. And trying to create a unified interface to do both has the typical problems of serving two masters. The single interface becomes progressively more confusing over time for either purpose. Given enough time it is likely to form a cargo cult. "Why are we updating this field? We don't use it." Response: "I don't know, but keep doing it or something might break."
📦🛐
So at its heart, CQRS is just a specific application of Separation of Concerns, aka good organization practices. Now that we've made introductions for the pattern, I'll go over some of our implementation details and lessons learned.
Messaging
I consider every Query or Command to be a message. Which means that any client system can represent these as ordinary data (classes or structs) with no methods. Then transmit them easily in wire formats such as JSON or CapnProto or w/e. Every message also has a name -- usually just the class/struct name -- which uniquely identifies it within the API. Such as SearchCustomers
(a query) or DeactivateCourse
(a command). Names are used to identify which operation was requested, then match it up with a message parser and a handling function. Security authorization can be as simple as keeping a list of which users are allowed to send which message names. Then checking that list before processing any user's message. 🤘🤘
If you are familiar with RPC, you can also look at Messaging as a superset of that pattern. The message name being the "procedure" and the message contents being the procedure arguments.
The Operations
It may seem obvious how Commands and Queries should work. But there are some nuances that I discovered.
Query
Well, queries are generally like you would expect. In particular, we handle them like this:
- API listens for
/query/[Query Name]
- Verify that user has
[Query Name]
permission - Deserialize the query message
- Pass the query message off to its handler function, which will:
- Validate the query message
- Load and transform data from database
- Serialize and return the data
We tend to create queries that are tailored for specific pages or to answer common questions. It feels like we follow an inverse DRY rule here. If I need a query for a page, I might use an existing query. But only if I do not have to change the existing query. If changes are needed, then it means the new page has a slightly different responsibility even though it displays most of the same data. So I will make a new query instead.
Command
The purpose of a command is to perform some business operation on the system. In practice, we noticed a distinction between whether or not the command needs to change one or multiple entities📌. How you handle multiple-entity changes is important for architectural reasons.
📌 Entity
Entity in this case means something that is a logical unit. In heavily normalized tables, an entity might include a parent and any descendants of 1-to-many relationships. In DDD terminology, you might call this an aggregate. In event sourcing, this is an event stream.
Scalability
You could execute multiple-entity changes in a single transaction to achieve all-or-nothing semantics. This approach is nice to work with in code, but it limits scalability. To be involved in a transaction, all affected entities have to be located on the same database node. If they are on different nodes, then a distributed transaction occurs (if supported by the database). And as load increases, distributed transactions will get progressively slower. Cross entity transactions are a valid approach for internal business applications (or any application which is not likely to outgrow a single database node). But for publicly available internet services, perhaps not.
A more scale-friendly approach is to only use single-entity commands to make changes. When a use case requires changes to multiple entities, use a meta-command which makes no changes itself, but instead orchestrates and runs single-entity commands. I call the single-entity commands "basic commands", and the multiple-entity ones "workflows".
⚠️ These are not back-end workflows
Workflow commands are a convenience for the front-end. They typically consist of individual actions the user could take themselves through the UI. But instead of making the user go to multiple pages, we present a single form and roll up all the data needed into a workflow command. These are best effort and time-boxed (due to being request/reply), so failure typically just results in an incomplete workflow which the user can retry or fix the remaining items individually. These workflows are not meant to replace back-end processes or to provide robust handling of failure cases.
Client-side workflows
You could implement a Workflow on the client side -- have the UI orchestrate all the necessary basic commands. However, I choose to make them API-side for one main reason: clarity (of security especially). I'll illustrate with a real example from our system. We have a Trainer role. This role is not allowed to Create Courses. However, they can Record Training they provided to employees. Part of the Record Training use case may include creating a new course with limited options. By executing the Record Training use case as an API workflow, it can be expressed as a single granular permission. "Trainers can Record Training but not Create Courses." As in, one box is checked on the permissions UI, but the other isn't.
To do the same thing from the client side, we would need to add a basic command: Create Trainer Course. Then admin users would have to be informed: "To give someone permission to record training, you have to check Create Trainer Course
and Permission X
and Permission Y
." So then client-side workflows like this are a documentation/end-user-training burden. We could also create a fake command just for permission purposes, which maps to the required basic-commands. This would instead burden devs with extra stuff to keep updated. I don't like either of these outcomes, so I prefer API-side workflows.
Update 11 Sep 2021
For running batches of the same kind of commands, we have used some client-side workflows. The client makes a list of commands. Then sends them, either one-at-a-time or in parallel. Then marks them off as success responses come back. This also makes it easy for the client to retry only commands that fail.
The downside: this approach is "chatty" -- it requires a round-trip communication to the server for every command. This increases server/network load versus the "bursty" server-side workflow. Additionally, users with high network latency will see one-at-a-time client workflows slow to a crawl. Ask me how I know.
The server will be doing the same amount of work per command whether you send them separately or as a burst. However, the server also uses cpu and memory to send/receive network requests. So more communication means less resources available for back-end work. How much less depends on the kinds of offloading supported by the server hardware.
Client workflows can make sense if you use them judiciously.
Guiding Principles
There are some very common questions people ask when implementing CQRS APIs. I will list the principles I have come to follow as headings, then detail the common questions behind them.
Returning errors is different from returning data.
A popular misunderstanding is that commands should return nothing at all. This stems all the way back from the CQS pattern (which CQRS just extends). This pattern was applied to an object and its methods inside specific languages. Many languages use exceptions as the error propagation strategy. A "command" method was especially noted by the fact that it returns void
. So the notion was born that commands return nothing. However, it is implied that an error will throw an exception, which is really just a different return path.
So the truth of the matter is that commands do return something. They return meta-information about the operation itself (whether it succeeded or failed and why). This is very different from returning business data, which is the job of Queries.
It is okay for commands to succeed without making changes.
Commands can make 0 or more changes. In other words, "making changes" is the purpose of a command, not the required outcome. So it is entirely valid for a command to run successfully but result in nothing changed.
We have cases like this where we compare an entity before and after running a command. If they are exactly the same, then we choose to make 0 changes and return successfully.
It is okay for command handling code to call queries.
A lot of questions come up based on the misconception that CQRS principles should apply to the insides as well as the outsides of an API. Specifically there are a lot of questions about whether or not it is ok for command handling code to run a query. Instinctively, this seems like a violation of CQRS principles. But CQRS only makes recommendations about the external surface area of the API. The insides of a command are implementation details on which it holds no opinion other than "makes changes".
So feel free to run queries to grab some information needed to make decisions within a command. One caution though. It is common in more advanced scenarios that the query data may have come from a cache or otherwise may be lagging behind what is "current". (Often referred to as Eventual Consistency.) In this case, you must consider the effect of stale data from the query on the decisions your command makes. See more on that here. It could be that slightly stale data won't matter, such as is normally the case with configuration data. Example: when a user changes configuration data, they expect that some things happened under the old configuration, but future things will happen under the new configuration. They will probably not notice or care that a user slipped in an operation under the old config during the few hundred milliseconds of eventual consistency after they made the change. They will just assume the operation was executed before their change.
Auto-incrementing IDs should not be the primary ID.
A common objection to commands not returning data is: I need to return the auto-generated ID. Auto-increment IDs are very convenient, but they have significant trade-offs. They don't scale for one thing, and they have security concerns for another. But let's ignore that for a moment and focus on a common usage issue: retries.
Scenario
A user fills out a form to create a new entity and hits Submit. The request times out.
Auto-increment adventure
If an auto-increment field is your only ID, your app has no way of knowing whether the request succeeded. The remedies to this situation typically depend on user awareness and participation.
If the user just hits Submit again (very likely), but the previous request did create the entity despite the timeout, then there are now two of the same entity with different IDs. To properly cleanup, the user should now search for duplicates and remove the redundant entity (highly unlikely).
Alternatively, after a timeout, the user could search for their maybe-created entity. And if they fail to find it, come back and fill out the form again. This scenario is not likely in my experience. Maybe it could happen if you add training costs to get users accustomed to thinking this way.
You could add in external systems of duplicate checking, such as keeping a memory of seen operations and their results. But there is a better way...
Pre-generated ID adventure
An ID was generated (or requested from the server) when the form was loaded, before the user even started typing anything.
After the user is informed of the request timeout, she just hits Submit again. The UI sends off the same exact same request as before with the same pre-generated ID. In the best case it succeeds as normal. In the worst case the API responds with: "This entity already exists." And if the UI can identify this specific error, it can just pretend it succeeded as normal. This adventure results in a better user experience and no chance of duplication.
Our strategy
We tend to use UUIDs for all identification purposes. They are easy to generate on many platforms. They defy trend analysis. Most of our creation forms have to run a query anyway (for example, to get drop-down list data), so we just include a new UUID in the results too.
Update 11 Sep 2021
The above works well with internal APIs that we consume ourselves. But as we have gotten into external APIs, we are considering a different strategy. Especially for operations which create new entities (e.g. Create Order). We cannot trust external clients to provide a unique ID matching our constraints. Or even to regurgitate an ID they got in a previous query.
Instead, we are looking at using the client-provided ID as reference data. When an entity is created, we will generate our own ID for it. But the client's ID will be attached to the entity and indexed for lookup. The client can use its own ID to call our API. But ID that we depend on internally still meets our standards.
Another scenario is where the client does not have its own ID but uses ours. The approach still works if slightly modified. The client provides a request ID. Once an operation completes, the client can ask for information about the created entity (including our ID) using the request ID.
Conclusion
Commands are the gatekeepers of change. Queries are the library of knowledge. That's CQRS. I have found that this pattern leads me in the right directions. It is also a versatile pattern. It doesn't care if your deployment surface area is monolithic or micro. You can even split commands and queries into their own separate services to scale out read loads separately from write loads.
But bear in mind that this is just one piece in a larger system, not a tool for every job. The CQRS pattern works well at the border of a back-end system, interfacing with client applications. As with any pattern, it will only be useful when applied in the right situation.
/∞
Top comments (16)
"Commands are the gatekeepers of change. Queries are the library of knowledge." Just a beautiful summation of CQRS. Would've loved more code samples, but otherwise amazing article.
Thanks!
Regarding code samples, is there anything you were specifically interested in seeing? I may post a follow-up if there is interest. It is my intention to one day publish a template for our style of API (which also includes event sourcing). I hope there is nothing especially ground-breaking in there. Maybe just organization patterns that people had not considered. (It may just be the way my brain works, but I find organization to be the hardest part of every solution.)
No sweat! I think these situations through in code, so examples always help me digest. Even if I already understand the situation, like in this case, I still enjoy watching someone else's perspective. Plus when a saw a notif for a post by you, I was like "dayummmmmm time to see some of that sweet Speakman f#!"
Great article. It was a pleasure to read it. I like the translation of the Entity to the DDD context.
One question about the Pre-generated ID adventure paragraph. You wrote:
So my question is why don't generate this Id on the server side after the submission of a form? You've chosen the different strategy, and that is fine for me because solved your problem :). I'm just curious if you've considered ID generation before sending a command to the external server? Or that was only a think javascript client, and it could be overengineering to do that?
Thanks for the comment. :)
Regarding ID generation, I might not be understanding your questions entirely, but I will try to answer.
We always generate UUIDs on the server currently. I have trust issues with some browser implementations of random in JS. However, for other kinds of apps I would be comfortable with generating UUIDs on the client.
We generate the ID when the form is loaded. For web clients, we normally request data from the server for creation forms anyway -- to populate drop-down lists for example. So in the query response from the server, it will include an newly generated UUID for convenience.
The client knowing the ID before the form is submitted is valuable for a couple of reasons. Retries without duplicates were mentioned in the article. But also when the operation succeeds, the client already has the ID to be able to provide access links to the user (or query a hypermedia endpoint with the ID, if that's your thing). This frees the client from having to interpret the success of the command -- it's basically just a "yes" or "no, and here's why". Generating on the server after submission does not have these benefits.
We have done ID generation (well, fetching an ID from server) just before submitting the creation command. That was our first try at it. We didn't like it because it was a little more awkward. For example, handling a retry is slightly different from a first try. Also if there was a network congestion problem near the time of submission, it is likely to affect fetching an ID as well, so an extra error step to handle. These problems would be easily tamed if we were generating the ID locally. But at that point, the difference between generating the ID when the form loads vs just before submission is negligible (with UUIDs anyway). For me it makes more sense to do it up front so the form submission process has less decision branches.
Let me know if you have any more questions or I missed something.
Edit: I meant to add that we also use secondary IDs. These are human-friendly (and in our particular app, usually human-entered) IDs. Items can be searched by secondary IDs.
Thanks a lot for the explanation. You answered my question in 100%. I'll do my best to make more clear question next time :)
I've heard that in DDD there could be a dedicated service which is responsible for ID creation.
Thanks a lot for your replay.
No worries. It seems I did understand your questions after all. :)
There are many strategies for ID generation, including ones where an ID server is employed. With UUIDs, there is no extra infrastructure needed. However, (although it is extremely unlikely) it is still possible to have UUID collisions. For me the trade-off is well worth making, but it depends on the situation.
Yes, you did :)
"It is okay for command handling code to call queries."
I think this assertion is not always true. When there is eventual consistency between the command side and the query side, for some use cases is not acceptable that the command side uses stale information to take decisions.
If you had a strong consistency requirement for some bit of information, then you probably would not make the view on the information eventually consistent anyway. So it still works out fine to call queries from the command side.
In first approaching CQRS/ES, I thought in the same way. I devised plans to keep some fully-consistent state on the command side separate from eventually views. However, after having it in production a bit, it turned out to be concerning myself (and wasting time) for no reason. Because life and business is eventually consistent (everyone is not immediately informed of everything that happens). So in particularly important areas of the business, they will already have procedures to deal with this reality. If they don't then it probably doesn't really matter. You shoot yourself in the foot asking the customer deeply technical questions like "Full consistency?" when they don't understand the trade-offs. (They are always going to say yes to this question, even if they don't need it.) In any case, modeling the software after the way the business actually works finds the right path.
"If you had a strong consistency requirement for some bit of information, then you probably would not make the view on the information eventually consistent anyway. So it still works out fine to call queries from the command side."
I don't think this statement covers all possibilities. I think is not possible affirm that ever an system decision requires "strong consistency" any display of the same data requires the same. I think this is an point of attention when a decision is made to use the query side on the command side. Understand if the information is "eventually consistent" and if this is acceptable to that use case.
"Because life and business is eventually consistent... In any case, modeling the software after the way the business actually works finds the right path."
I think that's a weak argument. Systems are not only made to mimic real-life and business processes, but to improve them.
Anyway, I am only saying that in some cases the statement "It is okay for command handling code to call queries." is not true. Sometimes an "special view" without eventual consistency must be constructed before the command handle can call the queries.
Thanks for your comments.
I think there has been a misunderstanding. The phrase "is okay" does not mean "always and only ever use this". Neither is the word "probably" a universal absolute. The text underneath the "is okay" explains that people intuitively view this as a taboo, but it is permissible. I think the reason for the taboo comes from the combination of other patterns such as DDD, ES, and of course EC. But even there it is still permissible to query certain kinds of information from the command side.
Your objection primarily seems to be the specific case of Querying eventually consistent data from the Command side. And in fact, this is still a perfectly valid way (among other possibilities) to get information. I will give you our use cases for example. Configuration data (to control a process) and set validation (unique keys) are our only queries from the command side (each use case also reads from an event stream). For our case, it makes no difference if the configuration data is eventually consistent. When the user makes a configuration change, they are expecting that some processes were executing before they made the change, and some will be executed after. They probably won't notice or care if milliseconds of eventual consistency allowed someone to start a process with the old configuration. Set validation (unique key constraint) is ideally fully consistent, but for us the risk of violation is low enough (they are user-entered with front-end checks and API-side checks) that we are okay with spending admin time to fix it if it ever happens. In our situation it is fine to use these queries with EC. Use your best judgement for your situation.
Here is Greg Young's 8-years-old post on the subject.
Hi Kasey, sorry for being late, this article was bookmarked ages ago and I got around it only now :-)
Great explanation. I have a few questions...
isn't CQRS also a generalization of REST's concepts around separation between read and write methods? GET queries the system, the other ones alter it. Same goes for GraphQL's separation between queries and mutations.
you didn't talk about it in this intro but the generalization made by CQRS seems a boon for tracing, logging and repeatability. If queries and commands are correctly separated in theory you could save the state of the system in a "log file"
have you considered using KSUIDs instead of UUIDs?
I like the idea of the client knowing the ID of the object in advance, unfortunately many frameworks people use default to primary IDs or database created UUIDs. This is easily solvable though
Hey rhymes! No worries about being late. Hopefully the ideas I have written here will be relevant for more than just a couple of months. So I won't consider comments late until an obviously better idea comes along. :)
I can definitely see the parallel you mention with REST and GraphQL. It has occurred to me as well. However, I have trouble putting these under the umbrella of CQRS. I think the main difficulty is that those patterns seem to be very entity-focused. In my mind CQRS is more like modeling an API as an Actor listening for messages, plus adapting the CQS pattern from Bertrand Meyer to those messages. Perhaps unintentionally, REST and GraphQL seem to intuitively lead devs to organize read and write concerns together under a single structure (an entity) and then exposing that implementation detail to the outside. Despite the semantic advantage from "GET does not make changes", this organization practice still impedes other benefits I will mention below.
Meyer's Command-Query Separation pattern was originally meant to refer to methods on objects, but it has distinct benefits for operations at the API level. As a dev maintaining the API, I find a lot of value in the division of query and command. It helps me to organize read concerns (such as full text indexes, search lists, reports) more appropriately and separately from the write concerns (domain model). For example, I end up using completely different code paths for the query and command sides. It is especially nice to isolate the command side to just be: "Here is the request. Go consult whatever data you need, then tell me your decision." Then let other parties be concerned with saving that decision into specific formats (e.g. SQL), updating full-text search, and so forth.
It also offers architectural flexibility -- you can choose to split read and write sides up into different services to address drastically different performance/distribution characteristics. For example, you could have a highly available, centrally-located write service but use geo-replicated Elastic Search and data replicas, each with their own local query service. Or you could do the inverse if that fit your problem space: fully distributed writes with centralized read aggregations. (For example, distributed sensor networks.) Whereas it seems at odds with an entity-based organization to say that GETs should go to a different service than PUT/POST/DELETE. I have only a passing familiarity with GraphQL, so I cannot offer a direct comparison there... only that it still seems to be about entities. (And Datomic too for that matter.)
Regarding auditability. CQRS sortof assumes messaging (Command and Query messages). The great thing about messages is that they are just data. So they can be logged, filtered, routed, stored for later, etc.; whatever you need. However, there is a danger in expecting the saved Commands to reproduce the current state of the system. There is a strong parallel to saving the SQL statements that you ran, then attempting to regenerate the database with them. Assuming you ignore rejected SQL statements, columns and relationships can still change over time, so old statements may break or work differently over the life of the system. This is why SQL itself does not store statements as the source of truth. Instead, it uses the transaction log for that purpose. If you saved the SQL statements as they happen, you may not be able to rebuild the database. But if you had the whole transaction log, you could.
So if you want to be able to rebuild the state of the system from the audit log, this is where event sourcing comes into prominence. Commands are the requests that were made of the system, like SQL statements. (Some systems also log these for regression testing; to ensure that the same command generates the same events.) But events are the actual changes made to the system, like the SQL transaction log. Command behavior can evolve over time, but events actually happened, so they do not change and must always be handled properly as code changes. Consider this business conversation: "We decided that after X date, users who signup will not be considered founders anymore. But we still must provide founder features for those who signed up before X." The signup events that happen now are different from before, but the old ones still matter.
Regarding KSUIDs. That was the first time I heard of them. Thanks for the link; I like to learn about new things.
There would be some challenges for me to use them currently. The tools (languages, libraries, databases) I use have support for UUIDs, but do not yet support KSUIDs. And so far, I don't really have a use case which needs the creation date integrated into the ID. (I have date as a separate field if I need to order by that.) I could see it improving write performance because indexes wouldn't have to be reordered so much on insertion vs random UUIDs. I will keep KSUIDs in mind in case a situation arises that fits. Thanks for mentioning them!
Don't get me started on frameworks. 🤐
Thanks for the really detailed explanation.
I dig the explicit separation, frameworks don't go much past the MVC in their recommendations
What I like about KSUIDs is the fact they are sortable so as you say they should have less impact on insertion. Haven't measured it though
"Commands are the gatekeepers of change. Queries are the library of knowledge."
Absolutely love this comment Kasey! And a fantastic all-round article.