Every once in a while, new Pullpo users ask us to provide more individual metrics in the product. The thing is, what individual metrics? And why?
Input to Output to Outcome
After having +100 conversations with managers and engineering leaders, we saw that it's useful to divide metrics into three sets:
- Input metrics, or, in the SPACE framework, activity metrics. This is what developers do and spend their time on. Number of PRs, commits, code review comments, time spent on meetings, code review turnaround time, PR size, amount of documentation written, number of LoCs… These are examples of activity metrics. They are the easy ones to measure at the individual level.
- Output metrics. These start to be team-level metrics. Here we have things like the DORA metrics: MTTR, CFR, lead time, deployment frequency... and similar ones like uptime, number of incidents, support tickets, API performance…
- Outcome metrics. These are business metrics. The ones that really matter. Churn, ARR, CAC… This kind of metrics reflects business impact.
The metrics that actually matter are output metrics and especially outcome metrics. So the question is, can we tie individual metrics to outcome and output metrics?
Tying individual metrics to business metrics.
Classical example: sales team.
They have all kinds of metrics to measure individual productivity like # calls made, % call to demo, % demo to close... and they work great. Why? Because they can easily link individual metrics to outcome metrics, and most importantly, the other way around: they can easily know the contribution of each team member to outcome metrics like MRR.
They can clearly measure the performance for each individual. Perfect.
What happens in engineering? We can link input metrics to output metrics, and output metrics to outcome metrics. For example, we know that a good PR Size has an impact on Cycle time (small PRs lead to faster and better code reviews), and a short Cycle time allows the company to ship faster. This means more features to promote and easier to get new clients (less CAC, more ARR) and can also reduce churn.
Ok, now try to do it the other way around. What’s the impact on CAC, ARR, and churn because a developer created a PR of the right size? Good luck.
Here is an example of tying input metrics to output metrics to outcome metrics, also inspired by this post by Iccha Sethi.
Comparing individual metrics.
Ok, so we cannot clearly tie outcome metrics to input metrics. Can we at least compare between input metrics?
Let's take the sales team example again. Imagine that we couldn't link outcome to input, at least we could compare input metrics between SDRs, right? The more calls they make (quantity) and the better close rate they have (quality), the better they are performing. Even if we couldn’t tie outcomes to inputs, even if we couldn’t know the contribution of each SDR to MRR, we could still clearly rank the performance of SDRs by only taking input metrics into account.
Ok, let's try that with engineering. So... what are quantity and quality in this case?
Quantity could be # PRs, # commits, # words of docs written, # code review comments... but how do we measure quality for those things?
In code, there are different versions of what quality means. I really like the TRUCE framework for software quality. In a nutshell, this framework analyzes the quality of the code in five areas (with different priorities depending on the stage of the business):
- Timely delivery of features.
- Robustness of code (reliable, tested, secure, scales, etc.).
- User needs are met (meets user, customer, stakeholder requirements/needs).
- Collaboration is enabled (readable, documented code to facilitate collaboration).
- Evolvable design.
Here comes the trick: those things depend on more than 1 person.
- Timely delivery: also depends on reviewers (and many more people in bigger companies).
- Robustness, evolvable and readable code: on an existing irregular codebase, with tech debt in different files, created by many other developers.
- Meets user requirements: entire product team deciding what to build.
So individual metrics are bad?
No. They are just numbers. Data is data. It's not good or bad, it is what it is.
If a manager exclusively uses activity metrics to compare individual developer productivity, then that is bad. But it's not the metrics' fault.
Can individual metrics be useful?
That's a better question. And the answer is yes, they are useful. Especially for detecting outliers in the team.
Maybe you cannot compare individual productivity of developers comparing # PRs or # commits, but it's useful to know if someone didn't create a PR or commit in the last two weeks.
In that case, your job as a leader would be to combine that info with the actual context on what's happening (maybe they are pairing, on vacation or temporarily working on another repo). And make decisions taking everything into account.
This is why Pullpo detects activity outliers automatically so that managers are aware if something is happening on their team ;)
Final thoughts and summary.
- We can tie input to output and output to outcome.
- However, unlike other teams, we cannot measure how much an engineer contributed to outcome.
- Most of the time, it doesn't even make sense to compare developers based on input. But input metrics are useful for detecting outliers.
- I say most of the time because I think there may be some exceptions to this. For example, I think review turnaround time can be a better comparative metric, allowing managers to go a little further than only detecting outliers. But not too much, or Goodhart's law could apply.
- I also believe that, with AI, comparative metrics could perhaps be a solvable problem, but the solution has to be perfect. Otherwise, the "solution" would be a bigger problem than the initial problem.
Top comments (0)