The consulting firm came up with a methodology they claim can measure software developer productivity. But that measurement comes at a high price – and we offer a more sensible approach.
Aug 29, 2023·edited Aug 29, 2023Liked by Kent Beck, Gergely Orosz
When I was learning photography, I heard a story about a photography teacher who divided his class into two groups: one group was graded on the quality of their best photos, and the other group was graded on how many photos the produced. At the end of the semester, the professor looked at which group produced the best work, and the pattern was clear: the majority of the highest quality photos came from the group asked to produce more work.
As an analogy, this is imperfect: new students are at a different point on the learning curve than professionals and likely benefit more from applying skills in new scenarios. But in my career, I've found that there's some truth there.
I'm a relatively new manager at a small company. I do my best to evaluate my team based on impact, but I also privately measure output, and there is an obvious correlation between the two. We're dealing with a novel product in an emerging market, and it's rarely clear which initiatives will be most impactful before we get prototypes into customer hands. It's unsurprising that the engineers on the team landing more changes have more hits and drive more customer acquisition and retention.
I conceptually believe that there are places where engineers are gaming output metrics and producing "busy work" with little value, but in my (admittedly limited) experience, I haven't seen much of that. I try to be aware of incentives; I don't tell the team I'm tracking output to avoid encouraging that type of work. Maybe this is the luxury of a small, process averse company.
I'm genuinely curious to hear from others who have experience in cultures where outcomes and impact don't track effort and output. As our company grows, I'll have some say in how engineers are evaluated, and I want to make sure we're being thoughtful.
I call this the 50 pounds of pots story as I heard it told about a pottery teacher. I summarize it as if you now what to do, do it. If not, figure out what to do. That "figuring out" process seems hard to incentivize.
I find there is a fine line between setting targets with these automated systems, versus "setting the pace" on a team, and then either helping/coaching those who are below the pace: or dealing with in other ways.
Of course we know that output matters. A less experienced engineer *should* aim to get a PR done every single day (or more!) as they will both do better work, and grow faster. And a hands-on engineering manager would both pay ensure there is space to make this happen (eg meetings don't slice up the day), and keep a pulse on it.
But there is a big difference in giving the manager leeway to build a high-performing team, vs mandating this from the above. E.g. when I was a manager, there were weeks when some of my most productive engineers did not do much coding, thanks to eg onboarding/recruitment/kicking off a major project.
I find that signals like PRs, code reviews, design docs etc are helpful when people onboard - while they "pick up the pace" and also when you spot issues with what they ship (or do not ship)
I'm not sure you can put this into a framework that visualizes all this for non-hands-on managers, and without them doing much damage with the raw data that misses all this context.
With a hands-on/technical manager, I trust all this data. Heck: as a manager, I didn't need to get such reports, as I reviewed PRs and code reviews most weeks, and picked up on patterns that felt off to me.
The sales org in my company actually measured the sales people on activity (in addition to various impact metrics) a few years ago. The reasoning was that if you don't talk to the customer (by a meeting, or by phone), no order will happen, and you won't even learn anything. They saw a correlation between effort, output and impact.
It did have the side effect that some customers complained about too much activity :) - and it isn't used anymore, but I think it was useful to develop our sales at that point in time.
One thing I always think about when reading about productivity is that "Productivity as a measurement is a good thing" seems to be a deeply ingrained and correct to folks, but I have to question how true it actually is. Let's take this particular measurement of productivity:
> For example, customer support can be measured by the number of tickets closed (an outcome), the time it takes to close a ticket (also an outcome,) and by customer satisfaction scores (impact.)
Most people agree that working customer support is a soul-crushing, terrible job, and that these metrics negatively impact their ability to service customers by incentivizing negative behaviours (copy/pasting answers without fully reading questions, closing difficult calls to prioritize easier ones, etc.) - while customer support services are generally good for C-Levels, I do wonder just how much better for customers and workers they would be if this fetish for measurement was put aside for a more holistic approach to outcomes and impact.
Unlikely, of course, and probably utopian ideal, but something I always find myself thinking about.
Seems like software companies make a whole bunch of money without all this measurement. IOW if you have a good business, you don't need it. If you have a bad business, measurement will make it worse.
This makes me think of Zappos, and how they won a whole industry by throwing away the conventional wisdom on quantity over quality for customer support.
What I do not know is if you can do what Zappos did without a founder who deeply believes in all of this, and holds their back for the team, and creates a culture where people push themselves to delight customers (and have the means to do so)
I think you hit the proverbial nail on the head here. The issue with eschewing conventional wisdom is you have to have leadership who truly believe in the chosen direction, and are willing to ride out any rough patches and nay-sayers. There will be constant whispers in their ears (especially from their VCs if they're taking VC money) about 'Sure, this is working, but it could be better...' or even 'See, that dip there? That was a sign this does not scale...' - there's a lot of pressure on executives to follow the herd and do what others do. We train MBAs to do it (regardless of what those programs claim), and while many of the foundational business books talk about pursuing excellence in your own way, a vast majority of them are "How To Do What Toyota Did In Five Easy Steps." Or said another way:
In the place of deep personal conviction about how to be a productive, profitable, and customer focused company, executives will choose to parrot conventional wisdom as correct. And with increasing demands on everyone to be more 'productive,' fewer people will have time to question the paradigm.
Thanks for the great insight. I would like some further clarification as to how the three DORA metrics "deployment frequency", "lead time for changes", and "mean time to recover", are categorized as OUTCOME metrics. They seem to line up much more with OUTPUTS based on your example outputs (eg: "feature is in production"), but maybe I am misunderstanding. Clarification would be appreciated!
I believe this also ties to the different engineering practices in Meta and in AWS as you wrote. AWS has extensive software tests whereas Meta has several staging environment to test business metrics. It may be good idea to link to them.
Yicong: I'm not sure I fully follow. Yes, they do these, and so have additional layers to validate that stuff works and/or measure the impact. Do you mean that it could be helpful to mention that e.g. Meta is already measuring this business impact where they can?
Indeed, in the sense that Meta and AWS are both linking to business metrics (outcome) but what they do is different because they generate profit differently; this pair of examples show no one size fits all, in contrast to McKinsey's framework. Oh...I forgot to mention, they were from your blog, something about engineering culture maybe.
I agree with the idea of measuring and metrics. As part of my job on the exec team, I discuss and ask Sales, Marketing and recruitment about their metrics. And good startups are usually metric focused in these areas.
My challenge is not should we measure engineering? ( I think we should ) but can we do it the right way.
I sometimes quote Einstein ‘Not everything that can be counted counts and not everything that counts can be counted’
- which feels like is sums up our challenge.
I had landed on the idea of shipping customer facing value per x ( week for startup, number per month for enterprise ) as well.
It’s far from ideal, but it is a decent proxy for helping to ensure the teams are focus on impact and to talk at the exec team level.
One thing we've been talking about with Kent is how the "what you measure for a decent outcome" is likely very different based on the "stage" of a product: whether it is "pioneer," "settler" or "town planner" stage (as per Wardley Maps): https://orghacking.com/pioneers-settlers-town-planners-wardley-9dcd3709cde7
I suspect for a team looking to get to MVP this is very different than a mature team optimizing a cash cow product.
Thanks for posting this. My last company has a good culture where we are trying to separate out rewarding for impact from promotion.
Impact is tied to a quarterly bonus. promotion is based on behavior (i.e. at each level there is a rubric of expected behavior. So in order to be promoted to Staff Eng one needs to demonstrate mentorship or organization wide impact, so folks are looking for ways to demonstrate that.
The process is pretty heavy weight so many new managers who is not used to this process hated it. The company then grew too fast and we have a new HR regime who killed the whole thing and now just do what every other company does. Sigh...
Aug 29, 2023·edited Aug 29, 2023Liked by Gergely Orosz
I worked in FinTech, where performance attribution is so heavily quantized, the "too abstract to measure" argument certainly falls on deaf ears. This is really one of the notoriously hard things we struggle with, and requires tech leaders to carefully manage expectations.
A few years ago, I was undertaking an effort to look at Pull Request metrics as a potential proxy for productivity. One of the major things I had stressed prior to presenting the numbers was that this analysis could really only be done so on the condition that we do not try to attribute "individual performance" to the results or try to identify “target numbers”. Stats were strictly used in service of a larger desired goal: identify inefficiencies and potential improvements to team mechanics, and were tailored to teams as a whole. Whereas individual stats were tracked, they were anonymized and heavily caveated before presenting to any senior audience to preempt the exact kind of pitfalls you described.
At least a few positive results did come from the exercise. One year, it helped us identify actionable issues such as insufficient timezones coverage between offices. Another year, very asymmetric PR numbers helped us proactively identify silent burnout with one of the top performers that may have otherwise gone unnoticed.
But again, difficult to tie this to a top bottom line number like "revenue" as easily as sales. It helped us primarily in other ways: 1. Identify potential tangential problems before they became bigger problems 2. Cultural reinforcement that the desired metric was important and noticed (but no hard performance target), 3. Demonstrate a standard of care that the “engineering dept knows what’s happening on the ground”. i.e. measure what we can, useful or not. And finally, in cases where pushing a metrics was a soft target, the measurements were ones where engineers and management were both fully aligned, and accepted the potential for behavior modification, but viewed it as a desirable side effect, again leaning on “culture reinforcement”. In other words, intentional gamification.
The thing is, taking snapshots of how PRs progress over time: seen this happen at other places as well, eg Uber. The first few times, this metric works, as there is no "gaming" it. Eg. for a long time at Uber, director+ folks had access to this data, and when COVID started, they shared on the company all-hands how the # of PRs and PR per engineer went up the few months after WFH started! It was vey interesting.
And I am also with you on how, when you are aware of the behavior modifying effect... this could be ok? Like it's clear if you measure PR frequency or PRs per week/month etc: you get smaller PRs. Which could be a goal in some places, and it's exactly what happened at eg Uber (and why the CI system was overloaded initially).
This reminds me of e.g what YouTube is rewarding: total watch time of the video. This leads creators to optimize for the thumbnail (to get the quanity) and then the watch % of the video (for the quality). The more they "game" the system, the better it is for YouTube, thanks to being able to sell more ads.
There are a lot of different companies with different products with engineering organizations. The relationship with engineering and the business could be different in one company and the next. However, the sales and recruiting organizations can still use similar metrics.
It is an interesting call out though if a CEO questions the impact of their engineering organization. A company like Chic-fil-A would approach it far differently than Uber would and far differently than Salesforce. Engineering's productivity and impact are different in each of those examples (with respect to the core business).
One way to avoid this whole conflict is to work for a company whose CEO came from tech (god help those who work for a company whose CEO used to be a CFO)...
When I was learning photography, I heard a story about a photography teacher who divided his class into two groups: one group was graded on the quality of their best photos, and the other group was graded on how many photos the produced. At the end of the semester, the professor looked at which group produced the best work, and the pattern was clear: the majority of the highest quality photos came from the group asked to produce more work.
As an analogy, this is imperfect: new students are at a different point on the learning curve than professionals and likely benefit more from applying skills in new scenarios. But in my career, I've found that there's some truth there.
I'm a relatively new manager at a small company. I do my best to evaluate my team based on impact, but I also privately measure output, and there is an obvious correlation between the two. We're dealing with a novel product in an emerging market, and it's rarely clear which initiatives will be most impactful before we get prototypes into customer hands. It's unsurprising that the engineers on the team landing more changes have more hits and drive more customer acquisition and retention.
I conceptually believe that there are places where engineers are gaming output metrics and producing "busy work" with little value, but in my (admittedly limited) experience, I haven't seen much of that. I try to be aware of incentives; I don't tell the team I'm tracking output to avoid encouraging that type of work. Maybe this is the luxury of a small, process averse company.
I'm genuinely curious to hear from others who have experience in cultures where outcomes and impact don't track effort and output. As our company grows, I'll have some say in how engineers are evaluated, and I want to make sure we're being thoughtful.
I call this the 50 pounds of pots story as I heard it told about a pottery teacher. I summarize it as if you now what to do, do it. If not, figure out what to do. That "figuring out" process seems hard to incentivize.
John: I really like your perspective here.
I find there is a fine line between setting targets with these automated systems, versus "setting the pace" on a team, and then either helping/coaching those who are below the pace: or dealing with in other ways.
Of course we know that output matters. A less experienced engineer *should* aim to get a PR done every single day (or more!) as they will both do better work, and grow faster. And a hands-on engineering manager would both pay ensure there is space to make this happen (eg meetings don't slice up the day), and keep a pulse on it.
But there is a big difference in giving the manager leeway to build a high-performing team, vs mandating this from the above. E.g. when I was a manager, there were weeks when some of my most productive engineers did not do much coding, thanks to eg onboarding/recruitment/kicking off a major project.
I find that signals like PRs, code reviews, design docs etc are helpful when people onboard - while they "pick up the pace" and also when you spot issues with what they ship (or do not ship)
I'm not sure you can put this into a framework that visualizes all this for non-hands-on managers, and without them doing much damage with the raw data that misses all this context.
With a hands-on/technical manager, I trust all this data. Heck: as a manager, I didn't need to get such reports, as I reviewed PRs and code reviews most weeks, and picked up on patterns that felt off to me.
I often talk to people about the idea of teaching someone a sport, like Swimming or ski-ing.
Early on, you need to be specific - put your leg there, put your arm there. I.E you are telling them how to have specific Output.
Case in point... take lots of photos.
But as you move from beginner to mastery, the why starts to matter - and you focus less on the specifics and more on the outcome.
The teacher of a brand new skier tells them very different things from a coach of an olympic winning team.
If we want to create high performing teams, we need to adjust based on where our team is at currently.
A junior engineer gets a lot of value from the mantra of 'frequent PRs', but a good, performing team should not be measured by that same yardstick.
The sales org in my company actually measured the sales people on activity (in addition to various impact metrics) a few years ago. The reasoning was that if you don't talk to the customer (by a meeting, or by phone), no order will happen, and you won't even learn anything. They saw a correlation between effort, output and impact.
It did have the side effect that some customers complained about too much activity :) - and it isn't used anymore, but I think it was useful to develop our sales at that point in time.
One thing I always think about when reading about productivity is that "Productivity as a measurement is a good thing" seems to be a deeply ingrained and correct to folks, but I have to question how true it actually is. Let's take this particular measurement of productivity:
> For example, customer support can be measured by the number of tickets closed (an outcome), the time it takes to close a ticket (also an outcome,) and by customer satisfaction scores (impact.)
Most people agree that working customer support is a soul-crushing, terrible job, and that these metrics negatively impact their ability to service customers by incentivizing negative behaviours (copy/pasting answers without fully reading questions, closing difficult calls to prioritize easier ones, etc.) - while customer support services are generally good for C-Levels, I do wonder just how much better for customers and workers they would be if this fetish for measurement was put aside for a more holistic approach to outcomes and impact.
Unlikely, of course, and probably utopian ideal, but something I always find myself thinking about.
Seems like software companies make a whole bunch of money without all this measurement. IOW if you have a good business, you don't need it. If you have a bad business, measurement will make it worse.
This makes me think of Zappos, and how they won a whole industry by throwing away the conventional wisdom on quantity over quality for customer support.
What I do not know is if you can do what Zappos did without a founder who deeply believes in all of this, and holds their back for the team, and creates a culture where people push themselves to delight customers (and have the means to do so)
One older article about Zappos: https://hbr.org/2010/07/how-i-did-it-zapposs-ceo-on-going-to-extremes-for-customers
I think you hit the proverbial nail on the head here. The issue with eschewing conventional wisdom is you have to have leadership who truly believe in the chosen direction, and are willing to ride out any rough patches and nay-sayers. There will be constant whispers in their ears (especially from their VCs if they're taking VC money) about 'Sure, this is working, but it could be better...' or even 'See, that dip there? That was a sign this does not scale...' - there's a lot of pressure on executives to follow the herd and do what others do. We train MBAs to do it (regardless of what those programs claim), and while many of the foundational business books talk about pursuing excellence in your own way, a vast majority of them are "How To Do What Toyota Did In Five Easy Steps." Or said another way:
In the place of deep personal conviction about how to be a productive, profitable, and customer focused company, executives will choose to parrot conventional wisdom as correct. And with increasing demands on everyone to be more 'productive,' fewer people will have time to question the paradigm.
Thanks for the great insight. I would like some further clarification as to how the three DORA metrics "deployment frequency", "lead time for changes", and "mean time to recover", are categorized as OUTCOME metrics. They seem to line up much more with OUTPUTS based on your example outputs (eg: "feature is in production"), but maybe I am misunderstanding. Clarification would be appreciated!
I suggest you try to explain them both ways and see what you come up with. There isn’t a right answer, just more or less helpful ways to think.
I believe this also ties to the different engineering practices in Meta and in AWS as you wrote. AWS has extensive software tests whereas Meta has several staging environment to test business metrics. It may be good idea to link to them.
Yicong: I'm not sure I fully follow. Yes, they do these, and so have additional layers to validate that stuff works and/or measure the impact. Do you mean that it could be helpful to mention that e.g. Meta is already measuring this business impact where they can?
Indeed, in the sense that Meta and AWS are both linking to business metrics (outcome) but what they do is different because they generate profit differently; this pair of examples show no one size fits all, in contrast to McKinsey's framework. Oh...I forgot to mention, they were from your blog, something about engineering culture maybe.
Good article.
I agree with the idea of measuring and metrics. As part of my job on the exec team, I discuss and ask Sales, Marketing and recruitment about their metrics. And good startups are usually metric focused in these areas.
My challenge is not should we measure engineering? ( I think we should ) but can we do it the right way.
I sometimes quote Einstein ‘Not everything that can be counted counts and not everything that counts can be counted’
- which feels like is sums up our challenge.
I had landed on the idea of shipping customer facing value per x ( week for startup, number per month for enterprise ) as well.
It’s far from ideal, but it is a decent proxy for helping to ensure the teams are focus on impact and to talk at the exec team level.
@Chris: nice to hear our thinking converging.
One thing we've been talking about with Kent is how the "what you measure for a decent outcome" is likely very different based on the "stage" of a product: whether it is "pioneer," "settler" or "town planner" stage (as per Wardley Maps): https://orghacking.com/pioneers-settlers-town-planners-wardley-9dcd3709cde7
I suspect for a team looking to get to MVP this is very different than a mature team optimizing a cash cow product.
Thanks for posting this. My last company has a good culture where we are trying to separate out rewarding for impact from promotion.
Impact is tied to a quarterly bonus. promotion is based on behavior (i.e. at each level there is a rubric of expected behavior. So in order to be promoted to Staff Eng one needs to demonstrate mentorship or organization wide impact, so folks are looking for ways to demonstrate that.
The process is pretty heavy weight so many new managers who is not used to this process hated it. The company then grew too fast and we have a new HR regime who killed the whole thing and now just do what every other company does. Sigh...
I worked in FinTech, where performance attribution is so heavily quantized, the "too abstract to measure" argument certainly falls on deaf ears. This is really one of the notoriously hard things we struggle with, and requires tech leaders to carefully manage expectations.
A few years ago, I was undertaking an effort to look at Pull Request metrics as a potential proxy for productivity. One of the major things I had stressed prior to presenting the numbers was that this analysis could really only be done so on the condition that we do not try to attribute "individual performance" to the results or try to identify “target numbers”. Stats were strictly used in service of a larger desired goal: identify inefficiencies and potential improvements to team mechanics, and were tailored to teams as a whole. Whereas individual stats were tracked, they were anonymized and heavily caveated before presenting to any senior audience to preempt the exact kind of pitfalls you described.
At least a few positive results did come from the exercise. One year, it helped us identify actionable issues such as insufficient timezones coverage between offices. Another year, very asymmetric PR numbers helped us proactively identify silent burnout with one of the top performers that may have otherwise gone unnoticed.
But again, difficult to tie this to a top bottom line number like "revenue" as easily as sales. It helped us primarily in other ways: 1. Identify potential tangential problems before they became bigger problems 2. Cultural reinforcement that the desired metric was important and noticed (but no hard performance target), 3. Demonstrate a standard of care that the “engineering dept knows what’s happening on the ground”. i.e. measure what we can, useful or not. And finally, in cases where pushing a metrics was a soft target, the measurements were ones where engineers and management were both fully aligned, and accepted the potential for behavior modification, but viewed it as a desirable side effect, again leaning on “culture reinforcement”. In other words, intentional gamification.
The thing is, taking snapshots of how PRs progress over time: seen this happen at other places as well, eg Uber. The first few times, this metric works, as there is no "gaming" it. Eg. for a long time at Uber, director+ folks had access to this data, and when COVID started, they shared on the company all-hands how the # of PRs and PR per engineer went up the few months after WFH started! It was vey interesting.
And I am also with you on how, when you are aware of the behavior modifying effect... this could be ok? Like it's clear if you measure PR frequency or PRs per week/month etc: you get smaller PRs. Which could be a goal in some places, and it's exactly what happened at eg Uber (and why the CI system was overloaded initially).
This reminds me of e.g what YouTube is rewarding: total watch time of the video. This leads creators to optimize for the thumbnail (to get the quanity) and then the watch % of the video (for the quality). The more they "game" the system, the better it is for YouTube, thanks to being able to sell more ads.
Thanks Gergely for the article on pioneers-settlers etc.
I had not applied that lens.
Your comment about suspecting differences, feels spot on given my experience in different organisations along that spectrum.
There are a lot of different companies with different products with engineering organizations. The relationship with engineering and the business could be different in one company and the next. However, the sales and recruiting organizations can still use similar metrics.
It is an interesting call out though if a CEO questions the impact of their engineering organization. A company like Chic-fil-A would approach it far differently than Uber would and far differently than Salesforce. Engineering's productivity and impact are different in each of those examples (with respect to the core business).
Proofreading:
(1) The graphic and text are inconsistent at this text
> What about the recruitment team’s main metric: the number of heads filled? It’s also categorized as “impact:”
The graphic shows it in "Outcome".
(2) Confusing (at least to me)
> Neither Kent nor I have seen accountable teams within tech companies which are not measured by outcome and impact
One way to avoid this whole conflict is to work for a company whose CEO came from tech (god help those who work for a company whose CEO used to be a CFO)...