The consulting firm came up with a methodology they claim can measure software developer productivity. But that measurement comes at a high price – and we offer a more sensible approach.
When I was learning photography, I heard a story about a photography teacher who divided his class into two groups: one group was graded on the quality of their best photos, and the other group was graded on how many photos the produced. At the end of the semester, the professor looked at which group produced the best work, and the pattern was clear: the majority of the highest quality photos came from the group asked to produce more work.
As an analogy, this is imperfect: new students are at a different point on the learning curve than professionals and likely benefit more from applying skills in new scenarios. But in my career, I've found that there's some truth there.
I'm a relatively new manager at a small company. I do my best to evaluate my team based on impact, but I also privately measure output, and there is an obvious correlation between the two. We're dealing with a novel product in an emerging market, and it's rarely clear which initiatives will be most impactful before we get prototypes into customer hands. It's unsurprising that the engineers on the team landing more changes have more hits and drive more customer acquisition and retention.
I conceptually believe that there are places where engineers are gaming output metrics and producing "busy work" with little value, but in my (admittedly limited) experience, I haven't seen much of that. I try to be aware of incentives; I don't tell the team I'm tracking output to avoid encouraging that type of work. Maybe this is the luxury of a small, process averse company.
I'm genuinely curious to hear from others who have experience in cultures where outcomes and impact don't track effort and output. As our company grows, I'll have some say in how engineers are evaluated, and I want to make sure we're being thoughtful.
One thing I always think about when reading about productivity is that "Productivity as a measurement is a good thing" seems to be a deeply ingrained and correct to folks, but I have to question how true it actually is. Let's take this particular measurement of productivity:
> For example, customer support can be measured by the number of tickets closed (an outcome), the time it takes to close a ticket (also an outcome,) and by customer satisfaction scores (impact.)
Most people agree that working customer support is a soul-crushing, terrible job, and that these metrics negatively impact their ability to service customers by incentivizing negative behaviours (copy/pasting answers without fully reading questions, closing difficult calls to prioritize easier ones, etc.) - while customer support services are generally good for C-Levels, I do wonder just how much better for customers and workers they would be if this fetish for measurement was put aside for a more holistic approach to outcomes and impact.
Unlikely, of course, and probably utopian ideal, but something I always find myself thinking about.
I believe this also ties to the different engineering practices in Meta and in AWS as you wrote. AWS has extensive software tests whereas Meta has several staging environment to test business metrics. It may be good idea to link to them.
I agree with the idea of measuring and metrics. As part of my job on the exec team, I discuss and ask Sales, Marketing and recruitment about their metrics. And good startups are usually metric focused in these areas.
My challenge is not should we measure engineering? ( I think we should ) but can we do it the right way.
I sometimes quote Einstein ‘Not everything that can be counted counts and not everything that counts can be counted’
- which feels like is sums up our challenge.
I had landed on the idea of shipping customer facing value per x ( week for startup, number per month for enterprise ) as well.
It’s far from ideal, but it is a decent proxy for helping to ensure the teams are focus on impact and to talk at the exec team level.
Thanks for posting this. My last company has a good culture where we are trying to separate out rewarding for impact from promotion.
Impact is tied to a quarterly bonus. promotion is based on behavior (i.e. at each level there is a rubric of expected behavior. So in order to be promoted to Staff Eng one needs to demonstrate mentorship or organization wide impact, so folks are looking for ways to demonstrate that.
The process is pretty heavy weight so many new managers who is not used to this process hated it. The company then grew too fast and we have a new HR regime who killed the whole thing and now just do what every other company does. Sigh...
Thanks for the great insight. I would like some further clarification as to how the three DORA metrics "deployment frequency", "lead time for changes", and "mean time to recover", are categorized as OUTCOME metrics. They seem to line up much more with OUTPUTS based on your example outputs (eg: "feature is in production"), but maybe I am misunderstanding. Clarification would be appreciated!
Thanks Gergely for the article on pioneers-settlers etc.
I had not applied that lens.
Your comment about suspecting differences, feels spot on given my experience in different organisations along that spectrum.
I worked in FinTech, where performance attribution is so heavily quantized, the "too abstract to measure" argument certainly falls on deaf ears. This is really one of the notoriously hard things we struggle with, and requires tech leaders to carefully manage expectations.
A few years ago, I was undertaking an effort to look at Pull Request metrics as a potential proxy for productivity. One of the major things I had stressed prior to presenting the numbers was that this analysis could really only be done so on the condition that we do not try to attribute "individual performance" to the results or try to identify “target numbers”. Stats were strictly used in service of a larger desired goal: identify inefficiencies and potential improvements to team mechanics, and were tailored to teams as a whole. Whereas individual stats were tracked, they were anonymized and heavily caveated before presenting to any senior audience to preempt the exact kind of pitfalls you described.
At least a few positive results did come from the exercise. One year, it helped us identify actionable issues such as insufficient timezones coverage between offices. Another year, very asymmetric PR numbers helped us proactively identify silent burnout with one of the top performers that may have otherwise gone unnoticed.
But again, difficult to tie this to a top bottom line number like "revenue" as easily as sales. It helped us primarily in other ways: 1. Identify potential tangential problems before they became bigger problems 2. Cultural reinforcement that the desired metric was important and noticed (but no hard performance target), 3. Demonstrate a standard of care that the “engineering dept knows what’s happening on the ground”. i.e. measure what we can, useful or not. And finally, in cases where pushing a metrics was a soft target, the measurements were ones where engineers and management were both fully aligned, and accepted the potential for behavior modification, but viewed it as a desirable side effect, again leaning on “culture reinforcement”. In other words, intentional gamification.
One way to avoid this whole conflict is to work for a company whose CEO came from tech (god help those who work for a company whose CEO used to be a CFO)...