Tech Debt, Incidents, and On-Call – The New Stack
I was recently chatting with a cloud and platform operations team leader who was looking to manage incident response. Like many organizations, they were trying to take a “build and run” approach. This is sometimes called “full-service property.” Whatever the term, this approach refers to software development teams taking responsibility for ensuring that the code they write also works well in production.
Dormain is Vice President of Product Marketing and Developer Relations at PagerDuty. Prior to PagerDuty, she led product marketing and content strategy for VMware Tanzu and held similar positions at Pivotal and Riberbed Technology. She also spent more than five years as a technology investment analyst, closely following enterprise infrastructure software companies and industry trends. Dormain holds a bachelor’s degree in history from the University of California, Los Angeles.
Naturally, I asked if software development teams take on-call rotations to support their code in production. After a deep sigh and a “it’s complicated” response, he said something really insightful: yes, software development teams often get in the way of escalating an incident, but they had decided to create a site reliability engineering team to support on-call duties.
Why? Why not fully live the values of “build and execute”? His answer reflected a division I had never heard expressed before, but it made perfect sense. Although developers know the code best, they aren’t as helpful when something goes wrong. While operations teams want a restored service as soon as possible, development teams want to eradicate the underlying problem.
On the surface, these appear similar. But imagine you’re at the grocery store checkout and your credit card is declined. You lived so tightly from paycheck to paycheck that your last credit card payment bounced. All you need to restore this credit card and verify is to pay the minimum balance. Call it $25. But the real fundamental problem is that you are buried in debt, carrying a balance and falling behind. Paying off the balance costs $25,000. This not only frees up your credit card, but eliminates a source of high-interest tech debt on your personal balance sheet.
Coming up with the $25 is relatively easy. For the operations team, this might mean something like rebooting a system. A service is quickly brought back online and the incident is “resolved”. The underlying source, however, still persists. It looms as a future incident waiting to happen again. But finding $25,000 is a longer and more complicated undertaking. And there’s food to put on the table once you’ve finished checking out the grocery store.
Who is right ?
Both views have a point. But the right thing to do is solve the $25 problem first, then quickly solve the $25,000 problem. After all, wouldn’t you find the fastest way to get through the grocery line and get dinner on the table first? But the challenge is that we rarely find the time to revisit the $25,000 problem. So we face the same problem a week later on our grocery run. It’s exhausting and demoralizing.
How do we take the time to unwind technology debt?
First, to define technology debt, I prefer to think of all code as technical debt. Why? Because all code will require servicing or maintenance at some point. At a minimum, security updates for libraries with vulnerabilities are inevitable. Just look at the Log4j vulnerability from late 2021. It needed urgent and widespread maintenance across many organizations.
Rather than debating whether the code is a debt or not, the better question is how easily you can maintain that code. If it’s easier to change and update, you’re in a much better position to fix that $25,000 problem soon after something goes wrong. But that still doesn’t answer the question of when you do this work.
The answer to this question could, ironically, come down to involving developers in this “build and run” responsibility. In his speech to PagerDuty Summit 2022, Charity Majors suggested using on-call rotation time for tech debt work. It’s already not the right time for developers to work on features. They could be interrupted at any time by an incident requiring their attention. But now is the perfect time to dig into what caused the incidents.
Using custody time to “pay off” technology debt serves several purposes. First, it should reduce recurring incidents that have been resolved without addressing the root cause. Second, as the title of Majors’ speech, “On call doesn’t have to suckimplies, this makes on-call rotations more attractive. As this cloud and platform operations leader observed, developers want to address the root cause. Ultimately, no one wants to be woken up by a problem, especially one they already know exists. It seems like a pain that could have been avoided.
Check conference by Charity Majors, co-founder and CTO of Honeycomb for more information on using on-call rotations to focus on tech debt and other tips for improving the on-call experience.
Feature image via Pixabay