- Published on
3 Reasons to Ditch Flat On-Call Structures For Escalation Layers
- Yuan Cheng
What’s A Flat On-Call Structure?
A flat on-call structure is an on-call schedule that has multiple people getting notified of the same incident at the same time. There’s no escalations, no hierarchy, no layers.
Problems With Flat On-Call
- It’s an untargeted cry for help. When you need immediate assistance, crying HELP to the masses is a great way to disturb everybody, but a horrible way to get effective help quickly. We’re all too familiar with that lengthy pause where everyone wonders “Are they talking to me?” Untargeted alerts suffer the same delay.
- It’s an unnecessary tax on employee morale. Flat on-call structures might work okay if you’re paying your employees for off duty on-call hours, and you truly need all hands on deck (e.g. firefighters). But if your teams are trying to rest and recuperate for the next day ahead, flat on-call structures quickly sap employee energy and decreases workplace efficiency.
An All Too Familiar Example
An outage occurs at 3:00am, and 5 people get woken up. The unfortunate “nice one” decides to bite the bullet and respond. The other four try to go back to sleep, but knowing there’s an active outage in flight, they’re unlikely to have much luck at that.
The next day, you’ve got all 5 team members operating at reduced capacity. Five team members! Not to mention, if you keep this process up over time, people will naturally start to slide towards the path of least resistance - letting someone else take the call. You can’t load balance across a flat structure, and even perceived shirking starts to drive a wedge between team members.
It’s not long before response times are lagging, and your most committed employees are starting to feel like they’re getting the short end of the stick.
3 Reasons to Ditch Flat On-Call
- It’s exhausting and inefficient. Why would you notify 5 team members when one would suffice?
- It’s near impossible to load balance. Unbalanced teams always become disgruntled teams.
- It’s slow. With 5x more eyes on the alert, you’d think the response is 5x faster. But if you’re always sending out untargeted alerts, people eventually end up waiting around to see if someone else is going to respond. Unless you’ve got other incentive programs in place, the lack of clear accountability can actually slow things down.
So what’s the solution? Use escalation layers.
Benefits of Escalation Layers
- It’s efficient, accountable, and measurable. Only notify the people responsible and keep them accountable. Measure metrics like Mean Time to Response (MTTR).
- Load balance the team. Using escalation layers, each week a different team member can be assigned as primary on-call making sure the primary on-call role is load balanced amongst the team and not one person.
- It’s faster. With delegated responsibility and accountability, your team won’t be burnt out and people will be willing to take the call when it comes.
A Better Example Using Escalation Layers
An outage occurs at 3:00am, and 1 person (the primary on-call) gets woken up. The other four team members are still asleep and actually won’t find out about the outage until they get to work the next morning. The next day, you’ve got 1 team member that’s tired, but regarded as a hero by the other four for fixing the outage. 😍