How not to send service outage status updates

How not to send service outage status updates

The emotion for this blog post is "buzz buzz".

Intuit, like many companies, makes use of the quite useful Statuspage service to notify their clients when interruptions to one or more of their services has been reported and/or identified. In general, this is a great thing because, in my case, as QuickBooks Online (one of these such services) is used by one of my clients to serve their customers, I am able to quickly ascertain if a problem one of my clients’ customers is having may be related to her app or with the upstream service (i.e., QuickBooks Online in this case).

Statuspage makes it easy for someone like me to subscribe and get e-mail or phone alerts when there are such reported outages. I can also follow-along as the status is updated. The usual cycle of such things is an incident is first reported, and if the vendor identifies that there does, in fact, seem to be some kind of a problem affecting their customers, they will update their status page which shares the time and indicates that they’re “currently looking into it”. If they discover that service has been impacted to any significant degree, the status for a particular service may change from green to yellow, indicating some potential service degradation may have occurred.

As their investigation progresses, they will update the status, either after a certain amount of time or if there is some change in the status. It is considered good measure to keep customers aware of ongoing investigations if they draw out even if no significant change has occurred.

If a truly service-impacting issue has been identified, one that has affected most or all of their customers, they may change the status even further, going from yellow to red. It is also assumed, at this time, that the issue has become a top priority for the vendor as (one would hope!) they would be focusing their efforts to restore service to their customers are quickly as possible.

After some time, and perhaps more updates, if the vendor feels they have solved the issue completely, they will change the status back to green, indicating that they are confident service has been restored back to their clients. Sometimes a brief period of yellow will be set prior to green if the restoration of service may take some time to affect all clients. Vendors may also note that they are “continuing to monitor” the issue to ensure a quick response should anything happen.

All of this is well and dandy, and quite suitable to both keep a wide portion of customers informed in near-realtime of the status of their service as well as to track historical service quality, all of which Statuspage does fairly well.

However, “all according to plan” or “as expected” hardly makes suitable blogging material. No, we are gathered here today because Intuit done goofed. Today’s caption is basically me when I woke-up this morning. Ask my wife.

There’s, like 100 more you cannot see in this screenshot. And my phone is STILL buzzing as I type this…
Actual screenshot of the Intuit Developers status page.

Allow myself to quote…myself.

Statuspage makes it easy for someone like me to subscribe and get e-mail or phone alerts when there are such reported outages.

Basil Gohar ca. 5 minutes ago

Statuspage makes it easy, and perhaps it’s so easy that it can be automatically triggered based on…some condition. So if you make that condition blindly trigger the update, you can get…a lot of updates. So many updates. So many updates one starts to think what you’re supposed to do with all of these updates…(other than vent by writing a blog post).

I would guess that somewhere in Intuit’s infrastructure there’s a bit of code that runs based on some service level condition or some reporting mechanism that will fire off an API call to Statuspage to update their status. The problem is, this mechanism appears to be stateless and will fire off regardless of whether or not its fired off anytime recently. I also don’t know if Statuspage themselves has some kind of rate limiting feature (one would imagine they would), but from the looks of it, the rate of messages is close to one every 10 seconds with some breaks deviating from that rate.

Let this be a lesson to everyone. We have more tools than ever in our hands to make developing, running, and maintaining client-facing services better and more robust than ever. However, as is always the case in software development and online services, while we should hope for the best, we have to plan for the absolute worst. If you haven’t given your QA team or developer Ricky access to try to break your service before you deploy it, then you’re just asking for it to break in a far more embarrassing way – one that can cost real face and real dollars in the long run.

Leave a Reply

Your email address will not be published. Required fields are marked *