The Google outage and Single Points of Mistake
The recent Google outage had lots of people agitated: Can we trust Google Apps? Can we trust Gmail? Can we trust cloud computing in general if Google can’t keep the lights on?
I, for one, am not overly concerned. Google hasn’t been down often, and when it has been down, the outage has been brief. Here’s more information about this most recent Google outage. I’ve been around enterprise systems long enough to know that internal enterprise systems fail at a surprisingly high frequency. What’s more, it’s the rare corporate IT department that has truly world-class engineering teams and processes. The reason, I believe, is simple: they can’t and shouldn’t afford it. Some financial services firms I’ve worked with have come close, but for all the other firms, they’re just getting by.
It makes sense. Even though most companies are reliant on IT systems, it’s not what they “do for a living”. IT will always be a cost center for them, however ambitious their CIOs are. And what do you do with cost centers? You trim them razor thin. In the IT space, this almost always means you reduce quality in terms of ability to change, performance, and reliability.
Which is one reason I believe that for the majority of businesses, some form of IT outsourcing to a cloud-based provider is inevitable. Cloud providers can achieve economies of scale (with both people and computer systems) even the largest corporate system can only dream about.
But that’s only half of what I wanted to write about today. Google has said there was a failure in one of their traffic routing systems. I don’t know if this was a system that improperly made this decision on its own, or if a human was in the loop. I know that Google tries to automate as much of it’s engineering as possible, but I’m sure that humans are still involved as several points.
So, a question arises: Can a system be considered highly available if humans are in the mix? My answer is: yes! However, just like any other truly critical system (e.g. weapons systems, medical instruments, etc.) the computer systems must be designed to understand and accept the fallibility of human operators.
A few years ago, my team and I were designing an administrative tool for a software deployment tool we were building. The tool provided a way for a user who has administrative privileges over specific applications within the system to control which end-users (paying customers) get which version of the application.
Our system supported the notion of different levels of quality for a release (e.g. alpha, beta, production, etc.). One can publish different versions to the system, and “target” each version to a different set of end-users (e.g. alpha testers, beta testers, production users, etc.).
So, there was lots of flaky software in the system, and this was intentional. The system simply assumed that the “targeter” knows what s/he is doing, and leaves it at that.
The danger here, of course, is that the targeter DOESN’T know what s/he is doing, or that s/he makes an honest mistake and sends alpha or beta code to a production user (eek!). The system would not have enough information to double-check the targeter’s settings, and would slavishly obey. And, of course, such a mistake would also look bad for my team because “our system did it”.
So we come to the topic I wanted to discuss: the notion of “Single Point of Mistake”. I’m thinking of this as a user-interface corollary to the systems concept of “Single Point of Failure”. To briefly recap, in order to build highly reliable systems, one of the things you have to do is to eliminate any single point of failure (e.g. don’t run your server software on only one server).
Highly available systems need to reach into the user-interface/interaction domain as well. In the scenario I discuss above, the system itself can be absolutely bulletproof, but a simple honest flub by a user of the system can wreck it. In this case, it’s important that the system be built with checks in place to prevent this kind of mistake, or at least alert the mistake maker to the impending doom.
The solution we chose involved marking applications (an application is a set of releases) as being either for External customers or Internal customers. Then, each release is marked as Alpha, Beta, or Production. Finally, each application has 3 user groups (you guessed it) Alpha, Beta, and Production. You set up your production users in the Production user group for an application. Then, the system will actively prevent any non-Production releases from being sent to any Production users - even if you specifically target that release to a Production user.
Of course, an administrator can still screw things up by placing a Production user in a Beta group, for instance. But they then have to manually target that user ALSO for the release to be sent. So, while the mistake can still be made, it requires TWO goofs, not just the one. So, we’ve eliminated the “Single Point of Mistake” and we’ve made our system even more “effectively reliable”.
As always, feedback is welcome!