In my previous job, I spent a fair amount of time with enterprise customers who were – more often than not – moving from proprietary Unix to Linux. Coming from Unix, these customers were often heavy users of resource(or service)-based High Availability (HA) products; such as HP Service Guard, Sun Cluster, Veritas Cluster Server etc. When moving to Linux, they wanted feature parity with their traditional Unix HA technologies and they ran very Enterprise-level workloads; like database, CRM, and so on. More often than not, they would end up running Red Hat Cluster Suite (HA Addon) or SUSE High Availability Extension; both of which are paid add-ons for these distributions where you pay an additional support premium on top of your normal subscription. These two stacks are of course open source, so you can go and download pacemaker, corosync and other tools and use them to keep your own services highly available at no cost. Of course, there are other proprietary products available for Linux like SIOS LifeKeeper and HP Service Guard for Linux. The latter is back from the dead due to popular demand.
I use the term resource-based in the previous paragraph to describe the nature of the HA solution; install the software on one host, then install your application and configure an active/passive failover environment across multiple servers. Upon failure of the service on one node, the resource management and infrastructure layers will detect failure(s) and move the service resources (i.e. your application and dependencies, like a floating IP) across to the second node, with a small window of downtime. The applications you run on top of such clustering layers are often not built with high availability in mind. Rather the application itself is blissfully unaware that it’s in some highly available configuration and instead the data and configuration is mirrored with underlying technologies, like DRBD. This is where you want a really good cluster resource manager, like pacemaker and the important infrastructure layer, such as corosync and openAIS, to handle things in bullet-proof fashion. These technologies have good mission-critical references too, Deutsche Flugsicherung (German Air Traffic Control) run their systems on top of SLES High Availability Extension, for example. I’ll refer to this as service & infrastructure HA.
Then we have another way of achieving HA. This is what some term as in-built HA, whereby the application itself has high availability features built into it. A widely known example would be Oracle’s RAC product. It’s UBER expensive but has differentiation over a simple active/passive Oracle DB running on top of one of the Linux HA stacks. If your wallet permits, it allows you to scale out in large active/active database configurations. This can be extremely useful for performance reasons whilst also achieving resilience. These features are really useful for data warehousing and other demanding use cases where money is no object, like I dunno, utility billing or inland revenue and tax collection (har!). Oh, did I mention cost? You’re paying for a fair bit of development time and also features, you can do cool stuff that you just can’t do with the “normal” clustering products on Linux (and Oracle will ensure this stays the case).
With the release of Eucalyptus 3.0, we see Eucalyptus deliver the industries first cloud platform with in-built HA. This is a big thing and something that enterprise customers need. Whilst the notions and requirements of high availability in the cloud may be different depending on who you talk to, a customer will often require that their cloud be high availabilty to safeguard service level agreements (SLA’s) with users.
Furthermore, with 3.1, we’ll be enhancing flexibility further to give customers and users the option to run supported hybrid-HA topologies. What do I mean by this? Well, the Cloud Controller (frontend) and Walrus (S3) components can have redundant spares but the Cluster Controller and Storage Controler don’t, this saves on hardware costs and complexity for some. It also unlocks an interesting possibility in terms of your SLA’s. So, why would you want to do this? Well, if an availiablity zone goes down, this might not bother you; users could just use another one in the timebeing until service is restored. In the meantime you’d like to ensure users can still interact with the cloud and use a different availability zone, recovering their work from snapshots or data from bukkits (S3). Without the Cloud Controller and Walrus, this wouldn’t be possible and this would be a much bigger issue for you; users wouldn’t be able to do *anything* and there goes your service.
Eucalyptus 3.0 HA lays the groundwork for flexible high availability in a truly distributed cloud platform. It’s one to watch 😉