Introduction to EDBP
I had to come up with an acronym. This is particularly important when you’re in the cloud business, I think it comes somewhere before the business plan and just after the beer.
Did you know that EUCALYPTUS itself is an Acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems ?
I’m trying to make a concious effort to blog more, hopefully I can impart some useful information to readers about deploying Eucalyptus in production. It’s about time we spoke a little bit more on production usage since we have plenty of customers running Eucalyptus in this manner. We have a fairly expansive customer list available here and that certainly doesn’t show all of them.
I’ll be starting to a write a series of blog posts about optimising Eucalyptus deployments and various architectural considerations that those Solution Architects amongst you will want to think about. These will also make it into solution briefs, presentations and knowledge base articles, I’m sure.
I’m also VERY interested to hear of suggested topics too, along with any feedback. Even if this is from a poor reader who’s landed on the wrong page 😉
Onto some high level thoughts on planning your deployment …
Make sure you are monitoring as many infrastructure components in the cloud platform as possible, from switches and storage arrays all the way to physical servers (CLC, Walrus, SC, CC and NC’s) and the instances themselves. The latter may involve installing agents into the images you serve in your cloud. This is another area which can sometimes be overlooked, it always surprises me when I come across even big companies that don’t have a readily available monitoring solution.
Consider using tools like Nagios, Zenoss or Ganglia to monitor the platform components and provide alerting based on thresholds. We have a wiki here with a checklist of useful things to monitor with such a tool, some folks have already been writing nagios scripts (and collectd bits).
Optimising your cloud platform is important and the level of optimisation required really depends on your usage of the platform. Is this just an evaluation environment? Is this a small dev/test platform with very modest capacity and with a small number of users? If so you probably don’t need to spend too much time optimising, you just want to evaluate the platform or provide some simple instances for Jenkin’s builds, right? If you’re looking to deploy for real-world production usage, then make sure you know the following things, or at least have an educated idea on them:
– What workloads are my users running in the cloud? Inside the instances, what software is being used?
– What are the usage profiles of these users? Does load peak at certain times? Do users use S3 much? Do they use EBS volumes heavily?
If you can’t get the above, use proactive means to closely monitor what users are really doing on your cloud, such as monitoring instance performance.
I won’t go into the obvious, like various levels of acceptance testing for your deployment, but knowing the workload can be very important when designing and tweaking the solution. If you can understand how your users will want to use the cloud then you can stay one step ahead of them. This knowledge can help you identify weak points in your architecture early-on and before they impact users.
As an example; perhaps users spend most of their time hitting the ephemeral disk stores (i.e. local disk) hardest on the node controllers (NCs), since the workload is transient. In which case, you’ll know that you might want to upgrade the NC disks to SSD’s and move to RAID1+0 and perhaps implement cgroups to limit the ability for instances to monopolise I/O, along with various other tweaks to caches and schedulers.
The previous paragraph dovetails nicely into this next topic. Capacity management is important and it doesn’t magically go away with cloud. With many users the assumption is that they don’t need to worry about this anymore. After all, the cloud does everything whilst it brings you ice cold beer by the poolside, right? Being a cloud admin doesn’t involve much work, I think according to some you’re already out of a job so why are you reading this post?
Perhaps you’re now used to running stuff in Amazon’s public cloud but just because you don’t have to worry too much about infrastructure problems, it doesn’t mean Amazon isn’t always thinking about it. The same principle applies for private cloud like Eucalyptus, it orchestrates infrastructure to a certain degree and thus is naturally still dependent on what is has to orchestrate with. If your storage network is 100Mbps and you have 100 users all trying to run EBS-backed instances (boot from iSCSI pretty-much) concurrently, the performance is going to be pretty diabolical or completely non-existent depending on your iSCSI timeouts 😉 No amount of cloud is going to get you out of that one.
As the cloud administrator, your traditional sysadmin role might have changed but now you need to learn the art of masking underlying infrastructure from the users. You need to be thinking about Quality-of-Service (QoS) too and implementing some method of guaranteeing the user experience. Ok, guarantee might be too strong a word as its arguably impossible but aim for consistency.
On a practical note, a good way to simulate usage may be to use Eutester to spin up instances and perform various functional tests against the cloud whilst also running load generation, this would stress the services themselves all the way down to the infrastructure layer. It can help uncover the relationships between these layers. Alternatively, perhaps you just use Jenkin’s EC2 plugin to run a load of instances and something like Ansible to orchestrate some benchmarking suites whilst you sip coffee and wait for the alarm as your datacentre catches fire.
Make sure that you have paid attention to the above points and explored them further and in more detail. Aim to do this during the design and test phase before you ramp-up your cloud usage or take it into some form of production. Common pitfalls typically sit in the categories above but there are many more considerations to think of. Make sure your staff, processes and infrastructure are ready for a cloud platform and that you have the capability to operate it effectively and proactively.
With the next blog I’ll talk a little bit about practical tuning of the storage controller and optimising the default configuration.