EBS Architecture (EDBP)

So, the second post in this series and now a look at Eucalyptus Elastic Block Storage (EBS) and the Eucalyptus Storage Controller (SC) component which handles this.

What does it do?

The Storage Controller sits at the same layer as the Cluster Controller (CC), each Eucalyptus Availability Zone (AZ) or Cluster will have its own CC and SC.  Within that AZ, the SC will provide EBS (Elastic Block Store) functionality via iSCSI (AoE is no longer used in 3.0 onwards) using the Linux Target Framework (tgt).  If you’re a subscription-paying customer you can also use the SAN adapter to have the Eucalyptus SC talk directly to your NetApp, Dell (and EMC with 3.2) array.  In turn the Node Controllers (NC) will also talk directly to your storage array.

EBS is core to EC2, it’s a feature which the vast majority compute users will use. It provides the capability for users to store persistent data (and snapshot that data to Walrus/S3 for long-term storage).  With Eucalyptus 3.0 users can now utilise EBS-backed instances, which are essentially boot-from-iSCSI virtual machines.  These virtual machines use an EBS volume for their root filesystem.

This post is a more in-depth look at best practices around storage controller configuration.

How does it work?

This is pretty well explained on the wiki page here but I’ll summarise in short for the benefit of readers.

An EBS volume is created by a number of steps, starting with messaging sent from the client to the CLC (Cloud Controller).  To then create the EBS volume, the SC performs the following steps:

  1. SC creates a volume file in /var/lib/eucalyptus/volumes named per the volume ID
  2. This file is then attached to a free loopback device
  3. A logical volume is created on top of this loopback
  4. An iSCSI target is created along with a new LUN with the backing store of this logical volume

Then, when a user wishes to attach this EBS volume to a running instance, the NC on which the instance resides will attempt to login to this iSCSI target and pass the block device from the LUN through to the virtual machine based on an XML definition file with a “virsh attach-device”.

The SC also facilitates point-in-time snapshots of EBS volumes.  This involves the SC copying the EBS volume to S3 for long-term persistence. From this snapshot, users can register boot-from-EBS  (bfEBS) images and create new volumes.

During the snapshot process the SC does the following:

  1. Creates a new raw disk image file in /var/lib/eucalyptus/volumes
  2. Adds this as a physical volume to the volume group of the EBS volume
  3. Extends the volume group over this new physical volume
  4. Create a new Logical Volume and dd the contents of the volume into this LV

After the copy is complete, the SC will then transfer the contents of the EBS volume up to S3/Walrus.

How should I architect EBS storage?

It’s quite common that the storage controller (without SAN adapter) can be the primary factor in deciding whether to scale out with multi-cluster in deployments.  In part this is down to the usage profile of EBS and the architectural design most users follow.  When designing the storage aspect of the cloud there are a number of key areas on which to focus.

Usage Profile

What is the usage profile for the cloud?  Are EBS volumes used at all?  Are EBS volumes a key infrastructure component of workloads? How many users are on the cloud? How many concurrent volume attachments can you expect?

These are all valid questions when designing the storage architecture.  Concurrent and heavy use of EBS volumes may dictate very different backend requirements to a cloud where EBS volumes are only lightly used.  Make sure you test at the planned or envisaged scale.

Component Topology

Always keep the SC on a separate host from the Cluster Controller (CC) or any other Eucalyptus component if you can.  This has the potential to dramatically improve performance for even the smallest deployments.

Disk Subsystem & Filesystem

With the overlay storage backend, volumes and snapshots are stored in a flat directory structure in /var/lib/eucalyptus/volumes.

You need to make sure you choose a fast disk subsystem for the storage controller.  If you’re using local disks, consider RAID levels with some form of striping, such as RAID1+0 on as many spindles as possible.  If you’re backing /var/lib/eucalyptus/volumes with some form of networked storage, avoid NFS and CIFS.  Choose iSCSI or Fibre Channel where possible for best performance under high utilization across high numbers of active EBS volumes.  For ultimate performance, consider SSD’s and SSD arrays.  If you’re planning on using a remote backing store, like iSCSI, consider optimising with jumbo frames and iSCSI TCP-offloading on the NIC, if supported.

With 3.2 the DASManager storage backend is now open sourced.  Unlike the typical overlay backend the DASManager directly carves up presented block storage with LVM, circumventing the requirement for loopbacks and thus removing the limit of 255 volumes+snapshots*.  When not using the SAN adapter for NetApp, Dell or EMC arrays, the DASManager should be the preferred choice for performance and additional scalability.

As noted in the wiki referred to previously the SC has been tested primarily with ext4, although any POSIX compliant filesystem should be fine.

* In Eucalyptus 3.0 and 3.1 the SC would loopback mount all inactive volumes and thus the limit of 256 volumes would be imposed (256 loopbacks in Linux).  With 3.2+ the overlay storage backend ensures that only active volumes are loopback mounted, so now users can have up to 256 in-use volumes or snapshots.

Network

Don’t cross the streams!

Best possible scenario is to move all EBS traffic onto its own network segment, adding additional interfaces to both your NC’s and the SC and then registering the SC on the interface you wish to use for the storage traffic.  This will ensure that storage and data are segregated.  This should be a necessity if you really must have the CC and SC sharing a host.

Host System Tuning

The box on which the SC is running should have as much memory as possible, plenty for pagecache usage (write-caching).  If the inbound I/O from initiators cannot be written to disk fast enough, the pagecache is going to be important. Monitor the virtual memory subsystem at all times, using something like Ganglia, Nagios, collectl or collectd.

For RHEL hosts use tuned profiles to apply some generic tweaks.  For the SC, enterprise-storage is probably the most effective; adjusting vm.dirty_ratio upwards (the point at which processes start asynchronous writes to disk with pdflushd), setting deadline I/O scheduler and enabling transparent hugepage support.

Consider cache layers in the chain from initiator to the SC.  These can give misleading results during testing.  For example, writes from the instance will (by default unless cache=none) hit the host pagecache, followed by the tgt cache on the SC as well its the SC pagecache, followed by any cache layer for the backing of /var/lib/eucalyptus/volumes. So the instance itself may see very misleading performance figures for disk writes particularly.  Test the chain from initiators to SC under stress conditions.

iSCSI Tuning

By default iscsid may be considered to be quite aggressive in its timeouts.  On a congested network the last thing a user wants is the initiator logging out of a session.  If bfEBS is being used, it’s probably a good idea to back off on some of the timeouts, consider changing the following:

node.conn[0].timeo.noop_out_interval = 0 <- this stops the "ping" between initator and target
node.conn[0].timeo.noop_out_timeout = 0 <- this disables the action of timing out an operation from the above "ping"
node.session.timeo.replacement_timeout = 7200 <-  this sets the connection timeout high, if running the entire root of the OS on the iscsi volume (bfEBS), be lazy.

Expectations & Remediation

Maybe this section should come first but expectations are key here. If you are going to install Eucalyptus on low-end server hardware, with slow disks and network, then don’t expect miracles in terms of concurrent EBS and bfEBS usage.  Performance may suck, YMMV.

On the other hand, perhaps you have optimised as best you can but short of getting new hardware and a new network, there is nothing more you can do to improve EBS performance in your current architecture.  At this point, consider using the following cloud properties to limit EBS volume sizes:

<partition_name>.storage.maxtotalvolumesizeingb <- sets the maximum total EBS volume size in the partition / cluster
<partition_name>.storage.maxvolumesizeingb <- sets the max per-volume size in the partition / cluster

Futhermore, utilise EIAM (Eucalyptus Identitiy and Access Management) quotas to mask over weak points in your platform architecture by placing restrictions on volume sizes and numbers of volumes users can have.  This is an easy way to limit “abuse” of the storage platform.  You can find some sample quotas here.

Wrap-up

Following on from my first post on design, the key here is to nail the requirements and usage profile before designing the EBS architecture and always, always monitor.

If you have any comments or requests, please reply to this post and I’ll try to address them 🙂

Advertisements

Eucalyptus Deployment Best Practices (EDBP): Planning

Introduction to EDBP

I had to come up with an acronym.  This is particularly important when you’re in the cloud business, I think it comes somewhere before the business plan and just after the beer.

Did you know that EUCALYPTUS itself is an Acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems ?

I’m trying to make a concious effort to blog more, hopefully I can impart some useful information to readers about deploying Eucalyptus in production.  It’s about time we spoke a little bit more on production usage since we have plenty of customers running Eucalyptus in this manner. We have a fairly expansive customer list available here and that certainly doesn’t show all of them.

I’ll be starting to a write a series of blog posts about optimising Eucalyptus deployments and various architectural considerations that those Solution Architects amongst you will want to think about.  These will also make it into solution briefs, presentations and knowledge base articles, I’m sure.

I’m also VERY interested to hear of suggested topics too, along with any feedback.  Even if this is from a poor reader who’s landed on the wrong page 😉

Onto some high level thoughts on planning your deployment …

Monitoring

Make sure you are monitoring as many infrastructure components in the cloud platform as possible, from switches and storage arrays all the way to physical servers (CLC, Walrus, SC, CC and NC’s) and the instances themselves.  The latter may involve installing agents into the images you serve in your cloud.  This is another area which can sometimes be overlooked, it always surprises me when I come across even big companies that don’t have a readily available monitoring solution.

Consider using tools like Nagios, Zenoss or Ganglia to monitor the platform components and provide alerting based on thresholds.  We have a wiki here with a checklist of useful things to monitor with such a tool, some folks have already been writing nagios scripts (and collectd bits).

Understand Usage

Optimising your cloud platform is important and the level of optimisation required really depends on your usage of the platform.  Is this just an evaluation environment?  Is this a small dev/test platform with very modest capacity and with a small number of users?  If so you probably don’t need to spend too much time optimising, you just want to evaluate the platform or provide some simple instances for Jenkin’s builds, right?  If you’re looking to deploy for real-world production usage, then make sure you know the following things, or at least have an educated idea on them:

– What workloads are my users running in the cloud?  Inside the instances, what software is being used?

– What are the usage profiles of these users? Does load peak at certain times? Do users use S3 much? Do they use EBS volumes heavily?

If you can’t get the above, use proactive means to closely monitor what users are really doing on your cloud, such as monitoring instance performance.

I won’t go into the obvious, like various levels of acceptance testing for your deployment, but knowing the workload can be very important when designing and tweaking the solution.  If you can understand how your users will want to use the cloud then you can stay one step ahead of them.  This knowledge can help you identify weak points in your architecture early-on and before they impact users.

As an example; perhaps users spend most of their time hitting the ephemeral disk stores (i.e. local disk) hardest on the node controllers (NCs), since the workload is transient.  In which case, you’ll know that you might want to upgrade the NC disks to SSD’s and move to RAID1+0 and perhaps implement cgroups to limit the ability for instances to monopolise I/O, along with various other tweaks to caches and schedulers.

Capacity Management

The previous paragraph dovetails nicely into this next topic. Capacity management is important and it doesn’t magically go away with cloud.  With many users the assumption is that they don’t need to worry about this anymore.  After all, the cloud does everything whilst it brings you ice cold beer by the poolside, right?  Being a cloud admin doesn’t involve much work, I think according to some you’re already out of a job so why are you reading this post?

Perhaps you’re now used to running stuff in Amazon’s public cloud but just because you don’t have to worry too much about infrastructure problems, it doesn’t mean Amazon isn’t always thinking about it.  The same principle applies for private cloud like Eucalyptus, it orchestrates infrastructure to a certain degree and thus is naturally still dependent on what is has to orchestrate with.  If your storage network is 100Mbps and you have 100 users all trying to run EBS-backed instances (boot from iSCSI pretty-much) concurrently, the performance is going to be pretty diabolical or completely non-existent depending on your iSCSI timeouts 😉  No amount of cloud is going to get you out of that one.

As the cloud administrator, your traditional sysadmin role might have changed but now you need to learn the art of masking underlying infrastructure from the users.  You need to be thinking about Quality-of-Service (QoS) too and implementing some method of guaranteeing the user experience.  Ok, guarantee might be too strong a word as its arguably impossible but aim for consistency.

On a practical note, a good way to simulate usage may be to use Eutester to spin up instances and perform various functional tests against the cloud whilst also running load generation, this would stress the services themselves all the way down to the infrastructure layer.  It can help uncover the relationships between these layers.  Alternatively, perhaps you just use Jenkin’s EC2 plugin to run a load of instances and something like Ansible to orchestrate some benchmarking suites whilst you sip coffee and wait for the alarm as your datacentre catches fire.

Going Pro

Make sure that you have paid attention to the above points and explored them further and in more detail.  Aim to do this during the design and test phase before you ramp-up your cloud usage or take it into some form of production.  Common pitfalls typically sit in the categories above but there are many more considerations to think of.  Make sure your staff, processes and infrastructure are ready for a cloud platform and that you have the capability to operate it effectively and proactively.

With the next blog I’ll talk a little bit about practical tuning of the storage controller and optimising the default configuration.