EBS Architecture (EDBP)

So, the second post in this series and now a look at Eucalyptus Elastic Block Storage (EBS) and the Eucalyptus Storage Controller (SC) component which handles this.

What does it do?

The Storage Controller sits at the same layer as the Cluster Controller (CC), each Eucalyptus Availability Zone (AZ) or Cluster will have its own CC and SC.  Within that AZ, the SC will provide EBS (Elastic Block Store) functionality via iSCSI (AoE is no longer used in 3.0 onwards) using the Linux Target Framework (tgt).  If you’re a subscription-paying customer you can also use the SAN adapter to have the Eucalyptus SC talk directly to your NetApp, Dell (and EMC with 3.2) array.  In turn the Node Controllers (NC) will also talk directly to your storage array.

EBS is core to EC2, it’s a feature which the vast majority compute users will use. It provides the capability for users to store persistent data (and snapshot that data to Walrus/S3 for long-term storage).  With Eucalyptus 3.0 users can now utilise EBS-backed instances, which are essentially boot-from-iSCSI virtual machines.  These virtual machines use an EBS volume for their root filesystem.

This post is a more in-depth look at best practices around storage controller configuration.

How does it work?

This is pretty well explained on the wiki page here but I’ll summarise in short for the benefit of readers.

An EBS volume is created by a number of steps, starting with messaging sent from the client to the CLC (Cloud Controller).  To then create the EBS volume, the SC performs the following steps:

  1. SC creates a volume file in /var/lib/eucalyptus/volumes named per the volume ID
  2. This file is then attached to a free loopback device
  3. A logical volume is created on top of this loopback
  4. An iSCSI target is created along with a new LUN with the backing store of this logical volume

Then, when a user wishes to attach this EBS volume to a running instance, the NC on which the instance resides will attempt to login to this iSCSI target and pass the block device from the LUN through to the virtual machine based on an XML definition file with a “virsh attach-device”.

The SC also facilitates point-in-time snapshots of EBS volumes.  This involves the SC copying the EBS volume to S3 for long-term persistence. From this snapshot, users can register boot-from-EBS  (bfEBS) images and create new volumes.

During the snapshot process the SC does the following:

  1. Creates a new raw disk image file in /var/lib/eucalyptus/volumes
  2. Adds this as a physical volume to the volume group of the EBS volume
  3. Extends the volume group over this new physical volume
  4. Create a new Logical Volume and dd the contents of the volume into this LV

After the copy is complete, the SC will then transfer the contents of the EBS volume up to S3/Walrus.

How should I architect EBS storage?

It’s quite common that the storage controller (without SAN adapter) can be the primary factor in deciding whether to scale out with multi-cluster in deployments.  In part this is down to the usage profile of EBS and the architectural design most users follow.  When designing the storage aspect of the cloud there are a number of key areas on which to focus.

Usage Profile

What is the usage profile for the cloud?  Are EBS volumes used at all?  Are EBS volumes a key infrastructure component of workloads? How many users are on the cloud? How many concurrent volume attachments can you expect?

These are all valid questions when designing the storage architecture.  Concurrent and heavy use of EBS volumes may dictate very different backend requirements to a cloud where EBS volumes are only lightly used.  Make sure you test at the planned or envisaged scale.

Component Topology

Always keep the SC on a separate host from the Cluster Controller (CC) or any other Eucalyptus component if you can.  This has the potential to dramatically improve performance for even the smallest deployments.

Disk Subsystem & Filesystem

With the overlay storage backend, volumes and snapshots are stored in a flat directory structure in /var/lib/eucalyptus/volumes.

You need to make sure you choose a fast disk subsystem for the storage controller.  If you’re using local disks, consider RAID levels with some form of striping, such as RAID1+0 on as many spindles as possible.  If you’re backing /var/lib/eucalyptus/volumes with some form of networked storage, avoid NFS and CIFS.  Choose iSCSI or Fibre Channel where possible for best performance under high utilization across high numbers of active EBS volumes.  For ultimate performance, consider SSD’s and SSD arrays.  If you’re planning on using a remote backing store, like iSCSI, consider optimising with jumbo frames and iSCSI TCP-offloading on the NIC, if supported.

With 3.2 the DASManager storage backend is now open sourced.  Unlike the typical overlay backend the DASManager directly carves up presented block storage with LVM, circumventing the requirement for loopbacks and thus removing the limit of 255 volumes+snapshots*.  When not using the SAN adapter for NetApp, Dell or EMC arrays, the DASManager should be the preferred choice for performance and additional scalability.

As noted in the wiki referred to previously the SC has been tested primarily with ext4, although any POSIX compliant filesystem should be fine.

* In Eucalyptus 3.0 and 3.1 the SC would loopback mount all inactive volumes and thus the limit of 256 volumes would be imposed (256 loopbacks in Linux).  With 3.2+ the overlay storage backend ensures that only active volumes are loopback mounted, so now users can have up to 256 in-use volumes or snapshots.

Network

Don’t cross the streams!

Best possible scenario is to move all EBS traffic onto its own network segment, adding additional interfaces to both your NC’s and the SC and then registering the SC on the interface you wish to use for the storage traffic.  This will ensure that storage and data are segregated.  This should be a necessity if you really must have the CC and SC sharing a host.

Host System Tuning

The box on which the SC is running should have as much memory as possible, plenty for pagecache usage (write-caching).  If the inbound I/O from initiators cannot be written to disk fast enough, the pagecache is going to be important. Monitor the virtual memory subsystem at all times, using something like Ganglia, Nagios, collectl or collectd.

For RHEL hosts use tuned profiles to apply some generic tweaks.  For the SC, enterprise-storage is probably the most effective; adjusting vm.dirty_ratio upwards (the point at which processes start asynchronous writes to disk with pdflushd), setting deadline I/O scheduler and enabling transparent hugepage support.

Consider cache layers in the chain from initiator to the SC.  These can give misleading results during testing.  For example, writes from the instance will (by default unless cache=none) hit the host pagecache, followed by the tgt cache on the SC as well its the SC pagecache, followed by any cache layer for the backing of /var/lib/eucalyptus/volumes. So the instance itself may see very misleading performance figures for disk writes particularly.  Test the chain from initiators to SC under stress conditions.

iSCSI Tuning

By default iscsid may be considered to be quite aggressive in its timeouts.  On a congested network the last thing a user wants is the initiator logging out of a session.  If bfEBS is being used, it’s probably a good idea to back off on some of the timeouts, consider changing the following:

node.conn[0].timeo.noop_out_interval = 0 <- this stops the "ping" between initator and target
node.conn[0].timeo.noop_out_timeout = 0 <- this disables the action of timing out an operation from the above "ping"
node.session.timeo.replacement_timeout = 7200 <-  this sets the connection timeout high, if running the entire root of the OS on the iscsi volume (bfEBS), be lazy.

Expectations & Remediation

Maybe this section should come first but expectations are key here. If you are going to install Eucalyptus on low-end server hardware, with slow disks and network, then don’t expect miracles in terms of concurrent EBS and bfEBS usage.  Performance may suck, YMMV.

On the other hand, perhaps you have optimised as best you can but short of getting new hardware and a new network, there is nothing more you can do to improve EBS performance in your current architecture.  At this point, consider using the following cloud properties to limit EBS volume sizes:

<partition_name>.storage.maxtotalvolumesizeingb <- sets the maximum total EBS volume size in the partition / cluster
<partition_name>.storage.maxvolumesizeingb <- sets the max per-volume size in the partition / cluster

Futhermore, utilise EIAM (Eucalyptus Identitiy and Access Management) quotas to mask over weak points in your platform architecture by placing restrictions on volume sizes and numbers of volumes users can have.  This is an easy way to limit “abuse” of the storage platform.  You can find some sample quotas here.

Wrap-up

Following on from my first post on design, the key here is to nail the requirements and usage profile before designing the EBS architecture and always, always monitor.

If you have any comments or requests, please reply to this post and I’ll try to address them 🙂

HA iSCSI and the Storage Controller

HA with OpeniSCSI:

Further to my general musings about HA, good chum Harold Spencer and I have been working on providing a way for our users to achieve high availability in the storage controller components of Eucalyptus for those who don’t have the SAN adapter.  The storage controller is the Eucalyptus component which handles the EBS (Eucalyptus Block Storage) volumes for instances.

Let me give some background here …..

With 3.0, we’re still following our old business model, this is effectively an Enterprise Edition for us.  This is all changing with the upcoming 3.1 release, as our good friend and colleague Greg outlines here.  This is a really big deal for Eucalyptus and also very exciting, we’re going back to our unified code-base.  The only things which won’t be open are a couple of proprietary subscription-only modules; one for VMware and one for interfacing with enterprise-class storage arrays (aka SAN’s). In regards to HA, those using Eucalyptus who don’t have a subscription can use HA for the Cloud Controller components (read my last blog entry on why that’s important) but won’t currently be able to use HA with the Storage Controller since this can only be achieved with an enterprise-class storage array and the subscription-only SAN adapter.  By default, without the SAN adapter, Eucalyptus uses the open-iSCSI target daemon (tgt) to handle EBS volumes.

We still think our users will want SC HA using open-iSCSI and we want to ensure that those users who don’t have a SAN (or don’t need the performance one brings) or who don’t pay our salaries can still have some kind of high availability experience at the storage level using some of the service and infrastructure HA platforms out there for Linux.  After all, running an open-iSCSI target daemon as a resource on top of a Linux clustering solution is nothing new and works very well of course.  In our testing we’ve been using pacemaker + corosync and running tgtd as a resource on top of a DRBD backed logical volume across a two-node cluster.  It just works, you can fail over LUN’s whilst clients are writing data to them and only achieve a brief wait at the client before disk operations can continue on the migrated LUN.  Data integrity is handled by the proper failover of resources and the excellent DRBD solution, courtesy of the great guys at Linbit.

You’ll want to check out Harold’s blog here for some more details on the work we’ve been doing to try to integrate this.