EBS Architecture (EDBP)


So, the second post in this series and now a look at Eucalyptus Elastic Block Storage (EBS) and the Eucalyptus Storage Controller (SC) component which handles this.

What does it do?

The Storage Controller sits at the same layer as the Cluster Controller (CC), each Eucalyptus Availability Zone (AZ) or Cluster will have its own CC and SC.  Within that AZ, the SC will provide EBS (Elastic Block Store) functionality via iSCSI (AoE is no longer used in 3.0 onwards) using the Linux Target Framework (tgt).  If you’re a subscription-paying customer you can also use the SAN adapter to have the Eucalyptus SC talk directly to your NetApp, Dell (and EMC with 3.2) array.  In turn the Node Controllers (NC) will also talk directly to your storage array.

EBS is core to EC2, it’s a feature which the vast majority compute users will use. It provides the capability for users to store persistent data (and snapshot that data to Walrus/S3 for long-term storage).  With Eucalyptus 3.0 users can now utilise EBS-backed instances, which are essentially boot-from-iSCSI virtual machines.  These virtual machines use an EBS volume for their root filesystem.

This post is a more in-depth look at best practices around storage controller configuration.

How does it work?

This is pretty well explained on the wiki page here but I’ll summarise in short for the benefit of readers.

An EBS volume is created by a number of steps, starting with messaging sent from the client to the CLC (Cloud Controller).  To then create the EBS volume, the SC performs the following steps:

  1. SC creates a volume file in /var/lib/eucalyptus/volumes named per the volume ID
  2. This file is then attached to a free loopback device
  3. A logical volume is created on top of this loopback
  4. An iSCSI target is created along with a new LUN with the backing store of this logical volume

Then, when a user wishes to attach this EBS volume to a running instance, the NC on which the instance resides will attempt to login to this iSCSI target and pass the block device from the LUN through to the virtual machine based on an XML definition file with a “virsh attach-device”.

The SC also facilitates point-in-time snapshots of EBS volumes.  This involves the SC copying the EBS volume to S3 for long-term persistence. From this snapshot, users can register boot-from-EBS  (bfEBS) images and create new volumes.

During the snapshot process the SC does the following:

  1. Creates a new raw disk image file in /var/lib/eucalyptus/volumes
  2. Adds this as a physical volume to the volume group of the EBS volume
  3. Extends the volume group over this new physical volume
  4. Create a new Logical Volume and dd the contents of the volume into this LV

After the copy is complete, the SC will then transfer the contents of the EBS volume up to S3/Walrus.

How should I architect EBS storage?

It’s quite common that the storage controller (without SAN adapter) can be the primary factor in deciding whether to scale out with multi-cluster in deployments.  In part this is down to the usage profile of EBS and the architectural design most users follow.  When designing the storage aspect of the cloud there are a number of key areas on which to focus.

Usage Profile

What is the usage profile for the cloud?  Are EBS volumes used at all?  Are EBS volumes a key infrastructure component of workloads? How many users are on the cloud? How many concurrent volume attachments can you expect?

These are all valid questions when designing the storage architecture.  Concurrent and heavy use of EBS volumes may dictate very different backend requirements to a cloud where EBS volumes are only lightly used.  Make sure you test at the planned or envisaged scale.

Component Topology

Always keep the SC on a separate host from the Cluster Controller (CC) or any other Eucalyptus component if you can.  This has the potential to dramatically improve performance for even the smallest deployments.

Disk Subsystem & Filesystem

With the overlay storage backend, volumes and snapshots are stored in a flat directory structure in /var/lib/eucalyptus/volumes.

You need to make sure you choose a fast disk subsystem for the storage controller.  If you’re using local disks, consider RAID levels with some form of striping, such as RAID1+0 on as many spindles as possible.  If you’re backing /var/lib/eucalyptus/volumes with some form of networked storage, avoid NFS and CIFS.  Choose iSCSI or Fibre Channel where possible for best performance under high utilization across high numbers of active EBS volumes.  For ultimate performance, consider SSD’s and SSD arrays.  If you’re planning on using a remote backing store, like iSCSI, consider optimising with jumbo frames and iSCSI TCP-offloading on the NIC, if supported.

With 3.2 the DASManager storage backend is now open sourced.  Unlike the typical overlay backend the DASManager directly carves up presented block storage with LVM, circumventing the requirement for loopbacks and thus removing the limit of 255 volumes+snapshots*.  When not using the SAN adapter for NetApp, Dell or EMC arrays, the DASManager should be the preferred choice for performance and additional scalability.

As noted in the wiki referred to previously the SC has been tested primarily with ext4, although any POSIX compliant filesystem should be fine.

* In Eucalyptus 3.0 and 3.1 the SC would loopback mount all inactive volumes and thus the limit of 256 volumes would be imposed (256 loopbacks in Linux).  With 3.2+ the overlay storage backend ensures that only active volumes are loopback mounted, so now users can have up to 256 in-use volumes or snapshots.

Network

Don’t cross the streams!

Best possible scenario is to move all EBS traffic onto its own network segment, adding additional interfaces to both your NC’s and the SC and then registering the SC on the interface you wish to use for the storage traffic.  This will ensure that storage and data are segregated.  This should be a necessity if you really must have the CC and SC sharing a host.

Host System Tuning

The box on which the SC is running should have as much memory as possible, plenty for pagecache usage (write-caching).  If the inbound I/O from initiators cannot be written to disk fast enough, the pagecache is going to be important. Monitor the virtual memory subsystem at all times, using something like Ganglia, Nagios, collectl or collectd.

For RHEL hosts use tuned profiles to apply some generic tweaks.  For the SC, enterprise-storage is probably the most effective; adjusting vm.dirty_ratio upwards (the point at which processes start asynchronous writes to disk with pdflushd), setting deadline I/O scheduler and enabling transparent hugepage support.

Consider cache layers in the chain from initiator to the SC.  These can give misleading results during testing.  For example, writes from the instance will (by default unless cache=none) hit the host pagecache, followed by the tgt cache on the SC as well its the SC pagecache, followed by any cache layer for the backing of /var/lib/eucalyptus/volumes. So the instance itself may see very misleading performance figures for disk writes particularly.  Test the chain from initiators to SC under stress conditions.

iSCSI Tuning

By default iscsid may be considered to be quite aggressive in its timeouts.  On a congested network the last thing a user wants is the initiator logging out of a session.  If bfEBS is being used, it’s probably a good idea to back off on some of the timeouts, consider changing the following:

node.conn[0].timeo.noop_out_interval = 0 <- this stops the "ping" between initator and target
node.conn[0].timeo.noop_out_timeout = 0 <- this disables the action of timing out an operation from the above "ping"
node.session.timeo.replacement_timeout = 7200 <-  this sets the connection timeout high, if running the entire root of the OS on the iscsi volume (bfEBS), be lazy.

Expectations & Remediation

Maybe this section should come first but expectations are key here. If you are going to install Eucalyptus on low-end server hardware, with slow disks and network, then don’t expect miracles in terms of concurrent EBS and bfEBS usage.  Performance may suck, YMMV.

On the other hand, perhaps you have optimised as best you can but short of getting new hardware and a new network, there is nothing more you can do to improve EBS performance in your current architecture.  At this point, consider using the following cloud properties to limit EBS volume sizes:

<partition_name>.storage.maxtotalvolumesizeingb <- sets the maximum total EBS volume size in the partition / cluster
<partition_name>.storage.maxvolumesizeingb <- sets the max per-volume size in the partition / cluster

Futhermore, utilise EIAM (Eucalyptus Identitiy and Access Management) quotas to mask over weak points in your platform architecture by placing restrictions on volume sizes and numbers of volumes users can have.  This is an easy way to limit “abuse” of the storage platform.  You can find some sample quotas here.

Wrap-up

Following on from my first post on design, the key here is to nail the requirements and usage profile before designing the EBS architecture and always, always monitor.

If you have any comments or requests, please reply to this post and I’ll try to address them :)

About these ads

One thought on “EBS Architecture (EDBP)

  1. Pingback: 婉兮清扬 » Eucalyptus的可维护性

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s