After the 5-minutes video introduction in part 1, now let’s have a closer look at the ScaleIO architecture. Your data center contains lots of servers, many of which contain local disks. So from a capital perspective, you have already paid for this local storage, which is spinning continuously and consuming precious power and cooling resources…
ScaleIO is a software-only solution that uses existing hosts’ local disks and LAN to realize a virtual SAN that has all the benefits of external storage—but at a fraction of the cost and the complexity. ScaleIO turns existing local internal storage into internal shared block storage that is comparable to or better than the more expensive external shared block storage.
The lightweight ScaleIO software components are installed in the application hosts and inter-communicate via a standard LAN to handle the application I/O requests sent to ScaleIO block volumes. An extremely efficient decentralized block I/O flow combined with a distributed, sliced volume layout results in a massively parallel I/O system that can scale to hundreds and thousands of nodes.
ScaleIO was designed and implemented with enterprise-grade resilience as a must have. Furthermore, the software features efficient distributed auto-healing processes that overcome media and node failures without requiring administrator involvement.
Dynamic and elastic, ScaleIO enables administrators to add or remove nodes and capacity on the fly. The software immediately responds to the changes, rebalancing the storage distribution and achieving a layout that optimally suits the new configuration.
Because ScaleIO is hardware agnostic, the software works efficiently with various types of disks, including magnetic disks and solid-state disks, flash PCI Express (PCIe) cards; networks; and hosts. It can be easily installed in an existing infrastructure as well as in greenfield configurations.
Management and Monitoring
Managing a ScaleIO deployment is easy. Anything from installation, configuration, monitoring, and upgrade is simple and straightforward. Anyone who manages the datacenter is capable of fully administering the deployment, without any specialized training and/or vendor certification. The complexity of storage administration is completely eliminated. The screens shown here are all that is needed in order to monitor the ScaleIO system. There is also a simple CLI for configuration and various system actions. Because the system manages itself and takes all the necessary remedial actions when a failure occurs, including re-optimization, there is no need for operator intervention when various events occur. However, ScaleIO features a “call home” capability, which alerts the administrator should an event occur. The admin can then take action to respond to the event (if necessary) even outside of business hours. The administrator can follow the system operations and monitor its progress. For example, the actual rebuilds and rebalance operations, when they are executed, can be monitored via the dashboard.
In large data centers, many applications are deployed, many different requirements exist, and operational parameters are dynamic, changing frequently and without much notice. Whether you are a service provider delivering hosted infrastructure as a service or your IT department delivers infrastructure as a service to functional units within your organization, ScaleIO offers a set of features that gives you complete control over performance, capacity and data location. Protection domains allow you to isolate specific servers and data sets. This can be done at the granularity of a single customer so that each customer can be under a different SLA. Storage pools can be used for further data segregation and tiering. For example, data that is accessed very frequently can be stored in a flash-only storage pool for the lowest latency, while less frequently accessed data can be stored in a low-cost, high-capacity pool of spinning disks. With ScaleIO, you can limit the amount of performance—IOPS or bandwidth—that selected customers can consume. The limiter allows for resource distribution to be imposed and regulated to prevent application “hogging” scenarios. Light data encryption at rest can be used to provide added security for sensitive customer data. Explicit mapping from volumes to servers determines what data can be accessed from the server and provides extra isolation and granularity with regard to location management.
For both enterprises and service providers, these features enhance system control and manageability—ensuring that quality of service is met.
The ScaleIO Data Client (SDC)
The SDC is a block device driver that exposes ScaleIO shared block volumes to applications. The SDC runs locally on any application server that requires access to the block storage volumes. The blocks that the SDC exposes can be blocks from anywhere within the ScaleIO global virtual SAN. This enables the local application to issue an I/O request and the SDC fulfills it regardless of where the particular blocks reside. The SDC communicates with other nodes (beyond its own local server) over TCP/IP-based protocol, so it is fully routable. TCP/IP is ubiquitous and is supported on any network. Data centers LANs are naturally supported.
In the figure below you can see the I/O flow. The application issues an I/O, which flows through the file system and volume manager, but instead of accessing the local storage on the server (via the block device driver), it is passed to the SDC (denoted as ‘C’ in the slide). The SDC knows where the relevant block resides on the larger system, and directs it to its destination (either locally or on another server within the ScaleIO cluster). The SDC is the only ScaleIO component that applications “see” in the data path. Note that in a bare-metal configuration, the SDC is always implemented as an OS component (kernel). In virtualized environments, it is typically implemented as a hypervisor element or as an independent VM.
The ScaleIO Data Sever (SDS)
The SDS owns local storage that contributes to the ScaleIO storage pools. An instance of the SDS runs on every server that contributes some or all of its local storage space (HDDs, SSDs, or PCIe flash cards) to the aggregated pool of storage within the ScaleIO virtual SAN. Local storage may be disks, disk partitions, even files. The role of the SDS is to actually perform I/O operations as requested by an SDC on the local or another server within the cluster.
In the figure below you can see the I/O flow. A request, originated at one of the cluster’s SDCs, arrives over the ScaleIO protocol to the SDS. The SDS uses the native local media’s block device driver to fulfill the request and returns the results. An SDS always “talks” to the local storage, the DAS, on the server it runs on. Note that an SDS can run on the same server that runs an SDC or can be decoupled. The two components are independent from each other.
The Metadata Manager
ScaleIO’s control component is known as the metadata manager, or MDM. The MDM serves as the monitoring and configuration agent.
The MDM holds the cluster-wide mapping information and is responsible for decisions regarding migration, rebuilds, and all system-related functions. The ScaleIO monitoring dashboard communicates with the MDM to retrieve system information for display.
The MDM is not on the ScaleIO datapath. That is, reads and writes never traverse the MDM. The MDM may communicate with other ScaleIO components within the cluster in order to perform system maintenance/management operations but never to perform data operations. This means that the MDM does not represent a bottleneck for data operations and is never an issue in the scaling up of the overall cluster. The MDM consumes resources that are not needed by applications and/or datapath activities. The MDM does not preempt users’ operations and does not have any impact on the overall cluster performance and bandwidth.
To support high availability, three instances of MDM can be run on different servers. This is also known as the MDM cluster. An MDM may run on servers that also run SDCs and/or SDSs. The MDM may also run on a separate server. During installation, the user decides where MDM instances reside. If a primary MDM fails (due to host crash, for example), another MDM takes over and functions as primary until the original MDM is recovered. The third instance is usually used both for HA and as a tie-breaker in case of conflicts.
Let’s take a look at the deployment options. It is common practice to install both an SDC and an SDS on the same server. This way applications and storage share the same compute resources. This slide shows such a fully converged configuration, where every server runs both an SDC and an SDS. All servers can have applications running on them, performing I/O operations via the local SDC. All servers contribute some or all of their local storage to the ScaleIO system via their local SDS. Components communicate over the LAN.
In some situations, an SDS can be separated from an SDC and installed on a different server. ScaleIO does not have any requirements in regard to deploying SDCs and SDSs on the same or different servers. Whatever the preference of the administrator is, ScaleIO works with it transparently and smoothly. Shown here is a two-layer configuration. A group of servers is running SDCs and another distinct group is running SDSs. The applications that run on the first group of servers make I/O requests to their local SDC. The second group, running SDSs, contributes the servers’ local storage to the virtual SAN. The first and second groups communicate over the LAN. In a way, this deployment is similar to a traditional external storage system. Applications run in one layer, while storage is in another layer.
At any moment, a whole new group of servers can be added as SDS servers to extend the capacity of the system as a whole. ScaleIO automatically rearranges the data, optimizing and rebalancing the data in the background without any downtime. This deployment can easily grow to thousands of nodes.
In VMware environments, ScaleIO uses a model that is similar to a virtual storage appliance (VSA), which is called ScaleIO VM, or SVM. This is a dedicated VM in each ESX host that contains both the SDS and the SDC. The VMs in that host can access the storage as depicted—to the hypervisor, then to the SVM, and from the SVM to the local storage. All the SVMs are connected, so this allows any VM in any ESX host to access any SDS in the system, as in a physical environment.
Physical and other Non-VMware Envrironments
Non-VMware environments including Citrix XenServer, Linux KVM, and Microsoft Hyper-V are identical to physical environments. Both the SDS and the SDC sit inside the hypervisor. Nothing is installed at the guest layer. Since ScaleIO is installed in the hypervisor, you are not dependent on the operating system, so there is only one build to maintain and test. And the installation is easy, as there is only one location to install ScaleIO.
Volumes, Mirroring and Data Protection
Now let’s look at the distributed data layout scheme of ScaleIO volumes. This scheme is designed to maximize protection and optimize performance. On the left in the figure below, you see a data volume 1 in grey and a data volume 2 in blue. On the right, you see a 100-node SDS cluster.
A single volume is divided into chunks of reasonably small size, say 1 MB. These chunks will be scattered (striped) on physical disks throughout the cluster, in a balanced and random manner.
Once the volume is provisioned, the chunks of volume 1 are spread throughout the cluster randomly and evenly. Volume 2 is treated similarly.
Note that the slide shows partial layout. Ideally, the chunks are spread over the all the 100 servers. It is important to understand that the ScaleIO volume chunks are not the same as data blocks. The I/O operations are done at a block level. If an application writes out 4KB of data, only 4KB are written, not 100 MB. The same goes for read operations—only the required data is read.
Now let me explain the ScaleIO two-copy mesh mirroring. For simplicity, I will illustrate it with volume 2, which only has five chunks—A, B C and E. The chunks are initially stored on servers as shown. In order to protect the volume data, we need to create redundant copies of those chunks. We end up with two copies of each chunk. It is important that we never store copies of the same chunk on the same physical server.
The copies have been made. Now, chunk A resides on two servers: SDS2 and SDS4. Similarly, all other chunks’ copies are created and stored on servers different from their first copy. Note that no server holds a complete mirror of another server. The ScaleIO mirroring scheme is referred to as mesh mirroring, meaning the volume is mirrored at the chunk level and is “meshed” throughout the cluster. This is one of the factors in enhancing overall data protection and cluster resilience. A volume never fails in full and rebuilding a particular damaged chunk (or chunks) is fast and efficient, as it is done simultaneously by multiple servers. When a server fails (or is removed from the cluster), its chunks are spread over the whole cluster and rebuilding is shared among all the servers.
Let’s take a look at a server failure scenario. SDS1 presently stores chunks E and B from Volume 2 and chunk F from Volume 1.
If SDS1 crashes, ScaleIO needs to rebuild these chunks, so chunks E, B and F are copied to other servers. This is done by copying the mirrors. The mirrored chunk of E is copied from SDS3 to SDS4, the mirrored chunk of B is copied from SDS6 to SDS100, and the mirrored chunk of F is copied from SDS2 to SDS5. This process is called forward rebuild. It is a many-to-many copy operation. By the end of the forward rebuild operation, the system is again fully protected and optimized. No matter what, no two copies of the same chunk are allowed to reside on the same server. Clearly, this rebuild process is much lighter-weight and faster than having to serially copy an entire server to another. Note that while this operation is in progress, all the data is still accessible to applications. For the chunks of SDS1, the mirrors are still available and are used. Users experience no outage or delays. ScaleIO always reserves space on servers for failure cases, when rebuilds are going to occupy new chunk space on disks. This is a configurable parameter (i.e., how much storage capacity to allocate as reserve).
A snapshot is a volume, which is a copy of another volume. Snapshots take no time to create (or remove); they are instantaneous. Snapshots do not consume much space initially because they are “thinly provisioned.” You can create snapshots of snapshots—any number of them.
The figure below shows a volume, V1, a snapshot of that volume, S111 and a snapshot of snapshot S111, S121. Snapshot volumes, which are fully functional, can be mapped to SDSs just like any other volumes. A complete genealogy of volumes and their snapshots is known as VTree. Any number of VTrees can be created in the system. When you create a snapshot for several volumes (or snapshots), a consistency group that contains all the volumes in that operation is created and named. Consistency groups are automatically created when issuing a snapshot command for several volumes. Operations may be performed on an entire consistency group (for example, delete volume).
The figure also shows two genealogies of volumes, V1 and V2. V1 is being used to create VTree1 of snapshots. V2 is used to create the VTree2 genealogy.
At some point, a command is issued to create two snapshots, one of V1 and the other of V2. Because this is a single snapshot command, the two newly created snapshots are grouped. S112 (of V1) and S211 (of V2) have been grouped together in the C1 consistency group.
Software-Only—but as Resilient as a Hardware Array
ScaleIO was developed by a team of storage veterans who had previously developed “built-like-a-tank” external storage subsystems. Such traditional subsystems typically combine system software with commodity hardware—which is comparable to application servers’ hardware—to provide enterprise-grade resilience. With its contemporary architecture, ScaleIO provides similar enterprise-grade, no-compromise resilience by running the storage software directly on the application servers.
Designed for extensive fault tolerance and high availability, ScaleIO handles all types of failures, including failures of media, connectivity, nodes, software interruptions, and more. No single point of failure can interrupt the ScaleIO I/O service. In many cases, ScaleIO can overcome multiple points of failure as well.