Performance Tuning your vSAN / VxRail Environment

Posted by

Recently, I have been getting a lot of questions around how to build vSAN for high performing workloads. This article is intended to cover hardware differences, slack space, what is vSAN namespace, object, component, Raid-1 vs Raid-5/6, and stripe width, which all can impact performance. Please keep in mind I am not covering dedup/compression overhead and the impact on performance.  At the time of this writing, some of the information is historic but some of the information applies to 6.7U1 vSAN.

**The below will be primarily focused on all flash vSAN/VxRail nodes, which makes the cache tier 100% write cache. The price point between hybrid and all flash are so close nowadays that over the long term all flash ends up being more economical.

**The below recommendations is operating under the assumption that you keep 30% storage free at all times for slack space. vSAN rebalances the components across the cluster whenever the consumption on a single capacity device reaches 80 percent or more. The rebalance operation might impact the performance of applications.

**The below recommendations are of my own personal opinion having designed/implemented/administrated vSAN/VxRail nodes in multiple enterprise environments across multiple countries.

Not all drives are created equal
I should start by stating that if you go down the VxRail path then you are fixed into a limited number of options for your all flash cache and capacity tier. However, if you go down the vSAN/vSAN ready node path then you have more options.
In today’s world you have SATA, SAS, PCI-e, and NVMe drives to choose from.

  • NVMe is pound for pound the heavyweight champion of performance and would make the ideal cache drive.
  • NVMe drives are typically the more expensive drive.
  • PCIe devices generally have faster performance than SSD devices.
  • The maximum capacity that is available for PCIe devices is generally greater than the maximum capacity that is currently listed for SSD devices
  • PCIe devices generally have higher cost than SSD devices.
  • My personal best practice: Make your cache tier as write intensive as possible while making your capacity tier as read intensive as possible

Below you will find a link to all the server and drive manufactuers that are rated for vSAN:

How big should my cache be?
Sizing your vSAN cache so that your virtual machines ‘working set’ resides completely in cache is going to give the best possible performance. Granted this can be difficult because determining your working set can be like hitting a moving target. When in doubt, run an environmental tool for a week to gather you peak workloads and add 15% on top of that.

Namespace, Object, Component, & Stripe Width
Namespace: When you provision a virtual machine on a vSAN datastore, vSAN creates a set of objects comprised of multiple components for each virtual disk. It also creates the VM home namespace, which is a container object that stores all metadata files of your virtual machine.

vSAN is an object datastore with a mostly flat hierarchy of objects and containers (folders). Think of a tree and branches architecture.  Items that make up a virtual machine are represented by objects. These are the most prevalent object types you will find on a vSAN datastore:

• VM Home, which contains virtual machine configuration files and logs, e.g., VMX file

• Virtual machine swap

• Virtual disk (VMDK)

• Delta disk (snapshot)

• Performance database

Each object consists of one or more components. The number of components that make up an object depends primarily on a couple things: The size of the objects and the storage policy assigned to the object. The maximum size of a component is 255GB. If an object is larger than 255GB, it is split up into multiple components.

Example: 800GB virtual disk is split into 4 components.

vSAN will break down a large component into smaller components in certain cases to help balance capacity consumption across disks, optimize rebuild and resynchronize activities, and improve overall efficiency in the environment.

Raid-1 mirror example: We have a Raid-1 mirror with failure to tolerate set to 1 with a disk stripe width set to 1.  The earlier 800GB object will break down into 4 components and will be mirrored to another host equaling 8 components.

**a 9th component called a witness will be placed on a third host for tie breaker situations.

Raid-5 with failure to tolerate set to 1 example: The 800GB object will now consist of 4 components – 3 data and one parity.  These components are distributed across the four hosts in our cluster. 

Each storage object is deployed on VSAN as a RAID tree and each branch of the tree is a component. For instance, if I chose to deploy a VMDK with a stripe width of two, then a RAID-0 stripe would be configured across two disks for this virtual machine disk. The VMDK would be the object, and each of the stripes would be a component of that object.

Deeper Dive into Stripe Width:

I have heard some say, “the higher the disk strip width the better,” which is rarely the case.  All writes go through the cache drives write buffer (600GB, still 600GB even if you have a 1.6TB drive), which increasing stripe width may or may not increase performance.  The one scenario where you will see improved performance is if you are having to destage a lot of data from the cache tier into the capacity tier. 

**Maximum Disk Stripe Width =12, minimum is 1.

**to see write buffer utilization go to 6.7U1 Monitor > vSAN > Performance > Disks prior to 6.7U1 it was Monitor > Performance > vSAN-Disk Group on a particular ESXi host.  Remember, once the write buffer free drops below 70%, the cache tier will begin to destage data.

My personal opinion is a SW=1 is more than sufficient for most workloads in your environment.  However, if you are experiencing a write cache destagging issue then increasing the SW will help. If you are going to increase the SW I recommend starting with 2 because this will give you Raid 1/0 performance characteristics.

Raid-1 vs Raid-5/6
Raid-1 is a mirror replica of the original object.  Since Raid-1 is only mirroring the data it is known as the performance Raid since there is no overhead.  Raid-5/6 however, they save how much space is required to write/protection data and this is why they are known as capacity efficiency Raid.

The space efficiency benefits of Raid-5/6 come at the price of the amplification of I/O operations.

Write operations are amplified, because the parity fragments need to be updated every time data is written.
In the general case, a write operation is smaller than the size of a RAID stripe. So, one way to do this is to:

  • read the part of the fragment that needs to be modified by the write operation;
  • read the relevant parts of the old parity/syndrome fragments to re-calculate their values (need both old and new values to do that);
  • combine the old values with the (new) data from the write operation to calculate new parity/syndrome value;
  • write the new data;
  • write the new parity/syndrome value.

A typical write on a Raid-5 one needs to perform 2 reads and 2 writes on storage. RAID-6 however, the numbers are 3 reads and 3 writes.

Since vSAN is a distributed storage solution, the amplification also means additional network traffic for write operations.

Examples of fault tolerant methods including changing stripe width:

In the above example you can see how an object is made up of components

In the above example you can see a Raid-1 mirror with a Stripe Width of 1.  This is the default vSAN storage policy

In the above example you see a Raid-1 mirror with a SW = 2

In the above example you see what happens when you have an object larger than 255GB
*Even if a Stripe Width is set to 1, vSAN will still break larger objects down.

In the above example you see a Raid-5 with SW=1 policy

In the above example you see Raid-5 with SW=2 policy

In the above example you see a Raid-6 with a SW=1 policy

The above example shows a Raid-6 with a SW=2 policy

As you can see, changing from Raid-1/Raid-5/Raid-6 has different impacts.  In addition, changing the Stripe Width can create a lot more components for vSAN to move.

There are plenty of considerations to account for when deciding on how to put together a vSAN cluster and setting up storage policies.  Luckily, vSAN is an object distributed storage solution, which means you can have a different policy per object in your vSAN environment.  Below you will find some of my personal best practices I apply to my environments.

Personal best practices:

  • If you are putting majority high performing workloads on the vSAN cluster then do not enable dedup/compression.
  • If you have a VM that is a highly transactional VM then set a Raid-1 with SW = 1 policy for best performance.
  • If you need to increase stripe width then start with the value 2, this will give you the characteristics of a Raid 1/0.
  • If you have a VM that is not high performing and you wish to save capacity then apply a Raid-5/6 with SW=1 policy.
  • If you have a VM with multiple VMDKs but some are high performing while others are not.  Apply Raid-1 with SW=1 to the VMDKs generating the higher performance and a Raid-5/6 policy on VMs that are lower performance.
  • If you have a VM that is having performance issues even with a Raid-1 with SW=1 policy and you determine there is a cache to capacity destaging issue then start with a Raid-1 SW=2 policy and see if that helps the destagging issue.  If not, add more cache drives to help spread out the workloads.
  • NVMe for cache followed by PCI-e then SSD.
  • Finally, always honor slack space in your vSAN cluster and reserve 30% at all times. vSAN rebalances the components across the cluster whenever the consumption on a single capacity device reaches 80 percent or more. The rebalance operation might impact the performance of applications.

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s