The single best strategy for maintaining continuous uptime starts with mirroring identical copies of data between two places, ideally some distance from each other. How far to stretch it? Well, there are some practical limits governed by the speed of light, but I’ll touch on that shortly.
One of our healthcare customers operates active-active clusters between midtown Manhattan and Secaucus, New Jersey across the Hudson River. Not very far as the crow flies, but far enough that Superstorm Sandy only affected one of the locations while the western data center took over through it all.
As this example illustrates, simply clustering servers and splitting them apart doesn’t do it, especially if the shared storage array on which they rely is susceptible to a site failure. Lose access to the data and everything goes down – that’s true no matter how bullet proof you trust that storage system to be. Little things like water, fire, construction crews and technician errors prove that over and over again.
An ideal solution combines VMware vSphere High Availability (HA) Clusters with either DataCore SANsymphony™ software-defined storage or DataCore™ Hyper-converged Virtual SAN products; both are equally well proven to keep the lights on when chaos conspires against you.
DataCore SANsymphony is best suited when your vSphere servers depend on an external SAN, and the clusters access it via iSCSI or Fibre Channel. If, on the other hand, you’re rolling out a hyper-converged configuration choosing to keep the physical storage inside the servers then – you guessed it – the Hyper-converged Virtual SAN makes most sense.
VMware recognized the value of this combination recently by listing DataCore’s products as vSphere Metro Cluster Certified. You can read more at: https://www.vmware.com/resources/compatibility/vcl/partnersupport.php#DatacoreSoftwareCorporation.
The best advice that I can give is “Instead of hoping that things won’t break, expect them to go terribly wrong and design with that in mind.” Those principles will save you a lot of embarrassment and keep your company off those unsavory morning headlines.
By the way – curious why they call it metro clusters? It has to do with the magic number governing the maximum separation between nodes in a vSphere cluster. Some like to put it in terms of miles or kilometers spanning a metropolitan area, but it’s more accurately expressed as round trip time (RTT) for data transmissions between nodes. Up to 5 millisecond RTT and the cluster behaves like it was all in one location. Beyond that, things start to time out waiting on inter-site coordination messages to be acknowledged.