Bullet Proofing Storage!
My body still cringes when I remember the moment I heard the news…”The storage crashed!”
It was 4 years ago (2010) when I lived through a large-scale outage that was caused by a SAN failure. Which BTW, according to experts, “Should have never happened”, but it did.
For 7 straight days, my team and I orchestrated the recovery of nearly 500 virtual servers that vanished when the storage crashed. Needless to say, it occurred during peak business hours.
And hour after hour as we brought servers back online, I had the dreaded task of sending email updates to developers, business partners, VPs, and C-level staff.
They all kept asking me the same question…how did this happen?
LIFE changing event.
Let me tell you when something like this happens you’re not the same afterward.
When it’s over, you walk away from a different person. And no matter what any vendor [or zealot] says about a cool SAN, Server, or Network device, you still can’t completely get it out of your head. You’ve experienced nothing is perfect.
Re-Thinking Storage Architecture
Since the experience I just shared, there have been many new storage solutions come to market. Names such as Pure Storage, Nimble, and Tintri come to mind.
They all advertise some new concepts for handling scaling and performance. But recently the solution which has captured my attention for a more practical reason is VSAN.
Why? Because it spreads out the failure domain almost as wide as you want it. And it can be built on lower-cost server hardware. The benefit is all your VMs are not in one aggregate of spindles.
But let me be clear! I am not naive, there are still Pros and Cons to using VSAN that will have to be worked out.
I know it won’t be easy to change how IT thinks about storage, i.e. SAN, NAS, NFS, FC, iSCSI.
We love BIG storage! But face it, there’s too much risk when all the data and VMs are in one place and the cost goes up when you replicate and snap to yet other SANs.
Big is out, Wide is in.
How do we move away from BIG storage architecture with 100’s or 1000’s of spindles? I say with smaller, wider, storage pools running directly on ESXi. For me, it’s a no-brainer because I see the value.
Recently I met with one of VMware’s leading experts on the topic, Rawlison Rivera, VCDX, and we went deep into the nuts and bolts of how VSAN works. Yes, his pitch was more about software-defined storage but let me say my interest was more in the practical application, smaller failure domains.
Big storage SANs will always have their place but spreading your applications across multiple ESXi hosts with multiple copies really interests me. Sure, snapshots and replication do this also but unless you have a proven and tested disaster recovery strategy (which we thought we had when our storage crashed), then you just banking on [HOPE].
I’ve learned my lesson.
As a tempered OPS manager, I don’t hope anymore. I test. I make sure we are not making assumptions. My thinking since the SAN failure has changed. I want smaller failure domains with less impact on the business when something goes BANG!
Do your own research and you’ll find there are many solutions on the market for spreading fault domains to reduce the impact of hardware failures. But for me, I am listening to what my GUT tells me; and what VMware has to say about VSAN. Why? Because my thinking about storage has changed.
This is obviously only one use-case, and VSAN will only work for providing shared storage for VMware products for now. I’m guessing this might change in the future. As for how to architect storage using VSAN, there’s much to come, and how to design VSAN books are yet to be written according to a quick check on Amazon.
The other option is to use a smaller data storage server for each workload. That’s what we’ll cover next time.
Wrap Up:
I’ve briefly covered what has changed my mind about BIG storage solutions. I also covered what intrigues me about VSAN and why.
I see a new trend coming which is why VMware is getting ahead of it with their own storage solution. Why? Because the risk is too high running all your VMs from a big pool of storage and it’s our responsibility to protect the business when technology fails. Spreading the risk out onto smaller fault domains makes sense.
What I took away from the experience I started this post with was eye-opening: A reality check that even a million-dollar storage platform can fail! And when it happens it isn’t the vendor who rebuilds…