I wish this information had been available 10 years ago when I started working with virtualization, but then again, like many reading this, I thought I was an expert and didn’t need it.
Now, I see myself as a student because of how fast virtualization changes.
Consider the Big Picture to Avoid These Virtualization Pitfalls
1. First and most importantly, look at the big picture for why you are implementing virtualization. Most managers look solely at VMware, XenServer, Hyper-V, or any other virtual server product for ROI (return on investment). Bad way to make IT decisions! Look at the big picture. How will virtualization affect everything and everyone it makes contact with? For example, how will your storage be affected when you start sharing it with hungry VMs, and will the I/O hold up? How will your system administrator handle the new responsibilities? How will you handle the users when they start complaining that everything is slow because you didn’t consider I/O and new responsibilities?
2. Once you have a big-picture view of what you want to do with virtualization, consider that you’re still probably missing a few things you will learn. Just look at these challenges as growing pains that are unavoidable. Virtualization dynamically changes as the environments grow and upgrade. First, there’s the experimental ESX or free ESXi, Hyper-V, and XenServer host that gets you started. Then, when the experimental host(s) get filled up, there’s the small farm of host servers that gets landed when you actually start purchasing new hardware and the full infrastructure licenses. Beware of the “wow, we can virtualize everything” period that happens from 50 to 200 virtual servers. At this point, everything seems to work fine because you haven’t saturated your SAN I/O or host memory and CPUs. But then there’s that point that happens at VM number 201 (201 is a relative number; it could be more or less depending on a number of factors), where panic is unavoidable if you haven’t prepared properly. That’s why you need to read the rest of this post.
Virtualization Backup Strategy
3. Now that considerations 1 and 2 are out of the way, which is mainly to make IT managers think, I’ll get to the good stuff. Have a backup strategy in the beginning that is made for backing up the VM images. Don’t rely on your legacy backup software for physical tower servers. Yes NetBackup or whatever can still do agent backups of files of a VM, this is a no brainer. However, how are you going to do a full system restore? Unless it’s just data, 5 hours after your backup administrator begins the system restore, he is still going to be trying to solve this riddle. You want a good solution that provides image backup. Solutions: VCB, vRanger Pro, Veeam Backup, Avamar. These are all specific backup tools for virtualization. Avamar can work on any type of virtual environment including Sun containers.
What Are Your Storage Limits (IOPS)
4. Know your storage limits. Capacity is just one part of the storage requirement. The other part is I/O or IOPS (Input/Output Per Second). VMs have different I/O needs. One hungry database or SharePoint VM on a LUN that shares its disk parity with multiple LUNs can cause performance problems across all the LUNs in the disk parity group. The best way I have found to avoid this is to design your storage with the biggest I/O pool available. I/O begins at the disk, and 15K disks have roughly 200 IOPS, whereas 10K disks have 150 IOPS (SATA has 30 – 50 IOPS). Do the math, which is better? After capacity and I/O are considered, then there is the pathing, which needs to be manually configured to split I/O down multiple paths to the SAN/NAS cache. I’ve seen million-dollar equipment brought to its knees because this stuff was overlooked. It’s usually not the equipment (HP, NetApp, EMC) that is causing the problem; it’s its configuration. Whether you plan to use FC, NFS, or iSCSI, this is important for your storage administrator to consider. Otherwise, you will be playing VM storage Tetris, and I guarantee you will lose.
5. This is in conjunction with 4, VM template configuration. If you’re planning to have a huge pool of I/O, then you will never know your template configuration is poor. VM configuration is important and is easy to overlook. Most will find out how important it is when I/O runs out… I’ve read this best practice on many blogs – “put data and OS, and even swap files on separate LUNs.” I agree this is a good best practice, but I am taking it even further and adding criteria. “Separate LUN on separate disk parity groups.” Here’s why ten – 15K disks will give you roughly 1500 IOPS across each LUN it is carved into. Depending on the size of the drive, you may have various LUN sizes of 200 to 500 GB (each with 10 – 20 I/O hungry VMs) that share the same IOPS. Splitting data, OS, and swapping onto more spindles will give you more IOPS and possibly an alternate path to the 2nd storage processor (active/active) or more cache that is assigned to another FC or NIC port. Make sure datastore names include what the LUN is for (Data, OS, or swap) and odd-even disk parity (data goes on odd, OS goes on even).
Practice Good House Cleaning
6. Clean up your messes. Don’t leave old proof of concept (POC) VMs or equipment running after the POC is done. Nothing is harder to do than to clean up a VM environment 2 years after everyone who was on the original project team has left, and your VM inventory now has 500 VMs in it. The first place you need to look when you hit your host and storage limits is here. Out of 500 VMs, you can bet there are at least 50 VM zombies that are idly running and using up precious resources. Then there’s the clean-up of zombie VM folders from VMs that were improperly deleted and the files left on the data store (you know, the VM you said you’d delete later – that was 2 years ago). Clean-up also helps control “Sprawl.” Sprawl is a fancy word for out of control.
Repeat of Backup Strategy
7. You probably didn’t hear me the first time, so I’m saying “Backups” again. I’m putting this down again to make sure you have a backup solution that backs up the complete VM image. It’s no easy task to change the backup process 2 years and 500 VMs later so make sure you do this right from the start.
What Are Your Infrastructure Standards?
8. Establish standards for your environment. All hosts will be on the certified version of ESX or whatever hypervisor you use. Once you allow old hosts to stay around after you have decided to build a new host on the current ESX version, it won’t be long before your virtual infrastructure is fragmented. Remember, virtualization is evolving almost daily, and new features are added to each new version of ESX and Hyper-V. Live migration didn’t work on the old Hyper-V version, but it works on R2. However, it doesn’t work across R1 to R2 or R2 to R1. Get all those R1 upgraded to R2 so all are the same and live migration works. Keeping the standard isn’t easy because VM administrators are also system administrators; they have to land the servers, configure the host, deploy the VM, and configure the VM. It’s the same people doing both jobs and in some cases, they are storage and network administrators, too. Make sure you have enough staff to maintain your standards. I’ve known more than a few overworked, underpaid, and misunderstood VM administrators in my time.
Documentation (Runbook)
9. I hate this one as much as any true IT professional, but someone has to keep doing the job if you leave and take a better-paying job somewhere else. Make sure you keep good documentation. If it’s required, cool Visio’s of everything is nice for management, but even more important for day-to-day support staff are “How To” documents. How do you land and provision a host (hardware and hypervisor)? How to deploy a VM. How do you add additional disk space to the “C” drive of a VM? How to P2V a system. How to properly request more storage. How to decommission a VM. How to schedule a VM backup. How to recover a VM from a backup. Also, keep the “How To” documents up-to-date. You need a new “How To” for each version of ESX because they are not the same; customization to the SWAP volume, for example, is different on 2.5, 3.0, and 3.5. Hyper-V and XenServer have their own little tweaks as well.
Keep Your Tools Simple (Less is more)
10. Don’t buy every tool out there thinking it’s going to fix everything I have spent the last 2 hours writing about. Listen to what I am saying. Listen to your support staff. Carefully listen to vendors who want to sell you something because there is no silver bullet for poor planning. And, while on the subject of the vendor, any consult recommendation with a direct connection with equipment vendors should also be scrutinized. I’ve seen the best SAN money can buy collapse under 25 VMs because it was haphazardly used (VM storage Tetris). Many of the problems I have warned you about can be avoided if you plan. Read numbers 1 and 2 again until this makes sense. I feel your pain for the VM administrators who are fighting the daily battles because most of what I have written about is already occurring in your virtual environment. To all the new bright-eyed IT managers and system administrators who are licking their chops because they are finally getting a budget to start virtualizing, I warn you and say, “Consider the big picture and plan, plan, plan!”
Conclusion:
Hopefully, this post has been helpful. Other items that were not covered were how to monitor VM and host servers, disaster recovery DR of virtual environments, capacity planning, forecasting, and hardware (servers, network, and storage) brands and types. These can be topics for the next 10 biggie list. My final note is that “Backups” will challenge traditional thinking, so heed my warnings.