Managing Virtual Infrastructure
To begin, this post has nothing to do with the nuts and bolts of installing, configuring or supporting VMware or any other virtual infrastructure.
It’s about managing virtual infrastructure and building a solid team of admins and engineers to keep it alive.
Back in 2009, I wrote 10 Biggies to Help Managers and Admins Avoid Virtualization Pit-Falls and since then not much has gotten better or changed. Managing virtual infrastructure is not for the faint of heart. It takes creativity, determination and a kick-ass team…
For the IT manager who has been given the new job of taking over management of virtual infrastructure it is a great opportunity, then when the honeymoon is over it is a lot of work, especially if you are taking over after there were problems…
6 Keys to Managing Virtual Infrastructure Like a Pro!
1. Detailed Audit
Start with a detailed audit of all the hardware and software the VI is built from (also know as currency) – I mean what ESX/i host are running on for servers (make, model, warranty dates), the storage provisioned to them (make, model, design), how the network is designed (1/10g, fiber, flex, xSigo, FCOE, copper, bandwidth, firewalls) and versions of ESX/I and vCenter. Part of the “Big” picture!
2. Health Check
While collecting all the information for your audit you’ll also need to find out what’s going on. What’s working and what’s not working. You may have multiple data centers with different cluster designs that are performing differently because of how they were set up. They may be using the same hardware and ESX version but because of the way networking or storage is setup they behave totally different. Another part of the “Big” picture.
Note: If you have the budget, VMware provides a Professional Service to do this.
3. Team Dynamics
Not every System Admin is cut out to be a virtualization engineer because there’s more to building a solid virtual infrastructure then building servers. My opinion of a virtualization engineer is someone that knows storage, networks, security, hardware, Windows/Linux, and scripting. Check out my blog on How to Apply for a VMware Engineer Job for what to look for to staff your team.
A good team has a lead designer who knows the environment and can define clear requirements, installers that can install and configure ESX/ESXi and vCenter according to the design. These two roles can be the same person and focus mainly on the back-end Then there are the front-end guys who will handle the day to day operations of creating templates, provisioning VMs, customizing VMs, P2Vs, and patching. (OS management and support should be handled by SysAdmins) they should also handle installation and updating VMtools.
Most importantly, you’ll need directors or executive leadership above you that will let you do the job right, and understand virtual infrastructure is not server technology, its infrastructure.
Good team dynamics provide a wide set of job roles to handle virtual infrastructure as a convergence of technologies and should be treat as a data center not as a server. Far too often VI is treated like servers and causes issues for the person managing the VI and team.
4. Road Map and Remediation
Once you have your audit, health check and team, you’ll need a remediation strategy and a road map for future growth. My suggestion for remediation is to standardize all your hardware, ESX installs, and storage and network designs. Try to keep building blocks standard configurations.
Avoid letting every Tom, Dick and Marry deviate from Best Practices because (s)he wants to try something new – even if they are the smartest on your team. New things should be planned, tested, and then added to the roadmap, not just thrown into production adhoc.
I’d bet a dollar, most of the health checks and audit completed will have a slew of various storage, hardware and configurations from someone’s test or POC environment that became production and then got forgotten. I’ve done, and have been involved in enough POCs in the last 15 years to know how quickly someone’s POC can turn into production.
Somewhere above I mentioned road map in a nutshell a road map is sort of a life-cycle management plan to remove old stuff and update to new stuff. As I stated, you will need to be determined to stay on track with your road map Don’t be too aggressive here because as environments grow it takes a long time to properly phase the old out and the new in. Not to mention you will be juggling this with everything else going on. One more thing, don’t forget to celebrate these accomplishments and report them to your boss. Achieving a Milestones is important news.
5. Performance and Health Monitoring (Check Engine Lights)
I like to think of problem areas as check engine lights. Do not ignore check engine lights or your VMs will crash – correction – most likely lots of VMs will crash at the same time and services will be offline until the issues are fixed. Then the blame game starts and someone will say to you, “I knew about that problems but I didn’t have time to fix it because I was too busy supporting projects” – at this point you will go crazy!
Whether you use 3rd party tools or have custom scripts for monitoring performance and health, a good monitoring strategy is an absolutely must have requirement. Nothing is worst for a VI manager then calls at 2AM and support tickets related to poor performance of VMs. You and your team need to stay ahead of problems or else they will spread like fire and impact all the VMs running in your vDC (virtual data center).
The problems I’ve found are most tools try to do too much, and normally are bad at reporting what is important. Alert and information overload about every VM, vDisk, vCPU, memory, ESX host or cluster will soon get ignored (Out of the 5000 emails sent, which one is important, right?). Not to mention how much it takes to customize some tools to work right or give you what you need. Also, a lot of tools will requires a full-time resource from your team to focus on fixing all the issues they find, and believe me even the healthiest environment will generate loads of alerts.
6. Capacity Management
Building too much capacity or not enough can be costly. The trick here is to build a set known amount of capacity, and then allow the capacity to be consumed. Then on a set schedule rebuild the capacity back. Trying to build capacity “Just in Time” will always keep you behind the curb and reactive, however, building a pool of capacity is more like having “Capacity as a Service” and allows you to control when you build and how much you need to rebuild. Both work – just one is reactive and the other is proactive. This item is a huge topic of its own and I will try to cover it another day.
Summary
To summarize what I’ve written, Managing Virtual Infrastructure 101 is more about trying to manage virtual infrastructure from a broad scope and planned set of practices, rather then from an ever evolving dysfunctional sprawl from hardware, to storage, to network, to VM, and finally to the guest OS thrown together as it’s required.
Learn to see the big picture and avoid anything temporary because once it’s in production it will not be easy to undo.
Note: this is for large environment with multiple members; smaller shops may have one or two people doing everything from landing hardware to installing networks. The two are different and will have their own set of challenges. This blog focuses on the larger shop with dedicated teams.