Managers! Is it time for a VMware health check to make sure your virtualization investment is not wasted?
Improve Uptime by Keeping vSphere in Shape
I know infrastructure budgets are tight, and I also know how difficult it can be to balance CAPEX and OPEX from month to month.
Over the years, I’ve had the pleasure of creating SLA Uptime reports as well as calculating monthly staff and environment utilization down to the kilowatt.
This is why performing a regular VMware health check is important because it helps keep VMware costs of ownership low by exposing waste, and providing opportunities to take proactive actions.
Barring some of the finer details, here’s what I look for:
1. Start by going vZombie Hunting!
The biggest culprit of a poorly utilized vSphere environment is VM sprawl.
What is sprawl?
Sprawl is when virtual machines get deployed for a project and then they are left unused, or when old services get upgraded with new virtual servers…
…but the old VMs never get decommissioned. Or when the service gets shut down but the servers never get turned off …
Stop VM Sprawl Before It’s Too Late!
When sprawl happens you end up with valuable resources burning cycles on storage, servers, and network hardware for nothing. They are vZombies (BTW – I started using this term years ago) and perfect candidates for decommissioning.
Hunting vZombie servers aren’t easy unless you have a tool such as VMware OpsManager or VKernel.
The other way is to create a custom VMware health check PowerShell script for checking and logging when there is no CPU, memory, or network traffic on a VM.
Normally a flatline is a good indicator of a zombie.
Once you track down the vZombies and check with the service owner to get the go-ahead, turn them off, back them up, and delete these VMs from your vCenter inventory. (Note: follow the standard decommissioning process)
Also don’t forget some server IP addresses have firewall rules and VIPS associated with them so clean them up, too!
2. Retire Old Server Hardware that is OOW and EOL
No VMware health check would be complete without retiring Out of Warranty (OOW) and End of Life (EOL) blade or rack server hardware.
Read my lips, this hardware is wasting your ESXi licenses because you cannot get enough memory or CPU cores in these systems to leverage your “per socket” ESX license efficiently.
Old servers are inefficient for VMware and server hardware OOW and EOL should be retired, ASAP!
For example: One loaded HP 380G8 or Dell R420 can handle more memory and CPU cores than 4 – 6 old servers and will still use only 2 ESX licenses. Also consolidation on new servers is good for reducing rack U, using less power and cooling, lowering port count on switches, lowering warranty renewals, reducing management overhead from less physical servers and lowers down time from failures of tired junk.
Also, another best practice to reduce the risk of reusing OOW and EOL servers is to get rid of this junk so your admins don’t reuse them.
I’ve seen too many junk servers pulled from the boneyard and put back into service because they were available.
I repeat…Old servers are inefficient for VMware!
If it’s now on the current VMware HCL it should be disposed of…
3. Standardize Configuration for Good VMware Health
A little bit of HP EVA, and a little bit of NetApp, and a little bit of local disk might make for a good song lyric, but they add up to a vSphere that is hard to manage, optimize and keep efficient.
And the same goes for mixed-matched servers of all makes and models as well. Mixed-matched configurations of memory and CPU types in the same cluster is a no-no!
Some servers with 32GB, others with 64GB, and even others with 192GB all in the same vSphere ESX cluster… this is a recipe for data loss and poor uptime.
A good best practice to follow is taking inventory of your equipment and enforcing standardization of hardware configurations.
This is key to optimizing your VMware investment because one-off environments are trouble and need to be on the VMware health check report so they can be dealt with!
4. Report Bloated Virtual Servers with Too Much Memory, CPU, and Disk
Finally, a thorough report will include resource waste such as VMs that were created with too much memory, CPU, and disk space (aka over-provisioned).
Overusing valuable vSphere resources is common in some vSphere environments because engineers and developers are used to ordering servers based on physical criteria. This is because they have never been shown the proof their servers are only using 20% of the resources they have provisioned (another reason for a good tool).
The unfortunate thing here – depending on the scale – is you may not be able to clean up existing systems because too much work may be involved. But you can start to trim back resources on newly deployed VMs.
Small, Medium, Large Virtual Servers
A good best practice is to come up with some standard configurations for VM sizes such as small, medium, and large; with various memory, CPU and disk size configurations.
This will also make capacity management easier since now you have a set block that you can calculate capacity from.
This is not uncommon and most cloud services use standard sizes for their VMs.
5. Take Action ASAP…
A good VMware health check documents and lists all offenders of these best practices. Once the report is completed you will want an action plan that road maps:
- How zombie VMs are to be decommissioned to get rid of sprawl and reclaim resources.
- How old hardware will be replaced and disposed of to get rid of inefficient server hardware and improve TCO and uptime?
- How storage, servers, and network systems are standardized and consolidated to reduce overall CAPEX and OPEX. And to make them easier to manage.
- How VM configurations will be standardized and made more efficient.
In Conclusion:
Overall, this VMware health check focuses on cleanup tasks, standardization, and consolidation that will make your vSphere more efficient and help increase the return on your investment and reduce downtime.
There are other tasks you can add to the check such as:
- VMtools and firmware updating
- ESXi upgrading
But I listed key areas that should be included in your standard VMware health check report.
VMware Health Check Scripts and Services:
- Order a VMware Professional Service Engagement
- Old ESX Health Check Script
- Another Script with Report
Do you have the recommendation to add?
Thanks for the non-technical health check list. I’ve been trying to find something like this but all the health check links are for scripts that check a million things.
-LA