VMware vSphere Troubleshooting Guide for Beginners

Troubleshooting Vmware

Why Is My App Running Slow On VMware? (Book Two)

12 Months After The Virtualization Project…

Frustrated? You’re not the only one suffering from a slow application. I feel your pain and have spent days on bridge calls troubleshooting application problems.

Yes, I know how important it is when a Sev 1 incident has taken down your key business application. In fact, I’ve been on too many calls at 2 A.M. and it’s time for this madness to stop!

The Think Service First Series are a guide for helping new IT managers get started. These eBooks take the guesswork out of running a busy infrastructure team.

Written by an experienced vSphere Cloud Manager, each book in the series will cover a different topic. There are years of insight offered by someone who worked through the difficult problems faced while growing a cluster of 30 hosts into a robust cloud with more than 600 ESXi hosts and nearly 7,000 virtual machines. It’s a book of insight.

Vbeginners.com

Vmware vsphere troubleshooting guide for beginners – joe sanchez

The second book in the series, “Why is my app running slow on VMware?” picks up where the first book, “VCP for Hire,” left off. It’s a dedicated guide for managers who are dealing with vSphere performance problems that are causing tier-one applications to run slow.

Using a systematic diagnostic process of elimination, Joe takes you through a series of questions about the network, storage, vSphere, and VM stack in search of the root cause, or causes. Each list of questions is designed to narrow down the field of possibilities until the cause of the VM performance problem is found.

In “Why is My App Running Slow on VMware,” Joe covers four common problems that cause application slowness on virtual servers running on VMware vSphere.

To set the right expectations, because the platform is so broad and can impact just about every type of hardware used for building infrastructure, it is not a deep technical how-to guide. But that said, it is a light at the end of the tunnel to point you in the right direction for understanding where to look to solve your application performance problems. If in fact, your application problem is due to poor virtual server performance.

Think Service First and let “Why is my App slow on VMware” be the first vSphere troubleshooting guide you turn to for help.

VMware vSphere Troubleshooting Guide

Table of Contents

  • Introduction
  • Chapter One – Frustrated (Sev 1, Slow Application)
  • Chapter Two – Network Strangulation (Hypertension, Database Disconnects)
  • Chapter Three – Storage Nausea (High Latency, Slow App)
  • Chapter Four – vHoudgePoudge (Your Baby is Ugly, Unstable Apps)
  • Chapter Five – VM Chaos (Best Efforts, No Standards)
  • Chapter Six – Remediation (Fix It, Capacity Management)
  • About the Author (Who is Joe)

Introduction

Epiphany, now I get it.

Wow, it’s taken me over a decade to get here.

All these years it was always right in front of me.

Do you know what I am talking about…do you see it?

It’s right there, in front of you too.

Let me explain.

When I started my IT career in the 90s, like many IT professionals I was taught to approach problems with the best technical solution possible and using the least amount of work.

But over the years I have learned the best technical solution isn’t always the most useful or user-friendly solution, nor is it the easiest. Thus, the reason I am writing this series of “Think Service First” eBooks.

Think about what I am going to tell you next.

Way back when VMware first came along with GSX and ESX there wasn’t talk of “The Cloud.”

It’s true.

The problem VMware was solving was server utilization, ROI. Back then nearly 80% of memory and CPU resources were wasted on physical servers crowding up data centers.

Think about my last statement for a moment. The problem was utilization, not a need for a cloud.

OK, let’s move on.

So with this in mind, I learned, like many other system engineers, how to cram as many VMs as I could on a host.

Like I said, VM server quantity came “first” and quality, such as how many users could log into the application and work came next.

Why?

I already explained why, because utilization was the problem we were solving.

As I recall, the beginning was rocky for many admins and it turned into a DEV versus IT conflict because developers started complaining about slow VMs. But who cared then if there were complaints because our mission was to increase ROI. Optimization was the goal and we were achieving it.

We became the CPU and memory police.

There were many wins and losses as we tightened the reins on resources, and the battle raged on. And over the years many of us admins became masters at loading up ESXi hosts with masses of VMs.

I recall one of my first blog posts in 2007 was about how many VMs could be handled per physical CPU and it’s still one of my most popular posts.

So here I am today; many hundreds of P2Vs later and nearly a thousand ESXi hosts deployments under my belt. Blink, the light has finally come on. Epiphany, SERVICE comes first!

About the “Think Service First” Series

Actually, the series started with VCP for Hire (2012), which is about how to determine who the best talent is for your virtualization team before you hire them. And the next books in the series will pick up where book one left off.

Each new release will dive into the many lessons I have learned over the last seven years as a VMware Engineer and Operations/Infrastructure Manager.

Why is My App Running Slow on VMware is book two in the Think Service First Series.

Enjoy!

 Chapter One – Frustrated (Sev 1, Slow Application)

Frustrated? You’re not the only organization suffering from a slow application. And I can relate because I know your technical teams have tried everything to figure out what’s causing slowness (been there, done that many times).

Now after days of troubleshooting they’re absolutely sure it’s not the application or database causing the problem so it must be the VMs (been there, too).

Does this sound familiar?

It’s a pretty common story nowadays, especially after years of fast-paced innovation projects to get data centers virtualized.

I guess we can say, “Now we have a legacy virtualization problem.”

Wow!

We didn’t expect to hear those words so soon because didn’t we just migrate everything to VMware or the cloud 12 months ago? And now I’m telling you we have issues because of it!

Not everyone has this problem – but, yes – “Houston we have a problem.”

Here’s how it probably happened.

Since 2007, we’ve been involved in virtualization projects and they all seem to follow the same pattern. Build a brand new environment and then start P-2-V-ing everything not tied down. (P2V means to convert a server from physical to a virtual server.)

The goal was to virtualize 100% of raw metal servers as quickly as possible and reduce hardware and other costs associated with running a physical infrastructure.

Well, in the frenzy a new problem was created. And the problem is called “over-subscription” which is caused by too much of a good thing. Basically, the problem results from over-leveraging or sharing the same compute resources with too many subscribers or consumers.

How did we create these new problems?

Well, think about it.

We had a bunch of old servers running on dedicated physical hardware, and then we built a big pool of shared hardware (some new and some re-purposed), and then we migrated (P2V’d) everything to a vSphere (VMware Cloud). Maybe this project even had a cool name like Phoenix, Genesis, or Hammer?

Sure, in the beginning, it was great and everyone was virtualizing. But now the fact remains; 12-24 months later you have everything competing for the same storage IO, the same network bandwidth, and the same compute power.

Did you calculate capacity for 12 – 24 months of the network, storage, and server hardware growth? Probably at first, but you’ve greatly exceeded your estimations. And sprawl consumed all your capacity months ago.

Many businesses in the frenzy to save money since 2008 have been leveraging their EXISTING hardware to the gills. And now when the economy is better, they are suffering from slowness and downtime on their most important applications because nobody has been properly managing capacity.

And now there’s a new frenzy going on to solve capacity problems…maybe you’ve already joined in and you’re planning to move to the public cloud?

Or at least some of your more important applications, right? I’ve seen it and heard it before, too.

Here’s the truth.

The same thing will happen in the cloud in 12 months that is happening now in your data center. There’s no free lunch and cloud vendors are banking on “Sprawl” so they can eat you alive on fees. Think cell service fees and it will put you in the right mindset.

Why will this happen? I’ll tell you because you haven’t fixed the problem.

The problems you are dealing with are a result of bad planning, bad design, and/or shortsightedness.

As the saying goes, “If we only knew then, what we know now.”

And – what – have we learned? We learned to leverage, and we learned over-leveraging which is now causing application slowness.

So here’s what you should do about your frustration.

First, get a grip!

Slow down.

Stop the frenzy.

And for a minute stop listening to vendors and listen to me.

Before you rush out and buy VMs on Amazon or another cloud and make the problem worst, consider fixing the internal problem.

In the next short chapters, I’m going to share with you important insights I’ve learned over the last seven years as I dealt with the same problems.

There’s no sales pitch because I’m not trying to sell you anything new (except maybe a $3 eBook). I’m just offering some good-old common sense from lessons I learned the hard way about networks, storage, vSphere, and VMs.

So are you ready to think outside the box and be objective?

In each of the next chapters, there will be a list of questions. Think objectively and ponder them. As you read them to make a list of your own action items you know need checking. Each chapter will end with a time of retrospection. My goal is to put you in the ballpark and lead you to the root cause of your performance problem. As well as provide insight on other problems that may be brewing underneath the surface.

Here we go.

 Chapter Two – Network Strangulation (Hypertension, Database Disconnects)

Strangulation means to choke. Is your application problem a result of the connection from the application to the database being choked?

If you are 100% sure it’s not a database (indexes, queries, or configuration) or application (code or configuration) related, then let’s start at the core of the stack and see what we find.

Let’s assume for now your network is choking and is causing your application to disconnect from the database. For the record, this is a very common problem where virtualization is in use.

Why is it choking?

Let’s find out. (Don’t forget to make your own list.)

  1. Is “data” and “maintenance” traffic on the same network switches (it doesn’t matter if they are on different VLANs)?
  2. Have you checked for database backups, snapshots, and replication jobs copying blocks or files from a source to a destination over the same network hardware while the problem is happening? Check to see how long these maintenance jobs are running and when.
  3. Do you have hundreds of VM back-ups happening via file copy or snapshots and going over the same network hardware as your data traffic? Check to see how long these jobs are running and when.
  4. Are automatic storage vMotions (storage-DRS), migrations, or server cloning jobs going on over the same network switches with data? Check to see how long they are running and how often.
  5. Do you have your network configuration for storage IO such as iSCSI and/or NFS using the same hardware as the data network? This is not as big of a problem on 10GB equipment but still, could be checked.
  6. Do you have mismatched network settings from ESXi NICs to switch ports (a very common problem)? Check your MTU configuration for consistency from the host, to the switch, and to the storage. Do not mix 1500 MTU with 9000 MTU (jumbo frames). If you have VLANs trunked for data traffic on the same NICs and ports then only use 1500 MTU settings.
  7. Are you adding more ESXi hosts to already loaded network equipment because there are open ports on the switch? An open port does not mean you have the capacity, especially if you are still on 1GB. Set thresholds and add more equipment when the threshold is breached.

Each of these questions could lead you to the root cause of your application problem. And you may find more than one of these problems existing at the same time. Or you may have thought of something else. Write it down and let’s keep going.

Let’s take a minute now for retrospection.

First, you are not alone.

These issues are prevalent and they’re easy to fix. Normally they result from unplanned growth and a lack of capacity planning for network utilization.

They can happen when people are rushing to get something done. And when there is no (or very little) intra-team collaboration. For example, a VMware admin asking for VLANs and IPs and doesn’t give any details on the traffic type, such as data, iSCSI, NFS, frame size, etc.

Or when someone decides to start a new data backup job over the network without finding out what the existing network utilization is and doing further due diligence to figuring out the impact.

There can be multiple solutions for the same network problem but start with the practical, first. And furthermore, start with solutions that do not cost anything. That’s right, zero dollars. I told you I’m not trying to sell you anything.

Begin by getting maintenance traffic off the data network and freeing up bandwidth. This can be done by adding more switches, upgrading to bigger pipes, or as I said, re-arranging existing hardware to save on costly new equipment.

If your hardware is already out of warranty and your team has been doing everything possible to squeeze out more capacity then you need to bite the bullet and refresh. And depending on the severity, this could be costly especially if you’re a Cisco house.

Obviously, if this is the case you will have salespeople lining up to sell you 7K switches and UCS servers but do not let them work you into a frenzy. If all you need is a few switches, don’t let them sell you a new data center. And don’t let them sidetrack you with free trials before you buy hardware.

Now, what if you find you don’t need equipment but you still have poor performance? Then obviously something is wrong with the network design or configuration. Maybe it’s time to bring in someone to assist with the engineering? A fresh set of eyes can often see issues without becoming defensive. This is why 3rd party annual health checks are crucial.

This brings us to the end of this chapter.

How’s your list coming?

Hang in there because we’re just warming up. The best is yet to come. Now, let’s talk about another potential problem, storage nausea.

 Chapter Three – Storage Nausea (High Latency, Slow App)

Nausea happens when things are out of balance. Is your application slow because you have too many VMs on the same storage LUN?

Maybe 12 – 24 months ago when you purchased the best storage available for your new cloud you had hundreds of terabytes to spare. But since then it has run into a wall because now you have hundreds of virtual servers consuming the space and IOPS. And there are plans for more VMs.

Now let’s assume storage is the problem and follow the same regiment of questioning. (Don’t forget to jot down your ideas.)

  1. Is your application performance problem just a symptom of a bigger problem?
  2. Has storage become a single point of failure for hundreds of VMs?
  3. Do you know how many VMs are loaded on your SAN, NAS, or DAS?
  4. How big is this risk of a storage failure?
  5. How many business-critical VMs with huge databases are processing millions of customer transactions and are sharing the same IOPS with test or POC VMs?
  6. Or maybe you have bunches of VMs with huge read-only file repositories consuming tier-one (SSD) storage, instead of slow disks?
  7. Are your applications fighting for IOPS and space with DEV or UAT VMs?
  8. Is your storage getting hammered non-stop, 24/7/365, without any routine maintenance?
  9. And worst yet, is it all the fabric still on a 1GB NFS or iSCSI storage network, instead of a 10GB, FCoE, or FC?

As a consultant, I’ve seen it over and over. I call it “storage Tetris.”

Let’s have another moment of retrospection.

Twelve to thirty-six months ago, nobody planned for this to happen. It happened while we grew the business, while new products were developed, and while innovation was occurring.

Sure, you may have purchased the largest spindle count possible for your brand new SAN or NAS, and then even carved it out based on best practices. But since then you’ve turned on thin-provisioning and de-duping and loaded too many VMs on it. End result – latency (and slow applications).

And that’s not all, to complicate the issue more – in the frenzy to build more VMs for new projects, the storage firmware (or software) has gotten behind on updates. This lack of upkeep has resulted in corky things going on with VMs and storage space. For example, you can’t add the new SSD disks you purchased which have been sitting around for months unused.

Why?

Because you probably can’t afford a planned maintenance outage to move 500 VMs elsewhere, while the firmware updates happen to fix these issues.

We just didn’t plan for this type of stuff months ago, and the vendors didn’t tell us it would happen. These are active/active so why do we need to take an outage, right? Because of a software bug, that’s why.

Here’s the good news.

I am not saying buy more storage. Nor am I saying buy another type of storage.

What I am saying is you have a storage problem that needs to be fixed which is causing slow application performance. And with so many different types of storage being used for datastores, it would be insane for me to try to solve your storage problem without knowing more about it.

Is the picture getting clear, yet?

We’ve created a massive single point of failure and it’s also causing major performance problems.

Write this down. You need to spread this loadout before a catastrophe happens and causes you to lose hundreds of VMs and data.

And to bear more bad news, chances are if your storage is in the state I just described, then your backups are also in a state of chaos, too.

Why?

You guessed it because backup jobs are normally managed by the storage staff.

A storage failure would put you out of business for weeks or months while the VMs were rebuilt from scratch (due to no backups).

Resolving the bigger issue with storage will most likely fix any application performance issues related to storage. But it will take time and planning while “storage Tetris” is carefully played.

How’s your list of action items coming? So far what I have covered is fixable. But now it’s going to get really interesting because I’m going to call your baby, UGLY!

 Chapter Four – vHoudgePoudge (Your Baby is Ugly, Unstable Apps)

Critiquing someone’s vSphere is the same as calling their baby ugly so here it goes, “You have a badly configured or engineered vSphere,” there I said it – your baby is ugly.

Go ahead and get defensive and tell me I’m full of it and where I can go…then let’s face the truth.

Your vSphere has evolved and has been – at best – your best efforts to keep up with the fast pace of the last couple of years.

I get it. Don’t forget I’ve been doing this for a while.

We always have too much going on and nobody has really spent the time needed to ensure consistency and best practices were being followed.

If the server hardware powers up someone slaps ESXi on it and joins it to vCenter.

But let’s reflect for a moment and see where the real issue began.

Let’s go back to the first free release of ESX set up in your data center because I actually think VMware and other vendors seed this by giving away free software.

Ugh?

Do you really want to know?

OK, I’ll tell you, because free software spawns the first Franken-host running on junky hardware pulled from the boneyard. And I will bet you in many cases it’s probably hosting tier-one applications. And yes, the one you are currently having problems with.

Can you see it in your minds-eye? It’s the server in the corner nobody touches and you cannot reboot or update. Yes, that one.

This scenario has been repeating over and over for years.

How does this happen? Here’s my guess.

Most IT managers in their right mind will not pay $2000 dollars for an ESXi Enterprise-Plus License and then put it on junk hardware. But they will let someone download the free version and put it on junk (of course it’s only a POC), which then will find its way into production, and that’s how it becomes an application problem.

Think hard about the scary truth I’m about to tell you because it happens.

Somewhere – possibly in your secure data center – you have business-critical applications or databases running on a free version of ESXi, built on an out-of-warranty server, and ready to die. (Write that down.)

And if history is any predictor of what will happen. The Franken-host will die at the worst possible time – likely, during your next big product release!

Let’s get back on track and continue talking about how the baby got ugly.


 Read the rest of this story…

Why is My App Running Slow on VMware

12 Months after the Virtualization Project

Joe Sanchez, VCP

Troubleshooting VmwareYou can read the rest of the story by clicking the link below.

Available Now
Vmware Troubleshooting GuideVsphere Troubleshooting Guide

Click Here To Finish Reading Troubleshooting Guide