Join the SentryOne
#ServerHealthChallenge

(You could win a Surface Pro 4!)

Learn More

"My only choice for SQL Server Performance Analysis"

Source: TrustRadius. The number one B2B software review site.

Available in the Azure Marketplace

Create an instance of SentryOne directly from the Azure Marketplace and easily monitor your SQL performance running in the cloud.

Visit the Azure Marketplace

is now...

Learn More

TROUBLESHOOTING SQL SERVER WAIT STATS

Introduction: Three Secrets for Effective Troubleshooting with Wait Statistics on Microsoft SQL Server

Google AuthorTwitter: @keklineBlog: kekline.com

There are many approaches to troubleshooting, tuning, and optimizing SQL Server performance.

  • A Windows system administrator might choose to use the Windows Event Logs, Task Manager, or powerful SysInternals tools, like Process Explorer, to find out as much as they can about what is happening within their Windows Server.
  • A .NET developer might open a client-side trace, enable a debugging switch in their program, or enable verbose logging on an important piece of middleware so they can determine how their programs are behaving when connected to a SQL Server back-end.
  • A line-of-business manager might simply watch the response times of their application, whether it be a client-app or website, and then call in a consultant when things seem too slow.
  • Even a full-time, dedicated SQL Server professional has many different options to choose from and methodologies to apply when SQL Server seems “off.”

But which approach is optimal? This white paper will show you. The good news is that SQL Server is a very mature product, with a long history of community-oriented staff and independent experts who happily share their best practices when troubleshooting, tuning, and optimizing SQL Server performance. This white paper will serve as a useful and effective template for performance troubleshooting which you will be able to use over and over again whenever SQL Server performance is called into question.

The optimal approach to SQL Server performance troubleshooting is a tried-and-true methodology utilizing wait statistics, performance monitor queues, and other relevant performance metrics.

Waits and Queues

Wait statistics, or simply wait stats, were first introduced as a fully-fledged feature in SQL Server 2005 and explained in the white paper entitled Performance Tuning with Waits and Queues, written by Tom Davidson. (An early, premature version of wait stats, called umsstats, was made available during a service pack release on SQL Server 2000.) While wait statistics were available on other major database platforms like Oracle, IBM DB2, and Sybase before they were available on SQL Server, they were not the only tool for performance troubleshooting in the DBA’s toolkit.

Since its inception, SQL Server provided another set of metrics, broadly known as queues, to show the internal behavior of SQL Server. (Queues were your only method of performance troubleshooting prior to SQL Server 2005.) The word “queues” in this context is an aggregate term that encompasses server-side traces, Profiler traces, Performance Monitor object counters, WMI counters, and other relevant metrics such as those revealed by the DBCC SQLPERF command.

 

Table of Contents

Optimal Performance Troubleshooting is Like Breakfast Cereal

In a nutshell, the optimal approach to performance troubleshooting is a tried-and-true methodology blending all valuable sources of system information, including waits, queues, and other non-SQL Server performance information.

SQL Server performance troubleshooting which uses a waits-only approach is problematic. It is a lot like what cereal advertisers (especially those cereals composed mostly of sugar) always say in their advertisements – “Part of a complete breakfast!” But another way to interpret the inferred meaning of that statement is “Incomplete or unhealthy by itself”. And the same is true of SQL Server performance troubleshooting when you use too much of or only one metric or a set of instrumentation. Sooner or later, you will expend a great deal of time and energy by going down a dead end, hit a wall, or follow a false lead – to your great peril.

Here is the good news: wait stat analysis provides an excellent return on the time invested in performance troubleshooting because it provides easy visibility into the areas where SQL Server spends time waiting to do work. It is especially good when used in combination with other tools and metrics. In other words, it’s “part of a complete troubleshooting breakfast!”

Wait Statistics Explained

SQL Server wait stats are, at their highest conceptual level, grouped into two broad categories: signal waits and resource waits. A signal wait is accumulated by processes running on SQL Server which are waiting for a CPU to become available (so called because the process has “signaled” that it is ready for processing). A resource wait is accumulated by processes running on SQL Server which are waiting for a specific resource to become available, such as waiting for the release of a lock on a specific record.

The promise of wait stats is to quickly identify where the most pressing bottlenecks are building within SQL Server.

Resource waits are further broken down into hundreds of specific categories (the exact number varies by version of SQL Server). These categories, called wait types, enable you to determine a more specific bottleneck on a SQL Server, such as schema locking issues, plan cache memory pressures, or backup I/O problems, instead of just generic locking, memory, or I/O issues, you might identify through other methods. The promise of wait stat analysis is that you now have an excellent way to quickly identify where the strongest pressures are building up on SQL Server’s specific resources.

In this white paper, we will walk through the optimal performance troubleshooting process for a variety of scenarios and illustrate both the strengths and weaknesses of using a waits-only approach. As Microsoft says in their definitive white paper on the subject: 

“Performance tuning of applications and solutions has been around for many years. The performance of SQL Server database applications should be evaluated from several different perspectives. Each perspective tells a different section of the complete performance story. Together they paint a detailed performance picture of the whole and also cross validate observations in each specific perspective.”

Like many memes evangelized on the Internet, waits have undergone a somewhat startling transformation in recent years. What was introduced as an additional arrow in your quiver for performance troubleshooting has been subverted into the only method used by many practitioners and even by many commercially available tools. While a waits-only approach can be useful for assessing overall SQL Server performance, it should not be used for performance troubleshooting. When a waits-only approach is used for troubleshooting, it can be inaccurate and, at times, misleading, which can result in wasted time and serious errors.

Three Surprising Failures In Waits-Only Performance Troubleshooting

Performance troubleshooting is predicated upon getting the most actionable information as quickly as possible. Because of this “easier said than done” requirement, performance troubleshooting SQL Server commonly fails for one or more of these three reasons:
    1. Wait statistics do not help troubleshoot non-SQL Server issues.
    2. There are plenty of situations where wait statistics are not relevant or are not actionable.
    3. Poor understanding of specific wait statistics leads many to conduct “knee-jerk” performance troubleshooting, resulting in faulty or misinformed corrective actions.

Compare these three reasons for failure to the approaches we have personally witnessed, introduced in the very first paragraph of this white paper.

In those examples, a Windows system admin, using only Windows Event Logs and Task Manager, has no visibility inside of SQL Server’s processes. A .NET developer, using only client-side traces and debugging switches, can see neither the deep internals of SQL Server nor the Windows Server itself. And of course, the business user, who only knows that their interaction with the application is slow, actually has no clue at all as to what is happening anywhere else in the organization’s infrastructure and, many times, not even on their own workstation.

Using a waits-only approach can lead to all three major problems of performance troubleshooting!

When performance troubleshooting SQL Server, you must collect both appropriate information and actionable information. You will need information found both inside and outside of SQL Server itself. And you will also need overall contextual and historical information to determine what is truly normal or abnormal behavior for your server.

An Example of a Flawed Waits-only Process

Wait stats values, shown by sys.dm_os_wait_stats and other Dynamic Management Views (DMVs), accumulate over time without distinction for granular, process-level identification of the root-cause.

To use a metaphor for the appropriateness of waits-only analysis, high wait stat values in many cases are the symptom of an illness, but not the actual diagnosis for the illness. A doctor would not diagnose a patient with a high fever (i.e. the wait stat). Instead, they would note that the patient has a high fever, but that they may have a wide variety of potential illnesses and then, rightly, proceed with more tests and investigation to find the root cause.

Waits analysis can reveal the symptoms of a SQL Server performance problem, but they rarely point directly to the root-cause of the performance problem.

Here is a real-world example of how waits-only troubleshooting can miss the mark on providing enough information and truly actionable information:

A small, regional bank, who is a customer of ours, had a small IT team who vastly preferred to use Microsoft Access for all of their home-grown applications. Microsoft Access front-end applications will, by default, use a technique to retrieve data from a SQL Server back-end, in which the application asks for each row it needs from the SQL Server, waits for an acknowledgement, and then asks for the next row, rather than asking for them all in a single cached batch. This painful process is sometimes referred to as RBAR (Row-By-Agonizing-Row).

As a result, the SQL Server back-end begins to accumulate a very large amount of ASYNC_NETWORK_IO waits as it sends one row to the client, waits for an acknowledgement from the client, then sends the next. The IT team was alarmed to see ASYNC_NETWORK_IO waits go higher with each new branch office they migrated to their new application. Taking the wait stat at face value without assessing any other performance metrics, they began to assume that the SQL Server simply needed a faster NIC or, at worst, they needed some improved networking infrastructure. When they analyzed the NIC traffic load by looking at PerfMon counters, they saw that the SQL Server and the network itself had a vast amount of headroom still untapped. This was simply a case of bad application design, by not disabling the default row-by-row behavior, which at first glance appeared to be a problem with the SQL Server.

This white paper presents the steps to follow to achieve an optimal performance troubleshooting experience on SQL Server.

Consider yourself warned. This is just one of a multitude of examples in which you might face significant misdiagnoses using a waits-only approach to SQL Server performance troubleshooting.

In the sections that follow, we discuss a variety of important and common shortcomings to consider in waits-only performance troubleshooting. At the same time, we will present the step-by-step process to follow to achieve an optimal performance troubleshooting experience.

Surprise #1: Wait Stats Do Not Help with Problems Outside of SQL Server. a.k.a. “The Blame Game”

You just returned to your desk from taking a quick lunch break. As you take your seat, you see that you have a number of messages waiting for you on the corporate voice mail system. You can also see that there are some email messages with the urgent marker glaring you in the face. And now that you focus, you hear the footfalls of someone rapidly approaching. The program manager for one of the business units pops their head around your cubical wall and says, “What is wrong with your database? The users are driving me crazy with complaints!”

When performance troubleshooting SQL Server, you want as much actionable information as quickly as possible. “Actionable” is the key word.

While this may not have happened to you (yet), most DBAs in medium or large IT organizations have had this experience many times. This sort of complaint is so common as to be routine for DBAs interacting with their business customers. The customer cannot reach the database, so clearly the database is at fault! In fact, DBAs are so used to being the first person in IT blamed for an application problem that they often develop coping mechanisms to shield themselves from what might turn out to be someone else’s technical issue.

Experienced DBAs are so used to getting the blame for failures elsewhere in the corporate infrastructure that they begin to figure out ways to “prove” the problem is caused by a separate silo of the IT organization.

However, any experienced DBA will point out that there are a multitude of other places within the corporate infrastructure where the user might encounter a problem, with the database being merely one of many symptoms of that problem. Consequently, your first order of business in performance troubleshooting is to determine whether it is even your job to be the person troubleshooting the problem!

And since so many problems originate outside of SQL Server, wait stats have no chance of detecting these issues.

Such common non-SQL Server issues might include:

  • An issue with a domain controller
  • Failover to a less capable backup domain controller
  • DNS server problems
  • A WAN or IP routing issue
  • A problem in the underlying Windows OS
  • Resource consumption issues on other applications or services running on the same server as the SQL Server, such as memory leaks
  • A SAN or HBA problem

These kinds of blame games are very common. They are so common that many DBAs begin to anticipate them and, after enough negative experiences, build out contingencies for failures elsewhere in the corporate infrastructure.

Networking or OS configuration problems might frequently appear to the user as if their favorite database application is actually at fault, especially since they may not use corporate infrastructure for any other purpose. And, in some cases, DBAs are the first person that users call, not because they suspect it is a database problem, but because the DBAs themselves are the most responsive and knowledgeable about troubleshooting issues like this.

Using SQL Sentry: Correlation of Waits and Queues is Built In!

Refer below to Figure 1 where you can see an example of the Performance Advisor Dashboard, which has integrated correlations of SQL Server wait statistics with all of the other pertinent Performance Monitor (PerfMon) queues.

Keep in mind that all values on the left side of the Dashboard are for Windows Server, with built-in overlays showing the specific resource consumption of SQL Server, while those on the right side are strictly for SQL Server. Highlighting a spike in wait stats on the middle, right pane of the Dashboard automatically highlights all of the other charts during same time periods so that you can correlate SQL Server wait stats with all of the other activity occurring on that instance of SQL Server and on the Windows Server. This makes it very easy to assess whether a problem is entirely contained within SQL Server, Windows Server, or in fact a combination of the two.

For example, assume that using a waits-only approach, you see in Figure 1 that your server has an unexpected spike in network-related waits. Is this a problem? Maybe – but you can only make a correct determination by comparing the wait statistics values to the actual PerfMon counter objects which track network utilization.

Correlating Wait Statistics

Figure 1: Correlating Wait Statistics to PerfMon Queues

Looking at the data in Figure 1, we can see that there is indeed a spike in Network waits (the light colored portion highlighted in the Waits chart on the right side of the screen). However, if you follow the red arrows from the SQL Server Waits on the right side to the Network PerfMon counter values on the upper left portion of the Dashboard, you can see that the percentage-based value for the two NICs are still less than 1% busy. In other words, the NICs are not busy in the least!

Surprise #2: The Wait Stats are Irrelevant

An interesting aspect of wait statistics is that, even on quiet servers, you will still accumulate wait statistics of various types. Some waits will, by design, accumulate a minute of wait time every minute – this is because some waits just represent wait time, and do not indicate any type of performance problem at all.

Furthermore, a SQL Server performing only a light workload will also accumulate wait statistics. If you only look at wait stats, you might think that your top wait stat needs to be fixed pronto! But no, there will always be a top wait. Assuming that the top wait always needs to be “dealt with” can be inappropriately turned into a time-consuming endeavor for inexperienced DBAs and system administrators. Imagine you take a look at a lightly used SQL Server hosting a SharePoint installation. Its top wait stats relate to LCK (locking) and FTS (full-text search). But does that mean we need to get busy tuning this server? Well, not really – it is not very busy after all.

Wait statistics must be assessed in the context of overall server performance. Even a SQL Server with plenty of spare computing capacity will exhibit a wide variety of wait statistics.

In other words, SQL Servers will always have wait stats in a ranked list from biggest wait stat to smallest. But just because a wait stat is a server’s biggest does not mean that it actually needs attention or remediation. For example, on SQL Server 2014, you may safely ignore a wide variety of wait stats because they measure the natural and internal management of platform. A few examples of wait stats which are safe to ignore include:

  • those with names beginning with ‘BROKER%’, like ‘BROKER_TRACE_STOP’, since SQL Server makes heavy use of Service Broker internally.
  • those whose names include ‘%SLEEP%’, like ‘LAZYWRITER_SLEEP’, since the sleeping wait stats accumulate the total time spent waiting for certain recurring sleeping tasks to wake up.
  • those containing ‘XTP_WAIT’, ‘SQLTRACE’, and ‘QUEUE’, since wait stats like these all manage internal processes.

The examples above are by no means an exhaustive list. Refer to Glenn Berry’s SQL Server Diagnostic Queries for an example query with a representative list of wait stats which you can safely ignore.

Surprise #3: When Good Wait Statistics Go Bad: Knee-Jerk Performance Troubleshooting

Contributed by Paul Randal and reprinted with permission from SQL Performance

Paul Randal (b|t) is the most respected and authoritative independent technical expert in the SQL Server world. I first met Paul in the summer of 2005, during a visit to the Microsoft campus, when I was serving as the president of the Professional Association for SQL Server, and Paul was serving as the lead program manager for the sql server storage engine team. Paul’s writings and videos are must-have resources for any serious student of Microsoft SQL Server. In an ironic twist, Paul is married to Kimberly Tripp, another of the industry’s top experts, and the instructor of my first SQL Server class way back at SQL Server version 6.0. All of the content of Surprise #3, with the exception of CXPACKET waits, was written by Paul.

Performance troubleshooting is an art and a science. The art comes from experience (and learning from others’ experiences) and the science comes from well-known guidelines about what to do in which scenarios.

Or at least that is what I like to think, and teach.

In reality, many DBAs and developers out there practice what I call ‘knee-jerk performance troubleshooting’. This commonly happens when a performance problem has reached the critical stage with, for instance, queries timing out, processes running slowly or failing, users disgruntled, and management wanting answers and action fast!

Knee-jerk performance troubleshooting happens when the practitioner jumps to the conclusion that the most prevalent symptom is the root-cause of the problem.

The ‘knee-jerk’ comes from doing some superficial analysis of the problem and jumping to the conclusion (really, it is grasping at a straw) that the most prevalent symptom must be the root cause and trying to address that, to no or little avail, often using misguided or downright incorrect advice found online. This leads to a lot of frustration and wasted time, and often leads to wasted money when the organization decides to try to throw hardware at the problem by upgrading the server and/or the I/O subsystem, only to find the problem is still there, or reappears pretty quickly.

Wait statistics analysis is one of the areas where it is easiest to knee-jerk, and I am going to talk about a few of the common wait types and the mistakes that people make around them. There is not enough scope to go into great depth about what to do in each case, but I will give you enough information to point you in the right direction.

LCK_M_XX Waits

Most people assume that if locking waits are the most prevalent, then it must be some kind of blocking problem that is the issue. Often it is, such as a lack of a suitable nonclustered index causing a table scan in REPEATABLE_READ or SERIALIZABLE isolation levels that escalates to an S table lock. (And as a quick hint, if you think you never use SERIALIZABLE, you do if you use distributed transactions – everything is converted to SERIALIZABLE under the covers, which can lead to unexpected blocking and deadlocks.)

However, it is often the case that blocking is being caused by something else. Under the default READ_COMMITTED isolation level, locks covering changes are held until the transaction commits, and will block reads and other updates to the same row(s). If anything prevents a transaction from committing, then blocking could show up.

For instance, if the database is synchronously mirrored, then the transaction cannot commit and release its locks until the log records have been sent across to the mirror and written to the mirror’s log drive. If the network is severely congested, or there is massive I/O contention on the mirror, then this could seriously delay the mirroring operation, and cause the transaction to take much longer to commit. This would look like blocking, but the root cause is a resource contention, which has to do with mirroring.

For locking waits, unless the cause is obvious from looking at the query plan or lock resource (e.g. table-level indicating lock escalation, or isolation level) and following the blocking chain (using a script that walks the blocking_session_id column in sys.dm_exec_requests) all the way to the thread and locked resources at the head of the blocking chain. That is going to point towards the root cause.

ASYNC_NETWORK_IO WAITS

The name of this one causes lots of confusion. What word do you focus in on? NETWORK. The cause of this wait type usually has nothing to do with the network. It should really be called WAITING_FOR_APP_ACK (nowledgment), or something similar, as that is exactly what is happening: SQL Server has sent some data to a client and is waiting for the client to acknowledge that it has consumed the data.

One of my favorite demos to do when teaching about wait statistics is to run a query that returns a large result set in Management Studio and watch the server rack up ASYNC_NETWORK_IO waits. There is clearly no network involved – it is just SSMS taking a long time to reply to SQL Server. It is doing what is known as RBAR (Row-By-Agonizing-Row), where only one row at a time is pulled from the results and processed, instead of caching all the results and then immediately replying to SQL Server and proceeding to process the cached rows.

This is the major cause of ASYNC_NETWORK_IO waits – poor application design. Occasionally, it is the network, but that is rare in my experience.

OLEDB Waits

The common knee-jerk reaction here is to equate this wait type with linked servers. However, this wait time became more common to see when SQL Server 2005 shipped, because 2005 contained a raft of new DMVs, and DMVs mostly use OLE DB under the covers. Before looking for linked server problems, I would check whether a monitoring tool is running DMVs constantly on the server.

If you do have linked servers, continue troubleshooting by going to the linked server and looking at the wait statistics there to see what the most prevalent issue is, and then continue the same analysis.

One other thing that can cause OLEDB waits is DBCC CHECKDB (and related commands). It uses an OLE DB rowset to communicate information between its Query Processor and Storage Engine subsystems.

SOS_SCHEDULER_YIELD Waits

When SOS_SCHEDULER_YIELD is the most prevalent on a server, it is common to see sustained, high CPU usage. The knee-jerk reaction here is that the server must be under CPU pressure, or that a spinlock is the problem.

We need a bit of background here to understand these two reactions.

Thread Scheduling

Thread scheduling in SQL Server is managed by SQL Server itself, not by Windows (i.e. it is non-preemptive). The SQL OS portion of the Storage Engine provides scheduling functionality and threads transition from running on a Processor (where the thread state is RUNNING) to being on the Waiter List waiting for a resource to become available (state is SUSPENDED) to being on the Runnable Queue once the resource becomes available (state is RUNNABLE) waiting to get to the top of the queue and back onto the Processor again (back to state being RUNNING). I have capitalized Processor, Waiter List, and Runnable Queue to identify them as parts of a scheduler.

Whenever a thread needs a resource that it cannot immediately acquire, it becomes suspended. Next, waits on the Waiter List are told (signaled) that their resource is available. The time spent on the Waiter List is the resource wait time and the time spent on the Runnable Queue is the signal wait time. Together they combine to be the overall wait time. SQL OS keeps track of the wait time and the signal wait time so we have to do some math on the output from sys.dm_os_wait_stats to derive the resource wait time (see Paul's script here).

The Waiter List is unordered (any thread on it can be signaled at any time and move to the Runnable Queue) and the Runnable Queue is First-In-First-Out (FIFO) almost 100% of the time. The only exception to the Runnable Queue being FIFO is where multiple Resource Governor workload groups have been configured in the same resource pool and they have different priorities relative to each other. I have never seen this used successfully in production so I will not discuss it further.

There is another reason why a thread may need to move off the Processor – it exhausts its quantum. The thread quantum in SQL OS is fixed at 4 milliseconds. The thread itself is responsible for determining that its quantum has been exhausted (by calling helper routines in SQL OS) and voluntarily giving up the processor (known as yielding). When this occurs, the thread moves directly to the bottom of the Runnable Queue, as there is nothing for it to wait for. SQL OS must register a wait type for this transition off the Processor though, and registers SOS_SCHEDULER_YIELD.

This behavior is often mistaken for CPU pressure, but it is not – it is just sustained CPU usage. CPU pressure, and recognizing it, is a whole other topic for a future post. As far as this article is concerned, as long as the average signal wait time is low (0-0.1-0.2ms), it is a pretty safe bet that CPU pressure is not an issue.

Spinlocks

A spinlock is a very low-level synchronization primitive that is used to provide thread-safe access to data structures in SQL Server that are extremely hot (very volatile and accessed and changed incredibly frequently by multiple threads). Examples of such structures are the buffer free list in each portion of the buffer pool and the proportional-fill weightings array for the data files in a filegroup.

When a thread needs to acquire a spinlock, it looks to see if the spinlock is free and if so immediately acquires it (using an interlocked assembly-language primitive like ‘test bit clear and set’). If the spinlock cannot be acquired, the thread immediately tries to acquire it again, and again, and again, for up to a thousand iterations, until it backs off and goes to the bottom of the Runnable Queue, registering an SOS_SCHEDULER_YIELD wait.

SOS_SCHEDULER_YIELD Causes

There are two main causes for SOS_SCHEDULER_YIELD: a thread exhausting its scheduling quantum and a thread failing to acquire a spinlock, and heavily recurring instances of either cause can lead to SOS_SCHEDULER_YIELD being the most prevalent wait along with high CPU usage.

Nearly 100% of the time, the first case is what is happening. It is extremely rare for a spinlock to be the cause of high CPU and high SOS_SCHEDULER_YIELD. The only way to prove that a spinlock is or is not the cause is to capture SQL Server call stacks when that wait type occurs, using Extended Events and debug symbols from Microsoft. (Paul has ablog post that describes and shows how to perform that investigation, and also a great whitepaper about spinlocks and spinlock investigations that is worth reading if you are interested in that depth of internals.)

For the case of quantum exhaustion, that is not the root cause. It is a further symptom. Now, we need to consider why a thread may be exhausting its quantum repeatedly.

A thread can only exhaust its quantum when it can continue processing SQL Server code for 4ms without needing a resource that another thread owns – no waiting for locks, page latches, data file pages to be read from disk, memory allocations, file growths, logging, or the myriad other resources that a thread might need.

The most common piece of code where quantum exhaustion can occur and rack up large amounts of SOS_SCHEDULER_YIELD waits is scanning an index/table where all the necessary data files pages are in memory and there is no contention for access to those pages, and so that is what I encourage you to look for in query plans when you see SOS_SCHEDULER_YIELD as the top wait type – large and/or repeated index/table scans.

This does not mean I am saying that large scans are bad, as it could be that the most efficient way to process your workload is through a scan. However, if the SOS_SCHEDULER_YIELD waits are new and unusual, and are caused by large scans, you should investigate why the query plans are using scans. Maybe someone dropped a critical nonclustered index, or statistics are out-of-date and so an incorrect query plan was chosen, or maybe an unusual parameter value was passed to a stored procedure and the query plan called for a scan, or a code change occurred without supporting index additions.

SOS_SCHEDULER_YIELDS Summary

Just as with other wait types, understanding exactly what SOS_SCHEDULER_YIELD means is key to understanding how to troubleshoot it, and whether the behavior is expected because of the workload being processed.

When you have a performance problem, take the time to understand the data you have, and perform further investigations to help narrow down the root cause of the problem. Do not just grasp at whatever seems to be the top wait statistic and follow the first piece of advice you come across online (unless it is from a well-known and reputable source) or you likely will not solve your problem, and may even make it worse.

PAGEIOLATCH_SH

Continuing on the theme of wait stats which can induce a knee-jerk troubleshooting reaction from the less experienced, I will begin by saying that there are really multiple kinds of PAGEIOLATCH waits, which I have signified with the XX at the end. The most common examples are:

  • PAGEIOLATCH_SH – (SHare) waiting for a data file page to be brought from disk into the buffer pool so its contents can be read.
  • PAGEIOLATCH_EX or PAGEIOLATCH_UP – (EXclusive or UPdate) waiting for a data file page to be brought from disk into the buffer pool so its contents can be modified.

Of these, the most common type, by far, is PAGEIOLATCH_SH.

When this wait type is the most prevalent on a server, the knee-jerk reaction is to conclude that the I/O subsystem must have a problem and so that's where investigations should be focused.

The first thing to do is to compare the PAGEIOLATCH_SH wait count and duration against your baseline. If the volume of waits is more or less the same, but the duration of each read wait has become much longer, then I'd be concerned about an I/O subsystem problem, such as:

  • A misconfiguration/malfunction at the I/O subsystem level
  • Network latency
  • Another I/O workload causing contention with our workload
  • Configuration of synchronous I/O-subsystem replication/mirroring

In my experience, the pattern is often that the number of PAGEIOLATCH_SH waits has increased substantially from the baseline (normal) amount and the wait duration has also increased (i.e. the time for a read I/O has increased), because the large number of reads overloads the I/O subsystem. This isn't an I/O subsystem problem – this is SQL Server driving more I/Os than it should be. The focus now needs to switch to SQL Server to identify the cause of the extra I/Os.

Causes of Large Numbers of Read I/Os

SQL Server has two types of reads: logical I/Os and physical I/Os. When the Access Methods portion of the Storage Engine needs to access a page, it asks the Buffer Pool for a pointer to the page in memory (called a logical I/O) and the Buffer Pool checks through its metadata to see if that page is already in memory.

If the page is in memory, the Buffer Pool gives the Access Methods the pointer, and the I/O remains a logical I/O. If the page is not in memory, the Buffer Pool issues a "real" I/O (called a physical I/O) and the thread has to wait for it to complete – incurring a PAGEIOLATCH_XX wait. Once the I/O completes and the pointer is available, the thread is notified and can continue running.

In an ideal world, your entire workload would fit in memory and so once the buffer pool has "warmed up" and holds all the workload, no more reads are required, only writes of updated data. It is not an ideal world though, and most of you do not have that luxury, so some reads are inevitable. As long as the number of reads stays around your baseline amount, there is no problem.

When a large number of reads are required suddenly and unexpectedly, that is a sign that there is a significant change in either the workload, the amount of buffer pool memory available for storing in-memory copies of pages, or both.

Here are some possible root causes (not an exhaustive list):

  • External Windows memory pressure on SQL Server causing the memory manager to reduce the buffer pool size
  • Plan cache bloat causing extra memory to be borrowed from the buffer pool
  • A query plan doing a table/clustered index scan (instead of an index seek) because of:
    • a workload volume increase
    • a parameter sniffing problem
    • a required nonclustered index that was dropped or changed
    • an implicit conversion

One pattern to look for that would suggest a table/clustered index scan being the cause is also seeing a large number of CXPACKET waits along with the PAGEIOLATCH_SH waits. This is a common pattern that indicates large, parallel table/clustered index scans occurring.

In all cases, you could look at what query plan is causing the PAGEIOLATCH_SH waits using the sys.dm_os_waiting_tasks and other DMVs, and you can get the Transact-SQL code to do that in my blog post located at SQLskills.com. On the other hand, SQL Sentry Performance Advisor can help you identify the culprit without getting your hands dirty.

Example Workflow with Performance Advisor and Plan Explorer

In a simple (obviously contrived) example, let's assume that I am on a client system using SQL Sentry's suite of tools and see a spike in I/O waits in the dashboard view of Performance Advisor , as shown below, in figure 2:

Spotting a spike in I/O waits in SQL Sentry Performance Advisor

Figure 2: Spotting a spike in I/O waits in SQL Sentry Performance Advisor

I decide to investigate by right-clicking a selected time interval around the time of the spike, then jumping over to the Top SQL view, which is going to show me the most expensive queries that have executed, as shown in Figure 3:

Highlighting Time Range in Top SQL

Figure 3: Highlighting a time range and navigating to Top SQL

In this view, I can see which long-running or high I/O queries were running at the time the spike occurred, and then choose to drill in to their query plans. In this case, as shown below in Figure 4, there is just one long-running query, which ran for nearly a minute:

Top SQL tab of Performance Advisor

Figure 4: Viewing a long-running query in Top SQL tab of Performance Advisor

If I look at the plan in the SQL Sentry client or open it in SQL Sentry Plan Explorer, I immediately see multiple problems. The number of reads required to return 7 rows seems far too high, the delta between estimated and actual rows is large, and the plan shows an index scan occurring where I would have expected a seek, as shown in Figure 5:

Implicit Conversion Warning

Figure 5: Seeing implicit conversion warnings in the query plan

The cause of all this is highlighted in the warning on the SELECT operator: It's an implicit conversion!

Implicit conversions are an insidious problem caused by a mismatch between the search predicate data type and the data type of the column being searched, or a calculation being performed on the table column rather than the search predicate. In either case, SQL Server cannot use an index seek on the table column and must use a scan instead.

This can crop up in seemingly innocent code, and a common example is using a date calculation. If you have a table that stores the age of customers, and you want to perform a calculation to see how many are 21 years old or over today, you might write code like this:

                WHERE DATEADD (YEAR, 21, [MyTable].[BirthDate]) <= 1/20/2017 12:00:00 AM;
                

With this code, the calculation is on the table column and so an index seek cannot be used, resulting in an unseekable expression (technically known as a non-SARGable expression) and a table/clustered index scan. This can be solved by moving the calculation to the other side of the operator:

                WHERE [MyTable].[BirthDate] <= DATEADD (YEAR, -21, 1/20/2017 12:00:00 AM);
                

PAGEIOLATCH Summary

Do not fall into the trap of thinking that excessive PAGEIOLATCH_XX waits are caused by the I/O subsystem. In my experience, they're usually caused by something to do with SQL Server and that is where I would start troubleshooting.

CXPACKET Waits

Perhaps the most misunderstood of all wait stats is the CXPACKET wait. The term refers to “Class eXchange Packet”, that is, a “packet” of rows that is exchanged between two parallel threads, consisting of a producer thread and a consumer thread, within a single process. To make a long story short, CXPACKET waits are a naturally occurring wait stat on any SQL Server utilizing parallelized queries. Whenever you see a high CXPACKET wait stat, think parallelism. Not only is CXPACKET a completely natural wait stat, it may very well be a good indicator since parallelized queries often outperform their serialized brethren.

In an example described by Adam Machanic on sqlblog.com, we can see a rather simple looking query which produces a high number of cxpacket waits below in figure 6

Query producing CXPACKET waits

Figure 6: An example query from Adam Machanic reliably producing CXPACKET waits

Many DBAs would instinctively react with a knee-jerk response to high CXPACKET waits by disabling or throttling parallelism, either by tinkering with the MAXDOP (i.e. maximum degrees of parallelism) or Cost Threshold for Parallelism settings. Instead, you should first consider the overall performance of the server.

As more and more servers these days ship with a large number of CPU cores, SQL Server seeks to distribute an ever increasing amount of its workload in parallel across multiple CPUs simultaneously. There’s a simple calculation to keep in mind:

SQL Server + multiple CPUs = Parallelism = CXPACKET waits

Parallelism, when you think about it in light of the calculation shown above, is neither good nor bad. It is simply normal operational behavior.

Consider, for example, that all parallelized queries contain more than one thread. There is a coordinator thread that handles all of the gather and partition stream activities. It will always accumulate CXPACKET waits while performing this activity. Depending upon the number of CPUs that the query is spread across, additional CXPACKET waits may accumulate, especially if one or more threads execute their portion of the workload very quickly while other threads (perhaps lumbering along on a different, slower I/O subsystem) are taking a few seconds longer to complete. In that case, all the waiting threads will also accumulate CXPACKET waits, even though the true reason they are waiting is a disparity in I/O subsystem speeds.

So, what is the appropriate response to a server with high CXPACKET waits? Like in the earlier section illustrated by Figure 2, there is no better way to know if there is a problem than to see wait stats for a specific period of time and then comparing those specific wait stats against what is normal for that time. By comparing against the baseline value in Performance Advisor for wait stats, you can quickly see whether the overall wait stat values are abnormal. Then, after determining that there is an issue, you can perform deeper analysis to determine the other issues in play that are making the workload different from normal.

Upon further research, you may find one or more root conditions that manifest themselves as higher CXPACKET waits, such as:

  • A new set of ad hoc queries are blocking resources needed by parallelized queries.
  • A long-running parallelized query runs across multiple database files which reside on different file systems of different speeds.
  • Alternately, one of the underlying file systems must share I/O with another application, another database, or even another instance of SQL Server, thus conflicting with one of the threads in a parallelized query.
  • A change in index statistics and histograms which alter the otherwise normal behavior of parallelized queries causing one or more of the parallel threads to slow down in relation to the other parallel threads of the query.

For more information on doing root cause analysis on parallelized queries, refer to these additional resources:

A common problem for alerting systems is that they do not provide holistic performance information across the entire Microsoft stack where the SQL Server instance is installed.

Additional Resources

Paul Randal and the team at SQLskills.com have a number of useful resources relating to wait statistics:

Brent Ozar has a fun and lucid write-up on wait stats and performance troubleshooting in an article entitled The Ozar Family Tradition of Performance Monitoring.

The seminal white paper from the Microsoft SQL Server Customer Advisory Team (SQLCAT), known as the Waits and Queues Whitepaper.

For deeper explanations of specific wait types, use these references:

A couple of free and useful tools you should consider adding to your toolkit include:

  • sp_WhoIsActive – this awesome free replacement for sp_who, written by Adam Machanic, shows what is happening inside your SQL Server in brilliant detail, including the queries, wait statistics, and locking/blocking issues.
  • Dr. DMV’s Diagnostic Queries – Glenn Berry’s outstanding collection of diagnostic queries, for many versions of SQL Server, updated for each new service pack and major cumulative update.

Conclusion

This white paper discussed the most common problems facing IT teams who focus their performance troubleshooting activities on wait statistics. By focusing only on wait statistics, DBAs and performance engineers may waste valuable time chasing down misguided paths and following red herring clues. The most common issues, which we discussed in detail, include:

  • Failure to detect or identify problems with an origin outside of SQL Server.
  • Lack of correlation to system and database error messages.
  • Uncorrelated wait statistics, which in turn lead to “knee-jerk troubleshooting,” including:
    • LCK_M_xxx waits
    • ASYNC_NETWORK_IO waits
    • OLEDB waits
    • SOS_SCHEDULER_YIELD waits
    • CXPACKET waits

Where other scripted solutions and other performance monitoring systems struggle with limited usefulness due to an over-reliance on wait statistics, SQL Sentry Performance Advisor and Event Manager, available in a bundle known as the Power Suite, are designed from the ground up to handle them all quickly, directly, and easily. The Power Suite also makes it easy to scale out performance troubleshooting activities throughout your IT enterprise by enabling you to create alerts to foresee problems, create your own baselines so that the system knows what is “normal” for your workload, and provides vivid visualizations to enable quick, drill-down troubleshooting and root cause detection, diagnostics, and resolution.

Download Evaluation

SentryOne 15-Day Free Trial

(Includes Plan Explorer Functionality)

Enables you to monitor, diagnose, and optimize the Microsoft data platform across physical, virtual, and cloud environments.

Free Trial Includes:
  • SQL Sentry logo
    SQL Sentry with Plan Explorer
    (10) SQL Server
  • Win Sentry logo
    Win Sentry
    (10) Windows Servers / Hyper-V Hosts
  • V Sentry logo
    V Sentry
    (2) VMware Hosts
  • BI Sentry logo
    BI Sentry
    (4) SSAS Instances
  • DB Sentry logo
    DB Sentry
    (5) Azure SQL Databases
Let's Get Started:
All fields required

By clicking download, you agree to the SentryOne Privacy Policy.

Standalone version of Plan Explorer also available.

LOOKS LIKE YOUR BROWSER IS OUT OF DATE

For your convenience here are links to supported browsers:

Your browser could be incompatible with this website.
If you close this window and continue you may experience less than optimal behavior.

Don't show this again