Poison Wait Detected

Whenever SQL Server needs to wait while it’s executing queries, it tracks that time as a wait statistic. We can query the waits tables to get overall bottleneck data over time.

Some waits are what we call poison: any occurrence of them means your SQL Server may feel unusable while this wait type is happening. You might only have a few minutes of this wait type, out of hours or days of uptime, but a few minutes of inaccessibility sure seems like a total outage to your end users.

We check sys.dm_os_wait_stats looking for significant cumulative wait times for the following wait types:

CMEMTHREAD – we’ve been seeing a lot of this in servers with a lot of CPU cores that still have their parallelism settings set at their defaults. To fix this issue:

  • Read my post about setting Cost Threshold for Parallelism and MAXDOP at reasonable defaults – which typically end up at Cost Threshold for Parallelism at 50, and then MAXDOP at around 8 or less depending on your hardware config.
  • Look for queries reading a lot of data – run sp_BlitzCache @SortOrder = ‘reads’, and you’re probably going to find queries going parallel to read a lot of data that’s cached in RAM. Try tuning those queries, or tuning the indexes they use.
  • After that, if you’re still having the problem, see Microsoft’s blog post on running SQL Servers with >8 cores per NUMA node. If you suspect that CMEMTHREAD is your server’s largest problem, and you meet the symptoms described in that KB article, open up a support ticket with Microsoft to be safe rather than just enabling this trace flag. It’s only $500, and they can give you a double check confirmation that this trace flag makes sense for you.

IO_QUEUE_LIMIT – in Azure SQL DB, this means your database is getting throttled.

IO_RETRY – a read or write failed due to insufficient resources, and we’re waiting for a retry.

LOG_RATE_GOVERNOR – in Azure SQL DB, this means your database is getting throttled. Your delete/update/insert work simply can’t go any faster due to the limits on your instance size. Before you spend more on a larger instance, read this post.

POOL_LOG_RATE_GOVERNOR – see LOG_RATE_GOVERNOR.

PREEMPTIVE_DEBUG – someone probably accidentally hit the DEBUG button in SSMS rather than Execute.

RESMGR_THROTTLED – in Azure SQL DB, this means a new request has come in, but it’s throttled based on the GROUP_MAX_REQUESTS setting.

RESOURCE_SEMAPHORE means SQL Server ran out of available memory to run queries, and queries had to wait on available memory before they could even start. You can learn more about this in Query Plans: Memory Grants and High Row Estimates, and the RESOURCE_SEMAPHORE training video.

RESOURCE_SEMAPHORE_QUERY_COMPILE is a lot like RESOURCE_SEMAPHORE, but it means SQL Server didn’t even have enough memory to compile a query plan. This is usually only seen in two situations:

  • Underpowered servers: think 8-16GB RAM for heavy production workloads, or
  • Really complex queries with dozens or hundreds of joins, like I describe in this post about ugly queries and this post as well – and you can hit these even on powerful servers, like 256GB+ memory. Finding the queries causing this is spectacularly hard, though – SQL Server doesn’t make it easy to analyze queries en masse looking for the highest compilation times. Solving it is usually more of a strategic thing: can we simplify our queries overall?

SE_REPL* – in Azure SQL DB, we’re waiting for the secondary replicas to catch up.

THREADPOOL means SQL Server ran out of worker threads, and new queries thought the SQL Server was frozen solid. During any occurrence of THREADPOOL, your SQL Server will feel like it’s locked up solid – although trickily, your CPU usage might be near zero. You can learn more about this in the THREADPOOL training video (or here if your subscription started 2020 or before.)

1 Comment.