Unexplained High CPU usage on Hyper-V host and Guests

I have a client with 2 identical Hyper-v servers running almost identical VMs. One of the servers out of the blue started having high CPU utilization. The host was bouncing from 35-50% and the guests were at 99% CPU utilization. Turned off the guests and reboot server, no change. Still 35-50% utilization. Made sure any unnecessary hardware was disabled or disconnected, again no change. Experimenting with one of the guest machines I noticed that the CPU utilization would sometimes show system interrupts at 99% then go away for a bit and then come back with any process that was active taking over the 99% utilization. After seeing that I wanted to check into system interrupts on each host machine and compare them.

In the past I had used KernView on 32bit machines, however this does not work for modern 64bit machines.  After some digging around on the internet it turns out KernRate works on 64bit machines and can be found in the Windows Driver Development Kit 7 found here http://www.microsoft.com/en-us/download/details.aspx?id=11800. If you choose the default install path the files can be found here C:\WinDDK\7600.16385.1\Tools\Other\amd64.

I wanted to log the output and have it run for a fixed time for comparison.  After looking through the help files I found my command to be ‘kernrate -s 30 -yo filename.txt’ which would give me a 30 second sample and write it to a file in the same path with the chosen file name. I ran the command on both my host that was not having issues and the one that was having issues.  I will cut to the interesting parts of the resulting log files in order to save space on this post.

Server specs (both servers are the same):
Dell 320
32GB ram
Intel E5-2420 CPU (6 hyper-threaded cores)
Server 2012 with Hyper-V role installed

Server with issues:

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 147

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime              276703 hits,           10002 events per hit ——–

Module                    Hits                  msec             %Total              Events/Sec
NTOSKRNL                138197            30074              49 %             45961508
HAL                          126880             30074              45 %             42197704
WIN32K                     7230             30026                  2 %                2408394
NTFS                           1030              30055                 0 %                  342773

Server without issues:

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 145
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime            289145 hits,          10009 events per hit ——–

Module                    Hits                 msec              %Total             Events/Sec
NTOSKRNL            244130           29999                 84 %              81452620
HAL                          41760            29999                 14 %              13932992
WIN32K                   1650             29999                   0 %                   550513
IPMIDRV                   625            30000                   0 %                  208520

So I noticed right away that the server with issues has 45% of interrupts going to the HAL. The HAL is short for Hardware Abstraction Layer which  is a piece of the operating system that allow other parts of the operating system interact with the physical hardware of the computer. Modern versions of Windows automatically select the HAL used based on the processor type, but I still verified both servers were using the same one. Again I disabled any unnecessary hardware, turned the guest machines off, updated drivers and ran KernRate between each step, all with very similar results.

After testing many configurations, drivers, and multiple reboots I was frustrated at the hours lost and the lack of understating why this was occurring. I had one last resort before declaring a bad CPU or motherboard and calling Dell for warranty.  I upgraded the bios and rebooted. I had left all disabled devices disabled and the guest machines off in order to limit the changes. A few minutes after rebooting I logged back in and opened task manager to a pleasant 10% CPU utilization. I re-enabled all devices and turned the guests back on. Everything seemed nice and fast, including the guest performance. I again ran KernRate to see if there was any difference in the results.

After BIOS update on bad machine:

OutputResults: KernelModuleCount = 144
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime                 341514 hits,            10009 events per hit ——–

Module                           Hits                msec          %Total             Events/Sec
NTOSKRNL                   332831           29999             97%               111047217
HAL                                   6673           29999               1 %                   2226409
IPMIDRV                          835           29999               0 %                      278593
NTFS                                   395           29999               0 %                       131789

Wow, that is quite a difference from the previous result and even better than the machine that was working seemly well. I am going to schedule a window of time to do the BIOS update on the second machine sometime in the future and see if the BIOS update will achieve a similar result. As with any updates or changes please backup your data and double check your BIOS update is for the correct machine as a BIOS update can go south and your machine will no longer boot.

Server 2012 RDS licensing problem

On server 2012 when setting up RDS the licensing diagnoser will still show not licensed even though the licensing manager show licensees. This will affect users when the 180 day trial runs out. In order to fix this the server mode and license server need to be set in the local server policy or group policy.

Local Computer Policy -> Computer Configuration -> Administrative Templates -> Windows Components -> Remote Desktop Services -> Remote Desktop Session Host -> Licensing Use the specified RD license servers =
Set the Remote Desktop licensing mode = Per User or Per Device depending on licenses bought

SonicWALL intermittent connection issues

Some older SonicWall routers or newer ones with a configuration that was imported have default NAT rules listed for “WAN Primary Subnet”, this causes the SonicWall to respond to all ARP queries on the entire subnet of the WAN interface even if the client is not assigned those IPs by the ISP. If you see these rules please disable them and flush the ARP cache to help prevent issues with the connected internet connection. This article helped to resolve this once the ISP pointed out it was an issue http://serverfault.com/questions/294817/how-can-i-stop-my-sonicwall-tz-210-sonicos-enhanced-5-5-1-0-5o-from-responding

Oddly even though the SonicWall responds to the ISP router with the ARP it does not put these entries in its own ARP table and the only way to see it is to have the ISP check the ARP table on their connected router.

In this case the client only had .198-.200 assigned to them but the SonicWall was responding to ARP on the entire usable block of .194-.201

PEMTK82#sh ip arp | inc 98.XXX.XXX

Internet 98.XXX.XXX.177 – 0014.f1eb.3bd9 ARPA Bundle1

Internet 98.XXX.XXX.185 0 0006.b13a.a2ca ARPA Bundle1

Internet 98.XXX.XXX.193 – 0014.f1eb.3bd9 ARPA Bundle1

Internet 98.XXX.XXX.194 51 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.195 53 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.196 45 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.197 222 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.198 0 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.199 0 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.200 10 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.201 67 c0ea.e458.XXXX ARPA Bundle1

Internet 98.XXX.XXX.202 0 0012.1ebd.99a8 ARPA Bundle1

Internet 98.XXX.XXX.203 6 000b.8660.4b74 ARPA Bundle1

Internet 98.XXX.XXX.204 134 0025.614a.af00 ARPA Bundle1

Internet 98.XXX.XXX.205 2 a0f3.c1c3.f5a3 ARPA Bundle1

Internet 98.XXX.XXX.206 255 0017.c54e.b575 ARPA Bundle1

To disable the rules uncheck the highlighted boxes and then go to the ARP page and clear the cache.

Server 2012 Multi-Path I/O with SMB3

For those of you who have not tested out this feature yet, I have to say it is pretty wonderful. However I have also run into some issues with it as well. For the quick and somewhat obvious gotcha it only works if the server and client both support SMB3.

In case you don’t know what it does I can give you a brief and dumbed down version of Multi-Path I/O. It allows multiple network cards to handle the SMB3 connections. By doing this it can spread the load across multiple network cards in effect increasing bandwidth and reliability. This is not the same is NIC teaming as all ports have their own IP on the network and it only works with SMB3 and any other service that supports it.

The issue I had was that even though there were 5 Gb NICs on the server the aggregate bandwidth for transfers was topping out at about 1Gb. 1 NIC was a built in Atheros and the other 4 were a single Intel PCI-x 1000 Pro. I could disable all but one NIC and get the same bandwidth as with all 5 running. It was distributing the load, about 190Mbs per NIC when all 5 were online or about 950Mbs for just one. Unfortunately this Intel NIC did not have newer drivers and only would work on the Windows Built in drivers, but replacing this NIC with a PCI-Express card did allow me to break the 1Gb barrier at 1.5Gbs with only 2 NICs. I will update this when I get more NICs installed. I suspect the issue was either the driver or some limitation of the PCI-X NIC card.

This is a good result:

multipathio smb3

SBS 2003 server lockups

My remote management software was alerting me that a client’s SBS 2003 server was experiencing 100% CPU utilization and would randomly stop responding to input and even showing offline. I was eventually able to RDP in and started to look for a cause of the issue. Event viewer would not open and most input was not providing any results, I was able to open task manger though and found that 4 processes were consuming all CPU utilization between them. Canceling Svchost.exe did not free up any cycles and the other processes which are protected and just went up higher in utilization.

After waiting 20 minutes for the server to respond to a force reboot command and looking through the event log I found the likely culprit. Shortly after the backup started I noticed errors like “lsass (432) Shadow copy 371 time-out (70000 ms).” For the lsass.exe, ntfrs.exe and tcpsvcs.exe processes. This is due to some VSS issues in sever 2003, to be safe I reset the VSS writers using a batch file with the following commands:

cd /d %windir%\system32

Net stop vss

Net stop swprv

regsvr32 ole32.dll

regsvr32 oleaut32.dll

regsvr32 vss_ps.dll

vssvc /register

regsvr32 /i swprv.dll

regsvr32 /i eventcls.dll

regsvr32 es.dll

regsvr32 stdprov.dll

regsvr32 vssui.dll

regsvr32 msxml.dll

regsvr32 msxml3.dll

regsvr32 msxml4.dll

Net start vss

Net start swprv

After looking further into the issue there is a KB article http://support.microsoft.com/kb/826936 which provides a hotfix for time-out issues of essential services during VSS copies. Remember to make sure you have a good backup before installing any hotfixes as they are not always well tested. If you can try the hotfix in a test environment first to be safe.