To help others that might find themselves in a similar situation I am posting this odd experience we had with a BizTalk environment during the fall of 2011.
We had a pretty standard setup with good hardware to back it up all the way, set up after best practices. We were using the BizTalk Benchmark Wizard (BBW) to benchmark our environment and were comming up short at around 70 msg/s.
We should have had values that were around 900 msg/s. Overall, from scrutinizing the performance logs using Performance Analysis of Logs (PAL) as well as our own best judgement, we at first couldn’t find anything alarming. Processor, Memory, disk, network etc. All good. We also ran things like the BizTalk Best Practice Analyzer (BizTalk BPA), the MessageBoxViewer tool (MBV), the Monitor BizTalk Server SQL Server Agent job, but it all came back looking good. The environment just seemed… slow.
As it turns out the processor was especially interesting knowing what turned out to be the final finding. The processors (two of them per server each of them with 6 cores per processor) was on an average very low, but as it turns out there was one process that was taking the equivalent of 1 full core of power (its Process % Processor time was at 100), but since it didn’t stay on one core it was hard to spot the problem. PAL doesn’t have an alert for this, and finding the one process and performance counter among all of them is not so easy.
The process was the “HP Insight Server Agents” (cqmgserv.exe). The theory goes that as it was failing, recovering and retrying it was pumping the machine full of events and clogging up the underlying bus.
The closest we got to a match in the form of a support document from HP was this. Once the service was disabled the tests ran as expected att around 1000 msg/s. Later the service was updated to a newer version and started again without causing the same issues.
The purpose of this post is not to lay the blame on HP’s door but instead to enlighten readers that similar situations can occur and to highlight the value of a tool like BBW, since without it this exception would have likely never got caught and this server would have gone into production delivering much less value on the investment than it should.