Wed Jun 18 14:19:32 CEST 2014

Hope this helps someone else.

I came in today and discovered that one of our check_mk Windows agents was
giving 'tcp connection refused'.  The check_mk_agent service in CMK itself
was showing CRIT, and all other related checks on that Windows host were
stale.  No notifications had been sent out - still have to dig into why.

Here is my rough troubleshooting flow:

   - Run manual check
   - Restart Windows agent
   - Telnet host 6556 from my workstation (should normally work) - it works
   - Check the port from the OMD server - a few ways to test - port is
   - We are running 1.11 (updated earlier this week for cmk BI features),
   so perhaps the agent needs updating
   - Updated agent from 1.24p2 to 1.24p3, no change
   - Noticed the service is not stopping correctly, have to kill the process
   - Double-checked the configuration, cleaned out extraneous stuff
   - Lots of googling later...
   - Ran netstat -anb | find /i "6556" on the troubled Windows box
   - I see 'CLOSE_WAIT' a number of times
   - Restart the service (kill, start), see LISTENING
   - Run manual check, still timing out
   - CLOSE_WAIT showing up again
   - Rebooted the Windows server (cuz, ya know)
   - No change
   - Started the check_mk_agent service, then from cmd: check_mk_agent.exe
   - It was hanging on a particular check
   - Commented out the related checks
   - Everything returned to normal

The check was a 'cscript script.vbs' that normally outputs appropriate
Nagios-readable service data.  The host runs about 30 of these checks, all
of them work fine except for a select few.  The select few were getting
'server does not exist' errors due to a VPN tunnel crashing and not coming
back up.

Still not certain if this is a cmk agent bug, or if we just need to put
better error handling into our vbs code.  (it's vbs because legacy and time)

