[Check_mk (english)] EXT: Re: EXT: Re: Agent Not responding

Karl Otterbein karl.otterbein at jet.com
Tue Feb 5 20:36:06 CET 2019


Bonding issue was not the problem- was able to bring down the bond and go directly to eth0 with no problem, but the issue persists.
Nothing I can find to correlate the issue.  I'm down to capturing pcaps on the agent discovery, where I see the truncated data making it back but the payload stops abruptly-  running cmk -vvII 10.mgd.host.0 from 10.CMK.serv.0
Packets associated from tcpdump:
1 0.000000 10.CMK.serv.0 10.mgd.host.0 TCP 74 54996 → 6556 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2752921687 TSecr=0 WS=128
2 0.013476 10.mgd.host.0 10.CMK.serv.0 TCP 74 6556 → 54996 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1354 SACK_PERM=1 TSval=751580 TSecr=2752921687 WS=128
3 0.013517 10.CMK.serv.0 10.mgd.host.0 TCP 66 54996 → 6556 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2752921700 TSecr=751580
4 0.036207 10.mgd.host.0 10.CMK.serv.0 TCP 81 6556 → 54996 [PSH, ACK] Seq=1 Ack=1 Win=29056 Len=15 TSval=751586 TSecr=2752921700
5 0.036231 10.CMK.serv.0 10.mgd.host.0 TCP 66 54996 → 6556 [ACK] Seq=1 Ack=16 Win=29312 Len=0 TSval=2752921723 TSecr=751586
6 0.051618 10.mgd.host.0 10.CMK.serv.0 TCP 1414 [TCP Previous segment not captured] 6556 → 54996 [PSH, ACK] Seq=12472 Ack=1 Win=29056 Len=1348 TSval=751589 TSecr=2752921723
7 0.051628 10.CMK.serv.0 10.mgd.host.0 TCP 78 [TCP Window Update] 54996 → 6556 [ACK] Seq=1
Ack=16 Win=32128 Len=0 TSval=2752921738 TSecr=751586 SLE=12472 SRE=13820
---ICMP PINGS from CMK from packets 8-26---
27 49.500910 10.CMK.serv.0 10.mgd.host.0 TCP 78 54996 → 6556 [FIN, ACK] Seq=1 Ack=16 Win=32128 Len=0 TSval=2752971188 TSecr=751586 SLE=12472 SRE=13820
28 49.519048 10.mgd.host.0 10.CMK.serv.0 TCP 66 [TCP Previous segment not captured] 6556 → 54996 [ACK] Seq=15204 Ack=2 Win=29056 Len=0 TSval=763956 TSecr=2752971188

Packet 6 payload shows the agent sending back partial discovery which is abruptly truncated.

wå¤@{     tmpfs          65536         0     65536       0%
/var/lib/docker/containers/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/mounts/shm
overlay                       overlay    936179076 184998696 751180380      20% /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
overlay                       overlay    936179076 184998696 751180380      20% /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
shm                           tmpfs          65536         0     65536       0% /var/lib/docker/containers/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/mounts/shm
overlay                       overlay    936179076 184998696 751180380      20% /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
tmpfs                         tmpfs       39486652         0  39486652       0% /run/user/1112
overlay                       overlay    936179076 184998696 751180380      20% /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
overlay                       overlay    936179076 184998696 751180380      20% /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
<<<df>>>
[df_inodes_start]

Final packets

Source is the CMK server --> Managed node

[SEQ/ACK analysis]
    [iRTT: 0.013517000 seconds]
    [TCP Analysis Flags]
        [Expert Info (Warning/Sequence): Previous segment(s) not captured (common at capture start)]
            [Previous segment(s) not captured (common at capture start)]
            [Severity level: Warning]
            [Group: Sequence]


Return from Managed node --> CMK server:

[SEQ/ACK analysis]
    [This is an ACK to the segment in frame: 27]
    [The RTT to ACK the segment was: 0.018138000 seconds]
    [iRTT: 0.013517000 seconds]
    [TCP Analysis Flags]
        [Expert Info (Warning/Sequence): Previous segment(s) not captured (common at capture start)]
            [Previous segment(s) not captured (common at capture start)]
            [Severity level: Warning]
            [Group: Sequence]
[Timestamps]
    [Time since first frame in this TCP stream: 49.500910000 seconds]
    [Time since previous frame in this TCP stream: 49.449282000 seconds]

Getting deep in the weeds now, if this was a FW issue I would expect to see no ability to contact, but it's almost as if xinetd just stops sending and the CMK server is just waiting for the remaining payload so it never fails out.  The last sequence is likely from me ctrl-c'ing the discovery from the command line.
much appreciated for any thoughts!

K

From: checkmk-en <checkmk-en-bounces at lists.mathias-kettner.de> on behalf of Karl Otterbein via checkmk-en <checkmk-en at lists.mathias-kettner.de>
Reply-To: Karl Otterbein <karl.otterbein at jet.com>
Date: Tuesday, February 5, 2019 at 11:32 AM
To: Ezequiel Tolstanov <ezequiel at atdt.com.ar>
Cc: "checkmk-en at lists.mathias-kettner.de" <checkmk-en at lists.mathias-kettner.de>
Subject: EXT: Re: [Check_mk (english)] EXT: Re: EXT: Re: Agent Not responding

No, it is a direct connect pipe with no VPN.  I have however discovered that the bond interface is dropping packets so I am going to go figure out if there is something physically wrong with the host- looks to be related.

Thank you all, I’ll reply if I find a cause with the bond, but appreciate everyone’s input!

K

On Feb 5, 2019, at 11:13 AM, Ezequiel Tolstanov <ezequiel at atdt.com.ar<mailto:ezequiel at atdt.com.ar>> wrote:
Hi,

Is the connection between the Server and the Remote Host going through a VPN tunnel? If so, maybe you are facing an MSS (Maximum Segment Size) issue.

Best regards.

El mar., 5 feb. 2019 a las 13:02, Karl Otterbein via checkmk-en (<checkmk-en at lists.mathias-kettner.de<mailto:checkmk-en at lists.mathias-kettner.de>>) escribió:

sorry- sent by mistake without finishing:

I wonder if the following I receive when running cmk -vvII may cause an issue:

 [agent] Connecting via TCP to 10.51.169.15:6556<http://10.51.169.15:6556> (5.0s timeout)

I wonder if there is a timeout waiting- this is a remote host in the cloud reaching to a co-lo in a different DC- is there a means to up that TCP timeout?

It doesn't make much sense because it's a 10GB direct connect.

Full output:

+ FETCHING DATA
 [agent] No persisted sections loaded
 [agent] Not using cache (Does not exist)
 [agent] Execute data source
 [agent] Connecting via TCP to 10.51.169.15:6556<http://10.51.169.15:6556> (5.0s timeout)
 [agent] Reading data from agent
--
Sent from Hiri<https://www.hiri.com/>


On 2019-02-05 10:58:33.309644-05:00 Karl Otterbein wrote:
Does the agent run through completion on the local host?  - yes- I'm able to complete through the agent run with both commands below...
telnet localhost 6556
and/or
/usr/bin/check_mk_agent (or wherevever the ahent lives)
Also, is the agent version of equivalent or lesser then the monitoring server? Having a newer agent version could cause some conflicts.  They are the same- both 1.5.0p11.

I wonder if the following I receive when running cmk -vvII may cause an issue:

--
Sent from Hiri<https://www.hiri.com/>


On 2019-02-05 10:52:19-05:00 Paul Dott wrote:
Does the agent run through completion on the local host?
telnet localhost 6556
and/or
/usr/bin/check_mk_agent (or wherevever the ahent lives)
Also, is the agent version of equivalent or lesser then the monitoring server? Having a newer agent version could cause some conflicts.

On Tue, Feb 5, 2019 at 7:40 AM Karl Otterbein via checkmk-en <checkmk-en at lists.mathias-kettner.de<mailto:checkmk-en at lists.mathias-kettner.de>> wrote:
Thanks Robert for the quick reply-

works properly:

df -PTlk
Filesystem                    Type     1024-blocks      Used Available Capacity Mounted on
udev                          devtmpfs   197408056         0 197408056       0% /dev
tmpfs                         tmpfs       39486656    183692  39302964       1% /run
... (truncated)

K

--
Sent from Hiri<https://www.hiri.com/>


On 2019-02-05 10:37:07-05:00 checkmk-en wrote:

What the agent does at this stage is

 df -PTlk



Try running this command on the managed node and see what happens.



Regards,

Robert



On 05.02.2019 16:32, Karl Otterbein via checkmk-en wrote:



> telnet 10.x.x.x 6556

> Trying 10..x.x.x...

> Connected to 10.x.x.x.

> Escape character is '^]'.

> <<<check_mk>>>

> Version: 1.5.0p11

> AgentOS: linux

> Hostname: xxxxxx01

> AgentDirectory: /etc/check_mk

> DataDirectory: /var/lib/check_mk_agent

> SpoolDirectory: /var/lib/check_mk_agent/spool

> PluginsDirectory: /usr/lib/check_mk_agent/plugins

> LocalDirectory: /usr/lib/check_mk_agent/local

> OnlyFrom: 

> <<<df>>>

>

> but this output is after re-installing the agent and restarting xinetd,

> and as you can see the output was completely truncated after <<<df>>>,

> where it just hangs until I break the session.



_______________________________________________

checkmk-en mailing list

checkmk-en at lists.mathias-kettner.de<mailto:checkmk-en at lists.mathias-kettner.de>

Manage your subscription or unsubscribe

https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en</df></df></check_mk>
_______________________________________________
checkmk-en mailing list
checkmk-en at lists.mathias-kettner.de<mailto:checkmk-en at lists.mathias-kettner.de>
Manage your subscription or unsubscribe
https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en at lists.mathias-kettner.de<mailto:checkmk-en at lists.mathias-kettner.de>
Manage your subscription or unsubscribe
https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mathias-kettner.de/pipermail/checkmk-en/attachments/20190205/a791bb8f/attachment-0001.html>


More information about the checkmk-en mailing list