[Check_mk (english)] EXT: Re: EXT: Re: EXT: Re: Agent Not responding

Robert Dahlem Robert.Dahlem at gmx.net
Tue Feb 5 23:36:42 CET 2019


It is pretty obvious you have a packet loss problem, which has been
pointed out before. Let's compare both dumps.

Packet 1 (at the client, "CMK.serv") in the first one corresponds to 13
(at the server, "mgd.host") in the second one.

1  -> 13, SYN,Seq=0. This is: "ring"
	we see MSS=1460 from the client,
	but only MSS=1396 reaches the server
	Someone is "MSS clamping" us.
2 <-> 14, SYN,ACK,Seq=0. This is: "yes, hello"
	MSS=1460 from the server
	but only MSS=1354 reaches the client
	Someone is "MSS clamping" us, but not equally
	for both directions.
3  -> 15, ACK,Seq=1. This is: "thank you for taking my call"
4  -> 16, PSH,ACK,Seq=1,Len=15. This is: "I would like some data"
5 <-      ACK,Seq=1,Len=0.
	Interesting. We do not see that on the server side
	Probably because we do not see the same communication on
	both sides, dumps were taken at different times.
  <-  17, ACK,Len=2768
	Packet too big, never reaches the client
6 <-     PSH,ACK,Seq=12472,Len=1348

See how the client never receives a package with a length more than 1348?

Because of packet 1->13 the server thinks it can send packets up to a
length of 1396 bytes. In fact, the network does not transport packets
bigger than 1354 bytes. Packets bigger than that should be split, but
because of packet 1->13 the server thinks it can send packets up to 1396
bytes. These packets will be dropped by the network.

Reduce the MTU on the server interface from 1500 to 1300 (MSS is
MTU-40). You will see that the communications works.

Then raise it to 1500 again, make packet dumps of the communication on
both sites at the same time. Give that to the network guys. They have an
MSS clamping bug. Traffic from the client to the server should be
clamped down to MSS=1354, not MSS=1396.

Regards,
Robert


On 05.02.2019 22:04, Karl Otterbein via checkmk-en wrote:
> 
> A little more:  I did a capture on the managed node, and the following
> sequence is what occurred- it may be a window sizing issue as previously
> brought to my attention, but this is an area I have not played with:
> 
> "No.","Time","Source","Destination","Protocol","Length","Info"
> "13","1.993100","10.CMK.HOST.0","10.MGD.HOST.0","TCP","74","56232  >
>  6556 [SYN] Seq=0 Win=29200 Len=0 MSS=1396 SACK_PERM=1 TSval=2758210085
> TSecr=0 WS=128"
> "14","1.993212","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","74","6556  >
>  56232 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1
> TSval=2073679 TSecr=2758210085 WS=128"
> "15","2.006338","10.CMK.HOST.0","10.MGD.HOST.0","TCP","66","56232  >
>  6556 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2758210098 TSecr=2073679"
> "16","2.034297","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","81","6556  >
>  56232 [PSH, ACK] Seq=1 Ack=1 Win=29056 Len=15 TSval=2073690
> TSecr=2758210098"
> "17","2.048122",""10.MGD.HOST.0",""10.CMK.HOST.0","TCP","2834","6556  >
>  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=2768 TSval=2073693 TSecr=2758210098"
> "18","2.048139","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","6556  >
>  56232 [ACK] Seq=2784 Ack=1 Win=29056 Len=1384 TSval=2073693
> TSecr=2758210098"
> "19","2.048165","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","2834","6556  >
>  56232 [ACK] Seq=4168 Ack=1 Win=29056 Len=2768 TSval=2073693
> TSecr=2758210098"
> "20","2.048175","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","6556  >
>  56232 [ACK] Seq=6936 Ack=1 Win=29056 Len=1384 TSval=2073693
> TSecr=2758210098"
> "21","2.048195","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","2834","6556  >
>  56232 [ACK] Seq=8320 Ack=1 Win=29056 Len=2768 TSval=2073693
> TSecr=2758210098"
> "22","2.049683","10.CMK.HOST.0","10.MGD.HOST.0","TCP","66","56232  >
>  6556 [ACK] Seq=1 Ack=16 Win=29312 Len=0 TSval=2758210141 TSecr=2073690"
> "23","2.049728","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","2798","6556  >
>  56232 [PSH, ACK] Seq=11088 Ack=1 Win=29056 Len=2732 TSval=2073694
> TSecr=2758210141"
> "24","2.051458","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","6556  >
>  56232 [ACK] Seq=13820 Ack=1 Win=29056 Len=1384 TSval=2073694
> TSecr=2758210141"
> "25","2.063603","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","78","[TCP Window
> Update] 56232  >  6556 [ACK] Seq=1 Ack=16 Win=32128 Len=0
> TSval=2758210155 TSecr=2073690 SLE=12472 SRE=1
> 3820"
> "26","2.063703","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Out-Of-Order] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2073697 TSecr=2758210155"
> "27","2.277525","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2073751 TSecr=2758210155"
> "28","2.709520","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2073859 TSecr=2758210155"
> "29","3.573634","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2074075 TSecr=2758210155"
> "30","5.305548","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2074508 TSecr=2758210155"
> "31","8.769619","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=1 Win=29056 Len=1384
> TSval=2075374 TSecr=2758210155"
> "32","13.641886","10.CMK.HOST.0","10.MGD.HOST.0","TCP","78","56232  >
>  6556 [FIN, ACK] Seq=1 Ack=16 Win=32128 Len=0 TSval=2758221733
> TSecr=2073690 SLE=12472 SRE=13820"
> "33","13.645589","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","66","6556  >
>  56232 [ACK] Seq=15204 Ack=2 Win=29056 Len=0 TSval=2076593 TSecr=2758221733"
> "34","15.689597","10.MGD.HOST.0",""10.CMK.HOST.0","TCP","1450","[TCP
> Retransmission] 6556  >  56232 [ACK] Seq=16 Ack=2 Win=29056 Len=1384
> TSval=2077104 TSecr=2758221733"
> -- 
> Sent from Hiri <https://www.hiri.com/>
>  
> 
> On 2019-02-05 14:36:28-05:00 checkmk-en wrote:
> 
>     Bonding issue was not the problem- was able to bring down the bond
>     and go directly to eth0 with no problem, but the issue persists.
> 
>     Nothing I can find to correlate the issue.  I'm down to capturing
>     pcaps on the agent discovery, where I see the truncated data making
>     it back but the payload stops abruptly-  running cmk -vvII
>     10.mgd.host.0 from 10.CMK.serv.0
> 
>     Packets associated from tcpdump:
> 
>     1 0.000000 10.CMK.serv.0 10.mgd.host.0 TCP 74 54996 → 6556 [SYN]
>     Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2752921687 TSecr=0
>     WS=128
> 
>     2 0.013476 10.mgd.host.0 10.CMK.serv.0 TCP 74 6556 → 54996 [SYN,
>     ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1354 SACK_PERM=1 TSval=751580
>     TSecr=2752921687 WS=128
> 
>     3 0.013517 10.CMK.serv.0 10.mgd.host.0 TCP 66 54996 → 6556 [ACK]
>     Seq=1 Ack=1 Win=29312 Len=0 TSval=2752921700 TSecr=751580
> 
>     4 0.036207 10.mgd.host.0 10.CMK.serv.0 TCP 81 6556 → 54996 [PSH,
>     ACK] Seq=1 Ack=1 Win=29056 Len=15 TSval=751586 TSecr=2752921700
> 
>     5 0.036231 10.CMK.serv.0 10.mgd.host.0 TCP 66 54996 → 6556 [ACK]
>     Seq=1 Ack=16 Win=29312 Len=0 TSval=2752921723 TSecr=751586
> 
>     6 0.051618 10.mgd.host.0 10.CMK.serv.0 TCP 1414 [TCP Previous
>     segment not captured] 6556 → 54996 [PSH, ACK] Seq=12472 Ack=1
>     Win=29056 Len=1348 TSval=751589 TSecr=2752921723
> 
>     7 0.051628 10.CMK.serv.0 10.mgd.host.0 TCP 78 [TCP Window Update]
>     54996 → 6556 [ACK] Seq=1
> 
>     Ack=16 Win=32128 Len=0 TSval=2752921738 TSecr=751586 SLE=12472 SRE=13820
> 
>     ---ICMP PINGS from CMK from packets 8-26---
> 
>     27 49.500910 10.CMK.serv.0 10.mgd.host.0 TCP 78 54996 → 6556 [FIN,
>     ACK] Seq=1 Ack=16 Win=32128 Len=0 TSval=2752971188 TSecr=751586
>     SLE=12472 SRE=13820
> 
>     28 49.519048 10.mgd.host.0 10.CMK.serv.0 TCP 66 [TCP Previous
>     segment not captured] 6556 → 54996 [ACK] Seq=15204 Ack=2 Win=29056
>     Len=0 TSval=763956 TSecr=2752971188
> 
>      
> 
>     Packet 6 payload shows the agent sending back partial discovery
>     which is abruptly truncated.
> 
>      
> 
>     wå¤@{     tmpfs          65536         0     65536       0%
> 
>     /var/lib/docker/containers/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/mounts/shm
> 
>     overlay                       overlay    936179076 184998696
>     751180380      20%
>     /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
> 
>     overlay                       overlay    936179076 184998696
>     751180380      20%
>     /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
> 
>     shm                           tmpfs          65536         0    
>     65536       0%
>     /var/lib/docker/containers/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/mounts/shm
> 
>     overlay                       overlay    936179076 184998696
>     751180380      20%
>     /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
> 
>     tmpfs                         tmpfs       39486652         0 
>     39486652       0% /run/user/1112
> 
>     overlay                       overlay    936179076 184998696
>     751180380      20%
>     /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
> 
>     overlay                       overlay    936179076 184998696
>     751180380      20%
>     /var/lib/docker/overlay2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/merged
> 
>     <<<df>>>
> 
>     [df_inodes_start]
> 
>      
> 
>     Final packets
> 
>      
> 
>     Source is the CMK server --> Managed node
> 
>      
> 
>     [SEQ/ACK analysis]
> 
>         [iRTT: 0.013517000 seconds]
> 
>         [TCP Analysis Flags]
> 
>             [Expert Info (Warning/Sequence): Previous segment(s) not
>     captured (common at capture start)]
> 
>                 [Previous segment(s) not captured (common at capture start)]
> 
>                 [Severity level: Warning]
> 
>                 [Group: Sequence]
> 
>      
> 
>      
> 
>     Return from Managed node --> CMK server:
> 
>      
> 
>     [SEQ/ACK analysis]
> 
>         [This is an ACK to the segment in frame: 27]
> 
>         [The RTT to ACK the segment was: 0.018138000 seconds]
> 
>         [iRTT: 0.013517000 seconds]
> 
>         [TCP Analysis Flags]
> 
>             [Expert Info (Warning/Sequence): Previous segment(s) not
>     captured (common at capture start)]
> 
>                 [Previous segment(s) not captured (common at capture start)]
> 
>                 [Severity level: Warning]
> 
>                 [Group: Sequence]
> 
>     [Timestamps]
> 
>         [Time since first frame in this TCP stream: 49.500910000 seconds]
> 
>         [Time since previous frame in this TCP stream: 49.449282000 seconds]
> 
>      
> 
>     Getting deep in the weeds now, if this was a FW issue I would expect
>     to see no ability to contact, but it's almost as if xinetd just
>     stops sending and the CMK server is just waiting for the remaining
>     payload so it never fails out.  The last sequence is likely from me
>     ctrl-c'ing the discovery from the command line.
> 
>     much appreciated for any thoughts!
> 
>      
> 
>     K
> 
>      
> 
>     *From: *checkmk-en <checkmk-en-bounces at lists.mathias-kettner.de> on
>     behalf of Karl Otterbein via checkmk-en
>     <checkmk-en at lists.mathias-kettner.de>
>     *Reply-To: *Karl Otterbein <karl.otterbein at jet.com>
>     *Date: *Tuesday, February 5, 2019 at 11:32 AM
>     *To: *Ezequiel Tolstanov <ezequiel at atdt.com.ar>
>     *Cc: *"checkmk-en at lists.mathias-kettner.de"
>     <checkmk-en at lists.mathias-kettner.de>
>     *Subject: *EXT: Re: [Check_mk (english)] EXT: Re: EXT: Re: Agent Not
>     responding
> 
>      
> 
>     No, it is a direct connect pipe with no VPN.  I have however
>     discovered that the bond interface is dropping packets so I am going
>     to go figure out if there is something physically wrong with the
>     host- looks to be related.
> 
>      
> 
>     Thank you all, I’ll reply if I find a cause with the bond, but
>     appreciate everyone’s input!
> 
>      
> 
>     K
> 
>      
> 
>     On Feb 5, 2019, at 11:13 AM, Ezequiel Tolstanov
>     <ezequiel at atdt.com.ar <mailto:ezequiel at atdt.com.ar>> wrote:
> 
>         Hi,
> 
>          
> 
>         Is the connection between the Server and the Remote Host going
>         through a VPN tunnel? If so, maybe you are facing an MSS
>         (Maximum Segment Size) issue.
> 
>          
> 
>         Best regards.
> 
>          
> 
>         El mar., 5 feb. 2019 a las 13:02, Karl Otterbein via checkmk-en
>         (<checkmk-en at lists.mathias-kettner.de
>         <mailto:checkmk-en at lists.mathias-kettner.de>>) escribió:
> 
>              
> 
>             sorry- sent by mistake without finishing:
> 
>             I wonder if the following I receive when running cmk -vvII
>             may cause an issue:
> 
>              [agent] Connecting via TCP to 10.51.169.15:6556
>             <http://10.51.169.15:6556> (5.0s timeout)
> 
>             I wonder if there is a timeout waiting- this is a remote
>             host in the cloud reaching to a co-lo in a different DC- is
>             there a means to up that TCP timeout?
> 
>             It doesn't make much sense because it's a 10GB direct connect.
> 
>             Full output:
> 
>             + FETCHING DATA
> 
>              [agent] No persisted sections loaded
> 
>              [agent] Not using cache (Does not exist)
> 
>              [agent] Execute data source
> 
>              [agent] Connecting via TCP to 10.51.169.15:6556
>             <http://10.51.169.15:6556> (5.0s timeout)
> 
>              [agent] Reading data from agent
> 
>             -- 
>             Sent from Hiri <https://www.hiri.com/>
> 
>              
> 
>             On 2019-02-05 10:58:33.309644-05:00 Karl Otterbein wrote:
> 
>                 Does the agent run through completion on the local host?
>                  - yes- I'm able to complete through the agent run with
>                 both commands below...
> 
>                 telnet localhost 6556
> 
>                 and/or
> 
>                 /usr/bin/check_mk_agent (or wherevever the ahent lives)
> 
>                 Also, is the agent version of equivalent or lesser then
>                 the monitoring server? Having a newer agent version
>                 could cause some conflicts.  They are the same-
>                 both 1.5.0p11.
> 
>                 I wonder if the following I receive when running cmk
>                 -vvII may cause an issue:
>                  
> 
>                 -- 
>                 Sent from Hiri <https://www.hiri.com/>
> 
>                  
> 
>                 On 2019-02-05 10:52:19-05:00 Paul Dott wrote:
> 
>                     Does the agent run through completion on the local
>                     host?
> 
>                     telnet localhost 6556
> 
>                     and/or
> 
>                     /usr/bin/check_mk_agent (or wherevever the ahent lives)
> 
>                     Also, is the agent version of equivalent or lesser
>                     then the monitoring server? Having a newer agent
>                     version could cause some conflicts.
> 
>                      
> 
>                     On Tue, Feb 5, 2019 at 7:40 AM Karl Otterbein via
>                     checkmk-en <checkmk-en at lists.mathias-kettner.de
>                     <mailto:checkmk-en at lists.mathias-kettner.de>> wrote:
> 
>                         Thanks Robert for the quick reply-
> 
>                         works properly:
> 
>                         df -PTlk
> 
>                         Filesystem                    Type    
>                         1024-blocks      Used Available Capacity Mounted on
> 
>                         udev                          devtmpfs  
>                         197408056         0 197408056       0% /dev
> 
>                         tmpfs                         tmpfs      
>                         39486656    183692  39302964       1% /run
> 
>                         ... (truncated)
> 
>                          
> 
>                         K
>                          
> 
>                         -- 
>                         Sent from Hiri <https://www.hiri.com/>
> 
>                          
> 
>                         On 2019-02-05 10:37:07-05:00 checkmk-en wrote:
> 
>                             What the agent does at this stage is
> 
>                              df -PTlk
> 
>                              
> 
>                             Try running this command on the managed node and see what happens.
> 
>                              
> 
>                             Regards,
> 
>                             Robert
> 
>                              
> 
>                             On 05.02.2019 16:32, Karl Otterbein via checkmk-en wrote:
> 
>                              
> 
>                             > telnet 10.x.x.x 6556
> 
>                             > Trying 10..x.x.x...
> 
>                             > Connected to 10.x.x.x.
> 
>                             > Escape character is '^]'.
> 
>                             > <<<check_mk>>>
> 
>                             > Version: 1.5.0p11
> 
>                             > AgentOS: linux
> 
>                             > Hostname: xxxxxx01
> 
>                             > AgentDirectory: /etc/check_mk
> 
>                             > DataDirectory: /var/lib/check_mk_agent
> 
>                             > SpoolDirectory: /var/lib/check_mk_agent/spool
> 
>                             > PluginsDirectory: /usr/lib/check_mk_agent/plugins
> 
>                             > LocalDirectory: /usr/lib/check_mk_agent/local
> 
>                             > OnlyFrom: 
> 
>                             > <<<df>>>
> 
>                             > 
> 
>                             > but this output is after re-installing the agent and restarting xinetd,
> 
>                             > and as you can see the output was completely truncated after <<<df>>>,
> 
>                             > where it just hangs until I break the session.
> 
>                              
> 
>                             _______________________________________________
> 
>                             checkmk-en mailing list
> 
>                             checkmk-en at lists.mathias-kettner.de <mailto:checkmk-en at lists.mathias-kettner.de>
> 
>                             Manage your subscription or unsubscribe
> 
>                             https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en</df></df></check_mk>
> 
>                         _______________________________________________
>                         checkmk-en mailing list
>                         checkmk-en at lists.mathias-kettner.de
>                         <mailto:checkmk-en at lists.mathias-kettner.de>
>                         Manage your subscription or unsubscribe
>                         https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
> 
>             _______________________________________________
>             checkmk-en mailing list
>             checkmk-en at lists.mathias-kettner.de
>             <mailto:checkmk-en at lists.mathias-kettner.de>
>             Manage your subscription or unsubscribe
>             https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
> 
> 
> _______________________________________________
> checkmk-en mailing list
> checkmk-en at lists.mathias-kettner.de
> Manage your subscription or unsubscribe
> https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
> 


More information about the checkmk-en mailing list