#871 Dead s2sout connection hinders connectivity to affected domain for hours

Reporter Jonas Wielicki
Owner MattJ
Created
Updated
Stars ★★★ (4)
Tags
  • Priority-Medium
  • Type-Defect
  • Milestone-0.9
  • Status-Fixed
  1. Jonas Wielicki on

    What steps will reproduce the problem? 1. Have an s2sout TCP connection which is "unauthed"(?) and which doesn’t appear to be received by the peer anymore and which doesn’t get ReSeTted by the peer. (sorry, I don’t have logs from when this situation occurred. It should be reproducable by simply dropping packets on the TCP connection, for example with https://github.com/horazont/aioxmpp/blob/devel/utils/mutetcp.py ) What is the expected output? What do you see instead? Following scenario: my domain (conference.zombofant.net), peer domain (foo.example). s2sout from conference.zombofant.net to foo.example is in the described dead condition. I expect prosody to close that s2sout at some point (e.g. after 5 minutes or so). However, the following happens: 1. foo.example opens connection to conference.zombofant.net; prosody sees new s2sin 2. prosody attempts dialback verification over dead s2sout 3. s2sin times out Some logs: Mar 28 04:30:23 s2sin11681a0 debug Incoming s2s connection Mar 28 04:30:23 s2sin11681a0 debug Incoming s2s received <stream:stream xmlns='http://etherx.jabber.org/streams' to='conference.zombofant.net' version='1.0' from='foo.example'> Mar 28 04:30:23 mod_s2s debug sending: <?xml version='1.0'?> Mar 28 04:30:23 mod_s2s debug sending: <stream:stream xmlns:db='jabber:server:dialback' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' from='conference.zombofant.net' id='1980de8a-01ee-420d-b5d8-045707bea6b5' to='foo.example' xml:lang='en' xmlns='jabber:server'> Mar 28 04:30:23 mod_s2s debug Sending stream features: <stream:features><dialback xmlns='urn:xmpp:features:dialback'/><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'/></stream:features> Mar 28 04:30:23 mod_s2s debug sending: <stream:features> Mar 28 04:30:23 s2sin11681a0 debug Received[s2sin_unauthed]: <starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'> Mar 28 04:30:23 mod_s2s debug sending: <proceed xmlns='urn:ietf:params:xml:ns:xmpp-tls'> Mar 28 04:30:23 socket debug server.lua: we need to do tls, but delaying until send buffer empty Mar 28 04:30:23 s2sin11681a0 debug TLS negotiation started for s2sin_unauthed... Mar 28 04:30:23 socket debug server.lua: attempting to start tls on tcp{client}: 0x2f3f718 Mar 28 04:30:23 socket debug server.lua: ssl handshake done Mar 28 04:30:23 s2sin11681a0 debug Incoming s2s received <stream:stream xmlns='http://etherx.jabber.org/streams' to='conference.zombofant.net' version='1.0' from='foo.example'> Mar 28 04:30:23 s2sin11681a0 debug certificate chain validation result: invalid Mar 28 04:30:23 s2sin11681a0 debug certificate error(s) at depth 1: self signed certificate in certificate chain Mar 28 04:30:23 mod_s2s debug sending: <?xml version='1.0'?> Mar 28 04:30:23 mod_s2s debug sending: <stream:stream xmlns:db='jabber:server:dialback' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' from='conference.zombofant.net' id='b0c5e727-7349-4ca9-8cad-8ffea75f4355' to='foo.example' xml:lang='en' xmlns='jabber:server'> Mar 28 04:30:23 mod_s2s debug Sending stream features: <stream:features><dialback xmlns='urn:xmpp:features:dialback'/></stream:features> Mar 28 04:30:23 mod_s2s debug sending: <stream:features> Mar 28 04:30:23 s2sin11681a0 debug Received[s2sin_unauthed]: <result xmlns='jabber:server:dialback' to='conference.zombofant.net' from='foo.example'> Mar 28 04:30:23 s2sin11681a0 debug asking foo.example if key 0a7a03e73269dd1b64e673c3fb4647750c7855b538da511f80e563d57e0bc8ba belongs to them Mar 28 04:30:23 s2sout28ea980 debug trying to send over unauthed s2sout to foo.example Mar 28 04:30:23 s2sout28ea980 debug stanza [db:verify] queued Mar 28 04:31:53 s2sin11681a0 debug Destroying incomplete session foo.example->conference.zombofant.net due to inactivity Mar 28 04:31:53 s2sin11681a0 debug Disconnecting (unknown host)[s2sin_unauthed], <stream:error> is: connection-timeout Mar 28 04:31:53 mod_s2s debug sending: <stream:error> Mar 28 04:31:53 mod_s2s debug sending: </stream:stream> Mar 28 04:31:53 s2sin11681a0 info incoming s2s stream foo.example->conference.zombofant.net closed: connection-timeout Mar 28 04:31:53 s2sin11681a0 debug Destroying incoming session foo.example->conference.zombofant.net: connection-timeout Mar 28 04:31:53 s2sin11681a0 debug s2s disconnected: nil->nil (connection closed) What version of the product are you using? On what operating system? root@xmpp-public:~# apt show prosody Package: prosody Version: 0.9.7-2+deb8u3 Please provide any additional information below. I think that the following mitigations are not sufficient: * mod_pinger: c2s only * mod_s2s_idle_timeout: may not work depending on the rate connections from the other side are attempted, as the s2sout timeout would be based on when the last db:verify was sent * mod_s2s_keepalive may not work because it doesn’t care whether ping replies are received and I’m not sure it would even try to send pings because according to the logs the s2sout is "unauthed" ("trying to send over unauthed s2sout to foo.example").

  2. Jonas Wielicki on

    FWIW, I resolved the situation using s2s:close on admin_telnet. So I unfortunately cannot provide information such as "was the remote side sending ACKs on the TCP" or "how full was the TCP buffer". I should’ve thought about that before killing the s2s connection.

  3. Zash on

    More information would indeed have been useful, especially regarding the early life of s2sout28ea980 I've seen something similar on a different server, possibly related to DNS failures.

    Changes
    • tags Status-NeedInfo Milestone-0.9
  4. Martin on

    I'm not sure but I might experience the same problem: - I am running prosody on a shared hoster (uberspace.de, toast not involved as I used `./configure --prefix=${HOME} && make && make install` - I was in conversations@conference.siacs.eu - I updated prosody 0.10 to hg:8ea0484871e8 I restarted my server (19:25) - prosody.log tells me s2s to conference.siacs.eu was successful but I can't connect to conversations@conference.siacs.eu - At 21:21 I restarted again to no avail ``` prosodyctl about Prosody hg:8ea0484871e8 # Prosody directories Data directory: /home/md85/var/lib/prosody Config directory: /home/md85/etc/prosody Source directory: /home/md85/lib/prosody Plugin directories: /home/md85/opt/prosody-modules/ - prosody-modules rev: a6d19aea9b73 /home/md85/lib/prosody/modules/ # Lua environment Lua version: Lua 5.1 Lua module search paths: /home/md85/lib/prosody/?.lua /home/md85/.luarocks/share/lua/5.1/?.lua /home/md85/.luarocks/share/lua/5.1/?/init.lua /usr/share/lua/5.1/?.lua /usr/share/lua/5.1/?/init.lua /home/md85/.luarocks/share/lua/5.1/?.lua /home/md85/.luarocks/share/lua/5.1/?/init.lua /usr/lib64/lua/5.1/?.lua /usr/lib64/lua/5.1/?/init.lua /home/md85/.luarocks/share/lua/5.1/?.lua /home/md85/.luarocks/share/lua/5.1/?/init.lua Lua C module search paths: /home/md85/lib/prosody/?.so /home/md85/.luarocks/lib/lua/5.1/?.so /usr/lib/lua/5.1/?.so /home/md85/.luarocks/lib/lua/5.1/?.so /usr/lib64/lua/5.1/?.so /usr/lib64/lua/5.1/loadall.so /home/md85/.luarocks/lib/lua/5.1/?.so LuaRocks: Installed (2.1.2) # Lua module versions lfs: LuaFileSystem 1.6.3 libevent: 2.1.8-stable luaevent: 0.4.4 lxp: LuaExpat 1.3.0 socket: LuaSocket 3.0-rc1 ssl: 0.6 ``` ``` grep siacs.eu ~/var/lib/prosody/prosody.log Aug 28 17:38:08 s2sout5c4f760 info Outgoing s2s connection mdosch.de->conference.siacs.eu complete Aug 28 17:38:17 mdosch.de:saslauth info Accepting SASL EXTERNAL identity from conference.siacs.eu Aug 28 17:38:17 s2sin12124230 info Incoming s2s connection conference.siacs.eu->mdosch.de complete Aug 30 00:02:39 s2sout5c4f760 info Outgoing s2s stream mdosch.de->conference.siacs.eu closed: Received SIGTERM Aug 30 00:02:39 s2sin12124230 info Incoming s2s stream conference.siacs.eu->mdosch.de closed: Received SIGTERM Aug 30 00:03:01 s2sout766cf40 info Outgoing s2s connection mdosch.de->conference.siacs.eu complete Aug 30 00:03:02 mdosch.de:saslauth info Accepting SASL EXTERNAL identity from conference.siacs.eu Aug 30 00:03:02 s2sin7a41090 info Incoming s2s connection conference.siacs.eu->mdosch.de complete Sep 04 19:25:21 s2sout766cf40 info Outgoing s2s stream mdosch.de->conference.siacs.eu closed: Received SIGTERM Sep 04 19:25:21 s2sin7a41090 info Incoming s2s stream conference.siacs.eu->mdosch.de closed: Received SIGTERM Sep 04 19:25:36 s2sout869e2d0 info Outgoing s2s connection mdosch.de->conference.siacs.eu complete Sep 04 21:21:38 s2sout869e2d0 info Outgoing s2s stream mdosch.de->conference.siacs.eu closed: Received SIGTERM Sep 04 21:21:59 s2sout7969bc0 info Outgoing s2s connection mdosch.de->conference.siacs.eu complete ``` As you mentioned a possible DNS issue I remembered I had some DNS stuff in my error log (but without ever experiencing problems) I had a look but there was no recent activity: ``` grep DNS ~/var/lib/prosody/prosody.err Jul 29 17:12:15 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Jul 31 17:45:52 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 07 19:23:33 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 11 23:00:51 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 13 15:29:09 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 15 14:57:09 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 18 16:41:14 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again Aug 19 14:30:18 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again ``` Also there were no errors logged today: ``` tail -n1 ~/var/lib/prosody/prosody.err Aug 19 14:30:18 adns error Exhausted all 1 configured DNS servers, next lookup will try 127.0.0.1 again ```

  5. Zash on

    Could someone try to reproduce this after https://hg.prosody.im/trunk/rev/756b8821007a to see if it helped?

  6. Jonas Schäfer on

    I haven’t noticed the issue for quite a while now. Then again, I also hadn’t noticed it for quite a while before initially reporting it, unfortunately.

  7. MattJ on

    Closing this. Rationale: This was originally reported against 0.9.7 (released in 2014). That branch is now beyond obsolete. There have been no recent reports of this issue that I'm aware of. Other bugs that may have been responsible for this, by not firing timers that would clean up the session, are #460 and #1572 (both fixed) (from the original report here, it looks like it was quite possibly using lua-event affected by #460). Finally, the connection logic has been replaced by net.connect in trunk.

    Changes
    • owner MattJ
    • tags Status-Fixed

New comment

Not published. Used for spam prevention and optional update notifications.