#536 Notify the admin before reaching a socket descriptor limit
When prosody runs out of file descriptors, bad things happen, and it is not easy to debug them.
It would be awesome if prosody could send a message to the configured admins JID(s) when a given threshold of the maximum process limit (i.e. 66%) has been reached, so the admin can increase the limits without time pressure. Ideally, this message should only be sent once (per prosody start and resource).
It would also be nice to have some output from prosodyctl about / status regarding the configured limits, however this is probably hard to achieve prior to the daemon start.
It really would, but unfortunately there is no simple way to do this (I have attempted it before).
The problem is that there's no (straightforward) way to find out how many descriptors we currently hold open. It doesn't map 1:1 to the number of connections, and some modules could easily make it a 1:2 ratio without us knowing about the extra fds being used.
The only accurate solution I found someone suggesting, was to loop through all possible fds and test which ones are open. Obviously not very efficient...
We could take the chaos monkey approach, modify the environment to fail to open files more often and then make sure it gets dealt with in a graceful manner.
Currently there are probably many dark code paths that rarely see the light of execution which may cause troubles for lonely threads in the rare cases they venture beyond the well trodden paths.
@MattJ Maybe this is naive, but I would consider the largest fd held by the program to be a good approximation. POSIX requires that the smallest possible fd is assigned to new connections, so fragmentation is not an issue regarding reaching the limit.
@Zash yeah, the server is not really handling this situation very well. It fails to open all kinds of data files and seems to be busy-looping. While individual errors cause XMPP errors to be generated (i.e. failing to open offline storage will send back an error for a message), it would be great to stop accepting new clients / opening s2s links if we are above a threshold, so we can at least properly handle the existing users.
I guess that's a fair enough compromise. I think the worst case scenario would be: client 1023 (ignoring for now whatever margin we want to give ourselves) connects, we enter 'lockdown' because we know we're almost at the limit. We deny new connections, alert the admin, etc.
But then over the course of the day, client 1023 stays connected while many other people disconnect (not an uncommon pattern in servers serving lots of users across different timezones). Client 1023 could stay connected for as long as they want, and no other users would be allowed to connect even if they were the only user on the server.
Maybe you could also expose the number of used descriptors from whatever select/epoll/... API you are using below the hood?
Right, but is back to the original issue. There's no syscall to get the number of open fds, and only counting connections isn't enough. We have modules that open a file for every connection, e.g. for logging purposes. We would hit the limit long before we predicted in that case because the number of used fds is actually 2x usual.