#460 Prosody trunk fails to execute module timers after running some time (a.k.a. THE timer problem)
Reporter
Ge0rG
Owner
MattJ
Created
Updated
Stars
★★ (3)
Tags
Priority-Medium
Type-Defect
Status-Fixed
Ge0rG
on
How to reproduce: Run prosody-trunk from the prosody debian repository (tested with 1nightly541-1~wheezy and some earlier versions), and have either mod_bosh or mod_smacks (or both) enabled.
After some time, prosody will fail to execute timer code (mod_pinger, mod_smacks timeouts, ...).
waqas suggested to run the following from the admin console:
>local h,now=timer.add_task.h,timer.reschedule.get_time();for i,v in ipairs(h.priorities) do local cb,param=h.items[h.ids[i]],timer.add_task.params[h.ids[i]]; if(tostring(cb)=="function(moduleapi.lua:366)") then cb="api:"..tostring(param.callback); elseif(tostring(cb)=="function(watchdog.lua:12)") then cb="watchdog:"..tostring(cb.watchdog.callback) end print(i,v,v - now, cb); end
Currently it reports >1500 events in the queue with negative wait times:
...
| 1514 1422293771.4827 -8874.7103559971 function(mod_s2s.lua:557)
| 1515 1422294227.0253 -8419.1677281857 function(mod_smacks.lua:262)
| 1516 1422295940.906 -6705.2870650291 function(mod_smacks.lua:262)
| 1517 1422297073.3799 -5572.8131351471 function(mod_smacks.lua:262)
| 1518 1422298475.1318 -4171.0612580776 function(mod_smacks.lua:262)
| 1519 1422298191.1517 -4455.0413451195 function(dns.lua:775)
| 1520 1422298191.1523 -4455.0407102108 function(dns.lua:775)
| 1521 1422298191.1561 -4455.0369620323 function(mod_manifesto.lua:70)
| 1522 1422298777.1561 -3869.036935091 watchdog:function(mod_pinger.lua:14)
| 1523 1422298266.1904 -4380.0026640892 function(mod_s2s.lua:557)
| 1524 1422298176.9323 -4469.2607090473 api:function(mod_smacks.lua:155)
| 1525 1422298269.9654 -4376.2276530266 function(mod_s2s.lua:557)
| 1526 1422299816.5679 -2829.6250941753 function(mod_s2s.lua:557)
| Result: nil
Zash
on
My attempt at fixing this:
(includes some overly verbose debug logging that would be stripped before commiting)
diff --git a/util/timer.lua b/util/timer.lua
--- a/util/timer.lua
+++ b/util/timer.lua
@@ -14,11 +14,21 @@
local debug_traceback = debug.traceback;
local tostring = tostring;
local xpcall = xpcall;
+local os_date = os.date;
+
+local function datefmt(t)
+ if os_date("%F") == os_date("%F", t) then
+ return os_date("%T.%%03d", t):format(t%1*1000);
+ end
+ return os_date("%F %T", t);
+end
local _ENV = nil;
local _add_task = server.add_task;
+local _server_timer;
+local _active_timers = 0;
local h = indexedbheap.create();
local params = {};
local next_time = nil;
@@ -29,23 +39,44 @@
local peek;
while true do
peek = h:peek();
- if peek == nil or peek > now then break; end
+ if peek == nil or peek > now then
+ log("debug", "No more pending timers");
+ break;
+ end
local _;
_, _callback, _id = h:pop();
_now = now;
_param = params[_id];
params[_id] = nil;
- --item(now, id, _param); -- FIXME pcall
+ log("debug", "_callback: %s", tostring(_param and _param.callback or _callback));
local success, err = xpcall(_call, _traceback_handler);
if success and type(err) == "number" then
h:insert(_callback, err + now, _id); -- re-add
params[_id] = _param;
end
end
+
+ if peek ~= nil and _active_timers > 1 and peek == next_time then
+ log("debug", "Another timer is scheduled for %s already", datefmt(next_time));
+ peek = nil;
+ else
next_time = peek;
- if peek ~= nil then
+ end
+
+ if peek then
+ log("debug", "Wait until %s", datefmt(peek));
return peek - now;
end
+ if not next_time and _active_timers == 0 then
+ log("debug", "No more timers");
+ if _server_timer then
+ _server_timer:close();
+ _server_timer = nil;
+ end
+ else
+ _active_timers = _active_timers - 1;
+ log("debug", "Not renewing, %d active timers left", _active_timers);
+ end
end
local function add_task(delay, callback, param)
local current_time = get_time();
@@ -55,7 +86,14 @@
params[id] = param;
if next_time == nil or event_time < next_time then
next_time = event_time;
- _add_task(next_time - current_time, _on_timer);
+ log("debug", "server.add_task(%s, _on_timer)", datefmt(next_time));
+ if _server_timer then
+ _server_timer:close();
+ _server_timer = nil;
+ else
+ _active_timers = _active_timers + 1;
+ end
+ _server_timer = _add_task(next_time - current_time, _on_timer);
end
return id;
end
*yell* sorry, the issue still persists.
After only three minutes of uptime:
timer:info()
| -- util.timer
| 2016-06-29 12:06:51 function(mod_statistics.lua:104)
| 2016-06-29 12:07:06 function(mod_s2s.lua:584)
| 2016-06-29 12:07:07 function(mod_s2s.lua:584)
| 2016-06-29 12:07:40 function(mod_s2s.lua:584)
| 2016-06-29 12:16:11 function(watchdog.lua:12)
| 2016-06-29 12:07:42 function(mod_s2s.lua:584)
| 2016-06-29 12:07:42 function(mod_s2s.lua:584)
| 2016-06-29 12:15:09 function(watchdog.lua:12)
| 2016-06-29 12:07:40 function(mod_s2s.lua:584)
| 1 libevent callbacks
| OK: Next event at 2016-06-29 12:06:51 (in -18.169477s)
Zash
on
Changes
tags Status-Accepted
Ge0rG
on
Just a small reminder that this is probably related to how lua-event (0.4.3 tested) handles the difference between NULL events and events with time=0. The following patch fixed THE timer bug for me:
https://op-co.de/tmp/lua-event-callback.patch
Zash
on
Can I close this then?
Ge0rG
on
I'd like to keep this open until there is a new lua-event version and prosody's dependency has been updated accordingly.
This is a major packaging issue. Debian and Ubuntu are currently providing a 0.4.3 package (without the fix), but that package reports itself as 0.4.4 (which is what shows up in 'prosodyctl about').
I too would like to keep this issue open until we can resolve this sensibly, possibly by offering lua-event from our repo.
The fixed Debian package for lua-event was included in buster and stretch-backports.
A related issue with similar symptoms (but less likely to trigger) was #1572 (0.11, fixed).
I've not seen any other recent reports of timers not firing, so I'm going to close this now. Thanks to everyone who helped with debugging!
How to reproduce: Run prosody-trunk from the prosody debian repository (tested with 1nightly541-1~wheezy and some earlier versions), and have either mod_bosh or mod_smacks (or both) enabled. After some time, prosody will fail to execute timer code (mod_pinger, mod_smacks timeouts, ...). waqas suggested to run the following from the admin console: >local h,now=timer.add_task.h,timer.reschedule.get_time();for i,v in ipairs(h.priorities) do local cb,param=h.items[h.ids[i]],timer.add_task.params[h.ids[i]]; if(tostring(cb)=="function(moduleapi.lua:366)") then cb="api:"..tostring(param.callback); elseif(tostring(cb)=="function(watchdog.lua:12)") then cb="watchdog:"..tostring(cb.watchdog.callback) end print(i,v,v - now, cb); end Currently it reports >1500 events in the queue with negative wait times: ... | 1514 1422293771.4827 -8874.7103559971 function(mod_s2s.lua:557) | 1515 1422294227.0253 -8419.1677281857 function(mod_smacks.lua:262) | 1516 1422295940.906 -6705.2870650291 function(mod_smacks.lua:262) | 1517 1422297073.3799 -5572.8131351471 function(mod_smacks.lua:262) | 1518 1422298475.1318 -4171.0612580776 function(mod_smacks.lua:262) | 1519 1422298191.1517 -4455.0413451195 function(dns.lua:775) | 1520 1422298191.1523 -4455.0407102108 function(dns.lua:775) | 1521 1422298191.1561 -4455.0369620323 function(mod_manifesto.lua:70) | 1522 1422298777.1561 -3869.036935091 watchdog:function(mod_pinger.lua:14) | 1523 1422298266.1904 -4380.0026640892 function(mod_s2s.lua:557) | 1524 1422298176.9323 -4469.2607090473 api:function(mod_smacks.lua:155) | 1525 1422298269.9654 -4376.2276530266 function(mod_s2s.lua:557) | 1526 1422299816.5679 -2829.6250941753 function(mod_s2s.lua:557) | Result: nil
My attempt at fixing this: (includes some overly verbose debug logging that would be stripped before commiting) diff --git a/util/timer.lua b/util/timer.lua --- a/util/timer.lua +++ b/util/timer.lua @@ -14,11 +14,21 @@ local debug_traceback = debug.traceback; local tostring = tostring; local xpcall = xpcall; +local os_date = os.date; + +local function datefmt(t) + if os_date("%F") == os_date("%F", t) then + return os_date("%T.%%03d", t):format(t%1*1000); + end + return os_date("%F %T", t); +end local _ENV = nil; local _add_task = server.add_task; +local _server_timer; +local _active_timers = 0; local h = indexedbheap.create(); local params = {}; local next_time = nil; @@ -29,23 +39,44 @@ local peek; while true do peek = h:peek(); - if peek == nil or peek > now then break; end + if peek == nil or peek > now then + log("debug", "No more pending timers"); + break; + end local _; _, _callback, _id = h:pop(); _now = now; _param = params[_id]; params[_id] = nil; - --item(now, id, _param); -- FIXME pcall + log("debug", "_callback: %s", tostring(_param and _param.callback or _callback)); local success, err = xpcall(_call, _traceback_handler); if success and type(err) == "number" then h:insert(_callback, err + now, _id); -- re-add params[_id] = _param; end end + + if peek ~= nil and _active_timers > 1 and peek == next_time then + log("debug", "Another timer is scheduled for %s already", datefmt(next_time)); + peek = nil; + else next_time = peek; - if peek ~= nil then + end + + if peek then + log("debug", "Wait until %s", datefmt(peek)); return peek - now; end + if not next_time and _active_timers == 0 then + log("debug", "No more timers"); + if _server_timer then + _server_timer:close(); + _server_timer = nil; + end + else + _active_timers = _active_timers - 1; + log("debug", "Not renewing, %d active timers left", _active_timers); + end end local function add_task(delay, callback, param) local current_time = get_time(); @@ -55,7 +86,14 @@ params[id] = param; if next_time == nil or event_time < next_time then next_time = event_time; - _add_task(next_time - current_time, _on_timer); + log("debug", "server.add_task(%s, _on_timer)", datefmt(next_time)); + if _server_timer then + _server_timer:close(); + _server_timer = nil; + else + _active_timers = _active_timers + 1; + end + _server_timer = _add_task(next_time - current_time, _on_timer); end return id; end
This might be fixed with http://hg.prosody.im/trunk/rev/f5fd2c5cdf28 (and the couple of commits just before it), yell if the issue persists.
Changes*yell* sorry, the issue still persists. After only three minutes of uptime: timer:info() | -- util.timer | 2016-06-29 12:06:51 function(mod_statistics.lua:104) | 2016-06-29 12:07:06 function(mod_s2s.lua:584) | 2016-06-29 12:07:07 function(mod_s2s.lua:584) | 2016-06-29 12:07:40 function(mod_s2s.lua:584) | 2016-06-29 12:16:11 function(watchdog.lua:12) | 2016-06-29 12:07:42 function(mod_s2s.lua:584) | 2016-06-29 12:07:42 function(mod_s2s.lua:584) | 2016-06-29 12:15:09 function(watchdog.lua:12) | 2016-06-29 12:07:40 function(mod_s2s.lua:584) | 1 libevent callbacks | OK: Next event at 2016-06-29 12:06:51 (in -18.169477s)
Just a small reminder that this is probably related to how lua-event (0.4.3 tested) handles the difference between NULL events and events with time=0. The following patch fixed THE timer bug for me: https://op-co.de/tmp/lua-event-callback.patch
Can I close this then?
I'd like to keep this open until there is a new lua-event version and prosody's dependency has been updated accordingly.
https://github.com/harningt/luaevent/commit/e9efea74d2eb8f3b1cb7ee2d3d0c83a88524ac2e Looks like this should be in the 0.4.4 release.
So this is now a package repo issue?
ChangesZashMattJThis is a major packaging issue. Debian and Ubuntu are currently providing a 0.4.3 package (without the fix), but that package reports itself as 0.4.4 (which is what shows up in 'prosodyctl about'). I too would like to keep this issue open until we can resolve this sensibly, possibly by offering lua-event from our repo.
Relevant bug report in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=889743
The fixed Debian package for lua-event was included in buster and stretch-backports. A related issue with similar symptoms (but less likely to trigger) was #1572 (0.11, fixed). I've not seen any other recent reports of timers not firing, so I'm going to close this now. Thanks to everyone who helped with debugging!
Changes