-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heartbeat capability #94
Conversation
1528dfb
to
301f385
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly LGTM! I think it needs a couple tests?
Also, should we open this as a PR upstream? This seems like a useful feature for anyone who is experiencing process hangs and/or deadlocks, not just for us, yeah?
My main remaining question is that this should probably also be opened upstream, right? |
Yes, I think we should open this upstream, but we should work out all the kinks beforehand. There might be tweaks we apply to this to make it more useful/effective over the next weeks. |
Makes sense. Please file an internal tracking issue then, so we don't forget, and put that in place of the first checkbox. And attend to the second checkbox. Assuming those are addressed and you add a bunch of comments, this LGTM i think! |
301f385
to
91e35d1
Compare
More comments added. Issue to upstream this created and linked. Please review the sleep duration computation I've changed in these lines? Previously, the code unsafely assumed that Also added a reporting interval backoff as suggested. |
91e35d1
to
120ff86
Compare
I spent some time trying to figure out how to test this. The challenge is that heartbeats are time driven and all timing sensitive tests are flaky in CI. Would love any ideas. |
The standard way to deal with time-sensitive approaches is to mock out the timing element, and step the system forward in the tests. There are basically two or three main variants that I've seen on that:
I'm not sure which of those would be most applicable in this situation. Notably, Julia's codebase does not employ any of these in its current test suites, so the infrastructure isn't there to easily build upon it. |
One way to mock out the timing would be: instead of calling |
As discussed, I've changed the configuration and behavior of the heartbeat thread; see here. Additionally, I've hardened the interface so that two consecutive Here's the log of the
Would love to try and get this complete and merged tomorrow. |
Really neat to hear you're seeing this in action!
Is this because thread 1 isn't able to drive the event loop while it's running the do_bench function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Kiran! 🎉
Can you make sure all the open threads are resolved before merging? Otherwise LGTM
Presence is controlled by a build-time option. Start a separate thread which simply sleeps. When heartbeats are enabled, this thread wakes up at specified intervals to verify that user code is heartbeating as requested and if not, prints task backtraces. Also fixes the call to `maxthreadid` in `generate_precompile.jl`.
c055069
to
87c9bc3
Compare
I've ported the PR to all +RAI branches |
PR Description
Telemetry for Julia's threads.
This capability is build-time enabled by setting
JL_HEARTBEAT_THREAD
insrc/options.h
, which causes Julia to start a heartbeat thread that quickly goes to sleep.Control heartbeats with
jl_heartbeat_enable(heartbeat_s, n_reports, reset_reporting_after_s)
. Heartbeats are disabled ifheartbeat_s <= 0
. Otherwise, you must calljl_heartbeat()
at least one time everyheartbeat_s
. If this does not happen, the heartbeat thread will calljl_print_task_backtraces(0)
. Ifjl_heartbeat()
continues to not be called, we double the reporting interval up to a maximum of 60 times theheartbeat_s
and report up ton_reports
times altogether. Oncejl_heartbeat()
is called again, the number of reports is reset immediately but we only reset the reporting interval back toheartbeat_s
afterreset_reporting_after_s
of steadyjl_heartbeat()
s are observed.Checklist
Requirements for merging:
[ ] I have opened an issue or PR upstream on JuliaLang/julia:port-to-*
labels that don't apply.