Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat capability #94

Merged
merged 1 commit into from
Oct 30, 2023
Merged

Add heartbeat capability #94

merged 1 commit into from
Oct 30, 2023

Conversation

kpamnany
Copy link
Collaborator

@kpamnany kpamnany commented Oct 7, 2023

PR Description

Telemetry for Julia's threads.

This capability is build-time enabled by setting JL_HEARTBEAT_THREAD in src/options.h, which causes Julia to start a heartbeat thread that quickly goes to sleep.

Control heartbeats with jl_heartbeat_enable(heartbeat_s, n_reports, reset_reporting_after_s). Heartbeats are disabled if heartbeat_s <= 0. Otherwise, you must call jl_heartbeat() at least one time every heartbeat_s. If this does not happen, the heartbeat thread will call jl_print_task_backtraces(0). If jl_heartbeat() continues to not be called, we double the reporting interval up to a maximum of 60 times the heartbeat_s and report up to n_reports times altogether. Once jl_heartbeat() is called again, the number of reports is reset immediately but we only reset the reporting interval back to heartbeat_s after reset_reporting_after_s of steady jl_heartbeat()s are observed.

Checklist

Requirements for merging:

@github-actions github-actions bot added port-to-v1.10 This change should apply to Julia v1.10 builds port-to-master This change should apply to all future Julia builds port-to-v1.9 This change should apply to Julia v1.9 builds labels Oct 7, 2023
@kpamnany kpamnany force-pushed the kp-heartbeat-thread branch 2 times, most recently from 1528dfb to 301f385 Compare October 13, 2023 19:25
@kpamnany kpamnany requested review from a team and tveldhui October 13, 2023 22:08
src/threading.c Outdated Show resolved Hide resolved
src/threading.c Outdated Show resolved Hide resolved
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM! I think it needs a couple tests?

Also, should we open this as a PR upstream? This seems like a useful feature for anyone who is experiencing process hangs and/or deadlocks, not just for us, yeah?

@NHDaly
Copy link
Member

NHDaly commented Oct 18, 2023

My main remaining question is that this should probably also be opened upstream, right?
Also, what happened to the PR checkboxes? Can you add those back?

@kpamnany
Copy link
Collaborator Author

Yes, I think we should open this upstream, but we should work out all the kinks beforehand. There might be tweaks we apply to this to make it more useful/effective over the next weeks.

@NHDaly
Copy link
Member

NHDaly commented Oct 18, 2023

Makes sense. Please file an internal tracking issue then, so we don't forget, and put that in place of the first checkbox. And attend to the second checkbox.

Assuming those are addressed and you add a bunch of comments, this LGTM i think!

@kpamnany kpamnany force-pushed the kp-heartbeat-thread branch from 301f385 to 91e35d1 Compare October 22, 2023 16:53
@kpamnany
Copy link
Collaborator Author

kpamnany commented Oct 22, 2023

More comments added. Issue to upstream this created and linked.

Please review the sleep duration computation I've changed in these lines? Previously, the code unsafely assumed that tchb < 1e9. This assumption is now removed.

Also added a reporting interval backoff as suggested.

@kpamnany kpamnany requested a review from NHDaly October 22, 2023 16:59
@kpamnany kpamnany force-pushed the kp-heartbeat-thread branch from 91e35d1 to 120ff86 Compare October 23, 2023 17:50
@kpamnany
Copy link
Collaborator Author

I spent some time trying to figure out how to test this. The challenge is that heartbeats are time driven and all timing sensitive tests are flaky in CI. Would love any ideas.

@NHDaly
Copy link
Member

NHDaly commented Oct 23, 2023

I spent some time trying to figure out how to test this. The challenge is that heartbeats are time driven and all timing sensitive tests are flaky in CI. Would love any ideas.

The standard way to deal with time-sensitive approaches is to mock out the timing element, and step the system forward in the tests.

There are basically two or three main variants that I've seen on that:

  • mockable / test-injectable timing system
    • At google, any class that wanted to anything based on time would depend on some Timing or Clock interface, and so to instantiate one, you'd have pass in a Time object. In production, this would be a normal Time implementation, but in tests, you could use a MockTime instance, which lets you control the time from your tests.
  • Insert mockable "pause points" specifically for testing, and have the tests override those mocks.
    • That's what we do in RAI. You can look at the v2-api tests for an example of this, where we have mockable pause points in the transaction processing, and we use a channel to set up synchronization between the client and server through those mocked pause points.
  • (variant on the above) If your system already has some synchronization interaction, mock out those to drive the tests.
    • So if the system waits for a client before proceeding at some points, you can mock out the client and control the system that way.

I'm not sure which of those would be most applicable in this situation. Notably, Julia's codebase does not employ any of these in its current test suites, so the infrastructure isn't there to easily build upon it.

@NHDaly
Copy link
Member

NHDaly commented Oct 23, 2023

One way to mock out the timing would be: instead of calling jl_hrtime() directly, you could call it through a global function pointer which defaults to jl_hrtime. Then the tests could update that global function pointer to point to a mocked timing function, which calls a julia callback instead, and from there you can now do whatever you want to step time forward.

@kpamnany
Copy link
Collaborator Author

As discussed, I've changed the configuration and behavior of the heartbeat thread; see here.

Additionally, I've hardened the interface so that two consecutive jl_heartbeat_enable() calls don't break things.

Here's the log of the benchmark.micro.query_evaluation job with the heartbeat thread enabled; there are quite a few heartbeat losses -- task backtraces printed. There seem to be only two types:

  • do_bench runs on thread 1 and occupies it completely; an artifact of how the benchmark is written. But it isn't clear why the second interactive thread doesn't pick up and run the server metrics task; it seems to be asleep.
  • A thread is compiling (deep in LLVM) and numerous other threads are blocked on the codegen_lock.

Would love to try and get this complete and merged tomorrow.

@NHDaly
Copy link
Member

NHDaly commented Oct 30, 2023

Really neat to hear you're seeing this in action!

  • But it isn't clear why the second interactive thread doesn't pick up and run the server metrics task; it seems to be asleep

Is this because thread 1 isn't able to drive the event loop while it's running the do_bench function?

Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Kiran! 🎉
Can you make sure all the open threads are resolved before merging? Otherwise LGTM

Presence is controlled by a build-time option. Start a separate
thread which simply sleeps. When heartbeats are enabled, this
thread wakes up at specified intervals to verify that user code
is heartbeating as requested and if not, prints task backtraces.

Also fixes the call to `maxthreadid` in `generate_precompile.jl`.
@kpamnany kpamnany force-pushed the kp-heartbeat-thread branch from c055069 to 87c9bc3 Compare October 30, 2023 18:26
@kpamnany kpamnany merged commit 9cd802f into v1.9.2+RAI Oct 30, 2023
1 check passed
@kpamnany kpamnany deleted the kp-heartbeat-thread branch October 30, 2023 20:23
@Drvi Drvi removed port-to-v1.10 This change should apply to Julia v1.10 builds port-to-master This change should apply to all future Julia builds port-to-v1.9 This change should apply to Julia v1.9 builds labels Nov 2, 2023
@Drvi
Copy link
Member

Drvi commented Nov 2, 2023

I've ported the PR to all +RAI branches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants