-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add low-level debugging #256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. We may have a much simpler way to add node level logs without tinkering with Ray tasks.
290783e
to
8c7b44d
Compare
8c7b44d
to
352af08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think this will be lot of logs. For a standard 100 node cluster running for 2hr, we will print 720 * 3200 * 72 = 165MB of additional logs. Hence, we should be able to disable this feature in prod if required (maybe as part of scaling params).
CW costs about $0.50/GB collected, so thankfully that's only about an extra 8 cents per job. Definitely still something to keep in mind though. |
This change adds critical low-level debugging logs, most importantly node-level memory utilization. This change will help greatly in catching OOM failures that otherwise go undetected.
Testing: