-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add recovery strategy for error waiting for container: unexpected EOF
#703
Conversation
@nuclearsandwich I have a draft of the process here. I need some insight on how to test this... I have a separate PR that I think should go first to this one, that adds the necessary plugins for this to work. |
Can you link / reference that here. As discussed in the Infrastructure meeting we're blocked on Jenkins plugin upgrades generally but that's an important item to address in the near future.
I'm in favor of just hand-hacking the test_ci_linux job to display the error string and exit, rather than adding a separate test job just for this.
This script is generally run by ROS 2 devs locally after merging a PR and making a change. Running it requires Jenkins admin configuration and a configured python environment. There are details and links to further details in the readme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass review in addition to the general feedback and attempts to answer the raised questions.
create_job(os_name, 'test_ci_recovery_' + os_name, 'ci_job.xml.em', { | ||
'cmake_build_type': 'None', | ||
'with_eof_recovery': True | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #703 (comment) I'm in favor of hand-hacking test_ci_linux
rather than creating a one-time job for testing.
throw ex | ||
} | ||
} | ||
println('# END SECTION: NVIDIA MISMATCH RECOVERY') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it was copy-pasted from the build.osrfoundation.org recovery without updating the section output.
I'm also not sure that this script needs to be wrapped in a script since the Groovy DSL isn't used.
@[if script_file]@ | ||
<source class="hudson.plugins.groovy.FileSystemScriptSource"> | ||
<scriptFile>@ESCAPE(script_file)</scriptFile> | ||
</source> | ||
@[end if]@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this snippet copied from ros_buildfarm? Is there an anticipated use case for this branch of the snippet?
Closing this since we have tracked the actual error and this strategy is not the appropriate solution |
Description
This PR adds the functionality for recovering the agent by restarting it when an specific error occurs on the log. In particular the case that was applied is
error waiting for container: unexpected EOF
.