Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error waiting for container: unexpected EOF failing builds #702

Closed
claraberendsen opened this issue May 15, 2023 · 12 comments
Closed

error waiting for container: unexpected EOF failing builds #702

claraberendsen opened this issue May 15, 2023 · 12 comments
Assignees

Comments

@claraberendsen
Copy link
Contributor

Description

We have been encountering the error of error waiting for container: unexpected EOF frequently on ci.ros2.org. Seems to be an issue with memory availability for the container.

Example build

https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2664/console

Statistics on the issue for the first two weeks of May:

nightly_linux_debug: 20% of the times
nightly_linux-aarch64_release: 13.33% of the times
nightly_linux-aarch64_repeated: 13.33% of the times
nightly_linux_release: 7.14% of the times

@Crola1702
Copy link
Contributor

Crola1702 commented May 19, 2023

Container EOF error statistics

15 days

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_debug 2023-05-19 2023-05-10 16 5 31.25
nightly_linux_foxy_coverage 2023-05-18 2023-05-07 16 2 12.5
nightly_linux-rhel_repeated 2023-05-18 2023-05-06 21 5 23.81
nightly_linux-rhel_release 2023-05-18 2023-05-12 15 4 26.67
nightly_linux-rhel_debug 2023-05-18 2023-05-06 15 3 20.0
Humble cyclonedds 2023-05-18 2023-05-09 6 2 33.33
Humble connext 2023-05-18 2023-05-06 5 2 40.0
Foxy connext 2023-05-14 2023-05-14 5 1 20.0
Rolling cyclonedds 2023-05-13 2023-05-13 17 1 5.88
Rolling connext 2023-05-13 2023-05-13 16 1 6.25
nightly_linux_release 2023-05-12 2023-05-12 24 1 4.17
nightly_linux-aarch64_release 2023-05-11 2023-05-10 15 2 13.33
nightly_linux-aarch64_repeated 2023-05-10 2023-05-10 15 1 6.67

5 days

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_debug 2023-05-19 2023-05-18 6 2 33.33
nightly_linux_foxy_coverage 2023-05-18 2023-05-18 6 1 16.67
nightly_linux-rhel_repeated 2023-05-18 2023-05-17 6 2 33.33
nightly_linux-rhel_release 2023-05-18 2023-05-16 5 3 60.0
nightly_linux-rhel_debug 2023-05-18 2023-05-18 5 1 20.0
Humble cyclonedds 2023-05-18 2023-05-18 2 1 50.0
Humble connext 2023-05-18 2023-05-18 2 1 50.0
Foxy connext 2023-05-14 2023-05-14 2 1 50.0

@claraberendsen
Copy link
Contributor Author

Today two aarch64 nightlies failed with this issue: nightly_linux-aarch64_debug #2384 and nightly_linux-aarch64_repeated #2371. Since the agent didn't get destroyed afterwards the logs were preserved and a detailed investigation was possible.

Investigation

1. Rule out OOM killer stopping the docker container.
The command journalctl -k was run to verify that the OOM killer process of the OS triggered either the daemon or the docker process to die. No indication of OOM issues was found in log.

2. Check docker.service logs to have a better indication of what was happening
Run the command journalctl -u docker.service .In the docker logs we could see that systemd was causing the docker daemon to restart. See the following lines :
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete"
And then this line
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Similar logs can be found on the other agent the nightly was run.
Note that the timing of these logs is close to when the error happens in the builds per each agent. This is a good indication that the errors are happening due to the restart of the docker service.

Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.430955206Z" level=info msg="ignoring event" container=88acb4d47b1e53cb20be3fa2c2c3c43472b51f9fe05b9d81a4e7a4aacf46ad8a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: run-docker-netns-d1893231c907.mount: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-1c135b999632ae831e47372d4458c53f1bf556a3062789f60394c15a953ed313-merged.mount: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete"
Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: docker.service: Succeeded.
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.834629993Z" level=info msg="Starting up"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.835519557Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839813954Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839843632Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839870011Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839881745Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843368324Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843398814Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843419565Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843431356Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083075793Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083108146Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083115251Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083282143Z" level=info msg="Loading containers: start."
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.220468635Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.551039037Z" level=info msg="Loading containers: done."
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.588748768Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.589147012Z" level=info msg="Daemon has completed initialization"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.609178145Z" level=info msg="API listen on /run/docker.sock"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.976374499Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977224089Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977467411Z" level=info msg="Daemon shutdown complete"
Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.service: Succeeded.
Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.socket: Succeeded.
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.285510408Z" level=info msg="Starting up"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.286470569Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287077320Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287100057Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287121070Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287131712Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288953180Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288979888Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288998989Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.289011009Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362675376Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362707343Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362715433Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362883835Z" level=info msg="Loading containers: start."
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.495188699Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.535984330Z" level=info msg="Loading containers: done."
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555383161Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555446997Z" level=info msg="Daemon has completed initialization"
Jun 06 06:44:15 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-check\x2doverlayfs\x2dsupport223685642-merged.mount: Succeeded.
Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.580350166Z" level=info msg="API listen on /run/docker.sock"
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802841899Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802922719Z" level=info msg="blockingPicker: the picked transport is not ready, loop back to repick" module=grpc
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802963933Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802999568Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby

3. Check journalctl logs in general
Executing journalctl between the time of failures we can see the following a couple of minutes before:

Jun 06 06:25:01 ip-10-0-1-221 CRON[428020]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 06 06:25:01 ip-10-0-1-221 CRON[428018]: pam_unix(cron:session): session closed for user root
Jun 06 06:34:19 ip-10-0-1-221 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 06 06:35:56 ip-10-0-1-221 dbus-daemon[592]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.28' (uid=0 pid=444913 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")

Which triggers a restart of most daemons by systemd. This Daily apt upgrade log indicates that an unattended-upgrade is being run.

Conclusion

It would seem that the latest issues with error waiting for container: unexpected EOF are related to the timing of the unattended-upgrades and not by a OOM or CPU issue like initially consider.
Even though we cannot rule out that an OOM issue may be causing some of them, it seems there is a periodicity and timing to which jobs fail, affecting mostly nightlies and in the same time.
As an example the nightly_linux-aarch64_debug has failed on two different agents but on similar times of day in the last two runs #2383 and #2384.

Further investigation / Not answered questions

  • Why does it fail in normal agents and not in the 2xlarge instances?. An avenue of research is to investigate how is the time of execution for unattended-upgrades being configured.

@claraberendsen
Copy link
Contributor Author

Today another nightly failed with this error nightly_linux-rhel_debug
In the logs we see the time of failure is ...

06:52:51 time="2023-06-07T06:52:52Z" level=error msg="error waiting for container: unexpected EOF"

If we ssh into the instance we see that the unnatended upgrade was run during that time as well

Jun 07 06:40:39 ip-10-0-1-235 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 07 06:53:35 ip-10-0-1-235 systemd[1]: Finished Daily apt upgrade and clean activities.

@claraberendsen
Copy link
Contributor Author

Replication Build Status

A test job was run on an agent and the docker service was restarted with sudo systemctl restart docker.

@claraberendsen
Copy link
Contributor Author

aarch builds have been green since the weekend.

@Blast545
Copy link
Contributor

Waiting for: https://github.com/osrf/chef-osrf/pull/204

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-infrastructure-updates-from-june-2023/32258/1

@wep21
Copy link

wep21 commented Aug 25, 2023

I have encountered the similar issue that says error waiting for container: unexpected EOF.
https://build.ros2.org/job/Hbin_uJ64__aws_sdk_cpp_vendor__ubuntu_jammy_amd64__binary/16/display/redirect
https://build.ros2.org/job/Rbin_ujv8_uJv8__aws_sdk_cpp_vendor__ubuntu_jammy_arm64__binary/11/display/redirect

@claraberendsen
Copy link
Contributor Author

claraberendsen commented Aug 28, 2023

@wep21 thanks for reporting! The PR to fix has gotten lost in the weeds of other things but given that the error is appearing again I have re-prioritized it and added to my backlog for this week.
This however is most likely unrelated to the general error of Unexpected EOF .

@Crola1702 Crola1702 self-assigned this Sep 12, 2023
@Crola1702
Copy link
Contributor

Crola1702 commented Sep 12, 2023

Fixing this issue in https://github.com/osrf/chef-osrf/pull/212

@Crola1702
Copy link
Contributor

This is not completely closed. As we are still missing the new images for build.ros2.org

@Crola1702 Crola1702 reopened this Oct 10, 2023
@Crola1702
Copy link
Contributor

Finally fixed by: https://github.com/osrf/osrf-terraform/pull/147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants