-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python and Go init/sidecar containers being OOMKilled with default settings #3479
Comments
Hello folks: @jcreixell , @iblancasa , @swiatekm I have very similar case for nodejs auto-instrumentation. It looks like that container has 7226 files that weights ~50MB to copy through the emptydir. I tried to run that cp command inside the container and limit memory and I can reliably reproduce the issue when memory is limited to 32MB only.
|
I can confirm it gets killed at 32M. More annoyingly, it seems to also sporadically get killed even at 128M. We should really get to the bottom of this. In order to resolve this, we should:
@kamsobon if you'd like to help out with this, any of the points above are a good starting point. |
@swiatekm If I get the controversy around k8s approach to counting memory (working set = RSS + active page cache), we should accept it takes more memory in page cache. It was suggested to accommodate that in container's limit, while container's request might be actually smaller. How about using following limits:
That "extra" 86MB should be always there for cache memory:
How I was testing it:
What else did I checked:
|
Tracking is one thing, but I was under the impression that the cgroup controller would force caches to be dropped if the total usage went above the group limit. I can easily memory map a 4 GB file inside a container with a 100 Mi memory limit, and not get killed for it. So I'm not convinced the cache is actually the root cause of the problem. |
Did the size of the sdks maybe increase during the last release? I think we didnt have a specific process to determine the limits. It was just tested on a few different setups and never caused issues. #1571 (comment) |
So I did some experimentation of my own 🙂 My results are the following: GKE
|
@pkoutsovasilis that's very interesting, thanks a lot for investigating! Disk speed being different between environments would explain why we don't see these problems in e2e tests. As a mitigation, we could bump the init container memory limit and write a unit test that compares our default value to the size of the instrumentation image. WDYT @open-telemetry/operator-approvers ? Independently, we could try to figure out whether this is a problem with kubelet defaults or a bug in the kernel. |
Component(s)
auto-instrumentation
Describe the issue you're reporting
The init/sidecar containers are being killed by kubernetes due to memory usage above the limits. This happens with python and go.
I am able to consistently reproduce the go OOMs with both k3d and kind:
Interestingly, the python OOMs only happen with k3d, not kind.
I am using a standard setup following the README. For testing, I use the rolldice applications with their manual instrumentation code stripped out.
The text was updated successfully, but these errors were encountered: