Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python and Go init/sidecar containers being OOMKilled with default settings #3479

Open
jcreixell opened this issue Nov 20, 2024 · 8 comments
Labels
discuss-at-sig This issue or PR should be discussed at the next SIG meeting

Comments

@jcreixell
Copy link
Contributor

jcreixell commented Nov 20, 2024

Component(s)

auto-instrumentation

Describe the issue you're reporting

The init/sidecar containers are being killed by kubernetes due to memory usage above the limits. This happens with python and go.

I am able to consistently reproduce the go OOMs with both k3d and kind:

kind

Interestingly, the python OOMs only happen with k3d, not kind.

Screenshot from 2024-11-20 18-33-10

Screenshot from 2024-11-20 18-33-55

I am using a standard setup following the README. For testing, I use the rolldice applications with their manual instrumentation code stripped out.

@kamsobon
Copy link

kamsobon commented Dec 3, 2024

Hello folks: @jcreixell , @iblancasa , @swiatekm

I have very similar case for nodejs auto-instrumentation.
Init container randomly dies because of OOM kill. Memory thresholds are mich bigger then for python/Go.

It looks like that container has 7226 files that weights ~50MB to copy through the emptydir.

I tried to run that cp command inside the container and limit memory and I can reliably reproduce the issue when memory is limited to 32MB only.
I'm not sure what to do next.

docker run --rm --memory=32m -it \
  --name test-cp-memory \
  open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.52.1 \
  /bin/sh

@swiatekm
Copy link
Contributor

swiatekm commented Dec 3, 2024

I can confirm it gets killed at 32M. More annoyingly, it seems to also sporadically get killed even at 128M. We should really get to the bottom of this. In order to resolve this, we should:

  1. Ensure there's an e2e test that catches it, as per Investigate why e2e tests did not catch OOMKilled errors #3484.
  2. Increase the requests until we consistently succeed.
  3. Properly investigate why this happens, exactly.
  4. Reconsider dropping busybox and using an external cp binary. See Don't require cp in instrumentation images #3231.

@kamsobon if you'd like to help out with this, any of the points above are a good starting point.

@swiatekm swiatekm added discuss-at-sig This issue or PR should be discussed at the next SIG meeting and removed needs triage labels Dec 3, 2024
@pkoutsovasilis
Copy link

pkoutsovasilis commented Dec 5, 2024

hey 👋 since by copying files you touch a lot of files I expect the page cache to be increasing, can this mean that we are hitting this issue? 🙂 (Specifically the wrapped issue in the previous link is this one)

@kamsobon
Copy link

kamsobon commented Dec 12, 2024

@swiatekm If I get the controversy around k8s approach to counting memory (working set = RSS + active page cache), we should accept it takes more memory in page cache.

It was suggested to accommodate that in container's limit, while container's request might be actually smaller.
IMHO it is reasonable as this (copying OTel agent files to main container) is one-time action and any memory that will go to page cache is easily reclaimable (I was forcing reclaiming cache memory with sync; echo 3 > /proc/sys/vm/drop_caches and that always goes down).

How about using following limits:

  • Request: 64MB
  • Limit: 150MB (I was unable to go beyond 140MB when doing cp over & over again)

That "extra" 86MB should be always there for cache memory:

  1. It is hard to believe anyone will use Operator on very small machines
  2. Main container most likely will request more than 150MB (if going with OTel operator)

How I was testing it:

  1. I run a docker container
docker run --rm --memory=256m -it \
  --name test-cp-memory \
  ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.52.1 \
  /bin/sh
  1. I run docker stats to keep tracking memory usage
  2. I run another container to force dropping page cache (I'm on macOS)
docker run --rm --name drop-cache -it --privileged alpine sh
  1. I run copy command in test-cp-memory container
cp -r /autoinstrumentation /attempt-1
  1. I observed docker stats. I was repeating steps 4/5 if necessary
  2. I run command to clean cache in drop-cache container
sync; echo 3 > /proc/sys/vm/drop_caches

What else did I checked:

  1. Using alpine:latest and cp. It seems to be allocating less memory for single copy - ~38MB. Conrary to busybox where in single run I was getting 38-100MB.
  2. Checking rsync from alpine. For that I made custom image
FROM ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest as instrumentation
FROM alpine:latest

RUN apk add --no-cache rsync
COPY --from=instrumentation /autoinstrumentation /autoinstrumentation

@swiatekm
Copy link
Contributor

swiatekm commented Dec 13, 2024

@swiatekm If I get the controversy around k8s approach to counting memory (working set = RSS + active page cache), we should accept it takes more memory in page cache.

Tracking is one thing, but I was under the impression that the cgroup controller would force caches to be dropped if the total usage went above the group limit. I can easily memory map a 4 GB file inside a container with a 100 Mi memory limit, and not get killed for it. So I'm not convinced the cache is actually the root cause of the problem.

@frzifus
Copy link
Member

frzifus commented Dec 19, 2024

Did the size of the sdks maybe increase during the last release? I think we didnt have a specific process to determine the limits. It was just tested on a few different setups and never caused issues. #1571 (comment)

@pkoutsovasilis
Copy link

So I did some experimentation of my own 🙂

My results are the following:

GKE v1.30.5-gke.1443001 Kernel Linux ubuntu 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sun Sep 29 16:26:42 UTC 2024 x86_64 Linux

I can reproduce 100% of the times getting my pod of with a 512Mi memory limit OOM Killed by just a running a dd if=/dev/zero of=testfile bs=1M count=4096

[627142.755247] dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=936
[627142.765838] CPU: 1 PID: 3656524 Comm: dd Not tainted 6.1.100+ #1
[627142.772031] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
[627142.781437] Call Trace:
[627142.784076]  <TASK>
[627142.786374]  dump_stack_lvl+0x4a/0x70
[627142.790237]  dump_header+0x52/0x250
[627142.793916]  oom_kill_process+0x10a/0x220
[627142.798112]  out_of_memory+0x3dc/0x5c0
[627142.802048]  ? mem_cgroup_iter+0x1c6/0x240
[627142.806333]  try_charge_memcg+0x827/0xa90
[627142.810531]  charge_memcg+0x3f/0x1f0
[627142.814291]  __mem_cgroup_charge+0x2b/0x80
[627142.818574]  __filemap_add_folio+0x31a/0x3c0
[627142.823044]  ? workingset_activation+0x150/0x150
[627142.827863]  __filemap_get_folio+0x27a/0x360
[627142.832338]  pagecache_get_page+0x19/0x70
[627142.836548]  ext4_da_write_begin+0x1be/0x2f0
[627142.841019]  generic_perform_write+0xa5/0x240
[627142.845563]  ext4_buffered_write_iter+0xf0/0x190
[627142.850370]  do_iter_write+0x1d9/0x2f0
[627142.854304]  ovl_write_iter+0x310/0x4c0
[627142.858328]  vfs_write+0x2e2/0x3d0
[627142.861917]  ksys_write+0x71/0xe0
[627142.865418]  do_syscall_64+0x78/0xa0
[627142.869180]  ? clear_bhb_loop+0x55/0xb0
[627142.873203]  ? clear_bhb_loop+0x55/0xb0
[627142.877231]  ? clear_bhb_loop+0x55/0xb0
[627142.881253]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[627142.886497] RIP: 0033:0x7933842ed347
[627142.890269] Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 ea d8 ff ff 41 54 b8 02 00 00 00 55 48 89 f5 be 00 88 08 00
[627142.909220] RSP: 002b:00007fff802eeec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[627142.916974] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007933842ed347
[627142.924308] RDX: 0000000000100000 RSI: 0000793384185000 RDI: 0000000000000001
[627142.931628] RBP: 0000793384329b28 R08: 0000000000000000 R09: 0000000000000000
[627142.938960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000100000
[627142.946280] R13: 0000793384329b5c R14: 0000793384185000 R15: 0000000000000001
[627142.953603]  </TASK>
[627142.956033] memory: usage 524288kB, limit 524288kB, failcnt 10279
[627142.962383] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[627142.968670] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda5d0b648_9191_4e4f_a119_e801e1f3321c.slice:
[627142.968732] anon 1396736
[627142.968732] file 518225920
[627142.968732] kernel 17248256
[627142.968732] kernel_stack 49152
[627142.968732] pagetables 126976
[627142.968732] sec_pagetables 0
[627142.968732] percpu 13504
[627142.968732] sock 0
[627142.968732] vmalloc 0
[627142.968732] shmem 0
[627142.968732] file_mapped 0
[627142.968732] file_dirty 518225920
[627142.968732] file_writeback 0
[627142.968732] swapcached 0
[627142.968732] anon_thp 0
[627142.968732] file_thp 0
[627142.968732] shmem_thp 0
[627142.968732] inactive_anon 1355776
[627142.968732] active_anon 40960
[627142.968732] inactive_file 501448704
[627142.968732] active_file 16777216
[627142.968732] unevictable 0
[627142.968732] slab_reclaimable 16726464
[627142.968732] slab_unreclaimable 244320
[627142.968732] slab 16970784
[627142.968732] workingset_refault_anon 0
[627142.968732] workingset_refault_file 0
[627142.968732] workingset_activate_anon 0
[627143.072410] Tasks state (memory values in pages):
[627143.077381] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[627143.086239] [1946000] 65535 1946000      257        1    28672        0          -998 pause
[627143.094839] [3643567]     0 3643567      406        1    45056        0           936 sleep
[627143.103440] [3656220]     0 3656220      427      279    49152        0           936 sh
[627143.111749] [3656524]     0 3656524      662      467    49152        0           936 dd
[627143.120057] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-05ce829cc56264e5f481a18052fb8c8671667a644e58b0372ac2e3a296a8536c.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda5d0b648_9191_4e4f_a119_e801e1f3321c.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda5d0b648_9191_4e4f_a119_e801e1f3321c.slice/cri-containerd-05ce829cc56264e5f481a18052fb8c8671667a644e58b0372ac2e3a296a8536c.scope,task=dd,pid=3656524,uid=0
[627143.165982] Memory cgroup out of memory: Killed process 3656524 (dd) total-vm:2648kB, anon-rss:964kB, file-rss:904kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:936
[627143.181494] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda5d0b648_9191_4e4f_a119_e801e1f3321c.slice/cri-containerd-05ce829cc56264e5f481a18052fb8c8671667a644e58b0372ac2e3a296a8536c.scope are going to be killed due to memory.oom.group set
[627143.205059] Memory cgroup out of memory: Killed process 3643567 (sleep) total-vm:1624kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:936
[627143.220435] Memory cgroup out of memory: Killed process 3656220 (sh) total-vm:1708kB, anon-rss:196kB, file-rss:920kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:936
[627143.235920] Memory cgroup out of memory: Killed process 3656524 (dd) total-vm:2648kB, anon-rss:964kB, file-rss:904kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:936

so what can I tell from the above dmesg messages is the following; pagedisk cache is not used only for tracking but it actually triggers the direct reclaim mode try_charge_memcg (this is somewhat different from the async reclaim mode which is triggered by echo 3 > /proc/sys/vm/drop_caches). However, for some reason the kernel says no memory can be reclaimed thus bb oom_kill_process. Now repeating the same experiment, but having set manually the memory.high, of the container cgroup, at 80% of the memory.max (512Mi) makes dd finish just fine without any OOM killing.

Docker version 27.3.1 Kernel 6.8.0-50-generic #51-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 9 18:03:35 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

A container with a even a lower memory limit of 256mb has no issue running dd if=/dev/zero of=testfile bs=1M count=4096. But this is not the whole truth, when I monitor the cgroup memory.events of my container I could see that I was getting rapid increases in the memory.max event. This means that still the page cache counts as consumed memory, but the direct page reclaim is successful here. PS: I did validate that my local docker doesn't tune the cgroup of the container differently than GKE, the have same memory.{min,low,high,max,...} settings.

I can only think of two possible explanations for the above behavioural differences:

  • The disk of the GKE node is too slow and the dirty pages are not written yet by the kernel when my dd triggers direct page reclaim. Thus the kernel says OOM. On the contrary my local container enjoys a really fast, almost idle, NVMe.
  • There might be a kernel bug, less probable but still possible (e.g. this one).

Some ideas about possible mitigations:

  • Increase the memory limit of the init container as high as the accumulated size of copied files (hoping this is <256MB)
  • Periodically invoke echo 3 > /proc/sys/vm/drop_caches to cause kernel IO to take a "breath" and try to flush as many as possible dirty pages
  • Although not sure it supports init containers, it would be cool if we could utilise this alpha feature of k8s. However, as of now, this requires a change in kubelet configuration, thus, I think it's not suitable.

@swiatekm
Copy link
Contributor

swiatekm commented Dec 30, 2024

@pkoutsovasilis that's very interesting, thanks a lot for investigating! Disk speed being different between environments would explain why we don't see these problems in e2e tests. As a mitigation, we could bump the init container memory limit and write a unit test that compares our default value to the size of the instrumentation image. WDYT @open-telemetry/operator-approvers ?

Independently, we could try to figure out whether this is a problem with kubelet defaults or a bug in the kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss-at-sig This issue or PR should be discussed at the next SIG meeting
Projects
None yet
Development

No branches or pull requests

5 participants