Add a delay between killing teamd processes #3325

saiarcot895 · 2024-10-14T01:13:49Z

What I did

When killing 10 or more teamd processes, add a delay of 0.1 seconds after every 10 kill signals/proceses. This is because in the LAG scale tests (in ecmp/inner_hashing/test_inner_hashing_lag.py in sonic-mgmt), it may create 100 LAGs, and when destroying them all, some of those LAGs may fail to be properly destroyed, leaving some stale port channels around. This seems to be because the netlink socket buffers on which the teamd processes get notifications become full with events of the other port channels/interfaces going down

Why I did it

As a workaround, add some delays in killing the teamd processes, so that the netlink buffers don't become full, causing messages to get dropped.

This delay was randomly chosen, and it seems to work well with 100 LAGs on a KVM. It can probably made to be a bit more aggressive if needed (i.e. maybe 0.05 seconds every 20 processes).

How I verified it

On a KVM testbed with t0-116 topology with a bit more than 100 LAGs, stop teamd using sudo systemctl stop teamd, and verify that all of the LAGs were deleted, and there were no messages from the kernel similar to the following:

Oct 12 21:33:03 vlab-04 kernel: PortChannel41 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel17 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send port change of device Ethernet136 via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Port device Ethernet136 removed
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send port change of device Ethernet174 via netlink (err -105)

Details if related

Partial fix for sonic-net/sonic-buildimage#19310.

When killing 10 or more teamd processes, add a delay of 0.1 seconds after every 10 kill signals/proceses. This is because in the LAG scale tests (in `ecmp/inner_hashing/test_inner_hashing_lag.py` in sonic-mgmt), it may create 100 LAGs, and when destroying them all, some of those LAGs may fail to be properly destroyed, leaving some stale port channels around. This seems to be because the netlink socket buffers on which the teamd processes get notifications become full with events of the other port channels/interfaces going down. As a workaround, add some delays in killing the teamd processes, so that the netlink buffers don't become full, causing messages to get dropped. This delay was randomly chosen, and it seems to work well with 100 LAGs on a KVM. It can probably made to be a bit more aggressive if needed (i.e. maybe 0.05 seconds every 20 processes). Signed-off-by: Saikrishna Arcot <[email protected]>

Signed-off-by: Saikrishna Arcot <[email protected]>

This requires overriding some libc functions and capturing information about kill signals sent or intercepting file open operations. Signe -off-by: Saikrishna Arcot <[email protected]>

saiarcot895 · 2024-10-22T01:20:05Z

/azpw run

mssonicbld · 2024-10-22T01:20:07Z

/AzurePipelines run

azure-pipelines · 2024-10-22T01:20:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Saikrishna Arcot <[email protected]>

saiarcot895 · 2024-10-31T20:57:05Z

Comparing the time needed to send SIGTERM to the teamd processes before and after this change, it appears that the time is roughly the same for about 70 LAGs, as tested on a physical device.

Before:

2024 Oct 31 20:33:56.338480 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel1 pid 26
2024 Oct 31 20:33:56.345896 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel10 pid 35
2024 Oct 31 20:33:56.349179 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel101 pid 43
2024 Oct 31 20:33:56.354644 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel102 pid 51
2024 Oct 31 20:33:56.360318 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel103 pid 59
2024 Oct 31 20:33:56.400292 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel104 pid 67
2024 Oct 31 20:33:56.400309 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel11 pid 75
2024 Oct 31 20:33:56.400867 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel12 pid 83
2024 Oct 31 20:33:56.401188 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel13 pid 91
2024 Oct 31 20:33:56.401362 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel14 pid 100
2024 Oct 31 20:33:56.402117 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel15 pid 109
2024 Oct 31 20:33:56.411100 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel16 pid 117
2024 Oct 31 20:33:56.411989 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel17 pid 125
2024 Oct 31 20:33:56.441357 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel18 pid 133
2024 Oct 31 20:33:56.486524 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel19 pid 141
2024 Oct 31 20:33:56.486781 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel2 pid 149
2024 Oct 31 20:33:56.486951 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel20 pid 157
2024 Oct 31 20:33:56.487095 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel21 pid 165
2024 Oct 31 20:33:56.487985 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel22 pid 173
2024 Oct 31 20:33:56.487985 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel23 pid 181
2024 Oct 31 20:33:56.488143 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel24 pid 189
2024 Oct 31 20:33:56.491583 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel25 pid 197
2024 Oct 31 20:33:56.498010 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel26 pid 205
2024 Oct 31 20:33:56.501587 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel27 pid 213
2024 Oct 31 20:33:56.504982 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel28 pid 221
2024 Oct 31 20:33:56.560632 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel29 pid 229
2024 Oct 31 20:33:56.604924 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel3 pid 237
2024 Oct 31 20:33:56.604950 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel30 pid 245
2024 Oct 31 20:33:56.604974 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel31 pid 253
2024 Oct 31 20:33:56.605128 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel32 pid 261
2024 Oct 31 20:33:56.608329 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel33 pid 269
2024 Oct 31 20:33:56.646533 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel34 pid 277
2024 Oct 31 20:33:56.651903 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel35 pid 285
2024 Oct 31 20:33:56.656102 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel36 pid 293
2024 Oct 31 20:33:56.660620 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel37 pid 301
2024 Oct 31 20:33:56.677031 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel38 pid 309
2024 Oct 31 20:33:56.679521 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel39 pid 317
2024 Oct 31 20:33:56.685786 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel4 pid 325
2024 Oct 31 20:33:56.689406 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel40 pid 333
2024 Oct 31 20:33:56.692990 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel41 pid 341
2024 Oct 31 20:33:56.795228 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel42 pid 349
2024 Oct 31 20:33:56.802910 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel43 pid 357
2024 Oct 31 20:33:56.809630 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel44 pid 365
2024 Oct 31 20:33:56.843699 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel45 pid 373
2024 Oct 31 20:33:56.881881 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel46 pid 381
2024 Oct 31 20:33:56.897540 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel47 pid 389
2024 Oct 31 20:33:56.935467 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel48 pid 397
2024 Oct 31 20:33:56.937797 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel49 pid 405
2024 Oct 31 20:33:56.942456 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel5 pid 413
2024 Oct 31 20:33:56.943508 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel50 pid 421
2024 Oct 31 20:33:56.945620 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel51 pid 429
2024 Oct 31 20:33:56.968744 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel52 pid 437
2024 Oct 31 20:33:56.969017 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel53 pid 445
2024 Oct 31 20:33:56.969215 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel54 pid 453
2024 Oct 31 20:33:56.969315 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel55 pid 461
2024 Oct 31 20:33:56.972646 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel56 pid 469
2024 Oct 31 20:33:56.973366 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel57 pid 477
2024 Oct 31 20:33:56.974690 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel58 pid 485
2024 Oct 31 20:33:56.975190 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel59 pid 493
2024 Oct 31 20:33:56.975761 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel6 pid 501
2024 Oct 31 20:33:57.013460 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel60 pid 509
2024 Oct 31 20:33:57.017120 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel61 pid 517
2024 Oct 31 20:33:57.020771 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel62 pid 525
2024 Oct 31 20:33:57.024415 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel63 pid 533
2024 Oct 31 20:33:57.028257 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel64 pid 541
2024 Oct 31 20:33:57.034454 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel65 pid 549
2024 Oct 31 20:33:57.035288 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel66 pid 557
2024 Oct 31 20:33:57.039678 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel67 pid 565
2024 Oct 31 20:33:57.051380 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel68 pid 573
2024 Oct 31 20:33:57.060741 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel69 pid 581

After:

2024 Oct 31 20:42:29.550813 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel1 pid 27
2024 Oct 31 20:42:29.550813 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel10 pid 35
2024 Oct 31 20:42:29.550813 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel101 pid 43
2024 Oct 31 20:42:29.550861 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel102 pid 51
2024 Oct 31 20:42:29.550861 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel103 pid 59
2024 Oct 31 20:42:29.550885 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel104 pid 68
2024 Oct 31 20:42:29.550907 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel11 pid 77
2024 Oct 31 20:42:29.550907 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel12 pid 85
2024 Oct 31 20:42:29.550946 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel13 pid 93
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel14 pid 101
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel15 pid 109
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel16 pid 117
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel17 pid 125
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel18 pid 133
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel19 pid 141
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel2 pid 149
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel20 pid 157
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel21 pid 165
2024 Oct 31 20:42:29.653453 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel22 pid 173
2024 Oct 31 20:42:29.767222 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel23 pid 181
2024 Oct 31 20:42:29.767674 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel24 pid 189
2024 Oct 31 20:42:29.767744 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel25 pid 197
2024 Oct 31 20:42:29.767805 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel26 pid 205
2024 Oct 31 20:42:29.767866 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel27 pid 213
2024 Oct 31 20:42:29.767946 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel28 pid 221
2024 Oct 31 20:42:29.768005 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel29 pid 229
2024 Oct 31 20:42:29.768067 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel3 pid 237
2024 Oct 31 20:42:29.768125 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel30 pid 245
2024 Oct 31 20:42:29.768182 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel31 pid 253
2024 Oct 31 20:42:29.858922 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel32 pid 261
2024 Oct 31 20:42:29.858922 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel33 pid 269
2024 Oct 31 20:42:29.858931 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel34 pid 277
2024 Oct 31 20:42:29.858938 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel35 pid 285
2024 Oct 31 20:42:29.858938 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel36 pid 293
2024 Oct 31 20:42:29.858947 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel37 pid 301
2024 Oct 31 20:42:29.858954 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel38 pid 309
2024 Oct 31 20:42:29.858954 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel39 pid 317
2024 Oct 31 20:42:29.858965 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel4 pid 325
2024 Oct 31 20:42:29.858965 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel40 pid 333
2024 Oct 31 20:42:29.964442 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel41 pid 341
2024 Oct 31 20:42:29.964474 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel42 pid 349
2024 Oct 31 20:42:29.964499 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel43 pid 357
2024 Oct 31 20:42:29.964523 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel44 pid 365
2024 Oct 31 20:42:29.964548 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel45 pid 373
2024 Oct 31 20:42:29.964574 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel46 pid 381
2024 Oct 31 20:42:29.964599 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel47 pid 389
2024 Oct 31 20:42:29.964625 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel48 pid 397
2024 Oct 31 20:42:29.964651 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel49 pid 405
2024 Oct 31 20:42:29.964676 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel5 pid 413
2024 Oct 31 20:42:30.115429 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel50 pid 421
2024 Oct 31 20:42:30.115523 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel51 pid 429
2024 Oct 31 20:42:30.115592 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel52 pid 437
2024 Oct 31 20:42:30.115660 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel53 pid 445
2024 Oct 31 20:42:30.115725 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel54 pid 453
2024 Oct 31 20:42:30.115980 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel55 pid 461
2024 Oct 31 20:42:30.116046 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel56 pid 469
2024 Oct 31 20:42:30.116113 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel57 pid 477
2024 Oct 31 20:42:30.116178 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel58 pid 485
2024 Oct 31 20:42:30.116550 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel59 pid 493
2024 Oct 31 20:42:30.167335 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel6 pid 501
2024 Oct 31 20:42:30.167365 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel60 pid 509
2024 Oct 31 20:42:30.167365 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel61 pid 517
2024 Oct 31 20:42:30.167375 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel62 pid 525
2024 Oct 31 20:42:30.167375 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel63 pid 533
2024 Oct 31 20:42:30.167384 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel64 pid 541
2024 Oct 31 20:42:30.167384 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel65 pid 549
2024 Oct 31 20:42:30.167415 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel66 pid 557
2024 Oct 31 20:42:30.167459 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel67 pid 565
2024 Oct 31 20:42:30.167469 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel68 pid 573
2024 Oct 31 20:42:30.274461 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel69 pid 581

In both cases, for 70 LAGs, it took about 0.6-0.7 seconds to send SIGTERM to the teamd processes, but the distribution of SIGTERMs sent is different. However, on this device, there are still some netlink messages getting dropped resulting in the cleanup not being complete.

dgsudharsan · 2024-11-11T02:50:27Z

@saiarcot895 Can you please run your changes with test_po_cleanup and test_po_cleanup_after_reload? We are noticing these tests statically fail with your changes

100ms might not be enough on slow systems for the teamd shutdown sequence to actually be staggered. Signed-off-by: Saikrishna Arcot <[email protected]>

… into teamd-delay-kill

dgsudharsan · 2024-11-20T14:51:58Z

@saiarcot895 Do we have an updated fix?

saiarcot895 · 2024-12-02T17:23:33Z

Updated syslog after changes:

2024 Dec  2 07:41:43.760139 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel14 pid 102
2024 Dec  2 07:41:43.760166 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel15 pid 110
2024 Dec  2 07:41:43.760189 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel16 pid 118
2024 Dec  2 07:41:43.760213 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel17 pid 126
2024 Dec  2 07:41:43.760237 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel18 pid 134
2024 Dec  2 07:41:43.760262 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel19 pid 142
2024 Dec  2 07:41:43.760291 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel2 pid 150
2024 Dec  2 07:41:43.760316 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel20 pid 158
2024 Dec  2 07:41:43.760343 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel21 pid 166
2024 Dec  2 07:41:43.760368 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel22 pid 174
...
2024 Dec  2 07:41:43.981285 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel23 pid 182
2024 Dec  2 07:41:43.981778 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel24 pid 190
2024 Dec  2 07:41:43.982313 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel25 pid 198
2024 Dec  2 07:41:43.982413 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel26 pid 206
2024 Dec  2 07:41:43.982544 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel27 pid 214
2024 Dec  2 07:41:43.982598 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel28 pid 222
2024 Dec  2 07:41:43.982652 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel29 pid 230
2024 Dec  2 07:41:43.982706 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel3 pid 238
2024 Dec  2 07:41:43.982759 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel30 pid 246
2024 Dec  2 07:41:43.982812 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel31 pid 254
...
2024 Dec  2 07:41:44.191547 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel32 pid 262
2024 Dec  2 07:41:44.191547 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel33 pid 270
2024 Dec  2 07:41:44.191555 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel34 pid 278
2024 Dec  2 07:41:44.191555 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel35 pid 286
2024 Dec  2 07:41:44.191574 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel36 pid 294
2024 Dec  2 07:41:44.191574 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel37 pid 302
2024 Dec  2 07:41:44.191574 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel38 pid 310
2024 Dec  2 07:41:44.191584 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel39 pid 318
2024 Dec  2 07:41:44.191584 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel4 pid 326
2024 Dec  2 07:41:44.191595 str2-7260cx3-acs-9 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Sent SIGTERM to port channel PortChannel40 pid 334
...

SIGTERM signal to the teamd supervisord was sent at 07:41:40, and the container exited at 07:41:46.

Teamd container is also configured to have a 60-second wait time (instead of the 10-second wait time) for termination, meaning docker will wait 60 seconds for the container to go down before forcefully killing it. This means that a 6-second wait time here is fine.

mssonicbld · 2024-12-15T04:19:11Z

/azp run

azure-pipelines · 2024-12-15T04:19:22Z

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph · 2024-12-15T04:19:22Z

cfgmgr/teammgr.cpp

        pid_t pid;
+        if (++sleepCounter % 10 == 0) {
+            // Sleep for 100 milliseconds so as to not overwhelm the netlink


Do we need to change comment .. is it 100ms or 200ms

Also instead of giving a one shot 200ms delay .. can we put 10ms for each delete ? The issues we have seen before is when we delete a large number of Portchannels few stale remains.

So now here you are optimizing the call from shell/cmd approach to system_call() approach which will make it execute faster

I felt the older shell/cmd way of checking for pid/kill was introducing some delay which helped in the cleanup of portchannels.

Tried a 10ms delay, and this seems to work well, at least on Arista 7260cx3.

Signed-off-by: Saikrishna Arcot <[email protected]>

mssonicbld · 2025-01-06T21:04:34Z

/azp run

azure-pipelines · 2025-01-06T21:04:45Z

Azure Pipelines successfully started running 1 pipeline(s).

saiarcot895 requested a review from judyjoseph as a code owner October 14, 2024 01:13

dgsudharsan added the Request for 202405 Branch label Oct 16, 2024

saiarcot895 added 2 commits October 21, 2024 17:45

Update LAG removal code to use the same logic as cleaning up all LAGs

f4fd3ab

Signed-off-by: Saikrishna Arcot <[email protected]>

Update tests to test LAG cleanup and to test with the new code

7b6fc53

This requires overriding some libc functions and capturing information about kill signals sent or intercepting file open operations. Signe -off-by: Saikrishna Arcot <[email protected]>

saiarcot895 requested a review from prsunny as a code owner October 22, 2024 00:47

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

27f6d3c

saiarcot895 and others added 3 commits October 22, 2024 15:59

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

bdd47c7

Add more tests to cover more cases

c5d84cf

Signed-off-by: Saikrishna Arcot <[email protected]>

Merge branch 'master' into teamd-delay-kill

1dd20a0

dgsudharsan added 2 commits November 4, 2024 07:59

Merge branch 'master' into teamd-delay-kill

8f71480

Merge branch 'master' into teamd-delay-kill

f39d60f

saiarcot895 added 3 commits November 13, 2024 14:16

Wait 200ms instead of 100ms, and fix teamd wait code

e7ce08d

100ms might not be enough on slow systems for the teamd shutdown sequence to actually be staggered. Signed-off-by: Saikrishna Arcot <[email protected]>

Merge remote-tracking branch 'refs/remotes/personal/teamd-delay-kill'…

ce1fdae

… into teamd-delay-kill

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

726f800

saiarcot895 added 2 commits November 25, 2024 12:07

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

d2ccbb3

Merge branch 'master' into teamd-delay-kill

181bdb9

saiarcot895 and others added 2 commits December 2, 2024 09:23

Merge branch 'master' into teamd-delay-kill

4d99bf4

Merge branch 'master' into teamd-delay-kill

6472d33

dprital added the Request for 202411 Branch label Dec 10, 2024

saiarcot895 and others added 2 commits December 10, 2024 11:43

Merge branch 'master' into teamd-delay-kill

fff96a3

Merge branch 'master' into teamd-delay-kill

cd38a76

judyjoseph reviewed Dec 15, 2024

View reviewed changes

saiarcot895 added 2 commits January 6, 2025 13:03

Try 10ms sleep

9f6ab64

Signed-off-by: Saikrishna Arcot <[email protected]>

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

9aaf399

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a delay between killing teamd processes #3325

Add a delay between killing teamd processes #3325

saiarcot895 commented Oct 14, 2024

saiarcot895 commented Oct 22, 2024

mssonicbld commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024

saiarcot895 commented Oct 31, 2024 •

edited

Loading

dgsudharsan commented Nov 11, 2024

dgsudharsan commented Nov 20, 2024

saiarcot895 commented Dec 2, 2024

mssonicbld commented Dec 15, 2024

azure-pipelines bot commented Dec 15, 2024

judyjoseph Dec 15, 2024

judyjoseph Dec 15, 2024

saiarcot895 Jan 6, 2025

mssonicbld commented Jan 6, 2025

azure-pipelines bot commented Jan 6, 2025

Add a delay between killing teamd processes #3325

Are you sure you want to change the base?

Add a delay between killing teamd processes #3325

Conversation

saiarcot895 commented Oct 14, 2024

saiarcot895 commented Oct 22, 2024

mssonicbld commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024

saiarcot895 commented Oct 31, 2024 • edited Loading

dgsudharsan commented Nov 11, 2024

dgsudharsan commented Nov 20, 2024

saiarcot895 commented Dec 2, 2024

mssonicbld commented Dec 15, 2024

azure-pipelines bot commented Dec 15, 2024

judyjoseph Dec 15, 2024

Choose a reason for hiding this comment

judyjoseph Dec 15, 2024

Choose a reason for hiding this comment

saiarcot895 Jan 6, 2025

Choose a reason for hiding this comment

mssonicbld commented Jan 6, 2025

azure-pipelines bot commented Jan 6, 2025

saiarcot895 commented Oct 31, 2024 •

edited

Loading