[Fleet] fix bulk actions timing out sometimes #205735

juliaElastic · 2025-01-07T13:43:16Z

Summary

Closes https://github.com/elastic/ingest-dev/issues/4346

Update: changed the implementation to run the first attempt of bulk action execution in the task too.

[2025-01-08T11:10:54.139+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: f8658178-cb1e-485d-9d2f-ad60ccc37bc9, total agents:10001
actionParams {
  kuery: '( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)',
  revoke: false,
  force: undefined,
  batchSize: 10000,
  showInactive: true,
  spaceId: 'default',
  total: 10001,
  actionId: 'f8658178-cb1e-485d-9d2f-ad60ccc37bc9'
}
retryParams {
  pitId: 'gIuaBAEPLmZsZWV0LWFnZW50cy03FmRQN3pDT1gzUnFpNFkwdHJFdzJvbncAARZ5N2R5SnVSelN2bWNMang1THNfVkNRAAEAAAAAAAFpZhZiZnpybjZYSFRRNklYNlBqWHFzVU1nAAEWZFA3ekNPWDNScWk0WTB0ckV3Mm9udwAA',
  retryCount: 1
}
[2025-01-08T11:10:54.154+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:54.154+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 647ms
[2025-01-08T11:10:55.772+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agents exceeded threshold of 250ms (351ms out of 633ms) and 15% (55%) 
[2025-01-08T11:10:57.235+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2025-01-08T11:10:57.235+01:00][DEBUG][plugins.fleet] Running task fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:57.236+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2025-01-08T11:10:57.251+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:57.251+01:00][DEBUG][plugins.fleet] kuery: ( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)
[2025-01-08T11:10:57.781+01:00][DEBUG][plugins.fleet] Action not found
[2025-01-08T11:10:58.891+01:00][DEBUG][plugins.fleet] Secrets storage requirements already met, turned on in settings
[2025-01-08T11:10:59.414+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agent_status exceeded threshold of 250ms (294ms out of 348ms) and 15% (85%) 
[2025-01-08T11:10:59.806+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agents exceeded threshold of 250ms (504ms out of 743ms) and 15% (68%) 
[2025-01-08T11:11:01.532+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
actionParams {
  kuery: '( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)',
  revoke: false,
  batchSize: 10000,
  showInactive: true,
  spaceId: 'default',
  total: 10001,
  actionId: 'f8658178-cb1e-485d-9d2f-ad60ccc37bc9'
}
retryParams {
  pitId: 'gIuaBAEPLmZsZWV0LWFnZW50cy03FmRQN3pDT1gzUnFpNFkwdHJFdzJvbncAARZ5N2R5SnVSelN2bWNMang1THNfVkNRAAEAAAAAAAFpZhZiZnpybjZYSFRRNklYNlBqWHFzVU1nAAEWZFA3ekNPWDNScWk0WTB0ckV3Mm9udwAA',
  retryCount: 1,
  taskId: 'fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9',
  searchAfter: [ 1736331016589, 'online-98', 119770 ]
}
[2025-01-08T11:11:01.564+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025

Old description

Bulk actions supposed to run async in a kibana task, and the API to return quickly with an action id.
This was implemented here and unintentionally broke when a tslint rule was introduced here, effectively letting the async code finish before the API returns. This results in the API timing out sometimes when there are many agents.

Tested by creating 100k agent docs with the create_agents script and bulk unenroll agents.

Logs before the change:

[2025-01-07T14:38:04.467+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 2a57ac7a-0c1b-4a08-8709-6ccb683db995, total agents:100000
[2025-01-07T14:38:04.482+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:38:04.482+01:00][DEBUG][plugins.fleet] kuery: status:* AND (fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977"))
...
[2025-01-07T14:39:00.264+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.290+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.293+01:00][INFO ][plugins.fleet] processed 100000 agents, took 55811ms
[2025-01-07T14:39:00.293+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.304+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 56027ms

Logs after the change:

[2025-01-07T14:42:10.178+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: c3e12928-bbfe-4731-bdc2-c47bcac727a7, total agents:100000
[2025-01-07T14:42:10.194+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:c3e12928-bbfe-4731-bdc2-c47bcac727a7
[2025-01-07T14:42:10.195+01:00][DEBUG][plugins.fleet] kuery: status:* AND (fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977"))
[2025-01-07T14:42:10.195+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 196ms
[2025-01-07T14:43:00.762+01:00][INFO ][plugins.fleet] processed 100000 agents, took 50567ms
[2025-01-07T14:43:00.762+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:c3e12928-bbfe-4731-bdc2-c47bcac727a7

elasticmachine · 2025-01-07T13:43:23Z

Pinging @elastic/fleet (Team:Fleet)

nchaulet

@juliaElastic instead of breaking the promise chain what do you think of always running those bulk actions in a background task? (it will make things more robust, and more resilient to kibana being halted)

If I understand it correctly right now we try to run it and run the errors in a background task correct?

juliaElastic · 2025-01-07T14:07:31Z

@juliaElastic instead of breaking the promise chain what do you think of always running those bulk actions in a background task? (it will make things more robust, and more resilient to kibana being halted)

If I understand it correctly right now we try to run it and run the errors in a background task correct?

Yes currently the first attempt is made outside of the task, and the check/retry is done in the async task. We could probably refactor to move the first attempt to the async task too. I'll look at it, it looks like a bigger refactor.

nchaulet

code LGTM 🚀 does any of our observability-perf will have to be updated to poll results, or they are already doing it?

juliaElastic · 2025-01-08T14:40:57Z

code LGTM 🚀 does any of our observability-perf will have to be updated to poll results, or they are already doing it?

It already polls, should be okay.

elasticmachine · 2025-01-08T17:34:53Z

💔 Build Failed

Buildkite Build
Commit: eed4746

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #28 / Agents fleet_request_diagnostics /agents/bulk_request_diagnostics should work for multiple agents by kuery in batches async
[job] [logs] FTR Configs #28 / Agents fleet_upgrade_agent multiple agents should bulk upgrade multiple agents by kuery in batches async

Metrics [docs]

✅ unchanged

History

cc @juliaElastic

juliaElastic added 2 commits January 7, 2025 14:34

fix bulk actions

52aea09

fix comment

ced1f3e

juliaElastic added release_note:fix backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Jan 7, 2025

juliaElastic self-assigned this Jan 7, 2025

juliaElastic requested a review from a team as a code owner January 7, 2025 13:43

botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Jan 7, 2025

nchaulet reviewed Jan 7, 2025

View reviewed changes

juliaElastic added 2 commits January 7, 2025 15:10

fix linting

a8f47f5

run bulk action fully in async task

24586c4

juliaElastic requested a review from nchaulet January 8, 2025 10:25

juliaElastic added 2 commits January 8, 2025 12:48

fix unit test

71d2786

fix api test

77eb57c

nchaulet approved these changes Jan 8, 2025

View reviewed changes

fix api test

35f50ac

fix api test

eed4746

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] fix bulk actions timing out sometimes #205735

[Fleet] fix bulk actions timing out sometimes #205735

juliaElastic commented Jan 7, 2025 •

edited

Loading

elasticmachine commented Jan 7, 2025

nchaulet left a comment

juliaElastic commented Jan 7, 2025 •

edited

Loading

nchaulet left a comment

juliaElastic commented Jan 8, 2025

elasticmachine commented Jan 8, 2025 •

edited

Loading

[Fleet] fix bulk actions timing out sometimes #205735

Are you sure you want to change the base?

[Fleet] fix bulk actions timing out sometimes #205735

Conversation

juliaElastic commented Jan 7, 2025 • edited Loading

Summary

Old description

elasticmachine commented Jan 7, 2025

nchaulet left a comment

Choose a reason for hiding this comment

juliaElastic commented Jan 7, 2025 • edited Loading

nchaulet left a comment

Choose a reason for hiding this comment

juliaElastic commented Jan 8, 2025

elasticmachine commented Jan 8, 2025 • edited Loading

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

History

juliaElastic commented Jan 7, 2025 •

edited

Loading

juliaElastic commented Jan 7, 2025 •

edited

Loading

elasticmachine commented Jan 8, 2025 •

edited

Loading