Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] fix bulk actions timing out sometimes #205735

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Jan 7, 2025

Summary

Closes https://github.com/elastic/ingest-dev/issues/4346

Update: changed the implementation to run the first attempt of bulk action execution in the task too.

[2025-01-08T11:10:54.139+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: f8658178-cb1e-485d-9d2f-ad60ccc37bc9, total agents:10001
actionParams {
  kuery: '( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)',
  revoke: false,
  force: undefined,
  batchSize: 10000,
  showInactive: true,
  spaceId: 'default',
  total: 10001,
  actionId: 'f8658178-cb1e-485d-9d2f-ad60ccc37bc9'
}
retryParams {
  pitId: 'gIuaBAEPLmZsZWV0LWFnZW50cy03FmRQN3pDT1gzUnFpNFkwdHJFdzJvbncAARZ5N2R5SnVSelN2bWNMang1THNfVkNRAAEAAAAAAAFpZhZiZnpybjZYSFRRNklYNlBqWHFzVU1nAAEWZFA3ekNPWDNScWk0WTB0ckV3Mm9udwAA',
  retryCount: 1
}
[2025-01-08T11:10:54.154+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:54.154+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 647ms
[2025-01-08T11:10:55.772+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agents exceeded threshold of 250ms (351ms out of 633ms) and 15% (55%) 
[2025-01-08T11:10:57.235+01:00][INFO ][plugins.fleet] Running bulk action retry task
[2025-01-08T11:10:57.235+01:00][DEBUG][plugins.fleet] Running task fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:57.236+01:00][INFO ][plugins.fleet] Completed bulk action retry task
[2025-01-08T11:10:57.251+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025-01-08T11:10:57.251+01:00][DEBUG][plugins.fleet] kuery: ( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)
[2025-01-08T11:10:57.781+01:00][DEBUG][plugins.fleet] Action not found
[2025-01-08T11:10:58.891+01:00][DEBUG][plugins.fleet] Secrets storage requirements already met, turned on in settings
[2025-01-08T11:10:59.414+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agent_status exceeded threshold of 250ms (294ms out of 348ms) and 15% (85%) 
[2025-01-08T11:10:59.806+01:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/agents exceeded threshold of 250ms (504ms out of 743ms) and 15% (68%) 
[2025-01-08T11:11:01.532+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
actionParams {
  kuery: '( fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977")) and (status:online or (status:error or status:degraded) or (status:updating or status:unenrolling or status:enrolling) or status:offline)',
  revoke: false,
  batchSize: 10000,
  showInactive: true,
  spaceId: 'default',
  total: 10001,
  actionId: 'f8658178-cb1e-485d-9d2f-ad60ccc37bc9'
}
retryParams {
  pitId: 'gIuaBAEPLmZsZWV0LWFnZW50cy03FmRQN3pDT1gzUnFpNFkwdHJFdzJvbncAARZ5N2R5SnVSelN2bWNMang1THNfVkNRAAEAAAAAAAFpZhZiZnpybjZYSFRRNklYNlBqWHFzVU1nAAEWZFA3ekNPWDNScWk0WTB0ckV3Mm9udwAA',
  retryCount: 1,
  taskId: 'fleet:unenroll_action:retry:f8658178-cb1e-485d-9d2f-ad60ccc37bc9',
  searchAfter: [ 1736331016589, 'online-98', 119770 ]
}
[2025-01-08T11:11:01.564+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:f8658178-cb1e-485d-9d2f-ad60ccc37bc9
[2025

Old description

Bulk actions supposed to run async in a kibana task, and the API to return quickly with an action id.
This was implemented here and unintentionally broke when a tslint rule was introduced here, effectively letting the async code finish before the API returns. This results in the API timing out sometimes when there are many agents.

Tested by creating 100k agent docs with the create_agents script and bulk unenroll agents.

Logs before the change:

[2025-01-07T14:38:04.467+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 2a57ac7a-0c1b-4a08-8709-6ccb683db995, total agents:100000
[2025-01-07T14:38:04.482+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:38:04.482+01:00][DEBUG][plugins.fleet] kuery: status:* AND (fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977"))
...
[2025-01-07T14:39:00.264+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.290+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.293+01:00][INFO ][plugins.fleet] processed 100000 agents, took 55811ms
[2025-01-07T14:39:00.293+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:2a57ac7a-0c1b-4a08-8709-6ccb683db995
[2025-01-07T14:39:00.304+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 56027ms

Logs after the change:

[2025-01-07T14:42:10.178+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: c3e12928-bbfe-4731-bdc2-c47bcac727a7, total agents:100000
[2025-01-07T14:42:10.194+01:00][INFO ][plugins.fleet] Scheduling task fleet:unenroll_action:retry:check:c3e12928-bbfe-4731-bdc2-c47bcac727a7
[2025-01-07T14:42:10.195+01:00][DEBUG][plugins.fleet] kuery: status:* AND (fleet-agents.policy_id : ("d3448ae1-9e55-485e-b74c-83471cb06977"))
[2025-01-07T14:42:10.195+01:00][DEBUG][plugins.fleet] [Bulk Agent Unenroll API] Unenroll agents in 196ms
[2025-01-07T14:43:00.762+01:00][INFO ][plugins.fleet] processed 100000 agents, took 50567ms
[2025-01-07T14:43:00.762+01:00][INFO ][plugins.fleet] Removing task fleet:unenroll_action:retry:check:c3e12928-bbfe-4731-bdc2-c47bcac727a7

@juliaElastic juliaElastic added release_note:fix backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Jan 7, 2025
@juliaElastic juliaElastic self-assigned this Jan 7, 2025
@juliaElastic juliaElastic requested a review from a team as a code owner January 7, 2025 13:43
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Jan 7, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic instead of breaking the promise chain what do you think of always running those bulk actions in a background task? (it will make things more robust, and more resilient to kibana being halted)

If I understand it correctly right now we try to run it and run the errors in a background task correct?

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jan 7, 2025

@juliaElastic instead of breaking the promise chain what do you think of always running those bulk actions in a background task? (it will make things more robust, and more resilient to kibana being halted)

If I understand it correctly right now we try to run it and run the errors in a background task correct?

Yes currently the first attempt is made outside of the task, and the check/retry is done in the async task. We could probably refactor to move the first attempt to the async task too. I'll look at it, it looks like a bigger refactor.

@juliaElastic juliaElastic requested a review from nchaulet January 8, 2025 10:25
Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM 🚀 does any of our observability-perf will have to be updated to poll results, or they are already doing it?

@juliaElastic
Copy link
Contributor Author

code LGTM 🚀 does any of our observability-perf will have to be updated to poll results, or they are already doing it?

It already polls, should be okay.

@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 8, 2025

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #28 / Agents fleet_request_diagnostics /agents/bulk_request_diagnostics should work for multiple agents by kuery in batches async
  • [job] [logs] FTR Configs #28 / Agents fleet_upgrade_agent multiple agents should bulk upgrade multiple agents by kuery in batches async

Metrics [docs]

✅ unchanged

History

cc @juliaElastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) release_note:fix Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants