-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] fix bulk actions timing out sometimes #205735
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/fleet (Team:Fleet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliaElastic instead of breaking the promise chain what do you think of always running those bulk actions in a background task? (it will make things more robust, and more resilient to kibana being halted)
If I understand it correctly right now we try to run it and run the errors in a background task correct?
Yes currently the first attempt is made outside of the task, and the check/retry is done in the async task. We could probably refactor to move the first attempt to the async task too. I'll look at it, it looks like a bigger refactor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM 🚀 does any of our observability-perf will have to be updated to poll results, or they are already doing it?
It already polls, should be okay. |
💔 Build Failed
Failed CI Steps
Test Failures
Metrics [docs]
History
|
Summary
Closes https://github.com/elastic/ingest-dev/issues/4346
Update: changed the implementation to run the first attempt of bulk action execution in the task too.
Old description
Bulk actions supposed to run async in a kibana task, and the API to return quickly with an action id.
This was implemented here and unintentionally broke when a tslint rule was introduced here, effectively letting the async code finish before the API returns. This results in the API timing out sometimes when there are many agents.
Tested by creating 100k agent docs with the
create_agents
script and bulk unenroll agents.Logs before the change:
Logs after the change: