-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include healthcheck logic for helper scripts running as sidecars #1842
base: alpha
Are you sure you want to change the base?
Include healthcheck logic for helper scripts running as sidecars #1842
Conversation
FWIW Here is my preview network pool, and cncli containers showing healthy once the script was copied in and healthcheck interval was reached:
|
Looks good! cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers Execute the script with docker exec.
Monitor the containers until the healthcheck interval occurs and that they are marked healthy.
Adjusted RETRIES
Adjusted CPU_THRESHOLD
|
Further testing... I was able to test with higher cpu load after deleting the cncli db and re-syncing. Result
Line 44 of healthcheck.sh: This seems to fix it...
With the above change, when cpu load is higher than CPU_THRESHOLD, this is the result:
|
Yeah, there are rare instances where cncli percentage can be high, but this tends to be when resyncing the entire db and/or a cncli init is running. Occasionally if there is an issue with node process itself, like if it gets stuck chainsync/blockfetch and never completes, I have also seen cncli get a high percentage, but otherwise its quite rare to see it increase. I figured with mithril-signer or db-sync, it might be more useful. |
@adamsthws Feel free to submit suggestions to adjust the Thanks for the testing. |
Testing revealed that setting RETRIES=0 results in script exit 1 without running the loop... it would be preferable to run the loop once when RETRIES=0. Suggestion - Modify the loop condition to handle RETRIES=0 by changing line 39 to the following:
Or...
|
I started thinkinng about a cncli specific check. The following function is an idea to check cncli status...
Perhaps would be improved further by also checking if sync is incrementing, so the healthcheck doesn't fail during initial sync. How would you feel about adding me as a commit co-author if you decide to use this? |
Description
Enhances the healthcheck.sh script to work for checking permissions on sidecar containers (helper scripts) via the ENTRYPOINT_PROCESS.
Where should the reviewer start?
/home/guild/.scripts/
.Testing different CPU Usage values
80
%) at a value you want to mark a container unhealthy when it is exceeded.Testing different amount of retries (internal to healthcheck.sh script).
20
) at a number of retries you want to perform if the CPU usage is above the CPU_THRESHOLD value before exiting non zeroCurrently it is a 3 second delay between checks, so 20 retries results in up to 60 seconds before the healthcheck will exit as unhealthy due to CPU load.
Testing different healthcheck values (external to healthcheck.sh script).
The current HEALTHCHECK of the container image is:
Reducing the start period and intervals to something more appropriate for the sidecar script will result in a much shorter period to determine the sidecar containers health.
Make sure to keep the environment variable RETRIES * 3 < container healthcheck timeout to avoid marking the container unhealthy before the script will return during periods of high cpu load.
Motivation and context
Issue #1841
Which issue it fixes?
Closes #1841
How has this been tested?
docker cp
the script into preview network cncli sync, validate and leaderlog containers and waiting until the interval runs the scriptdocker exec
to confirm it reports healthyAdditional Details
There is a SLEEPING_SCRIPTS array which is used for validate and leaderlog to still check for the cncli binary, but not consider a sleep period for validate and leaderlog to be unhealthy. Not 100% sure this is the best approach, but with sleep periods being variable I felt it was likely an acceptable middle ground.
Please do not hesitate to suggest an alternative approach to handling sleeping sidecars healthchecks if you think you have an improvement.
@adamsthws if you could please copy this into your sidecar containers (and your pool) and report back any results. I am marking this as a draft PR for the time being until testing is completed, after which if things look good I will mark it for review and get feedback from others.
Thanks