-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Ternary check mode" for more lightweight checks #1
Comments
The discrete mode is great if all you care about is uptime/downtime (as you say). But you lose the RTT (as you say). Conversely, if all you care about is response time and not uptime/downtime, then you could selectively fetch only every other filename that you receive in the list operation, or every Nth file, or some other similar scheme to ensure you are pulling from all of the regions equally. There seem to be quite attractive tradeoffs you can make if you only care about one of Another way to think about it is to combine these approaches...write a |
Another approach is writing a single file per instance per day. When a check runs, you pull down the latest version, update with the new value, then reupload the file overwriting the old one. If you are worried about race conditions (not an issue if you are using the "built in" cron behavior), you could write a new file each time and only pull down the latest file in the client side. This would reduce the number of files pulled down to N (with N being the number of instances running the checker) which would be an immense reduction. |
Per-day files would work great if the timeframe is day (or a low number of days). Let's say you want a timeframe of the last year–there would be lots of heavy lifting. How about a command, say, |
Just jotting my thoughts down into an issue for discussion...
If you do 1 check every 10 minutes and your status page shows the last 24 hours of checks, the browser downloads 144 check files to render the status page. This isn't too bad, but if you distribute your checks across multiple instances, you multiply the number of check files by the number of instances you distribute your checks to. And if you want finer granularity in the reporting, you have to produce check files more frequently.
One way to alleviate this volume is to introduce an alternate mode of producing checks: a "ternary" or "discrete" mode (for lack of better words) that only reports
healthy
,degraded
, ordown
. Ahealthy
status is assumed unless a file exists to reportdegraded
ordown
. Assuming an endpoint is usually healthy, this would drastically reduce the number of check files produced. Checks could be run every minute on multiple instances, if desired, and if the endpoint is reliably up, no check files would have to be downloaded.You do lose the RTT (response time) value, so the graphs will report "Up", "Degraded", or "Down" instead of a number. But if the service is only down for 5 minutes, you'd only have to download ~5 check files, so the status pages load much faster and you lose less storage.
@sqs also had the terrible, wonderful, no good, really great idea of encoding the results of the checks directly into the filenames on S3. 😄 That would allow us to download most of the results in just one or a few requests for file listings...
Anyway, it's too early to tell yet how people will be using this and if this mode will be in demand. This change would definitely be a paradigm shift so lots of code changes would be required, I think, unless there's a clever way for the checkup workers and the status page to mutually agree what the mode is from the results of the checks. (Would rather make the mode implicit than requiring explicit configuration. Going for the "just works" ideal.)
The text was updated successfully, but these errors were encountered: