CheckMate

CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.

DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.

Features

Core Features

Multi-protocol support (TCP, HTTP, HTTPS with cert validation, SMTP, DNS)
Hierarchical configuration (Sites → Groups → Hosts → Checks)
High availability monitoring with configurable modes
Configurable check intervals per service
Prometheus metrics integration
Simple Rule-based monitoring with custom conditions
Flexible notification system
Service tagging system
TLS certificate expiration monitoring

High Availability Monitoring

Groups support two monitoring modes that can be configured at different levels:

All Mode (Default)
- Group is considered "up" if any host is responding
- Rules only trigger when all hosts are down
- For redundant services where one available host is sufficient
Any Mode
- Group monitoring tracks all hosts individually
- Rules trigger when any host goes down
- Suitable for services where each host's availability is critical

Rule modes can be configured at three levels (in order of precedence):

Check level - Overrides group settings for specific checks
Group level - Default for all checks in the group
Default - Falls back to "all" mode if not specified

Configuration

Site Configuration

monitorSite: Name of the monitoring instance
sites: List of infrastructure sites to monitor
- name: Site identifier
- tags: Site-level tags
- groups: List of service groups

Group Configuration

name: Group identifier
tags: Group-level tags (combined with site tags)
hosts: List of hosts to monitor
- host: Hostname or IP
- tags: Host-specific tags
checks: Service checks applied to all hosts
- port: Port number
- protocol: TCP, HTTP, SMTP, or DNS
- interval: Check frequency (e.g., "30s", "1m")
- tags: Check-specific tags
- ruleMode: Override group's rule mode
- verifyCert: Enable certificate checking
ruleMode: Group-level rule mode ("all" or "any")

Rule Configuration

name: Rule identifier
condition: Expression using downtime and responseTime variables
tags: Tags to match against groups
notifications: Notification types to use

Notification Configuration

type: Notification type ("log", more coming soon)

Certificate Rule Configuration

name: Rule identifier
minDaysValidity: Number of days before expiration to trigger alert
tags: Tags to match against groups/checks
notifications: Notification types to use

Metrics

Prometheus Integration

Node metrics for hosts and groups
Edge metrics for relationships
Response time histograms
Success/failure counters
Host availability tracking

All metrics are exposed on :9100/metrics with the checkmate_ prefix.

Metrics

CheckMate exposes Prometheus metrics at :9100/metrics:

Core Metrics

checkmate_check_success: Service availability (1 = up, 0 = down)
checkmate_check_latency_milliseconds: Response time in milliseconds
checkmate_check_latency_milliseconds_histogram: Response time distribution
checkmate_hosts_up: Number of hosts up in a group (per port/protocol)
checkmate_hosts_total: Total number of hosts in a group (per port/protocol)
checkmate_cert_expiry_days: Days until certificate expiration

Graph Visualization Metrics (In Development)

Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux

checkmate_node_info: Node information for graph visualization
- Labels: id, type (site/group/host), name, tags, port, protocol
- Values: 1 for active nodes, 0 for inactive
checkmate_edge_info: Edge information with latency
- Labels: source, target, type, metric, port, protocol
- Values: latency in milliseconds

Example Prometheus queries:

# Filter checks by site
checkmate_check_success{site="mars-lab"}

# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})

# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))

# Host availability ratio per group
sum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)

# Graph Visualization (In Development)
checkmate_node_info{type="host", port="443", protocol="HTTPS"}
avg(checkmate_edge_info{type="contains", metric="latency"}) by (source, target, port, protocol)

Grafana Node Graph Setup (In Development)

To visualize your infrastructure in Grafana's Node Graph:

Create a new Node Graph panel
Configure the Node Query:
```
checkmate_node_info
```
Configure the Edge Query:
```
checkmate_edge_info{metric="latency"}
```
Set transformations:
- Nodes: Use 'id' for node ID, 'type' for node class
- Edges: Use 'source' and 'target' for connections

Note: Graph visualization features are in flux and the query/configuration interface may change

Health Checks

CheckMate provides Kubernetes-compatible health check endpoints:

/health/live - Liveness probe
- Returns 200 OK when the service is running
/health/ready - Readiness probe
- Returns 200 OK when ready to receive traffic
- Returns 503 Service Unavailable during initialization

All health check endpoints are served on port 9100 alongside metrics.

Roadmap

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Development

Prerequisites

Go 1.21 or higher
air for live reloading (optional)

Live Reloading

For development with automatic rebuilding on code changes:

Install Air:

go install github.com/air-verse/air@latest

Run with Air:

air

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github		.github
examples		examples
internal		internal
.air.toml		.air.toml
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CheckMate

Features

Core Features

High Availability Monitoring

Configuration

Site Configuration

Group Configuration

Rule Configuration

Notification Configuration

Certificate Rule Configuration

Metrics

Prometheus Integration

Metrics

Core Metrics

Graph Visualization Metrics (In Development)

Grafana Node Graph Setup (In Development)

Health Checks

Roadmap

License

Development

Prerequisites

Live Reloading

About

Releases 8

Packages

Languages

License

whiskeyjimbo/CheckMate

Folders and files

Latest commit

History

Repository files navigation

CheckMate

Features

Core Features

High Availability Monitoring

Configuration

Site Configuration

Group Configuration

Rule Configuration

Notification Configuration

Certificate Rule Configuration

Metrics

Prometheus Integration

Metrics

Core Metrics

Graph Visualization Metrics (In Development)

Grafana Node Graph Setup (In Development)

Health Checks

Roadmap

License

Development

Prerequisites

Live Reloading

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages