OpenGSLB Metrics Reference
OpenGSLB exposes Prometheus metrics for monitoring and observability. This document describes all available metrics and how to configure them.
Configuration
Enable the metrics endpoint in your configuration:
metrics:
enabled: true
address: ":9090" # Default port
Metrics are served at http://<address>/metrics in Prometheus text format.
A health check endpoint is also available at http://<address>/health.
Metrics Reference
DNS Metrics
opengslb_dns_queries_total
Type: Counter
Labels: domain, type, status
Total number of DNS queries received.
Label |
Description |
|---|---|
|
The queried domain name |
|
DNS query type (A, AAAA, etc.) |
|
Response status: |
Example:
opengslb_dns_queries_total{domain="app.example.com",type="A",status="success"} 1542
opengslb_dns_queries_total{domain="app.example.com",type="AAAA",status="success"} 523
opengslb_dns_queries_total{domain="unknown.com",type="A",status="nxdomain"} 12
opengslb_dns_query_duration_seconds
Type: Histogram
Labels: domain, status
DNS query processing duration in seconds.
Buckets: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1
Example:
opengslb_dns_query_duration_seconds_bucket{domain="app.example.com",status="success",le="0.001"} 1200
opengslb_dns_query_duration_seconds_sum{domain="app.example.com",status="success"} 0.892
opengslb_dns_query_duration_seconds_count{domain="app.example.com",status="success"} 1542
Health Check Metrics
opengslb_health_check_results_total
Type: Counter
Labels: region, server, result
Total number of health check results.
Label |
Description |
|---|---|
|
Region name |
|
Server address and port (e.g., |
|
Check result: |
Example:
opengslb_health_check_results_total{region="us-east-1",server="10.0.1.10:80",result="healthy"} 4521
opengslb_health_check_results_total{region="us-east-1",server="10.0.1.10:80",result="unhealthy"} 3
opengslb_health_check_duration_seconds
Type: Histogram
Labels: region, server
Health check duration in seconds.
Buckets: 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5
Example:
opengslb_health_check_duration_seconds_bucket{region="us-east-1",server="10.0.1.10:80",le="0.1"} 4500
opengslb_health_check_duration_seconds_sum{region="us-east-1",server="10.0.1.10:80"} 135.6
opengslb_health_check_duration_seconds_count{region="us-east-1",server="10.0.1.10:80"} 4524
opengslb_healthy_servers
Type: Gauge
Labels: region
Current number of healthy servers per region.
Example:
opengslb_healthy_servers{region="us-east-1"} 3
opengslb_healthy_servers{region="us-west-2"} 2
Routing Metrics
opengslb_routing_decisions_total
Type: Counter
Labels: domain, algorithm, server
Total number of routing decisions made.
Label |
Description |
|---|---|
|
The domain being routed |
|
Routing algorithm used: |
|
Selected server address and port |
Example:
opengslb_routing_decisions_total{domain="app.example.com",algorithm="round-robin",server="10.0.1.10:80"} 512
opengslb_routing_decisions_total{domain="app.example.com",algorithm="round-robin",server="10.0.1.11:80"} 510
opengslb_routing_decisions_total{domain="critical.example.com",algorithm="failover",server="10.0.2.10:80"} 1000
Configuration Metrics
opengslb_config_reloads_total
Type: Counter
Labels: result
Total number of configuration reload attempts.
Label |
Description |
|---|---|
|
Reload result: |
Example:
opengslb_config_reloads_total{result="success"} 5
opengslb_config_reloads_total{result="failure"} 1
Use Cases:
Track reload activity
Alert on failed reloads
Correlate reloads with behavior changes
opengslb_config_reload_timestamp_seconds
Type: Gauge
Unix timestamp of the last successful configuration reload.
Example:
opengslb_config_reload_timestamp_seconds 1701504615
Use Cases:
Verify reload was applied
Track time since last reload
Correlate with deployment events
Application Metrics
opengslb_app_info
Type: Gauge
Labels: version
Application version information. Always set to 1.
Example:
opengslb_app_info{version="1.0.0"} 1
opengslb_config_load_timestamp_seconds
Type: Gauge
Unix timestamp of the initial configuration load at startup.
Example:
opengslb_config_load_timestamp_seconds 1701504000
opengslb_configured_domains
Type: Gauge
Number of configured domains.
Example:
opengslb_configured_domains 5
opengslb_configured_servers
Type: Gauge
Total number of configured servers across all regions.
Example:
opengslb_configured_servers 12
Prometheus Configuration
Add OpenGSLB to your Prometheus scrape configuration:
scrape_configs:
- job_name: 'opengslb'
static_configs:
- targets: ['opengslb-host:9090']
scrape_interval: 15s
Example Queries
Query Rate
rate(opengslb_dns_queries_total[5m])
Query Latency (p99)
histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m]))
Error Rate
sum(rate(opengslb_dns_queries_total{status!="success"}[5m]))
/
sum(rate(opengslb_dns_queries_total[5m]))
Healthy Server Ratio
opengslb_healthy_servers / opengslb_configured_servers
Health Check Failure Rate
rate(opengslb_health_check_results_total{result="unhealthy"}[5m])
Configuration Reload Success Rate
sum(rate(opengslb_config_reloads_total{result="success"}[1h]))
/
sum(rate(opengslb_config_reloads_total[1h]))
Time Since Last Reload
time() - opengslb_config_reload_timestamp_seconds
Routing Distribution by Algorithm
sum by (algorithm) (rate(opengslb_routing_decisions_total[5m]))
Failover Events (Traffic to Non-Primary)
# Track when failover routing sends traffic to secondary servers
rate(opengslb_routing_decisions_total{algorithm="failover"}[5m])
Alerting Examples
High Error Rate
- alert: OpenGSLBHighErrorRate
expr: |
sum(rate(opengslb_dns_queries_total{status!="success"}[5m]))
/
sum(rate(opengslb_dns_queries_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "OpenGSLB error rate above 5%"
No Healthy Servers
- alert: OpenGSLBNoHealthyServers
expr: opengslb_healthy_servers == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy servers in region {{ $labels.region }}"
High Query Latency
- alert: OpenGSLBHighLatency
expr: |
histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "OpenGSLB p99 latency above 10ms"
Configuration Reload Failed
- alert: OpenGSLBConfigReloadFailed
expr: increase(opengslb_config_reloads_total{result="failure"}[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "OpenGSLB configuration reload failed"
description: "A configuration reload attempt failed. Check logs for details."
Failover Active
- alert: OpenGSLBFailoverActive
expr: |
opengslb_healthy_servers{region="primary"} == 0
and opengslb_healthy_servers{region="secondary"} > 0
for: 1m
labels:
severity: warning
annotations:
summary: "OpenGSLB failover active - primary region has no healthy servers"
Low Healthy Server Ratio
- alert: OpenGSLBLowHealthyRatio
expr: |
opengslb_healthy_servers / opengslb_configured_servers < 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Less than 50% of servers are healthy"
Overwatch Metrics (ADR-015)
These metrics are only available in Overwatch mode.
opengslb_overwatch_backends_total
Type: Gauge
Total number of backends in the registry.
Example:
opengslb_overwatch_backends_total 24
opengslb_overwatch_backends_healthy
Type: Gauge
Number of backends with healthy effective status.
Example:
opengslb_overwatch_backends_healthy 22
opengslb_overwatch_agents_registered
Type: Gauge
Number of unique agents currently registered.
Example:
opengslb_overwatch_agents_registered 8
opengslb_overwatch_stale_agents
Type: Gauge
Number of backends marked as stale (no recent heartbeat).
Example:
opengslb_overwatch_stale_agents 2
opengslb_overwatch_overrides_active
Type: Gauge
Number of active manual overrides.
Example:
opengslb_overwatch_overrides_active 1
opengslb_overwatch_validation_total
Type: Counter
Labels: service, result
External validation results.
Label |
Description |
|---|---|
|
Service name |
|
Validation result: |
Example:
opengslb_overwatch_validation_total{service="web-service",result="healthy"} 450
opengslb_overwatch_validation_total{service="web-service",result="unhealthy"} 12
opengslb_overwatch_veto_total
Type: Counter
Labels: service, reason
Veto events where Overwatch overrode agent health claims.
Label |
Description |
|---|---|
|
Service name |
|
Veto reason: |
Example:
opengslb_overwatch_veto_total{service="web-service",reason="validation_unhealthy"} 5
Gossip Metrics
opengslb_gossip_messages_received_total
Type: Counter
Labels: type
Total gossip messages received by type.
Example:
opengslb_gossip_messages_received_total{type="heartbeat"} 4521
opengslb_gossip_messages_received_total{type="predictive"} 12
opengslb_gossip_override_operations_total
Type: Counter
Labels: operation
Override operations via API.
Label |
Description |
|---|---|
|
Operation type: |
Example:
opengslb_gossip_override_operations_total{operation="set"} 5
opengslb_gossip_override_operations_total{operation="clear"} 3
opengslb_gossip_decryption_failures_total
Type: Counter
Total gossip message decryption failures.
Example:
opengslb_gossip_decryption_failures_total 3
Use Cases:
Monitor for encryption key mismatches
Detect potential security issues with gossip communication
Geolocation Routing Metrics (Sprint 6)
opengslb_routing_geo_decisions_total
Type: Counter
Labels: domain, country, continent, region
Total geolocation routing decisions by location.
Label |
Description |
|---|---|
|
The domain being routed |
|
ISO country code (e.g., “US”, “GB”) |
|
Continent code (e.g., “NA”, “EU”) |
|
Selected region name |
Example:
opengslb_routing_geo_decisions_total{domain="app.example.com",country="US",continent="NA",region="us-east-1"} 1542
opengslb_routing_geo_decisions_total{domain="app.example.com",country="GB",continent="EU",region="eu-west-1"} 523
opengslb_routing_geo_fallback_total
Type: Counter
Labels: domain, reason
Total geolocation routing fallbacks by reason.
Label |
Description |
|---|---|
|
The domain being routed |
|
Fallback reason: |
Example:
opengslb_routing_geo_fallback_total{domain="app.example.com",reason="no_servers_in_region"} 12
opengslb_routing_geo_fallback_total{domain="app.example.com",reason="lookup_failed"} 5
opengslb_routing_geo_custom_hits_total
Type: Counter
Labels: domain, region, cidr
Total custom CIDR mapping matches in geolocation routing.
Label |
Description |
|---|---|
|
The domain being routed |
|
Matched region from custom mapping |
|
The matched CIDR range |
Example:
opengslb_routing_geo_custom_hits_total{domain="app.example.com",region="us-chicago",cidr="10.1.0.0/16"} 450
opengslb_routing_geo_custom_hits_total{domain="app.example.com",region="us-dallas",cidr="10.2.0.0/16"} 230
Latency Routing Metrics (Sprint 6)
opengslb_routing_latency_selected_ms
Type: Gauge
Labels: domain, server
Smoothed latency in milliseconds of the selected server for latency-based routing.
Example:
opengslb_routing_latency_selected_ms{domain="perf-critical.example.com",server="10.0.1.10:8080"} 45.5
opengslb_routing_latency_rejected_total
Type: Counter
Labels: domain, server, reason
Total servers rejected due to latency threshold or insufficient data.
Label |
Description |
|---|---|
|
The domain being routed |
|
Server address (host:port) |
|
Rejection reason: |
Example:
opengslb_routing_latency_rejected_total{domain="perf-critical.example.com",server="10.0.2.10:8080",reason="above_threshold"} 5
opengslb_routing_latency_rejected_total{domain="perf-critical.example.com",server="10.0.3.10:8080",reason="no_data"} 12
opengslb_routing_latency_fallback_total
Type: Counter
Labels: domain, reason
Total fallbacks to round-robin when latency data is unavailable.
Label |
Description |
|---|---|
|
The domain being routed |
|
Fallback reason: |
Example:
opengslb_routing_latency_fallback_total{domain="perf-critical.example.com",reason="no_latency_data"} 3
opengslb_backend_smoothed_latency_ms
Type: Gauge
Labels: service, address
Current smoothed (EMA) latency in milliseconds for each backend.
Example:
opengslb_backend_smoothed_latency_ms{service="myapp",address="10.0.1.10:8080"} 45.5
opengslb_backend_smoothed_latency_ms{service="myapp",address="10.0.1.11:8080"} 52.3
opengslb_backend_latency_samples
Type: Gauge
Labels: service, address
Number of latency samples collected for each backend.
Example:
opengslb_backend_latency_samples{service="myapp",address="10.0.1.10:8080"} 150
Per-Agent Connectivity Metrics (Sprint 6)
opengslb_agent_connected
Type: Gauge
Labels: agent_id, region
Agent connection status (1=connected, 0=disconnected).
Example:
opengslb_agent_connected{agent_id="agent-1",region="us-east-1"} 1
opengslb_agent_connected{agent_id="agent-2",region="eu-west-1"} 0
opengslb_agent_heartbeat_age_seconds
Type: Gauge
Labels: agent_id
Seconds since last heartbeat per agent.
Example:
opengslb_agent_heartbeat_age_seconds{agent_id="agent-1"} 5.2
opengslb_agent_heartbeat_age_seconds{agent_id="agent-2"} 45.8
opengslb_agent_backends_registered_per_agent
Type: Gauge
Labels: agent_id
Number of backends registered by each agent.
Example:
opengslb_agent_backends_registered_per_agent{agent_id="agent-1"} 4
opengslb_agent_backends_registered_per_agent{agent_id="agent-2"} 2
opengslb_agent_stale_events_total
Type: Counter
Labels: agent_id
Total stale events per agent.
Example:
opengslb_agent_stale_events_total{agent_id="agent-1"} 2
Override Metrics with Service Granularity (Sprint 6)
opengslb_overrides_active
Type: Gauge
Labels: service
Number of active overrides per service.
Example:
opengslb_overrides_active{service="myapp"} 1
opengslb_overrides_active{service="otherapp"} 0
opengslb_overrides_changes_total
Type: Counter
Labels: service, action
Total override changes by service and action.
Label |
Description |
|---|---|
|
Service name |
|
Action type: |
Example:
opengslb_overrides_changes_total{service="myapp",action="set"} 5
opengslb_overrides_changes_total{service="myapp",action="clear"} 3
Enhanced DNSSEC Metrics (Sprint 6)
opengslb_dnssec_signatures_total
Type: Counter
Labels: zone
Total DNSSEC signatures generated per zone.
Example:
opengslb_dnssec_signatures_total{zone="gslb.example.com"} 15420
opengslb_dnssec_key_age_by_zone_seconds
Type: Gauge
Labels: zone, key_tag
Age of DNSSEC signing keys in seconds, per zone and key tag.
Label |
Description |
|---|---|
|
DNS zone name |
|
DNSSEC key tag identifier |
Example:
opengslb_dnssec_key_age_by_zone_seconds{zone="gslb.example.com",key_tag="12345"} 86400
Overwatch Alerting Examples
No Registered Agents
- alert: OpenGSLBNoAgents
expr: opengslb_overwatch_agents_registered == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No agents registered with Overwatch"
High Stale Backend Count
- alert: OpenGSLBHighStaleBackends
expr: |
opengslb_overwatch_stale_agents / opengslb_overwatch_backends_total > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "More than 20% of backends are stale"
Validation Disagreement Rate
- alert: OpenGSLBHighVetoRate
expr: |
rate(opengslb_overwatch_veto_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High rate of validation vetoes - agent health claims being overridden"
Sprint 6 Example Queries
Geolocation Traffic Distribution
# Traffic distribution by region
sum by (region) (rate(opengslb_routing_geo_decisions_total[5m]))
# Traffic from custom CIDR mappings
sum by (region, cidr) (rate(opengslb_routing_geo_custom_hits_total[5m]))
# Geolocation fallback rate
sum(rate(opengslb_routing_geo_fallback_total[5m])) / sum(rate(opengslb_routing_geo_decisions_total[5m]))
Latency Routing Analysis
# Average selected latency by domain
avg by (domain) (opengslb_routing_latency_selected_ms)
# Servers frequently rejected due to high latency
topk(5, sum by (server) (rate(opengslb_routing_latency_rejected_total{reason="above_threshold"}[1h])))
# Latency routing fallback rate
sum by (domain) (rate(opengslb_routing_latency_fallback_total[5m])) / sum by (domain) (rate(opengslb_routing_decisions_total{algorithm="latency"}[5m]))
Agent Health Monitoring
# Agents not connected
opengslb_agent_connected == 0
# Agents with stale heartbeats (>30s)
opengslb_agent_heartbeat_age_seconds > 30
# Stale events by agent
rate(opengslb_agent_stale_events_total[1h])
Override Activity
# Current override count by service
opengslb_overrides_active
# Override change rate
sum by (service, action) (rate(opengslb_overrides_changes_total[1h]))
Sprint 6 Alerting Examples
High Geolocation Fallback Rate
- alert: OpenGSLBHighGeoFallbackRate
expr: |
sum(rate(opengslb_routing_geo_fallback_total[5m])) /
sum(rate(opengslb_routing_geo_decisions_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Geolocation routing fallback rate above 10%"
description: "Many geolocation lookups are failing or falling back to default."
Agent Heartbeat Stale
- alert: OpenGSLBAgentStale
expr: opengslb_agent_heartbeat_age_seconds > 60
for: 2m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent_id }} heartbeat stale"
description: "No heartbeat received from agent for over 60 seconds."
High Latency Server Selection
- alert: OpenGSLBHighLatencySelected
expr: opengslb_routing_latency_selected_ms > 200
for: 5m
labels:
severity: warning
annotations:
summary: "Selected servers have high latency for {{ $labels.domain }}"
description: "Latency-based routing is selecting servers with >200ms latency."
Gossip Decryption Failures
- alert: OpenGSLBGossipDecryptionFailures
expr: increase(opengslb_gossip_decryption_failures_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Gossip message decryption failures detected"
description: "This may indicate encryption key mismatches between nodes."
Metric Cardinality
Be aware of metric cardinality when configuring monitoring:
Metric |
Cardinality Factors |
|---|---|
|
domains × query_types × status |
|
domains × algorithms × servers |
|
regions × servers × results |
|
domains × countries × continents × regions |
|
domains × regions × cidrs |
|
domains × servers × reasons |
|
agents × regions |
|
services × actions |
|
zones × key_tags |
For large deployments with many domains or servers, consider:
Aggregating by region instead of individual servers
Using recording rules to pre-aggregate high-cardinality metrics
Limiting label values in Prometheus configuration
The geolocation metrics can grow with country/continent combinations - monitor cardinality