OpenGSLB Troubleshooting Guide
This guide covers common issues and their solutions for Agent and Overwatch mode deployments.
Table of Contents
DNS Issues
DNS Queries Return SERVFAIL
Symptoms:
digreturnsstatus: SERVFAILNo IP addresses in DNS response
Possible Causes:
All backend servers unhealthy
# Check server health status curl http://localhost:8080/api/v1/health/servers | jq '.servers[] | select(.healthy == false)'
Solution: Verify backend servers are running and health check endpoints are accessible.
Health checks not yet completed
# Check readiness curl http://localhost:8080/api/v1/ready
Solution: Wait for initial health checks to complete (typically 5-10 seconds after startup).
Configuration validation error
# Check logs for validation errors journalctl -u opengslb | grep -i "validation\|error"
DNS Queries Return NXDOMAIN
Symptoms:
digreturnsstatus: NXDOMAINDomain exists in configuration but not resolved
Possible Causes:
Domain not configured
# List configured domains curl http://localhost:8080/api/v1/health/servers | jq '.servers[].domain' | sort -u
Domain name mismatch (trailing dot, case sensitivity)
DNS queries include trailing dot:
app.example.com.Configuration should not include trailing dot:
app.example.com
No servers in domain’s regions
Check that the domain references valid regions with servers
High DNS Query Latency
Symptoms:
DNS responses take >100ms
Timeouts on client side
Possible Causes:
Health check blocking - DNS handler waiting for health status
Router algorithm inefficiency - Complex routing taking too long
Resource exhaustion - CPU/memory issues
Solutions:
# Check DNS query duration metrics
curl http://localhost:9090/metrics | grep opengslb_dns_query_duration
# Check system resources
top -p $(pgrep opengslb)
Health Check Issues
All Servers Marked Unhealthy
Symptoms:
All servers show
healthy: falseDNS returns SERVFAIL
Diagnostic Steps:
Verify backend connectivity:
# From OpenGSLB host curl -v http://backend-ip:port/health
Check health check configuration:
health_check: type: http path: /health # Correct path? timeout: 5s # Sufficient timeout?
Review health check logs:
journalctl -u opengslb | grep -i "health check failed"
Health Check Flapping
Symptoms:
Server rapidly alternates between healthy/unhealthy
Frequent log entries about state changes
Possible Causes:
Timeout too aggressive
health_check: timeout: 10s # Increase from default
Thresholds too low
health_check: failure_threshold: 3 # Require 3 failures before marking unhealthy success_threshold: 2 # Require 2 successes before marking healthy
Backend intermittently failing - Check backend logs
TCP Health Checks Failing
Symptoms:
TCP health checks report connection refused
Backend service is running
Possible Causes:
Firewall blocking connections
# Test connectivity nc -zv backend-ip port
Service not listening on expected interface
# On backend server ss -tlnp | grep :port
Connection limit reached on backend
Configuration Issues
Configuration Reload Failed
Symptoms:
SIGHUP sent but configuration unchanged
Log shows “configuration reload failed”
Solutions:
Validate configuration before reload:
opengslb --config /etc/opengslb/config.yaml --validate
Check for syntax errors:
# YAML syntax validation python3 -c "import yaml; yaml.safe_load(open('/etc/opengslb/config.yaml'))"
Review reload logs:
journalctl -u opengslb | grep -i reload
Configuration Changes Not Taking Effect
Symptoms:
Configuration file updated but behavior unchanged
Solutions:
Send SIGHUP to reload:
sudo systemctl reload opengslb # or sudo kill -SIGHUP $(pgrep opengslb)
Some changes require restart:
dns.listen_addressmode(agent/overwatch)overwatch.gossip.bind_addressagent.gossip.overwatch_nodes
Verify reload was successful:
curl http://localhost:9090/metrics | grep opengslb_config_reload
Agent Mode Issues
Agent Not Sending Heartbeats
Symptoms:
Agent logs show no heartbeat activity
Overwatch shows backends as stale
Possible Causes:
Gossip not configured
agent: gossip: encryption_key: "your-32-byte-base64-key" overwatch_nodes: - "overwatch-1.internal:7946"
Network connectivity to Overwatch
# Test gossip port connectivity nc -zv overwatch-ip 7946
Encryption key mismatch
Agent and Overwatch must use identical
encryption_key
Solutions:
# Check agent gossip metrics
curl http://localhost:9090/metrics | grep opengslb_gossip
# Review agent logs
journalctl -u opengslb | grep -i "gossip\|heartbeat"
Agent Health Checks Not Running
Symptoms:
No health check metrics for agent backends
Backend status unknown
Possible Causes:
Backends not configured
agent: backends: - service: "my-service" address: "127.0.0.1" port: 8080 health_check: type: http path: /health
Health check interval too long
agent: backends: - service: "my-service" health_check: interval: 10s # Default may be longer
Solutions:
# Check health check status
curl http://localhost:9090/metrics | grep opengslb_health_check
# Review agent backend status
journalctl -u opengslb | grep -i "backend\|health"
Agent Certificate Issues
Symptoms:
Agent fails to start with certificate errors
Gossip connection rejected
Possible Causes:
Certificate paths incorrect
agent: identity: cert_path: /var/lib/opengslb/agent.crt key_path: /var/lib/opengslb/agent.key
Certificate permissions
# Check file permissions ls -la /var/lib/opengslb/agent.* # Fix permissions chmod 600 /var/lib/opengslb/agent.key chmod 644 /var/lib/opengslb/agent.crt
Certificate not trusted by Overwatch
Ensure agent tokens are configured on Overwatch
Overwatch Mode Issues
Backends Not Registering
Symptoms:
Overwatch shows no backends
Agent heartbeats not received
Possible Causes:
Gossip port not accessible
# Check gossip listener ss -tlnp | grep 7946 # Test from agent nc -zv overwatch-ip 7946
Agent tokens not configured
overwatch: agent_tokens: my-service: "service-token-here"
Encryption key mismatch
All agents and Overwatch nodes must use the same key
Solutions:
# Check Overwatch backend registry
curl http://localhost:8080/api/v1/overwatch/backends | jq
# Review gossip logs
journalctl -u opengslb | grep -i gossip
External Validation Disagreements
Symptoms:
Overwatch validation disagrees with agent claims
Backend marked unhealthy despite agent claiming healthy
This is expected behavior. Per ADR-015, Overwatch validation ALWAYS wins over agent claims.
Investigate disagreements:
# Check validation status
curl http://localhost:8080/api/v1/overwatch/backends | jq '.backends[] | select(.validation_healthy != .agent_healthy)'
# Check validation metrics
curl http://localhost:9090/metrics | grep opengslb_overwatch_validation
# Review disagreement logs
journalctl -u opengslb | grep -i "disagrees with agent"
Possible causes of disagreement:
Network path differences - Overwatch can’t reach backend that agent can
Different health check configuration - Agent and Overwatch using different paths/ports
Intermittent failures - Backend flapping, caught at different times
Backends Going Stale
Symptoms:
Backends marked as
stalestatusagent_last_seentimestamp is old
Possible Causes:
Agent stopped or crashed
# Check agent status on backend server systemctl status opengslb
Network partition between agent and Overwatch
# Test connectivity from agent to Overwatch nc -zv overwatch-ip 7946
Stale threshold too aggressive
overwatch: stale: threshold: 30s # Time before marking stale remove_after: 5m # Time before removing
Solutions:
# Check stale backends
curl http://localhost:8080/api/v1/overwatch/backends?status=stale | jq
# Check stale metrics
curl http://localhost:9090/metrics | grep opengslb_overwatch_stale
Manual Override Not Working
Symptoms:
Override API returns success but backend status unchanged
Override not persisting
Diagnostic Steps:
Verify override was set:
curl http://localhost:8080/api/v1/overwatch/backends | jq '.backends[] | select(.override_status != null)'
Check override takes precedence:
Override > Validation > Staleness > Agent claim
Backend must not be stale for override to show in effective status
Review persistence:
# Check if bbolt store is working journalctl -u opengslb | grep -i "store\|persist"
Setting an override:
# Force backend healthy
curl -X POST http://localhost:8080/api/v1/overwatch/backends/my-service/10.0.1.10/80/override \
-H "Content-Type: application/json" \
-H "X-User: admin" \
-d '{"healthy": true, "reason": "maintenance bypass"}'
# Clear override
curl -X DELETE http://localhost:8080/api/v1/overwatch/backends/my-service/10.0.1.10/80/override
Performance Issues
High Memory Usage
Symptoms:
Memory usage >500MB for small configurations
OOM kills
Solutions:
Check for goroutine leaks:
curl http://localhost:9090/debug/pprof/goroutine?debug=2
Reduce health check frequency:
health_check: interval: 30s # Increase from default
Limit concurrent checks:
Consider reducing number of monitored servers
High CPU Usage
Symptoms:
CPU >50% under normal load
Slow DNS responses
Solutions:
Profile the application:
curl http://localhost:9090/debug/pprof/profile?seconds=30 > cpu.prof go tool pprof cpu.prof
Check for excessive logging:
logging: level: info # Avoid "debug" in production
Review routing algorithm:
Weighted routing is slightly more CPU-intensive than round-robin
Logging and Debugging
Enable Debug Logging
Temporarily increase log verbosity:
logging:
level: debug
format: json
Or via environment variable:
OPENGSLB_LOG_LEVEL=debug opengslb --config config.yaml
View Real-Time Logs
# systemd
journalctl -u opengslb -f
# Docker
docker logs -f opengslb
# Direct
./opengslb --config config.yaml 2>&1 | tee opengslb.log
Export Diagnostic Information
For support requests, gather:
# System info
uname -a
cat /etc/os-release
# OpenGSLB version
opengslb --version
# Configuration (sanitize secrets)
cat /etc/opengslb/config.yaml | grep -v key | grep -v token
# Metrics snapshot
curl http://localhost:9090/metrics > metrics.txt
# Recent logs
journalctl -u opengslb --since "1 hour ago" > logs.txt
# Overwatch status (if applicable)
curl http://localhost:8080/api/v1/overwatch/backends > backends.json
curl http://localhost:8080/api/v1/overwatch/stats > stats.json
curl http://localhost:8080/api/v1/health/servers > health-status.json
Getting Help
If you cannot resolve an issue:
Search existing issues: https://github.com/LoganRossUS/OpenGSLB/issues
Create a new issue with diagnostic information above
Community support: Discussions
For commercial support: licensing@opengslb.org