Scenario: All Backends Unhealthy
Symptoms
Alert:
OpenGSLBNoHealthyServersDNS queries return SERVFAIL (if
return_last_healthy: false)DNS queries return stale IPs (if
return_last_healthy: true)Metric:
opengslb_overwatch_backends_healthy == 0User reports: Cannot reach application
Impact
Severity: SEV1
User Impact: Complete service unavailability (unless
return_last_healthy: true)Duration: Until at least one backend recovers
Diagnosis
Step 1: Verify All Backends Are Indeed Unhealthy
# List all backends with health status
opengslb-cli servers --api http://localhost:9090
# Check API directly
curl http://localhost:9090/api/v1/overwatch/backends | jq '.backends[] | {service, address, effective_status}'
Step 2: Determine Why Backends Are Unhealthy
Check the source of unhealthy status:
# Get detailed backend info
curl http://localhost:9090/api/v1/overwatch/backends | jq '.backends[] | {
service,
address,
agent_healthy,
validation_healthy,
validation_error,
override_status
}'
Possible causes:
Source |
Indicator |
Likely Cause |
|---|---|---|
|
Agent reporting unhealthy |
Backend service down |
|
Overwatch validation failed |
Network issue or service down |
|
Manual override |
Someone marked unhealthy |
All stale |
|
All agents disconnected |
Step 3: Check Backend Services Directly
# Test backends directly (bypass Overwatch)
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "=== $ip ==="
curl -s -o /dev/null -w "%{http_code}" http://${ip}:8080/health
done
Step 4: Check Network Connectivity
# From Overwatch to backends
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "=== $ip ==="
nc -zv $ip 8080
done
Step 5: Check Agent Status
# Are agents sending heartbeats?
curl http://localhost:9090/api/v1/overwatch/stats | jq '{
active_agents,
stale_backends
}'
# List agents
curl http://localhost:9090/api/v1/overwatch/agents | jq '.agents[].agent_id'
Resolution
If Backends Are Actually Down
This is expected behavior - Overwatch correctly reports unhealthy.
Fix the backend services (application team responsibility)
Once backends are healthy, Overwatch will automatically detect
If Network Partition (Overwatch Can’t Reach Backends)
# Check firewall rules
sudo iptables -L -n
# Check routing
ip route get 10.0.1.10
# Check from Overwatch directly
curl http://10.0.1.10:8080/health
Fix network issue to restore connectivity.
If Overrides Are Blocking
# List active overrides
opengslb-cli overrides list --api http://localhost:9090
# Clear overrides if appropriate
opengslb-cli overrides clear myapp 10.0.1.10:8080 --api http://localhost:9090
If All Agents Disconnected
# Check gossip connectivity
nc -zv overwatch 7946
# On agent servers, check agent status
journalctl -u opengslb-agent -n 50
# Check encryption key matches
# (Compare between agent and overwatch configs)
Emergency: Force Traffic to Specific Backend
If you know a backend is healthy but validation fails:
# Override to force healthy
opengslb-cli overrides set myapp 10.0.1.10:8080 \
--healthy=true \
--reason="Emergency override - backend verified healthy manually" \
--api http://localhost:9090
Warning: This bypasses health checking. Use only in emergencies.
Emergency: Enable return_last_healthy
If DNS must return something:
# In config
dns:
return_last_healthy: true # Return last known healthy IPs when all unhealthy
sudo systemctl reload opengslb-overwatch
Prevention
Multiple backends: Always have N+1 capacity
Multi-region: Distribute backends across regions
Health check tuning: Ensure thresholds aren’t too aggressive
Monitoring: Alert before all backends are unhealthy
Graceful degradation: Use
return_last_healthy: trueif appropriate
Alerts to Add
- alert: OpenGSLBFewHealthyBackends
expr: opengslb_overwatch_backends_healthy < 2
for: 2m
labels:
severity: warning
annotations:
summary: "Only {{ $value }} healthy backends remaining"
- alert: OpenGSLBNoHealthyBackends
expr: opengslb_overwatch_backends_healthy == 0
for: 30s
labels:
severity: critical
annotations:
summary: "No healthy backends - service unavailable"