Scenario: Overwatch Down
Symptoms
Alert:
OverwatchDownDNS queries to specific Overwatch timeout
Metrics endpoint not responding
systemd service failed/stopped
Container not running (Docker deployments)
Impact
Severity: SEV2 (single node), SEV1 (all nodes)
User Impact:
Single Overwatch: Clients retry to other Overwatches
All Overwatches: DNS service unavailable
HA Note: With proper client configuration, single Overwatch failure has minimal impact
Diagnosis
Step 1: Check Overwatch Status
# Check service status
sudo systemctl status opengslb-overwatch
# Check if process is running
ps aux | grep opengslb
# Check listening ports
ss -tulnp | grep -E "(53|7946|9090|9091)"
Step 2: Check System Resources
# Disk space
df -h
# Memory
free -m
# CPU load
uptime
# System logs
dmesg | tail -50
Step 3: Check Overwatch Logs
# Recent logs
journalctl -u opengslb-overwatch -n 200 --no-pager
# Errors only
journalctl -u opengslb-overwatch -p err --since "1 hour ago"
# Follow logs during restart
journalctl -u opengslb-overwatch -f
Step 4: Test DNS Manually
# On the Overwatch server
dig @127.0.0.1 myapp.gslb.example.com
# From remote client
dig @overwatch-ip myapp.gslb.example.com
Step 5: Check Other Overwatches (HA)
for ow in overwatch-{1,2,3}; do
echo "=== $ow ==="
ssh $ow "sudo systemctl status opengslb-overwatch --no-pager"
done
Common Causes and Solutions
Cause 1: Service Crashed
Check for crash:
journalctl -u opengslb-overwatch | grep -E "(panic|fatal|killed)"
Restart service:
sudo systemctl restart opengslb-overwatch
If recurring, check logs for root cause and file issue.
Cause 2: Configuration Error
After config change, service may fail to start:
# Validate configuration
opengslb --config=/etc/opengslb/overwatch.yaml --validate
# Check last config change
ls -la /etc/opengslb/overwatch.yaml
git -C /etc/opengslb log --oneline -5 # If version controlled
Fix configuration and restart.
Cause 3: Port Conflict
Another process using port 53:
sudo lsof -i :53
sudo ss -tulnp | grep :53
Stop conflicting service (often systemd-resolved):
sudo systemctl stop systemd-resolved
sudo systemctl disable systemd-resolved
Cause 4: Out of Memory
dmesg | grep -i "out of memory"
journalctl -k | grep -i oom
Solutions:
Increase server memory
Increase container memory limits
Check for memory leaks (file issue if suspected)
Cause 5: Disk Full
df -h /var/lib/opengslb
Clean up:
# Remove old backups
find /var/lib/opengslb -name "*.backup" -mtime +7 -delete
# Clear old logs
journalctl --vacuum-time=7d
Cause 6: Certificate/Key Issues
DNSSEC or TLS issues:
journalctl -u opengslb-overwatch | grep -i "certificate\|key\|tls"
# Check DNSSEC keys
ls -la /var/lib/opengslb/dnssec/
Cause 7: Network Interface Down
ip addr show
ip link show
Bring interface up:
sudo ip link set eth0 up
Recovery Steps
Step 1: Restart the Service
sudo systemctl restart opengslb-overwatch
Step 2: Verify Service Started
sudo systemctl status opengslb-overwatch
curl http://localhost:9090/api/v1/ready
Step 3: Verify DNS Working
dig @localhost myapp.gslb.example.com +short
Step 4: Verify Gossip Receiving
# Check for agent connections
curl http://localhost:9090/api/v1/overwatch/stats | jq .active_agents
Step 5: Verify DNSSEC (if enabled)
curl http://localhost:9090/api/v1/dnssec/status | jq .enabled
dig @localhost myapp.gslb.example.com +dnssec
Docker Recovery
Check Container Status
docker ps -a | grep opengslb
docker logs opengslb-overwatch
Restart Container
docker restart opengslb-overwatch
Recreate Container
docker stop opengslb-overwatch
docker rm opengslb-overwatch
docker run -d \
--name opengslb-overwatch \
-p 53:53/udp -p 53:53/tcp \
-p 7946:7946 \
-p 9090:9090 -p 9091:9091 \
-v ./config/overwatch.yaml:/etc/opengslb/config.yaml:ro \
-v opengslb-data:/var/lib/opengslb \
ghcr.io/loganrossus/opengslb:latest
All Overwatches Down
Emergency Procedure
Start any single Overwatch first
sudo systemctl start opengslb-overwatch
Update DNS clients to point to working Overwatch
Update resolv.conf
Update DNS forwarding
Temporary: hardcode IP
Bring up remaining Overwatches
for host in overwatch-{2,3}; do ssh $host "sudo systemctl start opengslb-overwatch" done
Restore normal DNS configuration
Disaster Recovery
If all Overwatches are lost, see Backup and Restore.
Prevention
High Availability: Deploy multiple Overwatches
Resource Monitoring: Alert on disk, memory, CPU
Health Checks: Monitor liveness and readiness endpoints
Graceful Degradation: Configure
return_last_healthyAutomated Recovery: Use systemd restart policies
Systemd Restart Policy
# In /etc/systemd/system/opengslb-overwatch.service
[Service]
Restart=on-failure
RestartSec=5
Alerts
- alert: OverwatchDown
expr: up{job="opengslb-overwatch"} == 0
for: 1m
labels:
severity: warning
annotations:
summary: "Overwatch {{ $labels.instance }} is down"
- alert: AllOverwatchesDown
expr: sum(up{job="opengslb-overwatch"}) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "All Overwatch nodes are down - DNS unavailable"
- alert: OverwatchHighMemory
expr: process_resident_memory_bytes{job="opengslb-overwatch"} > 1e9
for: 10m
labels:
severity: warning
annotations:
summary: "Overwatch using more than 1GB memory"