OpenGSLB Troubleshooting Guide

This guide covers common issues and their solutions for Agent and Overwatch mode deployments.

Table of Contents

DNS Issues
Health Check Issues
Configuration Issues
Agent Mode Issues
Overwatch Mode Issues
Performance Issues
Logging and Debugging

DNS Issues

DNS Queries Return SERVFAIL

Symptoms:

dig returns status: SERVFAIL
No IP addresses in DNS response

Possible Causes:

All backend servers unhealthy

# Check server health status
curl http://localhost:8080/api/v1/health/servers | jq '.servers[] | select(.healthy == false)'

Solution: Verify backend servers are running and health check endpoints are accessible.

Health checks not yet completed
```
# Check readiness
curl http://localhost:8080/api/v1/ready
```
Solution: Wait for initial health checks to complete (typically 5-10 seconds after startup).

Configuration validation error

# Check logs for validation errors
journalctl -u opengslb | grep -i "validation\|error"

DNS Queries Return NXDOMAIN

Symptoms:

dig returns status: NXDOMAIN
Domain exists in configuration but not resolved

Possible Causes:

Domain not configured

# List configured domains
curl http://localhost:8080/api/v1/health/servers | jq '.servers[].domain' | sort -u

Domain name mismatch (trailing dot, case sensitivity)
- DNS queries include trailing dot: app.example.com.
- Configuration should not include trailing dot: app.example.com
No servers in domain’s regions
- Check that the domain references valid regions with servers

High DNS Query Latency

Symptoms:

DNS responses take >100ms
Timeouts on client side

Possible Causes:

Health check blocking - DNS handler waiting for health status
Router algorithm inefficiency - Complex routing taking too long
Resource exhaustion - CPU/memory issues

Solutions:

# Check DNS query duration metrics
curl http://localhost:9090/metrics | grep opengslb_dns_query_duration

# Check system resources
top -p $(pgrep opengslb)

Health Check Issues

All Servers Marked Unhealthy

Symptoms:

All servers show healthy: false
DNS returns SERVFAIL

Diagnostic Steps:

Verify backend connectivity:

# From OpenGSLB host
curl -v http://backend-ip:port/health

Check health check configuration:

health_check:
  type: http
  path: /health  # Correct path?
  timeout: 5s    # Sufficient timeout?

Review health check logs:

journalctl -u opengslb | grep -i "health check failed"

Health Check Flapping

Symptoms:

Server rapidly alternates between healthy/unhealthy
Frequent log entries about state changes

Possible Causes:

Timeout too aggressive

health_check:
  timeout: 10s  # Increase from default

Thresholds too low

health_check:
  failure_threshold: 3  # Require 3 failures before marking unhealthy
  success_threshold: 2  # Require 2 successes before marking healthy

Backend intermittently failing - Check backend logs

TCP Health Checks Failing

Symptoms:

TCP health checks report connection refused
Backend service is running

Possible Causes:

Firewall blocking connections

# Test connectivity
nc -zv backend-ip port

Service not listening on expected interface

# On backend server
ss -tlnp | grep :port

Connection limit reached on backend

Configuration Issues

Configuration Reload Failed

Symptoms:

SIGHUP sent but configuration unchanged
Log shows “configuration reload failed”

Solutions:

Validate configuration before reload:

opengslb --config /etc/opengslb/config.yaml --validate

Check for syntax errors:

# YAML syntax validation
python3 -c "import yaml; yaml.safe_load(open('/etc/opengslb/config.yaml'))"

Review reload logs:

journalctl -u opengslb | grep -i reload

Configuration Changes Not Taking Effect

Symptoms:

Configuration file updated but behavior unchanged

Solutions:

Send SIGHUP to reload:

sudo systemctl reload opengslb
# or
sudo kill -SIGHUP $(pgrep opengslb)

Some changes require restart:
- dns.listen_address
- mode (agent/overwatch)
- overwatch.gossip.bind_address
- agent.gossip.overwatch_nodes

Verify reload was successful:

curl http://localhost:9090/metrics | grep opengslb_config_reload

Agent Mode Issues

Agent Not Sending Heartbeats

Symptoms:

Agent logs show no heartbeat activity
Overwatch shows backends as stale

Possible Causes:

Gossip not configured

agent:
  gossip:
    encryption_key: "your-32-byte-base64-key"
    overwatch_nodes:
      - "overwatch-1.internal:7946"

Network connectivity to Overwatch

# Test gossip port connectivity
nc -zv overwatch-ip 7946

Encryption key mismatch
- Agent and Overwatch must use identical encryption_key

Solutions:

# Check agent gossip metrics
curl http://localhost:9090/metrics | grep opengslb_gossip

# Review agent logs
journalctl -u opengslb | grep -i "gossip\|heartbeat"

Agent Health Checks Not Running

Symptoms:

No health check metrics for agent backends
Backend status unknown

Possible Causes:

Backends not configured

agent:
  backends:
    - service: "my-service"
      address: "127.0.0.1"
      port: 8080
      health_check:
        type: http
        path: /health

Health check interval too long

agent:
  backends:
    - service: "my-service"
      health_check:
        interval: 10s  # Default may be longer

Solutions:

# Check health check status
curl http://localhost:9090/metrics | grep opengslb_health_check

# Review agent backend status
journalctl -u opengslb | grep -i "backend\|health"

Agent Certificate Issues

Symptoms:

Agent fails to start with certificate errors
Gossip connection rejected

Possible Causes:

Certificate paths incorrect

agent:
  identity:
    cert_path: /var/lib/opengslb/agent.crt
    key_path: /var/lib/opengslb/agent.key

Certificate permissions

# Check file permissions
ls -la /var/lib/opengslb/agent.*

# Fix permissions
chmod 600 /var/lib/opengslb/agent.key
chmod 644 /var/lib/opengslb/agent.crt

Certificate not trusted by Overwatch
- Ensure agent tokens are configured on Overwatch

Overwatch Mode Issues

Backends Not Registering

Symptoms:

Overwatch shows no backends
Agent heartbeats not received

Possible Causes:

Gossip port not accessible

# Check gossip listener
ss -tlnp | grep 7946

# Test from agent
nc -zv overwatch-ip 7946

Agent tokens not configured

overwatch:
  agent_tokens:
    my-service: "service-token-here"

Encryption key mismatch
- All agents and Overwatch nodes must use the same key

Solutions:

# Check Overwatch backend registry
curl http://localhost:8080/api/v1/overwatch/backends | jq

# Review gossip logs
journalctl -u opengslb | grep -i gossip

External Validation Disagreements

Symptoms:

Overwatch validation disagrees with agent claims
Backend marked unhealthy despite agent claiming healthy

This is expected behavior. Per ADR-015, Overwatch validation ALWAYS wins over agent claims.

Investigate disagreements:

# Check validation status
curl http://localhost:8080/api/v1/overwatch/backends | jq '.backends[] | select(.validation_healthy != .agent_healthy)'

# Check validation metrics
curl http://localhost:9090/metrics | grep opengslb_overwatch_validation

# Review disagreement logs
journalctl -u opengslb | grep -i "disagrees with agent"

Possible causes of disagreement:

Network path differences - Overwatch can’t reach backend that agent can
Different health check configuration - Agent and Overwatch using different paths/ports
Intermittent failures - Backend flapping, caught at different times

Backends Going Stale

Symptoms:

Backends marked as stale status
agent_last_seen timestamp is old

Possible Causes:

Agent stopped or crashed

# Check agent status on backend server
systemctl status opengslb

Network partition between agent and Overwatch

# Test connectivity from agent to Overwatch
nc -zv overwatch-ip 7946

Stale threshold too aggressive

overwatch:
  stale:
    threshold: 30s      # Time before marking stale
    remove_after: 5m    # Time before removing

Solutions:

# Check stale backends
curl http://localhost:8080/api/v1/overwatch/backends?status=stale | jq

# Check stale metrics
curl http://localhost:9090/metrics | grep opengslb_overwatch_stale

Manual Override Not Working

Symptoms:

Override API returns success but backend status unchanged
Override not persisting

Diagnostic Steps:

Verify override was set:

curl http://localhost:8080/api/v1/overwatch/backends | jq '.backends[] | select(.override_status != null)'

Check override takes precedence:
- Override > Validation > Staleness > Agent claim
- Backend must not be stale for override to show in effective status

Review persistence:

# Check if bbolt store is working
journalctl -u opengslb | grep -i "store\|persist"

Setting an override:

# Force backend healthy
curl -X POST http://localhost:8080/api/v1/overwatch/backends/my-service/10.0.1.10/80/override \
  -H "Content-Type: application/json" \
  -H "X-User: admin" \
  -d '{"healthy": true, "reason": "maintenance bypass"}'

# Clear override
curl -X DELETE http://localhost:8080/api/v1/overwatch/backends/my-service/10.0.1.10/80/override

Performance Issues

High Memory Usage

Symptoms:

Memory usage >500MB for small configurations
OOM kills

Solutions:

Check for goroutine leaks:

curl http://localhost:9090/debug/pprof/goroutine?debug=2

Reduce health check frequency:

health_check:
  interval: 30s  # Increase from default

Limit concurrent checks:
- Consider reducing number of monitored servers

High CPU Usage

Symptoms:

CPU >50% under normal load
Slow DNS responses

Solutions:

Profile the application:

curl http://localhost:9090/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

Check for excessive logging:

logging:
  level: info  # Avoid "debug" in production

Review routing algorithm:
- Weighted routing is slightly more CPU-intensive than round-robin

Logging and Debugging

Enable Debug Logging

Temporarily increase log verbosity:

logging:
  level: debug
  format: json

Or via environment variable:

OPENGSLB_LOG_LEVEL=debug opengslb --config config.yaml

View Real-Time Logs

# systemd
journalctl -u opengslb -f

# Docker
docker logs -f opengslb

# Direct
./opengslb --config config.yaml 2>&1 | tee opengslb.log

Export Diagnostic Information

For support requests, gather:

# System info
uname -a
cat /etc/os-release

# OpenGSLB version
opengslb --version

# Configuration (sanitize secrets)
cat /etc/opengslb/config.yaml | grep -v key | grep -v token

# Metrics snapshot
curl http://localhost:9090/metrics > metrics.txt

# Recent logs
journalctl -u opengslb --since "1 hour ago" > logs.txt

# Overwatch status (if applicable)
curl http://localhost:8080/api/v1/overwatch/backends > backends.json
curl http://localhost:8080/api/v1/overwatch/stats > stats.json
curl http://localhost:8080/api/v1/health/servers > health-status.json

Getting Help

If you cannot resolve an issue:

Search existing issues: https://github.com/LoganRossUS/OpenGSLB/issues
Create a new issue with diagnostic information above
Community support: Discussions

For commercial support: licensing@opengslb.org