Agent-Overwatch Deployment Guide
This guide covers deploying OpenGSLB using the agent-overwatch architecture introduced in Sprint 5.
Architecture Overview
The agent-overwatch model consists of two components:
Agent: Runs on application servers, monitors local health, gossips state to Overwatch nodes
Overwatch: Runs adjacent to DNS infrastructure, validates health claims, serves authoritative DNS
Key Principles
No VIPs required: DNS clients retry automatically (resolv.conf with multiple nameservers)
No cluster coordination: Each Overwatch operates independently
Security by default: Mandatory gossip encryption, TOFU authentication, DNSSEC enabled
Overwatch always wins: External validation overrides agent health claims
Prerequisites
Go 1.21+ (for building from source)
Network connectivity between agents and Overwatches (port 7946 for gossip)
DNS port access (port 53 or custom) for Overwatch nodes
Deployment Patterns
Pattern 1: Simple (1 Overwatch, N Agents)
┌─────────────────────────────────────────────────────────┐
│ DNS Clients │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │Overwatch │ │
│ │ 10.0.1.53│ │
│ └────┬─────┘ │
│ │ Gossip │
│ ┌───────────┼───────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ │ + App │ │ + App │ │ + App │ │
│ └────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────────────────────────┘
Pattern 2: High Availability (Multiple Independent Overwatches)
┌─────────────────────────────────────────────────────────┐
│ DNS Clients (resolv.conf with multiple nameservers) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Overwatch1│ │Overwatch2│ │Overwatch3│ │
│ │10.0.1.53 │ │10.0.1.54 │ │10.0.1.55 │ │
│ └─────┬────┘ └─────┬────┘ └─────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ Gossip (all agents → all overwatches)
│ ┌────────┼────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ │ + App │ │ + App │ │ + App │ │
│ └────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────────────────────────┘
Step-by-Step Deployment
Step 2: Deploy Overwatch Nodes
Create /etc/opengslb/overwatch.yaml:
mode: overwatch
identity:
node_id: overwatch-us-east-1
region: us-east
dns:
listen_address: "0.0.0.0:53"
zones:
- gslb.example.com
default_ttl: 30
dnssec:
enabled: true
key_sync:
peers:
- "https://overwatch-2.internal:9090"
- "https://overwatch-3.internal:9090"
poll_interval: "1h"
# Service tokens - agents must present matching token
agent_tokens:
myapp: "${MYAPP_TOKEN}"
otherapp: "${OTHERAPP_TOKEN}"
gossip:
bind_address: "0.0.0.0:7946"
encryption_key: "${GOSSIP_KEY}" # REQUIRED
validation:
enabled: true
check_interval: 30s
check_timeout: 5s
stale:
threshold: 30s # Mark stale after 30s no heartbeat
remove_after: 5m # Remove backend after 5m stale
api:
address: "127.0.0.1:8080" # Localhost only by default; change for remote access
allowed_networks:
- 10.0.0.0/8
- 192.168.0.0/16
metrics:
enabled: true
address: "0.0.0.0:9090"
data_dir: /var/lib/opengslb
logging:
level: info
format: json
Start Overwatch:
# Build from source
go build -o opengslb ./cmd/opengslb
# Run as systemd service
./opengslb --config /etc/opengslb/overwatch.yaml
Step 3: Deploy Agents
Create /etc/opengslb/agent.yaml on each application server:
mode: agent
identity:
service_token: "${MYAPP_TOKEN}"
region: us-east
# Certificate auto-generated on first start at /var/lib/opengslb/
backends:
- service: myapp
address: 127.0.0.1
port: 8080
weight: 100
health_check:
type: http
path: /health
interval: 5s
timeout: 2s
failure_threshold: 3
success_threshold: 2
predictive:
enabled: true
cpu_threshold: 85
memory_threshold: 90
error_rate_threshold: 5
check_interval: 10s
gossip:
encryption_key: "${GOSSIP_KEY}" # Must match Overwatch
overwatch_nodes:
- overwatch-1.internal:7946
- overwatch-2.internal:7946
- overwatch-3.internal:7946
heartbeat:
interval: 10s
missed_threshold: 3
data_dir: /var/lib/opengslb
logging:
level: info
format: json
metrics:
enabled: true
address: "127.0.0.1:9100" # Local only for agent metrics
Start Agent:
./opengslb --config /etc/opengslb/agent.yaml
Step 4: Configure DNS Clients
Configure client /etc/resolv.conf:
nameserver 10.0.1.53
nameserver 10.0.1.54
nameserver 10.0.1.55
options timeout:2 attempts:3
Or for corporate networks, configure your DNS server to forward GSLB zones:
BIND example (named.conf):
zone "gslb.example.com" {
type forward;
forward only;
forwarders {
10.0.1.53;
10.0.1.54;
10.0.1.55;
};
};
Multi-Backend Agent Configuration
An agent can register multiple backends (services):
mode: agent
identity:
service_token: "${TOKEN}"
region: us-east
backends:
- service: web
address: 127.0.0.1
port: 8080
weight: 100
health_check:
type: http
path: /health
interval: 5s
timeout: 2s
- service: api
address: 127.0.0.1
port: 9090
weight: 100
health_check:
type: http
path: /api/health
interval: 5s
timeout: 2s
- service: grpc
address: 127.0.0.1
port: 50051
weight: 100
health_check:
type: tcp
interval: 10s
timeout: 3s
gossip:
encryption_key: "${GOSSIP_KEY}"
overwatch_nodes:
- overwatch-1.internal:7946
External Override API
External tools can override health state:
# Mark backend unhealthy
curl -X PUT http://overwatch:8080/api/v1/overrides/myapp/10.0.1.10 \
-H "Content-Type: application/json" \
-d '{"healthy": false, "reason": "High latency from CloudWatch"}'
# Clear override
curl -X DELETE http://overwatch:8080/api/v1/overrides/myapp/10.0.1.10
# List all overrides
curl http://overwatch:8080/api/v1/overrides
DNSSEC Configuration
DNSSEC is enabled by default. To get DS records for parent zone delegation:
curl http://overwatch:8080/api/v1/dnssec/ds
Response:
{
"zone": "gslb.example.com",
"ds_records": [
{
"key_tag": 12345,
"algorithm": 13,
"digest_type": 2,
"digest": "abc123...",
"ds_record": "gslb.example.com. IN DS 12345 13 2 abc123..."
}
]
}
To disable DNSSEC (not recommended):
dnssec:
enabled: false
security_acknowledgment: "I understand that disabling DNSSEC allows DNS spoofing attacks"
Monitoring
Prometheus Metrics
Agent metrics (port 9100):
opengslb_agent_backends_registered- Number of backends registeredopengslb_agent_heartbeats_sent_total- Heartbeats sentopengslb_agent_heartbeat_failures_total- Failed heartbeatsopengslb_predictive_bleeding- Predictive health signal
Overwatch metrics (port 9090):
opengslb_overwatch_agents_registered- Registered agentsopengslb_overwatch_backends_total- Total backendsopengslb_overwatch_backends_healthy- Healthy backendsopengslb_overwatch_stale_agents_total- Stale agentsopengslb_overwatch_validation_checks_total- Validation checksopengslb_dns_queries_total- DNS queries served
Alerting Examples
# Prometheus alerting rules
groups:
- name: opengslb
rules:
- alert: HighStaleAgents
expr: opengslb_overwatch_stale_agents_total > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Agents are stale"
- alert: LowHealthyBackends
expr: opengslb_overwatch_backends_healthy < 2
for: 2m
labels:
severity: critical
annotations:
summary: "Less than 2 healthy backends"
Troubleshooting
Agent not registering
Check gossip connectivity:
nc -zv overwatch-1.internal 7946
Verify encryption key matches between agent and Overwatch
Check service token matches
agent_tokensin Overwatch configCheck agent logs:
journalctl -u opengslb-agent -f
Backend marked stale
Check agent is running and sending heartbeats
Check heartbeat metrics:
opengslb_agent_heartbeats_sent_totalCheck network connectivity between agent and Overwatch
Overwatch external validation may recover stale backends if service is actually healthy
DNS not resolving
Verify Overwatch is serving DNS:
dig @overwatch-1.internal myapp.gslb.example.com
Check registered backends:
curl http://overwatch:8080/api/v1/backendsCheck healthy backends:
curl http://overwatch:8080/api/v1/backends/healthy
DNSSEC validation failing
Verify DS records are published in parent zone
Check key sync between Overwatches:
curl http://overwatch:8080/api/v1/dnssec/sync/status
Security Checklist
Gossip encryption key is securely stored and rotated periodically
Service tokens are unique per application
API endpoints are IP-restricted
DNSSEC is enabled
Agent certificates are stored with appropriate permissions
Overwatch nodes are in private network
Metrics endpoints are not publicly exposed
Migration from Legacy Mode
If migrating from --mode=standalone (Sprint 3 and earlier):
Your existing configuration with
regionsandserversstill worksFor dynamic registration, deploy agents alongside your applications
Overwatches will serve backends from both static config and agent registration
Gradually migrate to agent-based registration for full features
Document Version: 1.0 Last Updated: December 2025