Agent-Overwatch Deployment Guide

This guide covers deploying OpenGSLB using the agent-overwatch architecture introduced in Sprint 5.

Architecture Overview

The agent-overwatch model consists of two components:

  1. Agent: Runs on application servers, monitors local health, gossips state to Overwatch nodes

  2. Overwatch: Runs adjacent to DNS infrastructure, validates health claims, serves authoritative DNS

Key Principles

  • No VIPs required: DNS clients retry automatically (resolv.conf with multiple nameservers)

  • No cluster coordination: Each Overwatch operates independently

  • Security by default: Mandatory gossip encryption, TOFU authentication, DNSSEC enabled

  • Overwatch always wins: External validation overrides agent health claims

Prerequisites

  • Go 1.21+ (for building from source)

  • Network connectivity between agents and Overwatches (port 7946 for gossip)

  • DNS port access (port 53 or custom) for Overwatch nodes

Deployment Patterns

Pattern 1: Simple (1 Overwatch, N Agents)

┌─────────────────────────────────────────────────────────┐
│                    DNS Clients                           │
│                         │                                │
│                         ▼                                │
│                   ┌──────────┐                          │
│                   │Overwatch │                          │
│                   │ 10.0.1.53│                          │
│                   └────┬─────┘                          │
│                        │ Gossip                         │
│            ┌───────────┼───────────┐                    │
│            ▼           ▼           ▼                    │
│       ┌────────┐  ┌────────┐  ┌────────┐              │
│       │ Agent  │  │ Agent  │  │ Agent  │              │
│       │ + App  │  │ + App  │  │ + App  │              │
│       └────────┘  └────────┘  └────────┘              │
└─────────────────────────────────────────────────────────┘

Pattern 2: High Availability (Multiple Independent Overwatches)

┌─────────────────────────────────────────────────────────┐
│  DNS Clients (resolv.conf with multiple nameservers)     │
│       │           │           │                          │
│       ▼           ▼           ▼                          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                 │
│  │Overwatch1│ │Overwatch2│ │Overwatch3│                 │
│  │10.0.1.53 │ │10.0.1.54 │ │10.0.1.55 │                 │
│  └─────┬────┘ └─────┬────┘ └─────┬────┘                 │
│        │            │            │                       │
│        └────────────┼────────────┘                       │
│                     │ Gossip (all agents → all overwatches)
│            ┌────────┼────────┐                          │
│            ▼        ▼        ▼                          │
│       ┌────────┐ ┌────────┐ ┌────────┐                 │
│       │ Agent  │ │ Agent  │ │ Agent  │                 │
│       │ + App  │ │ + App  │ │ + App  │                 │
│       └────────┘ └────────┘ └────────┘                 │
└─────────────────────────────────────────────────────────┘

Step-by-Step Deployment

Step 1: Generate Shared Secrets

# Generate gossip encryption key (32 bytes, base64 encoded)
GOSSIP_KEY=$(openssl rand -base64 32)
echo "Gossip Key: $GOSSIP_KEY"

# Generate service token for each application
MYAPP_TOKEN=$(openssl rand -base64 32)
echo "MyApp Token: $MYAPP_TOKEN"

Step 2: Deploy Overwatch Nodes

Create /etc/opengslb/overwatch.yaml:

mode: overwatch

identity:
  node_id: overwatch-us-east-1
  region: us-east

dns:
  listen_address: "0.0.0.0:53"
  zones:
    - gslb.example.com
  default_ttl: 30

dnssec:
  enabled: true
  key_sync:
    peers:
      - "https://overwatch-2.internal:9090"
      - "https://overwatch-3.internal:9090"
    poll_interval: "1h"

# Service tokens - agents must present matching token
agent_tokens:
  myapp: "${MYAPP_TOKEN}"
  otherapp: "${OTHERAPP_TOKEN}"

gossip:
  bind_address: "0.0.0.0:7946"
  encryption_key: "${GOSSIP_KEY}"  # REQUIRED

validation:
  enabled: true
  check_interval: 30s
  check_timeout: 5s

stale:
  threshold: 30s      # Mark stale after 30s no heartbeat
  remove_after: 5m    # Remove backend after 5m stale

api:
  address: "127.0.0.1:8080"  # Localhost only by default; change for remote access
  allowed_networks:
    - 10.0.0.0/8
    - 192.168.0.0/16

metrics:
  enabled: true
  address: "0.0.0.0:9090"

data_dir: /var/lib/opengslb

logging:
  level: info
  format: json

Start Overwatch:

# Build from source
go build -o opengslb ./cmd/opengslb

# Run as systemd service
./opengslb --config /etc/opengslb/overwatch.yaml

Step 3: Deploy Agents

Create /etc/opengslb/agent.yaml on each application server:

mode: agent

identity:
  service_token: "${MYAPP_TOKEN}"
  region: us-east
  # Certificate auto-generated on first start at /var/lib/opengslb/

backends:
  - service: myapp
    address: 127.0.0.1
    port: 8080
    weight: 100
    health_check:
      type: http
      path: /health
      interval: 5s
      timeout: 2s
      failure_threshold: 3
      success_threshold: 2

predictive:
  enabled: true
  cpu_threshold: 85
  memory_threshold: 90
  error_rate_threshold: 5
  check_interval: 10s

gossip:
  encryption_key: "${GOSSIP_KEY}"  # Must match Overwatch
  overwatch_nodes:
    - overwatch-1.internal:7946
    - overwatch-2.internal:7946
    - overwatch-3.internal:7946

heartbeat:
  interval: 10s
  missed_threshold: 3

data_dir: /var/lib/opengslb

logging:
  level: info
  format: json

metrics:
  enabled: true
  address: "127.0.0.1:9100"  # Local only for agent metrics

Start Agent:

./opengslb --config /etc/opengslb/agent.yaml

Step 4: Configure DNS Clients

Configure client /etc/resolv.conf:

nameserver 10.0.1.53
nameserver 10.0.1.54
nameserver 10.0.1.55
options timeout:2 attempts:3

Or for corporate networks, configure your DNS server to forward GSLB zones:

BIND example (named.conf):

zone "gslb.example.com" {
    type forward;
    forward only;
    forwarders {
        10.0.1.53;
        10.0.1.54;
        10.0.1.55;
    };
};

Multi-Backend Agent Configuration

An agent can register multiple backends (services):

mode: agent

identity:
  service_token: "${TOKEN}"
  region: us-east

backends:
  - service: web
    address: 127.0.0.1
    port: 8080
    weight: 100
    health_check:
      type: http
      path: /health
      interval: 5s
      timeout: 2s

  - service: api
    address: 127.0.0.1
    port: 9090
    weight: 100
    health_check:
      type: http
      path: /api/health
      interval: 5s
      timeout: 2s

  - service: grpc
    address: 127.0.0.1
    port: 50051
    weight: 100
    health_check:
      type: tcp
      interval: 10s
      timeout: 3s

gossip:
  encryption_key: "${GOSSIP_KEY}"
  overwatch_nodes:
    - overwatch-1.internal:7946

Health Authority Hierarchy

Overwatch uses a priority-based health determination:

Priority

Source

Description

1 (highest)

Manual Override

Via API, persists until cleared

2

External Tool

CloudWatch, Watcher integration

3

Overwatch Validation

External health check by Overwatch

4 (lowest)

Agent Claim

Agent’s local health check

Key behavior: Overwatch validation ALWAYS wins over agent claims. This prevents lying agents from serving traffic.

Stale Backend Recovery

If an agent stops sending heartbeats but the backend service is still healthy:

  1. Backend marked stale after stale.threshold (default: 30s)

  2. Overwatch external validation continues checking stale backends

  3. If validation succeeds, backend is recovered to healthy status

  4. Backend only removed after stale.remove_after (default: 5m)

External Override API

External tools can override health state:

# Mark backend unhealthy
curl -X PUT http://overwatch:8080/api/v1/overrides/myapp/10.0.1.10 \
  -H "Content-Type: application/json" \
  -d '{"healthy": false, "reason": "High latency from CloudWatch"}'

# Clear override
curl -X DELETE http://overwatch:8080/api/v1/overrides/myapp/10.0.1.10

# List all overrides
curl http://overwatch:8080/api/v1/overrides

DNSSEC Configuration

DNSSEC is enabled by default. To get DS records for parent zone delegation:

curl http://overwatch:8080/api/v1/dnssec/ds

Response:

{
  "zone": "gslb.example.com",
  "ds_records": [
    {
      "key_tag": 12345,
      "algorithm": 13,
      "digest_type": 2,
      "digest": "abc123...",
      "ds_record": "gslb.example.com. IN DS 12345 13 2 abc123..."
    }
  ]
}

To disable DNSSEC (not recommended):

dnssec:
  enabled: false
  security_acknowledgment: "I understand that disabling DNSSEC allows DNS spoofing attacks"

Monitoring

Prometheus Metrics

Agent metrics (port 9100):

  • opengslb_agent_backends_registered - Number of backends registered

  • opengslb_agent_heartbeats_sent_total - Heartbeats sent

  • opengslb_agent_heartbeat_failures_total - Failed heartbeats

  • opengslb_predictive_bleeding - Predictive health signal

Overwatch metrics (port 9090):

  • opengslb_overwatch_agents_registered - Registered agents

  • opengslb_overwatch_backends_total - Total backends

  • opengslb_overwatch_backends_healthy - Healthy backends

  • opengslb_overwatch_stale_agents_total - Stale agents

  • opengslb_overwatch_validation_checks_total - Validation checks

  • opengslb_dns_queries_total - DNS queries served

Alerting Examples

# Prometheus alerting rules
groups:
  - name: opengslb
    rules:
      - alert: HighStaleAgents
        expr: opengslb_overwatch_stale_agents_total > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agents are stale"

      - alert: LowHealthyBackends
        expr: opengslb_overwatch_backends_healthy < 2
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Less than 2 healthy backends"

Troubleshooting

Agent not registering

  1. Check gossip connectivity:

    nc -zv overwatch-1.internal 7946
    
  2. Verify encryption key matches between agent and Overwatch

  3. Check service token matches agent_tokens in Overwatch config

  4. Check agent logs:

    journalctl -u opengslb-agent -f
    

Backend marked stale

  1. Check agent is running and sending heartbeats

  2. Check heartbeat metrics: opengslb_agent_heartbeats_sent_total

  3. Check network connectivity between agent and Overwatch

  4. Overwatch external validation may recover stale backends if service is actually healthy

DNS not resolving

  1. Verify Overwatch is serving DNS:

    dig @overwatch-1.internal myapp.gslb.example.com
    
  2. Check registered backends:

    curl http://overwatch:8080/api/v1/backends
    
  3. Check healthy backends:

    curl http://overwatch:8080/api/v1/backends/healthy
    

DNSSEC validation failing

  1. Verify DS records are published in parent zone

  2. Check key sync between Overwatches:

    curl http://overwatch:8080/api/v1/dnssec/sync/status
    

Security Checklist

  • Gossip encryption key is securely stored and rotated periodically

  • Service tokens are unique per application

  • API endpoints are IP-restricted

  • DNSSEC is enabled

  • Agent certificates are stored with appropriate permissions

  • Overwatch nodes are in private network

  • Metrics endpoints are not publicly exposed

Migration from Legacy Mode

If migrating from --mode=standalone (Sprint 3 and earlier):

  1. Your existing configuration with regions and servers still works

  2. For dynamic registration, deploy agents alongside your applications

  3. Overwatches will serve backends from both static config and agent registration

  4. Gradually migrate to agent-based registration for full features


Document Version: 1.0 Last Updated: December 2025