Configuration Reference

OpenGSLB is configured via a YAML file. By default, it looks for /etc/opengslb/config.yaml, but you can specify a different path with the --config flag.

Configuration File Security

OpenGSLB enforces strict file permissions on the configuration file. The config file must not be world-readable (no “other” read permission).

# Correct permissions (owner read/write, group read)
chmod 640 /etc/opengslb/config.yaml

# Or more restrictive (owner only)
chmod 600 /etc/opengslb/config.yaml

If the file has insecure permissions, OpenGSLB will refuse to start and display an error message.

Runtime Mode (ADR-015)

OpenGSLB operates in one of two modes:

Mode	Description
`overwatch`	DNS-serving, health-validating authority node. Receives agent heartbeats, validates health claims, serves DNS.
`agent`	Health-reporting agent on application servers. Monitors local backends, gossips status to Overwatch nodes.

# Set runtime mode
mode: overwatch  # or "agent"

If mode is not specified, OpenGSLB defaults to overwatch mode.

Agent Mode Configuration

Agent mode runs on application servers to monitor local backends and report health to Overwatch nodes.

mode: agent

agent:
  identity:
    service_token: "pre-shared-token-for-auth"
    region: "us-east-1"
    cert_path: /var/lib/opengslb/agent.crt
    key_path: /var/lib/opengslb/agent.key

  backends:
    - service: "web-service"
      address: "127.0.0.1"
      port: 8080
      weight: 100
      health_check:
        type: http
        interval: 10s
        timeout: 5s
        path: /health
        failure_threshold: 3
        success_threshold: 2

  gossip:
    encryption_key: "base64-encoded-32-byte-key"
    overwatch_nodes:
      - "overwatch-1.internal:7946"
      - "overwatch-2.internal:7946"

  heartbeat:
    interval: 10s
    missed_threshold: 3

  predictive:
    enabled: true
    cpu:
      threshold: 80
      bleed_duration: 30s
    memory:
      threshold: 85
      bleed_duration: 30s
    error_rate:
      threshold: 5
      window: 60s
      bleed_duration: 30s

Agent Identity Settings

Field	Type	Default	Description
`service_token`	string	Required	Pre-shared token for initial authentication with Overwatch
`region`	string	Required	Geographic region this agent belongs to
`cert_path`	string	`/var/lib/opengslb/agent.crt`	Path to store/load agent certificate
`key_path`	string	`/var/lib/opengslb/agent.key`	Path to store/load agent private key

Agent Backend Settings

Field	Type	Default	Description
`service`	string	Required	Service name (maps to DNS domain)
`address`	string	Required	Backend server IP address
`port`	integer	Required	Backend server port
`weight`	integer	`100`	Routing weight (1-1000)
`health_check`	object	Required	Health check configuration

Agent Gossip Settings

Field	Type	Default	Description
`encryption_key`	string	Required	32-byte base64-encoded encryption key
`overwatch_nodes`	list	Required	List of Overwatch gossip addresses

Generate an encryption key with:

openssl rand -base64 32

Agent Heartbeat Settings

Field	Type	Default	Description
`interval`	duration	`10s`	Time between heartbeat messages
`missed_threshold`	integer	`3`	Missed heartbeats before deregistration

Agent Predictive Health Settings

Predictive health allows agents to signal impending failures before they impact traffic.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable predictive health monitoring
`cpu.threshold`	float	`80`	CPU usage percentage to trigger bleed
`cpu.bleed_duration`	duration	`30s`	Duration to gradually drain traffic
`memory.threshold`	float	`85`	Memory usage percentage to trigger bleed
`memory.bleed_duration`	duration	`30s`	Duration to gradually drain traffic
`error_rate.threshold`	float	`5`	Error rate percentage to trigger bleed
`error_rate.window`	duration	`60s`	Window for error rate calculation
`error_rate.bleed_duration`	duration	`30s`	Duration to gradually drain traffic

Agent Latency Learning Settings (ADR-017)

Passive latency learning allows agents to collect real client-to-backend TCP RTT data and report it to Overwatch for intelligent routing. This captures actual client experience rather than Overwatch-to-backend latency.

agent:
  latency_learning:
    enabled: true
    poll_interval: 10s
    min_connection_age: 5s
    ipv4_prefix: 24
    ipv6_prefix: 48
    ewma_alpha: 0.3
    max_subnets: 100000
    subnet_ttl: 168h
    min_samples: 5
    report_interval: 30s

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable passive latency learning
`poll_interval`	duration	`10s`	How often to poll OS for TCP connection RTT data
`min_connection_age`	duration	`5s`	Minimum connection age before collecting RTT (new connections have unstable RTT)
`ipv4_prefix`	integer	`24`	IPv4 subnet prefix for aggregation (e.g., /24 groups all 10.0.1.x together)
`ipv6_prefix`	integer	`48`	IPv6 subnet prefix for aggregation
`ewma_alpha`	float	`0.3`	EWMA smoothing factor (0-1). Higher = more responsive to recent samples
`max_subnets`	integer	`100000`	Maximum subnets to track (prevents unbounded memory growth)
`subnet_ttl`	duration	`168h`	How long to keep subnet entries without updates (7 days default)
`min_samples`	integer	`5`	Minimum samples before reporting a subnet’s latency
`report_interval`	duration	`30s`	How often to send latency reports to Overwatch via gossip

Requirements:

Linux: CAP_NET_ADMIN capability or root privileges
Windows: Administrator privileges (uses GetPerTcpConnectionEStats API)

Grant capability on Linux:

sudo setcap cap_net_admin+ep /usr/local/bin/opengslb

Overwatch Mode Configuration

Overwatch mode serves DNS and validates health claims from agents.

mode: overwatch

overwatch:
  identity:
    node_id: "overwatch-1"
    region: "us-east-1"

  agent_tokens:
    web-service: "pre-shared-token-for-web-service"
    api-service: "pre-shared-token-for-api-service"

  gossip:
    bind_address: "0.0.0.0:7946"
    encryption_key: "base64-encoded-32-byte-key"
    probe_interval: 1s
    probe_timeout: 500ms
    gossip_interval: 200ms

  validation:
    enabled: true
    check_interval: 30s
    check_timeout: 5s

  stale:
    threshold: 30s
    remove_after: 5m

  data_dir: /var/lib/opengslb

  dnssec:
    enabled: true
    algorithm: ECDSAP256SHA256

Overwatch Identity Settings

Field	Type	Default	Description
`node_id`	string	hostname	Unique identifier for this Overwatch node
`region`	string	(empty)	Geographic region this Overwatch serves

Overwatch Agent Tokens

Map of service names to authentication tokens. Agents must provide matching tokens to register.

overwatch:
  agent_tokens:
    web-service: "token-for-web"
    api-service: "token-for-api"

Overwatch Gossip Settings

Field	Type	Default	Description
`bind_address`	string	`0.0.0.0:7946`	Address to listen for agent gossip
`encryption_key`	string	Required	32-byte base64-encoded key (must match agents)
`probe_interval`	duration	`1s`	Interval between failure probes
`probe_timeout`	duration	`500ms`	Timeout for a single probe
`gossip_interval`	duration	`200ms`	Interval between gossip messages

Overwatch Validation Settings

External validation allows Overwatch to independently verify agent health claims.

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable external health validation
`check_interval`	duration	`30s`	Frequency of validation checks
`check_timeout`	duration	`5s`	Timeout for validation checks

Important: Per ADR-015, Overwatch validation ALWAYS wins over agent claims. This prevents agents from falsely claiming healthy status.

Overwatch Stale Settings

Configure when backends are considered stale (no recent heartbeat from agent).

Field	Type	Default	Description
`threshold`	duration	`30s`	Time without heartbeat before marking stale
`remove_after`	duration	`5m`	Time after which stale backends are removed

Overwatch Data Directory

Field	Type	Default	Description
`data_dir`	string	`/var/lib/opengslb`	Directory for persistent data (bbolt database)

Overwatch DNSSEC Settings

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable DNSSEC signing
`security_acknowledgment`	string	(empty)	Required if disabling DNSSEC
`algorithm`	string	`ECDSAP256SHA256`	DNSSEC signing algorithm

Configuration Sections

DNS Configuration

Controls the DNS server behavior.

dns:
  listen_address: ":53"
  default_ttl: 60
  return_last_healthy: false

Field	Type	Default	Description
`listen_address`	string	`:53`	Address and port to listen on. Format: `ip:port` or `:port` for all interfaces.
`default_ttl`	integer	`60`	Default TTL (seconds) for DNS responses. Clients cache responses for this duration.
`return_last_healthy`	boolean	`false`	When all servers are unhealthy: `false` returns SERVFAIL, `true` returns the last known healthy IP.

Notes:

Lower TTL = faster failover but higher DNS query volume
Port 53 requires root privileges. Use a high port (e.g., :5353) for non-root operation
return_last_healthy: true enables “limp mode” - degraded service instead of complete failure

Logging Configuration

Controls log output format and verbosity.

logging:
  level: info
  format: json

Field	Type	Default	Description
`level`	string	`info`	Log level: `debug`, `info`, `warn`, `error`
`format`	string	`json`	Output format: `json` (structured) or `text` (human-readable)

Format recommendations:

Use json for production deployments with log aggregation (ELK, Splunk, Loki)
Use text for development and debugging

Metrics Configuration

Controls Prometheus metrics exposure.

metrics:
  enabled: true
  address: ":9090"

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable the metrics HTTP endpoint
`address`	string	`:9090`	Address and port for the metrics server

When enabled, metrics are available at http://<address>/metrics and a health check at http://<address>/health.

Regions Configuration

Defines geographic regions/data centers and their backend servers.

regions:
  - name: us-east-1
    servers:
      - address: 10.0.1.10
        port: 80
        weight: 100
        service: "app.example.com"  # REQUIRED in v1.1.0
      - address: 10.0.1.11
        port: 80
        weight: 100
        service: "app.example.com"  # REQUIRED in v1.1.0
    health_check:
      type: http
      interval: 30s
      timeout: 5s
      path: /health
      failure_threshold: 3
      success_threshold: 2

Region Fields

Field	Type	Required	Description
`name`	string	Yes	Unique identifier for the region
`servers`	list	Yes	List of backend servers in this region
`health_check`	object	Yes	Health check configuration for servers in this region

Server Fields

Field	Type	Default	Description
`service`	string	Required	v1.1.0+: Domain/service this server belongs to (must match a configured domain name)
`address`	string	Required	IP address of the backend server
`port`	integer	`80`	Port number for health checks
`weight`	integer	`100`	Server weight for weighted routing (1-1000)
`host`	string	(empty)	Hostname for HTTPS health checks (for TLS SNI and certificate validation)

BREAKING CHANGE (v1.1.0): The service field is now required for all servers. This enables the unified server architecture where static, agent-registered, and API-registered servers all use the same validation system. The service field specifies which domain/service the server belongs to.

Health Check Fields

Field	Type	Default	Description
`type`	string	`http`	Check type: `http`, `https`, or `tcp`
`interval`	duration	`30s`	Time between health checks
`timeout`	duration	`5s`	Timeout for each check (must be < interval)
`path`	string	`/health`	HTTP/HTTPS path to check
`host`	string	(empty)	Host header for HTTPS checks (for TLS SNI and certificate validation)
`failure_threshold`	integer	`3`	Consecutive failures before marking unhealthy (1-10)
`success_threshold`	integer	`2`	Consecutive successes before marking healthy (1-10)

Health check behavior:

HTTP/HTTPS checks expect a 2xx response code
TCP checks only verify successful TCP connection (no data exchange)
A server starts as healthy and requires failure_threshold consecutive failures to become unhealthy
An unhealthy server requires success_threshold consecutive successes to become healthy again
For HTTPS checks with IP addresses, use host to set the Host header for TLS certificate validation

When to use TCP checks:

Services without HTTP endpoints (databases, caches, custom protocols)
Quick connectivity verification without application-level validation
Services where the health endpoint isn’t exposed

Domains Configuration

Defines which domains OpenGSLB responds to and how traffic is routed.

domains:
  - name: app.example.com
    routing_algorithm: round-robin
    regions:
      - us-east-1
      - us-west-2
    ttl: 30

Field	Type	Default	Description
`name`	string	Required	Fully qualified domain name to respond to
`routing_algorithm`	string	`round-robin`	Algorithm: `round-robin`, `weighted`, `failover`, `geolocation`, `latency`
`regions`	list	Required	List of region names to route traffic to
`ttl`	integer	Uses `dns.default_ttl`	TTL for this domain’s responses (overrides default)

Notes:

Domain names are matched exactly (no wildcard support currently)
Queries for unconfigured domains receive NXDOMAIN
All servers from all listed regions form the candidate pool for routing

Duration Format

Duration fields accept Go duration strings:

30s - 30 seconds
5m - 5 minutes
1h - 1 hour
500ms - 500 milliseconds

Complete Example

# OpenGSLB Configuration
# /etc/opengslb/config.yaml

dns:
  listen_address: ":53"
  default_ttl: 60
  return_last_healthy: false

logging:
  level: info
  format: json

metrics:
  enabled: true
  address: ":9090"

regions:
  - name: us-east-1
    servers:
      - address: 10.0.1.10
        port: 80
        weight: 100
      - address: 10.0.1.11
        port: 80
        weight: 100
    health_check:
      type: http
      interval: 30s
      timeout: 5s
      path: /health
      failure_threshold: 3
      success_threshold: 2

  - name: us-west-2
    servers:
      - address: 10.0.2.10
        port: 80
        weight: 150
      - address: 10.0.2.11
        port: 80
        weight: 100
    health_check:
      type: http
      interval: 30s
      timeout: 5s
      path: /health
      failure_threshold: 3
      success_threshold: 2

  - name: eu-west-1
    servers:
      - address: 10.0.3.10
        port: 8080
        weight: 100
    health_check:
      type: tcp
      interval: 15s
      timeout: 3s

domains:
  - name: app.example.com
    routing_algorithm: round-robin
    regions:
      - us-east-1
      - us-west-2
    ttl: 30

  - name: api.example.com
    routing_algorithm: round-robin
    regions:
      - us-east-1
      - us-west-2
      - eu-west-1
    ttl: 60

  - name: static.example.com
    routing_algorithm: round-robin
    regions:
      - us-west-2
    ttl: 300

Example Configurations

Single Region (Development)

Minimal configuration for development or single-datacenter deployments:

dns:
  listen_address: ":5353"
  default_ttl: 30

logging:
  level: debug
  format: text

regions:
  - name: local
    servers:
      - address: 127.0.0.1
        port: 8080
      - address: 127.0.0.1
        port: 8081
    health_check:
      type: http
      interval: 10s
      timeout: 2s
      path: /health
      failure_threshold: 2
      success_threshold: 1

domains:
  - name: myapp.local
    routing_algorithm: round-robin
    regions:
      - local

Multi-Region (Production)

Production configuration with multiple regions and strict health checking:

dns:
  listen_address: ":53"
  default_ttl: 60
  return_last_healthy: false

logging:
  level: info
  format: json

metrics:
  enabled: true
  address: ":9090"

regions:
  - name: us-east-1
    servers:
      - address: 10.0.1.10
        port: 443
      - address: 10.0.1.11
        port: 443
      - address: 10.0.1.12
        port: 443
    health_check:
      type: https
      interval: 15s
      timeout: 3s
      path: /healthz
      failure_threshold: 3
      success_threshold: 2

  - name: us-west-2
    servers:
      - address: 10.0.2.10
        port: 443
      - address: 10.0.2.11
        port: 443
    health_check:
      type: https
      interval: 15s
      timeout: 3s
      path: /healthz
      failure_threshold: 3
      success_threshold: 2

  - name: eu-central-1
    servers:
      - address: 10.0.3.10
        port: 443
      - address: 10.0.3.11
        port: 443
    health_check:
      type: https
      interval: 15s
      timeout: 3s
      path: /healthz
      failure_threshold: 3
      success_threshold: 2

domains:
  - name: api.mycompany.com
    routing_algorithm: round-robin
    regions:
      - us-east-1
      - us-west-2
      - eu-central-1
    ttl: 30

High-Availability with Fast Failover

Configuration optimized for rapid failover detection:

dns:
  listen_address: ":53"
  default_ttl: 10  # Short TTL for fast client updates

regions:
  - name: primary
    servers:
      - address: 10.0.1.10
        port: 80
      - address: 10.0.1.11
        port: 80
    health_check:
      type: http
      interval: 5s          # Check every 5 seconds
      timeout: 2s
      path: /health
      failure_threshold: 2  # Mark unhealthy after 10 seconds
      success_threshold: 1  # Recover immediately on success

domains:
  - name: critical-app.example.com
    routing_algorithm: round-robin
    regions:
      - primary
    ttl: 5  # Very short TTL

Command Line Options

./opengslb [options]

Options:
  --config string    Path to configuration file (default "/etc/opengslb/config.yaml")
  --version          Show version information and exit

Environment Variables

Currently, OpenGSLB does not support environment variable substitution in configuration files. All values must be specified directly in the YAML file.

Validation

OpenGSLB validates configuration on startup and will fail with descriptive error messages for:

Missing required fields
Invalid duration formats
Invalid port numbers
Invalid log levels or formats
Domains referencing non-existent regions
Timeout >= interval for health checks
Out-of-range threshold values

Weighted Routing

Weighted routing distributes traffic proportionally based on server weights. Servers with higher weights receive more traffic.

Configuration

domains:
  - name: app.example.com
    routing_algorithm: weighted
    regions:
      - my-region

How It Works

Traffic distribution is proportional to server weights:

Server	Weight	Traffic Share
server1	150	50%
server2	100	33%
server3	50	17%

The algorithm uses weighted random selection. On each DNS query, a server is randomly selected with probability proportional to its weight. Over many queries, the distribution matches the weight ratios.

Weight Behavior

Weight > 0: Server participates in selection with given weight
Weight = 0: Server is excluded from selection (useful for soft-disabling)
Unhealthy servers: Excluded regardless of weight

Use Cases

Capacity-based distribution: Route more traffic to higher-capacity servers
Gradual migrations: Shift traffic by adjusting weights over time
Cost optimization: Send less traffic to more expensive regions

Comparison with Round-Robin

Aspect	Round-Robin	Weighted
Distribution	Equal	Proportional to weight
Server weights	Ignored	Respected
Predictability	Deterministic rotation	Probabilistic
Use case	Homogeneous servers	Heterogeneous capacity

Example: Gradual Traffic Shift

To gradually shift traffic from old to new servers:

# Week 1: 90% old, 10% new
servers:
  - address: "10.0.1.10"  # old
    weight: 90
  - address: "10.0.2.10"  # new
    weight: 10

# Week 2: 50% old, 50% new
servers:
  - address: "10.0.1.10"
    weight: 50
  - address: "10.0.2.10"
    weight: 50

# Week 3: 10% old, 90% new
servers:
  - address: "10.0.1.10"
    weight: 10
  - address: "10.0.2.10"
    weight: 90

Failover (Active/Standby) Routing

Failover routing directs all traffic to the highest-priority healthy server. When that server becomes unhealthy, traffic automatically fails over to the next server in priority order.

Configuration

domains:
  - name: critical-app.example.com
    routing_algorithm: failover
    regions:
      - my-region

How It Works

Server priority is determined by the order in the configuration file:

servers:
  - address: "10.0.1.10"  # Priority 1 (Primary)
  - address: "10.0.1.11"  # Priority 2 (Secondary)  
  - address: "10.0.1.12"  # Priority 3 (Tertiary)

The routing behavior:

Primary	Secondary	Tertiary	Traffic Goes To
✅ Healthy	✅ Healthy	✅ Healthy	Primary
❌ Unhealthy	✅ Healthy	✅ Healthy	Secondary
❌ Unhealthy	❌ Unhealthy	✅ Healthy	Tertiary
❌ Unhealthy	❌ Unhealthy	❌ Unhealthy	SERVFAIL

Return-to-Primary Behavior

When a higher-priority server recovers, traffic automatically returns to it. This is the default and expected behavior for most disaster recovery scenarios.

Example timeline:

T=0: Primary healthy → traffic to Primary
T=5: Primary fails health checks → traffic to Secondary
T=10: Primary recovers → traffic returns to Primary

Use Cases

Disaster Recovery: Primary datacenter with hot standby
Maintenance Windows: Graceful failover during updates
Cost Optimization: Use expensive standby only when needed
Regulatory Compliance: Ensure traffic stays in primary region when possible

Comparison with Other Algorithms

Aspect	Round-Robin	Weighted	Failover
Traffic pattern	Distributed	Proportional	Single server
Predictability	Rotates	Probabilistic	Deterministic
Failover	Automatic	Automatic	Automatic
Recovery	N/A	N/A	Returns to primary
Use case	Load distribution	Capacity-based	DR/Active-standby

Health Check Recommendations

For failover routing, consider:

Short intervals (10-15s): Detect failures quickly
Low failure threshold (2-3): Fail over promptly
Higher success threshold (3-5): Avoid flapping on recovery
Short DNS TTL (15-30s): Clients update quickly after failover

health_check:
  interval: 10s
  failure_threshold: 2   # Fail fast
  success_threshold: 3   # Recover carefully

domains:
  - name: app.example.com
    ttl: 15  # Short TTL for failover scenarios

Monitoring Failover Events

Monitor these metrics to track failover:

opengslb_routing_decisions_total{algorithm="failover",server="..."} - Which server is receiving traffic
opengslb_health_check_results_total{result="unhealthy"} - Health check failures

A spike in traffic to the secondary server indicates a failover event.

Geolocation Routing

Geolocation routing directs traffic to servers based on the client’s geographic location. OpenGSLB uses MaxMind GeoIP2/GeoLite2 databases to resolve client IP addresses to geographic regions.

Configuration

domains:
  - name: app.example.com
    routing_algorithm: geolocation
    regions:
      - us-east-1
      - eu-west-1
      - ap-southeast-1

geolocation:
  database_path: "/var/lib/opengslb/geoip/GeoLite2-Country.mmdb"
  default_region: us-east-1
  ecs_enabled: true
  custom_mappings:
    - cidr: "10.0.0.0/8"
      region: us-east-1
    - cidr: "172.16.0.0/12"
      region: eu-west-1

Geolocation Settings

Field	Type	Default	Description
`database_path`	string	Required	Path to MaxMind GeoIP2/GeoLite2 database file (.mmdb)
`default_region`	string	Required	Fallback region when geolocation lookup fails
`ecs_enabled`	boolean	`true`	Enable EDNS Client Subnet support for accurate client location
`custom_mappings`	list	(empty)	Custom CIDR-to-region mappings

Custom CIDR Mappings

Custom mappings override GeoIP lookups for specific IP ranges. This is useful for:

Internal networks that should route to specific regions
Known customer IP ranges with preferred regions
Overriding incorrect GeoIP data

custom_mappings:
  - cidr: "10.0.0.0/8"        # Internal US network
    region: us-east-1
  - cidr: "192.168.0.0/16"    # Internal EU network
    region: eu-west-1
  - cidr: "203.0.113.0/24"    # Customer's APAC network
    region: ap-southeast-1

Custom mappings use longest-prefix matching—the most specific CIDR match wins.

Region Configuration for Geolocation

Regions must specify which countries or continents they serve:

regions:
  - name: us-east-1
    countries: ["US", "CA", "MX"]
    continents: ["NA", "SA"]
    servers:
      - address: "10.0.1.10"
        port: 8080

  - name: eu-west-1
    countries: ["GB", "DE", "FR", "NL", "BE"]
    continents: ["EU"]
    servers:
      - address: "10.0.2.10"
        port: 8080

  - name: ap-southeast-1
    countries: ["SG", "MY", "TH", "VN", "ID"]
    continents: ["AS", "OC"]
    servers:
      - address: "10.0.3.10"
        port: 8080

Field	Type	Description
`countries`	list	ISO 3166-1 alpha-2 country codes served by this region
`continents`	list	Continent codes: AF, AN, AS, EU, NA, OC, SA

EDNS Client Subnet (ECS) Support

When ecs_enabled: true, OpenGSLB extracts client location from ECS information in DNS queries. This provides more accurate geolocation when queries come from recursive resolvers (like Google DNS or Cloudflare) that include client subnet data.

GeoIP Database Setup

Download a MaxMind GeoLite2 database:

# Register at maxmind.com for a free license key
# Download GeoLite2-Country.mmdb

mkdir -p /var/lib/opengslb/geoip
mv GeoLite2-Country.mmdb /var/lib/opengslb/geoip/
chown opengslb:opengslb /var/lib/opengslb/geoip/GeoLite2-Country.mmdb

For production deployments, automate database updates using MaxMind’s geoipupdate tool. See the GeoIP maintenance runbook in the operations documentation.

Monitoring Geolocation Routing

Monitor these metrics:

opengslb_geo_routing_decision{country="...",continent="...",region="..."} - Routing decisions by location
opengslb_geo_fallback{reason="..."} - Fallback events and reasons
opengslb_geo_custom_mapping_hit{region="..."} - Custom CIDR mapping matches

Latency-Based Routing

Latency-based routing directs traffic to the server with the lowest measured latency. This algorithm continuously measures latency during health checks and uses exponential moving average (EMA) smoothing to prevent routing flapping.

Configuration

domains:
  - name: app.example.com
    routing_algorithm: latency
    regions:
      - us-east-1
      - us-west-2

latency_config:
  smoothing_factor: 0.3
  max_latency_ms: 500
  min_samples: 3

Latency Settings

Field	Type	Default	Description
`smoothing_factor`	float	`0.3`	EMA smoothing factor (0.0-1.0). Higher = more responsive, lower = more stable
`max_latency_ms`	integer	`500`	Maximum acceptable latency in milliseconds. Servers exceeding this are excluded
`min_samples`	integer	`3`	Minimum latency samples required before using server for routing

How It Works

Latency measurement: During each health check, OpenGSLB measures the TCP connection time to the backend
EMA smoothing: Latency values are smoothed using exponential moving average to prevent routing flapping from transient spikes
Server selection: The server with the lowest smoothed latency is selected
Threshold enforcement: Servers with latency exceeding max_latency_ms are excluded from selection
Automatic fallback: Falls back to round-robin when insufficient latency data is available

Smoothing Factor

The smoothing factor controls how responsive the latency calculation is to new measurements:

Factor	Behavior
`0.1`	Very stable, slow to react to changes
`0.3`	Balanced (default)
`0.5`	Moderate responsiveness
`0.8`	Highly responsive, may flap on spikes

Formula: new_latency = (smoothing_factor * measured) + ((1 - smoothing_factor) * previous)

Use Cases

Global deployments: Route users to the fastest regional server
Multi-cloud: Route to the cloud provider with best current performance
Hybrid deployments: Balance between on-premises and cloud based on network conditions

Example: Multi-Region Latency Routing

dns:
  default_ttl: 30

regions:
  - name: us-east-1
    servers:
      - address: "10.0.1.10"
        port: 8080
      - address: "10.0.1.11"
        port: 8080
    health_check:
      type: http
      interval: 10s
      timeout: 5s
      path: /health

  - name: us-west-2
    servers:
      - address: "10.0.2.10"
        port: 8080
    health_check:
      type: http
      interval: 10s
      timeout: 5s
      path: /health

  - name: eu-west-1
    servers:
      - address: "10.0.3.10"
        port: 8080
    health_check:
      type: http
      interval: 10s
      timeout: 5s
      path: /health

domains:
  - name: api.example.com
    routing_algorithm: latency
    regions:
      - us-east-1
      - us-west-2
      - eu-west-1
    ttl: 30

latency_config:
  smoothing_factor: 0.3
  max_latency_ms: 200
  min_samples: 5

Monitoring Latency Routing

Monitor these metrics:

opengslb_latency_routing_decision{server="...",latency_ms="..."} - Selected server and its latency
opengslb_latency_rejection{server="...",reason="..."} - Servers excluded due to high latency or insufficient samples
opengslb_health_check_latency_seconds{server="..."} - Raw health check latency measurements

Combining with Geolocation

For optimal performance, consider using geolocation routing with latency as a secondary factor. Configure regions geographically, and latency routing will select the fastest server within the client’s region.

Learned Latency Routing (ADR-017)

Learned latency routing uses passive TCP RTT data collected by agents to route clients to the backend with the lowest measured latency. Unlike standard latency routing (which measures Overwatch-to-backend latency), this captures the actual client-to-backend experience.

How It Differs from Standard Latency Routing

Aspect	Standard Latency	Learned Latency
What’s measured	Overwatch → Backend	Client → Backend
Measurement method	Active health check probes	Passive TCP RTT from OS
Accuracy	Proxy’s perspective	Client’s actual experience
Data source	Overwatch only	Agent gossip
Cold start	Falls back to round-robin	Falls back to geolocation

Configuration

domains:
  - name: app.example.com
    routing_algorithm: learned_latency
    regions:
      - us-east
      - us-west
      - eu-west
      - ap-southeast
    ttl: 60
    latency_config:
      max_latency_ms: 300
      min_samples: 5

Important: Learned latency routing requires agents with latency_learning.enabled: true to collect and gossip RTT data.

Learned Latency Settings

Field	Type	Default	Description
`max_latency_ms`	integer	`500`	Exclude backends with latency above this threshold
`min_samples`	integer	`3`	Minimum samples required before using learned data for a subnet

How It Works

Agents collect TCP RTT: When clients connect to backends, agents read TCP connection RTT from the OS kernel
Subnet aggregation: RTT samples are aggregated by client subnet (default /24 for IPv4)
Gossip to Overwatch: Agents periodically send latency reports to all Overwatch nodes
DNS routing: When a query arrives, Overwatch looks up learned latency for that client’s subnet and selects the lowest-latency backend
Cold start fallback: If no learned data exists for a subnet, falls back to geolocation routing

Viewing Learned Latency Data

Query the Overwatch API to see collected latency data:

curl http://localhost:9090/api/v1/overwatch/latency | jq .

Example response:

{
  "entries": [
    {
      "subnet": "10.1.2.0/24",
      "domain": "app.example.com",
      "region": "eu-west",
      "rtt_ms": 85,
      "samples": 150,
      "last_updated": "2025-12-19T10:05:00Z"
    }
  ]
}

Use Cases

True client optimization: Route based on actual client experience, not proxy measurements
CDN-like behavior: Automatically route clients to their lowest-latency backend
Multi-cloud arbitrage: Discover which cloud provider is fastest for each client subnet
ISP-aware routing: Different ISPs may have different latency to your backends

Example: Full Learned Latency Deployment

Overwatch configuration:

mode: overwatch

domains:
  - name: app.example.com
    routing_algorithm: learned_latency
    regions:
      - us-east
      - eu-west
      - ap-southeast
    latency_config:
      max_latency_ms: 300
      min_samples: 5

overwatch:
  geolocation:
    database_path: /var/lib/opengslb/GeoLite2-Country.mmdb
    default_region: us-east

Agent configuration:

mode: agent

agent:
  backends:
    - service: "app.example.com"
      address: "127.0.0.1"
      port: 80
      weight: 100

  latency_learning:
    enabled: true
    poll_interval: 10s
    min_connection_age: 5s
    report_interval: 30s

Configuration Hot-Reload

OpenGSLB supports reloading configuration without restarting the service. This allows you to add/remove domains and servers, change routing algorithms, and update health check settings with zero downtime.

Triggering a Reload

Send SIGHUP to the OpenGSLB process:

# Find the process ID
pgrep opengslb

# Send SIGHUP
kill -HUP $(pgrep opengslb)

# Or in one command
pkill -HUP opengslb

What Can Be Reloaded

Setting	Hot-Reload	Notes
Domains	✅ Yes	Add, remove, or modify domains
Servers	✅ Yes	Add, remove, or modify servers
Regions	✅ Yes	Add, remove, or modify regions
Health check settings	✅ Yes	Interval, timeout, thresholds
Routing algorithm	✅ Yes	Change algorithm for domains
DNS TTL	✅ Yes	Per-domain or default TTL
DNS listen address	❌ No	Requires restart
Metrics port	❌ No	Requires restart

Reload Behavior

Validation first: The new configuration is fully validated before any changes are applied
Atomic swap: Changes are applied atomically—partial updates don’t happen
Health state preserved: Existing servers retain their health state during reload
No query disruption: In-flight DNS queries are not affected

Reload Process

When you send SIGHUP:

OpenGSLB reads and validates the configuration file
If validation fails, the old configuration continues (error logged)
If validation succeeds:
- DNS registry is updated with new domains
- Health checks are started for new servers
- Health checks are stopped for removed servers
- Router is updated if algorithm changed
Success/failure is logged and recorded in metrics

Monitoring Reloads

Check reload metrics in Prometheus:

# Total reload attempts by result
opengslb_config_reloads_total{result="success"}
opengslb_config_reloads_total{result="failure"}

# Timestamp of last successful reload
opengslb_config_reload_timestamp_seconds

Logs

Successful reload:

level=INFO msg="received SIGHUP, reloading configuration"
level=INFO msg="reloading configuration" old_domains=2 new_domains=3 old_regions=1 new_regions=2
level=INFO msg="health manager reconfigured" added=2 removed=0 updated=0 total=5
level=INFO msg="configuration reload complete" domains=3 servers=5
level=INFO msg="configuration reloaded successfully"

Failed reload (invalid config):

level=INFO msg="received SIGHUP, reloading configuration"
level=ERROR msg="configuration reload failed" error="failed to load configuration: validation error: ..."

Best Practices

Validate before reload: Test your config changes with opengslb --config /path/to/new/config.yaml --validate (if available) or in a staging environment
Use version control: Keep your configuration in git to track changes and enable rollback
Monitor after reload: Watch metrics and logs after reloading to confirm expected behavior
Gradual changes: Make incremental config changes rather than large rewrites
Backup config: Keep a known-good configuration file as backup

Example: Adding a New Server

Original config:

regions:
  - name: primary
    servers:
      - address: "10.0.1.10"
        port: 80

Updated config:

regions:
  - name: primary
    servers:
      - address: "10.0.1.10"
        port: 80
      - address: "10.0.1.11"  # New server
        port: 80

Reload:

kill -HUP $(pgrep opengslb)

The new server will immediately begin health checks and be added to rotation once healthy.

Example: Changing Routing Algorithm

domains:
  - name: app.example.com
    routing_algorithm: weighted  # Changed from round-robin

After reload, traffic distribution will change to respect server weights.

Multi-File Configuration (Includes)

For large deployments with many domains managed by different teams, OpenGSLB supports splitting configuration across multiple files. This enables team-based configuration management while maintaining centralized infrastructure settings.

Basic Usage

Use the includes directive in your main configuration to include additional files:

# /etc/opengslb/config.yaml
mode: overwatch

dns:
  listen_address: ":53"
  zones:
    - gslb.example.com

includes:
  - regions/*.yaml          # Load all region files
  - domains/**/*.yaml       # Recursively load domain files
  - tokens.yaml             # Load agent tokens

Glob Patterns

The includes directive supports glob patterns:

Pattern	Matches
`*.yaml`	All YAML files in the current directory
`regions/*.yaml`	All YAML files in the regions/ subdirectory
`domains/*/.yaml`	All YAML files recursively under domains/
`tokens.yaml`	Specific file

Patterns are relative to the main configuration file’s directory.

Merge Semantics

When multiple files are loaded, content is merged according to these rules:

Field	Merge Behavior
`regions`	Arrays are concatenated
`domains`	Arrays are concatenated
`agent_tokens`	Maps are merged (later values override)
`agent.backends`	Arrays are concatenated
`geolocation.custom_mappings`	Arrays are concatenated
Other scalars	Only from main file (includes cannot override)

Example Directory Structure

/etc/opengslb/
├── config.yaml              # Main configuration
├── regions/
│   ├── us-east.yaml         # US East region
│   ├── us-west.yaml         # US West region
│   └── eu-west.yaml         # EU West region
├── domains/
│   ├── team-a/
│   │   └── app.yaml         # Team A's application domain
│   └── team-b/
│       └── api.yaml         # Team B's API domain
└── tokens.yaml              # Agent authentication tokens

Example Files

Main config (config.yaml):

mode: overwatch

dns:
  listen_address: ":53"
  zones:
    - gslb.example.com
  default_ttl: 30

overwatch:
  gossip:
    encryption_key: "YOUR_KEY_HERE"
  dnssec:
    enabled: true

logging:
  level: info
  format: json

includes:
  - regions/*.yaml
  - domains/**/*.yaml
  - tokens.yaml

Region file (regions/us-east.yaml):

regions:
  - name: us-east-1
    countries: ["US", "CA", "MX"]
    continents: ["NA", "SA"]
    servers:
      - address: "10.0.1.10"
        port: 8080
      - address: "10.0.1.11"
        port: 8080
    health_check:
      type: http
      path: /health
      interval: 30s

Domain file (domains/team-a/app.yaml):

domains:
  - name: app.gslb.example.com
    routing_algorithm: round-robin
    regions:
      - us-east-1
      - us-west-2
    ttl: 30

Tokens file (tokens.yaml):

overwatch:
  agent_tokens:
    team-a-app: "secret-token-for-team-a"
    team-b-api: "secret-token-for-team-b"

Error Handling

OpenGSLB provides clear error messages with file context:

Duplicate region name:

regions/backup.yaml: duplicate region name "us-east-1"

Circular include:

circular include detected: config.yaml -> base.yaml -> config.yaml

Permission error:

regions/insecure.yaml: permission check failed: file is world-writable, which is a security risk

Security

Included files undergo the same permission checks as the main config
World-writable files are rejected
Maximum include depth is 10 levels (prevents infinite recursion)
Circular includes are detected and rejected

Hot-Reload with Includes

When you send SIGHUP, all included files are re-read along with the main configuration. This means:

Changes to any included file take effect on reload
New files matching glob patterns are automatically included
Removed files are no longer included

# Reload after editing any config file
kill -HUP $(pgrep opengslb)

Nested Includes

Included files can themselves contain includes directives:

# base.yaml
includes:
  - regions/*.yaml

This allows for modular configuration hierarchies, but be careful not to create circular dependencies.

Best Practices

Separate by responsibility: Keep infrastructure settings in the main file, let teams manage their domains
Use descriptive directories: domains/team-a/ is clearer than domains/a/
Document ownership: Add comments indicating who manages each file
Secure sensitive files: Keep tokens in a separate file with restrictive permissions
Version control: Track all configuration files in git for audit trail

Validating Configuration

Validate your multi-file configuration before deploying:

opengslb-cli config validate --config /etc/opengslb/config.yaml

This will load all included files and report any validation errors.

IPv6 Support

OpenGSLB supports both IPv4 and IPv6 addresses for backend servers. The DNS server automatically handles A (IPv4) and AAAA (IPv6) queries, returning only addresses of the appropriate family.

Configuration

Simply configure servers with IPv6 addresses:

regions:
  - name: us-east
    servers:
      - address: "10.0.1.10"      # IPv4
        port: 80
        weight: 100
      - address: "10.0.1.11"      # IPv4
        port: 80
        weight: 100
      - address: "2001:db8::1"    # IPv6
        port: 80
        weight: 100
      - address: "2001:db8::2"    # IPv6
        port: 80
        weight: 100
    health_check:
      type: http
      interval: 30s
      timeout: 5s
      path: /health

Query Behavior

Query Type	Servers Considered	Response
A (IPv4)	Only IPv4 servers	A record with IPv4 address
AAAA (IPv6)	Only IPv6 servers	AAAA record with IPv6 address

Mixed Environments

In environments with both IPv4 and IPv6 servers:

A queries return only IPv4 addresses
AAAA queries return only IPv6 addresses
Each address family is load-balanced independently
Health checks work for both IPv4 and IPv6 endpoints

IPv4-Only or IPv6-Only Domains

If a domain only has servers of one address family:

Queries for the available family return addresses normally
Queries for the unavailable family return NOERROR with an empty answer section

This is standard DNS behavior indicating the domain exists but has no records of the requested type.

Example: Dual-Stack Configuration

regions:
  - name: primary-dc
    servers:
      # IPv4 servers
      - address: "192.168.1.10"
        port: 443
        weight: 100
      - address: "192.168.1.11"
        port: 443
        weight: 100
      # IPv6 servers
      - address: "2001:db8:1::10"
        port: 443
        weight: 100
      - address: "2001:db8:1::11"
        port: 443
        weight: 100
    health_check:
      type: http
      interval: 15s
      timeout: 3s
      path: /health

domains:
  - name: app.example.com
    routing_algorithm: round-robin
    regions:
      - primary-dc
    ttl: 30

Testing IPv6

# Query for IPv4 address
dig @localhost -p 15353 app.example.com A +short
# Returns: 192.168.1.10 (or .11)

# Query for IPv6 address
dig @localhost -p 15353 app.example.com AAAA +short
# Returns: 2001:db8:1::10 (or ::11)

Health Checks for IPv6

Health checks work identically for IPv6 servers. The health check URL is constructed using the IPv6 address in bracket notation:

http://[2001:db8:1::10]:443/health

TCP health checks connect to the IPv6 address directly.

Notes

IPv4-mapped IPv6 addresses (e.g., ::ffff:192.168.1.1) are treated as IPv4
Ensure your network infrastructure supports IPv6 if configuring IPv6 servers
Health checks must be reachable via the configured address family