Configuration Reference
OpenGSLB is configured via a YAML file. By default, it looks for /etc/opengslb/config.yaml, but you can specify a different path with the --config flag.
Configuration File Security
OpenGSLB enforces strict file permissions on the configuration file. The config file must not be world-readable (no “other” read permission).
# Correct permissions (owner read/write, group read)
chmod 640 /etc/opengslb/config.yaml
# Or more restrictive (owner only)
chmod 600 /etc/opengslb/config.yaml
If the file has insecure permissions, OpenGSLB will refuse to start and display an error message.
Runtime Mode (ADR-015)
OpenGSLB operates in one of two modes:
Mode |
Description |
|---|---|
|
DNS-serving, health-validating authority node. Receives agent heartbeats, validates health claims, serves DNS. |
|
Health-reporting agent on application servers. Monitors local backends, gossips status to Overwatch nodes. |
# Set runtime mode
mode: overwatch # or "agent"
If mode is not specified, OpenGSLB defaults to overwatch mode.
Agent Mode Configuration
Agent mode runs on application servers to monitor local backends and report health to Overwatch nodes.
mode: agent
agent:
identity:
service_token: "pre-shared-token-for-auth"
region: "us-east-1"
cert_path: /var/lib/opengslb/agent.crt
key_path: /var/lib/opengslb/agent.key
backends:
- service: "web-service"
address: "127.0.0.1"
port: 8080
weight: 100
health_check:
type: http
interval: 10s
timeout: 5s
path: /health
failure_threshold: 3
success_threshold: 2
gossip:
encryption_key: "base64-encoded-32-byte-key"
overwatch_nodes:
- "overwatch-1.internal:7946"
- "overwatch-2.internal:7946"
heartbeat:
interval: 10s
missed_threshold: 3
predictive:
enabled: true
cpu:
threshold: 80
bleed_duration: 30s
memory:
threshold: 85
bleed_duration: 30s
error_rate:
threshold: 5
window: 60s
bleed_duration: 30s
Agent Identity Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
Pre-shared token for initial authentication with Overwatch |
|
string |
Required |
Geographic region this agent belongs to |
|
string |
|
Path to store/load agent certificate |
|
string |
|
Path to store/load agent private key |
Agent Backend Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
Service name (maps to DNS domain) |
|
string |
Required |
Backend server IP address |
|
integer |
Required |
Backend server port |
|
integer |
|
Routing weight (1-1000) |
|
object |
Required |
Health check configuration |
Agent Gossip Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
32-byte base64-encoded encryption key |
|
list |
Required |
List of Overwatch gossip addresses |
Generate an encryption key with:
openssl rand -base64 32
Agent Heartbeat Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
duration |
|
Time between heartbeat messages |
|
integer |
|
Missed heartbeats before deregistration |
Agent Predictive Health Settings
Predictive health allows agents to signal impending failures before they impact traffic.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable predictive health monitoring |
|
float |
|
CPU usage percentage to trigger bleed |
|
duration |
|
Duration to gradually drain traffic |
|
float |
|
Memory usage percentage to trigger bleed |
|
duration |
|
Duration to gradually drain traffic |
|
float |
|
Error rate percentage to trigger bleed |
|
duration |
|
Window for error rate calculation |
|
duration |
|
Duration to gradually drain traffic |
Agent Latency Learning Settings (ADR-017)
Passive latency learning allows agents to collect real client-to-backend TCP RTT data and report it to Overwatch for intelligent routing. This captures actual client experience rather than Overwatch-to-backend latency.
agent:
latency_learning:
enabled: true
poll_interval: 10s
min_connection_age: 5s
ipv4_prefix: 24
ipv6_prefix: 48
ewma_alpha: 0.3
max_subnets: 100000
subnet_ttl: 168h
min_samples: 5
report_interval: 30s
Field |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable passive latency learning |
|
duration |
|
How often to poll OS for TCP connection RTT data |
|
duration |
|
Minimum connection age before collecting RTT (new connections have unstable RTT) |
|
integer |
|
IPv4 subnet prefix for aggregation (e.g., /24 groups all 10.0.1.x together) |
|
integer |
|
IPv6 subnet prefix for aggregation |
|
float |
|
EWMA smoothing factor (0-1). Higher = more responsive to recent samples |
|
integer |
|
Maximum subnets to track (prevents unbounded memory growth) |
|
duration |
|
How long to keep subnet entries without updates (7 days default) |
|
integer |
|
Minimum samples before reporting a subnet’s latency |
|
duration |
|
How often to send latency reports to Overwatch via gossip |
Requirements:
Linux: CAP_NET_ADMIN capability or root privileges
Windows: Administrator privileges (uses GetPerTcpConnectionEStats API)
Grant capability on Linux:
sudo setcap cap_net_admin+ep /usr/local/bin/opengslb
Overwatch Mode Configuration
Overwatch mode serves DNS and validates health claims from agents.
mode: overwatch
overwatch:
identity:
node_id: "overwatch-1"
region: "us-east-1"
agent_tokens:
web-service: "pre-shared-token-for-web-service"
api-service: "pre-shared-token-for-api-service"
gossip:
bind_address: "0.0.0.0:7946"
encryption_key: "base64-encoded-32-byte-key"
probe_interval: 1s
probe_timeout: 500ms
gossip_interval: 200ms
validation:
enabled: true
check_interval: 30s
check_timeout: 5s
stale:
threshold: 30s
remove_after: 5m
data_dir: /var/lib/opengslb
dnssec:
enabled: true
algorithm: ECDSAP256SHA256
Overwatch Identity Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
hostname |
Unique identifier for this Overwatch node |
|
string |
(empty) |
Geographic region this Overwatch serves |
Overwatch Agent Tokens
Map of service names to authentication tokens. Agents must provide matching tokens to register.
overwatch:
agent_tokens:
web-service: "token-for-web"
api-service: "token-for-api"
Overwatch Gossip Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Address to listen for agent gossip |
|
string |
Required |
32-byte base64-encoded key (must match agents) |
|
duration |
|
Interval between failure probes |
|
duration |
|
Timeout for a single probe |
|
duration |
|
Interval between gossip messages |
Overwatch Validation Settings
External validation allows Overwatch to independently verify agent health claims.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable external health validation |
|
duration |
|
Frequency of validation checks |
|
duration |
|
Timeout for validation checks |
Important: Per ADR-015, Overwatch validation ALWAYS wins over agent claims. This prevents agents from falsely claiming healthy status.
Overwatch Stale Settings
Configure when backends are considered stale (no recent heartbeat from agent).
Field |
Type |
Default |
Description |
|---|---|---|---|
|
duration |
|
Time without heartbeat before marking stale |
|
duration |
|
Time after which stale backends are removed |
Overwatch Data Directory
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Directory for persistent data (bbolt database) |
Overwatch DNSSEC Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable DNSSEC signing |
|
string |
(empty) |
Required if disabling DNSSEC |
|
string |
|
DNSSEC signing algorithm |
Configuration Sections
DNS Configuration
Controls the DNS server behavior.
dns:
listen_address: ":53"
default_ttl: 60
return_last_healthy: false
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Address and port to listen on. Format: |
|
integer |
|
Default TTL (seconds) for DNS responses. Clients cache responses for this duration. |
|
boolean |
|
When all servers are unhealthy: |
Notes:
Lower TTL = faster failover but higher DNS query volume
Port 53 requires root privileges. Use a high port (e.g.,
:5353) for non-root operationreturn_last_healthy: trueenables “limp mode” - degraded service instead of complete failure
Logging Configuration
Controls log output format and verbosity.
logging:
level: info
format: json
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Log level: |
|
string |
|
Output format: |
Format recommendations:
Use
jsonfor production deployments with log aggregation (ELK, Splunk, Loki)Use
textfor development and debugging
Metrics Configuration
Controls Prometheus metrics exposure.
metrics:
enabled: true
address: ":9090"
Field |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable/disable the metrics HTTP endpoint |
|
string |
|
Address and port for the metrics server |
When enabled, metrics are available at http://<address>/metrics and a health check at http://<address>/health.
Regions Configuration
Defines geographic regions/data centers and their backend servers.
regions:
- name: us-east-1
servers:
- address: 10.0.1.10
port: 80
weight: 100
service: "app.example.com" # REQUIRED in v1.1.0
- address: 10.0.1.11
port: 80
weight: 100
service: "app.example.com" # REQUIRED in v1.1.0
health_check:
type: http
interval: 30s
timeout: 5s
path: /health
failure_threshold: 3
success_threshold: 2
Region Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Unique identifier for the region |
|
list |
Yes |
List of backend servers in this region |
|
object |
Yes |
Health check configuration for servers in this region |
Server Fields
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
v1.1.0+: Domain/service this server belongs to (must match a configured domain name) |
|
string |
Required |
IP address of the backend server |
|
integer |
|
Port number for health checks |
|
integer |
|
Server weight for weighted routing (1-1000) |
|
string |
(empty) |
Hostname for HTTPS health checks (for TLS SNI and certificate validation) |
BREAKING CHANGE (v1.1.0): The service field is now required for all servers. This enables the unified server architecture where static, agent-registered, and API-registered servers all use the same validation system. The service field specifies which domain/service the server belongs to.
Health Check Fields
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Check type: |
|
duration |
|
Time between health checks |
|
duration |
|
Timeout for each check (must be < interval) |
|
string |
|
HTTP/HTTPS path to check |
|
string |
(empty) |
Host header for HTTPS checks (for TLS SNI and certificate validation) |
|
integer |
|
Consecutive failures before marking unhealthy (1-10) |
|
integer |
|
Consecutive successes before marking healthy (1-10) |
Health check behavior:
HTTP/HTTPS checks expect a 2xx response code
TCP checks only verify successful TCP connection (no data exchange)
A server starts as healthy and requires
failure_thresholdconsecutive failures to become unhealthyAn unhealthy server requires
success_thresholdconsecutive successes to become healthy againFor HTTPS checks with IP addresses, use
hostto set the Host header for TLS certificate validation
When to use TCP checks:
Services without HTTP endpoints (databases, caches, custom protocols)
Quick connectivity verification without application-level validation
Services where the health endpoint isn’t exposed
Domains Configuration
Defines which domains OpenGSLB responds to and how traffic is routed.
domains:
- name: app.example.com
routing_algorithm: round-robin
regions:
- us-east-1
- us-west-2
ttl: 30
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
Fully qualified domain name to respond to |
|
string |
|
Algorithm: |
|
list |
Required |
List of region names to route traffic to |
|
integer |
Uses |
TTL for this domain’s responses (overrides default) |
Notes:
Domain names are matched exactly (no wildcard support currently)
Queries for unconfigured domains receive NXDOMAIN
All servers from all listed regions form the candidate pool for routing
Duration Format
Duration fields accept Go duration strings:
30s- 30 seconds5m- 5 minutes1h- 1 hour500ms- 500 milliseconds
Complete Example
# OpenGSLB Configuration
# /etc/opengslb/config.yaml
dns:
listen_address: ":53"
default_ttl: 60
return_last_healthy: false
logging:
level: info
format: json
metrics:
enabled: true
address: ":9090"
regions:
- name: us-east-1
servers:
- address: 10.0.1.10
port: 80
weight: 100
- address: 10.0.1.11
port: 80
weight: 100
health_check:
type: http
interval: 30s
timeout: 5s
path: /health
failure_threshold: 3
success_threshold: 2
- name: us-west-2
servers:
- address: 10.0.2.10
port: 80
weight: 150
- address: 10.0.2.11
port: 80
weight: 100
health_check:
type: http
interval: 30s
timeout: 5s
path: /health
failure_threshold: 3
success_threshold: 2
- name: eu-west-1
servers:
- address: 10.0.3.10
port: 8080
weight: 100
health_check:
type: tcp
interval: 15s
timeout: 3s
domains:
- name: app.example.com
routing_algorithm: round-robin
regions:
- us-east-1
- us-west-2
ttl: 30
- name: api.example.com
routing_algorithm: round-robin
regions:
- us-east-1
- us-west-2
- eu-west-1
ttl: 60
- name: static.example.com
routing_algorithm: round-robin
regions:
- us-west-2
ttl: 300
Example Configurations
Single Region (Development)
Minimal configuration for development or single-datacenter deployments:
dns:
listen_address: ":5353"
default_ttl: 30
logging:
level: debug
format: text
regions:
- name: local
servers:
- address: 127.0.0.1
port: 8080
- address: 127.0.0.1
port: 8081
health_check:
type: http
interval: 10s
timeout: 2s
path: /health
failure_threshold: 2
success_threshold: 1
domains:
- name: myapp.local
routing_algorithm: round-robin
regions:
- local
Multi-Region (Production)
Production configuration with multiple regions and strict health checking:
dns:
listen_address: ":53"
default_ttl: 60
return_last_healthy: false
logging:
level: info
format: json
metrics:
enabled: true
address: ":9090"
regions:
- name: us-east-1
servers:
- address: 10.0.1.10
port: 443
- address: 10.0.1.11
port: 443
- address: 10.0.1.12
port: 443
health_check:
type: https
interval: 15s
timeout: 3s
path: /healthz
failure_threshold: 3
success_threshold: 2
- name: us-west-2
servers:
- address: 10.0.2.10
port: 443
- address: 10.0.2.11
port: 443
health_check:
type: https
interval: 15s
timeout: 3s
path: /healthz
failure_threshold: 3
success_threshold: 2
- name: eu-central-1
servers:
- address: 10.0.3.10
port: 443
- address: 10.0.3.11
port: 443
health_check:
type: https
interval: 15s
timeout: 3s
path: /healthz
failure_threshold: 3
success_threshold: 2
domains:
- name: api.mycompany.com
routing_algorithm: round-robin
regions:
- us-east-1
- us-west-2
- eu-central-1
ttl: 30
High-Availability with Fast Failover
Configuration optimized for rapid failover detection:
dns:
listen_address: ":53"
default_ttl: 10 # Short TTL for fast client updates
regions:
- name: primary
servers:
- address: 10.0.1.10
port: 80
- address: 10.0.1.11
port: 80
health_check:
type: http
interval: 5s # Check every 5 seconds
timeout: 2s
path: /health
failure_threshold: 2 # Mark unhealthy after 10 seconds
success_threshold: 1 # Recover immediately on success
domains:
- name: critical-app.example.com
routing_algorithm: round-robin
regions:
- primary
ttl: 5 # Very short TTL
Command Line Options
./opengslb [options]
Options:
--config string Path to configuration file (default "/etc/opengslb/config.yaml")
--version Show version information and exit
Environment Variables
Currently, OpenGSLB does not support environment variable substitution in configuration files. All values must be specified directly in the YAML file.
Validation
OpenGSLB validates configuration on startup and will fail with descriptive error messages for:
Missing required fields
Invalid duration formats
Invalid port numbers
Invalid log levels or formats
Domains referencing non-existent regions
Timeout >= interval for health checks
Out-of-range threshold values
Weighted Routing
Weighted routing distributes traffic proportionally based on server weights. Servers with higher weights receive more traffic.
Configuration
domains:
- name: app.example.com
routing_algorithm: weighted
regions:
- my-region
How It Works
Traffic distribution is proportional to server weights:
Server |
Weight |
Traffic Share |
|---|---|---|
server1 |
150 |
50% |
server2 |
100 |
33% |
server3 |
50 |
17% |
The algorithm uses weighted random selection. On each DNS query, a server is randomly selected with probability proportional to its weight. Over many queries, the distribution matches the weight ratios.
Weight Behavior
Weight > 0: Server participates in selection with given weight
Weight = 0: Server is excluded from selection (useful for soft-disabling)
Unhealthy servers: Excluded regardless of weight
Use Cases
Capacity-based distribution: Route more traffic to higher-capacity servers
Gradual migrations: Shift traffic by adjusting weights over time
Cost optimization: Send less traffic to more expensive regions
Comparison with Round-Robin
Aspect |
Round-Robin |
Weighted |
|---|---|---|
Distribution |
Equal |
Proportional to weight |
Server weights |
Ignored |
Respected |
Predictability |
Deterministic rotation |
Probabilistic |
Use case |
Homogeneous servers |
Heterogeneous capacity |
Example: Gradual Traffic Shift
To gradually shift traffic from old to new servers:
# Week 1: 90% old, 10% new
servers:
- address: "10.0.1.10" # old
weight: 90
- address: "10.0.2.10" # new
weight: 10
# Week 2: 50% old, 50% new
servers:
- address: "10.0.1.10"
weight: 50
- address: "10.0.2.10"
weight: 50
# Week 3: 10% old, 90% new
servers:
- address: "10.0.1.10"
weight: 10
- address: "10.0.2.10"
weight: 90
Failover (Active/Standby) Routing
Failover routing directs all traffic to the highest-priority healthy server. When that server becomes unhealthy, traffic automatically fails over to the next server in priority order.
Configuration
domains:
- name: critical-app.example.com
routing_algorithm: failover
regions:
- my-region
How It Works
Server priority is determined by the order in the configuration file:
servers:
- address: "10.0.1.10" # Priority 1 (Primary)
- address: "10.0.1.11" # Priority 2 (Secondary)
- address: "10.0.1.12" # Priority 3 (Tertiary)
The routing behavior:
Primary |
Secondary |
Tertiary |
Traffic Goes To |
|---|---|---|---|
✅ Healthy |
✅ Healthy |
✅ Healthy |
Primary |
❌ Unhealthy |
✅ Healthy |
✅ Healthy |
Secondary |
❌ Unhealthy |
❌ Unhealthy |
✅ Healthy |
Tertiary |
❌ Unhealthy |
❌ Unhealthy |
❌ Unhealthy |
SERVFAIL |
Return-to-Primary Behavior
When a higher-priority server recovers, traffic automatically returns to it. This is the default and expected behavior for most disaster recovery scenarios.
Example timeline:
T=0: Primary healthy → traffic to Primary
T=5: Primary fails health checks → traffic to Secondary
T=10: Primary recovers → traffic returns to Primary
Use Cases
Disaster Recovery: Primary datacenter with hot standby
Maintenance Windows: Graceful failover during updates
Cost Optimization: Use expensive standby only when needed
Regulatory Compliance: Ensure traffic stays in primary region when possible
Comparison with Other Algorithms
Aspect |
Round-Robin |
Weighted |
Failover |
|---|---|---|---|
Traffic pattern |
Distributed |
Proportional |
Single server |
Predictability |
Rotates |
Probabilistic |
Deterministic |
Failover |
Automatic |
Automatic |
Automatic |
Recovery |
N/A |
N/A |
Returns to primary |
Use case |
Load distribution |
Capacity-based |
DR/Active-standby |
Health Check Recommendations
For failover routing, consider:
Short intervals (10-15s): Detect failures quickly
Low failure threshold (2-3): Fail over promptly
Higher success threshold (3-5): Avoid flapping on recovery
Short DNS TTL (15-30s): Clients update quickly after failover
health_check:
interval: 10s
failure_threshold: 2 # Fail fast
success_threshold: 3 # Recover carefully
domains:
- name: app.example.com
ttl: 15 # Short TTL for failover scenarios
Monitoring Failover Events
Monitor these metrics to track failover:
opengslb_routing_decisions_total{algorithm="failover",server="..."}- Which server is receiving trafficopengslb_health_check_results_total{result="unhealthy"}- Health check failures
A spike in traffic to the secondary server indicates a failover event.
Geolocation Routing
Geolocation routing directs traffic to servers based on the client’s geographic location. OpenGSLB uses MaxMind GeoIP2/GeoLite2 databases to resolve client IP addresses to geographic regions.
Configuration
domains:
- name: app.example.com
routing_algorithm: geolocation
regions:
- us-east-1
- eu-west-1
- ap-southeast-1
geolocation:
database_path: "/var/lib/opengslb/geoip/GeoLite2-Country.mmdb"
default_region: us-east-1
ecs_enabled: true
custom_mappings:
- cidr: "10.0.0.0/8"
region: us-east-1
- cidr: "172.16.0.0/12"
region: eu-west-1
Geolocation Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
string |
Required |
Path to MaxMind GeoIP2/GeoLite2 database file (.mmdb) |
|
string |
Required |
Fallback region when geolocation lookup fails |
|
boolean |
|
Enable EDNS Client Subnet support for accurate client location |
|
list |
(empty) |
Custom CIDR-to-region mappings |
Custom CIDR Mappings
Custom mappings override GeoIP lookups for specific IP ranges. This is useful for:
Internal networks that should route to specific regions
Known customer IP ranges with preferred regions
Overriding incorrect GeoIP data
custom_mappings:
- cidr: "10.0.0.0/8" # Internal US network
region: us-east-1
- cidr: "192.168.0.0/16" # Internal EU network
region: eu-west-1
- cidr: "203.0.113.0/24" # Customer's APAC network
region: ap-southeast-1
Custom mappings use longest-prefix matching—the most specific CIDR match wins.
Region Configuration for Geolocation
Regions must specify which countries or continents they serve:
regions:
- name: us-east-1
countries: ["US", "CA", "MX"]
continents: ["NA", "SA"]
servers:
- address: "10.0.1.10"
port: 8080
- name: eu-west-1
countries: ["GB", "DE", "FR", "NL", "BE"]
continents: ["EU"]
servers:
- address: "10.0.2.10"
port: 8080
- name: ap-southeast-1
countries: ["SG", "MY", "TH", "VN", "ID"]
continents: ["AS", "OC"]
servers:
- address: "10.0.3.10"
port: 8080
Field |
Type |
Description |
|---|---|---|
|
list |
ISO 3166-1 alpha-2 country codes served by this region |
|
list |
Continent codes: AF, AN, AS, EU, NA, OC, SA |
EDNS Client Subnet (ECS) Support
When ecs_enabled: true, OpenGSLB extracts client location from ECS information in DNS queries. This provides more accurate geolocation when queries come from recursive resolvers (like Google DNS or Cloudflare) that include client subnet data.
GeoIP Database Setup
Download a MaxMind GeoLite2 database:
# Register at maxmind.com for a free license key
# Download GeoLite2-Country.mmdb
mkdir -p /var/lib/opengslb/geoip
mv GeoLite2-Country.mmdb /var/lib/opengslb/geoip/
chown opengslb:opengslb /var/lib/opengslb/geoip/GeoLite2-Country.mmdb
For production deployments, automate database updates using MaxMind’s geoipupdate tool. See the GeoIP maintenance runbook in the operations documentation.
Monitoring Geolocation Routing
Monitor these metrics:
opengslb_geo_routing_decision{country="...",continent="...",region="..."}- Routing decisions by locationopengslb_geo_fallback{reason="..."}- Fallback events and reasonsopengslb_geo_custom_mapping_hit{region="..."}- Custom CIDR mapping matches
Latency-Based Routing
Latency-based routing directs traffic to the server with the lowest measured latency. This algorithm continuously measures latency during health checks and uses exponential moving average (EMA) smoothing to prevent routing flapping.
Configuration
domains:
- name: app.example.com
routing_algorithm: latency
regions:
- us-east-1
- us-west-2
latency_config:
smoothing_factor: 0.3
max_latency_ms: 500
min_samples: 3
Latency Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
EMA smoothing factor (0.0-1.0). Higher = more responsive, lower = more stable |
|
integer |
|
Maximum acceptable latency in milliseconds. Servers exceeding this are excluded |
|
integer |
|
Minimum latency samples required before using server for routing |
How It Works
Latency measurement: During each health check, OpenGSLB measures the TCP connection time to the backend
EMA smoothing: Latency values are smoothed using exponential moving average to prevent routing flapping from transient spikes
Server selection: The server with the lowest smoothed latency is selected
Threshold enforcement: Servers with latency exceeding
max_latency_msare excluded from selectionAutomatic fallback: Falls back to round-robin when insufficient latency data is available
Smoothing Factor
The smoothing factor controls how responsive the latency calculation is to new measurements:
Factor |
Behavior |
|---|---|
|
Very stable, slow to react to changes |
|
Balanced (default) |
|
Moderate responsiveness |
|
Highly responsive, may flap on spikes |
Formula: new_latency = (smoothing_factor * measured) + ((1 - smoothing_factor) * previous)
Use Cases
Global deployments: Route users to the fastest regional server
Multi-cloud: Route to the cloud provider with best current performance
Hybrid deployments: Balance between on-premises and cloud based on network conditions
Example: Multi-Region Latency Routing
dns:
default_ttl: 30
regions:
- name: us-east-1
servers:
- address: "10.0.1.10"
port: 8080
- address: "10.0.1.11"
port: 8080
health_check:
type: http
interval: 10s
timeout: 5s
path: /health
- name: us-west-2
servers:
- address: "10.0.2.10"
port: 8080
health_check:
type: http
interval: 10s
timeout: 5s
path: /health
- name: eu-west-1
servers:
- address: "10.0.3.10"
port: 8080
health_check:
type: http
interval: 10s
timeout: 5s
path: /health
domains:
- name: api.example.com
routing_algorithm: latency
regions:
- us-east-1
- us-west-2
- eu-west-1
ttl: 30
latency_config:
smoothing_factor: 0.3
max_latency_ms: 200
min_samples: 5
Monitoring Latency Routing
Monitor these metrics:
opengslb_latency_routing_decision{server="...",latency_ms="..."}- Selected server and its latencyopengslb_latency_rejection{server="...",reason="..."}- Servers excluded due to high latency or insufficient samplesopengslb_health_check_latency_seconds{server="..."}- Raw health check latency measurements
Combining with Geolocation
For optimal performance, consider using geolocation routing with latency as a secondary factor. Configure regions geographically, and latency routing will select the fastest server within the client’s region.
Learned Latency Routing (ADR-017)
Learned latency routing uses passive TCP RTT data collected by agents to route clients to the backend with the lowest measured latency. Unlike standard latency routing (which measures Overwatch-to-backend latency), this captures the actual client-to-backend experience.
How It Differs from Standard Latency Routing
Aspect |
Standard Latency |
Learned Latency |
|---|---|---|
What’s measured |
Overwatch → Backend |
Client → Backend |
Measurement method |
Active health check probes |
Passive TCP RTT from OS |
Accuracy |
Proxy’s perspective |
Client’s actual experience |
Data source |
Overwatch only |
Agent gossip |
Cold start |
Falls back to round-robin |
Falls back to geolocation |
Configuration
domains:
- name: app.example.com
routing_algorithm: learned_latency
regions:
- us-east
- us-west
- eu-west
- ap-southeast
ttl: 60
latency_config:
max_latency_ms: 300
min_samples: 5
Important: Learned latency routing requires agents with latency_learning.enabled: true to collect and gossip RTT data.
Learned Latency Settings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
integer |
|
Exclude backends with latency above this threshold |
|
integer |
|
Minimum samples required before using learned data for a subnet |
How It Works
Agents collect TCP RTT: When clients connect to backends, agents read TCP connection RTT from the OS kernel
Subnet aggregation: RTT samples are aggregated by client subnet (default /24 for IPv4)
Gossip to Overwatch: Agents periodically send latency reports to all Overwatch nodes
DNS routing: When a query arrives, Overwatch looks up learned latency for that client’s subnet and selects the lowest-latency backend
Cold start fallback: If no learned data exists for a subnet, falls back to geolocation routing
Viewing Learned Latency Data
Query the Overwatch API to see collected latency data:
curl http://localhost:9090/api/v1/overwatch/latency | jq .
Example response:
{
"entries": [
{
"subnet": "10.1.2.0/24",
"domain": "app.example.com",
"region": "eu-west",
"rtt_ms": 85,
"samples": 150,
"last_updated": "2025-12-19T10:05:00Z"
}
]
}
Use Cases
True client optimization: Route based on actual client experience, not proxy measurements
CDN-like behavior: Automatically route clients to their lowest-latency backend
Multi-cloud arbitrage: Discover which cloud provider is fastest for each client subnet
ISP-aware routing: Different ISPs may have different latency to your backends
Example: Full Learned Latency Deployment
Overwatch configuration:
mode: overwatch
domains:
- name: app.example.com
routing_algorithm: learned_latency
regions:
- us-east
- eu-west
- ap-southeast
latency_config:
max_latency_ms: 300
min_samples: 5
overwatch:
geolocation:
database_path: /var/lib/opengslb/GeoLite2-Country.mmdb
default_region: us-east
Agent configuration:
mode: agent
agent:
backends:
- service: "app.example.com"
address: "127.0.0.1"
port: 80
weight: 100
latency_learning:
enabled: true
poll_interval: 10s
min_connection_age: 5s
report_interval: 30s
Configuration Hot-Reload
OpenGSLB supports reloading configuration without restarting the service. This allows you to add/remove domains and servers, change routing algorithms, and update health check settings with zero downtime.
Triggering a Reload
Send SIGHUP to the OpenGSLB process:
# Find the process ID
pgrep opengslb
# Send SIGHUP
kill -HUP $(pgrep opengslb)
# Or in one command
pkill -HUP opengslb
What Can Be Reloaded
Setting |
Hot-Reload |
Notes |
|---|---|---|
Domains |
✅ Yes |
Add, remove, or modify domains |
Servers |
✅ Yes |
Add, remove, or modify servers |
Regions |
✅ Yes |
Add, remove, or modify regions |
Health check settings |
✅ Yes |
Interval, timeout, thresholds |
Routing algorithm |
✅ Yes |
Change algorithm for domains |
DNS TTL |
✅ Yes |
Per-domain or default TTL |
DNS listen address |
❌ No |
Requires restart |
Metrics port |
❌ No |
Requires restart |
Reload Behavior
Validation first: The new configuration is fully validated before any changes are applied
Atomic swap: Changes are applied atomically—partial updates don’t happen
Health state preserved: Existing servers retain their health state during reload
No query disruption: In-flight DNS queries are not affected
Reload Process
When you send SIGHUP:
OpenGSLB reads and validates the configuration file
If validation fails, the old configuration continues (error logged)
If validation succeeds:
DNS registry is updated with new domains
Health checks are started for new servers
Health checks are stopped for removed servers
Router is updated if algorithm changed
Success/failure is logged and recorded in metrics
Monitoring Reloads
Check reload metrics in Prometheus:
# Total reload attempts by result
opengslb_config_reloads_total{result="success"}
opengslb_config_reloads_total{result="failure"}
# Timestamp of last successful reload
opengslb_config_reload_timestamp_seconds
Logs
Successful reload:
level=INFO msg="received SIGHUP, reloading configuration"
level=INFO msg="reloading configuration" old_domains=2 new_domains=3 old_regions=1 new_regions=2
level=INFO msg="health manager reconfigured" added=2 removed=0 updated=0 total=5
level=INFO msg="configuration reload complete" domains=3 servers=5
level=INFO msg="configuration reloaded successfully"
Failed reload (invalid config):
level=INFO msg="received SIGHUP, reloading configuration"
level=ERROR msg="configuration reload failed" error="failed to load configuration: validation error: ..."
Best Practices
Validate before reload: Test your config changes with
opengslb --config /path/to/new/config.yaml --validate(if available) or in a staging environmentUse version control: Keep your configuration in git to track changes and enable rollback
Monitor after reload: Watch metrics and logs after reloading to confirm expected behavior
Gradual changes: Make incremental config changes rather than large rewrites
Backup config: Keep a known-good configuration file as backup
Example: Adding a New Server
Original config:
regions:
- name: primary
servers:
- address: "10.0.1.10"
port: 80
Updated config:
regions:
- name: primary
servers:
- address: "10.0.1.10"
port: 80
- address: "10.0.1.11" # New server
port: 80
Reload:
kill -HUP $(pgrep opengslb)
The new server will immediately begin health checks and be added to rotation once healthy.
Example: Changing Routing Algorithm
domains:
- name: app.example.com
routing_algorithm: weighted # Changed from round-robin
After reload, traffic distribution will change to respect server weights.
Multi-File Configuration (Includes)
For large deployments with many domains managed by different teams, OpenGSLB supports splitting configuration across multiple files. This enables team-based configuration management while maintaining centralized infrastructure settings.
Basic Usage
Use the includes directive in your main configuration to include additional files:
# /etc/opengslb/config.yaml
mode: overwatch
dns:
listen_address: ":53"
zones:
- gslb.example.com
includes:
- regions/*.yaml # Load all region files
- domains/**/*.yaml # Recursively load domain files
- tokens.yaml # Load agent tokens
Glob Patterns
The includes directive supports glob patterns:
Pattern |
Matches |
|---|---|
|
All YAML files in the current directory |
|
All YAML files in the regions/ subdirectory |
|
All YAML files recursively under domains/ |
|
Specific file |
Patterns are relative to the main configuration file’s directory.
Merge Semantics
When multiple files are loaded, content is merged according to these rules:
Field |
Merge Behavior |
|---|---|
|
Arrays are concatenated |
|
Arrays are concatenated |
|
Maps are merged (later values override) |
|
Arrays are concatenated |
|
Arrays are concatenated |
Other scalars |
Only from main file (includes cannot override) |
Example Directory Structure
/etc/opengslb/
├── config.yaml # Main configuration
├── regions/
│ ├── us-east.yaml # US East region
│ ├── us-west.yaml # US West region
│ └── eu-west.yaml # EU West region
├── domains/
│ ├── team-a/
│ │ └── app.yaml # Team A's application domain
│ └── team-b/
│ └── api.yaml # Team B's API domain
└── tokens.yaml # Agent authentication tokens
Example Files
Main config (config.yaml):
mode: overwatch
dns:
listen_address: ":53"
zones:
- gslb.example.com
default_ttl: 30
overwatch:
gossip:
encryption_key: "YOUR_KEY_HERE"
dnssec:
enabled: true
logging:
level: info
format: json
includes:
- regions/*.yaml
- domains/**/*.yaml
- tokens.yaml
Region file (regions/us-east.yaml):
regions:
- name: us-east-1
countries: ["US", "CA", "MX"]
continents: ["NA", "SA"]
servers:
- address: "10.0.1.10"
port: 8080
- address: "10.0.1.11"
port: 8080
health_check:
type: http
path: /health
interval: 30s
Domain file (domains/team-a/app.yaml):
domains:
- name: app.gslb.example.com
routing_algorithm: round-robin
regions:
- us-east-1
- us-west-2
ttl: 30
Tokens file (tokens.yaml):
overwatch:
agent_tokens:
team-a-app: "secret-token-for-team-a"
team-b-api: "secret-token-for-team-b"
Error Handling
OpenGSLB provides clear error messages with file context:
Duplicate region name:
regions/backup.yaml: duplicate region name "us-east-1"
Circular include:
circular include detected: config.yaml -> base.yaml -> config.yaml
Permission error:
regions/insecure.yaml: permission check failed: file is world-writable, which is a security risk
Security
Included files undergo the same permission checks as the main config
World-writable files are rejected
Maximum include depth is 10 levels (prevents infinite recursion)
Circular includes are detected and rejected
Hot-Reload with Includes
When you send SIGHUP, all included files are re-read along with the main configuration. This means:
Changes to any included file take effect on reload
New files matching glob patterns are automatically included
Removed files are no longer included
# Reload after editing any config file
kill -HUP $(pgrep opengslb)
Nested Includes
Included files can themselves contain includes directives:
# base.yaml
includes:
- regions/*.yaml
This allows for modular configuration hierarchies, but be careful not to create circular dependencies.
Best Practices
Separate by responsibility: Keep infrastructure settings in the main file, let teams manage their domains
Use descriptive directories:
domains/team-a/is clearer thandomains/a/Document ownership: Add comments indicating who manages each file
Secure sensitive files: Keep tokens in a separate file with restrictive permissions
Version control: Track all configuration files in git for audit trail
Validating Configuration
Validate your multi-file configuration before deploying:
opengslb-cli config validate --config /etc/opengslb/config.yaml
This will load all included files and report any validation errors.
IPv6 Support
OpenGSLB supports both IPv4 and IPv6 addresses for backend servers. The DNS server automatically handles A (IPv4) and AAAA (IPv6) queries, returning only addresses of the appropriate family.
Configuration
Simply configure servers with IPv6 addresses:
regions:
- name: us-east
servers:
- address: "10.0.1.10" # IPv4
port: 80
weight: 100
- address: "10.0.1.11" # IPv4
port: 80
weight: 100
- address: "2001:db8::1" # IPv6
port: 80
weight: 100
- address: "2001:db8::2" # IPv6
port: 80
weight: 100
health_check:
type: http
interval: 30s
timeout: 5s
path: /health
Query Behavior
Query Type |
Servers Considered |
Response |
|---|---|---|
A (IPv4) |
Only IPv4 servers |
A record with IPv4 address |
AAAA (IPv6) |
Only IPv6 servers |
AAAA record with IPv6 address |
Mixed Environments
In environments with both IPv4 and IPv6 servers:
A queries return only IPv4 addresses
AAAA queries return only IPv6 addresses
Each address family is load-balanced independently
Health checks work for both IPv4 and IPv6 endpoints
IPv4-Only or IPv6-Only Domains
If a domain only has servers of one address family:
Queries for the available family return addresses normally
Queries for the unavailable family return
NOERRORwith an empty answer section
This is standard DNS behavior indicating the domain exists but has no records of the requested type.
Example: Dual-Stack Configuration
regions:
- name: primary-dc
servers:
# IPv4 servers
- address: "192.168.1.10"
port: 443
weight: 100
- address: "192.168.1.11"
port: 443
weight: 100
# IPv6 servers
- address: "2001:db8:1::10"
port: 443
weight: 100
- address: "2001:db8:1::11"
port: 443
weight: 100
health_check:
type: http
interval: 15s
timeout: 3s
path: /health
domains:
- name: app.example.com
routing_algorithm: round-robin
regions:
- primary-dc
ttl: 30
Testing IPv6
# Query for IPv4 address
dig @localhost -p 15353 app.example.com A +short
# Returns: 192.168.1.10 (or .11)
# Query for IPv6 address
dig @localhost -p 15353 app.example.com AAAA +short
# Returns: 2001:db8:1::10 (or ::11)
Health Checks for IPv6
Health checks work identically for IPv6 servers. The health check URL is constructed using the IPv6 address in bracket notation:
http://[2001:db8:1::10]:443/health
TCP health checks connect to the IPv6 address directly.
Notes
IPv4-mapped IPv6 addresses (e.g.,
::ffff:192.168.1.1) are treated as IPv4Ensure your network infrastructure supports IPv6 if configuring IPv6 servers
Health checks must be reachable via the configured address family