Architecture Decisions
This document records significant architectural decisions made during OpenGSLB development. Each decision includes context, rationale, and consequences to help future contributors understand why the system is designed the way it is.
Note: ADRs marked with ⚠️ SUPERSEDED have been replaced by newer decisions but are retained for historical context.
ADR-001: Use Go for Implementation
Status: Accepted Date: 2024-11-01
Context
Need to choose a programming language for GSLB implementation.
Decision
Use Go (Golang) as the primary language.
Rationale
Excellent performance for network services
Strong concurrency support for handling multiple health checks
Rich standard library for DNS and HTTP operations
Good ecosystem for building network infrastructure tools
Easy deployment (single binary)
Cross-platform compilation (Linux, Windows)
Consequences
Positive:
Single binary deployment simplifies operations
Strong typing catches errors at compile time
Excellent performance for network workloads
Negative:
Team needs Go expertise
ADR-002: DNS-Based Load Balancing Approach
Status: Accepted Date: 2024-11-01
Context
Need to choose between DNS-based, Anycast, or proxy-based GSLB.
Decision
Implement DNS-based GSLB that returns different IP addresses based on routing logic.
Rationale
DNS-based approach is widely compatible
Lower operational complexity than Anycast
More efficient than proxy-based (no single point of failure for data plane)
Clients cache DNS responses, reducing load on GSLB system
No network team involvement required (unlike BGP/Anycast)
Consequences
Positive:
Works with any client that supports DNS
No infrastructure changes required
Scales naturally through DNS caching
Negative:
TTL affects failover speed
Clients must respect DNS TTL
Cannot handle session persistence at DNS level
⚠️ ADR-003: Health Check Architecture
Status: SUPERSEDED by ADR-015 Date: 2024-11-01
This decision has been superseded. See ADR-015 for the current agent-overwatch architecture.
ADR-004: Configuration via YAML Files
Status: Accepted (Amended by ADR-015) Date: 2024-11-01
Context
Need configuration format for regions, servers, and policies.
Decision
Use YAML files for configuration with hot-reload support.
Rationale
Human-readable and easy to version control
Well-supported in Go ecosystem
Can be validated before deployment
Supports complex nested structures
Consequences
Positive:
Easy to read and edit
Git-friendly for change tracking
Schema validation catches errors before deployment
Negative:
Need schema validation implementation
File watching required for hot-reload
Secrets should use environment variable overrides
Amendment (ADR-015): YAML defines structural configuration. Runtime overrides stored in embedded KV store.
ADR-005: Pluggable Routing Algorithms
Status: Accepted Date: 2024-11-01
Context
Different use cases require different routing strategies.
Decision
Implement a strategy pattern for routing algorithms with a pluggable interface.
Supported Algorithms:
Round-robin
Weighted
Failover (active/standby)
Geolocation (GeoIP-based)
Latency-based
Rationale
Flexibility to add new algorithms without core changes
Easy to test algorithms in isolation
Can switch algorithms per domain/service
Consequences
Positive:
New algorithms can be added without modifying existing code
Each algorithm is independently testable
Per-domain algorithm selection provides flexibility
Negative:
Need clear interface definition
Algorithm selection logic adds complexity
Each algorithm requires documentation
ADR-006: Prometheus for Metrics
Status: Accepted Date: 2024-11-01
Context
Need observability into GSLB operations and decisions.
Decision
Expose Prometheus metrics for all key operations.
Rationale
Industry standard for metrics
Excellent Go client library
Easy integration with Grafana
Pull-based model reduces GSLB dependencies
Consequences
Positive:
Standard tooling works out of the box
Rich ecosystem of dashboards and alerting
No push infrastructure required
Negative:
Metrics endpoint needs security (IP allowlist)
Must implement metric cardinality limits
Configurable bind address needed to avoid port collisions
⚠️ ADR-007: Separate Control and Data Planes
Status: SUPERSEDED by ADR-015 Date: 2024-11-01
This decision has been superseded. See ADR-015 for the current architecture where Overwatch nodes serve both roles independently.
ADR-008: TTL-Based Failover Strategy
Status: Accepted Date: 2024-11-01
Context
DNS caching affects failover speed.
Decision
Use configurable TTLs (default 30-60 seconds) for DNS responses, with health-check-based updates.
Rationale
Balance between failover speed and DNS query load
Clients will update within reasonable timeframe
Health checks can update more frequently than TTL
Reduces impact of stale DNS caches
TTL Guidelines:
TTL |
Use Case |
Trade-off |
|---|---|---|
< 5s |
Not recommended |
High query volume, resolver issues |
5-15s |
Critical services |
Aggressive, fast failover |
30-60s |
Most deployments |
Balanced, recommended |
> 60s |
Stable services |
Conservative, slower failover |
Consequences
Positive:
Configurable per deployment needs
Health checks provide faster-than-TTL updates
Reasonable failover times for most use cases
Negative:
Higher DNS query volume with lower TTLs
Some clients cache longer than TTL
Need monitoring of DNS query rates
ADR-009: Unhealthy Server Response Strategy
Status: Accepted Date: 2024-11-01
Context
When all backend servers for a domain are unhealthy, the GSLB must decide how to respond to DNS queries.
Decision
Default to returning SERVFAIL, with a configurable option to return the last known good IP address.
dns:
return_last_healthy: false # Default: return SERVFAIL when all unhealthy
Rationale
SERVFAIL is the correct DNS response when the server cannot provide an authoritative answer
Some operators prefer degraded service over no service (“limp mode”)
Making it configurable allows operators to choose based on their requirements
Default to SERVFAIL as it’s more honest and helps surface issues quickly
Consequences
Positive:
Honest failure signaling by default
Configurable for “limp mode” when needed
Clear operational semantics
Negative:
Must maintain last-known-good state per domain
Operators must explicitly opt-in to stale responses
Monitoring should alert when serving stale responses
ADR-010: DNS Library Selection
Status: Accepted Date: 2024-12-01
Context
Need a DNS library for protocol handling.
Decision
Use github.com/miekg/dns v1.x.
Rationale
Industry standard (15,000+ importers including CoreDNS/Kubernetes)
Active maintenance with security updates
Stable API suitable for our A/AAAA record needs
Consequences
Positive:
Battle-tested in production at scale
Comprehensive DNS protocol support
Active community and maintenance
Negative:
External dependency (mitigated by stability and reputation)
ADR-011: Router Terminology for Server Selection
Status: Accepted Date: 2024-12-01
Context
OpenGSLB is an authoritative DNS server that returns A records pointing to backend servers. It does not route network traffic.
Decision
Use “Router” to describe the server selection component, with clear documentation that this refers to DNS response routing (selecting which IP to return), not network traffic routing.
Rationale
The Router does NOT:
Handle network traffic
Proxy requests
Manage connections to backends
The Router ONLY:
Receives a pre-filtered list of healthy servers
Selects one server based on its algorithm
Returns the selected server for inclusion in the DNS response
Consequences
Positive:
Clear terminology within the codebase
Consistent with industry GSLB terminology
Negative:
May confuse users expecting network routing
Requires clear documentation
⚠️ ADR-012: Distributed Agent Architecture & HA Control Plane
Status: SUPERSEDED by ADR-015 Date: 2025-04-01
This decision has been superseded. The Raft-based cluster mode has been replaced by the simpler agent-overwatch architecture. See ADR-015.
⚠️ ADR-013: Hybrid Configuration & KV Store Strategy
Status: SUPERSEDED by ADR-015 Date: 2025-04-01
This decision has been superseded. KV store design revised in ADR-015 for the agent-overwatch model.
⚠️ ADR-014: Runtime Mode Semantics
Status: SUPERSEDED by ADR-015 Date: 2025-04-01
This decision has been superseded. Runtime modes redefined in ADR-015 (agent/overwatch instead of standalone/cluster).
ADR-015: Agent-Overwatch Architecture
Status: Accepted Date: 2025-12-10 Supersedes: ADR-003, ADR-007, ADR-012, ADR-013, ADR-014
Context
Previous iterations of OpenGSLB explored Raft consensus, VRRP for VIP failover, and anycast-based architectures. These approaches introduced operational complexity:
Raft consensus: Required odd-numbered node clusters, added latency for leader election, didn’t solve the VIP problem
VRRP/Anycast: Required network team involvement (BGP configuration) for each deployment
Cluster mode: Created coordination overhead without proportional benefit
The fundamental insight: DNS clients already have built-in redundancy. When configured with multiple nameservers, clients automatically retry on failure. This eliminates the need for complex VIP failover mechanisms.
Decision
OpenGSLB adopts a simplified two-component architecture:
Agent: Runs on application servers, monitors local health, gossips state to Overwatch nodes
Overwatch: Runs adjacent to or on DNS infrastructure, validates health claims, serves authoritative DNS
Key Simplifications:
No Raft consensus (removed)
No VRRP (removed)
No VIP management (removed)
No cluster coordination (removed)
Overwatch nodes are fully independent
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ resolv.conf: │
│ nameserver 10.0.1.53 ──┐ │
│ nameserver 10.0.1.54 ──┼──► Overwatch nodes (any of them) │
│ nameserver 10.0.1.55 ──┘ Client retries on failure │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────┴───────────────────────────────────────┐
│ OVERWATCH NODES │
│ (Independent, no coordination) │
│ │
│ Overwatch-1 Overwatch-2 Overwatch-3 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ DNS Server │ │ DNS Server │ │ DNS Server │ │
│ │ Validator │ │ Validator │ │ Validator │ │
│ │ GeoIP DB │ │ GeoIP DB │ │ GeoIP DB │ │
│ │ KV Store │ │ KV Store │ │ KV Store │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ DNSSEC Key Sync (minimal) │
└─────────────────────────────────────────────────────────────────────────┘
▲
│ Gossip (encrypted, authenticated)
┌─────────────────────────────────┴───────────────────────────────────────┐
│ AGENTS │
│ (on application servers) │
│ │
│ Agent Agent Agent Agent Agent │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ App │ │ App │ │ App │ │ App │ │ App │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
│ Agents gossip to ALL Overwatch nodes globally │
└─────────────────────────────────────────────────────────────────────────┘
Component Specifications
Agent Mode (--mode=agent):
Aspect |
Specification |
|---|---|
Purpose |
Local health monitoring, predictive failure detection |
Deployment |
On application servers, one agent per server |
Backends |
Can register multiple backends per agent |
Health Checks |
HTTP, HTTPS, TCP (configurable) |
Predictive Signals |
CPU, memory, error rate thresholds |
Gossip |
Publishes to ALL configured Overwatch nodes globally |
DNS |
Does not serve DNS |
Heartbeat |
Configurable interval (explicit keepalive) |
Overwatch Mode (--mode=overwatch):
Aspect |
Specification |
|---|---|
Purpose |
DNS authority, external health validation, veto power |
Deployment |
Adjacent to or on existing DNS infrastructure |
DNS Zones |
Authoritative for configured GSLB zones |
Routing |
Round-robin, weighted, failover, geolocation, latency-based |
Validation |
External health checks to all backends (configurable interval) |
Veto Power |
Overwatch external check always wins over agent claims |
Independence |
No coordination with other Overwatch nodes (except DNSSEC keys) |
GeoIP |
Local MaxMind database on each node |
Trust Model
Agent Identity: Two-factor authentication:
Pre-shared service token (configured in YAML)
TOFU certificate pinning (agent generates cert, Overwatch pins on first valid connection)
Gossip Security (MANDATORY):
Feature |
Status |
Notes |
|---|---|---|
Encryption |
Required |
AES-256 via memberlist |
Authentication |
Required |
Pre-shared key |
Opt-out |
Not allowed |
Startup fails without encryption key |
Health Authority Hierarchy (highest to lowest):
Human Override (via API)
External Tool Override (via API)
Overwatch External Validation
Agent Health Claim
Rationale
DNS clients already have built-in redundancy via multiple nameservers
Eliminates operational complexity of Raft/VRRP/VIP management
Each Overwatch is independently deployable
No network team involvement required
Security-first with mandatory encryption
Consequences
Positive:
Dramatically simpler architecture (no Raft, no VRRP, no VIPs)
Leverages existing DNS client redundancy
Each Overwatch is independently deployable
Works on Linux and Windows (including Domain Controllers)
Cloud-agnostic
Negative (Mitigated):
Overwatches may have slightly different views → acceptable for DNS (eventually consistent)
DNSSEC key sync requires minimal Overwatch communication → simple API polling
Client-side failover adds ~2s on Overwatch failure → standard DNS behavior
ADR-016: Unified Server Registration and Service-to-Domain Mapping
Status: Accepted Date: 2025-12-18 Breaking Change: Requires OpenGSLB 1.1.0+
Context
Prior to v1.1.0, OpenGSLB had two parallel and disconnected tracking systems for backend servers:
Static Servers (Config-based): Defined in
regions[].servers[], used by DNS registryAgent-Registered Servers (Dynamic): Registered via gossip, tracked in Backend Registry, never used for DNS responses
This created fundamental problems:
Problem 1: No service-to-domain mapping for static servers. All servers in a region were included in all domains using that region.
Problem 2: Agent-registered servers were stored but never used for DNS responses.
Problem 3: Two separate health tracking systems existed in parallel.
Decision
Unify server registration with three registration methods feeding a single source of truth:
Static Configuration (YAML file)
Agent Self-Registration (gossip heartbeat)
API Registration (HTTP POST)
All methods register servers into a unified Backend Registry that feeds the DNS Registry.
Architecture
Before v1.1.0 (Two Parallel Worlds):
Config File → DNS Registry → DNS Handler (static only)
Agent Gossip → Backend Registry → API (never used for DNS)
After v1.1.0 (Unified):
Config File ─┐
Agent Gossip ─┼─► Backend Registry ─► DNS Registry ─► DNS Handler
API POST ─┘ │
└─► Validator (external validation)
Implementation
Required service field on static servers:
# Before (v1.0.x) - NO LONGER VALID
regions:
- name: us-east
servers:
- address: 10.0.1.10
port: 8080
# After (v1.1.0) - REQUIRED
regions:
- name: us-east
servers:
- address: 10.0.1.10
port: 8080
weight: 100
service: webapp.example.com # REQUIRED
Server CRUD API:
Endpoint |
Method |
Purpose |
|---|---|---|
|
GET |
List servers |
|
POST |
Add server |
|
PATCH |
Update weight |
|
DELETE |
Remove server |
Rationale
Explicit service-to-domain mapping prevents misconfiguration
Unified architecture eliminates parallel tracking systems
API-driven operations enable dynamic server management
All registration methods feed the same validation pipeline
Consequences
Positive:
Single source of truth for all servers
Explicit service mapping prevents errors
Agent-registered servers appear in DNS responses
Full CRUD via API
Negative (Mitigated):
Breaking change requires
servicefield → clear error messages and migration guideConfig verbosity increases → necessary for correctness
ADR-017: Passive Latency Learning via OS TCP Statistics
Status: Accepted Date: 2025-12-19 Related: ADR-015 (Agent-Overwatch Architecture)
Context
Latency-based routing requires knowing network latency between clients and backends. At DNS resolution time, we only know the source IP (often a resolver) and optionally EDNS Client Subnet.
How competitors solve this:
Vendor |
Approach |
Limitation |
|---|---|---|
F5 GTM |
Active LDNS probing |
LDNS ≠ client location; probes blocked |
AWS Route53 |
Pre-computed database |
Only works for AWS regions |
Cloudflare |
Edge PoP measurement |
Requires 330+ global PoPs |
Citrix NetScaler |
LDNS probing chain |
Same LDNS limitations |
Critical finding: No GSLB product measures actual client-to-backend latency. All use proxies.
The opportunity: OpenGSLB agents run on application servers. The OS already tracks TCP RTT for congestion control. We can read this data to learn actual client latencies.
Decision
Implement passive latency learning using OS-native TCP statistics only.
Linux: Netlink INET_DIAG (tcp_info.tcpi_rtt) - requires CAP_NET_ADMIN
Windows: GetPerTcpConnectionEStats API - requires Administrator
Approaches rejected:
Approach |
Reason |
|---|---|
Application SDK |
Requires code changes; doesn’t work for COTS |
eBPF |
CAP_BPF allows kernel code execution; catastrophic if compromised |
Packet capture |
CAP_NET_RAW required; CPU overhead |
Network TAP |
Infrastructure dependency |
Data Flow
1. Client connects to application (normal traffic)
2. OS tracks RTT for congestion control (always happens)
3. Agent polls OS for connection RTT (every 10s)
4. Agent aggregates by subnet: 203.0.113.0/24 → 45ms average
5. Agent gossips aggregated data to Overwatch (every 30s)
6. Overwatch uses learned data for routing decisions
Privacy Protection
Individual client IPs never leave the agent. All data aggregated to subnets:
Protocol |
Aggregation |
Addresses per Bucket |
|---|---|---|
IPv4 |
/24 |
256 |
IPv6 |
/48 |
~2^80 |
Configuration
# Agent configuration
latency_learning:
enabled: true
poll_interval: 10s
ipv4_prefix: 24
ipv6_prefix: 48
min_connection_age: 5s
max_subnets: 100000
subnet_ttl: 168h
min_samples: 5
report_interval: 30s
ewma_alpha: 0.3
Security Analysis
Platform |
Capability |
Risk Level |
|---|---|---|
Linux |
CAP_NET_ADMIN |
Low (read-only diagnostics) |
Windows |
Administrator |
Low (standard for services) |
CAP_NET_ADMIN does NOT allow:
Kernel code execution (unlike eBPF)
Packet content capture (unlike CAP_NET_RAW)
Memory access or privilege escalation
Rationale
Unique capability: No other GSLB learns from actual client connections
Zero application changes: Works with any software
Minimal overhead: Polling existing kernel structures every 10s
Safe privileges: Read-only network diagnostics
Consequences
Positive:
Real client latency data (unique in market)
Works with commercial off-the-shelf software
Minimal CPU impact
Graceful degradation to geo-routing if collection fails
Cross-platform (Linux and Windows)
Negative (Mitigated):
Requires elevated privileges → read-only, well-understood scope
Cold start period → falls back to geo-inference
macOS/BSD not supported → Linux and Windows cover enterprise deployments
ADR-018: Anycast Node Discovery (Optional)
Status: Proposed Date: 2025-12-20 Related: ADR-015 (Agent-Overwatch Architecture)
Context
ADR-015 established that agents gossip to Overwatch nodes, but it assumes agents are statically configured with all Overwatch addresses:
agent:
gossip:
overwatch_nodes:
- "overwatch-1.internal:7946"
- "overwatch-2.internal:7946"
The scalability problem: When deploying a new Overwatch node, operators must update the configuration of every agent. For deployments with hundreds of agents across multiple regions, this creates significant operational burden.
The split-brain risk: Currently, each Overwatch node operates independently with its own view of registered backends. If agents only connect to a subset of Overwatches:
Overwatch-1 may know about agents A, B, C
Overwatch-2 may know about agents D, E, F
DNS responses differ based on which Overwatch receives the query
ADR-015 mitigates this by requiring agents to gossip to ALL Overwatch nodes. But this requires agents to know about all Overwatches upfront, which doesn’t scale.
Decision
Introduce optional anycast-based discovery with overwatch peering. Operators can choose between:
Static Configuration (existing, default): Agents explicitly list all Overwatch nodes
Anycast Discovery (new, optional): Agents discover Overwatches via anycast VIP, Overwatches sync state via peering
Architecture: Anycast Discovery Mode
┌─────────────────────────────────────────┐
│ Overwatch Peering Mesh │
│ (memberlist gossip between peers) │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │Overwatch1│◀────▶│Overwatch2│ │
│ └────▲─────┘ └─────▲────┘ │
│ │ │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │Overwatch3 │ (newly deployed) │
│ └───────────┘ │
└────────────────┬────────────────────────┘
│
Anycast VIP │ 10.255.0.1:7946
┌────────────────┴────────────────────────┐
│ BGP Anycast Advertisement │
└────────────────┬────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
┌────▼────┐ ┌─────▼────┐ ┌─────▼────┐
│ Agent A │ │ Agent B │ │ Agent C │
│ (US-E) │ │ (US-W) │ │ (EU) │
└─────────┘ └──────────┘ └──────────┘
How it works:
Agent Discovery: Agent connects to anycast VIP, routes to nearest Overwatch
Join & Learn: Upon joining, agent receives list of all Overwatch peers
Multi-Connect: Agent establishes gossip with ALL Overwatches (not just anycast target)
Overwatch Peering: Overwatches gossip backend registry state between themselves
New Overwatch: Joins peer mesh, receives state from existing peers, advertises anycast
Important Limitation: Overwatches cannot use anycast for peer discovery. Since each Overwatch advertises the anycast VIP, connecting to it would route to themselves. Overwatches must be configured with at least one bootstrap_peer (static address). However, once joined, gossip propagates new peers to all existing Overwatches automatically.
Component |
Discovery Method |
Manual Config on Add |
|---|---|---|
Agents |
Anycast VIP |
None (zero-touch) |
Overwatches |
Bootstrap peers |
New node only (one peer address) |
Component Specifications
Agent Discovery Configuration (optional):
agent:
gossip:
# Option 1: Static (existing behavior, remains default)
overwatch_nodes:
- "overwatch-1.internal:7946"
- "overwatch-2.internal:7946"
# Option 2: Anycast discovery (new, optional)
discovery:
enabled: true
anycast_address: "10.255.0.1:7946"
# Optional fallback if anycast unreachable
fallback_nodes:
- "overwatch-1.internal:7946"
Overwatch Peering Configuration (optional):
overwatch:
peering:
enabled: true
# Bootstrap peers (at least one required for new nodes)
# Existing nodes discover each other via gossip
bootstrap_peers:
- "overwatch-1.internal:7947"
- "overwatch-2.internal:7947"
# Separate port for peer-to-peer gossip (distinct from agent gossip)
bind_address: "0.0.0.0:7947"
# What state to sync between overwatches
sync:
backend_registry: true # Backend health and registration
latency_data: true # Passive latency learning data (ADR-017)
dnssec_keys: true # DNSSEC key material
Overwatch Peering Protocol
Overwatches form a separate memberlist cluster (port 7947) distinct from the agent gossip cluster (port 7946):
Cluster |
Port |
Members |
Purpose |
|---|---|---|---|
Agent Gossip |
7946 |
Agents + Overwatches |
Health updates, registration |
Peer Gossip |
7947 |
Overwatches only |
State synchronization |
Synchronized State:
Data |
Sync Method |
Consistency |
|---|---|---|
Backend Registry |
Full-state CRDT merge |
Eventually consistent |
Health Status |
Last-writer-wins by timestamp |
Eventually consistent |
Latency Data |
EWMA merge (weighted average) |
Eventually consistent |
DNSSEC Keys |
Existing peer sync (ADR-015) |
Strongly consistent |
Conflict Resolution: Backend registry uses last-seen-wins with agent authority. If Overwatch-1 and Overwatch-2 have different health status for the same backend:
Compare
AgentLastSeentimestampsMore recent timestamp wins
Overwatch external validation can override agent claims (per ADR-015 trust hierarchy)
Discovery Flow
┌─────────┐ ┌────────────┐ ┌────────────┐
│ Agent │ │ Overwatch1 │ │ Overwatch2 │
└────┬────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
│ 1. Connect to anycast VIP │ │
│ (routed to nearest) │ │
│──────────────────────────────▶│ │
│ │ │
│ 2. Join memberlist cluster │ │
│◀─────────────────────────────▶│ │
│ │ │
│ 3. Receive peer list │ │
│ [Overwatch1, Overwatch2] │ │
│◀──────────────────────────────│ │
│ │ │
│ 4. Connect to all peers │ │
│─────────────────────────────────────────────────────────────────▶
│ │ │
│ 5. Gossip to all Overwatches │ │
│──────────────────────────────▶│◀────────────────────────────────│
│ │ │
New Overwatch Deployment Flow
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Overwatch3 │ │ Overwatch1 │ │ Overwatch2 │
│ (new) │ │ (existing) │ │ (existing) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
│ 1. Join peer mesh via bootstrap │ │
│────────────────────────────────▶│ │
│ │ │
│ 2. Receive full backend registry│ │
│◀────────────────────────────────│ │
│ │ │
│ 3. Receive latency data │ │
│◀────────────────────────────────│ │
│ │ │
│ 4. Start advertising anycast │ │
│ ═══════════════════════════════════════════════════════════════ │
│ │ │
│ 5. Agents discover via anycast │ │
│◀════════════════════════════════│═════════════════════════════════│
│ │ │
Operator Decision Matrix
Deployment Size |
Overwatch Changes |
Recommended Mode |
|---|---|---|
Small (1-2 OW, <20 agents) |
Rare |
Static configuration |
Medium (2-5 OW, 20-100 agents) |
Occasional |
Static or Anycast |
Large (5+ OW, 100+ agents) |
Frequent |
Anycast discovery |
Multi-region with dynamic scaling |
Common |
Anycast discovery |
Network Requirements (Anycast Mode)
Operators choosing anycast discovery must configure:
Anycast VIP: A single IP address advertised by all Overwatches
BGP Configuration: Each Overwatch advertises the anycast prefix
Health-Based Withdrawal: Overwatch stops advertising if unhealthy
Example BGP setup (operator responsibility):
# Each Overwatch runs a BGP daemon (BIRD, FRR, etc.)
# Advertises anycast prefix when healthy
# Withdraws on failure (health check integration)
Combined Anycast: DNS + Discovery
The same anycast VIP can serve both DNS queries and agent discovery on different ports:
Service |
Port |
Protocol |
Purpose |
|---|---|---|---|
DNS |
53 |
UDP/TCP |
Client DNS queries |
Gossip |
7946 |
TCP/UDP |
Agent discovery and health updates |
Peer Sync |
7947 |
TCP/UDP |
Overwatch-to-Overwatch state sync |
Anycast VIP: 10.255.0.1
┌────────────────────────────────┐
│ :53 - DNS queries │
│ :7946 - Agent gossip │
└───────────────┬────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ OW-1 │◀───────────────▶│ OW-2 │◀───────────────▶│ OW-3 │
│ :7947 │ Peer Gossip │ :7947 │ Peer Gossip │ :7947 │
└─────────┘ └─────────┘ └─────────┘
Why combining works: All routing algorithms (geo, latency, weighted, etc.) make decisions based on the client’s source IP, which is preserved in DNS queries regardless of anycast routing. The anycast path only determines which Overwatch processes the request, not the routing decision outcome.
Operator benefit: Single VIP to manage, single BGP advertisement, unified health-based withdrawal. Clients configure one nameserver address instead of multiple.
Rationale
Optional complexity: Small deployments keep static config simplicity
Scalable operations: Large deployments avoid config sprawl
Consistent state: Overwatch peering ensures all nodes have complete view
Graceful migration: Can run mixed mode during transition
Leverages existing infra: Uses memberlist (already proven in agent gossip)
Consequences
Positive:
Zero agent config changes when adding Overwatches (anycast mode)
Consistent backend registry across all Overwatches
New Overwatches immediately operational after peer sync
Backward compatible (static mode unchanged)
Negative (Mitigated):
Anycast requires BGP configuration → operator choice, documented requirements
Additional network port (7947) for peer gossip → configurable, optional
Eventual consistency window during sync → acceptable for DNS (per ADR-015)
More complex failure modes → comprehensive monitoring and documentation
Testing Considerations
⚠️ BLOCKER: This ADR remains in Proposed status until a valid testing strategy is established.
The Challenge: Anycast requires BGP peering with network infrastructure. There is no router to peer with in unit tests or standard integration tests.
Component |
Testability |
Notes |
|---|---|---|
Overwatch Peering |
✅ Testable |
Memberlist gossip between overwatches, no network dependency |
Agent Discovery Logic |
✅ Testable |
Code path for anycast vs static, mockable |
Backend Registry Sync |
✅ Testable |
CRDT merge logic, pure functions |
BGP Advertisement |
❌ Not testable |
Requires real network infrastructure |
Anycast Routing |
❌ Not testable |
Requires BGP-capable routers |
Health-Based Withdrawal |
❌ Not testable |
Requires BGP daemon integration |
Potential Testing Approaches (to be evaluated):
Containerized BGP Lab: Use GoBGP or FRR in containers to simulate a minimal anycast environment
Pro: Realistic BGP behavior
Con: Complex setup, slow tests, flaky in CI
Split Testing Strategy:
Unit/integration tests for peering and discovery logic (testable parts)
Manual acceptance tests for anycast behavior (documented runbook)
Con: No automated coverage of anycast path
Mock Network Layer: Abstract BGP health signaling behind an interface
Pro: Enables unit testing of advertisement logic
Con: Doesn’t validate actual BGP behavior
Dedicated Test Environment: Physical or cloud-based lab with real routers
Pro: Full end-to-end validation
Con: Cost, maintenance, not suitable for CI
Decision Required: Before implementation, establish which testing approach(es) will be used and document in a testing plan.
Migration Path
Phase 1: Enable overwatch peering (no agent changes)
# All overwatches
overwatch:
peering:
enabled: true
bootstrap_peers: ["overwatch-1:7947", "overwatch-2:7947"]
Phase 2: Configure anycast infrastructure (network team)
Phase 3: Update agents to use discovery (gradual rollout)
agent:
gossip:
discovery:
enabled: true
anycast_address: "10.255.0.1:7946"
fallback_nodes: ["overwatch-1:7946"] # Keep during transition
Phase 4: Remove static overwatch_nodes from agent configs
Document History
Date |
ADR |
Change |
|---|---|---|
2024-11 |
001-008 |
Initial architecture decisions |
2024-12 |
009-011 |
Sprint 2/3 decisions |
2025-04 |
012-014 |
Distributed architecture (Raft-based) |
2025-12-10 |
015 |
Agent-Overwatch architecture (supersedes 003, 007, 012-014) |
2025-12-18 |
016 |
Unified server registration |
2025-12-19 |
017 |
Passive latency learning |
2025-12-20 |
018 |
Anycast node discovery (optional) |