OpenGSLB Project Progress
Current Sprint: Sprint 6 - Production Readiness ✅
Sprint Goal: Implement intelligent routing features (geolocation, latency-based), enhanced observability, operational tooling, and comprehensive documentation for production deployments
Completed
Sprint 1: Foundation ✅
GitHub repository with branch protection
CI/CD pipeline (Go 1.21/1.22 matrix, golangci-lint)
Docker image builds to ghcr.io
Integration test environment (docker-compose)
Development environment documentation
Makefile and developer tooling
Pre-commit hooks
Sprint 2: Core Features ✅
Configuration Schema & Loader (YAML, validation, defaults)
DNS Server Foundation (miekg/dns, UDP/TCP, A records)
Health Check Framework (HTTP, thresholds, status tracking)
Round-Robin Routing Algorithm
Component Integration (graceful shutdown, lifecycle management)
Observability Foundation (slog logging, Prometheus metrics)
Documentation & Examples
Sprint 3: Advanced Features ✅
Story 1: Weighted Routing Algorithm ✅
Weighted random selection based on server weights
Weight 0 excludes server from selection
Unhealthy servers excluded regardless of weight
Statistical distribution matches weight ratios
Thread-safe implementation
Unit tests verify proportional distribution
Story 2: Active/Standby Routing Algorithm ✅
Failover router selects first healthy server in priority order
Automatic failover when primary becomes unhealthy
Automatic return-to-primary when it recovers
Supports multiple fallback levels
Clear logging of failover events
Unit tests for failover and recovery scenarios
Story 3: TCP Health Check Implementation ✅
TCP health check (connection-only verification)
Configurable timeout
type: tcpconfiguration supportUnit tests with mock TCP servers
Story 4: Configuration Hot-Reload (SIGHUP) ✅
SIGHUP triggers configuration reload
New configuration validated before applying
Invalid configuration rejected with error log
Domains can be added/removed
Servers can be added/removed from regions
Health checks start/stop for changed servers
Reload events logged
Metrics track reload attempts and success/failure
In-flight DNS queries not disrupted
Story 5: AAAA Record Support ✅
AAAA queries return IPv6 addresses
Servers can be configured with IPv6 addresses
Mixed IPv4/IPv6 server pools supported
A query returns only IPv4, AAAA returns only IPv6
Unit tests for AAAA handling
Story 6: Health Check Status API Endpoint ✅
GET /api/v1/health/servers returns JSON with all server statuses
GET /api/v1/health/regions returns aggregated region health
GET /api/v1/ready for readiness probes
GET /api/v1/live for liveness probes
IP-based access control (allowed_networks)
Security-first design with localhost-only default
Documentation includes endpoint details
Story 7: Integration Test Suite Enhancement ✅
Integration test for weighted routing distribution
Integration test for failover behavior
Integration test for TCP health checks
Integration test for configuration reload (SIGHUP)
Integration test for AAAA records
Integration test for Health API
Tests run in CI pipeline
Manual test script covers all Sprint 3 features
Story 8: Documentation Updates ✅
Weighted routing documented with examples
Active/Standby (failover) routing documented with examples
TCP health checks documented
Hot reload documented with operational guidance
AAAA/IPv6 records documented
Health status API documented
API security/hardening guide created
PROGRESS.md updated
Sprint 4: Distributed Agent Architecture ⚠️ SUPERSEDED
Sprint 4 implemented a Raft-based cluster architecture. After operational analysis, this was superseded by Sprint 5’s simpler agent-overwatch model. See ADR-015.
Sprint 5: Agent-Overwatch Architecture ✅
Story 1: Remove Raft and Cluster Infrastructure ✅
Removed
--mode=cluster(replaced by multiple independent Overwatches)Removed Raft consensus code
Removed leader election logic
Updated
--modeflag to accept onlyagentoroverwatchUpdated DNS handler (no LeaderChecker needed)
Kept hashicorp/memberlist for gossip
Story 2: Refactor Agent Mode ✅
Agent supports multiple backends per instance
Each backend has independent health check configuration
HeartbeatSender with configurable interval
Service token authentication (pre-shared)
Self-signed certificate generation for TOFU (identity.go)
Predictive health signals per backend (predictor.go)
Agent does NOT serve DNS (enforced)
BackendManager handles multi-backend health checks
Monitor collects system metrics (CPU, memory, error rate)
Graceful shutdown with deregistration
Comprehensive unit tests
Story 3: Implement Overwatch Mode ✅
--mode=overwatchstarts DNS server on configured addressRegistry receives and processes agent gossip messages
Maintains backend registry from agent registrations
Validator performs external health validation (configurable interval)
Overwatch validation ALWAYS wins over agent claims (ADR-015)
Independent operation (no Overwatch-to-Overwatch coordination for health)
API for backend status, overrides, DNSSEC
KV store for state persistence (bbolt)
Unit tests for Overwatch functionality
Story 4: Mandatory Gossip Security ✅
Gossip encryption key REQUIRED in configuration
Startup fails with clear error if key missing
AES-256 encryption via memberlist
Key must be exactly 32 bytes (base64 encoded in config)
Key validation on startup
Documentation for key generation
Story 5: Agent Identity and TOFU ✅
Agent generates self-signed certificate on first start
Certificate stored locally (configurable path)
Service token sent with first connection
Overwatch validates token, pins certificate
Subsequent connections authenticated by pinned cert
Pinned certs stored in Overwatch KV store
Certificate rotation mechanism
Revocation via API (delete pinned cert)
Unit tests for identity flow
Story 6: External Override API ✅
PUT /api/v1/overrides/{service}/{address}sets overrideDELETE /api/v1/overrides/{service}/{address}clears overrideGET /api/v1/overrideslists all active overridesOverride includes: healthy (bool), reason (string), source (string)
Overrides stored in registry
API handlers with IP allowlist
Unit tests for API endpoints
Story 7: DNSSEC Foundation ✅
DNSSEC enabled by default
Disabling requires explicit security acknowledgment
Key generation on first start
DS record exposed via API
Key stored in KV store
Unit tests for DNSSEC signing
Story 8: DNSSEC Key Sync ✅
Overwatches poll peers for DNSSEC keys
Configurable poll interval
Newest key wins (by timestamp)
Key sync is ONLY inter-Overwatch communication
Failed sync doesn’t prevent DNS serving
Sync status visible in metrics/API
Story 9: Heartbeat and Stale Backend Detection ✅
Agents send explicit heartbeat at configurable interval
Heartbeat message is lightweight (no full backend state unless changed)
Overwatch tracks last heartbeat per agent (AgentLastSeen)
Backends marked stale after N missed heartbeats
Stale backends removed from DNS rotation (GetHealthyBackends)
Overwatch external check can recover stale backend if actually healthy
Metrics for heartbeat status (OverwatchStaleAgentsTotal, OverwatchAgentHeartbeatAge)
Comprehensive unit tests for heartbeat and stale detection logic
Story 10: Integration Testing and Documentation ✅
Unit tests for agent-overwatch registration flow
Unit tests for multi-backend agent
Unit tests for Overwatch external validation veto
Unit tests for override API affects backend status
Unit tests for heartbeat and stale detection
Unit tests for DNSSEC signing
Unit tests for health authority hierarchy
Updated ARCHITECTURE_DECISIONS.md with ADR-015
Updated PROGRESS.md
Full integration tests for multiple independent Overwatches
Agent failover integration test
Deployment guide for agent-overwatch model
Sprint 6: Production Readiness ✅
Story 1: Geolocation Routing ✅
MaxMind GeoIP2/GeoLite2 database integration
Country and continent-level geographic resolution
Custom CIDR-to-region mappings with longest-prefix matching
EDNS Client Subnet (ECS) support for accurate client location
Configurable default region fallback
GeoRouter implementation with region-based server selection
API endpoint for geolocation testing (
/api/v1/geo/lookup)Unit and integration tests for geolocation routing
Story 2: Latency-Based Routing ✅
Continuous latency measurement during health checks
Exponential moving average (EMA) smoothing to prevent flapping
Configurable maximum latency threshold (default: 500ms)
Minimum samples requirement before using latency data
Automatic fallback to round-robin when insufficient data
Sub-millisecond precision latency tracking
LatencyRouter implementation with lowest-latency selection
Unit and integration tests for latency routing
Story 3: CLI Management Tool ✅
opengslb-clicommand-line tool for operationsstatuscommand for overall system healthserverscommand with filtering by service/regionoverridescommand for managing manual overridesgeo testcommand for testing geolocation lookupsdnsseccommands for key managementConfiguration validation command
Table and JSON output formats
Comprehensive CLI documentation
Story 4: Multi-File Configuration Includes ✅
includesdirective for splitting config across filesGlob pattern matching (
config.d/*.yaml)Environment variable expansion (
${VAR}syntax)Layered configuration merging (arrays concatenated, maps merged)
Circular include detection
Maximum include depth enforcement (10 levels)
Security: permission checks on all included files
Clear error messages with file:line context
Story 5: Comprehensive Operational Runbooks ✅
Overwatch deployment runbook with production examples
Agent deployment guide with multi-backend configuration
GeoIP database update procedures and automation
HA setup guide for multi-Overwatch deployments
Incident response playbooks for common scenarios
Backup and restore procedures
Upgrade procedures with rollback guidance
Story 6: Enhanced Observability Metrics ✅
Geolocation routing metrics (
opengslb_geo_routing_decision,opengslb_geo_fallback)Custom CIDR hit metrics (
opengslb_geo_custom_mapping_hit)Latency routing metrics (
opengslb_latency_routing_decision,opengslb_latency_rejection)Per-agent metrics (
opengslb_overwatch_agent_heartbeat_age,opengslb_overwatch_agent_backends)Override metrics with service labels (
opengslb_overwatch_backend_override)Enhanced DNSSEC metrics (
opengslb_dnssec_key_age)Gossip decryption failure counter
Prometheus alerting examples in documentation
Story 7: Integration Tests and Documentation Polish ✅
Integration tests for geolocation routing
Integration tests for latency routing
Integration tests for CLI tools
Documentation review and consistency updates
Configuration reference updates for new features
Troubleshooting guide updates
Metrics
Code Coverage (Sprint 5)
pkg/agent: ~90%
pkg/overwatch: ~88%
pkg/config: 92%
pkg/dns: 87%
pkg/health: 90%
pkg/routing: 93%
pkg/metrics: 85%
Overall: ~89%
Test Results
Unit tests: All passing (162 tests)
Integration tests: Existing tests passing
Architecture Decisions Made
ADR |
Title |
Sprint |
|---|---|---|
ADR-001 |
Use Go for Implementation |
1 |
ADR-002 |
DNS-Based Load Balancing Approach |
1 |
ADR-003 |
⚠️ Health Check Architecture |
1 (superseded by ADR-015) |
ADR-004 |
Configuration via YAML Files |
1 |
ADR-005 |
Pluggable Routing Algorithms |
1 |
ADR-006 |
Prometheus for Metrics |
2 |
ADR-007 |
⚠️ Separate Control and Data Planes |
2 (superseded by ADR-015) |
ADR-008 |
TTL-Based Failover Strategy |
2 |
ADR-009 |
Unhealthy Server Response Strategy |
2 |
ADR-010 |
DNS Library Selection (miekg/dns) |
2 |
ADR-011 |
Router Terminology Clarification |
2 |
ADR-012 |
⚠️ Distributed Agent Architecture |
4 (superseded by ADR-015) |
ADR-013 |
⚠️ Hybrid Configuration & KV Store |
4 (superseded by ADR-015) |
ADR-014 |
⚠️ Runtime Mode Semantics |
4 (superseded by ADR-015) |
ADR-015 |
Agent-Overwatch Architecture |
5 |
Feature Summary
Routing Algorithms
Algorithm |
Description |
Status |
|---|---|---|
Round-Robin |
Equal distribution across healthy servers |
✅ Complete |
Weighted |
Proportional distribution by server weight |
✅ Complete |
Failover |
Priority-based active/standby |
✅ Complete |
Geolocation |
Route by client IP location (GeoIP2) |
✅ Complete |
Latency-Based |
Route to lowest-latency server (EMA smoothed) |
✅ Complete |
Health Checks
Type |
Description |
Status |
|---|---|---|
HTTP |
GET request, expect 2xx |
✅ Complete |
HTTPS |
TLS-enabled HTTP check |
✅ Complete |
TCP |
Connection-only verification |
✅ Complete |
DNS Features
Feature |
Status |
|---|---|
A Records (IPv4) |
✅ Complete |
AAAA Records (IPv6) |
✅ Complete |
UDP Transport |
✅ Complete |
TCP Transport |
✅ Complete |
Configurable TTL |
✅ Complete |
NXDOMAIN for unknown |
✅ Complete |
SERVFAIL when all unhealthy |
✅ Complete |
DNSSEC Signing |
✅ Complete |
Agent-Overwatch Architecture
Component |
Status |
|---|---|
Agent Mode |
✅ Complete |
Multi-backend support |
✅ Complete |
Heartbeat mechanism |
✅ Complete |
Identity/TOFU |
✅ Complete |
Predictive health |
✅ Complete |
Overwatch Mode |
✅ Complete |
Backend registry |
✅ Complete |
External validation |
✅ Complete |
Health authority hierarchy |
✅ Complete |
Stale detection with recovery |
✅ Complete |
Override API |
✅ Complete |
DNSSEC key sync |
✅ Complete |
Operations
Feature |
Status |
|---|---|
Structured Logging (JSON/text) |
✅ Complete |
Prometheus Metrics |
✅ Complete |
Hot Reload (SIGHUP) |
✅ Complete |
Health Status API |
✅ Complete |
Docker Deployment |
✅ Complete |
Graceful Shutdown |
✅ Complete |
Mandatory Gossip Encryption |
✅ Complete |
Known Issues / Technical Debt
Low Priority
CNAME record support not yet implemented
Web UI dashboard not yet implemented
Future Enhancements
Windows service support validation
Performance benchmarks for agent-overwatch architecture
Grafana dashboard templates (community contribution welcome)
Sprint 7 Preview (Future)
Based on roadmap, future sprints may focus on:
CNAME record support
Grafana dashboard templates
Web UI for configuration management
Windows service support
Additional routing algorithms (e.g., session affinity)
Documentation Index
Document |
Description |
|---|---|
README.md |
Project overview and quick start |
Full configuration reference |
|
REST API reference |
|
Prometheus metrics reference |
|
Docker deployment guide |
|
Testing guide |
|
Common issues and solutions |
|
Design decisions |
|
API security guide |
|
Gossip protocol documentation |
|
CONTRIBUTING.md |
Development setup and workflow |
Project Milestones
Milestone |
Status |
Date |
|---|---|---|
Sprint 1: Infrastructure |
✅ Complete |
Nov 2025 |
Sprint 2: Core Features |
✅ Complete |
Nov 2025 |
Sprint 3: Advanced Features |
✅ Complete |
Dec 2025 |
Sprint 4: Distributed Architecture |
⚠️ Superseded |
Dec 2025 |
Sprint 5: Agent-Overwatch Architecture |
✅ Complete |
Dec 2025 |
Sprint 6: Production Readiness |
✅ Complete |
Dec 2025 |
Sprint 7: Future Enhancements |
🔲 Planned |
TBD |
Last Updated: December 2025 Version: 0.6.0 Sprint Master: Logan Ross Product Owner: Logan Ross