OpenGSLB Project Progress

Current Sprint: Sprint 6 - Production Readiness ✅

Sprint Goal: Implement intelligent routing features (geolocation, latency-based), enhanced observability, operational tooling, and comprehensive documentation for production deployments

Completed

Sprint 1: Foundation ✅

  • GitHub repository with branch protection

  • CI/CD pipeline (Go 1.21/1.22 matrix, golangci-lint)

  • Docker image builds to ghcr.io

  • Integration test environment (docker-compose)

  • Development environment documentation

  • Makefile and developer tooling

  • Pre-commit hooks

Sprint 2: Core Features ✅

  • Configuration Schema & Loader (YAML, validation, defaults)

  • DNS Server Foundation (miekg/dns, UDP/TCP, A records)

  • Health Check Framework (HTTP, thresholds, status tracking)

  • Round-Robin Routing Algorithm

  • Component Integration (graceful shutdown, lifecycle management)

  • Observability Foundation (slog logging, Prometheus metrics)

  • Documentation & Examples

Sprint 3: Advanced Features ✅

Story 1: Weighted Routing Algorithm ✅

  • Weighted random selection based on server weights

  • Weight 0 excludes server from selection

  • Unhealthy servers excluded regardless of weight

  • Statistical distribution matches weight ratios

  • Thread-safe implementation

  • Unit tests verify proportional distribution

Story 2: Active/Standby Routing Algorithm ✅

  • Failover router selects first healthy server in priority order

  • Automatic failover when primary becomes unhealthy

  • Automatic return-to-primary when it recovers

  • Supports multiple fallback levels

  • Clear logging of failover events

  • Unit tests for failover and recovery scenarios

Story 3: TCP Health Check Implementation ✅

  • TCP health check (connection-only verification)

  • Configurable timeout

  • type: tcp configuration support

  • Unit tests with mock TCP servers

Story 4: Configuration Hot-Reload (SIGHUP) ✅

  • SIGHUP triggers configuration reload

  • New configuration validated before applying

  • Invalid configuration rejected with error log

  • Domains can be added/removed

  • Servers can be added/removed from regions

  • Health checks start/stop for changed servers

  • Reload events logged

  • Metrics track reload attempts and success/failure

  • In-flight DNS queries not disrupted

Story 5: AAAA Record Support ✅

  • AAAA queries return IPv6 addresses

  • Servers can be configured with IPv6 addresses

  • Mixed IPv4/IPv6 server pools supported

  • A query returns only IPv4, AAAA returns only IPv6

  • Unit tests for AAAA handling

Story 6: Health Check Status API Endpoint ✅

  • GET /api/v1/health/servers returns JSON with all server statuses

  • GET /api/v1/health/regions returns aggregated region health

  • GET /api/v1/ready for readiness probes

  • GET /api/v1/live for liveness probes

  • IP-based access control (allowed_networks)

  • Security-first design with localhost-only default

  • Documentation includes endpoint details

Story 7: Integration Test Suite Enhancement ✅

  • Integration test for weighted routing distribution

  • Integration test for failover behavior

  • Integration test for TCP health checks

  • Integration test for configuration reload (SIGHUP)

  • Integration test for AAAA records

  • Integration test for Health API

  • Tests run in CI pipeline

  • Manual test script covers all Sprint 3 features

Story 8: Documentation Updates ✅

  • Weighted routing documented with examples

  • Active/Standby (failover) routing documented with examples

  • TCP health checks documented

  • Hot reload documented with operational guidance

  • AAAA/IPv6 records documented

  • Health status API documented

  • API security/hardening guide created

  • PROGRESS.md updated

Sprint 4: Distributed Agent Architecture ⚠️ SUPERSEDED

Sprint 4 implemented a Raft-based cluster architecture. After operational analysis, this was superseded by Sprint 5’s simpler agent-overwatch model. See ADR-015.

Sprint 5: Agent-Overwatch Architecture ✅

Story 1: Remove Raft and Cluster Infrastructure ✅

  • Removed --mode=cluster (replaced by multiple independent Overwatches)

  • Removed Raft consensus code

  • Removed leader election logic

  • Updated --mode flag to accept only agent or overwatch

  • Updated DNS handler (no LeaderChecker needed)

  • Kept hashicorp/memberlist for gossip

Story 2: Refactor Agent Mode ✅

  • Agent supports multiple backends per instance

  • Each backend has independent health check configuration

  • HeartbeatSender with configurable interval

  • Service token authentication (pre-shared)

  • Self-signed certificate generation for TOFU (identity.go)

  • Predictive health signals per backend (predictor.go)

  • Agent does NOT serve DNS (enforced)

  • BackendManager handles multi-backend health checks

  • Monitor collects system metrics (CPU, memory, error rate)

  • Graceful shutdown with deregistration

  • Comprehensive unit tests

Story 3: Implement Overwatch Mode ✅

  • --mode=overwatch starts DNS server on configured address

  • Registry receives and processes agent gossip messages

  • Maintains backend registry from agent registrations

  • Validator performs external health validation (configurable interval)

  • Overwatch validation ALWAYS wins over agent claims (ADR-015)

  • Independent operation (no Overwatch-to-Overwatch coordination for health)

  • API for backend status, overrides, DNSSEC

  • KV store for state persistence (bbolt)

  • Unit tests for Overwatch functionality

Story 4: Mandatory Gossip Security ✅

  • Gossip encryption key REQUIRED in configuration

  • Startup fails with clear error if key missing

  • AES-256 encryption via memberlist

  • Key must be exactly 32 bytes (base64 encoded in config)

  • Key validation on startup

  • Documentation for key generation

Story 5: Agent Identity and TOFU ✅

  • Agent generates self-signed certificate on first start

  • Certificate stored locally (configurable path)

  • Service token sent with first connection

  • Overwatch validates token, pins certificate

  • Subsequent connections authenticated by pinned cert

  • Pinned certs stored in Overwatch KV store

  • Certificate rotation mechanism

  • Revocation via API (delete pinned cert)

  • Unit tests for identity flow

Story 6: External Override API ✅

  • PUT /api/v1/overrides/{service}/{address} sets override

  • DELETE /api/v1/overrides/{service}/{address} clears override

  • GET /api/v1/overrides lists all active overrides

  • Override includes: healthy (bool), reason (string), source (string)

  • Overrides stored in registry

  • API handlers with IP allowlist

  • Unit tests for API endpoints

Story 7: DNSSEC Foundation ✅

  • DNSSEC enabled by default

  • Disabling requires explicit security acknowledgment

  • Key generation on first start

  • DS record exposed via API

  • Key stored in KV store

  • Unit tests for DNSSEC signing

Story 8: DNSSEC Key Sync ✅

  • Overwatches poll peers for DNSSEC keys

  • Configurable poll interval

  • Newest key wins (by timestamp)

  • Key sync is ONLY inter-Overwatch communication

  • Failed sync doesn’t prevent DNS serving

  • Sync status visible in metrics/API

Story 9: Heartbeat and Stale Backend Detection ✅

  • Agents send explicit heartbeat at configurable interval

  • Heartbeat message is lightweight (no full backend state unless changed)

  • Overwatch tracks last heartbeat per agent (AgentLastSeen)

  • Backends marked stale after N missed heartbeats

  • Stale backends removed from DNS rotation (GetHealthyBackends)

  • Overwatch external check can recover stale backend if actually healthy

  • Metrics for heartbeat status (OverwatchStaleAgentsTotal, OverwatchAgentHeartbeatAge)

  • Comprehensive unit tests for heartbeat and stale detection logic

Story 10: Integration Testing and Documentation ✅

  • Unit tests for agent-overwatch registration flow

  • Unit tests for multi-backend agent

  • Unit tests for Overwatch external validation veto

  • Unit tests for override API affects backend status

  • Unit tests for heartbeat and stale detection

  • Unit tests for DNSSEC signing

  • Unit tests for health authority hierarchy

  • Updated ARCHITECTURE_DECISIONS.md with ADR-015

  • Updated PROGRESS.md

  • Full integration tests for multiple independent Overwatches

  • Agent failover integration test

  • Deployment guide for agent-overwatch model

Sprint 6: Production Readiness ✅

Story 1: Geolocation Routing ✅

  • MaxMind GeoIP2/GeoLite2 database integration

  • Country and continent-level geographic resolution

  • Custom CIDR-to-region mappings with longest-prefix matching

  • EDNS Client Subnet (ECS) support for accurate client location

  • Configurable default region fallback

  • GeoRouter implementation with region-based server selection

  • API endpoint for geolocation testing (/api/v1/geo/lookup)

  • Unit and integration tests for geolocation routing

Story 2: Latency-Based Routing ✅

  • Continuous latency measurement during health checks

  • Exponential moving average (EMA) smoothing to prevent flapping

  • Configurable maximum latency threshold (default: 500ms)

  • Minimum samples requirement before using latency data

  • Automatic fallback to round-robin when insufficient data

  • Sub-millisecond precision latency tracking

  • LatencyRouter implementation with lowest-latency selection

  • Unit and integration tests for latency routing

Story 3: CLI Management Tool ✅

  • opengslb-cli command-line tool for operations

  • status command for overall system health

  • servers command with filtering by service/region

  • overrides command for managing manual overrides

  • geo test command for testing geolocation lookups

  • dnssec commands for key management

  • Configuration validation command

  • Table and JSON output formats

  • Comprehensive CLI documentation

Story 4: Multi-File Configuration Includes ✅

  • includes directive for splitting config across files

  • Glob pattern matching (config.d/*.yaml)

  • Environment variable expansion (${VAR} syntax)

  • Layered configuration merging (arrays concatenated, maps merged)

  • Circular include detection

  • Maximum include depth enforcement (10 levels)

  • Security: permission checks on all included files

  • Clear error messages with file:line context

Story 5: Comprehensive Operational Runbooks ✅

  • Overwatch deployment runbook with production examples

  • Agent deployment guide with multi-backend configuration

  • GeoIP database update procedures and automation

  • HA setup guide for multi-Overwatch deployments

  • Incident response playbooks for common scenarios

  • Backup and restore procedures

  • Upgrade procedures with rollback guidance

Story 6: Enhanced Observability Metrics ✅

  • Geolocation routing metrics (opengslb_geo_routing_decision, opengslb_geo_fallback)

  • Custom CIDR hit metrics (opengslb_geo_custom_mapping_hit)

  • Latency routing metrics (opengslb_latency_routing_decision, opengslb_latency_rejection)

  • Per-agent metrics (opengslb_overwatch_agent_heartbeat_age, opengslb_overwatch_agent_backends)

  • Override metrics with service labels (opengslb_overwatch_backend_override)

  • Enhanced DNSSEC metrics (opengslb_dnssec_key_age)

  • Gossip decryption failure counter

  • Prometheus alerting examples in documentation

Story 7: Integration Tests and Documentation Polish ✅

  • Integration tests for geolocation routing

  • Integration tests for latency routing

  • Integration tests for CLI tools

  • Documentation review and consistency updates

  • Configuration reference updates for new features

  • Troubleshooting guide updates

Metrics

Code Coverage (Sprint 5)

  • pkg/agent: ~90%

  • pkg/overwatch: ~88%

  • pkg/config: 92%

  • pkg/dns: 87%

  • pkg/health: 90%

  • pkg/routing: 93%

  • pkg/metrics: 85%

  • Overall: ~89%

Test Results

  • Unit tests: All passing (162 tests)

  • Integration tests: Existing tests passing

Architecture Decisions Made

ADR

Title

Sprint

ADR-001

Use Go for Implementation

1

ADR-002

DNS-Based Load Balancing Approach

1

ADR-003

⚠️ Health Check Architecture

1 (superseded by ADR-015)

ADR-004

Configuration via YAML Files

1

ADR-005

Pluggable Routing Algorithms

1

ADR-006

Prometheus for Metrics

2

ADR-007

⚠️ Separate Control and Data Planes

2 (superseded by ADR-015)

ADR-008

TTL-Based Failover Strategy

2

ADR-009

Unhealthy Server Response Strategy

2

ADR-010

DNS Library Selection (miekg/dns)

2

ADR-011

Router Terminology Clarification

2

ADR-012

⚠️ Distributed Agent Architecture

4 (superseded by ADR-015)

ADR-013

⚠️ Hybrid Configuration & KV Store

4 (superseded by ADR-015)

ADR-014

⚠️ Runtime Mode Semantics

4 (superseded by ADR-015)

ADR-015

Agent-Overwatch Architecture

5

Feature Summary

Routing Algorithms

Algorithm

Description

Status

Round-Robin

Equal distribution across healthy servers

✅ Complete

Weighted

Proportional distribution by server weight

✅ Complete

Failover

Priority-based active/standby

✅ Complete

Geolocation

Route by client IP location (GeoIP2)

✅ Complete

Latency-Based

Route to lowest-latency server (EMA smoothed)

✅ Complete

Health Checks

Type

Description

Status

HTTP

GET request, expect 2xx

✅ Complete

HTTPS

TLS-enabled HTTP check

✅ Complete

TCP

Connection-only verification

✅ Complete

DNS Features

Feature

Status

A Records (IPv4)

✅ Complete

AAAA Records (IPv6)

✅ Complete

UDP Transport

✅ Complete

TCP Transport

✅ Complete

Configurable TTL

✅ Complete

NXDOMAIN for unknown

✅ Complete

SERVFAIL when all unhealthy

✅ Complete

DNSSEC Signing

✅ Complete

Agent-Overwatch Architecture

Component

Status

Agent Mode

✅ Complete

Multi-backend support

✅ Complete

Heartbeat mechanism

✅ Complete

Identity/TOFU

✅ Complete

Predictive health

✅ Complete

Overwatch Mode

✅ Complete

Backend registry

✅ Complete

External validation

✅ Complete

Health authority hierarchy

✅ Complete

Stale detection with recovery

✅ Complete

Override API

✅ Complete

DNSSEC key sync

✅ Complete

Operations

Feature

Status

Structured Logging (JSON/text)

✅ Complete

Prometheus Metrics

✅ Complete

Hot Reload (SIGHUP)

✅ Complete

Health Status API

✅ Complete

Docker Deployment

✅ Complete

Graceful Shutdown

✅ Complete

Mandatory Gossip Encryption

✅ Complete

Known Issues / Technical Debt

Low Priority

  • CNAME record support not yet implemented

  • Web UI dashboard not yet implemented

Future Enhancements

  • Windows service support validation

  • Performance benchmarks for agent-overwatch architecture

  • Grafana dashboard templates (community contribution welcome)

Sprint 7 Preview (Future)

Based on roadmap, future sprints may focus on:

  • CNAME record support

  • Grafana dashboard templates

  • Web UI for configuration management

  • Windows service support

  • Additional routing algorithms (e.g., session affinity)

Documentation Index

Document

Description

README.md

Project overview and quick start

docs/configuration.md

Full configuration reference

docs/api.md

REST API reference

docs/metrics.md

Prometheus metrics reference

docs/docker.md

Docker deployment guide

docs/testing.md

Testing guide

docs/troubleshooting.md

Common issues and solutions

docs/ARCHITECTURE_DECISIONS.md

Design decisions

docs/security/api-hardening.md

API security guide

docs/gossip.md

Gossip protocol documentation

CONTRIBUTING.md

Development setup and workflow

Project Milestones

Milestone

Status

Date

Sprint 1: Infrastructure

✅ Complete

Nov 2025

Sprint 2: Core Features

✅ Complete

Nov 2025

Sprint 3: Advanced Features

✅ Complete

Dec 2025

Sprint 4: Distributed Architecture

⚠️ Superseded

Dec 2025

Sprint 5: Agent-Overwatch Architecture

✅ Complete

Dec 2025

Sprint 6: Production Readiness

✅ Complete

Dec 2025

Sprint 7: Future Enhancements

🔲 Planned

TBD


Last Updated: December 2025 Version: 0.6.0 Sprint Master: Logan Ross Product Owner: Logan Ross