Demo 5: Predictive Health Detection
“We knew it was going to fail before it did.”
This demo showcases OpenGSLB’s core differentiator: predictive health monitoring that detects problems before they impact users.
What You’ll Learn
Predictive health monitoring configuration
CPU, memory, and error rate thresholds
Chaos engineering with injectable failures
Grafana dashboards for visualization
Proactive vs reactive health detection
The Problem with Traditional GSLB
Traditional GSLB:
App crashes → Health check fails → DNS updated → Users see errors (30-60s)
OpenGSLB:
CPU spikes → Agent predicts failure → Traffic drains → App crashes → Zero user impact
OpenGSLB is predictive from the inside (agents know trouble is coming) while remaining reactive from the outside (overwatch validates and can veto).
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ DEMO ENVIRONMENT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DNS CLIENT │ │
│ │ (dig / client container) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OVERWATCH │ │
│ │ (10.50.0.10:53) │ │
│ │ │ │
│ │ • Receives agent gossip (including predictive signals) │ │
│ │ • Performs external health validation │ │
│ │ • Serves authoritative DNS │ │
│ │ • Exposes API on :8080, metrics on :9090 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Gossip (AES-256 encrypted) │
│ │ │
│ ┌────────────────────┬───────────┴───────────┬────────────────────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ BACKEND-1 │ │ BACKEND-2 │ │ BACKEND-3 │ │ │
│ │ (stable) │ │ (stable) │ │ (CHAOS) │ │ │
│ │ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ │ Agent │ │ │ │ Agent │ │ │ │ Agent │ │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ │ Demo App │ │ │ │ Demo App │ │ │ │ Demo App │ │◄── Chaos │ │
│ │ │ :8080 │ │ │ │ :8080 │ │ │ │ :8080 │ │ Target│ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ 10.50.0.21 10.50.0.22 10.50.0.23 │ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PROMETHEUS + GRAFANA │ │
│ │ (Metrics visualization) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Quick Start
Prerequisites
Docker and Docker Compose
Go 1.22+ (for building OpenGSLB)
digcommand (for DNS queries)curlandjq(for API calls)
Start the Demo
cd demos/demo-5-predictive-health
# This builds and starts everything
./scripts/start-demo.sh
Access Points
Service |
URL/Command |
|---|---|
Grafana Dashboard |
http://localhost:3000 |
Prometheus |
http://localhost:9092 |
Overwatch API |
http://localhost:8080/api/v1/health/servers |
DNS Query |
|
Client SSH |
|
Chaos API |
http://localhost:8083/chaos/status |
Run the Guided Demo
ssh root@localhost -p 2222
# Password: demo
# Inside the container:
./demo.sh
Demo Script
Act 1: Baseline
Verify all backends are healthy:
# Check health via API
curl -s http://localhost:8080/api/v1/health/servers | jq '.servers[] | {address, healthy}'
# DNS returns all 3 backends
dig @localhost -p 5354 app.demo.local +short
Act 2: Trigger CPU Spike
Inject chaos on backend-3:
# From host machine
curl -X POST "http://localhost:8083/chaos/cpu?intensity=85&duration=60s"
# Or from client container
./chaos.sh cpu 60s 85
Watch the Grafana dashboard - you’ll see:
CPU spike on backend-3
Predictive signal triggers
Backend begins draining
Act 3: Traffic Shifts
DNS automatically excludes backend-3:
# DNS now returns only backend-1 and backend-2
dig @localhost -p 5354 app.demo.local +short
# But health check STILL PASSES!
curl -s http://localhost:8083/health
# Returns: {"status":"healthy"...}
# Overwatch shows backend-3 as draining
curl -s http://localhost:8080/api/v1/health/servers | jq '.servers[] | select(.address | contains("10.50.0.23"))'
Important
The backend’s health check is still passing, but we’re proactively draining because the agent predicted trouble. This is the key insight of predictive health.
Act 4: Overwatch Validates
Trigger actual errors to validate the prediction:
curl -X POST "http://localhost:8083/chaos/errors?rate=100&duration=30s"
Now Overwatch’s external validation confirms the failure. The agent’s prediction was correct.
Act 5: Recovery
Stop all chaos:
curl -X POST "http://localhost:8083/chaos/stop"
# Wait 15 seconds for recovery
sleep 15
# All 3 backends should be back
dig @localhost -p 5354 app.demo.local +short
Chaos Injection API
The demo app on backend-3 (exposed at http://localhost:8083) provides chaos injection endpoints:
Endpoint |
Method |
Description |
|---|---|---|
|
POST |
Trigger CPU spike |
|
POST |
Trigger memory pressure |
|
POST |
Inject HTTP 500 errors |
|
POST |
Add response latency |
|
POST |
Stop all chaos |
|
GET |
View current chaos state |
Parameters
CPU Spike
curl -X POST "http://localhost:8083/chaos/cpu?duration=60s&intensity=85"
# intensity: 1-100 (% CPU to consume)
# duration: how long to run
Memory Pressure
curl -X POST "http://localhost:8083/chaos/memory?duration=60s&amount=500"
# amount: MB to allocate
# duration: how long to hold
Error Injection
curl -X POST "http://localhost:8083/chaos/errors?duration=60s&rate=50"
# rate: 1-100 (% of /health requests that return 500)
# duration: how long to inject errors
Latency Injection
curl -X POST "http://localhost:8083/chaos/latency?duration=60s&latency=500"
# latency: milliseconds to add to all requests
# duration: how long to inject latency
Key Talking Points
“Why is this better than traditional health checks?”
Traditional GSLB waits for failure. Three failed health checks at 10-second intervals = 30 seconds of users hitting a degraded server. OpenGSLB’s agent sees the warning signs—CPU spiking, memory filling, error rates climbing—and starts draining traffic before the crash. Zero user impact.
“Can’t the agent lie?”
Great question. The agent can claim anything, but Overwatch always validates externally. If the agent says “healthy” but Overwatch’s check sees 500 errors, Overwatch wins. If the agent cries wolf with bleed signals, Overwatch’s external check can override. Trust but verify.
“What if the Overwatch goes down?”
DNS clients have built-in redundancy. Configure multiple Overwatches in resolv.conf, clients automatically retry. No VRRP, no Raft, no complexity. DNS has been doing failover for 40 years.
Predictive Health Configuration
The agent configuration enables predictive monitoring:
agent:
predictive:
enabled: true
check_interval: 5s
cpu:
threshold: 80 # Bleed when CPU exceeds 80%
bleed_duration: 30s
memory:
threshold: 85 # Bleed when memory exceeds 85%
bleed_duration: 30s
error_rate:
threshold: 5 # Bleed when error rate exceeds 5/min
window: 60s
bleed_duration: 30s
Troubleshooting
DNS queries timeout
Make sure Overwatch is running:
docker logs overwatch
Backends not registering
Check agent logs:
docker logs backend-3
Chaos not working
Verify backend-3 is accessible:
curl http://localhost:8083/chaos/status
Cleanup
./scripts/cleanup.sh
This stops and removes all containers.
Summary
Demo 5 showcases OpenGSLB’s most powerful feature: predictive health detection. By running agents alongside your applications, OpenGSLB can:
Detect problems early - Before health checks fail
Drain traffic proactively - Zero user impact
Validate predictions - Overwatch confirms agent signals
Recover automatically - Traffic returns when issues resolve
This completes the OpenGSLB demo series. You’ve seen:
Basic DNS load balancing (Demo 1)
Agent-based architecture (Demo 2)
Latency-based routing (Demo 3)
Geographic routing (Demo 4)
Predictive health (Demo 5)