# Demo 5: Predictive Health Detection

> "We knew it was going to fail before it did."

This demo showcases OpenGSLB's core differentiator: **predictive health monitoring** that detects problems *before* they impact users.

## What You'll Learn

- Predictive health monitoring configuration
- CPU, memory, and error rate thresholds
- Chaos engineering with injectable failures
- Grafana dashboards for visualization
- Proactive vs reactive health detection

## The Problem with Traditional GSLB

```
Traditional GSLB:
  App crashes → Health check fails → DNS updated → Users see errors (30-60s)

OpenGSLB:
  CPU spikes → Agent predicts failure → Traffic drains → App crashes → Zero user impact
```

OpenGSLB is **predictive from the inside** (agents know trouble is coming) while remaining **reactive from the outside** (overwatch validates and can veto).

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           DEMO ENVIRONMENT                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                        DNS CLIENT                                │   │
│   │                     (dig / client container)                     │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                    │                                     │
│                                    ▼                                     │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                        OVERWATCH                                 │   │
│   │                      (10.50.0.10:53)                             │   │
│   │                                                                  │   │
│   │  • Receives agent gossip (including predictive signals)         │   │
│   │  • Performs external health validation                          │   │
│   │  • Serves authoritative DNS                                     │   │
│   │  • Exposes API on :8080, metrics on :9090                       │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                    ▲                                     │
│                                    │ Gossip (AES-256 encrypted)          │
│                                    │                                     │
│   ┌────────────────────┬───────────┴───────────┬────────────────────┐   │
│   │                    │                       │                    │   │
│   ▼                    ▼                       ▼                    │   │
│ ┌──────────────┐  ┌──────────────┐       ┌──────────────┐          │   │
│ │  BACKEND-1   │  │  BACKEND-2   │       │  BACKEND-3   │          │   │
│ │   (stable)   │  │   (stable)   │       │   (CHAOS)    │          │   │
│ │              │  │              │       │              │          │   │
│ │ ┌──────────┐ │  │ ┌──────────┐ │       │ ┌──────────┐ │          │   │
│ │ │  Agent   │ │  │ │  Agent   │ │       │ │  Agent   │ │          │   │
│ │ └──────────┘ │  │ └──────────┘ │       │ └──────────┘ │          │   │
│ │ ┌──────────┐ │  │ ┌──────────┐ │       │ ┌──────────┐ │          │   │
│ │ │ Demo App │ │  │ │ Demo App │ │       │ │ Demo App │ │◄── Chaos │   │
│ │ │  :8080   │ │  │ │  :8080   │ │       │ │  :8080   │ │    Target│   │
│ │ └──────────┘ │  │ └──────────┘ │       │ └──────────┘ │          │   │
│ └──────────────┘  └──────────────┘       └──────────────┘          │   │
│   10.50.0.21        10.50.0.22              10.50.0.23              │   │
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                      PROMETHEUS + GRAFANA                        │   │
│   │                    (Metrics visualization)                       │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

## Quick Start

### Prerequisites

- Docker and Docker Compose
- Go 1.22+ (for building OpenGSLB)
- `dig` command (for DNS queries)
- `curl` and `jq` (for API calls)

### Start the Demo

```bash
cd demos/demo-5-predictive-health

# This builds and starts everything
./scripts/start-demo.sh
```

### Access Points

| Service | URL/Command |
|---------|-------------|
| Grafana Dashboard | http://localhost:3000 |
| Prometheus | http://localhost:9092 |
| Overwatch API | http://localhost:8080/api/v1/health/servers |
| DNS Query | `dig @localhost -p 5354 app.demo.local +short` |
| Client SSH | `ssh root@localhost -p 2222` (password: demo) |
| Chaos API | http://localhost:8083/chaos/status |

### Run the Guided Demo

```bash
ssh root@localhost -p 2222
# Password: demo

# Inside the container:
./demo.sh
```

## Demo Script

### Act 1: Baseline

Verify all backends are healthy:

```bash
# Check health via API
curl -s http://localhost:8080/api/v1/health/servers | jq '.servers[] | {address, healthy}'

# DNS returns all 3 backends
dig @localhost -p 5354 app.demo.local +short
```

### Act 2: Trigger CPU Spike

Inject chaos on backend-3:

```bash
# From host machine
curl -X POST "http://localhost:8083/chaos/cpu?intensity=85&duration=60s"

# Or from client container
./chaos.sh cpu 60s 85
```

Watch the Grafana dashboard - you'll see:
- CPU spike on backend-3
- Predictive signal triggers
- Backend begins draining

### Act 3: Traffic Shifts

DNS automatically excludes backend-3:

```bash
# DNS now returns only backend-1 and backend-2
dig @localhost -p 5354 app.demo.local +short

# But health check STILL PASSES!
curl -s http://localhost:8083/health
# Returns: {"status":"healthy"...}

# Overwatch shows backend-3 as draining
curl -s http://localhost:8080/api/v1/health/servers | jq '.servers[] | select(.address | contains("10.50.0.23"))'
```

:::{important}
The backend's health check is still passing, but we're proactively draining because the agent predicted trouble. This is the key insight of predictive health.
:::

### Act 4: Overwatch Validates

Trigger actual errors to validate the prediction:

```bash
curl -X POST "http://localhost:8083/chaos/errors?rate=100&duration=30s"
```

Now Overwatch's external validation confirms the failure. The agent's prediction was correct.

### Act 5: Recovery

Stop all chaos:

```bash
curl -X POST "http://localhost:8083/chaos/stop"

# Wait 15 seconds for recovery
sleep 15

# All 3 backends should be back
dig @localhost -p 5354 app.demo.local +short
```

## Chaos Injection API

The demo app on backend-3 (exposed at http://localhost:8083) provides chaos injection endpoints:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/chaos/cpu` | POST | Trigger CPU spike |
| `/chaos/memory` | POST | Trigger memory pressure |
| `/chaos/errors` | POST | Inject HTTP 500 errors |
| `/chaos/latency` | POST | Add response latency |
| `/chaos/stop` | POST | Stop all chaos |
| `/chaos/status` | GET | View current chaos state |

### Parameters

**CPU Spike**
```bash
curl -X POST "http://localhost:8083/chaos/cpu?duration=60s&intensity=85"
# intensity: 1-100 (% CPU to consume)
# duration: how long to run
```

**Memory Pressure**
```bash
curl -X POST "http://localhost:8083/chaos/memory?duration=60s&amount=500"
# amount: MB to allocate
# duration: how long to hold
```

**Error Injection**
```bash
curl -X POST "http://localhost:8083/chaos/errors?duration=60s&rate=50"
# rate: 1-100 (% of /health requests that return 500)
# duration: how long to inject errors
```

**Latency Injection**
```bash
curl -X POST "http://localhost:8083/chaos/latency?duration=60s&latency=500"
# latency: milliseconds to add to all requests
# duration: how long to inject latency
```

## Key Talking Points

### "Why is this better than traditional health checks?"

Traditional GSLB waits for failure. Three failed health checks at 10-second intervals = 30 seconds of users hitting a degraded server. OpenGSLB's agent sees the warning signs—CPU spiking, memory filling, error rates climbing—and starts draining traffic before the crash. Zero user impact.

### "Can't the agent lie?"

Great question. The agent can claim anything, but Overwatch always validates externally. If the agent says "healthy" but Overwatch's check sees 500 errors, Overwatch wins. If the agent cries wolf with bleed signals, Overwatch's external check can override. Trust but verify.

### "What if the Overwatch goes down?"

DNS clients have built-in redundancy. Configure multiple Overwatches in resolv.conf, clients automatically retry. No VRRP, no Raft, no complexity. DNS has been doing failover for 40 years.

## Predictive Health Configuration

The agent configuration enables predictive monitoring:

```yaml
agent:
  predictive:
    enabled: true
    check_interval: 5s
    cpu:
      threshold: 80        # Bleed when CPU exceeds 80%
      bleed_duration: 30s
    memory:
      threshold: 85        # Bleed when memory exceeds 85%
      bleed_duration: 30s
    error_rate:
      threshold: 5         # Bleed when error rate exceeds 5/min
      window: 60s
      bleed_duration: 30s
```

## Troubleshooting

### DNS queries timeout

Make sure Overwatch is running:
```bash
docker logs overwatch
```

### Backends not registering

Check agent logs:
```bash
docker logs backend-3
```

### Chaos not working

Verify backend-3 is accessible:
```bash
curl http://localhost:8083/chaos/status
```

## Cleanup

```bash
./scripts/cleanup.sh
```

This stops and removes all containers.

## Summary

Demo 5 showcases OpenGSLB's most powerful feature: **predictive health detection**. By running agents alongside your applications, OpenGSLB can:

1. **Detect problems early** - Before health checks fail
2. **Drain traffic proactively** - Zero user impact
3. **Validate predictions** - Overwatch confirms agent signals
4. **Recover automatically** - Traffic returns when issues resolve

This completes the OpenGSLB demo series. You've seen:
- Basic DNS load balancing (Demo 1)
- Agent-based architecture (Demo 2)
- Latency-based routing (Demo 3)
- Geographic routing (Demo 4)
- Predictive health (Demo 5)