# Capacity Planning Guide

This document provides guidance for sizing OpenGSLB deployments based on workload requirements.

## Overview

OpenGSLB capacity planning involves sizing:

1. **Overwatches**: DNS query throughput, backend count
2. **Agents**: Number of backends monitored
3. **Network**: Gossip traffic, DNS traffic
4. **Storage**: KV store, DNSSEC keys, logs

## Sizing Guidelines

### Overwatch Sizing

#### Resource Requirements by Scale

| Scale | Backends | DNS QPS | CPU | Memory | Disk |
|-------|----------|---------|-----|--------|------|
| Small | < 50 | < 1,000 | 2 cores | 512 MB | 1 GB |
| Medium | 50-200 | 1,000-10,000 | 4 cores | 1 GB | 5 GB |
| Large | 200-1,000 | 10,000-50,000 | 8 cores | 2 GB | 10 GB |
| XLarge | 1,000+ | 50,000+ | 16 cores | 4 GB | 20 GB |

#### Factors Affecting Overwatch Resources

| Factor | Impact | Mitigation |
|--------|--------|------------|
| DNS query rate | CPU, network | Horizontal scaling (more Overwatches) |
| Number of backends | Memory, gossip traffic | Increase memory |
| DNSSEC enabled | CPU for signing | Consider faster algorithm or more CPU |
| Validation enabled | CPU, network | Adjust validation interval |
| GeoIP lookups | CPU, memory | GeoIP database ~5MB in memory |

### Agent Sizing

#### Resource Requirements

| Backends per Agent | CPU | Memory | Notes |
|--------------------|-----|--------|-------|
| 1-5 | 0.1 core | 32 MB | Typical deployment |
| 5-20 | 0.25 core | 64 MB | Multi-service host |
| 20-50 | 0.5 core | 128 MB | High-density |

#### Factors Affecting Agent Resources

| Factor | Impact | Mitigation |
|--------|--------|------------|
| Number of backends | CPU, memory | Linear scaling |
| Health check frequency | CPU, network | Adjust interval |
| Predictive health | CPU | Can be disabled |
| Gossip traffic | Network | Minimal impact |

### Network Requirements

#### Gossip Traffic

Gossip traffic is lightweight:

| Component | Traffic Pattern | Estimated Bandwidth |
|-----------|-----------------|---------------------|
| Heartbeat | Every 10s per backend | ~100 bytes/heartbeat |
| Health update | On status change | ~200 bytes/update |
| Probe | Periodic | ~50 bytes/probe |

**Estimate per agent**: ~1 KB/minute typical, ~5 KB/minute during changes

#### DNS Traffic

| Query Size | Response Size | Typical |
|------------|---------------|---------|
| A record | 40-60 bytes | 100-200 bytes |
| With DNSSEC | 40-60 bytes | 500-1000 bytes |

**Bandwidth calculation**: `QPS × avg_response_size × 8`

Example: 10,000 QPS × 200 bytes × 8 = 16 Mbps

### Storage Requirements

#### Overwatch Storage

| Component | Size | Growth |
|-----------|------|--------|
| Binary | ~15 MB | Per version |
| Configuration | ~10 KB | Stable |
| GeoIP database | ~5 MB | Monthly updates |
| KV store (bbolt) | ~1 MB base + 1 KB/backend | Linear with backends |
| DNSSEC keys | ~10 KB | Per zone |
| Logs | Variable | Configure rotation |

#### Agent Storage

| Component | Size | Growth |
|-----------|------|--------|
| Binary | ~15 MB | Per version |
| Configuration | ~5 KB | Stable |
| Certificate | ~2 KB | Stable |
| Logs | Variable | Configure rotation |

## Scaling Strategies

### Vertical Scaling (Scale Up)

Increase resources on existing nodes:

| Scenario | Solution |
|----------|----------|
| High CPU usage | Add cores, optimize validation interval |
| High memory usage | Add RAM, reduce log retention |
| High disk I/O | Use SSD, reduce logging |

### Horizontal Scaling (Scale Out)

Add more nodes:

| Scenario | Solution |
|----------|----------|
| DNS capacity limit | Add more Overwatches |
| Geographic distribution | Deploy Overwatches per region |
| Agent count limit | More Overwatches (gossip load sharing) |

### Scaling Formulas

#### Overwatches Needed

```
min_overwatches = ceil(peak_qps / qps_per_overwatch)
recommended_overwatches = min_overwatches + 1  # For redundancy
```

Where `qps_per_overwatch` ≈ 50,000 for typical hardware.

#### Memory Estimation

```
base_memory = 256 MB
per_backend_memory = 2 KB
geoip_memory = 5 MB
buffer = 1.5x

total_memory = (base_memory + (backends × per_backend_memory) + geoip_memory) × buffer
```

## Deployment Configurations

### Small Deployment (Startup/Dev)

```
Overwatches: 1-2
Agents: 1-10
Hardware per Overwatch:
  - 2 vCPU
  - 512 MB RAM
  - 10 GB disk
  - Shared/burstable OK
```

### Medium Deployment (SMB/Production)

```
Overwatches: 3
Agents: 10-50
Hardware per Overwatch:
  - 4 vCPU
  - 1 GB RAM
  - 20 GB SSD
  - Dedicated instances
```

### Large Deployment (Enterprise)

```
Overwatches: 5-10 (geo-distributed)
Agents: 50-500
Hardware per Overwatch:
  - 8 vCPU
  - 2 GB RAM
  - 50 GB SSD
  - Dedicated instances
  - 10 Gbps networking
```

### Cloud Instance Recommendations

#### AWS

| Scale | Instance Type | Notes |
|-------|---------------|-------|
| Small | t3.small | Burstable OK for dev |
| Medium | m5.large | Production baseline |
| Large | m5.xlarge | High throughput |
| XLarge | c5.2xlarge | CPU optimized |

#### GCP

| Scale | Machine Type | Notes |
|-------|--------------|-------|
| Small | e2-small | Cost-effective |
| Medium | n2-standard-2 | Balanced |
| Large | n2-standard-4 | Production |
| XLarge | c2-standard-8 | Compute optimized |

#### Azure

| Scale | VM Size | Notes |
|-------|---------|-------|
| Small | B2s | Burstable |
| Medium | D2s_v3 | General purpose |
| Large | D4s_v3 | Production |
| XLarge | F8s_v2 | Compute optimized |

## Capacity Monitoring

### Key Metrics to Watch

```promql
# CPU utilization
rate(process_cpu_seconds_total{job="opengslb"}[5m])

# Memory usage
process_resident_memory_bytes{job="opengslb"}

# DNS query rate
rate(opengslb_dns_queries_total[5m])

# Backend count
opengslb_overwatch_backends_total

# Query latency
histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m]))
```

### Capacity Alerts

```yaml
groups:
  - name: opengslb-capacity
    rules:
      - alert: HighCPUUsage
        expr: rate(process_cpu_seconds_total{job="opengslb"}[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Overwatch CPU usage above 80%"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes{job="opengslb"} / process_virtual_memory_max_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Overwatch memory usage above 80%"

      - alert: HighQueryRate
        expr: rate(opengslb_dns_queries_total[5m]) > 40000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Approaching query rate capacity"

      - alert: HighQueryLatency
        expr: histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 query latency above 10ms"
```

## Growth Planning

### Capacity Planning Process

1. **Baseline current usage**
   - Measure current QPS, backend count, resource usage

2. **Estimate growth**
   - Project based on business growth
   - Typical: 20-50% annual growth

3. **Plan headroom**
   - Maintain 30-50% headroom for spikes
   - Plan capacity 6-12 months ahead

4. **Review quarterly**
   - Adjust based on actual growth
   - Update projections

### Example Capacity Plan

```
Current (Q1 2025):
  - DNS QPS: 5,000
  - Backends: 50
  - Overwatches: 3

Projected (Q4 2025, 50% growth):
  - DNS QPS: 7,500
  - Backends: 75
  - Overwatches: 3 (sufficient)

Projected (Q4 2026, 100% total):
  - DNS QPS: 10,000
  - Backends: 100
  - Overwatches: 3 (may need 4th for headroom)

Action items:
  - Q3 2025: Upgrade Overwatch instances to next tier
  - Q2 2026: Deploy 4th Overwatch for geographic expansion
```

## Related Documentation

- [Benchmarks](./benchmarks.md)
- [HA Setup Guide](../deployment/ha-setup.md)
- [Overwatch Deployment](../deployment/overwatch.md)

---

**Document Version**: 1.0
**Last Updated**: December 2025