Capacity Planning Guide

This document provides guidance for sizing OpenGSLB deployments based on workload requirements.

Overview

OpenGSLB capacity planning involves sizing:

Overwatches: DNS query throughput, backend count
Agents: Number of backends monitored
Network: Gossip traffic, DNS traffic
Storage: KV store, DNSSEC keys, logs

Sizing Guidelines

Overwatch Sizing

Resource Requirements by Scale

Scale	Backends	DNS QPS	CPU	Memory	Disk
Small	< 50	< 1,000	2 cores	512 MB	1 GB
Medium	50-200	1,000-10,000	4 cores	1 GB	5 GB
Large	200-1,000	10,000-50,000	8 cores	2 GB	10 GB
XLarge	1,000+	50,000+	16 cores	4 GB	20 GB

Factors Affecting Overwatch Resources

Factor	Impact	Mitigation
DNS query rate	CPU, network	Horizontal scaling (more Overwatches)
Number of backends	Memory, gossip traffic	Increase memory
DNSSEC enabled	CPU for signing	Consider faster algorithm or more CPU
Validation enabled	CPU, network	Adjust validation interval
GeoIP lookups	CPU, memory	GeoIP database ~5MB in memory

Agent Sizing

Resource Requirements

Backends per Agent	CPU	Memory	Notes
1-5	0.1 core	32 MB	Typical deployment
5-20	0.25 core	64 MB	Multi-service host
20-50	0.5 core	128 MB	High-density

Factors Affecting Agent Resources

Factor	Impact	Mitigation
Number of backends	CPU, memory	Linear scaling
Health check frequency	CPU, network	Adjust interval
Predictive health	CPU	Can be disabled
Gossip traffic	Network	Minimal impact

Network Requirements

Gossip Traffic

Gossip traffic is lightweight:

Component	Traffic Pattern	Estimated Bandwidth
Heartbeat	Every 10s per backend	~100 bytes/heartbeat
Health update	On status change	~200 bytes/update
Probe	Periodic	~50 bytes/probe

Estimate per agent: ~1 KB/minute typical, ~5 KB/minute during changes

DNS Traffic

Query Size	Response Size	Typical
A record	40-60 bytes	100-200 bytes
With DNSSEC	40-60 bytes	500-1000 bytes

Bandwidth calculation: QPS × avg_response_size × 8

Example: 10,000 QPS × 200 bytes × 8 = 16 Mbps

Storage Requirements

Overwatch Storage

Component	Size	Growth
Binary	~15 MB	Per version
Configuration	~10 KB	Stable
GeoIP database	~5 MB	Monthly updates
KV store (bbolt)	~1 MB base + 1 KB/backend	Linear with backends
DNSSEC keys	~10 KB	Per zone
Logs	Variable	Configure rotation

Agent Storage

Component	Size	Growth
Binary	~15 MB	Per version
Configuration	~5 KB	Stable
Certificate	~2 KB	Stable
Logs	Variable	Configure rotation

Scaling Strategies

Vertical Scaling (Scale Up)

Increase resources on existing nodes:

Scenario	Solution
High CPU usage	Add cores, optimize validation interval
High memory usage	Add RAM, reduce log retention
High disk I/O	Use SSD, reduce logging

Horizontal Scaling (Scale Out)

Add more nodes:

Scenario	Solution
DNS capacity limit	Add more Overwatches
Geographic distribution	Deploy Overwatches per region
Agent count limit	More Overwatches (gossip load sharing)

Scaling Formulas

Overwatches Needed

min_overwatches = ceil(peak_qps / qps_per_overwatch)
recommended_overwatches = min_overwatches + 1  # For redundancy

Where qps_per_overwatch ≈ 50,000 for typical hardware.

Memory Estimation

base_memory = 256 MB
per_backend_memory = 2 KB
geoip_memory = 5 MB
buffer = 1.5x

total_memory = (base_memory + (backends × per_backend_memory) + geoip_memory) × buffer

Deployment Configurations

Small Deployment (Startup/Dev)

Overwatches: 1-2
Agents: 1-10
Hardware per Overwatch:
  - 2 vCPU
  - 512 MB RAM
  - 10 GB disk
  - Shared/burstable OK

Medium Deployment (SMB/Production)

Overwatches: 3
Agents: 10-50
Hardware per Overwatch:
  - 4 vCPU
  - 1 GB RAM
  - 20 GB SSD
  - Dedicated instances

Large Deployment (Enterprise)

Overwatches: 5-10 (geo-distributed)
Agents: 50-500
Hardware per Overwatch:
  - 8 vCPU
  - 2 GB RAM
  - 50 GB SSD
  - Dedicated instances
  - 10 Gbps networking

Cloud Instance Recommendations

AWS

Scale	Instance Type	Notes
Small	t3.small	Burstable OK for dev
Medium	m5.large	Production baseline
Large	m5.xlarge	High throughput
XLarge	c5.2xlarge	CPU optimized

GCP

Scale	Machine Type	Notes
Small	e2-small	Cost-effective
Medium	n2-standard-2	Balanced
Large	n2-standard-4	Production
XLarge	c2-standard-8	Compute optimized

Azure

Scale	VM Size	Notes
Small	B2s	Burstable
Medium	D2s_v3	General purpose
Large	D4s_v3	Production
XLarge	F8s_v2	Compute optimized

Capacity Monitoring

Key Metrics to Watch

# CPU utilization
rate(process_cpu_seconds_total{job="opengslb"}[5m])

# Memory usage
process_resident_memory_bytes{job="opengslb"}

# DNS query rate
rate(opengslb_dns_queries_total[5m])

# Backend count
opengslb_overwatch_backends_total

# Query latency
histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m]))

Capacity Alerts

groups:
  - name: opengslb-capacity
    rules:
      - alert: HighCPUUsage
        expr: rate(process_cpu_seconds_total{job="opengslb"}[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Overwatch CPU usage above 80%"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes{job="opengslb"} / process_virtual_memory_max_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Overwatch memory usage above 80%"

      - alert: HighQueryRate
        expr: rate(opengslb_dns_queries_total[5m]) > 40000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Approaching query rate capacity"

      - alert: HighQueryLatency
        expr: histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 query latency above 10ms"

Growth Planning

Capacity Planning Process

Baseline current usage
- Measure current QPS, backend count, resource usage
Estimate growth
- Project based on business growth
- Typical: 20-50% annual growth
Plan headroom
- Maintain 30-50% headroom for spikes
- Plan capacity 6-12 months ahead
Review quarterly
- Adjust based on actual growth
- Update projections

Example Capacity Plan

Current (Q1 2025):
  - DNS QPS: 5,000
  - Backends: 50
  - Overwatches: 3

Projected (Q4 2025, 50% growth):
  - DNS QPS: 7,500
  - Backends: 75
  - Overwatches: 3 (sufficient)

Projected (Q4 2026, 100% total):
  - DNS QPS: 10,000
  - Backends: 100
  - Overwatches: 3 (may need 4th for headroom)

Action items:
  - Q3 2025: Upgrade Overwatch instances to next tier
  - Q2 2026: Deploy 4th Overwatch for geographic expansion