Capacity Planning Guide
This document provides guidance for sizing OpenGSLB deployments based on workload requirements.
Overview
OpenGSLB capacity planning involves sizing:
Overwatches: DNS query throughput, backend count
Agents: Number of backends monitored
Network: Gossip traffic, DNS traffic
Storage: KV store, DNSSEC keys, logs
Sizing Guidelines
Overwatch Sizing
Resource Requirements by Scale
Scale |
Backends |
DNS QPS |
CPU |
Memory |
Disk |
|---|---|---|---|---|---|
Small |
< 50 |
< 1,000 |
2 cores |
512 MB |
1 GB |
Medium |
50-200 |
1,000-10,000 |
4 cores |
1 GB |
5 GB |
Large |
200-1,000 |
10,000-50,000 |
8 cores |
2 GB |
10 GB |
XLarge |
1,000+ |
50,000+ |
16 cores |
4 GB |
20 GB |
Factors Affecting Overwatch Resources
Factor |
Impact |
Mitigation |
|---|---|---|
DNS query rate |
CPU, network |
Horizontal scaling (more Overwatches) |
Number of backends |
Memory, gossip traffic |
Increase memory |
DNSSEC enabled |
CPU for signing |
Consider faster algorithm or more CPU |
Validation enabled |
CPU, network |
Adjust validation interval |
GeoIP lookups |
CPU, memory |
GeoIP database ~5MB in memory |
Agent Sizing
Resource Requirements
Backends per Agent |
CPU |
Memory |
Notes |
|---|---|---|---|
1-5 |
0.1 core |
32 MB |
Typical deployment |
5-20 |
0.25 core |
64 MB |
Multi-service host |
20-50 |
0.5 core |
128 MB |
High-density |
Factors Affecting Agent Resources
Factor |
Impact |
Mitigation |
|---|---|---|
Number of backends |
CPU, memory |
Linear scaling |
Health check frequency |
CPU, network |
Adjust interval |
Predictive health |
CPU |
Can be disabled |
Gossip traffic |
Network |
Minimal impact |
Network Requirements
Gossip Traffic
Gossip traffic is lightweight:
Component |
Traffic Pattern |
Estimated Bandwidth |
|---|---|---|
Heartbeat |
Every 10s per backend |
~100 bytes/heartbeat |
Health update |
On status change |
~200 bytes/update |
Probe |
Periodic |
~50 bytes/probe |
Estimate per agent: ~1 KB/minute typical, ~5 KB/minute during changes
DNS Traffic
Query Size |
Response Size |
Typical |
|---|---|---|
A record |
40-60 bytes |
100-200 bytes |
With DNSSEC |
40-60 bytes |
500-1000 bytes |
Bandwidth calculation: QPS × avg_response_size × 8
Example: 10,000 QPS × 200 bytes × 8 = 16 Mbps
Storage Requirements
Overwatch Storage
Component |
Size |
Growth |
|---|---|---|
Binary |
~15 MB |
Per version |
Configuration |
~10 KB |
Stable |
GeoIP database |
~5 MB |
Monthly updates |
KV store (bbolt) |
~1 MB base + 1 KB/backend |
Linear with backends |
DNSSEC keys |
~10 KB |
Per zone |
Logs |
Variable |
Configure rotation |
Agent Storage
Component |
Size |
Growth |
|---|---|---|
Binary |
~15 MB |
Per version |
Configuration |
~5 KB |
Stable |
Certificate |
~2 KB |
Stable |
Logs |
Variable |
Configure rotation |
Scaling Strategies
Vertical Scaling (Scale Up)
Increase resources on existing nodes:
Scenario |
Solution |
|---|---|
High CPU usage |
Add cores, optimize validation interval |
High memory usage |
Add RAM, reduce log retention |
High disk I/O |
Use SSD, reduce logging |
Horizontal Scaling (Scale Out)
Add more nodes:
Scenario |
Solution |
|---|---|
DNS capacity limit |
Add more Overwatches |
Geographic distribution |
Deploy Overwatches per region |
Agent count limit |
More Overwatches (gossip load sharing) |
Scaling Formulas
Overwatches Needed
min_overwatches = ceil(peak_qps / qps_per_overwatch)
recommended_overwatches = min_overwatches + 1 # For redundancy
Where qps_per_overwatch ≈ 50,000 for typical hardware.
Memory Estimation
base_memory = 256 MB
per_backend_memory = 2 KB
geoip_memory = 5 MB
buffer = 1.5x
total_memory = (base_memory + (backends × per_backend_memory) + geoip_memory) × buffer
Deployment Configurations
Small Deployment (Startup/Dev)
Overwatches: 1-2
Agents: 1-10
Hardware per Overwatch:
- 2 vCPU
- 512 MB RAM
- 10 GB disk
- Shared/burstable OK
Medium Deployment (SMB/Production)
Overwatches: 3
Agents: 10-50
Hardware per Overwatch:
- 4 vCPU
- 1 GB RAM
- 20 GB SSD
- Dedicated instances
Large Deployment (Enterprise)
Overwatches: 5-10 (geo-distributed)
Agents: 50-500
Hardware per Overwatch:
- 8 vCPU
- 2 GB RAM
- 50 GB SSD
- Dedicated instances
- 10 Gbps networking
Cloud Instance Recommendations
AWS
Scale |
Instance Type |
Notes |
|---|---|---|
Small |
t3.small |
Burstable OK for dev |
Medium |
m5.large |
Production baseline |
Large |
m5.xlarge |
High throughput |
XLarge |
c5.2xlarge |
CPU optimized |
GCP
Scale |
Machine Type |
Notes |
|---|---|---|
Small |
e2-small |
Cost-effective |
Medium |
n2-standard-2 |
Balanced |
Large |
n2-standard-4 |
Production |
XLarge |
c2-standard-8 |
Compute optimized |
Azure
Scale |
VM Size |
Notes |
|---|---|---|
Small |
B2s |
Burstable |
Medium |
D2s_v3 |
General purpose |
Large |
D4s_v3 |
Production |
XLarge |
F8s_v2 |
Compute optimized |
Capacity Monitoring
Key Metrics to Watch
# CPU utilization
rate(process_cpu_seconds_total{job="opengslb"}[5m])
# Memory usage
process_resident_memory_bytes{job="opengslb"}
# DNS query rate
rate(opengslb_dns_queries_total[5m])
# Backend count
opengslb_overwatch_backends_total
# Query latency
histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m]))
Capacity Alerts
groups:
- name: opengslb-capacity
rules:
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total{job="opengslb"}[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Overwatch CPU usage above 80%"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes{job="opengslb"} / process_virtual_memory_max_bytes > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Overwatch memory usage above 80%"
- alert: HighQueryRate
expr: rate(opengslb_dns_queries_total[5m]) > 40000
for: 5m
labels:
severity: warning
annotations:
summary: "Approaching query rate capacity"
- alert: HighQueryLatency
expr: histogram_quantile(0.99, rate(opengslb_dns_query_duration_seconds_bucket[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "P99 query latency above 10ms"
Growth Planning
Capacity Planning Process
Baseline current usage
Measure current QPS, backend count, resource usage
Estimate growth
Project based on business growth
Typical: 20-50% annual growth
Plan headroom
Maintain 30-50% headroom for spikes
Plan capacity 6-12 months ahead
Review quarterly
Adjust based on actual growth
Update projections
Example Capacity Plan
Current (Q1 2025):
- DNS QPS: 5,000
- Backends: 50
- Overwatches: 3
Projected (Q4 2025, 50% growth):
- DNS QPS: 7,500
- Backends: 75
- Overwatches: 3 (sufficient)
Projected (Q4 2026, 100% total):
- DNS QPS: 10,000
- Backends: 100
- Overwatches: 3 (may need 4th for headroom)
Action items:
- Q3 2025: Upgrade Overwatch instances to next tier
- Q2 2026: Deploy 4th Overwatch for geographic expansion