Gossip Protocol
OpenGSLB uses the gossip protocol for communication between Agents and Overwatch nodes. This document describes the gossip architecture, message types, and configuration.
Overview
The gossip protocol is built on hashicorp/memberlist, providing:
Fast event propagation: Health updates reach Overwatch within 500ms
Encrypted communication: Required AES-256 encryption for gossip traffic
Failure detection: SWIM-based protocol detects agent failures quickly
Heartbeat mechanism: Agents send periodic heartbeats to maintain registration
Architecture (ADR-015)
┌─────────────────────────────────────────────────────────┐
│ Overwatch Nodes (DNS Authority) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Overwatch-1 │ │ Overwatch-2 │ │
│ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Gossip │ │ │ │ Gossip │ │ │
│ │ │ Receiver │ │ │ │ Receiver │ │ │
│ │ └────▲─────┘ │ │ └────▲─────┘ │ │
│ └──────┼───────┘ └──────┼───────┘ │
│ │ │ │
└──────────┼────────────────────────┼──────────────────────┘
│ │
┌─────────────────────┼────────────────────────┼─────────────────────┐
│ │ │ │
│ Gossip Messages (Encrypted) │
│ │ │ │
│ ┌────────────────┼────────────────────────┼────────────────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Agent-1 │ │ Agent-2 │ │ Agent-3 │ │ Agent-4 │
│ (App Server)│ │ (App Server)│ │ (App Server)│ │ (App Server)│
│ │ │ │ │ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Health │ │ │ │ Health │ │ │ │ Health │ │ │ │ Health │ │
│ │ Monitor │ │ │ │ Monitor │ │ │ │ Monitor │ │ │ │ Monitor │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
[Backend Svc] [Backend Svc] [Backend Svc] [Backend Svc]
Communication Flow
Agents run on application servers alongside backends
Agents monitor local backend health
Agents send heartbeat messages to Overwatch nodes via gossip
Overwatch receives heartbeats and maintains backend registry
Overwatch optionally validates health claims independently
There is no Agent-to-Agent communication. Each Agent connects directly to Overwatch nodes.
Message Types
Heartbeat (heartbeat)
Sent periodically by agents to register backends and report health status.
{
"type": "heartbeat",
"agent_id": "agent-abc123",
"timestamp": "2025-04-08T10:30:00Z",
"payload": {
"agent_id": "agent-abc123",
"region": "us-east-1",
"fingerprint": "sha256:...",
"backends": [
{
"service": "web-service",
"address": "10.0.1.10",
"port": 80,
"weight": 100,
"healthy": true,
"latency_ms": 45,
"last_check": "2025-04-08T10:29:55Z"
}
]
}
}
Predictive Signal (predictive)
Sent when an agent predicts an impending failure based on resource metrics.
{
"type": "predictive",
"agent_id": "agent-abc123",
"timestamp": "2025-04-08T10:30:00Z",
"payload": {
"agent_id": "agent-abc123",
"service": "web-service",
"address": "10.0.1.10",
"port": 80,
"signal": "bleed",
"reason": "cpu_high",
"value": 92.5,
"threshold": 90.0,
"bleed_weight": 50
}
}
Signal types:
bleed: Gradual degradation, reduce traffic weightdrain: Prepare for shutdown, stop accepting new trafficrecovered: Signal cleared, return to normal operation
Reason codes:
cpu_high: CPU utilization above thresholdmemory_pressure: Memory usage above thresholderror_rate: Error rate above threshold
Deregister (deregister)
Sent by agent during graceful shutdown to remove backends from registry.
{
"type": "deregister",
"agent_id": "agent-abc123",
"timestamp": "2025-04-08T10:30:00Z",
"payload": {
"agent_id": "agent-abc123",
"reason": "shutdown"
}
}
Configuration
Agent Gossip Configuration
mode: agent
agent:
gossip:
# Required: 32-byte base64-encoded encryption key
encryption_key: "xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI="
# Required: Overwatch nodes to connect to
overwatch_nodes:
- "overwatch-1.internal:7946"
- "overwatch-2.internal:7946"
Overwatch Gossip Configuration
mode: overwatch
overwatch:
gossip:
# Address to bind for receiving gossip
bind_address: "0.0.0.0:7946"
# Required: Must match agent encryption key
encryption_key: "xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI="
# Failure detection timing
probe_interval: 1s
probe_timeout: 500ms
# Gossip message timing
gossip_interval: 200ms
Encryption (Mandatory)
OpenGSLB requires encrypted gossip communication. There is no option to disable encryption - this is a security-critical feature that protects against:
Man-in-the-middle attacks: Prevents attackers from intercepting or modifying health updates
Unauthorized agents: Only nodes with the correct key can participate in the cluster
Data tampering: Ensures integrity of backend health information
Key Requirements
Algorithm: AES-256 encryption via memberlist
Key length: Exactly 32 bytes (256 bits)
Format: Base64-encoded in configuration files
Scope: Same key must be used by all Agents and Overwatch nodes
Generating an Encryption Key
# Recommended: Generate a secure 32-byte key using OpenSSL
openssl rand -base64 32
# Example output: xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI=
Alternative methods:
# Using /dev/urandom (Linux/macOS)
head -c 32 /dev/urandom | base64
# Using Python
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"
Key Distribution
Important: All Agents and Overwatch nodes must use the same encryption key.
Recommended practices:
Generate the key once on a secure system
Distribute via secure configuration management (e.g., Vault, AWS Secrets Manager, SOPS)
Never commit keys to version control
Rotate keys periodically by deploying new keys to all nodes before removing old keys
Startup Validation
OpenGSLB validates the encryption key at startup. If the key is missing or invalid, startup will fail with a clear error:
ERROR: gossip.encryption_key is required. OpenGSLB requires encrypted gossip communication.
Generate a key with: openssl rand -base64 32
ERROR: gossip.encryption_key must be exactly 32 bytes (got 16).
Ensure you're using a 256-bit key. Generate with: openssl rand -base64 32
Metrics
Gossip exposes the following Prometheus metrics:
Agent Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Heartbeats sent to Overwatch |
|
Counter |
Failed heartbeat attempts |
|
Counter |
Predictive signals sent |
|
Gauge |
Connected Overwatch nodes |
Overwatch Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Messages received by type |
|
Counter |
Heartbeats received from agents |
|
Gauge |
Currently registered agents |
|
Counter |
Message processing errors |
Heartbeat Behavior
Interval and Timeout
agent:
heartbeat:
interval: 10s # Send heartbeat every 10 seconds
missed_threshold: 3 # Deregistered after 3 missed heartbeats (30s)
Staleness Detection
Overwatch marks backends as stale based on heartbeat activity:
overwatch:
stale:
threshold: 30s # Mark stale after 30s without heartbeat
remove_after: 5m # Remove backend after 5m stale
Status progression:
Healthy/Unhealthy: Recent heartbeat from agent
Stale: No heartbeat within
stale.thresholdRemoved: No heartbeat within
stale.remove_after
Troubleshooting
Agent Cannot Connect to Overwatch
Check firewall rules allow TCP/UDP on gossip port (default: 7946)
Verify Overwatch bind address is reachable from agent
Test connectivity:
# From agent server
nc -zv overwatch-ip 7946
Encryption Key Mismatch
If agents can’t communicate with Overwatch:
WARN gossip: failed to decode gossip message error="cipher: message authentication failed"
Ensure all nodes use the same encryption_key value.
Backends Going Stale
Check agent process is running:
systemctl status opengslbCheck agent metrics:
curl http://localhost:9090/metrics | grep gossipVerify network connectivity to Overwatch
Review agent logs:
journalctl -u opengslb | grep gossip
High Heartbeat Failures
Check the opengslb_gossip_heartbeat_failures_total metric:
Network connectivity issues between agent and Overwatch
Overwatch node is down or unreachable
Encryption key mismatch
Overwatch gossip port not listening
Best Practices
Always use encryption: Gossip encryption is required in production
Multiple Overwatch nodes: Configure agents to connect to multiple Overwatch nodes for redundancy
Monitor heartbeat metrics: Alert if
opengslb_gossip_heartbeat_failures_totalincreasesTune stale thresholds: Balance between quick detection and avoiding false positives
Separate networks: Consider using a management network for gossip traffic