Gossip Protocol

OpenGSLB uses the gossip protocol for communication between Agents and Overwatch nodes. This document describes the gossip architecture, message types, and configuration.

Overview

The gossip protocol is built on hashicorp/memberlist, providing:

  • Fast event propagation: Health updates reach Overwatch within 500ms

  • Encrypted communication: Required AES-256 encryption for gossip traffic

  • Failure detection: SWIM-based protocol detects agent failures quickly

  • Heartbeat mechanism: Agents send periodic heartbeats to maintain registration

Architecture (ADR-015)

                    ┌─────────────────────────────────────────────────────────┐
                    │              Overwatch Nodes (DNS Authority)             │
                    │                                                          │
                    │   ┌──────────────┐         ┌──────────────┐             │
                    │   │ Overwatch-1  │         │ Overwatch-2  │             │
                    │   │              │         │              │             │
                    │   │ ┌──────────┐ │         │ ┌──────────┐ │             │
                    │   │ │  Gossip  │ │         │ │  Gossip  │ │             │
                    │   │ │ Receiver │ │         │ │ Receiver │ │             │
                    │   │ └────▲─────┘ │         │ └────▲─────┘ │             │
                    │   └──────┼───────┘         └──────┼───────┘             │
                    │          │                        │                      │
                    └──────────┼────────────────────────┼──────────────────────┘
                               │                        │
         ┌─────────────────────┼────────────────────────┼─────────────────────┐
         │                     │                        │                      │
         │            Gossip Messages (Encrypted)                             │
         │                     │                        │                      │
         │    ┌────────────────┼────────────────────────┼────────────────┐    │
         │    │                │                        │                │    │
         │    ▼                ▼                        ▼                ▼    │
    ┌─────────────┐    ┌─────────────┐          ┌─────────────┐   ┌─────────────┐
    │  Agent-1    │    │  Agent-2    │          │  Agent-3    │   │  Agent-4    │
    │ (App Server)│    │ (App Server)│          │ (App Server)│   │ (App Server)│
    │             │    │             │          │             │   │             │
    │ ┌─────────┐ │    │ ┌─────────┐ │          │ ┌─────────┐ │   │ ┌─────────┐ │
    │ │ Health  │ │    │ │ Health  │ │          │ │ Health  │ │   │ │ Health  │ │
    │ │ Monitor │ │    │ │ Monitor │ │          │ │ Monitor │ │   │ │ Monitor │ │
    │ └─────────┘ │    │ └─────────┘ │          │ └─────────┘ │   │ └─────────┘ │
    └─────────────┘    └─────────────┘          └─────────────┘   └─────────────┘
         │                   │                        │                  │
         ▼                   ▼                        ▼                  ▼
    [Backend Svc]      [Backend Svc]            [Backend Svc]      [Backend Svc]

Communication Flow

  1. Agents run on application servers alongside backends

  2. Agents monitor local backend health

  3. Agents send heartbeat messages to Overwatch nodes via gossip

  4. Overwatch receives heartbeats and maintains backend registry

  5. Overwatch optionally validates health claims independently

There is no Agent-to-Agent communication. Each Agent connects directly to Overwatch nodes.

Message Types

Heartbeat (heartbeat)

Sent periodically by agents to register backends and report health status.

{
  "type": "heartbeat",
  "agent_id": "agent-abc123",
  "timestamp": "2025-04-08T10:30:00Z",
  "payload": {
    "agent_id": "agent-abc123",
    "region": "us-east-1",
    "fingerprint": "sha256:...",
    "backends": [
      {
        "service": "web-service",
        "address": "10.0.1.10",
        "port": 80,
        "weight": 100,
        "healthy": true,
        "latency_ms": 45,
        "last_check": "2025-04-08T10:29:55Z"
      }
    ]
  }
}

Predictive Signal (predictive)

Sent when an agent predicts an impending failure based on resource metrics.

{
  "type": "predictive",
  "agent_id": "agent-abc123",
  "timestamp": "2025-04-08T10:30:00Z",
  "payload": {
    "agent_id": "agent-abc123",
    "service": "web-service",
    "address": "10.0.1.10",
    "port": 80,
    "signal": "bleed",
    "reason": "cpu_high",
    "value": 92.5,
    "threshold": 90.0,
    "bleed_weight": 50
  }
}

Signal types:

  • bleed: Gradual degradation, reduce traffic weight

  • drain: Prepare for shutdown, stop accepting new traffic

  • recovered: Signal cleared, return to normal operation

Reason codes:

  • cpu_high: CPU utilization above threshold

  • memory_pressure: Memory usage above threshold

  • error_rate: Error rate above threshold

Deregister (deregister)

Sent by agent during graceful shutdown to remove backends from registry.

{
  "type": "deregister",
  "agent_id": "agent-abc123",
  "timestamp": "2025-04-08T10:30:00Z",
  "payload": {
    "agent_id": "agent-abc123",
    "reason": "shutdown"
  }
}

Configuration

Agent Gossip Configuration

mode: agent

agent:
  gossip:
    # Required: 32-byte base64-encoded encryption key
    encryption_key: "xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI="

    # Required: Overwatch nodes to connect to
    overwatch_nodes:
      - "overwatch-1.internal:7946"
      - "overwatch-2.internal:7946"

Overwatch Gossip Configuration

mode: overwatch

overwatch:
  gossip:
    # Address to bind for receiving gossip
    bind_address: "0.0.0.0:7946"

    # Required: Must match agent encryption key
    encryption_key: "xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI="

    # Failure detection timing
    probe_interval: 1s
    probe_timeout: 500ms

    # Gossip message timing
    gossip_interval: 200ms

Encryption (Mandatory)

OpenGSLB requires encrypted gossip communication. There is no option to disable encryption - this is a security-critical feature that protects against:

  • Man-in-the-middle attacks: Prevents attackers from intercepting or modifying health updates

  • Unauthorized agents: Only nodes with the correct key can participate in the cluster

  • Data tampering: Ensures integrity of backend health information

Key Requirements

  • Algorithm: AES-256 encryption via memberlist

  • Key length: Exactly 32 bytes (256 bits)

  • Format: Base64-encoded in configuration files

  • Scope: Same key must be used by all Agents and Overwatch nodes

Generating an Encryption Key

# Recommended: Generate a secure 32-byte key using OpenSSL
openssl rand -base64 32

# Example output: xK7dQm9pR8vLnM3wYhA2cE5fG6jN1sU4tB0oZiXeHrI=

Alternative methods:

# Using /dev/urandom (Linux/macOS)
head -c 32 /dev/urandom | base64

# Using Python
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"

Key Distribution

Important: All Agents and Overwatch nodes must use the same encryption key.

Recommended practices:

  1. Generate the key once on a secure system

  2. Distribute via secure configuration management (e.g., Vault, AWS Secrets Manager, SOPS)

  3. Never commit keys to version control

  4. Rotate keys periodically by deploying new keys to all nodes before removing old keys

Startup Validation

OpenGSLB validates the encryption key at startup. If the key is missing or invalid, startup will fail with a clear error:

ERROR: gossip.encryption_key is required. OpenGSLB requires encrypted gossip communication.
       Generate a key with: openssl rand -base64 32
ERROR: gossip.encryption_key must be exactly 32 bytes (got 16).
       Ensure you're using a 256-bit key. Generate with: openssl rand -base64 32

Metrics

Gossip exposes the following Prometheus metrics:

Agent Metrics

Metric

Type

Description

opengslb_gossip_heartbeats_sent_total

Counter

Heartbeats sent to Overwatch

opengslb_gossip_heartbeat_failures_total

Counter

Failed heartbeat attempts

opengslb_gossip_predictive_signals_total

Counter

Predictive signals sent

opengslb_gossip_connected_overwatches

Gauge

Connected Overwatch nodes

Overwatch Metrics

Metric

Type

Description

opengslb_gossip_messages_received_total

Counter

Messages received by type

opengslb_gossip_heartbeats_received_total

Counter

Heartbeats received from agents

opengslb_gossip_agents_registered

Gauge

Currently registered agents

opengslb_gossip_message_processing_errors_total

Counter

Message processing errors

Heartbeat Behavior

Interval and Timeout

agent:
  heartbeat:
    interval: 10s         # Send heartbeat every 10 seconds
    missed_threshold: 3   # Deregistered after 3 missed heartbeats (30s)

Staleness Detection

Overwatch marks backends as stale based on heartbeat activity:

overwatch:
  stale:
    threshold: 30s      # Mark stale after 30s without heartbeat
    remove_after: 5m    # Remove backend after 5m stale

Status progression:

  1. Healthy/Unhealthy: Recent heartbeat from agent

  2. Stale: No heartbeat within stale.threshold

  3. Removed: No heartbeat within stale.remove_after

Troubleshooting

Agent Cannot Connect to Overwatch

  1. Check firewall rules allow TCP/UDP on gossip port (default: 7946)

  2. Verify Overwatch bind address is reachable from agent

  3. Test connectivity:

# From agent server
nc -zv overwatch-ip 7946

Encryption Key Mismatch

If agents can’t communicate with Overwatch:

WARN gossip: failed to decode gossip message error="cipher: message authentication failed"

Ensure all nodes use the same encryption_key value.

Backends Going Stale

  1. Check agent process is running: systemctl status opengslb

  2. Check agent metrics: curl http://localhost:9090/metrics | grep gossip

  3. Verify network connectivity to Overwatch

  4. Review agent logs: journalctl -u opengslb | grep gossip

High Heartbeat Failures

Check the opengslb_gossip_heartbeat_failures_total metric:

  1. Network connectivity issues between agent and Overwatch

  2. Overwatch node is down or unreachable

  3. Encryption key mismatch

  4. Overwatch gossip port not listening

Best Practices

  1. Always use encryption: Gossip encryption is required in production

  2. Multiple Overwatch nodes: Configure agents to connect to multiple Overwatch nodes for redundancy

  3. Monitor heartbeat metrics: Alert if opengslb_gossip_heartbeat_failures_total increases

  4. Tune stale thresholds: Balance between quick detection and avoiding false positives

  5. Separate networks: Consider using a management network for gossip traffic

See Also