Demo 6: Passive Latency Learning (ADR-017)

Level: Advanced Time: 30-45 minutes (Azure deployment) Prerequisites: Azure subscription, Terraform, SSH key

Overview

This demo showcases OpenGSLB’s passive latency learning feature - a unique capability that learns real client-to-backend latency by reading TCP RTT (Round-Trip Time) data directly from the operating system.

Unlike active latency probing (which measures Overwatch-to-backend latency), passive learning captures the actual latency experienced by your clients. This data is aggregated by client subnet and gossiped to Overwatch nodes for intelligent routing decisions.

What Makes This Different

Approach	What It Measures	Accuracy
Active Latency (Demo 3)	Overwatch → Backend	Proxy’s perspective
Passive Learning (This Demo)	Client → Backend	Client’s actual experience

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  Client (10.1.2.x)                                                  │
│       │                                                              │
│       │ HTTP Request                                                 │
│       ▼                                                              │
│  ┌─────────────┐         ┌─────────────┐         ┌─────────────┐   │
│  │ Backend EU  │◄────────│   Agent     │────────►│  Overwatch  │   │
│  │ (nginx)     │  TCP    │ (RTT: 85ms) │  Gossip │  (DNS)      │   │
│  └─────────────┘  conn   └─────────────┘         └─────────────┘   │
│                                                          │          │
│  Agent reads TCP_INFO from kernel:                       │          │
│  - tcpi_rtt (smoothed RTT in microseconds)              │          │
│  - Aggregates by /24 subnet                             ▼          │
│  - Reports to Overwatch via gossip            DNS Response:        │
│                                               "Use EU backend       │
│                                                (lowest latency)"    │
└─────────────────────────────────────────────────────────────────────┘

What You’ll Learn

How agents collect TCP RTT data from the OS
How latency is aggregated by client subnet
How Overwatch uses learned latency for routing
Cold-start fallback to geolocation routing
Cross-platform support (Linux + Windows)

Prerequisites

Azure Subscription with permissions to create VMs
Terraform >= 1.0 installed
SSH Key for VM access
Azure CLI authenticated (az login)

Quick Start

# Clone the repository
git clone https://github.com/LoganRossUS/OpenGSLB.git
cd OpenGSLB/demos/demo-6-advanced-passive-latency-learning/terraform

# Configure deployment
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your SSH key path

# Deploy infrastructure (~2 minutes)
terraform init
terraform apply -var="windows_admin_password=YourComplexPassword123!"

# Wait for cloud-init to complete, then validate
ssh azureuser@$(terraform output -raw traffic_eastus_public_ip)
test-cluster

Infrastructure Overview

The demo deploys across 3 Azure regions:

Region	VMs	Purpose
East US	Overwatch, Traffic Generator	DNS authority, simulated clients
West Europe	Linux Backend, Windows Backend	Application servers with agents
Southeast Asia	Linux Backend, Traffic Generator	APAC region backend + clients

Expected Latencies

From	To	Expected RTT
East US	West Europe	~80-100ms
East US	Southeast Asia	~200-250ms
Southeast Asia	Southeast Asia	~1-5ms

Step-by-Step Walkthrough

Step 1: Verify Deployment

SSH to the traffic generator and run validation:

ssh azureuser@<traffic_eastus_public_ip>

# Run cluster validation
test-cluster

# Expected output:
# ✓ Overwatch DNS responding
# ✓ All agents connected
# ✓ Health checks passing

Step 2: Generate Traffic

Generate sustained traffic to populate the latency table:

# Generate 5 requests/second for 5 minutes
generate-traffic 5 300

# This creates TCP connections to all backends
# Agents read RTT from each connection

Step 3: View Learned Latency Data

Query the Overwatch API to see collected latency data:

curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq .

Expected output:

{
  "entries": [
    {
      "subnet": "10.1.2.0/24",
      "domain": "app.demo.local",
      "region": "eu-west",
      "rtt_ms": 85,
      "samples": 150,
      "last_updated": "2025-12-19T10:05:00Z"
    },
    {
      "subnet": "10.1.2.0/24",
      "domain": "app.demo.local",
      "region": "ap-southeast",
      "rtt_ms": 220,
      "samples": 150,
      "last_updated": "2025-12-19T10:05:00Z"
    }
  ]
}

Step 4: Test Latency-Based Routing

Query DNS and verify the lowest-latency backend is selected:

# From East US traffic generator
dig @10.1.1.10 app.demo.local A +short

# Expected: West Europe backend IP (lower latency than Singapore)

Check Overwatch logs for routing decision:

ssh azureuser@<overwatch_public_ip>
journalctl -u opengslb | grep "routing decision"

Step 5: Test Cold-Start Fallback

When no latency data exists, Overwatch falls back to geolocation:

# Restart Overwatch (clears latency table)
sudo systemctl restart opengslb

# Immediately query DNS
dig @10.1.1.10 app.demo.local A +short

# Check logs - should show geo_fallback
journalctl -u opengslb | grep "geo_fallback"

Step 6: Compare Regions

Traffic from different regions should route to different backends:

# From East US (connects to EU - lower latency)
ssh azureuser@<traffic_eastus_public_ip>
dig @10.1.1.10 app.demo.local +short

# From Singapore (connects to Singapore - local)
ssh azureuser@<traffic_singapore_public_ip>
dig @10.1.1.10 app.demo.local +short

Configuration Deep-Dive

Agent Latency Learning Config

agent:
  latency_learning:
    enabled: true
    poll_interval: 10s        # How often to read TCP stats
    min_connection_age: 5s    # Ignore new connections
    ipv4_prefix: 24           # Aggregate by /24
    ipv6_prefix: 48           # Aggregate by /48
    ewma_alpha: 0.3           # Smoothing factor
    max_subnets: 100000       # Memory limit
    subnet_ttl: 168h          # 7-day retention
    min_samples: 5            # Min samples before reporting
    report_interval: 30s      # Gossip frequency

Domain Configuration for Learned Latency

domains:
  - name: app.demo.local
    routing_algorithm: learned_latency  # Use passive learning
    regions:
      - eu-west
      - ap-southeast
    latency_config:
      max_latency_ms: 300     # Exclude high-latency backends
      min_samples: 5          # Require sufficient data

How It Works

1. TCP RTT Collection (Agent)

On Linux, agents read /proc/net/tcp and use getsockopt(TCP_INFO):

struct tcp_info info;
getsockopt(sock, IPPROTO_TCP, TCP_INFO, &info, &len);
// info.tcpi_rtt contains smoothed RTT in microseconds

On Windows, agents use GetPerTcpConnectionEStats():

# Requires Administrator privileges
GetPerTcpConnectionEStats -State EstablishedConnections

2. Subnet Aggregation

RTT samples are aggregated by client subnet using EWMA:

new_rtt = α × sample + (1 - α) × old_rtt

With α = 0.3, recent samples have moderate influence while maintaining stability.

3. Gossip to Overwatch

Agents periodically send latency reports:

{
  "type": "latency_report",
  "agent_id": "backend-eu-west",
  "entries": [
    {"subnet": "10.1.2.0/24", "rtt_ms": 85, "samples": 50}
  ]
}

4. Routing Decision

When a DNS query arrives, Overwatch:

Extracts client IP (or ECS subnet)
Looks up learned latency for each backend
Selects the backend with lowest RTT
Falls back to geolocation if no data exists

Troubleshooting

No Latency Data Appearing

# Check agent logs
journalctl -u opengslb | grep "latency"

# Verify CAP_NET_ADMIN on Linux
getcap /usr/local/bin/opengslb
# Should show: cap_net_admin+ep

# Verify connections exist
ss -tn | grep ESTAB

Unexpected Routing

# Check what Overwatch sees
curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq .

# Verify domain config
curl http://10.1.1.10:8080/api/v1/domains | jq .

Windows Agent Issues

# Check if running as Administrator
([Security.Principal.WindowsPrincipal] [Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")

# Check agent logs
Get-Content C:\opengslb\logs\agent.log | Select-String "latency"

Cleanup

# Destroy all Azure resources
cd terraform
terraform destroy

# Or delete the resource group directly
az group delete --name rg-opengslb-latency-test --yes

Cost Estimate

Resource	Monthly Cost
5x Linux VMs (B2s)	~$75
1x Windows VM (B2s)	~$25
VNet Peering	~$20
Total	~$120

Tip: Deallocate VMs when not testing to reduce costs.

Next Steps

Review ADR-017: Passive Latency Learning for design rationale
Explore the Configuration Reference for all latency_learning options
Try Demo 3: Latency Routing for comparison with active probing

Key Takeaways

Passive learning captures real client latency - not just Overwatch-to-backend
Subnet aggregation prevents unbounded memory growth
Cold-start fallback ensures routing works before data is collected
Cross-platform support - Linux (netlink) and Windows (GetPerTcpConnectionEStats)
No client changes required - works with existing TCP connections