Demo 6: Passive Latency Learning (ADR-017)
Level: Advanced Time: 30-45 minutes (Azure deployment) Prerequisites: Azure subscription, Terraform, SSH key
Overview
This demo showcases OpenGSLB’s passive latency learning feature - a unique capability that learns real client-to-backend latency by reading TCP RTT (Round-Trip Time) data directly from the operating system.
Unlike active latency probing (which measures Overwatch-to-backend latency), passive learning captures the actual latency experienced by your clients. This data is aggregated by client subnet and gossiped to Overwatch nodes for intelligent routing decisions.
What Makes This Different
Approach |
What It Measures |
Accuracy |
|---|---|---|
Active Latency (Demo 3) |
Overwatch → Backend |
Proxy’s perspective |
Passive Learning (This Demo) |
Client → Backend |
Client’s actual experience |
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Client (10.1.2.x) │
│ │ │
│ │ HTTP Request │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Backend EU │◄────────│ Agent │────────►│ Overwatch │ │
│ │ (nginx) │ TCP │ (RTT: 85ms) │ Gossip │ (DNS) │ │
│ └─────────────┘ conn └─────────────┘ └─────────────┘ │
│ │ │
│ Agent reads TCP_INFO from kernel: │ │
│ - tcpi_rtt (smoothed RTT in microseconds) │ │
│ - Aggregates by /24 subnet ▼ │
│ - Reports to Overwatch via gossip DNS Response: │
│ "Use EU backend │
│ (lowest latency)" │
└─────────────────────────────────────────────────────────────────────┘
What You’ll Learn
How agents collect TCP RTT data from the OS
How latency is aggregated by client subnet
How Overwatch uses learned latency for routing
Cold-start fallback to geolocation routing
Cross-platform support (Linux + Windows)
Prerequisites
Azure Subscription with permissions to create VMs
Terraform >= 1.0 installed
SSH Key for VM access
Azure CLI authenticated (
az login)
Quick Start
# Clone the repository
git clone https://github.com/LoganRossUS/OpenGSLB.git
cd OpenGSLB/demos/demo-6-advanced-passive-latency-learning/terraform
# Configure deployment
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your SSH key path
# Deploy infrastructure (~2 minutes)
terraform init
terraform apply -var="windows_admin_password=YourComplexPassword123!"
# Wait for cloud-init to complete, then validate
ssh azureuser@$(terraform output -raw traffic_eastus_public_ip)
test-cluster
Infrastructure Overview
The demo deploys across 3 Azure regions:
Region |
VMs |
Purpose |
|---|---|---|
East US |
Overwatch, Traffic Generator |
DNS authority, simulated clients |
West Europe |
Linux Backend, Windows Backend |
Application servers with agents |
Southeast Asia |
Linux Backend, Traffic Generator |
APAC region backend + clients |
Expected Latencies
From |
To |
Expected RTT |
|---|---|---|
East US |
West Europe |
~80-100ms |
East US |
Southeast Asia |
~200-250ms |
Southeast Asia |
Southeast Asia |
~1-5ms |
Step-by-Step Walkthrough
Step 1: Verify Deployment
SSH to the traffic generator and run validation:
ssh azureuser@<traffic_eastus_public_ip>
# Run cluster validation
test-cluster
# Expected output:
# ✓ Overwatch DNS responding
# ✓ All agents connected
# ✓ Health checks passing
Step 2: Generate Traffic
Generate sustained traffic to populate the latency table:
# Generate 5 requests/second for 5 minutes
generate-traffic 5 300
# This creates TCP connections to all backends
# Agents read RTT from each connection
Step 3: View Learned Latency Data
Query the Overwatch API to see collected latency data:
curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq .
Expected output:
{
"entries": [
{
"subnet": "10.1.2.0/24",
"domain": "app.demo.local",
"region": "eu-west",
"rtt_ms": 85,
"samples": 150,
"last_updated": "2025-12-19T10:05:00Z"
},
{
"subnet": "10.1.2.0/24",
"domain": "app.demo.local",
"region": "ap-southeast",
"rtt_ms": 220,
"samples": 150,
"last_updated": "2025-12-19T10:05:00Z"
}
]
}
Step 4: Test Latency-Based Routing
Query DNS and verify the lowest-latency backend is selected:
# From East US traffic generator
dig @10.1.1.10 app.demo.local A +short
# Expected: West Europe backend IP (lower latency than Singapore)
Check Overwatch logs for routing decision:
ssh azureuser@<overwatch_public_ip>
journalctl -u opengslb | grep "routing decision"
Step 5: Test Cold-Start Fallback
When no latency data exists, Overwatch falls back to geolocation:
# Restart Overwatch (clears latency table)
sudo systemctl restart opengslb
# Immediately query DNS
dig @10.1.1.10 app.demo.local A +short
# Check logs - should show geo_fallback
journalctl -u opengslb | grep "geo_fallback"
Step 6: Compare Regions
Traffic from different regions should route to different backends:
# From East US (connects to EU - lower latency)
ssh azureuser@<traffic_eastus_public_ip>
dig @10.1.1.10 app.demo.local +short
# From Singapore (connects to Singapore - local)
ssh azureuser@<traffic_singapore_public_ip>
dig @10.1.1.10 app.demo.local +short
Configuration Deep-Dive
Agent Latency Learning Config
agent:
latency_learning:
enabled: true
poll_interval: 10s # How often to read TCP stats
min_connection_age: 5s # Ignore new connections
ipv4_prefix: 24 # Aggregate by /24
ipv6_prefix: 48 # Aggregate by /48
ewma_alpha: 0.3 # Smoothing factor
max_subnets: 100000 # Memory limit
subnet_ttl: 168h # 7-day retention
min_samples: 5 # Min samples before reporting
report_interval: 30s # Gossip frequency
Domain Configuration for Learned Latency
domains:
- name: app.demo.local
routing_algorithm: learned_latency # Use passive learning
regions:
- eu-west
- ap-southeast
latency_config:
max_latency_ms: 300 # Exclude high-latency backends
min_samples: 5 # Require sufficient data
How It Works
1. TCP RTT Collection (Agent)
On Linux, agents read /proc/net/tcp and use getsockopt(TCP_INFO):
struct tcp_info info;
getsockopt(sock, IPPROTO_TCP, TCP_INFO, &info, &len);
// info.tcpi_rtt contains smoothed RTT in microseconds
On Windows, agents use GetPerTcpConnectionEStats():
# Requires Administrator privileges
GetPerTcpConnectionEStats -State EstablishedConnections
2. Subnet Aggregation
RTT samples are aggregated by client subnet using EWMA:
new_rtt = α × sample + (1 - α) × old_rtt
With α = 0.3, recent samples have moderate influence while maintaining stability.
3. Gossip to Overwatch
Agents periodically send latency reports:
{
"type": "latency_report",
"agent_id": "backend-eu-west",
"entries": [
{"subnet": "10.1.2.0/24", "rtt_ms": 85, "samples": 50}
]
}
4. Routing Decision
When a DNS query arrives, Overwatch:
Extracts client IP (or ECS subnet)
Looks up learned latency for each backend
Selects the backend with lowest RTT
Falls back to geolocation if no data exists
Troubleshooting
No Latency Data Appearing
# Check agent logs
journalctl -u opengslb | grep "latency"
# Verify CAP_NET_ADMIN on Linux
getcap /usr/local/bin/opengslb
# Should show: cap_net_admin+ep
# Verify connections exist
ss -tn | grep ESTAB
Unexpected Routing
# Check what Overwatch sees
curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq .
# Verify domain config
curl http://10.1.1.10:8080/api/v1/domains | jq .
Windows Agent Issues
# Check if running as Administrator
([Security.Principal.WindowsPrincipal] [Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")
# Check agent logs
Get-Content C:\opengslb\logs\agent.log | Select-String "latency"
Cleanup
# Destroy all Azure resources
cd terraform
terraform destroy
# Or delete the resource group directly
az group delete --name rg-opengslb-latency-test --yes
Cost Estimate
Resource |
Monthly Cost |
|---|---|
5x Linux VMs (B2s) |
~$75 |
1x Windows VM (B2s) |
~$25 |
VNet Peering |
~$20 |
Total |
~$120 |
Tip: Deallocate VMs when not testing to reduce costs.
Next Steps
Review ADR-017: Passive Latency Learning for design rationale
Explore the Configuration Reference for all latency_learning options
Try Demo 3: Latency Routing for comparison with active probing
Key Takeaways
Passive learning captures real client latency - not just Overwatch-to-backend
Subnet aggregation prevents unbounded memory growth
Cold-start fallback ensures routing works before data is collected
Cross-platform support - Linux (netlink) and Windows (GetPerTcpConnectionEStats)
No client changes required - works with existing TCP connections