# Demo 6: Passive Latency Learning (ADR-017) **Level**: Advanced **Time**: 30-45 minutes (Azure deployment) **Prerequisites**: Azure subscription, Terraform, SSH key ## Overview This demo showcases OpenGSLB's **passive latency learning** feature - a unique capability that learns real client-to-backend latency by reading TCP RTT (Round-Trip Time) data directly from the operating system. Unlike active latency probing (which measures Overwatch-to-backend latency), passive learning captures the actual latency experienced by your clients. This data is aggregated by client subnet and gossiped to Overwatch nodes for intelligent routing decisions. ### What Makes This Different | Approach | What It Measures | Accuracy | |----------|-----------------|----------| | Active Latency (Demo 3) | Overwatch → Backend | Proxy's perspective | | **Passive Learning (This Demo)** | **Client → Backend** | **Client's actual experience** | ### Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Client (10.1.2.x) │ │ │ │ │ │ HTTP Request │ │ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Backend EU │◄────────│ Agent │────────►│ Overwatch │ │ │ │ (nginx) │ TCP │ (RTT: 85ms) │ Gossip │ (DNS) │ │ │ └─────────────┘ conn └─────────────┘ └─────────────┘ │ │ │ │ │ Agent reads TCP_INFO from kernel: │ │ │ - tcpi_rtt (smoothed RTT in microseconds) │ │ │ - Aggregates by /24 subnet ▼ │ │ - Reports to Overwatch via gossip DNS Response: │ │ "Use EU backend │ │ (lowest latency)" │ └─────────────────────────────────────────────────────────────────────┘ ``` ## What You'll Learn 1. How agents collect TCP RTT data from the OS 2. How latency is aggregated by client subnet 3. How Overwatch uses learned latency for routing 4. Cold-start fallback to geolocation routing 5. Cross-platform support (Linux + Windows) ## Prerequisites - **Azure Subscription** with permissions to create VMs - **Terraform** >= 1.0 installed - **SSH Key** for VM access - **Azure CLI** authenticated (`az login`) ## Quick Start ```bash # Clone the repository git clone https://github.com/LoganRossUS/OpenGSLB.git cd OpenGSLB/demos/demo-6-advanced-passive-latency-learning/terraform # Configure deployment cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your SSH key path # Deploy infrastructure (~2 minutes) terraform init terraform apply -var="windows_admin_password=YourComplexPassword123!" # Wait for cloud-init to complete, then validate ssh azureuser@$(terraform output -raw traffic_eastus_public_ip) test-cluster ``` ## Infrastructure Overview The demo deploys across 3 Azure regions: | Region | VMs | Purpose | |--------|-----|---------| | **East US** | Overwatch, Traffic Generator | DNS authority, simulated clients | | **West Europe** | Linux Backend, Windows Backend | Application servers with agents | | **Southeast Asia** | Linux Backend, Traffic Generator | APAC region backend + clients | ### Expected Latencies | From | To | Expected RTT | |------|-----|--------------| | East US | West Europe | ~80-100ms | | East US | Southeast Asia | ~200-250ms | | Southeast Asia | Southeast Asia | ~1-5ms | ## Step-by-Step Walkthrough ### Step 1: Verify Deployment SSH to the traffic generator and run validation: ```bash ssh azureuser@ # Run cluster validation test-cluster # Expected output: # ✓ Overwatch DNS responding # ✓ All agents connected # ✓ Health checks passing ``` ### Step 2: Generate Traffic Generate sustained traffic to populate the latency table: ```bash # Generate 5 requests/second for 5 minutes generate-traffic 5 300 # This creates TCP connections to all backends # Agents read RTT from each connection ``` ### Step 3: View Learned Latency Data Query the Overwatch API to see collected latency data: ```bash curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq . ``` Expected output: ```json { "entries": [ { "subnet": "10.1.2.0/24", "domain": "app.demo.local", "region": "eu-west", "rtt_ms": 85, "samples": 150, "last_updated": "2025-12-19T10:05:00Z" }, { "subnet": "10.1.2.0/24", "domain": "app.demo.local", "region": "ap-southeast", "rtt_ms": 220, "samples": 150, "last_updated": "2025-12-19T10:05:00Z" } ] } ``` ### Step 4: Test Latency-Based Routing Query DNS and verify the lowest-latency backend is selected: ```bash # From East US traffic generator dig @10.1.1.10 app.demo.local A +short # Expected: West Europe backend IP (lower latency than Singapore) ``` Check Overwatch logs for routing decision: ```bash ssh azureuser@ journalctl -u opengslb | grep "routing decision" ``` ### Step 5: Test Cold-Start Fallback When no latency data exists, Overwatch falls back to geolocation: ```bash # Restart Overwatch (clears latency table) sudo systemctl restart opengslb # Immediately query DNS dig @10.1.1.10 app.demo.local A +short # Check logs - should show geo_fallback journalctl -u opengslb | grep "geo_fallback" ``` ### Step 6: Compare Regions Traffic from different regions should route to different backends: ```bash # From East US (connects to EU - lower latency) ssh azureuser@ dig @10.1.1.10 app.demo.local +short # From Singapore (connects to Singapore - local) ssh azureuser@ dig @10.1.1.10 app.demo.local +short ``` ## Configuration Deep-Dive ### Agent Latency Learning Config ```yaml agent: latency_learning: enabled: true poll_interval: 10s # How often to read TCP stats min_connection_age: 5s # Ignore new connections ipv4_prefix: 24 # Aggregate by /24 ipv6_prefix: 48 # Aggregate by /48 ewma_alpha: 0.3 # Smoothing factor max_subnets: 100000 # Memory limit subnet_ttl: 168h # 7-day retention min_samples: 5 # Min samples before reporting report_interval: 30s # Gossip frequency ``` ### Domain Configuration for Learned Latency ```yaml domains: - name: app.demo.local routing_algorithm: learned_latency # Use passive learning regions: - eu-west - ap-southeast latency_config: max_latency_ms: 300 # Exclude high-latency backends min_samples: 5 # Require sufficient data ``` ## How It Works ### 1. TCP RTT Collection (Agent) On Linux, agents read `/proc/net/tcp` and use `getsockopt(TCP_INFO)`: ```c struct tcp_info info; getsockopt(sock, IPPROTO_TCP, TCP_INFO, &info, &len); // info.tcpi_rtt contains smoothed RTT in microseconds ``` On Windows, agents use `GetPerTcpConnectionEStats()`: ```powershell # Requires Administrator privileges GetPerTcpConnectionEStats -State EstablishedConnections ``` ### 2. Subnet Aggregation RTT samples are aggregated by client subnet using EWMA: ``` new_rtt = α × sample + (1 - α) × old_rtt ``` With α = 0.3, recent samples have moderate influence while maintaining stability. ### 3. Gossip to Overwatch Agents periodically send latency reports: ```json { "type": "latency_report", "agent_id": "backend-eu-west", "entries": [ {"subnet": "10.1.2.0/24", "rtt_ms": 85, "samples": 50} ] } ``` ### 4. Routing Decision When a DNS query arrives, Overwatch: 1. Extracts client IP (or ECS subnet) 2. Looks up learned latency for each backend 3. Selects the backend with lowest RTT 4. Falls back to geolocation if no data exists ## Troubleshooting ### No Latency Data Appearing ```bash # Check agent logs journalctl -u opengslb | grep "latency" # Verify CAP_NET_ADMIN on Linux getcap /usr/local/bin/opengslb # Should show: cap_net_admin+ep # Verify connections exist ss -tn | grep ESTAB ``` ### Unexpected Routing ```bash # Check what Overwatch sees curl http://10.1.1.10:8080/api/v1/overwatch/latency | jq . # Verify domain config curl http://10.1.1.10:8080/api/v1/domains | jq . ``` ### Windows Agent Issues ```powershell # Check if running as Administrator ([Security.Principal.WindowsPrincipal] [Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator") # Check agent logs Get-Content C:\opengslb\logs\agent.log | Select-String "latency" ``` ## Cleanup ```bash # Destroy all Azure resources cd terraform terraform destroy # Or delete the resource group directly az group delete --name rg-opengslb-latency-test --yes ``` ## Cost Estimate | Resource | Monthly Cost | |----------|-------------| | 5x Linux VMs (B2s) | ~$75 | | 1x Windows VM (B2s) | ~$25 | | VNet Peering | ~$20 | | **Total** | **~$120** | **Tip**: Deallocate VMs when not testing to reduce costs. ## Next Steps - Review [ADR-017: Passive Latency Learning](../ARCHITECTURE_DECISIONS.md#adr-017-passive-latency-learning-via-os-tcp-statistics) for design rationale - Explore the [Configuration Reference](../configuration.md) for all latency_learning options - Try [Demo 3: Latency Routing](demo-3-latency-routing.md) for comparison with active probing ## Key Takeaways 1. **Passive learning captures real client latency** - not just Overwatch-to-backend 2. **Subnet aggregation** prevents unbounded memory growth 3. **Cold-start fallback** ensures routing works before data is collected 4. **Cross-platform support** - Linux (netlink) and Windows (GetPerTcpConnectionEStats) 5. **No client changes required** - works with existing TCP connections