Incident Response Playbook

This document provides a general framework for responding to OpenGSLB incidents.

Incident Severity Levels

Level	Description	Response Time	Examples
SEV1	Complete service outage	Immediate	All Overwatches down, DNS not resolving
SEV2	Partial outage or degraded service	15 minutes	Single region down, high error rate
SEV3	Minor issues, limited impact	1 hour	Single backend unhealthy, config warning
SEV4	Informational, no user impact	Next business day	Metric anomaly, minor alert

General Response Framework

Phase 1: Detection and Assessment (0-5 minutes)

Acknowledge the alert
- Note alert time and source
- Assign incident owner
Initial assessment
- What is the user impact?
- How many users/services affected?
- Is this a known issue?

Gather basic information

# Quick health check
opengslb-cli status --api http://overwatch:9090

# Check all Overwatches
for host in overwatch-{1,2,3}; do
    echo "=== $host ==="
    curl -s http://${host}:9090/api/v1/ready
done

# Test DNS
dig @overwatch-1 myapp.gslb.example.com

Phase 2: Triage and Communication (5-15 minutes)

Determine severity level
- Use criteria above
- Escalate if SEV1/SEV2
Notify stakeholders
- Internal status channel
- On-call escalation if needed
- Customer communication for SEV1/SEV2
Document initial findings
- Symptoms observed
- Timeline of events
- Initial hypothesis

Phase 3: Mitigation (15-60 minutes)

Apply temporary fix if available
- Traffic redirect
- Override unhealthy backends
- Rollback if recent change caused issue

Monitor mitigation effectiveness

# Watch key metrics
watch -n5 'opengslb-cli servers --api http://overwatch:9090'

Continue investigation
- Root cause analysis
- Collect logs and metrics

Phase 4: Resolution (Variable)

Implement permanent fix
- Configuration change
- Code fix deployment
- Infrastructure repair
Verify fix
- Functional testing
- Monitor for recurrence
Stand down
- Clear incident status
- Notify stakeholders

Phase 5: Post-Incident (Within 48 hours)

Conduct post-mortem
- What happened?
- Why did it happen?
- How was it detected?
- How was it resolved?
- How do we prevent it?
Create action items
- Immediate fixes
- Long-term improvements
- Monitoring enhancements
Update documentation
- Runbooks
- Alert thresholds
- Architecture documentation

Quick Reference Commands

Health Assessment

# Overall status
opengslb-cli status --api http://localhost:9090

# Backend health
opengslb-cli servers --api http://localhost:9090

# Domain configuration
opengslb-cli domains --api http://localhost:9090

# Check overrides
opengslb-cli overrides list --api http://localhost:9090

DNS Testing

# Query all Overwatches
for ow in 10.0.1.{53,54,55}; do
    echo "=== $ow ==="
    dig @$ow myapp.gslb.example.com +short
done

# DNSSEC validation
dig @overwatch myapp.gslb.example.com +dnssec

# Query with specific client IP (for geo testing)
dig @overwatch myapp.gslb.example.com +subnet=8.8.8.8/32

Log Analysis

# Recent errors
journalctl -u opengslb-overwatch -p err --since "1 hour ago"

# Follow logs
journalctl -u opengslb-overwatch -f

# Search for specific patterns
journalctl -u opengslb-overwatch | grep -E "(error|fail|timeout)"

Emergency Actions

# Mark backend unhealthy (immediate traffic diversion)
opengslb-cli overrides set myapp 10.0.1.10:8080 \
    --healthy=false \
    --reason="Emergency override during incident" \
    --api http://localhost:9090

# Clear override (restore traffic)
opengslb-cli overrides clear myapp 10.0.1.10:8080 \
    --api http://localhost:9090

# Force validation
curl -X POST http://localhost:9090/api/v1/overwatch/validate

# Reload configuration
sudo systemctl reload opengslb-overwatch

Metrics Queries

# Current query rate
sum(rate(opengslb_dns_queries_total[5m]))

# Error rate
sum(rate(opengslb_dns_queries_total{status!="success"}[5m])) / sum(rate(opengslb_dns_queries_total[5m]))

# Healthy backends
opengslb_overwatch_backends_healthy

# Stale agents
opengslb_overwatch_stale_agents

Incident Response Contacts

Role	Contact	Escalation Path
On-Call Engineer	[Your contact info]	Page via PagerDuty
Platform Lead	[Contact]	Phone for SEV1
Security	[Contact]	For security incidents

Common Scenarios

For detailed procedures on specific incidents, see:

Runbook Template

Use this template for documenting new scenarios:

# [Scenario Name]

## Symptoms
- What alerts fire?
- What do users experience?

## Impact
- Severity level
- Affected services/users

## Diagnosis
- Commands to run
- What to look for

## Resolution
1. Step-by-step fix

## Prevention
- How to avoid recurrence

## Related
- Links to relevant docs

Post-Incident Review Template

# Incident Review: [Title]

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: SEVN
**Owner**: [Name]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Alert fired
- HH:MM - Acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Incident resolved

## Root Cause
Detailed explanation of why it happened.

## Impact
- Services affected
- Users affected
- Duration

## Resolution
What was done to fix it.

## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| Fix X | Person | Date | Open |

## Lessons Learned
- What went well?
- What could be improved?