Incident Response Playbook

This document provides a general framework for responding to OpenGSLB incidents.

Incident Severity Levels

Level

Description

Response Time

Examples

SEV1

Complete service outage

Immediate

All Overwatches down, DNS not resolving

SEV2

Partial outage or degraded service

15 minutes

Single region down, high error rate

SEV3

Minor issues, limited impact

1 hour

Single backend unhealthy, config warning

SEV4

Informational, no user impact

Next business day

Metric anomaly, minor alert

General Response Framework

Phase 1: Detection and Assessment (0-5 minutes)

  1. Acknowledge the alert

    • Note alert time and source

    • Assign incident owner

  2. Initial assessment

    • What is the user impact?

    • How many users/services affected?

    • Is this a known issue?

  3. Gather basic information

    # Quick health check
    opengslb-cli status --api http://overwatch:9090
    
    # Check all Overwatches
    for host in overwatch-{1,2,3}; do
        echo "=== $host ==="
        curl -s http://${host}:9090/api/v1/ready
    done
    
    # Test DNS
    dig @overwatch-1 myapp.gslb.example.com
    

Phase 2: Triage and Communication (5-15 minutes)

  1. Determine severity level

    • Use criteria above

    • Escalate if SEV1/SEV2

  2. Notify stakeholders

    • Internal status channel

    • On-call escalation if needed

    • Customer communication for SEV1/SEV2

  3. Document initial findings

    • Symptoms observed

    • Timeline of events

    • Initial hypothesis

Phase 3: Mitigation (15-60 minutes)

  1. Apply temporary fix if available

    • Traffic redirect

    • Override unhealthy backends

    • Rollback if recent change caused issue

  2. Monitor mitigation effectiveness

    # Watch key metrics
    watch -n5 'opengslb-cli servers --api http://overwatch:9090'
    
  3. Continue investigation

    • Root cause analysis

    • Collect logs and metrics

Phase 4: Resolution (Variable)

  1. Implement permanent fix

    • Configuration change

    • Code fix deployment

    • Infrastructure repair

  2. Verify fix

    • Functional testing

    • Monitor for recurrence

  3. Stand down

    • Clear incident status

    • Notify stakeholders

Phase 5: Post-Incident (Within 48 hours)

  1. Conduct post-mortem

    • What happened?

    • Why did it happen?

    • How was it detected?

    • How was it resolved?

    • How do we prevent it?

  2. Create action items

    • Immediate fixes

    • Long-term improvements

    • Monitoring enhancements

  3. Update documentation

    • Runbooks

    • Alert thresholds

    • Architecture documentation

Quick Reference Commands

Health Assessment

# Overall status
opengslb-cli status --api http://localhost:9090

# Backend health
opengslb-cli servers --api http://localhost:9090

# Domain configuration
opengslb-cli domains --api http://localhost:9090

# Check overrides
opengslb-cli overrides list --api http://localhost:9090

DNS Testing

# Query all Overwatches
for ow in 10.0.1.{53,54,55}; do
    echo "=== $ow ==="
    dig @$ow myapp.gslb.example.com +short
done

# DNSSEC validation
dig @overwatch myapp.gslb.example.com +dnssec

# Query with specific client IP (for geo testing)
dig @overwatch myapp.gslb.example.com +subnet=8.8.8.8/32

Log Analysis

# Recent errors
journalctl -u opengslb-overwatch -p err --since "1 hour ago"

# Follow logs
journalctl -u opengslb-overwatch -f

# Search for specific patterns
journalctl -u opengslb-overwatch | grep -E "(error|fail|timeout)"

Emergency Actions

# Mark backend unhealthy (immediate traffic diversion)
opengslb-cli overrides set myapp 10.0.1.10:8080 \
    --healthy=false \
    --reason="Emergency override during incident" \
    --api http://localhost:9090

# Clear override (restore traffic)
opengslb-cli overrides clear myapp 10.0.1.10:8080 \
    --api http://localhost:9090

# Force validation
curl -X POST http://localhost:9090/api/v1/overwatch/validate

# Reload configuration
sudo systemctl reload opengslb-overwatch

Metrics Queries

# Current query rate
sum(rate(opengslb_dns_queries_total[5m]))

# Error rate
sum(rate(opengslb_dns_queries_total{status!="success"}[5m])) / sum(rate(opengslb_dns_queries_total[5m]))

# Healthy backends
opengslb_overwatch_backends_healthy

# Stale agents
opengslb_overwatch_stale_agents

Incident Response Contacts

Role

Contact

Escalation Path

On-Call Engineer

[Your contact info]

Page via PagerDuty

Platform Lead

[Contact]

Phone for SEV1

Security

[Contact]

For security incidents

Common Scenarios

For detailed procedures on specific incidents, see:

Runbook Template

Use this template for documenting new scenarios:

# [Scenario Name]

## Symptoms
- What alerts fire?
- What do users experience?

## Impact
- Severity level
- Affected services/users

## Diagnosis
- Commands to run
- What to look for

## Resolution
1. Step-by-step fix

## Prevention
- How to avoid recurrence

## Related
- Links to relevant docs

Post-Incident Review Template

# Incident Review: [Title]

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: SEVN
**Owner**: [Name]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Alert fired
- HH:MM - Acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Incident resolved

## Root Cause
Detailed explanation of why it happened.

## Impact
- Services affected
- Users affected
- Duration

## Resolution
What was done to fix it.

## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| Fix X | Person | Date | Open |

## Lessons Learned
- What went well?
- What could be improved?