Chapter 12. Operations & Maintenance

Day-2 operations, monitoring, incident response, and lifecycle management for the authorization system


Day-2 operations — the ongoing management of the authorization system after initial deployment — represent the longest and most operationally intensive phase of the system lifecycle. A well-designed authorization system that is poorly operated will gradually degrade in both security posture and performance as role proliferation accumulates, stale bindings pile up, and monitoring gaps go undetected. This chapter provides the operational procedures, monitoring specifications, incident response playbooks, and maintenance schedules required to keep the authorization system secure, performant, and compliant over its operational lifetime.

12.1 Operations Monitoring Dashboard

24/7 Authorization System Operations Center

Figure 12.1: Authorization System 24/7 Operations Center — SOC monitoring setup showing PDP Engine at 99.99% uptime, Redis Cache hit rate 94%, API Gateway latency 12ms p95, Audit Pipeline at 0 lag, with real-time Grafana metrics, alert management console, and analyst workstations for continuous monitoring

12.2 Key Operational Metrics & SLOs

The authorization system must be monitored against a defined set of Service Level Objectives (SLOs) that reflect the system's operational health from both a performance and security perspective. The following table defines the key operational metrics, their SLO targets, alert thresholds, and the on-call response priority for each metric breach.

Metric SLO Target Warning Threshold Critical Threshold On-Call Priority
PDP Availability 99.99% per month <99.99% (any 5-min window) <99.9% (any 5-min window) P1 — Immediate page
Decision Latency p99 <50ms at normal load >50ms p99 for 2 consecutive minutes >100ms p99 for 1 minute P2 — 15-min response
Cache Hit Rate >90% <90% for 5 consecutive minutes <70% for 2 consecutive minutes P2 — 15-min response
Audit Event Lag <60 seconds end-to-end >60s lag for 3 consecutive minutes >5 min lag or consumer stopped P1 — Immediate page
Error Rate <0.1% of decisions >0.1% error rate for 2 minutes >1% error rate for 1 minute P1 — Immediate page
IdP Sync Lag <15 minutes >15 min since last successful sync >60 min since last successful sync P2 — 15-min response
Orphan Account Count 0 accounts Any orphan account detected Orphan account active >48 hours P2 — 15-min response
SoD Violation Count 0 active violations Any new SoD violation detected SoD violation unresolved >24 hours P1 — Immediate page

12.3 Maintenance Schedule

Regular maintenance activities are essential for preventing the gradual degradation of authorization system security and performance. The following maintenance schedule defines the recurring activities, their frequency, responsible team, and estimated effort. All maintenance activities should be tracked in the ITSM platform and completed within the defined maintenance windows.

Maintenance Activity Frequency Responsible Team Est. Effort Notes
Periodic Access Review Quarterly (or monthly for privileged roles) IAM Team + Business Owners 2–5 days (depends on user count) Use Access Review Planner calculator (Chapter 9) to estimate effort
Role Catalog Cleanup Semi-annually IAM Team 1–2 days Remove unused roles; merge duplicate roles; update role descriptions
Security Patch Application Monthly (critical patches: within 72h) Platform Engineering 2–4 hours per patch cycle Test in staging before production; maintain rollback plan
Penetration Test Annually (or after major changes) Security Team + External Pentesters 1–2 weeks Include authorization-specific test cases; remediate all Critical/High findings
Disaster Recovery Drill Semi-annually Platform Engineering + IAM Team 4–8 hours Test full failover; verify RTO/RPO; update runbooks based on drill findings
Compliance Control Review Annually (or per audit cycle) Security & Compliance Team 3–5 days Map controls to current compliance requirements; identify gaps; update control evidence
Audit Log Retention Review Annually Security Team 1 day Verify retention meets compliance requirements; archive or purge expired logs per policy
Certificate Renewal Before expiry (cert-manager automates; manual check quarterly) Platform Engineering 1–2 hours Monitor certificate expiry; automate renewal with cert-manager; alert 30 days before expiry

12.4 Incident Response Playbooks

Incident Type Priority Immediate Actions (0–15 min) Investigation (15–60 min) Resolution & Post-Incident
PDP Engine Outage P1 Page on-call; check pod status; verify Redis connectivity; check recent deployments Review pod logs; check resource limits; verify network policies; check dependency health Restart pods if OOMKilled; rollback if recent deployment; post-incident review within 48h
Privilege Escalation Detected P1 Immediately revoke affected user's sessions and bindings; notify CISO; preserve evidence Trace escalation path in audit logs; identify attack vector; assess scope of unauthorized access Patch vulnerability; conduct forensic review; notify affected parties per breach policy; update controls
Audit Log Gap Detected P1 Alert security team; check Kafka consumer status; verify audit pipeline health Determine gap duration and scope; check for tampering indicators; review pipeline logs Restore pipeline; document gap for compliance; assess regulatory notification requirements
High Latency (SLA Breach) P2 Check PDP metrics; verify Redis hit rate; check for traffic spike Profile slow decisions; check for policy complexity increase; review recent changes Scale PDP if needed; optimize policies; tune cache TTL; update capacity plan
Continuous Improvement: The authorization system should be treated as a living system that evolves with the organization's security requirements. Establish a quarterly review cadence to assess whether the current permission model still aligns with the organization's risk appetite, compliance obligations, and operational needs. Key inputs to this review include access review findings, incident post-mortems, compliance audit results, and feedback from business owners and end users. Use the findings to drive incremental improvements to the permission model, role catalog, and operational procedures.