Chapter 12. Operations & Maintenance

Day-2 operations, monitoring, incident response, and lifecycle management for the authorization system

Day-2 operations — the ongoing management of the authorization system after initial deployment — represent the longest and most operationally intensive phase of the system lifecycle. A well-designed authorization system that is poorly operated will gradually degrade in both security posture and performance as role proliferation accumulates, stale bindings pile up, and monitoring gaps go undetected. This chapter provides the operational procedures, monitoring specifications, incident response playbooks, and maintenance schedules required to keep the authorization system secure, performant, and compliant over its operational lifetime.

12.1 Operations Monitoring Dashboard

24/7 Authorization System Operations Center

Figure 12.1: Authorization System 24/7 Operations Center — SOC monitoring setup showing PDP Engine at 99.99% uptime, Redis Cache hit rate 94%, API Gateway latency 12ms p95, Audit Pipeline at 0 lag, with real-time Grafana metrics, alert management console, and analyst workstations for continuous monitoring

12.2 Key Operational Metrics & SLOs

The authorization system must be monitored against a defined set of Service Level Objectives (SLOs) that reflect the system's operational health from both a performance and security perspective. The following table defines the key operational metrics, their SLO targets, alert thresholds, and the on-call response priority for each metric breach.

Metric	SLO Target	Warning Threshold	Critical Threshold	On-Call Priority
PDP Availability	99.99% per month	<99.99% (any 5-min window)	<99.9% (any 5-min window)	P1 — Immediate page
Decision Latency p99	<50ms at normal load	>50ms p99 for 2 consecutive minutes	>100ms p99 for 1 minute	P2 — 15-min response
Cache Hit Rate	>90%	<90% for 5 consecutive minutes	<70% for 2 consecutive minutes	P2 — 15-min response
Audit Event Lag	<60 seconds end-to-end	>60s lag for 3 consecutive minutes	>5 min lag or consumer stopped	P1 — Immediate page
Error Rate	<0.1% of decisions	>0.1% error rate for 2 minutes	>1% error rate for 1 minute	P1 — Immediate page
IdP Sync Lag	<15 minutes	>15 min since last successful sync	>60 min since last successful sync	P2 — 15-min response
Orphan Account Count	0 accounts	Any orphan account detected	Orphan account active >48 hours	P2 — 15-min response
SoD Violation Count	0 active violations	Any new SoD violation detected	SoD violation unresolved >24 hours	P1 — Immediate page

12.3 Maintenance Schedule

Regular maintenance activities are essential for preventing the gradual degradation of authorization system security and performance. The following maintenance schedule defines the recurring activities, their frequency, responsible team, and estimated effort. All maintenance activities should be tracked in the ITSM platform and completed within the defined maintenance windows.

Maintenance Activity	Frequency	Responsible Team	Est. Effort	Notes
Periodic Access Review	Quarterly (or monthly for privileged roles)	IAM Team + Business Owners	2–5 days (depends on user count)	Use Access Review Planner calculator (Chapter 9) to estimate effort
Role Catalog Cleanup	Semi-annually	IAM Team	1–2 days	Remove unused roles; merge duplicate roles; update role descriptions
Security Patch Application	Monthly (critical patches: within 72h)	Platform Engineering	2–4 hours per patch cycle	Test in staging before production; maintain rollback plan
Penetration Test	Annually (or after major changes)	Security Team + External Pentesters	1–2 weeks	Include authorization-specific test cases; remediate all Critical/High findings
Disaster Recovery Drill	Semi-annually	Platform Engineering + IAM Team	4–8 hours	Test full failover; verify RTO/RPO; update runbooks based on drill findings
Compliance Control Review	Annually (or per audit cycle)	Security & Compliance Team	3–5 days	Map controls to current compliance requirements; identify gaps; update control evidence
Audit Log Retention Review	Annually	Security Team	1 day	Verify retention meets compliance requirements; archive or purge expired logs per policy
Certificate Renewal	Before expiry (cert-manager automates; manual check quarterly)	Platform Engineering	1–2 hours	Monitor certificate expiry; automate renewal with cert-manager; alert 30 days before expiry

12.4 Incident Response Playbooks

Incident Type	Priority	Immediate Actions (0–15 min)	Investigation (15–60 min)	Resolution & Post-Incident
PDP Engine Outage	P1	Page on-call; check pod status; verify Redis connectivity; check recent deployments	Review pod logs; check resource limits; verify network policies; check dependency health	Restart pods if OOMKilled; rollback if recent deployment; post-incident review within 48h
Privilege Escalation Detected	P1	Immediately revoke affected user's sessions and bindings; notify CISO; preserve evidence	Trace escalation path in audit logs; identify attack vector; assess scope of unauthorized access	Patch vulnerability; conduct forensic review; notify affected parties per breach policy; update controls
Audit Log Gap Detected	P1	Alert security team; check Kafka consumer status; verify audit pipeline health	Determine gap duration and scope; check for tampering indicators; review pipeline logs	Restore pipeline; document gap for compliance; assess regulatory notification requirements
High Latency (SLA Breach)	P2	Check PDP metrics; verify Redis hit rate; check for traffic spike	Profile slow decisions; check for policy complexity increase; review recent changes	Scale PDP if needed; optimize policies; tune cache TTL; update capacity plan

Continuous Improvement: The authorization system should be treated as a living system that evolves with the organization's security requirements. Establish a quarterly review cadence to assess whether the current permission model still aligns with the organization's risk appetite, compliance obligations, and operational needs. Key inputs to this review include access review findings, incident post-mortems, compliance audit results, and feedback from business owners and end users. Use the findings to drive incremental improvements to the permission model, role catalog, and operational procedures.