Kubernetes • Production
The Ultimate Kubernetes Production Checklist: What 90% of Teams Miss Before Go-Live
Production is unforgiving.
The manifests are tidy, the cluster is “green,” CI is pushing clean images, and dashboards look calm. Yet most rescue engagements we take on share a root cause: teams assume Kubernetes defaults are safe. They’re not. This checklist is a field guide—what we look for during readiness reviews and what we’ve seen prevent late-night incidents across fintech, retail, healthcare, and SaaS. If your org can check off everything here, you’re not just running Kubernetes; you’re running it well.
1) Cluster Baseline & Architecture
Managed vs. self-managed control planes
- Default to managed: GKE, EKS, AKS. You get patched control planes, tight cloud integration, and fewer pager moments.
- Self-managed only when required: Compliance hard requirements, air-gap, or exotic networking. Budget time for etcd care and upgrades.
Version strategy
- Stay within two minor versions of latest. Schedule quarterly reviews for deprecations.
- Pin versions for nodes and control planes; never be surprised by an auto-upgrade.
Configuration as code
Keep a cluster configuration repo (Terraform, eksctl, gcloud CLI, or Pulumi). If your cluster can’t be rebuilt from Git, it’s not production-friendly.
Common failure: mixing click-ops and code, then losing track of what’s authoritative. Choose one source of truth and enforce it. 🧠
2) Security Hardening (Non-Negotiable)
Identity, policy, isolation
- RBAC: Lock down service accounts; no cluster-admin by default, ever.
- Namespaces with quotas: Each team/environment gets its own sandbox with
LimitRangesandResourceQuotas. - Pod Security Standards or Kyverno: Block privileged pods, host network/IPC, and root FS writes.
- NetworkPolicies: Start with default deny; open only what you need.
Supply chain & secrets
- Sign and scan images (Cosign + Trivy). Break build on high CVEs with reachable paths.
- Use External Secrets / cloud KMS; don’t store secrets in ConfigMaps.
- Require private registries and restrict egress to registries your org controls.
Controversial take: If you deploy apps in the default namespace, you’re signaling you don’t prioritize security. Create purpose-built namespaces or expect drift.
3) Networking & Traffic Management
Ingress and service patterns
- Use a cloud-native ingress: GCLB (GKE), ALB (EKS), AGIC (AKS). Favor managed TLS and WAF.
- Prefer internal load balancing for east-west and private APIs; put public ingress at the edge.
- Document DNS conventions:
svc.ns.svc.cluster.local, subdomain per environment, and wildcard certs where appropriate.
Readiness, liveness, and retries
- Never route traffic to a pod until
readinessProbeis healthy; tune thresholds to application startup (JIT warmup, caches). - Right-size timeouts and retries at the edge; too many retries amplify outages.
Network segmentation
- Baseline: deny-all
NetworkPolicy+ explicit egress. Permit only required DNS, registries, and upstream APIs. - Audit with periodic “can I reach X?” jobs. If you can’t prove isolation, you don’t have it.
4) Observability That Actually Helps
Dashboards are not alerting systems. Alerts must be actionable, owned, and routable.
Golden signals per service
- Latency: p50/p95/p99. Tie SLOs to user expectations, not platform defaults.
- Traffic: QPS/RPS, concurrency, queue depth.
- Errors: 4xx/5xx rates, gRPC status codes.
- Saturation: CPU throttling, memory pressure, file descriptors, connection pools.
Recommended stack
- Metrics: Prometheus + Grafana (or Managed Prometheus).
- Logs: Loki or Elastic with retention tiers; ship structured logs.
- Tracing: OpenTelemetry + Jaeger/Tempo. Sample generously on errors.
- Alerting: PagerDuty/Opsgenie with ownership labels (
team=payments).
High-impact tip: create a single “First Responder” dashboard that answers three questions within 3 minutes: what broke, where, and who owns it.
5) Storage & Stateful Workloads
Data durability first
- Use managed disks via StorageClasses. Pick IOPS-tuned classes (
gp3, Premium SSD, Balanced PD). - Plan for zone failure: multi-AZ node pools, replication at the data layer.
- Backups: Velero, Kasten, or cloud snapshots with immutable retention.
- Avoid
hostPathand node-pinned state; it breaks scheduling and recovery.
Practical guidance: unless you have mature data ops (backups, PITR, DR playbooks), keep core databases on managed services (Cloud SQL, RDS, Cosmos DB). Move to in-cluster only when you can prove parity.
6) CI/CD & Progressive Delivery
Git as the truth
- Adopt GitOps (Argo CD or Flux). Everything declarative; drift is a failed build.
- Images tagged with
git-sha+ build metadata; never uselatest. - Use promotion branches (dev → stage → prod) with policy checks in between.
Safer rollouts
- Canary or Blue/Green with automatic rollback on SLO violations.
- Service meshes (Istio/Linkerd) for traffic splits and telemetry, only if you’ll operate them well.
- Run smoke tests inside the cluster as Jobs against service endpoints.
Secret sauce: empower developers with “fast feedback, safe guardrails.” Shorten merge-to-prod while raising quality gates. 🚀
7) Resilience, Autoscaling & Pod Health
Health & budgets
- Define
livenessProbeandreadinessProbeper container; no copy-paste defaults. - PDBs: ensure voluntary disruptions don’t take capacity below safe thresholds.
- Requests/limits: right-size with real traffic; watch for CPU throttling and OOM kills.
Autoscaling the right thing
- HPA on meaningful metrics (queue depth, concurrency) rather than CPU alone.
- Cluster Autoscaler enabled, with buffer capacity for burst.
- Schedule chaos experiments (kube-monkey, LitmusChaos) during business hours with on-call awareness.
Common trap: autoscaling tuned on synthetic load that doesn’t match production. Calibrate with real traces and peak events.
8) Backups, DR & Region Strategy
Test what you ship
- Quarterly restore drills measured for RTO/RPO. Publish results internally.
- Backup etcd (if self-managed) and cluster state; validate restores to a sandbox cluster.
- Document runbooks as code and plain text. People reach for PDFs when they’re panicking; give them one page that works.
Reality check: You don’t have DR until someone who didn’t write the system can follow the runbook and bring it back.
9) Governance, Cost & Lifecycle
Guardrails that scale
- OPA/Gatekeeper or Kyverno policies enforced in CI and admission.
- Label every resource with owner, cost center, environment, and data classification.
- Automate cleanup of orphaned load balancers, PVs, and snapshots.
Make costs visible
- Adopt OpenCost/Kubecost and show $ per service per environment.
- Right-size nodes, consider Spot/Preemptible for stateless tiers with disruption budgets.
Often missed: image bloat. Trim base images and remove unused layers; it reduces cold start times and egress.
10) Operational Readiness & Culture
Runbooks, drills, and reviews
- Monthly operational review: SLO adherence, incidents, cost, toil metrics.
- Runbooks per service: startup, shutdown, scale, rollback, failover.
- Access hygiene: break-glass accounts, short-lived tokens, audited kubectl.
Day-2 excellence
- Track API deprecations every release; upgrade early to avoid big-bang pain.
- Keep a living “production contract” per service: limits, readiness, SLOs, dependencies.
Blunt truth: YAML doesn’t rot; expectations do. If nobody owns the production contract, reliability becomes “someone else’s problem.” ⚠️
Printable Checklist
Here’s a condensed list you can copy into an issue template:
- Cluster on managed control plane; versions pinned and within 2 minors.
- Infra as code repo; rebuild documented and tested.
- RBAC least privilege; namespaces per team; quotas/limits defined.
- Pod Security (PSS/Kyverno); deny privileged; restrict host mounts.
- Signed and scanned images; private registry egress only.
- Ingress with managed TLS/WAF; NetworkPolicies default-deny.
- Readiness/liveness tuned; retries/timeouts set at edge.
- Golden signals dashboards; actionable alerts to on-call.
- StorageClasses with IOPS tiers; backups verified via restore.
- GitOps; progressive delivery; canary/blue-green with rollback.
- PDBs; right-sized requests/limits; HPA on meaningful metrics.
- DR drills; RTO/RPO measured; runbooks tested by non-authors.
- Policy enforcement (OPA/Kyverno); cost visibility per service.
- Monthly ops review; production contract maintained.
Work With Us
If this checklist surfaced gaps, that’s good news—you can fix them before your next incident. DevOps City runs production readiness assessments, GitOps enablement, and Kubernetes operations workshops tailored to your stack. Want a pragmatic roadmap you can ship within a quarter? Let’s talk. 🙂