Kubernetes • Production

The Ultimate Kubernetes Production Checklist: What 90% of Teams Miss Before Go-Live

By DevOps City Editorial • Nov 9, 2025 • 0 views

Executive summary. Kubernetes defaults are not production-safe. This checklist aligns teams on identity & policy, network isolation, observability, progressive delivery, resilience, DR, and governance so you can ship faster—safely.

Production is unforgiving.

The manifests are tidy, the cluster is “green,” CI is pushing clean images, and dashboards look calm. Yet most rescue engagements we take on share a root cause: teams assume Kubernetes defaults are safe. They’re not. This checklist is a field guide—what we look for during readiness reviews and what we’ve seen prevent late-night incidents across fintech, retail, healthcare, and SaaS. If your org can check off everything here, you’re not just running Kubernetes; you’re running it well.

1) Cluster Baseline & Architecture

Managed vs. self-managed control planes

Default to managed: GKE, EKS, AKS. You get patched control planes, tight cloud integration, and fewer pager moments.
Self-managed only when required: Compliance hard requirements, air-gap, or exotic networking. Budget time for etcd care and upgrades.

Version strategy

Stay within two minor versions of latest. Schedule quarterly reviews for deprecations.
Pin versions for nodes and control planes; never be surprised by an auto-upgrade.

Configuration as code

Keep a cluster configuration repo (Terraform, eksctl, gcloud CLI, or Pulumi). If your cluster can’t be rebuilt from Git, it’s not production-friendly.

Common failure: mixing click-ops and code, then losing track of what’s authoritative. Choose one source of truth and enforce it. 🧠

2) Security Hardening (Non-Negotiable)

Identity, policy, isolation

RBAC: Lock down service accounts; no cluster-admin by default, ever.
Namespaces with quotas: Each team/environment gets its own sandbox with LimitRanges and ResourceQuotas.
Pod Security Standards or Kyverno: Block privileged pods, host network/IPC, and root FS writes.
NetworkPolicies: Start with default deny; open only what you need.

Supply chain & secrets

Sign and scan images (Cosign + Trivy). Break build on high CVEs with reachable paths.
Use External Secrets / cloud KMS; don’t store secrets in ConfigMaps.
Require private registries and restrict egress to registries your org controls.

Controversial take: If you deploy apps in the default namespace, you’re signaling you don’t prioritize security. Create purpose-built namespaces or expect drift.

3) Networking & Traffic Management

Ingress and service patterns

Use a cloud-native ingress: GCLB (GKE), ALB (EKS), AGIC (AKS). Favor managed TLS and WAF.
Prefer internal load balancing for east-west and private APIs; put public ingress at the edge.
Document DNS conventions: svc.ns.svc.cluster.local, subdomain per environment, and wildcard certs where appropriate.

Readiness, liveness, and retries

Never route traffic to a pod until readinessProbe is healthy; tune thresholds to application startup (JIT warmup, caches).
Right-size timeouts and retries at the edge; too many retries amplify outages.

Network segmentation

Baseline: deny-all NetworkPolicy + explicit egress. Permit only required DNS, registries, and upstream APIs.
Audit with periodic “can I reach X?” jobs. If you can’t prove isolation, you don’t have it.

4) Observability That Actually Helps

Dashboards are not alerting systems. Alerts must be actionable, owned, and routable.

Golden signals per service

Latency: p50/p95/p99. Tie SLOs to user expectations, not platform defaults.
Traffic: QPS/RPS, concurrency, queue depth.
Errors: 4xx/5xx rates, gRPC status codes.
Saturation: CPU throttling, memory pressure, file descriptors, connection pools.

Recommended stack

Metrics: Prometheus + Grafana (or Managed Prometheus).
Logs: Loki or Elastic with retention tiers; ship structured logs.
Tracing: OpenTelemetry + Jaeger/Tempo. Sample generously on errors.
Alerting: PagerDuty/Opsgenie with ownership labels (team=payments).

High-impact tip: create a single “First Responder” dashboard that answers three questions within 3 minutes: what broke, where, and who owns it.

5) Storage & Stateful Workloads

Data durability first

Use managed disks via StorageClasses. Pick IOPS-tuned classes (gp3, Premium SSD, Balanced PD).
Plan for zone failure: multi-AZ node pools, replication at the data layer.
Backups: Velero, Kasten, or cloud snapshots with immutable retention.
Avoid hostPath and node-pinned state; it breaks scheduling and recovery.

Practical guidance: unless you have mature data ops (backups, PITR, DR playbooks), keep core databases on managed services (Cloud SQL, RDS, Cosmos DB). Move to in-cluster only when you can prove parity.

6) CI/CD & Progressive Delivery

Git as the truth

Adopt GitOps (Argo CD or Flux). Everything declarative; drift is a failed build.
Images tagged with git-sha + build metadata; never use latest.
Use promotion branches (dev → stage → prod) with policy checks in between.

Safer rollouts

Canary or Blue/Green with automatic rollback on SLO violations.
Service meshes (Istio/Linkerd) for traffic splits and telemetry, only if you’ll operate them well.
Run smoke tests inside the cluster as Jobs against service endpoints.

Secret sauce: empower developers with “fast feedback, safe guardrails.” Shorten merge-to-prod while raising quality gates. 🚀

7) Resilience, Autoscaling & Pod Health

Health & budgets

Define livenessProbe and readinessProbe per container; no copy-paste defaults.
PDBs: ensure voluntary disruptions don’t take capacity below safe thresholds.
Requests/limits: right-size with real traffic; watch for CPU throttling and OOM kills.

Autoscaling the right thing

HPA on meaningful metrics (queue depth, concurrency) rather than CPU alone.
Cluster Autoscaler enabled, with buffer capacity for burst.
Schedule chaos experiments (kube-monkey, LitmusChaos) during business hours with on-call awareness.

Common trap: autoscaling tuned on synthetic load that doesn’t match production. Calibrate with real traces and peak events.

8) Backups, DR & Region Strategy

Test what you ship

Quarterly restore drills measured for RTO/RPO. Publish results internally.
Backup etcd (if self-managed) and cluster state; validate restores to a sandbox cluster.
Document runbooks as code and plain text. People reach for PDFs when they’re panicking; give them one page that works.

Reality check: You don’t have DR until someone who didn’t write the system can follow the runbook and bring it back.

9) Governance, Cost & Lifecycle

Guardrails that scale

OPA/Gatekeeper or Kyverno policies enforced in CI and admission.
Label every resource with owner, cost center, environment, and data classification.
Automate cleanup of orphaned load balancers, PVs, and snapshots.

Make costs visible

Adopt OpenCost/Kubecost and show $ per service per environment.
Right-size nodes, consider Spot/Preemptible for stateless tiers with disruption budgets.

Often missed: image bloat. Trim base images and remove unused layers; it reduces cold start times and egress.

10) Operational Readiness & Culture

Runbooks, drills, and reviews

Monthly operational review: SLO adherence, incidents, cost, toil metrics.
Runbooks per service: startup, shutdown, scale, rollback, failover.
Access hygiene: break-glass accounts, short-lived tokens, audited kubectl.

Day-2 excellence

Track API deprecations every release; upgrade early to avoid big-bang pain.
Keep a living “production contract” per service: limits, readiness, SLOs, dependencies.

Blunt truth: YAML doesn’t rot; expectations do. If nobody owns the production contract, reliability becomes “someone else’s problem.” ⚠️

Printable Checklist

Here’s a condensed list you can copy into an issue template:

Cluster on managed control plane; versions pinned and within 2 minors.
Infra as code repo; rebuild documented and tested.
RBAC least privilege; namespaces per team; quotas/limits defined.
Pod Security (PSS/Kyverno); deny privileged; restrict host mounts.
Signed and scanned images; private registry egress only.
Ingress with managed TLS/WAF; NetworkPolicies default-deny.
Readiness/liveness tuned; retries/timeouts set at edge.
Golden signals dashboards; actionable alerts to on-call.
StorageClasses with IOPS tiers; backups verified via restore.
GitOps; progressive delivery; canary/blue-green with rollback.
PDBs; right-sized requests/limits; HPA on meaningful metrics.
DR drills; RTO/RPO measured; runbooks tested by non-authors.
Policy enforcement (OPA/Kyverno); cost visibility per service.
Monthly ops review; production contract maintained.

Work With Us

If this checklist surfaced gaps, that’s good news—you can fix them before your next incident. DevOps City runs production readiness assessments, GitOps enablement, and Kubernetes operations workshops tailored to your stack. Want a pragmatic roadmap you can ship within a quarter? Let’s talk. 🙂

Was this useful?

Back to Blog