Do You Actually Need Kubernetes?
Evaluating whether the overhead of Kubernetes is worth it for your team, and what the migration actually looks like.

Three months of evaluation. Every "Kubernetes for beginners" article on the internet, seemingly. Four vendor demos. More architecture meetings than I care to count. End result: a partial migration. Not everything. Just the pieces where the tradeoff made sense.
Wanted to walk through the decision-making because I suspect a lot of small-to-mid teams are asking the same question and getting sales pitches where they need honest assessments.
The Problem We Were Solving
API server, React frontend, PostgreSQL, Redis, background job worker. All running on a single 8-core VPS with Docker Compose. Worked fine most of the time. Problems appeared during traffic spikes โ product launches, marketing campaigns, sudden user surges.
When traffic spiked, someone had to SSH into the server, check resource usage, spin up another container or increase limits. During business hours: manageable. At 2 AM: someone's phone rings and they drag themselves out of bed to manually scale things.
We also couldn't do zero-downtime deploys. Every new API version meant a 3-5 second window where active connections dropped. Most users experienced a failed request that succeeded on retry. Tolerable. But it made everyone nervous about deploying during peak hours.
Two concrete problems: manual scaling and deploy interruptions. Everything else was secondary.
The manual scaling was worse than it sounds on paper. One Saturday morning during a flash sale, our API response time went from 200ms to 8 seconds in about ten minutes. By the time the on-call person woke up, checked Slack, SSH'd in, and diagnosed the bottleneck, we'd lost maybe forty minutes of traffic at degraded performance. Customers were hitting refresh, generating more load, making the situation worse. The fix was bumping container resource limits and restarting, which took another five minutes. An hour of bad experience because nobody was awake to push a button.
With predictable traffic patterns, you can overprovision and eat the cost. Our traffic was not predictable. Marketing would run a campaign without telling engineering (this happened twice), and we'd find out when the alerts fired. The VPS was either too small for spikes or too expensive for quiet periods. There was no middle ground without automation.
What Kubernetes Offers
You describe what you want the system to look like. Kubernetes makes reality match the description. "Run 3 copies of this API server. If any crash, start replacements."
apiVersion: apps/v1
kind: Deployment
metadata:
name: core-api
spec:
replicas: 3
selector:
matchLabels:
app: core-api
template:
metadata:
labels:
app: core-api
spec:
containers:
- name: app
image: internal-registry/core-api:v2.1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe checks if the container is alive โ fails, Kubernetes kills it and starts a fresh one. readinessProbe checks if it's ready for traffic โ fails, the container stays running but gets removed from the load balancer until it recovers.
Appealing. Container crashes from an unhandled exception or memory leak? Restarted automatically. No pager. No SSH.
Auto-scaling was the other draw:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: core-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: core-api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Average CPU across pods exceeds 70% โ Kubernetes spins up more replicas. Drops back โ scales down. Launch event with 10x traffic? Handled without touching a keyboard. At 3 AM when nobody's on the site? Two replicas instead of paying for peak capacity around the clock.
We did the math. Right-sizing with auto-scaling would save roughly 30% on hosting compared to provisioning for peak and leaving it there.
Rolling deployments solved the other problem. Kubernetes starts new pods with the updated version, waits for readiness probes to pass, gradually shifts traffic away from old pods. No point where zero healthy pods are handling requests. Tested in staging with continuous load โ not a single dropped connection during deploys. After months of carefully timing deploys to avoid peak hours, that was a relief.
The Downsides We Discovered
Learning curve was steep. Team had strong Docker skills. Kubernetes is different territory. Services, Ingress, ConfigMaps, Secrets, Namespaces, RBAC โ a lot to learn before deploying even a simple app. Budgeted two weeks for the team to get comfortable. Took closer to five before people felt confident making changes without worrying about breaking something.
Debugging is harder. Docker Compose: docker logs <container>, see the output. Kubernetes: pod might have been rescheduled to a different node, or killed and replaced, and the logs from the old pod are gone unless you set up log aggregation (which is its own project). Spent a week setting up Loki and Grafana just for searchable logs. A week of work that didn't exist in the Docker Compose setup.
YAML fatigue. An application that takes 20 lines in docker-compose.yml easily requires 150+ lines of Kubernetes YAML across multiple files โ Deployment, Service, Ingress, HPA, ConfigMap, maybe a PersistentVolumeClaim. Helm and Kustomize help manage this. They're additional complexity on top of the already complex base.
Local development gets awkward. Can't easily run a full Kubernetes cluster on a developer laptop. minikube and kind exist but they're resource-hungry and the experience isn't as smooth as docker compose up. We kept Docker Compose for local development and Kubernetes only for staging and production. Local environment doesn't perfectly match production. The thing Docker was supposed to fix. Somewhat ironic.
Config drift. Someone runs kubectl apply manually to fix a midnight production issue, forgets to commit the change to git. Cluster state and source code are now out of sync. Mitigated this with ArgoCD for GitOps โ cluster syncs from a git repository, manual changes get overwritten. But that's yet another tool to set up and maintain.
The Alternatives We Considered
Before committing to Kubernetes, we looked at simpler options. Worth mentioning because jumping straight to K8s isn't the only path.
AWS ECS with Fargate was the most tempting alternative. Managed container orchestration without running your own cluster. You define task definitions (similar to Docker Compose services), set up an Application Load Balancer, configure auto-scaling rules, and AWS handles the rest. No nodes to manage. No control plane to worry about. Pricing is per-vCPU and per-GB of memory, so you pay for what you use.
We built a proof of concept. It worked. The auto-scaling was simple to configure โ a few CloudWatch alarms and scaling policies. Zero-downtime deploys came almost for free through the rolling update strategy in the service definition. If we'd gone this route, we probably would have been up and running in a week instead of five.
Why we didn't: vendor lock-in was the concern. ECS task definitions are AWS-specific. The networking model, the IAM integration, the service discovery โ all tied to AWS. Our team was already talking about potentially moving cloud providers in the next year or two. Kubernetes felt more portable, probably. Whether that portability concern was justified or premature is something I still think about. In practice, we haven't moved providers, and the Kubernetes-specific tooling we've adopted (ArgoCD, GKE-specific ingress) has created its own form of lock-in anyway.
Docker Swarm was dismissed quickly. It's simpler than Kubernetes and built into Docker, which sounded appealing. But the ecosystem is much smaller, the community has largely moved on, and the features we needed โ fine-grained auto-scaling, sophisticated health checks, solid service mesh options if we ever needed them โ aren't as mature. It felt like investing in a shrinking platform.
Just scripting it โ someone on the team suggested writing custom scaling scripts that would monitor CPU via the cloud provider's API and spin up/down VPS instances with Docker containers. Scrappy and cheap. We prototyped it on a weekend. It handled basic scale-up fine but scale-down was tricky (you need to drain connections before killing an instance), and the health check / restart logic was basically reimplementing what Kubernetes does, but worse and maintained by us. We dropped it after the prototype.
What We Actually Chose
Didn't go all-in. Migrated stateless services โ API server and frontend โ to a managed Kubernetes cluster on GKE. These benefit most from auto-scaling and zero-downtime deploys.
Kept PostgreSQL and Redis on managed cloud services outside Kubernetes. For context on how we think about the broader infrastructure picture, I wrote about cloud-native architecture and the tradeoffs of adopting incrementally versus going all-in. Running databases inside Kubernetes is possible but adds significant operational complexity โ persistent volume management, backup strategies accounting for pod rescheduling, replication across nodes. General advice from people who've done it: "don't, unless there's a specific reason." We didn't have one.
The background worker was borderline. Migrated it because it's stateless and we wanted scaling tied to queue depth. But it would have been fine as a Docker container on a dedicated VPS. The migration added some complexity without a huge corresponding benefit for that particular component.
Six Months Later
Auto-scaling has been great. Surprise traffic spike last month โ social media post went viral, linked to our app โ API scaled from 2 pods to 11 without anyone doing anything. Old setup: that would have been a crash followed by a frantic SSH session.
Zero-downtime deploys: great too. Deploy multiple times a day now. Nobody checks the clock first.
Operational overhead is real, though. More time on infrastructure tooling in the past six months than the previous two years combined. Monitoring setup, log aggregation, Helm charts, networking issues, Ingress controller quirks. Time that didn't go toward product features.
Would I recommend Kubernetes for a small team? Depends, hard to say. Single application with predictable traffic โ a VPS with Docker Compose is simpler, cheaper, good enough. Unpredictable scaling needs, multiple services, deploy reliability requirements โ Kubernetes complexity starts paying for itself.
We're in the second category. Just barely. A year ago, probably weren't. The tipping point was the night pager incidents and the deploy anxiety. Without those concrete pain points, not sure we would have migrated. Not sure we should have.
What I'd Do Differently
Start with the networking earlier. Kubernetes networking โ Services, Ingress, DNS resolution between pods โ is conceptually simple but has enough gotchas that we burned several days on it during the migration. One example: we configured an Ingress resource expecting it to handle TLS termination, but the default Ingress controller on GKE routes traffic differently than what the docs for generic Kubernetes describe. Had to add annotations specific to the GKE ingress class. Another: internal service-to-service communication required creating Service resources with the right port mappings, and a typo in the port number caused intermittent connection failures that were hard to trace because the error messages were vague.
I'd also invest in a proper staging environment from day one instead of doing the initial testing against a "staging namespace" in the same cluster. Namespaces provide isolation but they share the same nodes and cluster resources. A bad configuration in staging shouldn't be able to affect production, and with shared clusters, the blast radius is bigger than you'd want.
The Helm chart situation needs attention too. We started with raw YAML manifests for simplicity. Within two months, the duplication between staging and production configs was painful โ same Deployment spec copied twice with different replica counts and image tags. Helm templating would have prevented that. We're migrating to Helm now, but starting with it would have been smarter.
And I'd be more careful about resource requests and limits. We set generous CPU and memory limits early on because we weren't sure what our services needed. Three months later, we noticed we were paying for a lot of allocated-but-unused capacity. Right-sizing required running the services under production load and analyzing actual usage patterns โ something that takes time and attention that's easy to deprioritize.
Secrets Management โ The Thing We Got Wrong First
Worth calling out specifically because we made a mistake that's common and avoidable. When we first set up the cluster, we stored application secrets (database passwords, API keys, JWT secrets) as Kubernetes Secrets โ base64-encoded values in YAML files committed to our GitOps repo. This is the default approach in every tutorial.
The problem: base64 is encoding, not encryption. Anyone with read access to the git repo could decode every secret in about two seconds. We were storing production database credentials in plain sight, dressed up as something secure.
Switched to Sealed Secrets after a security review flagged it. The controller runs in the cluster and decrypts secrets at deploy time. The encrypted versions are safe to commit to git โ you can't decrypt them without the private key that lives only inside the cluster. Setup took about half a day. The peace of mind was immediate.
For teams with more infrastructure budget, external secret stores like AWS Secrets Manager or HashiCorp Vault are the better long-term answer. They centralize secret rotation and audit logging in ways that Sealed Secrets can't match. But for a small team that just needs secrets out of plain text in a git repo, Sealed Secrets is the fastest path to something reasonable.
One more thing: rotate your secrets. We set up automatic rotation for database passwords on a 90-day cycle after the security review. Before that, the production database password hadn't changed in fourteen months. Not great.
The Gravitational Pull
The ecosystem around Kubernetes has a pull to it. Once you adopt it, you keep discovering tools that integrate with it โ service meshes, policy engines, progressive delivery controllers, cost optimization platforms, security scanners. Each one is individually reasonable. Collectively, they add up to a lot of moving parts, each with its own configuration, upgrade cycle, and failure modes.
Actively resisting that pull. We don't need Istio. Don't need Linkerd. Don't need a service mesh at all with our current number of services. We need auto-scaling and rolling deploys, and we have those. Everything else waits until there's a real, measured problem that justifies the additional tooling.
The temptation is real, though. Every Kubernetes blog post and conference talk shows off the full stack โ monitoring with Prometheus and Grafana, tracing with Jaeger, log aggregation with Loki, GitOps with ArgoCD, secrets management with Vault, policy enforcement with OPA. It's an impressive ecosystem. It's also a full-time job to maintain if you adopt all of it at once. We adopted ArgoCD (necessary for GitOps) and Loki+Grafana (necessary for log visibility) and stopped there. Two tools on top of Kubernetes itself. That's already more infrastructure than we had before.
One more thing that caught us off guard: Kubernetes version upgrades. GKE manages the control plane upgrades, but node pool upgrades still require our attention. Each Kubernetes minor version is supported for about 14 months. Fall behind and your cluster runs on an unsupported version with no security patches. We've scheduled quarterly upgrade checks โ test the new version in staging, review the changelog for breaking changes, upgrade production. Not glamorous work. Necessary work.
The honest summary: Kubernetes solved the two problems we needed solved. It also introduced a new category of problems โ infrastructure complexity, longer onboarding for new developers, another system to keep updated and secured. Net positive for our team at our current scale. But I wouldn't adopt it without specific pain points driving the decision. "Everyone uses Kubernetes" isn't a pain point. "Our on-call got paged three times last month for manual scaling" is.
Further Resources
- Kubernetes Official Documentation โ The comprehensive reference for concepts, workloads, networking, storage, and cluster administration.
- CNCF Cloud Native Landscape โ An interactive map of the entire cloud-native ecosystem maintained by the Cloud Native Computing Foundation, showing how tools like Helm, ArgoCD, and Prometheus fit together.
- Kubernetes the Hard Way (GitHub) โ Kelsey Hightower's tutorial for setting up Kubernetes from scratch, which builds deep understanding of how the components actually work.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

How Docker Compose Actually Went for My Development Setup
Addressing misconceptions about containerized local setups, configuration pain points, and why volume mounting behaves the way it does.

Learning Docker โ What I Wish Someone Had Told Me Earlier
Why most Docker tutorials fail beginners, the critical difference between images and containers, and what actually happens when you run a container.

Observability Beyond Grep โ Logs, Metrics, Traces, and Why They All Matter
Why grepping through log files stops working at scale, the real difference between logs, metrics, and traces, and how OpenTelemetry ties them together.