Case Study - Cloud Exit: From €200K/yr Cloud Bills to Owned Infrastructure
How we helped a growing SaaS platform reclaim control of its infrastructure — migrating from managed cloud Kubernetes to a self-operated, on-premises cluster with predictable costs and full data sovereignty.
- Client
- B2B SaaS Platform
- Year
- Service
- Infrastructure Architecture & Cloud Repatriation

The Problem with "Just Use the Cloud"
The cloud is genuinely convenient — until the bill arrives. This client started like most SaaS companies do: shipping fast on managed services, focused entirely on product. Managed Kubernetes made sense early on — offload the complexity, iterate quickly, validate the market. It worked. The product gained traction, customers signed up, and the engineering team stayed lean.
But as they scaled, the cloud bill scaled faster. From €50K in year one to €120K in year two to nearly €200K by year three. The application architecture hadn't fundamentally changed. The infrastructure had just gotten more expensive. Managed Kubernetes across two regions. Object storage priced by egress. Load balancers charged per hour. The line items multiplied, but the value didn't.
The team wasn't deeply locked into proprietary services — most workloads ran on standard Kubernetes with open-source tooling. The database was PostgreSQL. Storage was object-compatible S3. This wasn't a cloud-native architecture problem; it was a cloud economics problem. They were paying managed service premiums for infrastructure they could operate themselves.
Beyond cost, there were structural concerns. Customer data spread across provider regions with inconsistent residency guarantees. GDPR compliance was met on paper, but the operational reality — limited audit trails, opaque data flows, shared tenancy — made the compliance team uncomfortable. And there was a harder question: what happens when a cloud provider reprices or deprecates a service?
The CTO had read Basecamp's cloud exit writing. It resonated. But they needed help designing, deploying, and migrating a production workload without dropping packets.
What We Did
Capacity Planning
Before touching any hardware, we mapped their existing workloads: pod resource requests versus actual consumption, traffic patterns, storage IOPS requirements, and realistic three-year growth projections. Cloud bills are a poor proxy for compute needs — managed services carry significant overhead and encourage overprovisioning. The actual workload fit comfortably within a cluster a fraction of the managed cost.
We profiled their production Kubernetes cluster for two weeks: CPU and memory usage per service, storage I/O patterns, network egress volumes, and daily/weekly traffic cycles. Peak load was predictable. Most services ran well below their requested resources. The data showed they were paying for capacity they didn't use.
We sized a 5-node cluster around measured peak load with 40% headroom for growth. EPYC processors, high-density RAM, and NVMe drives per node feeding Ceph directly. 51TB usable storage with 3-way replication. Enough for current needs, expandable without re-architecting.
Hardware Procurement & Deployment
The hardware decision was about reliability, not specs. Enterprise servers with full vendor support coverage — next-business-day onsite. Hardware failure at 3am is not a software problem; it needs a truck roll, not a Slack message.
We specified redundant power, out-of-band management, and dedicated network separation: management, VM traffic, Ceph storage, and live migration each on their own isolated paths. No single point of failure at the network layer. The hardware was racked and cabled by the colocation partner. We validated the physical layer before touching a config file.
ProxMox Cluster
ProxMox was the virtualisation layer. Not because it's fashionable — it isn't — but because it's operationally honest. No vendor lock-in, solid live migration, Ceph integration that works, and a community that documents failure modes rather than hiding them. The client's team would need to operate this for years. They needed something they could understand and fix.
Five nodes, Ceph across all of them, VLAN separation throughout. The cluster was live and stable before Kubernetes entered the picture.
Kubernetes Cluster Design & Deployment
We deployed Kubernetes as VMs on top of ProxMox, not on bare metal directly. This was a deliberate architectural choice: it preserves live migration for node maintenance without disrupting workloads, and separates infrastructure concerns from application concerns. The platform team manages Kubernetes. The infrastructure team manages ProxMox and Ceph. Clean operational boundaries.
The Kubernetes cluster mirrored their cloud setup's logical structure to simplify migration: same namespaces, same RBAC patterns, same service naming conventions. We wanted workload manifests to be portable with minimal changes.
Networking: MetalLB for LoadBalancer IPs with a dedicated VLAN. Cilium as CNI — chosen for its observability (Hubble), network policy enforcement, and eBPF performance. We replicated their cloud network policies directly into Cilium NetworkPolicy resources.
GitOps delivery: ArgoCD synchronized from the same Git repositories they used for cloud deployments. Same CI/CD pipelines, same container images. Only the cluster API endpoint changed.
Storage: Ceph RBD for block storage (StatefulSets, databases), CephFS for shared filesystem access (legacy apps expecting NFS semantics). Storage classes configured to match cloud equivalents by name — PVCs migrated without manifest changes.
Migration Strategy & Execution
The migration followed a blue-green pattern designed to minimize risk. The new cluster ran in parallel with the cloud environment for six weeks while we validated each service layer.
We categorized workloads by criticality and dependency chains, then migrated in waves:
Wave 1: Internal services — monitoring, logging, internal dashboards. Low user impact, high operational signal. These ran dual-stack (cloud + on-prem) for two weeks while we validated observability, storage performance, and incident response workflows.
Wave 2: Background workers — async job processors, scheduled tasks, batch operations. These services tolerate transient failures and gave us confidence in Ceph storage under real load patterns. We ran A/B traffic splits via queue routing to compare performance and reliability between environments.
Wave 3: API services — the customer-facing application layer. DNS TTLs pre-staged to 60 seconds two weeks prior. Each service migrated individually over weekends, monitored for 48 hours before proceeding to the next. Traffic shifted gradually: 10% → 50% → 100% over each service's migration window.
Wave 4: Stateful workloads — PostgreSQL, Redis, object storage. Replicated data in advance, then cut over during scheduled maintenance windows with full backups and rollback plans. Database replication lag monitored continuously; cutover only proceeded when lag was under 5 seconds.
Throughout the migration, both environments ran with synchronized deployments. A rollback was a DNS change away. The final cutover was a non-event — most users never noticed.
Security & Compliance
Network segmentation from day one: management traffic never touches workload traffic. VPN for all remote access. No public SSH. Certificates managed internally with short rotation cycles.
Data residency became concrete: all customer data in a single, known physical location. The compliance team got an actual data processing register they could stand behind — not a spreadsheet of cloud regions with asterisks.
Monitoring
Prometheus and Grafana across the full stack — ProxMox nodes, Kubernetes, storage, network. Alert routing via PagerDuty. The client now knows when a disk is degrading before Ceph marks it failed, and when a node is memory-pressured before a pod gets OOMkilled. With managed cloud, they had dashboards with no context. Now they have signal.
The Result
- Infrastructure cost reduction
- ~50%
- Dropped packets during migration
- Zero
- 120 cores / 2.5TB RAM
- 5 nodes
- Usable distributed storage (Ceph 3×)
- 51TB
- Workloads migrated successfully
- 100%
The migration completed without customer-facing incidents. Production workloads that ran on cloud-managed Kubernetes now run on infrastructure the team owns and operates. Response times improved — local storage latency is measurably better than cloud block volumes, and network hops between services dropped from cross-AZ to same-rack.
The OpEx-to-CapEx shift changed the financial conversation. Cloud spend was a recurring line item that grew with usage and resisted forecasting. Owned hardware is a capital investment that amortises over five years, with predictable costs for colocation, power, and maintenance. The total cost of ownership is substantially lower — and the client owns the asset at the end.
More practically: the team operates infrastructure they understand and control. When something breaks, they know how to fix it — no waiting on support tickets. That operational confidence is worth more than any single cost line. They can forecast infrastructure costs for the next three years with precision, something impossible with cloud billing.
What We Used
- ProxMox VE
- Ceph
- Kubernetes
- Cilium
- MetalLB
- ArgoCD
- Prometheus
- Grafana