Beyond Nomad: Escaping the Cloud Tax and Building a Sovereign Kubernetes Stack
The Allure of the Simple Stack
For years, my philosophy on infrastructure was simple: keep the moving parts to an absolute minimum.
If you audited my infrastructure-as-code repository two years ago, you would have seen a beautifully lean, highly optimized environment. I used Terragrunt to orchestrate DigitalOcean droplets in the ams3 region. I provisioned s-4vcpu-8gb instances running Debian 11, attached simple 10GB ext4 block volumes, and used Ansible to configure the underlying system.
At the heart of this setup was HashiCorp Nomad.
Nomad was the perfect orchestrator for a lean consultancy. It was a single, elegant binary. It didn't require a dedicated platform team to keep the control plane alive. It scheduled jobs, managed containers, and got out of the way.
To complement this lean setup, I offloaded observability entirely. I piped Prometheus metrics, Loki logs, and Tempo traces directly into a managed Grafana Cloud account.
It was a fully codified, zero-friction machine. It worked flawlessly—until the business requirements outgrew the architecture. Today, that entire stack has been dismantled. I have migrated my production workloads to a 13-node bare-metal Kubernetes cluster on Hetzner, powered by Talos Linux, Rook-Ceph, and a fully self-hosted Grafana observability stack.
This is the story of why I traded the "simplicity" of Nomad and managed cloud services for a Sovereign Kubernetes architecture.
Catalyst 1: The BSL License and Vendor Lock-in
The first crack in the foundation wasn't technical; it was legal and strategic.
In August 2023, HashiCorp altered the open-source landscape by shifting its core products—including Nomad and Terraform—from the Mozilla Public License (MPL) to the Business Source License (BSL).
At the time, I was designing a managed hosting service for my clients, effectively intending to package and sell isolated access to a robust Nomad cluster. Under the new BSL terms, offering a managed service built directly on top of Nomad crossed into a legal grey area, designed specifically to protect HashiCorp from platforms doing exactly what I intended to do (unless I paid exorbitant enterprise licensing fees).
I realized I was building my business on rented land. You cannot offer true "Sovereign Cloud" solutions to European clients when the core orchestration engine is subject to sudden, restrictive licensing changes by a US corporation.
We immediately pivoted our IaC from Terraform to OpenTofu, and the search for a truly open-source orchestrator began. Kubernetes was the only logical answer.
Catalyst 2: The Stateful High-Availability Wall
Stateless applications are easy. It’s stateful data that breaks architectures.
One of my primary clients, runs a legacy Apache/PHP application. Because of how the application is architected, it requires shared, persistent read/write access to a specific directory on disk.
In my DigitalOcean/Nomad environment, I was provisioning standard DigitalOcean Block Storage volumes formatting them as ext4. The fatal flaw here is that standard cloud block storage is ReadWriteOnce (RWO). The volume can only be mounted to a single node at a time.
If I wanted to achieve High Availability (HA) for this application, I was blocked. I couldn't spin up multiple instances of the PHP app across different Nomad worker nodes because they couldn't share the same block volume. If the underlying nomad-node-01 droplet failed, the entire application went down until the volume could be detached and reattached elsewhere.
To fix this, I needed ReadWriteMany (RWX) distributed storage.
While you can force CSI plugins into Nomad to handle distributed storage, the integration often feels fragile. Kubernetes, however, has Rook-Ceph. By moving to Kubernetes, I was able to take raw NVMe drives across multiple physical servers and pool them into a highly resilient, distributed Ceph storage cluster.
Suddenly, legacy applications could scale horizontally across worker nodes, sharing state effortlessly. True High Availability was finally unlocked.
Catalyst 3: The SaaS Observability Trap
When you are small, managed SaaS products feel like a superpower. As you scale, they become a financial trap.
Initially, my Terragrunt code integrated directly with Grafana Cloud. It was incredibly convenient to push Prometheus metrics and Loki logs to their eu-west-0 endpoints. But as my cluster grew and the metrics and log volume increased, I hit the usage limits of Grafana Cloud's standard tiers rapidly.
To keep the visibility I needed, my monthly observability bill was set to skyrocket. I was paying a premium simply for the convenience of someone else hosting my time-series databases.
By migrating to a Sovereign K8s stack, I reclaimed my observability. I deployed the entire LGTM stack (Loki, Grafana, Tempo, Mimir) natively inside my cluster. Because I was no longer constrained by SaaS pricing tiers, I could retain higher-resolution metrics and longer log histories, giving me better debugging capabilities at a fraction of the cost.
The Economics: Beating the "Cloud Tax"
Achieving this level of High Availability and running a heavy observability stack requires serious compute power. If I attempted to build this in AWS, Azure, or DigitalOcean, the monthly infrastructure bill would have destroyed the profit margins.
Consider the math:
In my old DigitalOcean setup, a mid-tier s-4vcpu-8gb droplet (4 Cores, 8GB RAM) cost $48 per month.
By pivoting to European bare-metal using Hetzner’s server auctions, the economics completely inverted. Today, I am utilizing dedicated servers featuring Intel Core i7-7700 processors (6 Cores) and a massive 64GB of RAM—all for between 29€ and 31€ per month.
For nearly half the price of a cloud droplet, I am getting 8x the memory, dedicated physical CPUs, and direct access to the bare metal. Because the hardware is so affordable, I was able to massively over-provision.
My current Sovereign K8s cluster consists of 13 dedicated nodes:
- 3 Control Planes (For true K8s API HA)
- 3 Storage Nodes (Dedicated to Rook-Ceph utilizing local NVMe disks)
- 4 Worker Nodes (For application workloads)
- 3 Monitoring Nodes (Dedicated to the self-hosted Grafana LGTM stack)
Building a 13-node, HA cluster with this much memory and NVMe storage on a public cloud would cost thousands of dollars a month. I am running it for the cost of a few premium cloud VMs.
The OS Paradigm Shift: Enter Talos Linux
You might be thinking: Managing 13 physical Linux servers sounds like a sysadmin’s nightmare. If I had continued using Debian 11 and Ansible to manually patch the OS, configure container runtimes, and secure SSH access across 13 bare-metal machines, the operational overhead would have ruined the financial savings.
To make bare-metal Kubernetes viable, the Operating System had to disappear. That is why I adopted Talos Linux.
Talos is an immutable, API-driven operating system designed exclusively for Kubernetes. There is no SSH. There is no bash shell. The entire OS is managed via an API, which integrates flawlessly with my OpenTofu infrastructure-as-code pipelines.
From a security and maintenance standpoint, it is a revelation. If a worker node acts up, I don't log in to debug it. I send an API command to wipe the machine and reboot it from a pristine image. This "Zero-Touch" approach means managing 13 physical machines with Talos requires strictly less operational effort than managing 3 cloud VMs with traditional Linux distributions.
Conclusion: The Strategic Value of Independence
Trading the simplicity of a single Nomad binary for a 13-node Kubernetes cluster sounds counterintuitive—until you evaluate the outcomes.
By making this migration, I achieved three massive strategic victories:
- Uncompromising Reliability: I eliminated Single Points of Failure (SPOFs) for stateful applications using Rook-Ceph.
- Financial Predictability: I escaped the escalating costs of Public Cloud compute and SaaS observability tools, multiplying my cluster's power while slashing the bill.
- Total Sovereignty: By building on European physical hardware with 100% Free and Open Source Software (Talos, K8s, Ceph, OpenTofu), my infrastructure is completely immune to the US Cloud Act, arbitrary vendor price hikes, and proprietary licensing traps.
The cloud was supposed to make infrastructure easier. For many scaling businesses, it has simply made it more expensive and restrictive.
Stop renting your reliability. Start owning your stack.
What's Next? This post is the first in a series detailing my migration to a Sovereign Kubernetes stack. In the upcoming articles, I will break down the exact technical implementation—starting with how I use IaC to provision bare-metal servers on Hetzner and bootstrap Talos Linux from scratch. Stay tuned.
Technologies
Kubernetes, Nomad, bare-metal, Talos, Ceph, Grafana