Sketching my ideal Kubernetes cluster

Summary: Building a Kubernetes infrastructure with GitOps, Kustomize and Flux2. With: nginx and cert-manager for SSL and automatic certificate renewal, mix of SOPS and Sealed secrets for safe secret storage in git repositories, CrunchyData postgres operator for managing postgres instances and their backups, Prometheus / grafana observability stack for monitoring, and multiple environment clusters (dev, staging, prod).

Why Kubernetes?

Kubernetes is a buzzword that has been around for a few years. And while I had used it from a “developer” perspective (ie, deploying applications, managing small services), I felt that I’ve never really designed a full fledged production environment. When I needed to make a choice for hosting Xapted infrastructure and all needed supporting systems, it was an opportunity to learn by doing.

Another point is that while PaaS solutions are very easy to deploy, there is a lot of Infrastructure as Clicks (the opposite of Infrastructure as Code), which makes replication and configuration tracking difficult. May times deployments involve going through tutorials and documentation which leave no other trace than the operator’s memory.

Starting point: GitOps

Coming from the Ansible world, the concept of keeping the infrastructure sate in a repository is something that not only I got used to, but fell in love with. It’s a great way to debug issues because all applied changes come from a single source of truth. It has the benefit of turning the git repository into a traceable infrastructure changelog. So from the beginning this was one of my main requirements: to kept a single source of truth for the cluster at all time, enabling me to clone the entire infrastructure with a few commands.

The practice of using Git as a single source of truth for declarative infrastructure and applications is called GitOps. A term coined in 2017 specifically in the context of Kubernetes and Kubernetes infrastructure management.

Enter Flux

Flux takes GitOps to the next level. When installed it defines a set of CRDs (Custom Resource Definitions) and services in your kubernetes cluster that automate the deployment of infrastructure defined in git repositories. You can, for example, register git repositories and helm charts as sources of your infrastructure, and ask Flux to apply all the changes that happen to those repositories or helm chats.


# Defining a Github repository as a source
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: podinfo
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/stefanprodan/podinfo
  ref:
    branch: master


# Defining a Kustomization that applies the checks if the cluster
# is updated and in sync with the repository every 5 minutes

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization 
metadata:
  name: podinfo
  namespace: flux-system
spec:
  interval: 5m0s
  path: ./kustomize
  prune: true
  sourceRef:
    kind: GitRepository
    name: podinfo
  validation: client

This way every time a pull request gets merged to main branch (in my case), Flux will poll your git repository for changes, reconcile changes to the resources, and apply them. It will even cleanup resources that no longer are specified.

Note:To connect your repository to your cluster you will need to “bootstrap” your Flux setup.

Kustomize

Flux + Kustomize are also a great combo to create segregated synchronized cross cloud clusters. Since version 1.14 kubectl supports applying kustomization resources, which means no additional installations are needed.

kubectl apply -k infrastructure/staging/prometheus

Kustomize is a purely declarative way of extending and replacing kubernetes resources and values. The most common pattern is specifying shared resources in base (ie resources that work the same across all clusters), and extend resources or override values for each of the cluster environments

Creating additional permissions objects like cluster-admin roles for GKE becomes as easy as adding an additional resource in the respective infrastructure folder, and adding it to kustomization.yml file, along with all the inherited

├── apps
│   ├── base
│   ├── production 
│   └── staging
├── infrastructure
│   ├── base
│   ├── production 
│   └── staging
└── clusters
    ├── production
    └── staging


# infrastructure/production/kustomization.yml
# Creating RBAC admin for a GKE cluster, 
# and inheriting all resources from base

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - rbac-admin.yml
  - ../base

Secrets & GitOps

Of course unencrypted secrets should never be in a git repository, and this poses a problem to the “Single source of truth in git”.

You can create secrets and keep their state in the cluster (and nowhere else), or use third party service (like Hashicorp Vault) to sync the secrets to your cluster. But this would be a deviation from the “single source of truth” of GitOps. And even when using a third party secret management solution, authentication to those services might also require secrets that need storage somewhere.

There are 2 main philosophies of GitOps secret management. Decryption client side (SOPS), and decryption cluster side (Sealed Secrets).

SOPS - github link

SOPS are a way to encrypt secrets locally. You (and your team) hold a secret locally that you can use encrypt secrets before you commit them into your git repository. SOPS is nice because it does not encrypt the YAML keys of the Kubernetes resource, only their values, which makes it easy to manage, edit and debug without having to decrypt the entire secret.

However you need to decrypt the secrets before applying them to your cluster. This also means that Flux needs access to a key in order to decrypt them and deploy thing automatically.

Sealed Secrets - github link

Sealed Secrets takes a slightly different approach. When you install Sealed Secrets it will create a master secret key in your cluster. You can then encrypt secrets using kubeseal using the master cluster key, stored remotely. This means that if you don’t have access to your cluster to fetch the key (eg. you’re offline), you won’t be able to encrypt or decrypt secrets.


# Sealing a secret
$ cat gitlab-registry-secret-plain.yml | kubeseal -o yaml

Unlike SOPS, secrets “sealed” with kubeseal are a bit more opaque for debugging, and harder to edit. A new --raw feature of kubeseal makes it relatively easier to replace or add secrets, but it still has some friction.


# Encrypting a raw secret
$ echo "hello" | kubeseal --raw --name my_secret_name --namespace my_namespace

Another disadvantage of Sealed Secrets is that it generates a new key for each cluster. This means that if you reset your cluster from start, your secrets are no longer accessible by the new cluster.

The biggest advantage of sealed secrets is users can update metadata and apply sealed secrets without needing to access the secrets themselves (remember, SOPS will have to decrypt the secrets before applying them unlike sealed secrets). This allows a cluster administrator to give developer access to the SealedSecret custom resource, but not the Secret resource themselves (which have the secrets in plain base64).

Hybrid approach - Using both

I first came across this approach through xUnholy | Raspbernetes.

SOPS can also be used to store Sealed Secrets master key backups (in the repository or using key management services like KMS that integrate with SOPS). This has the advantage of allowing re-creation of a cluster from scratch anytime (as the keys is also stored in the repository).

NOTE: By default sealed secrets will rotate the sealing keys every 30 days. And by rotation it means new secrets will be encrypted with the new keys, while the old secrets won’t be changed. You will need to re-backup the new sealing key every time the sealing key is rotated. To restore the cluster you need to add all previous and new keys (otherwise you will lose access to the older secrets)

Nginx and Let’s encrypt

Ingresses

To receive and route requests, Kubernetes uses a resource type called Ingress. However, ingresses need to be implemented through different Ingress controllers which you can choose from. Kubernetes officially supports AWS, GCE, and nginx ingress controllers.

AWS and GCE Ingress controllers manage and create external load balancers (managed by Google and Amazon), but Nginx will handle the requests arriving to a single external load balancer through a Nginx reverse proxy service within the cluster. Because GCE and AWS external ingresses come with added cost by their providers, Nginx is a cheaper alternative. If your main concern is scalability, GCE/AWS are virtually infinitely scalable and not bound by the resources of your cluster, but because we’re build a small cluster, I chose Nginx.

SSL and Let’s Encrypt

To ensure secure connections to services in the cluster (ie ensure that no one can intercept and read data sent between the client and the services) we need a certificate issued by a trusted authority to encrypt the data. Previously these certificates could be relatively expensive, but Let’s Encrypt has made it incredibly accessible (ie free) to get them.

Projects like cert-manager (previously called cert-bot) have automated the process of requesting and renewing the certificates, and can be configured for Kubernetes.

The setup, to simplify things, is to have a cert-manager ClusterIssuer, which issues certificates to all the services in the cluster, regardless of their namespace.


apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: cluster-certificate-issuer
spec:
  acme:
    # You must replace this email address with your own.
    # Let's Encrypt will use this to contact you about expiring
    # certificates, and issues related to your account.
    email: YOUR_EMAIL@EMAIL.COM
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      # Secret resource that will be used to store the account's private key.
      name: main-issuer-account-key
    # Add a single challenge solver, HTTP01 using nginx
    solvers:
    - http01:
        ingress:
          class: nginx

To make a service connection encrypted, all we need to do is specify the tls section with a secret to the urls of the service to store the certificates.


apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service-webserver-ingress
  namespace: my-namespace
spec:

  ...

  tls:
    - hosts:  
      - my_service.xapted.com
      secretName: my-service-cert

This wasn’t all without trouble. Nginx ingress is sometimes cloud provider specific, and cert-manager+nginx has some known issues related to nginx and the proxy protocol.

If you’re using Digital ocean, be sure to follow their guides closely about their custom annotations (eg service.beta.kubernetes.io/do-loadbalancer-enable-proxy-protocol: 'true') and read more about it here

Databases and a death wish

Kubernetes and containerisation are the promised land, as long as you stay in the realm of statelessness. Things get more complicated with stateful applications: scaling them and upgrading them. You don’t want to lose persistent volumes while making an upgrade, or have inconsistent data across multiple postgres instances.

Hosted databases are a real option, but using cloud proxies as sidecars, or managing complex permissions and secrets have always been for me a stain in a beautiful picture. Using over the internet connection in hosted databases (like Aiven or DigitalOcean) are a bit cleaner, but with each service and new database the bill racks up.

The simplest solution for hosting a database in Kubernetes safely is using StatefulSets. But that doesn’t include backups, and graceful scalability (StatefulSets will each have their own VolumeClaims, which need to be synced between other replicas if you want to ensure consistency). Cue in Postgres-operators.

Postgres-operators are custom resources that include services for managing, backing up and replicating postgres instances in a cluster (eg. Patroni). The closest you get to having a managed database in your kubernetes cluster. Here’s some of them:

From an initial inspection, Zalando operator seemed to be the right choice, with significant history, github stars, and a neat web UI as an extra. However as time went on it started to show how much this is a right fit for a large organization. Features like teams, that are hard to avoid, become an annoyance. The web ui which looks great at first becomes way less relevant.

KubeDb was immediately out of the race since they recently changed their licenses and now require a license key even for the open source versions of their platform.

CrunchyData operator, for it’s simplicity, seems to be a great choice overall, with Zalando being close second for more complex use cases and large teams.


# A CrunchyData postgres cluster with backups stored in Kubernetes

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: my-db-name
  namespace: my-application-namespace
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.4-0
  postgresVersion: 13
  instances:
    - name: instance1
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-2
      repoHost:
        dedicated: {}
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi

Using the database becomes as easy as attaching the secrets created by Postgres Operator to your container environment data. It’s worth noting that your application might crash at first while the secrets are being created by the initializing database cluster.


env:
  - name: POSTGRES_HOST
    valueFrom: { secretKeyRef: { name: my-db-name-pguser-my-db-name, key: host } }
  - name: POSTGRES_PORT
    valueFrom: { secretKeyRef: { name: my-db-name-pguser-my-db-name, key: port } }
  - name: POSTGRES_DB
    valueFrom: { secretKeyRef: { name: my-db-name-pguser-my-db-name, key: dbname } }
  - name: POSTGRES_USER
    valueFrom: { secretKeyRef: { name: my-db-name-pguser-my-db-name, key: user } }
  - name: POSTGRES_PASSWORD
    valueFrom: { secretKeyRef: { name: my-db-name-pguser-my-db-name, key: password } }

Monitoring

Visibility to what’s happening inside your cluster is important. You want to know if your services are properly resourced (eg. that they are not dangerously close to imposed memory limits), and that your cluster has enough resources to schedule all your pods. Prometheus and Grafana are a very well know combo for monitoring services.

One of the reasons why Flux is so great for GitOps becomes painfully apparent when installing Prometheus monitoring stack: Yaml files. Lots of Yaml files

As you can see it in the Kube-Prometheus project, there are LOT of kubernetes yaml resources. The defacto method of installing applications like these in GitOps without Flux would be to copy either all the manifestos or the Helm template if available to your own repo. That is a lot of yaml files to track when doing updates.

This can be easily fixed with Flux by simply creating a Kustomization resource that points to the official repo folder containing the Kustomization resource. But as I discuss later, I really don’t feel comfortable doing this in production :)


apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/prometheus-operator/kube-prometheus/tree/main/manifests

A better approach (and still on my To Do) is to create a HelmRelease Flux resource pointing to a helm repository. This will handle the deployment without you having make hard copy of all yaml resources in your own Git repository. Nevertheless, applying the monitoring yaml manifests of Kube-Prometheus “just works”.

As a small positive note, Prometheus exporter integrates nicely with many tools like Lens, which will display the entire cluster metrics right on your local environment

Resource optimization

Unless you have really deep pockets, you might not want to use the default resource allocation from the official helm charts and Kustomize manifests. A clean cluster with application defaults will be mostly idle with lots of memory or CPU reserved. Worst case your cluster will have unused reserved resources, and you’ll need to scale your Nodes to ensure all deployments can be scheduled in your cluster (if you run of non-reserved memory your cluster won’t be able to schedule your pods for lack of resources).

If you’re planning a big cluster that’s probably not an issue. But if you want to start small, tweaking the requests resources from the manifests (using Helm values.yaml or Kustomize resources) is a good idea.

The monitoring prometheus/grafana stack is the perfect tool for this job, allowing you to see resource allocation per namespace, node and workload. Ideally you want to reach a state where, while idle, every deployment is just slightly bellow 100% of the resource `requests'. That way you’re not reserving resources that are unused, and could be used to schedule other pods.

Here we can see that the entire monitoring namespace is using memory very efficiently, which in a idle state is using 90% of the requests. However, it seems that I’m requesting way too much CPU, as only 34% of the requested (ie reserved) resources are being used.

I also used a really nice project called kube-resource-report, which gives an simplified graphical view over your resource usage, along with cost estimations and resource configuration suggestions for each deployment.

Last remarks

This has been an extremely interesting endeavour. Especially because it changed my perception of Kubernetes away from it being an “All size fits all” to viewing it more as framework. Especially with the amount of CRDs I ended up at the end.

My main requirement of having a “single source of truth” has been fulfilled, and I have successfully recreated the staging cluster from scratch (including backed up master keys) without major problems. Deploying changes to the architecture is also as easy as making a commit to the repository (main gets deployed to staging, and stable gets deployed to production).

Continuous deployment is still hard, since many projects I want to deploy are in separate repositories. I went for a mono-repo structure for the infrastructure, which means I still need to manually update an image tag every time a project has a new version, or restart a deployment with latest image.

There are, however, Flux tools to automatically scan docker registries or git repositories for new images, and automatically replace image tags. That’s also on my to do list.

A couple of things I was not happy with along the way:

It’s still incredibly difficult to manage secrets in Kubernetes in a GitOps way. Editing secrets when you have access to the master keys shouldn’t be this hard.
Managing resource dependency (ie. applying one resource only when CRDs or dependent resources are applied) is still cumbersome. Flux and kubectl do allows for health checks, but it forces you to separate Kustomize resources into multiple Flux resources (eg. Separate prometheus CRDs from the prometheus resources, and make prometheus be dependent on the CRD Kustomization resource.)
The two Kustomization types are a really silly thing. Long story short, there is a official Kustomization Kubernetes resource, and a Flux2 Kustomization resource from CRD. It would be really nice if Flux2 would rename their Kustomization resource to KustomizationRelease, or similar.
Cert Manager + Nginx was surprisingly one of the things that proved to be the hardest to work with. Some issues included certificates not being created / duplicate certificate requests, and issues with Digital Ocean and other cloud providers related to a Kubernetes upstream issue on the proxy protocol. Many of the resolutions were “Just delete the resource X” and many issue trackers were closed without reason, which gets frustrating after a couple of instances.
I’m still not entirely confident in using Flux Kustomization resources pointing directly to project repositories. Many projects provide an example of installation, but you clearly shouldn’t point your production resources to those Kustomization resources (eg CruncyData Postgres Operator). It would be nice if projects would start keeping references to maintained Kubernetes/Kustomize manifests that can be used in production, just like Helm charts.
Flux does not allow you to create a cross resource dependencies (eg KustomizeRelease with a HelmRelease dependency). This is a significant pain, and can end up in a russian doll of Kustomize/Helm resources encapsulation. In my case I had to create a Flux Kustomize Deployment (that depends on Sealed Secrets) that points to a folder including the resource of a Flux HelmRelease resource. This is a known issue/request

Appendix 1 - The final setup structure

│# (FLUX2 RESOURCES)
│
├── clusters   
│   ├── common
│   │   ├── infrastructure
│   │   │   ├── cert-manager.yaml
│   │   │   ├── ingress-nginx.yaml
│   │   │   ├── kustomization.yaml
│   │   │   ├── monitoring.yaml
│   │   │   ├── postgres-operator.yaml
│   │   │   └── sealed-secrets.yaml
│   │   └── kustomization.yaml
│   │    
│   ├── production
│   │   ├── flux-system
│   │   │   ├── gotk-components.yaml
│   │   │   ├── gotk-sync.yaml
│   │   │   └── kustomization.yaml
│   │   ├── projects
│   │   │   ├── project_1.yaml
│   │   │   ├── ...
│   │   │   └── kustomization.yaml
│   │   │── tools
│   │   │   ├── internal_tool_1.yaml
│   │   │   ├── ...
│   │   │   └── kustomization.yaml
│   │   └── kustomization.yaml (includes clusters/common)
│   │        
│   └── staging
│       └── (SIMILAR TO PRODUCTION)
│
│# (KUBERNETES / KUSTOMIZE RESOURCES)  
│    
├── infrastructure
│   ├── cert-manager
│   │   ├── cert-manager.yaml
│   │   ├── kustomization.yaml
│   │   ├── resource-adjustements.yaml
│   ├── ingress-nginx
│   │   ├── ...
│   │   ...
│   └── kustomization.yaml
│
├── projects
│   ├── base
│   │   ├── kustomization.yaml
│   │   └── project_1
│   │       ├── config.yaml
│   │       ├── database.yaml
│   │       ├── kustomization.yaml
│   │       ├── namespace.yaml
│   │       └── project_1_deployment.yaml
│   ├── production
│   │   ├── kustomization.yaml
│   │   └── project_1
│   │       ├── kustomization.yaml
│   │       └── project_1-prod-secrets.yaml
│   └── staging
│       ├── kustomization.yaml
│       └── project_1
│           ├── ingress.yaml
│           ├── kustomization.yaml
│           └── project_1-staging-secrets.yaml
├── tools
│   ├── base
│   │   ├── kustomization.yaml
│   │   └── internal_tool_1
│   │       ├── config.yaml
│   │       ├── database.yaml
│   │       ├── kustomization.yaml
│   │       ├── namespace.yaml
│   │       └── internal_tool_1_deployment.yaml
│   ├── production
│   │   ├── kustomization.yaml
│   │   └── internal_tool_1
│   │       ├── kustomization.yaml
│   │       └── internal_tool_1-prod-secrets.yaml
│   └── staging
│       ├── kustomization.yaml
│       └── internal_tool_1
│           ├── ingress.yaml
│           ├── kustomization.yaml
│           └── internal_tool_1-staging-secrets.yaml