Avatar
Tomas started his career as a PHP developer. After graduating, he worked at British Telecom as head of the Web Services department in Argentina. After that, he went to IBM, where he wore many technical hats: DBA, Sysdamin, and DevOps. He's now an independent consultant and writer. He loves to learn and to teach about technology. In his free time, he likes reading, sailing, and board gaming.

First released in 1997, PostgreSQL is in its twenties—quite something in the fast-paced software industry. Despite its age, PostgreSQL is the second most-used relational database system in the world, and has won (for the second consecutive time) the DBMS of the Year Award. How has it managed to stay relevant during all that time? And how does it fit in the present cloud-native era?

Databases on the Cloud

There are several ways of running databases in the cloud:

  • Self-hosted: The only alternative before the cloud existed, self-hosted, on-premises is the most predominant method for running a database. Spin up a server and install the software. Database administrators favor self-hosted installation since it gives them full control and produces the best possibilities for optimization.
  • Managed databases: Also known as database-as-a-service (DBaaS), it is the "no-ops" choice. The cloud provider takes on the administrative burden. It has no startup cost and a pay-as-you-go billing model, making it a popular choice for startups, although they risk vendor lock-in—migrating away from DBaaS can be difficult and expensive.
  • Orchestration: The middle, "some-ops" way. An abstraction layer decouples the database software from the underlying infrastructure. It promises the best of both worlds: full control of the databases without the burden of supporting infrastructure.

Most of the time, databases are the Achilles heel of a system—keeping performance up to par gets more challenging as databases grow. We’re living in the age of information and there is a never ending tide of data pooling in our databases.

We can throw hardware at the problem—this is called vertical scaling—and it works, up to a point. Asymptotically, however, we reach a point of diminishing returns where more hardware just doesn’t cut it. Before that happens, it’s time to start thinking about horizontal scaling— spreading the workload across more machines.

We now have a distributed computing problem in our hands. Most database systems have a replication mechanism to keep data synchronized across multiple machines. PostgreSQL is no exception:

  • Log shipping periodically updates read-only secondary copies of the databases.
  • Streaming replication copies data from a primary database to multiple secondaries in real-time. At any point, any of the secondary databases can take the role of the primary to recover from failures.
  • Multi-master replication is similar to streaming replication, but writes can be distributed across multiple master databases and then relayed to the rest in the network. Postgres does not support this method natively, but through several third-party solutions like Pgpool-II.

Postgres on Docker

Docker is a packaging solution for applications. Docker creates isolated environments called containers where software can run without interference or library conflicts. For instance, we can get a Postgres database from a single command:

$ docker run -e POSTGRES_PASSWORD=mysecretpassword -d postgres

With Docker, developers can standardize and streamline the way applications are run. They can run the same components and libraries across all environments.

By default, Docker doesn’t provide persistence. Changes are lost after a container stops. We can keep the data around by mapping a volume:

$ docker create volume pgdata
$ docker run -v pgdata:/var/lib/postgresql/data -d postgres

Or by directly mapping a directory on the host:

$ docker run -v /pgdata:/var/lib/postgresql/data -d postgres

Kubernetes for Databases

No matter how easy Docker makes it to install and run software, it still doesn’t offer any help in the way of managing a distributed database system.

Kubernetes is Google’s open-source solution for running distributed workloads. Its main job is to manage containers. Kubernetes adds several abstractions on top of containers:

  • A Pod is a group of one or more containers that are guaranteed to run on the same machine, sharing resources, and an IP address.
  • A ReplicaSet represents a set of identical pods. A ReplicaSet keeps track of pods running at any given time.
  • A Deployment manages ReplicaSets. It compares the actual state with the desired state and makes the changes required. It provides scaling and rolling updates.

Kubernetes has been widely adopted for running applications. For databases, however, the story has been different. Kubernetes was initially planned to run only stateless applications, and for many years running a DB on Kubernetes was considered infeasible or even dangerous.

Now developers who have adopted Kubernetes for their stateless applications are beginning to recognize the value of treating the database in the same way:

  • Flexibility: developers can run databases right next to their applications, taking responsibility away from database and system admins, tearing down the legacy technology silos.
  • Infrastructure as code: Kubernetes provides a declarative model to describe the entire system. Developers define the whole stack in manifest files that can be checked into source control.
  • Separation of concerns: each team takes ownership of their applications and databases.
  • Scalability: gives the possibility of easily adding more databases on demand.
  • No downtime upgrades: with Kubernetes rolling updates.
  • Cloud agnostic: Kubernetes is widely supported, making it easy to switch clouds or adopt hybrid architectures.
  • Freedom: to run any database, beyond the ones usually offered as DBaaS.

Kubernetes provides two resources to handle stateful workloads:

  • PersistentVolumes represent abstract storage volumes that can be assigned to pods.
  • StatefulSets are a specialized set of pods that guarantee their uniqueness. A StatefulSet works like a Deployment, but it maintains the identity of each of its pods. They are not interchangeable. Each pod has a persistent name and identifier that is preserved after restarts, a vital property for setting up a reliable database replication.

Kubernetes is a declarative configuration system. All we need to do is to tell it what we want and it will find the best way of achieving it.

For instance, you can create an initial PostgreSQL database with the following manifest:

# statefulset.yml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-database
spec:
  selector:
    matchLabels:
      app: postgres-database
  serviceName: postgres-service
  replicas: 1
  template:
    metadata:
      labels:
        app: postgres-database
    spec:
      containers:
        - name: postgres-database
          image: postgres
          volumeMounts:
            - name: postgres-disk
              mountPath: /var/lib/postgresql/data
          env:
            - name: POSTGRES_PASSWORD
              value: mysecretpassword
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
  volumeClaimTemplates:
    - metadata:
        name: postgres-disk
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

This StatefulSet contains a single pod with a dedicated 10 GB persistent disk. To complete the setup, you’ll need a service to forward incoming connections to the database. For instance, the following manifest defines a companion load-balancing service:

# service.yml
apiVersion: v1
kind: Service
metadata:
  name: postgres-loadbalancer
spec:
  selector:
    app: postgres-database
  type: LoadBalancer
  ports:
    - port: 5432
      targetPort: 5432

The database should be up and running in a few minutes after applying both files using Kubernetes kubectl admin tool:

$ kubectl apply -f statefulset.yml
$ kubectl apply -f service.yml

Extending Kubernetes

Databases, regardless of where they run, will always require backups, regular maintenance, and some automation to handle scaling and database replication. Fortunately, Kubernetes can be extended with operators. Operators add application-specific controls to manage complex operations on behalf of the user.

Multiple database solutions work with or on top of Kubernetes:

  • Patroni is a template for running highly available PostgreSQL on Kubernetes.
  • postgres-operator and CrunchyData PostgreSQL are open-source operators for PostgreSQL that feature rolling updates, volume and replication management, and automated backups.
  • KubeDB is another operator and a convenient command-line tool.
  • Kasten a data-management layer for Kubernetes with automated backups and disaster recovery.
  • KubeDirector is a custom controller for Kubernetes for running stateful applications.
  • Crunchy Containers is a suite of Docker images of rapid PostgreSQL deployment. It supports various orchestrators, including Kubernetes and ships with various administration tools and Helm charts for automated deployments.

Conclusion

Depending on your use case, running your databases on Kubernetes might be a great idea. Orchestration is not a magic bullet that will solve all your problems, but it is becoming a tempting option to consider, especially if you already are using it for your other applications and services.

How to work with us

  • Contact us to set up a call.
  • We will analyze your needs and recommend a content contract solution.
  • Sign on with ContentLab.
  • We deliver topic-curated, deeply technical content to you.

To get started, complete the form to the right to schedule a call with us.

Send this to a friend