First released in 1997, PostgreSQL is in its twenties—quite something in the fast-paced software industry. Despite its age, PostgreSQL is the second most-used relational database system in the world, and has won (for the second consecutive time) the DBMS of the Year Award. How has it managed to stay relevant during all that time? And how does it fit in the present cloud-native era?
Databases on the Cloud
There are several ways of running databases in the cloud:
- Self-hosted: The only alternative before the cloud existed, self-hosted, on-premises is the most predominant method for running a database. Spin up a server and install the software. Database administrators favor self-hosted installation since it gives them full control and produces the best possibilities for optimization.
- Managed databases: Also known as database-as-a-service (DBaaS), it is the "no-ops" choice. The cloud provider takes on the administrative burden. It has no startup cost and a pay-as-you-go billing model, making it a popular choice for startups, although they risk vendor lock-in—migrating away from DBaaS can be difficult and expensive.
- Orchestration: The middle, "some-ops" way. An abstraction layer decouples the database software from the underlying infrastructure. It promises the best of both worlds: full control of the databases without the burden of supporting infrastructure.
Most of the time, databases are the Achilles heel of a system—keeping performance up to par gets more challenging as databases grow. We’re living in the age of information and there is a never ending tide of data pooling in our databases.
We can throw hardware at the problem—this is called vertical scaling—and it works, up to a point. Asymptotically, however, we reach a point of diminishing returns where more hardware just doesn’t cut it. Before that happens, it’s time to start thinking about horizontal scaling— spreading the workload across more machines.
We now have a distributed computing problem in our hands. Most database systems have a replication mechanism to keep data synchronized across multiple machines. PostgreSQL is no exception:
- Log shipping periodically updates read-only secondary copies of the databases.
- Streaming replication copies data from a primary database to multiple secondaries in real-time. At any point, any of the secondary databases can take the role of the primary to recover from failures.
- Multi-master replication is similar to streaming replication, but writes can be distributed across multiple master databases and then relayed to the rest in the network. Postgres does not support this method natively, but through several third-party solutions like Pgpool-II.
Postgres on Docker
Docker is a packaging solution for applications. Docker creates isolated environments called containers where software can run without interference or library conflicts. For instance, we can get a Postgres database from a single command:
$ docker run -e POSTGRES_PASSWORD=mysecretpassword -d postgres
With Docker, developers can standardize and streamline the way applications are run. They can run the same components and libraries across all environments.
By default, Docker doesn’t provide persistence. Changes are lost after a container stops. We can keep the data around by mapping a volume:
$ docker create volume pgdata $ docker run -v pgdata:/var/lib/postgresql/data -d postgres
Or by directly mapping a directory on the host:
$ docker run -v /pgdata:/var/lib/postgresql/data -d postgres
Kubernetes for Databases
No matter how easy Docker makes it to install and run software, it still doesn’t offer any help in the way of managing a distributed database system.
Kubernetes is Google’s open-source solution for running distributed workloads. Its main job is to manage containers. Kubernetes adds several abstractions on top of containers:
- A Pod is a group of one or more containers that are guaranteed to run on the same machine, sharing resources, and an IP address.
- A ReplicaSet represents a set of identical pods. A ReplicaSet keeps track of pods running at any given time.
- A Deployment manages ReplicaSets. It compares the actual state with the desired state and makes the changes required. It provides scaling and rolling updates.
Kubernetes has been widely adopted for running applications. For databases, however, the story has been different. Kubernetes was initially planned to run only stateless applications, and for many years running a DB on Kubernetes was considered infeasible or even dangerous.
Now developers who have adopted Kubernetes for their stateless applications are beginning to recognize the value of treating the database in the same way:
- Flexibility: developers can run databases right next to their applications, taking responsibility away from database and system admins, tearing down the legacy technology silos.
- Infrastructure as code: Kubernetes provides a declarative model to describe the entire system. Developers define the whole stack in manifest files that can be checked into source control.
- Separation of concerns: each team takes ownership of their applications and databases.
- Scalability: gives the possibility of easily adding more databases on demand.
- No downtime upgrades: with Kubernetes rolling updates.
- Cloud agnostic: Kubernetes is widely supported, making it easy to switch clouds or adopt hybrid architectures.
- Freedom: to run any database, beyond the ones usually offered as DBaaS.
Kubernetes provides two resources to handle stateful workloads:
- PersistentVolumes represent abstract storage volumes that can be assigned to pods.
- StatefulSets are a specialized set of pods that guarantee their uniqueness. A StatefulSet works like a Deployment, but it maintains the identity of each of its pods. They are not interchangeable. Each pod has a persistent name and identifier that is preserved after restarts, a vital property for setting up a reliable database replication.
Kubernetes is a declarative configuration system. All we need to do is to tell it what we want and it will find the best way of achieving it.
For instance, you can create an initial PostgreSQL database with the following manifest:
# statefulset.yml apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres-database spec: selector: matchLabels: app: postgres-database serviceName: postgres-service replicas: 1 template: metadata: labels: app: postgres-database spec: containers: - name: postgres-database image: postgres volumeMounts: - name: postgres-disk mountPath: /var/lib/postgresql/data env: - name: POSTGRES_PASSWORD value: mysecretpassword - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeClaimTemplates: - metadata: name: postgres-disk spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi
This StatefulSet contains a single pod with a dedicated 10 GB persistent disk. To complete the setup, you’ll need a service to forward incoming connections to the database. For instance, the following manifest defines a companion load-balancing service:
# service.yml apiVersion: v1 kind: Service metadata: name: postgres-loadbalancer spec: selector: app: postgres-database type: LoadBalancer ports: - port: 5432 targetPort: 5432
The database should be up and running in a few minutes after applying both files using Kubernetes kubectl admin tool:
$ kubectl apply -f statefulset.yml $ kubectl apply -f service.yml
Databases, regardless of where they run, will always require backups, regular maintenance, and some automation to handle scaling and database replication. Fortunately, Kubernetes can be extended with operators. Operators add application-specific controls to manage complex operations on behalf of the user.
Multiple database solutions work with or on top of Kubernetes:
- Patroni is a template for running highly available PostgreSQL on Kubernetes.
- postgres-operator and CrunchyData PostgreSQL are open-source operators for PostgreSQL that feature rolling updates, volume and replication management, and automated backups.
- KubeDB is another operator and a convenient command-line tool.
- Kasten a data-management layer for Kubernetes with automated backups and disaster recovery.
- KubeDirector is a custom controller for Kubernetes for running stateful applications.
- Crunchy Containers is a suite of Docker images of rapid PostgreSQL deployment. It supports various orchestrators, including Kubernetes and ships with various administration tools and Helm charts for automated deployments.
Depending on your use case, running your databases on Kubernetes might be a great idea. Orchestration is not a magic bullet that will solve all your problems, but it is becoming a tempting option to consider, especially if you already are using it for your other applications and services.