In this article, I wanted to share my experience gained with the pain and challenges of business growth and other database-related problems in production use. When your system starts with an empty database and the number of users and requests is low, everything is easy, and it's a good time to prepare for future high loads, to prepare for great success in business terms. My approach to systems and infrastructure engineering is to focus on being ready to support the business growth and success, not only on current operations. You start with your business from some small numbers, then grow 2x, 4x, 10x, and at some point, the business growth skyrockets, and you have to handle this or lose. It’s a challenge and no one wants to lose.
Here is my experience building and maintaining databases to support business growth. There will be 3 major parts: the beginning - set up to grow, from 2x to 10x - the basic steps of optimizations, and from 10x to skyrocket - complex optimizations and architecture for a fast-growing business.
Any big clouds like AWS, Google, or Azure has its own perfectly managed solutions for Postgres database with a lot of nice docs on how to implement it according to the rules of the well-architected framework, you just pay for it ;), and live with their restrictions. The key benefit of using Postgres in Kubernetes on-premise is nearly unlimited IOPSes. The current SSDs and NVMes provide 100-120 000 IOPS. But AWS RDS limits are up to 16 000 per instance. They have dedicated IO too, but it is costly, at the price starting from a monthly salary of senior DevOps ;). The second great benefit - the Postgres on dedicated hardware is 5 to 10 times cheaper than a managed solution. So, I vote for Postgres and Kubernetes to save some cash and invest it in business growth and/or in my pocket.
To build good and reliable solutions is nothing to invent, we’re just going to follow the best practices. Best practices are pretty much the same for all IT industries, and suitable for any cloud platform or on-premise deployments. The examples are AWS Well-Architected Framework or Google Cloud Architecture Framework, just read. At the start, we’re going to run the Postgres in small/medium instance size, but in a safe, reliable, and ready-to-grow configuration.
There are two major ways to deploy Postgres to Kubernetes. First is using Helm charts like bitnami/postgresql or Serge’s postgresql-single, which has benefits compared to bitnami chart. The second way to run Postgres is to use the Kubernetes operator. There are a couple of them, here we will talk about Zalando Postgres operator. The helm way is better when in the future you will not plan to use more than one or two Postgres clusters per Kubernetes cluster. If you realize, that you will run more, then the Operator way is for you with all its automation and management advantages.
Here is an example of PostgreSQL database manifest for Zalando Postgres Operator pg-prod.yaml.
This example was tested for intensive transaction processing with query rates up to 3000-4000 per second and database size up to 350GB (data and indexes). It is enough to start most projects with thousands of active users like medium-volume webshops, news- and community-support websites, medium-sized gaming or gambling, etc. We assume, that we already set up the operator and the S3 bucket with encryption and access control for backups and WAL-log storage.
Let’s review the config.
Please, open this manifest in the new tab or window and let's have a look.
We want maximum availability with minimum downtime. So, numberOfInstances: 2. It will create a classic Postgres primary-secondary cluster pair managed by Zalando Postgres operator and Patroni under the hood.
The top lines in the config are about name, version, etc. Let's look at the Postgres config part.
Firstly, I prefer autovacuum enabled for small and medium-sized deployments. Autovacuum helps to keep the tables in good condition for effective data searches and manipulations from one side, from another, it does not affect much the overall Postgres performance. For huge and high-load deployments you can deactivate autovacuum. I am going to discuss it in future articles.
autovacuum_analyze_scale_factor: "0.1"
autovacuum_vacuum_scale_factor: "0.2"
The second, and very important part is logging. With these settings we will control slow queries, lock waits, and temporary files - which heavily affect overall Postgres performance, and all issues can be detected and fixed quickly. Detailed explanations of each type of issue and how to fix it I’ll give later in the next articles. For detailed migration control, data protection, and for security reasons, we want to log all DDL (data definition language) queries too.
log_destination: "stderr"
log_connections: "off"
log_disconnections: "off"
log_min_duration_statement: "100ms"
log_statement: "ddl"
log_lock_waits: "on"
log_temp_files: "0"
The next part is connection limits.
max_connections: "150"
superuser_reserved_connections: "5"
It’s simple, Postgres spend a lot of resources to control connections. For production use, we must limit the number of connections to prevent CPU/Memory consumption and preserve some connection pool for superuser for maintenance, recovery, and problem-solving.
max_standby_archive_delay: "900s"
max_standby_streaming_delay: "900s"
wal_level: "logical"
max_wal_senders: "4"
max_replication_slots: "4"
Our cluster has been built using physical replication protocol, we set the wal_level to replica to allow replication. The max_wal_senders and max_replication_slots are stand for 1 or 2 standby replicas, base-backup (if you will use it), and 1 or 2 reserved connections for future physical replication. For example, if you want to have running replicas in another region, or for data analytics. Also, we must set the limits for delays between secondary and primary. Some queries can cause significant replication delays between primary and secondary. But we want to have the most recent data in the replica. If these parameters are set, the secondary will drop and decline all queries, decreasing workload to allow itself to catch up with the primary. Also, there is a metric for Prometheus monitoring to control the replica lag - “pg_replication_lag”. Here is an example of a Grafana graph to control the replica lag, and adding the expression >= will give you the alert expression.
sum by(kubernetes_pod_name) (pg_replication_lag{source="$cluster",kubernetes_pod_name="$instance"})
Parallelism settings would work more or less well for 2 to 10 CPU cores instance, for current values refer to resources section of manifest.
max_wal_senders: "4"
max_replication_slots: "4"
max_worker_processes: "16"
max_parallel_workers: "8"
max_parallel_workers_per_gather: "2"
max_parallel_maintenance_workers: "2"
Memory and other configuration options are like this. The work_mem and maintenance_work_mem are good for orders/financial data processing when most rows are numbers or relatively small varchars, text, or JSON. If your workload and most of your data are big texts, JSONs, or blogss, consider increasing these values, and increasing memory requests/limits in the resources section too.
max_wal_size: "4GB"
min_wal_size: "2GB"
wal_keep_size: "2GB"
effective_cache_size: "4GB"
shared_buffers: "2GB"
work_mem: "64MB"
maintenance_work_mem: "256MB"
# we're on ssd
effective_io_concurrency: "100"
random_page_cost: "1.1"
# enable the extentions, pg_stat_statements is a must-have for production
shared_preload_libraries: "pg_stat_statements,pg_cron,pg_trgm,pgcrypto,pg_stat_kcache"
track_io_timing: "on"
pg_stat_statements.max: "1000"
pg_stat_statements.track: "all"
cron.database_name: "postgres"
synchronous_commit: "local"
Very important podAnnotations to protect our Postgres pods from eviction and for monitoring
podAnnotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
prometheus.io/port: "9187"
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
To protect our production to avoid connection storms and improve connection management enable the connection pooler (pbgouncer).
enableConnectionPooler: true
enableReplicaConnectionPooler: true
For backup I chose the logical backup method supported by the operator: I wanted the operator to care for backups with its own supplied methods, otherwise, you have to setup your own backup solution.
enableLogicalBackup: true
logicalBackupSchedule: "30 10 * * *"
Now check the additionalVolumes
and sidecars sections. We’re going to enable as a sidecars promtail to ship the logs to our Loki service, and postgres-exporter for advanced monitoring. The configs are in configmaps supplied in the manifest. Promtail config is pretty straightforward. Let’s check the postgres-monitoring-queries
configmap. We run the pg_stat_statements set of metrics for the query analysis dashboard to manage the query efficiency and to get the basic info for future query optimizations. Also pg_replication_lag metric is here, this is a very important metric for managing the data quality in replicas. The 3-rd metric pg_postmaster_start_time_seconds is good for Postgres uptime/restart management and alerting.
Under the hood, the Zalando Postgres operator takes care of many things and helps us a lot with routine maintenance. In operator configuration, we set up S3 storage for backups and WAL-log archive, backup options, healthchecks, timeouts, etc. After deployment, we will have WAL-logs and backups shipped to S3, and the operator will take care of its retention. Also operator will gracefully restart Postgres instances and poolers for reconfiguration, or in case of failures.
Now a few words about monitoring and alerting. With the Prometheus or Victoriametrics installed, we will have our Postgres metrics scraped. For Grafana I would recommend to setup kubernetes-pods, pg-monitoring, pgbouncer, and pg-replication-lag dashboards. And also, it would be great to setup pg-query-overview.json and pg-query-drilldown.json. These dashboards was originally been built by Percona for its PMM, and then quickly ported by me to pure Grafana and Victoriametrics/Prometheus, so its configuration could be tricky and take time. But you will get a cool picture of pg_stat_statements_calls, a sets of most time-consuming queries, most called queries, etc, and will get insights about queries from query analytics - timings, buffers, reads, et cetera. And you don’t need to set up any third parties and extra Postgres monitoring tools like pgbadger, SolarWinds Postgres analytics, and others.
The alerts set we use:
- alert on postgresql is running if
pg_up != 1
orno-data
- alert on
pg_replication_lag > max_standby_archive_delay
or any number you consider reasonable - alerts like
PersistentVolume is filling up
from default alerts set - alerts like
High CPU iowait
from node-exporter to control IO - a set of basic kubernetes alerts like
PodCPULimit
,ContainerCrashLooping
,KubePodNotReady
, etc. In addition to altering I would recommend referring to the Awesome Prometheus Alerts project to get ideas and solutions on how to monitor your infrastructure.
Conclusions
So, in the end, after applying the Posgresql manifest we have up and running:
- the highly available Postgres cluster with primary and standby nodes,
- the encrypted WAL-logs storage and backup storage in the S3 bucket with auth and access control,
- pg-bouncer for primary with SSL encryption,
- pg-bouncer for replica with SSL encryption,
- Kubernetes service for direct connection to primary with SSL (cluster-name.namespace.svc),
- Kubernetes service for direct connection to replica with SSL (cluster-name-repl.namespace.svc),
- provisioned primary database,
- provisioned application users with creds in secrets,
- all stuff runs on special nodes (node groups), defined by nodeAffinity,
- logging subsystem is set up and allows control of slow queries, temporary files, data definition queries, and default Postgres logs,
- all basic Postgres metrics are provided by posgres-exporter with enhanced query and replication lag monitoring.
With Victoriametrics/Prometheus and Grafana, we also have the full monitoring and alerting solution to manage and maintain our databases. With your business growth, the workload will grow, and it provides you with information to support this growth, detect failures, and tune your Postgres instance, database, tables, and queries. And by the most of parameters, this setup of Postgres and monitoring tools will comply with the best practices from the point of view of ‘well-architected frameworks’, and be ready to grow up to 1-2-3 million users.
Check the Posgres manifests for Zalando operator in my Github repo: https://github.com/petrushinvs/databases-in-prod/tree/main/postgres/kubernetes
How to solve slow application responses caused by slow queries, Postgres performance degradation, replica lags, and other issues - see the next article, Part 2 - 2x to 10x growth, basic steps of query and database optimizations.
Good luck.
Top comments (0)