High availability is often sold as a checkbox, but in practice it is a set of architectural trade-offs between cost, complexity, and recovery objectives. A well-designed HA system hides failures from users by providing redundant paths for every component that can break. This article walks through the architecture patterns that deliver minimum downtime, from active-passive pairs to multi-zone clusters with Ceph-backed storage, and shows you how to match the pattern to your recovery point and recovery time targets.
Defining the Goal: RPO and RTO
Before choosing an HA pattern, pin down two numbers:
- Recovery Point Objective (RPO): how much data loss is acceptable, measured in time. Zero RPO requires synchronous replication.
- Recovery Time Objective (RTO): how long a failover may take. Sub-minute RTO requires automated orchestration, not manual runbooks.
An e-commerce checkout typically requires RPO under 10 seconds and RTO under 30 seconds. An internal wiki might tolerate RPO 15 minutes and RTO 30 minutes. The answer dictates whether you need active-active, active-passive, or warm standby.
Active-Passive Architectures
In active-passive, one node serves all traffic while a second node waits, synchronised via replication. On failure, a virtual IP (VRRP via Keepalived) or DNS record moves to the standby. This pattern is simple, predictable, and sufficient for most stateful workloads that do not need more than one node's compute.
Typical building blocks:
- Keepalived with VRRP for virtual IP failover.
- PostgreSQL streaming replication with
synchronous_commit=remote_writefor RPO near zero. - DRBD or LINSTOR for block-level replication when the application is not replication-aware.
- Pacemaker + Corosync for orchestrated cluster logic.
Expected RTO with VRRP and healthcheck-driven failover: 5 to 15 seconds. Expected RPO with synchronous replication: near zero.
Active-Active Architectures
Active-active means multiple nodes serve traffic simultaneously behind a load balancer. Failure of any node reduces capacity but does not cause downtime. Session state must live outside the nodes (Redis, database, or sticky-session load balancer with fallback). Data stores must support multi-writer topologies: Galera for MariaDB, Patroni with PostgreSQL logical replication, or managed multi-AZ clusters.
| Property | Active-Passive | Active-Active |
|---|---|---|
| Typical RTO | 5 to 60 seconds | 0 (no failover needed for capacity) |
| Capacity under partial failure | 100 percent | N-1 / N capacity |
| Stateful complexity | Low | High |
| Cost at small scale | Lower | Higher |
| Write conflicts | None | Must be resolved |
Load Balancing Layer
A proper HA load balancer is itself redundant. Two HAProxy or NGINX nodes run behind a shared virtual IP, managed by Keepalived or an anycast BGP announcement. Health checks should test real application behaviour, not just TCP connectivity. Use the check directive with HTTP requests against a /healthz endpoint that touches the database, queue, and cache, so the load balancer removes a node as soon as it is degraded, not only when it is dead.
backend app_backend
option httpchk GET /healthz
http-check expect status 200
server app1 10.0.1.11:8080 check inter 2s fall 2 rise 2
server app2 10.0.1.12:8080 check inter 2s fall 2 rise 2
server app3 10.0.1.13:8080 check inter 2s fall 2 rise 2
Shared Storage with Ceph
Ceph provides distributed object, block, and filesystem storage with no single point of failure. A three-node Ceph cluster using CRUSH-mapped replicas tolerates the loss of any one node without data loss or downtime. Ceph RBD (RADOS Block Device) is the canonical choice for VM disks in HA clusters, and CephFS is ideal for shared application state like media uploads or shared logs.
For production Ceph clusters, plan for at least three monitor nodes, three OSD hosts, and dedicated 10 Gbps networks for both public and cluster replication traffic. Separate journal devices (NVMe) from data devices when using HDD; for all-NVMe clusters, single-device OSDs are fine. Our high-availability hosting deployments use Ceph RBD as the default storage backend.
Multi-Zone and Multi-Region
A cluster confined to one rack cannot survive a rack power event. A cluster confined to one data center cannot survive a facility fire or a regional fibre cut. Multi-zone deployments place cluster members across separate failure domains inside the same region, which gives sub-5 ms replication latency while eliminating single-facility risk. Multi-region goes further by replicating across regions, which adds 20 to 80 ms of latency but removes geographic risk.
Our data center footprint supports multi-zone clusters today and multi-region DR patterns with automated failover.
Failover Orchestration
Automated failover needs a quorum-aware orchestrator to prevent split-brain. Pacemaker + Corosync with fencing (STONITH) remains the gold standard for VM-based clusters. For Kubernetes, the control plane quorum plus pod disruption budgets and PodAntiAffinity rules gives similar guarantees. The failure mode you must always design against is network partition: two healthy nodes that cannot see each other will both try to become primary and corrupt data unless fencing is in place.
Testing the System You Actually Built
An untested HA system is not HA. Schedule quarterly game days where you:
- Kill the active node and measure RTO.
- Disconnect the replication network and observe fencing behaviour.
- Fill the disk on the primary and watch health checks remove it.
- Fail a whole zone and measure application-level recovery.
Document every learning and fix the weakest link before the next test.
Choosing the Right Starting Point
For most teams, an active-passive pair on managed cloud servers with PostgreSQL streaming replication and Keepalived VRRP is the right first step. Graduate to active-active or multi-zone when your RTO/RPO targets justify the extra complexity. Contact us for an HA architecture review tailored to your workload.
Published by the MassiveGRID team, specialists in high-availability hosting, Ceph storage, and multi-zone cluster design.