Possible disaster scenarios

Failure of:	Machine affected	HA system response
PSQL database service	Primary	Failover, automatic service reboot. Replication stops in the meantime.
	Secondary	Automatic service reboot. Replication stops in the meantime.
	Monitor	Automatic service reboot. Replication continues but no failover possible in the meantime.
Server shutdown	Primary	Failover. Replication stops, waits for a signal from secondary.
	Secondary	Replication on primary stops. Waits for a signal from secondary.
	Monitor	The primary database is still usable. Primary and secondary nodes wait for a connection to monitor. Replication continues.
	All	Database unavailable, no replication, no failover possible.
Server reboot	Primary	Failover, automatic service reboot on startup.
	Secondary	Automatic service reboot. Replication stops in the meantime.
	Monitor	Automatic service reboot. Replication continues but no failover possible in the meantime.
	All	Database unavailable, no replication, no failover possible.
`pg_autoctl` corrupted and/or deleted	All	Database unavailable, no replication, no failover possible.

Controlled switchover

Note

In a controlled switchover situation it is possible to organize the sequence of events in a way to avoid data loss and lower downtime to a minimum. Because the HA cluster described here uses synchronous replication, triggering a manual failover doesn’t risk data loss risks. The monitor server keeps the current primary health at the time when the failover is triggered, and drives the failover accordingly.

Triggering a controlled switchover is the same as a manual failover described above.

Recovery

Database service failure

If the PostgreSQL database fails on one of the machines, the system will automatically reboot the affected service, but the replication process is unavailable for the duration.

Server shutdown

If either of the component machines is shut down, a manual restart is required. The failover processes will automatically start with the machine, and reinitialize the connections. If only the monitor server is affected, replication continues and failover is still possible.

Server reboot

The failover system is configured to automatically restart with the server, and no manual intervention is required. If only the monitor server is affected, replication continues but no failover can be triggered until it’s available.

`pg_autoctl` setup failure

On the current primary database machine:

/usr/pgsql-12/bin/postgres -D /var/lib/pgsql/[node-?] -p [port]

Edit the preferences.cfg file for Central, and change the following line, using the connection string:

postgres://[node-?]:[port]/mmsuite?target_session_attrs=read-write

Restart Central:

systemctl restart mmcentral

Complete shutdown

If the startup scripts are correct in all of the machines a manual boot of the machines in the correct order (1. monitor; 2. primary; 3. secondary) will be enough to reinitialize the cluster. On each machine, use the ps -ef | grep monitor (or primary/secondary) command after boot to verify the pg_autoctl process is running.

If something’s not working, or you’d like to manually restart the services to recover, follow these steps.

Note

You can create bash scripts of each step to execute instead of manually running through them.

Start the monitor machine:

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[monitor]/

Start the primary machine:

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[node-1]/

If an error message states an instance is already running, remove the referenced file:

rm /tmp/pg_autoctl/var/lib/pgsql/[node-1]/pg_autoctl.pid

And re-run the application:

pg_autoctl run --pgdata ./[node-1]/

Start the secondary machine(s):

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[node-2]/