1. High Availability Overview

Foreword

High Availability Cluster enables the definition of highly available virtual machines. Any production VM is considered HA once you start Hoster API server in the HA mode (aka hoster api start --ha-mode).

In simple words, if a virtual machine is configured as production VM and the physical host fails, the VM will be automatically restarted on one of the remaining cluster nodes, if such a node happens to have the latest available ZFS snapshot for that specific virtual machine.

If you are planning to use HA, we highly recommend getting production grade server hardware with no single point of failure. This should include disk level redundancy (at least RaidZ-1), redundant power supplies, UPS systems, network switches with dual power supplies, etc.

It is essential that you use redundant network connections for the cluster communication. This can be as simple as running an additional, direct network cable connections from node-to-node (in the small, 3-node clusters), or as complex as setting up a spine-leaf architecture, with multiple bond interfaces. Else a simple switch reboot (or power loss on the switch) can lock all HA nodes, and bring down the whole cluster.

Overview

Hoster high availability is based on raft-ish algorithm using 3 candidate nodes that are described in the ha_config.json file and are essentially static. These 3 candidate nodes will agree who is a manager in this cluster based on the process start time - API server process that has been started first becomes a manager. All worker nodes are dynamic, and kept in RAM on every single candidate in case it needs to take over the cluster and become a manager.

To activate the HA mode start the API server using the --ha-mode:

hoster api start --ha-mode --ha-debug

--ha-debug let's you troubleshoot or test the HA mode before going to production. It will simply log every action (to /var/log/messages) instead of immediately applying them. For example:

EMERG: candidatesRegistered has gone below 2, initiating self fencing

The line above signals that the node in question can't reach the other 2 candidate nodes, and without the --ha-debug flag it would've rebooted itself. But because we are in the debug mode, API process will log what just happened and will exit immediately to simulate a node failure.

You can use tail -f /var/log/messages to watch the HA cluster changes and actions live.

HA Process Watchdog

By executing hoster api status you can check if the API server has been started in the HA mode:

🟢 API Server is running as PID 50056
🟢 HA Watchdog service is running as PID 53078

️🤖 HA is running in DEBUG mode
🔶 BE CAREFUL! This system is running as a part of the HA. Changes applied here, may affect other cluster members.

From this specific output it's clear that we are using the High Availability mode, but you've probably noticed by now that there is another process running (apart from the API server itself) called ha_watchdog.

HA Watchdog's responsibility is to track the API server process status, and check if it's running (sometimes processes can crash unexpectedly). If it detects any issues at all, it will reboot the host it's running on to prevent the HA cluster inconsistencies (VMs from this host could have already been started on another host because manager has marked us offline, and migrated the workload to other cluster members).

And the same goes for the API server itself, it's constantly checking for ha_watchdog process to be running, and if the watchdog is not running it will reboot the host. So essentially the process communication and monitoring is bi-directional.

Transport Layer

HTTP has been chosen as a communication protocol between the HA nodes at the initial stage of development, for the ease of use and tooling availability (to not re-invent the wheel, at least for now).

Every server in the cluster (be it a worker or a candidate) uses HTTP based REST API to communicate with cluster candidates. There are 2 main endpoints involved in this process: /api/v1/ha/register and /api/v1/ha/ping.

The endpoints are pretty much self-explanatory, but to dive a little deeper: ping is executed from any cluster member to all 3 candidates, every 2 seconds. register is executed only once, when the API server process starts, but we will need to register to all 3 candidate nodes separately. Every candidate node keeps track of their own connections to other members, but only the elected manager has the power to apply any cluster actions (like a VM failover, for example).

In case of failure, manager will wait the graceful period of time, specified in the node ha_config.json file. Once that time has passed and the node hasn't come back online, the manager will iterate over all nodes in the cluster collecting the VM information and all of it's available snapshots (for a specific VM in question, that has belonged to the failed node), and will start such a VM on a different node in the cluster.

Please keep in mind, that the data consistency is our priority, and the safety of overprovissioning is being totally ignored in the favour of consistency. As long as node has the latest VM/ZFS snapshot available in the whole cluster, manager will attempt to start the VM on that specific node. You can balance how VMs would be failed over using different replication strategies. Try to not over-provision the servers that may need to take on more work in case of failure.