How to configure Openstack High Availability with corosync & pacemaker

This is a two part article, here I will share the steps to configure OpenStack High Availability (HA) between two controllers. In the second part I will share the steps to configure HAProxy and move keystone service endpoints to loadbalancer. By default if your bring up a controller and compute node using tripleo configuration then the controllers will by default get configured via pacemaker cluster. But if you are manually bringing up your openstack setup using packstack or devstack or by manually creating all the database and services then you will have to manually configure cluster between the controllers to configure OpenStack High Availability (HA).

Configure OpenStack High Availability (HA)

For the sake of this article I brought up two controller nodes using packstack on two different virtual machines using CentOS 7 running on Oracle VirtualBox installed on my Linux Server. After the successful completion of packstack you will observe keystonerc_admin file in the home folder of the root user.

HINT

How to install multi node openstack on virtualbox with packstack on CentOS 7

Installing the Pacemaker resource manager

Since we will configure OpenStack High Availability using pacemaker and corosync, first of all we need to install all the rpms required for the cluster setup. So we will install Pacemaker to manage the VIPs that we will use with HAProxy to make the web services highly available.

So install pacemaker on all the controller nodes

[root@controller2 ~]# yum install -y pcs fence-agents-all
[root@controller1 ~]# yum install -y pcs fence-agents-all

Verify that the software installed correctly by running the following command:

[root@controller1 ~]# rpm -q pcs
pcs-0.9.162-5.el7.centos.2.x86_64

[root@controller2 ~]# rpm -q pcs
pcs-0.9.162-5.el7.centos.2.x86_64

Next, add rules to the firewall to allow cluster traffic:

[root@controller1 ~]# firewall-cmd --permanent --add-service=high-availability
success

[root@controller1 ~]# firewall-cmd --reload
success

[root@controller2 ~]# firewall-cmd --permanent --add-service=high-availability
success

[root@controller2 ~]# firewall-cmd --reload
success

NOTE

If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports: TCP ports 2224, 3121, and 21064, and UDP port 5405.
If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host.

The installed packages will create a hacluster user with a disabled password. While this is fine for running pcs commands locally, the account needs a login password in order to perform such tasks as syncing the corosync configuration, or starting and stopping the cluster on other nodes.

Set the password for the Pacemaker cluster on each controller node using the following command:

[root@controller1 ~]# passwd hacluster
Changing password for user hacluster.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

[root@controller2 ~]# passwd hacluster
Changing password for user hacluster.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

Start the Pacemaker cluster manager on each node:

[root@controller1 ~]# systemctl start pcsd.service
[root@controller1 ~]# systemctl enable pcsd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.

[root@controller2 ~]# systemctl start pcsd.service
[root@controller2 ~]# systemctl enable pcsd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.

Configure Corosync

To configure Openstack High Availability we need to configure corosync on both the nodes, use pcs cluster auth to authenticate as the hacluster user:

[root@controller1 ~]# pcs cluster auth controller1 controller2
Username: hacluster
Password:
controller2: Authorized
controller1: Authorized

[root@controller2 ~]# pcs cluster auth controller1 controller2
Username: hacluster
Password:
controller2: Authorized
controller1: Authorized

NOTE

If you face any issues at this step, check your firewalld/iptables or selinux policy

Finally, run the following commands on the first node to create the cluster and start it. Here our cluster name will be openstack

[root@controller1 ~]# pcs cluster setup --start --name openstack controller1 controller2
Destroying cluster on nodes: controller1, controller2...
controller1: Stopping Cluster (pacemaker)...
controller2: Stopping Cluster (pacemaker)...
controller1: Successfully destroyed cluster
controller2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'controller1', 'controller2'
controller1: successful distribution of the file 'pacemaker_remote authkey'
controller2: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
controller1: Succeeded
controller2: Succeeded

Starting cluster on nodes: controller1, controller2...
controller1: Starting Cluster...
controller2: Starting Cluster...

Synchronizing pcsd certificates on nodes controller1, controller2...
controller2: Success
controller1: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller2: Success
controller1: Success

Enable the pacemaker and corosync services on both the controller so they can automatically start on boot

[root@controller1 ~]# systemctl enable pacemaker
Created symlink from /etc/systemd/system/multi-user.target.wants/pacemaker.service to /usr/lib/systemd/system/pacemaker.service.

[root@controller1 ~]# systemctl enable corosync
Created symlink from /etc/systemd/system/multi-user.target.wants/corosync.service to /usr/lib/systemd/system/corosync.service.

[root@controller2 ~]# systemctl enable corosync
Created symlink from /etc/systemd/system/multi-user.target.wants/corosync.service to /usr/lib/systemd/system/corosync.service.

[root@controller2 ~]# systemctl enable pacemaker
Created symlink from /etc/systemd/system/multi-user.target.wants/pacemaker.service to /usr/lib/systemd/system/pacemaker.service.

Validate cluster using pacemaker

Verify that the cluster started successfully using the following command on both the nodes:

[root@controller1 ~]# pcs status
Cluster name: openstack
Stack: corosync
Current DC: controller2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Tue Oct 16 11:51:13 2018
Last change: Tue Oct 16 11:50:51 2018 by root via cibadmin on controller1

2 nodes configured
0 resources configured

Online: [ controller1 controller2 ]

No resources

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@controller2 ~]# pcs status
Cluster name: openstack
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: controller2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Mon Oct 15 17:04:29 2018
Last change: Mon Oct 15 16:49:09 2018 by hacluster via crmd on controller2

2 nodes configured
0 resources configured

Online: [ controller1 controller2 ]

No resources

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

How to start the Cluster

Now that corosync is configured, it is time to start the cluster. The command below will start corosync and pacemaker on both nodes in the cluster. If you are issuing the start command from a different node than the one you ran the pcs cluster auth command on earlier, you must authenticate on the current node you are logged into before you will be allowed to start the cluster.

[root@controller1 ~]# pcs cluster start --all

An alternative to using the pcs cluster start --all command is to issue either of the below command sequences on each node in the cluster separately:

[root@controller1 ~]# pcs cluster start
Starting Cluster...

[root@controller1 ~]# systemctl start corosync.service
[root@controller1 ~]# systemctl start pacemaker.service

Verify Corosync Installation

First, use corosync-cfgtool to check whether cluster communication is happy:

[root@controller2 ~]#  corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 192.168.122.22
        status  = ring 0 active with no faults

So all looks normal with our fixed IP address (not a 127.0.0.x loopback address) listed as the id, and no faults for the status.
If you see something different, you might want to start by checking the node’s network, firewall and SELinux configurations.

Next, check the membership and quorum APIs:

[root@controller2 ~]# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.122.20)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.122.22)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

Check the status of corosync service

[root@controller2 ~]# pcs status corosync
Membership information
----------------------
    Nodeid      Votes Name
         1          1 controller1
         2          1 controller2 (local)

You should see both nodes have joined the cluster.

Repeat the same steps on both the controller to validate the corosync services

Verify the cluster configuration

Before we make any changes, it’s a good idea to check the validity of the configuration.

[root@controller1 ~]# crm_verify -L -V
   error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

As you can see, the tool has found some errors.

In order to guarantee the safety of your data, [5] fencing (also called STONITH) is enabled by default. However, it also knows when no STONITH configuration has been supplied and reports this as a problem (since the cluster will not be able to make progress if a situation requiring node fencing arises).

We will disable this feature for now and configure it later. To disable STONITH, set the stonith-enabled cluster option to false on both the controller nodes:

[root@controller1 ~]# pcs property set stonith-enabled=false
[root@controller1 ~]# crm_verify -L

[root@controller2 ~]# pcs property set stonith-enabled=false
[root@controller2 ~]# crm_verify -L
With the new cluster option set, the configuration is now valid.

WARNING

The use of stonith-enabled=false is completely inappropriate for a production cluster. It tells the cluster to simply pretend that the nodes which fails are safely in powered off state. Some vendors will refuse to support clusters that have STONITH disabled.

I will continue this article i.e to configure OpenStack High Availability in separate part. In the next part I will share the steps to configure HAProxy and we will manage it as a resource. Also the detail steps to move OpenStack API endpoints behind the cluster load balancer.