Linux HA Cluster

From KG7QIN's Wiki
Jump to navigation Jump to search

Various notes on making an Linux High Availability Cluster.

Corosync/Pacemaker

Corosync and Pacemaker make it easy to make an HA cluster that share a Virtual IP (VIP) between nodes.

Sync time between servers

  • Install NTP and set the timezone
  dpkg-reconfigure tzdata
  apt-get update
  apt-get -y install ntp

Configure Firewall

Install Corosync/Pacemaker

  apt-get install pacemaker corosync

Note: you can install the options pcs tool for controlling pacemaker wtih

  apt-get install pcs

Configure Authorization Key for two servers

  • Generate the key on server 1 (main server)
  apt-get install haveged
  corosync-keygen
  • Copy key to server 2
  scp /etc/corosync/authekey <username>@<server2>:/etc/corosync
  • Change permissions for file on server 2
  chown root: /etc/corosync/authkey
  chmod 400 /etc/corosync/authkey

Configure Corosync cluster

  • On server 1 add the following to /etc/crosync/corosync.conf
totem {
  version: 2
  cluster_name: lbcluster
  transport: udpu
  interface {
    ringnumber: 0
    bindnetaddr: <private_binding_IP_address>
    broadcast: yes
    mcastport: 5405
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
}

nodelist {
  node {
    ring0_addr: <server_1_private_IP_address>
    name: primary
    nodeid: 1
  }
  node {
    ring0_addr: <server_2_private_IP_address>
    name: secondary
    nodeid: 2
  }
}

logging {
  to_logfile: yes
  logfile: /var/log/corosync/corosync.log
  to_syslog: yes
  timestamp: on
}

Make sure to update the <private_binding_IP_address>, <server_1_private_IP_address>, and <server_2_private_IP_address> sections in the configuration above.

  • Copy config to server 2
  scp /etc/corosync/corosync.conf <username>@<server2>:/etc/corosync

Enable and Run Corosync

Do all of the following on server 1 and server 2:

  • Make directory and config file
  mkdir -p /etc/corosync/service.d
  vim /etc/corosync/service.d/pcmk
service {
  name: pacemaker
  ver: 1
}
  • Edit /etc/default/corosync. Add/change START= to START=yes
  • Start corosync on both server with
  systemctl enable corosync
  systemctl restart corosync
  • Check to make sure everything worked
  corosync-cmapctl | grep members
  
  runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
  runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(server_A_private_IP_address)
  runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
  runtime.totem.pg.mrp.srp.members.1.status (str) = joined
  runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
  runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(server_B_private_IP_address)
  runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
  runtime.totem.pg.mrp.srp.members.2.status (str) = joined

Enable and Start Pacemaker

  • Enable with
  systemctl enable pacemaker
  • Start
  systemctl start pacemaker
  • Check status
  crm status
  Last updated: Sun Sep 17 15:49:24 2017          
  Last change: Tue Sep 12 09:04:23 2017 by root via crm_attribute on secondary
  Stack: corosync
  Current DC: primary (version 1.1.14-70404b0) - partition with quorum
  2 nodes and 0 resource configuredOnline: [ primary secondary ]

Configure Pacemaker & add Virtual IP

  • Run on server 1:
  crm configure property stonith-enabled=false
  crm configure property no-quorum-policy=ignore
  • Add Virtual IP
  crm configure primitive virtual_public_ip \
  ocf:heartbeat:IPaddr2 params ip="1.0.0.3" \
  cidr_netmask="32" op monitor interval="10s" \
  meta migration-threshold="2" failure-timeout="60s" resource-stickiness="100"

Notes:

  1. Change ip="10.0.0.3" to the IP address you will use. Both servers must have IP addresses already on the same network assigned to interfaces
  2. If adding an additional address, you will need to change the primitive's name from virtual_public_ip to something like virtual_public_ip2
  • Check status
  crm status
  Last updated: Sun Sep 17 15:49:24 2017          
  Last change: Tue Sep 12 09:04:23 2017 by root via crm_attribute on secondary
  Stack: corosync
  Current DC: primary (version 1.1.14-70404b0) - partition with quorum
  2 nodes and 1 resource configuredOnline: [ primary secondary ]Full list of resources:virtual_public_ip   (ocf::heartbeat:IPaddr2):    Started primary
  • Verify Virtual IP is running on server 1:
  iip -4 addr ls

You should something similar to:

  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
      inet 1.0.0.1/24 brd 1.0.0.255 scope global eth0
         valid_lft forever preferred_lft forever
      inet 1.0.0.3/32 brd 1.0.0.255 scope global eth0
         valid_lft forever preferred_lft forever

Testing

  • Simulate server 1 going down. Run the following on server 1:
  crm node standby primary
  • Login to server 2 and check the status:
  crm status
  Last updated: Sun Sep 17 15:49:24 2017          
  Last change: Tue Sep 12 09:04:23 2017 by root via crm_attribute on secondary
  Stack: corosync
  Current DC: primary (version 1.1.14-70404b0) - partition with quorum
  2 nodes and 1 resource configuredNode primary: standby
  Online: [ secondary ]Full list of resources:virtual_public_ip   (ocf::heartbeat:IPaddr2):    Started secondary
  • Check server 2's IP to make sure Virtual IP is present:
  ip -4 addr ls
  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
      inet 1.0.0.2/24 brd 1.0.0.255 scope global eth0
         valid_lft forever preferred_lft forever
      inet 1.0.0.3/32 brd 1.0.0.255 scope global eth0
        valid_lft forever preferred_lft forever
  • Bring server 1 online again
  crm node online primary
  • You can force the fail-over by running on server 2
  crm node standby secondary
  crm node online secondary