Thu, 30 Oct 2008

4 node GFS+IPVSADM cluster with Ubuntu Linux

Linux and Open Source world has vast array of technologies to fit our needs. In these pages, I will define steps to build 4 node GFS cluster with LVS. Unfortunately there is not much howtos or guides to setup similar environments.
GFS is developed by Redhat and GFS configuration can be most easily done using GUI tools in Redhat Enterprise Linux. However, there are many reasons to deploy a Debian/Ubuntu instead of Redhat. (I will not discuss them here) This setup has been tested on following hardware:


Our environment will end up having following features:




::READ HERE

Fun, right ? Ok, I will summarize MSA 2000. MSA 2000 can be configured using serial console or web management gui. I recommend using serial console just for setting up ip address and continue with its nice web gui. I also recommend reading basic information about volumes (physical-logical) and RAID systems. I preferred using RAID6 because I need to maximize disk space while maintaining safety. RAID5 is not the best option because it can not deal with 2 broken disks and its performance is awful when single disk fails. Using RAID6 we can still stand with 2 disk failues and we will still have acceptable performance. If performance is more important than the storage size, RAID 0+1 is more suitable. HP delivers servers with hardware raid capability and you will most probably have two mirrored disks. Leave those, it will not have a negative effect on the performance and you will gain important redundancy against our old, crappy hard disk technology.

  1. Download Ubuntu Hardy Heron 8.04 64-bit server edition and burn the image. Boot the system with the cd, verify the content on the cdrom and start the installation. Make a base installation with minimal packages. We can continue using updated packages in the repositories.
    If you plan adding new local disks to hosts, you can choose to have an LVM managed disk during partitioning step. If you do not, you can just select ext3 as local partition. Here, some might say that you should make separate partitions for /boot, /var, /home or /usr. Here are some basic rules:
    • Do not make /boot a separate partition unless you have a real good reason
    • Do not make /var a separate partition. Some applications (like mysql) put database files under /var/lib. (Or you can have mail queue under /var/spool/) In that case, put that specific directory on another partition, not whole /var.
    • You should make /home a different partition if you intend to use home directories a lot.
      The point is to be able to use one single backup file restore all the necessary files in case of a disaster. After more than 10 years, I can say that I see no advantage of having separate /var or /usr but I faced many problems
  2. Setup network parameters during the installation. Provide http proxy if necessary After the installation. Open /etc/apt/sources.list
    vi /etc/apt/sources.list
    Change contents of the file like this:
    deb http://us.archive.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
    deb-src http://us.archive.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
    deb http://security.ubuntu.com/ubuntu/ hardy-security restricted main multiverse universe
    deb-src http://security.ubuntu.com/ubuntu/ hardy-security restricted main multiverse universe
    deb http://us.archive.ubuntu.com/ubuntu/ hardy-updates restricted main multiverse universe
    deb-src http://us.archive.ubuntu.com/ubuntu/ hardy-updates restricted main multiverse universe
    This will make sure that the system will fetch the updated and security packages. You can also add hardy-proposed and hardy-backports to have more recent packages. However, once you have the system up and running, do not install updates other than security, unless necessary Update the package cache and make an upgrade:
    apt-get update
    apt-get upgrade
  3. Now run tasksel and select OpenSSH server. If you want to have a desktop environment, also select Ubuntu Desktop. Tasksel will install all the required packages. I will not focus on installation of web server or database packages. It quite depends on needs and taste. So I'm stepping in to other stuff.
  4. Set up serial console for remote access. ilo2 requires additional license in order to provide remote console functionality via browser. However, you can still have a serial console using virtual serial port.
    First we need to create a serial port listener. Since Ubuntu uses upstart as the init replacement, we will not use old inittab. Open /etc/event.d/ttyS0 and write:
    ttyS0 - getty start on runlevel 2 start on runlevel 3 start on runlevel 4 start on runlevel 5 stop on runlevel 0 stop on runlevel 1 stop on runlevel 6 respawn exec /sbin/getty -L 115200 ttyS0 vt102
    Let root login from the serial console. Open /etc/securetty and enter ttyS0
    Now edit /boot/grub/menu.lst to tell grub to use ttyS0 during boot up phase
    serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1 terminal --timeout=10 serial console
    And lastly, append following to the end of the each kernel line in /boot/grub/menu.lst
    console=ttyS0 console=tty0
  5. Install multipath-tools. Ubuntu kernels already include support for Multipathed IO. This allows us to connect to the storage array using two different physical paths. We can have redundancy for failover or we can distribute I/O load between the paths My /etc/multipath.conf looks like:
    defaults {
    udev_dir /dev
    user_friendly_names yes
    }
    blacklist {
    devnode "cciss"
    devnode "fd"
    devnode "hd"
    devnode "md"
    devnode "sr"
    devnode "scd"
    devnode "st"
    devnode "ram"
    devnode "raw"
    devnode "loop"
    }
    multipaths {
    multipath {
    wwid 3600c0ff000d56f978446734801000000
    alias san1
    }
    }
    device {
    vendor "HP"
    model "MSA2*"
    path_grouping_policy multibus
    getuid_callout "/lib/udev/scsi_id -g -u -s /block/%n"
    selector "round-robin 0"
    rr_weight uniform
    prio_callout "/bin/true"
    path_checker tur
    hardware_handler "0"
    failback immediate
    no_path_retry 12
    rr_min_io 100
    }
    
    If we had other devices in the path which we do not want to include, we would have a blacklist section. Examples for this section is included in the default config file. My path_grouping_policy is failover. You can also use other policies to distribute load, such as multibus Start multipathing daemon by running /etc/init.d/multipath-tools start . Then run multipath -ll (Double L) . You should see like this:
    root@node1:~# multipath -ll
    san1 (3600c0ff000d56f978446734801000000) dm-0 HP      ,MSA2212fc    
    [size=820G][features=0][hwhandler=0]
    \_ round-robin 0 [prio=0][active]
      \_ 0:0:0:0 sda        8:0   [active][ready]
    \_ round-robin 0 [prio=0][enabled]
      \_ 1:0:0:0 sdb        8:16  [active][ready]
    
    If you see only one path, then your fibre switches may need configuration or you may have problems in cabling. All nodes should be able to see the device. Now we can mount the disk on one of the nodes but we must not use it on the other nodes concurrently, for now.
  6. You should be able to see the disk on the storage array by now. You have to see as many as physical disks as you created in the management GUI of the MSA 2000.
    Note: For kernel versions < 2.6.19, qla2xxx drivers include failover capability. However, the driver does not work with newer kernels due to serious changes in kernel structs. Newer versions of the driver omits this capability and recommends using multipath-tools. I recommend sticking with the newest kernels.
    • Run cfdisk against the shared disk
    • Create an LVM partition. The size is up to you, I will continue with one gigantic partition
    • Create another partition for cluster quorum disk. ~10MB is enough and the type can be Linux
    • Save and exit. (You may need to reboot or run partprobe to let the kernel see the changes directly)
    • First create a physical volume. We will put our volume groups in it.
      pvcreate /dev/mapper/san1-part1
    • Create and activate a volume group. We will put our logical volumes in it
      vgcreate vgname /dev/mapper/san1-part1
      vgchange -a y vgname
    • Create logical volume using all available space
      vgdisplay vgname|grep "Total PE"
      lvcreate -l usetotalpehere vgname -n lvname
    • Create your favourite file system using the new logical volume. The path will look like:
      mkfs /dev/mapper/lvname
  7. Now we will configure GFS2 cluster. First, we have to install required packages on all the nodes.
    apt-get install gfs2-tools gfs-tools cman clvm
    Add following modules to /etc/modules
    loop
    lp
    rtc
    fuse
    gfs
    lock_dlm
    ip_vs_ftp
    ip_vs_lblc
    ip_vs_lc
    ip_vs_wlc
    ip_vs_lblcr
    ip_vs_nq
    ip_vs_wrr
    ip_vs_sh
    ip_vs_dh
    ip_vs_sed
    ip_vs_rr
    bonding
    
    Run modprobe against all these modules to load them in to the kernel
    Create GFS2 (or GFS, whichever you want. GFS2 is still experimental) volume. Parameter j defines the number of journals. Normally, every node must have one. We have to use Distributed Lock Manager for lock management.
    mkfs -t gfs2 -p lock_dlm -j 4 /dev/mapper/lvname
    Setup /etc/lvm2/lvm.conf by defining following values
    locking_type = 2 fallback_to_clustered_locking = 0 fallback_to_local_locking = 0 locking_library = "liblvm2clusterlock.so"
    We should never fallback to another locking mechanism otherwise we can harm the data integrity. Now, we have to create our quorum disk. Basically, quorum disk is accessed by all nodes frequently to write timestamped "I am here" messages. If an alive message is not seen for specified time, that node is considered to be dead and fenced from the storage. Quorum disk is extremely important for data consistency. It must reside on a physical partition and all nodes musth have access to it. Quorum can not be on a LVM volume because accessing to cluster LVM requires cluster membership. Initialize the quorum disk:
    mkqdisk /dev/mapper/san1-part2
    Open /etc/cluster/cluster.conf and define the GFS cluster. Ideally, all nodes should have at least one working fencing method. Otherwise, it might not be possible to prevent data corruption. The best fencing methods are power related ones. Since I only have a ilo2 port, I send poweroff signal through it. It is also possible to run scripts which login to fiber switch and shutdown a specific port.
    <?xml version="1.0"?> <cluster name="mycluster" config_version="1"> <cman expected_votes="5" two_nodes="0" cluster_id="777"> </cman> <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="ilo"> <device name="ilo_node1"/> </method> </fence> </clusternode> <clusternode name="node2" nodeid="2" votes="1"> <fence> <method name="ilo"> <device name="ilo_node2"/> </method> </fence> </clusternode> <clusternode name="node3" nodeid="3" votes="1"> <fence> <method name="ilo"> <device name="ilo_node3"/> </method> </fence> </clusternode> <clusternode name="node4" nodeid="4" votes="1"> <fence> <method name="ilo"> <device name="ilo_node4"/> </method> </fence> </clusternode> </clusternodes> <quorumd device="/dev/mapper/san1-part1" votes="4" stop_cman="1"> <heuristic program="/bin/true" score="1" interval="2" tko="3"/> </quorumd> <fence_daemon post_join_delay="5" clean_start="1"/> <fencedevices> <fencedevice name="ilo_node1" agent="fence_ilo" ipaddr="192.168.25.223" login="Administrator" passwd="1234" action="off"/> <fencedevice name="ilo_node2" agent="fence_ilo" ipaddr="192.168.25.224" login="Administrator" passwd="1234" action="off"/> <fencedevice name="ilo_node3" agent="fence_ilo" ipaddr="192.168.25.221" login="Administrator" passwd="1234" action="off"/> <fencedevice name="ilo_node4" agent="fence_ilo" ipaddr="192.168.25.222" login="Administrator" passwd="1234" action="off"/> </fencedevices> </cluster>
  8. Setup bonding for ethernet interface.
    Normally, I recommend having 4 ethernet interfaces for maximum reliability. Since we have 4 nodes, we can not just use cross cables or serial cables for cluster interconnect. We have to connect to a switch and switches can fail anytime. Therefore, two of the four interfaces must be bonded in failover mode to in order to make sure that even if one link/switch fails cluster traffic can still survive. This extremely vital. Cluster interconnect will not only carry heartbeat messages, but also provide transport for distributed lock manager. If distributed lock manager losts connectivity, the node will lost its access to global storage.
    Bonding setup is extremely simple. Just configure /etc/network/interfaces file like this:
    auto bond0 
    iface bond0 inet static
    	address x.y.z.t
    	netmask a.b.c.d
    	gateway q.w.e.r
    	slaves eth0 eth1
    
    Of course you will need another similar section for bond1
    Later, it is possible to check the status from /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)
    
    Bonding Mode: fault-tolerance (active-backup) (fail_over_mac)
    Primary Slave: None
    Currently Active Slave: eth0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 100
    Down Delay (ms): 100
    
    Slave Interface: eth0
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: 00:1e:0b:d3:f2:34
    
    Slave Interface: eth1
    MII Status: down
    Link Failure Count: 0
    Permanent HW addr: 00:1e:0b:d3:f2:32
    
  9. Reboot all nodes. If everything is correct and there is no network issue, all nodes should join to the cluster. Cluster memberships can be checked by running cman_tool nodes.
  10. Now we can continue with setting up heartbeat for web and sql services. Before this, all ip address should be planned and entered to all /etc/hosts files on all nodes.
  11. Setup Linux IP cluster
    • Install required packages for ip cluster. apt-get install heartbeat2 ipvsadm ldirectord
    • Tune kernel parameters for cluster operation and reboot all nodes. Note that if any of the interfaces is missing, you will probably have issues in ipvs cluster.
      net.ipv4.conf.default.rp_filter=1
      net.ipv4.conf.all.rp_filter=1
      net.ipv4.ip_forward=1
      net.ipv4.conf.all.arp_ignore = 1
      net.ipv4.conf.eth0.arp_ignore = 1
      net.ipv4.conf.eth1.arp_ignore = 1
      net.ipv4.conf.all.arp_announce = 2
      net.ipv4.conf.eth0.arp_announce = 2
      net.ipv4.conf.eth1.arp_announce = 2
      net.ipv4.conf.bond0.proxy_arp = 1
      
    • Edit /etc/ha.d/ha.cf Three lines are most important in this file.
      ucast bond0 x.y.z.t
      node web1
      node web2
      
      ucast line defines interface for heartbeat checks and destination ip address. node lines define the hostnames of the heartbeat nodes.
    • Edit /etc/ha.d/haresources. Define ldirectord module for cluster ip. The line must be same in both nodes of the cluster. For example, following line must be used on both db1 and db2.
      web1	ldirectord::ldirectord.cf LVSSyncDaemonSwap::master IPaddr2::clusteripaddress/netmaskslashnotation/bond0/broadcastaddress
      
    • Edit /etc/ha.d/ldirectord.conf. Here we define two real servers with different weights. manual of ipvsadm explains all load balancing algorithms in detail. In this setup, we create an pure active/passive balancing using sed scheduler. Checktype is only connect but it is possible to send http requests and check for return values.
      checktimeout=10
      checkinterval=1
      autoreload=yes
      logfile="local0"
      
      virtual	= tomcat:8080
      	service = http
      	real = web1:8080 gate 65535
      	real = web2:8080 gate 1
      	checktype = connect
      	scheduler = sed
      	protocol = tcp
      
  12. Create a new openvz based virtual server. You can check here for a template creation howto if you do not already have one

posted at: 11:30 | path: /Linux | permanent link to this entry