Installing Cent OSOn Cluster Via NFS

From Biowiki
Jump to: navigation, search

---

Cent OS install/config of multiple Babylon Cluster nodes

Purpose: you already have the Babylon Cluster and you just want to add some new nodes in a small amount of time. All these nodes have the same hardware configuration and the same purpose, so except for the hostname and the IP address, their install parameters will be exactly the same. It is possible and desirable to automate as many steps in the install/config process as possible.

Another purpose: this can be used to quickly reinstall a node, or just install one new node.

This is a write-up of how I did it when we added 14 nodes to the cluster in Feb-March 2007. This is not the most elegant solution (that would involve PXE and DHCP and other things), but it still works decently well.

Once I actually polished the procedure (which required 5 installs with varying degrees of automation to get right... ugh) I was able to do install the last 9 nodes in 48 minutes: 5.3 minutes per node! (from putting in boot disk to getting the CentOS login screen on a ready OS; not counting the hours required to pull and label all the cables, which was the rate-limiting step, or configuring users or SGE - although this could have been automated by putting it in the post-install section of the kickstart file)

---

NB: necessary files

All the files used/mentioned in this install (except the Cent OS images) are in sheridan:/root/ and sheridan:/root/new-node-config/. There is also a copy of them (except for the secure stuff) in lorien:/home/users/avu/new-node-config/ because that dir is backed up.

Prepare for the installation

SSH into lorien, the NFS server.

Download CentOS images and make boot disk

Download to /home/tmp/centos/ :

  • CentOS-4.4-x86_64-bin1of4.iso
  • CentOS-4.4-x86_64-bin2of4.iso
  • CentOS-4.4-x86_64-bin3of4.iso
  • CentOS-4.4-x86_64-bin4of4.iso

SSH into some machine that can burn CDs (e.g. kosh) and has the NFS mounted.

Make the boot disk (NB: the boot disk ISO is actually in a subdir of the first CentOS install disk, so we have to mount it first):

cd /tmp/
mkdir centos
cp /nfs/tmp/centos/CentOS-4.4-x86_64-bin1of4.iso .
su
mount -o loop -t iso9660 [[Cent OS]]-4.4-x86_64-bin1of4.iso centos/
cdrecord -v -dev=ATA:1,1,0 -data centos/images/boot.iso
umount centos/
rm [[Cent OS]]-4.4-x86_64-bin1of4.iso
rmdir centos/

Make list of new nodes

Put it in a file called hostnames.new. We will use this file a lot (see the "finish config" section). Here's the one for this install:

cat ~/hostnames.new

bester
talia
lyta
byron
lochley
zack
theo
neroon
cartagia
refa
natoth
draal
zathras
ngrath

Prepare a kickstart file for each node

The kickstart file contains the installation configuration information, so we don't have to provide it while doing the install, thus speeding things greatly.

All the files will be exactly the same, except for the hostname and IP fields for eth1 (at time, we are not using eth0, but that may change). We'll use a base kickstart file (explained here) and then create copies of it, one specifically for each new node.

Note that the fields NODE_HOSTNAME and NODE_STATIC_IP in anaconda-ks.cfg must be replaced with node-specific info. This can be done using this fun one-liner:

for i in $(cat ~/hostnames.new) ; do \
	cat anaconda-ks.cfg | sed s/NODE_HOSTNAME/$i/g | sed s/NODE_STATIC_IP/$(grep $i /etc/hosts | awk '{print $1}')/g > ks-$i.cfg ; \
done

NB: this only works if your /etc/hosts file has all the node hostnames and static IPs.

We now have these kickstart files:

ks-bester.cfg
ks-byron.cfg
ks-cartagia.cfg
ks-draal.cfg
ks-lochley.cfg
ks-lyta.cfg
ks-natoth.cfg
ks-neroon.cfg
ks-ngrath.cfg
ks-refa.cfg
ks-talia.cfg
ks-theo.cfg
ks-zack.cfg
ks-zathras.cfg

Install CentOS on each node

Insert boot disk into machine upon which you want to install CentOS and reboot it.

At the prompt asking the installation mode (the one with the boot: prompt), enter:

linux ks=nfs:192.168.0.13:/home/tmp/ks-HOSTNAME.cfg

(NB: that's the NFS server's IP address in there)

This will load the node-specific kickstart file from the NFS. It will contain answers to most configuration questions, except the ones below.

You will be asked the Networking Device. Select eth1 (at the time of this writing, we are only using eth1).

Because the node needs an IP before it can contact the NFS server, you have to enter one. We will make this the IP that we actually want for the node long-term.

You will first wait for half a minute while the node tries to get a Dynamic IP for eth1. Let it fail, then we will come to the TCP/IP screen where we will enter a static IP. Hit Space Bar to disable DHCP (unless you have a DHCP server, which at the time of this writing is NOT the case). Enter the static IP (192.168.0.xxx), netmask (255.255.255.0), gateway (192.168.0.1), and nameserver (128.32.136.12).

Annoyingly, you will also be asked to do the same thing for eth0, although we're not using it. Enter a bogus IP and netmask for it (e.g. 192.168.0.200); the kickstart file will set eth0 to be disabled at boot, so what you enter won't matter. Also, the kickstart file's post-install section will remove the fake IP and the netmask from the config afterwards (the file /etc/sysconfig/network-scripts/ifcfg-eth0 will be cleansed). But, enter the same gateway and nameserver as for eth1.

Now you will be asked if you want to erase and repartition the hard drive. Say yes. You won't have to provide partition info, because the kickstart file contains it.

From this point on, the installation will be completely automated. You can even remove the boot disk and move on to the next machine! All the remaining info will be obtained by the Anaconda installer from the NFS.

Finish configuring each node

We still need to do some things on each node before it's ready to be used, like:

  • mount the NFS
  • set up the users and groups
  • configure SGE

This is easily done using ssh-agent and command-line loops. To start:

ssh sheridan
su -
eval `ssh-agent -s`
ssh-add
# enter the extremeley complicated passphrase

Unless you have DNS set up, the first step is to copy the hosts file to all new nodes:

for i in $(cat ~/hostnames.new) ; do scp /etc/hosts $i:/etc/ ; done

Set up the NFS

for i in $(cat ~/hostnames.new) ; do cat add-nfs.bat | ssh $i 'eval `cat -`' ; done

The contents of add-nfs.bat are:

mkdir /mnt/nfs
; 
ln -s /mnt/nfs /nfs
;
echo lorien:/home /mnt/nfs nfs defaults 0 0 >> /etc/fstab
;
mount -a

NB: commands must be semicolon-delimited, since the backticks after eval will eat the newlines.

Add users and groups

Yes, this is an atrocious hack: passes, groups, and shadow contain the relevant tails of /etc/passwd, /etc/group, and /etc/shadow, respectively. Of course, you should make sure you store these files somewhere safe (e.g. /root/new-node-config/)!

cd /root/new-node-config/
for i in $(cat ~/hostnames.new) ; do cat passes | ssh $i 'cat - >> /etc/passwd' ; done
for i in $(cat ~/hostnames.new) ; do cat groups | ssh $i 'cat - >> /etc/group' ; done
for i in $(cat ~/hostnames.new) ; do cat shadow | ssh $i 'cat - >> /etc/shadow' ; done

NB: this only works for users with NFS home dirs, because they don't have to be created.

Set up NTP so that the clocks are synchronized

for i in $(cat ~/hostnames.new) ; do \
	ssh $i 'echo "* * * * * date 2>&1 >> /var/log/ntpdate.log ; /usr/sbin/ntpdate 128.138.140.44 2>&1 >> /var/log/ntpdate.log" | crontab' ; \
done

Configure SGE

You can consult the Sun Grid Engine and How To Administer Sun Grid Engine pages, but I will give you The Short Version.

Register all the new nodes as exec hosts on the master host (sheridan at the time of this writing). If you have a shadow master (which we don't), presumably those steps will have to be done on there also.

On the master host, as the sgeadmin user, do this for each new node:

qconf -ae
# replace the word "template" with the new node hostname
qconf -ah NEW_NODE_HOSTNAME

Note: if you want, you can use qconf -Ae NODE_CONFIG_FILE to load this info from a config file on disk, but this is faster than preparing the config files for each node (or maybe not, if you use the Power of Perl 1-liners).

Now, add the new nodes to the SGE queue you want them to be in:

qconf -mq DESIRED_QUEUE

You will add the hostnames to the list in hostlist and the number of CPUs that each node has for SGE to the list in slots (using the format [HOSTNAME]=[NUM_OF_CPUs]).

Now, to install on the exec hosts. As before, this is easy using command-line loops:

for i in $(cat ~/hostnames.new) ; do cat add-sge.bat | ssh $i ' eval `cat -`' ; done

where add-sge.bat is the following script:

mkdir -p /opt/sge ;
cd /opt/sge/ ;
cp /nfs/tmp/sge-6.0u7-* . ;
cp /nfs/tmp/default.tar.gz . ;
tar xvfz sge-6.0u7-common.tar.gz ;
tar xvfz sge-6.0u7-bin-lx24-amd64.tar.gz ;
tar xvfz default.tar.gz ;
chown -R sgeadmin:sgeadmin /opt/sge ;

cat /nfs/tmp/services.tail >> /etc/services ;
iptables -I RH-Firewall-1-INPUT 3 -p tcp --dport 536 -j ACCEPT ;
iptables -I RH-Firewall-1-INPUT 3 -p tcp --dport 537 -j ACCEPT ;
service iptables save ;

. default/common/settings.sh ;
./inst_sge -x -noremote -auto /nfs/tmp/babylon_configuration.conf ;

service sgeexecd stop ;
service sgeexecd start ;
service sgeexecd stop ;
service sgeexecd start ;

NB: as before, notice that commands must be semicolon-delimited, as eval will eat newlines.

CAVEAT: make sure that each machine starts the sge_execd daemon AFTER REBOOT! There is something flaky about the install that might prevent it from doing that all the time (or at least the first few times). The complete description of the problem is here (the Sun Grid Engine install page, "Issues with rebooting the node" section).

---

-- Created by: Andrew Uzilov on 05 Mar 2007