Purpose: you already have the
BabylonCluster and you just want to add some new nodes in a small amount of time.
All these nodes have the same hardware configuration and the same purpose, so except for the
hostname and the
IP address, their install parameters will be exactly the same.
It is possible and desirable to automate as many steps in the install/config process as possible.
Another purpose: this can be used to quickly reinstall a node, or just install one new node.
This is a write-up of how I did it when we added 14 nodes to the cluster in Feb-March 2007.
This is not the most elegant solution (
that would involve PXE and DHCP and other things), but it still works decently well.
Once I actually polished the procedure (which required 5 installs with varying degrees of automation to get right... ugh) I was able to do install the last 9 nodes in 48 minutes:
5.3 minutes per node! (from putting in boot disk to getting the CentOS login screen on a ready OS;
not counting the hours required to pull and label all the cables, which was the rate-limiting step, or configuring users or SGE - although this
could have been automated by
putting it in the post-install section of the kickstart file)
NB: necessary files
All the files used/mentioned in this install (except the
CentOS images) are in
sheridan:/root/ and
sheridan:/root/new-node-config/.
There is also a copy of them (except for the secure stuff) in
lorien:/home/users/avu/new-node-config/ because that dir is backed up.
Prepare for the installation
SSH into
lorien, the NFS server.
Download CentOS images and make boot disk
Download to
/home/tmp/centos/ :
- CentOS-4.4-x86_64-bin1of4.iso
- CentOS-4.4-x86_64-bin2of4.iso
- CentOS-4.4-x86_64-bin3of4.iso
- CentOS-4.4-x86_64-bin4of4.iso
SSH into some machine that can burn CDs (e.g.
kosh) and has the NFS mounted.
Make the boot disk (
NB: the boot disk ISO is actually in a subdir of the first CentOS install disk, so we have to mount it first):
cd /tmp/
mkdir centos
cp /nfs/tmp/centos/CentOS-4.4-x86_64-bin1of4.iso .
su
mount -o loop -t iso9660 CentOS-4.4-x86_64-bin1of4.iso centos/
cdrecord -v -dev=ATA:1,1,0 -data centos/images/boot.iso
umount centos/
rm CentOS-4.4-x86_64-bin1of4.iso
rmdir centos/
Make list of new nodes
Put it in a file called
hostnames.new.
We will use this file a lot (see the "finish config" section).
Here's the one for this install:
cat ~/hostnames.new
bester
talia
lyta
byron
lochley
zack
theo
neroon
cartagia
refa
natoth
draal
zathras
ngrath
Prepare a kickstart file for each node
The kickstart file contains the installation configuration information, so we don't have to provide it while doing the install, thus speeding things greatly.
All the files will be exactly the same, except for the hostname and IP fields for
eth1 (at time, we are not using
eth0, but that may change).
We'll use a
base kickstart file (explained here) and then create copies of it, one specifically for each new node.
Note that the fields
NODE_HOSTNAME and
NODE_STATIC_IP in
anaconda-ks.cfg must be replaced with node-specific info.
This can be done using this fun one-liner:
for i in $(cat ~/hostnames.new) ; do \
cat anaconda-ks.cfg | sed s/NODE_HOSTNAME/$i/g | sed s/NODE_STATIC_IP/$(grep $i /etc/hosts | awk '{print $1}')/g > ks-$i.cfg ; \
done
NB: this only works if your
/etc/hosts file has all the node hostnames and static IPs.
We now have these kickstart files:
ks-bester.cfg
ks-byron.cfg
ks-cartagia.cfg
ks-draal.cfg
ks-lochley.cfg
ks-lyta.cfg
ks-natoth.cfg
ks-neroon.cfg
ks-ngrath.cfg
ks-refa.cfg
ks-talia.cfg
ks-theo.cfg
ks-zack.cfg
ks-zathras.cfg
Install CentOS on each node
Insert boot disk into machine upon which you want to install CentOS and reboot it.
At the prompt asking the installation mode (the one with the
boot: prompt), enter:
linux ks=nfs:192.168.0.13:/home/tmp/ks-HOSTNAME.cfg
(
NB: that's the NFS server's IP address in there)
This will load the node-specific kickstart file from the NFS.
It will contain answers to most configuration questions, except the ones below.
You will be asked the
Networking Device.
Select
eth1 (at the time of this writing, we are only using
eth1).
Because the node needs an IP before it can contact the NFS server, you have to enter one.
We will make this the IP that we actually want for the node long-term.
You will first wait for half a minute while the node tries to get a
Dynamic IP for
eth1.
Let it fail, then we will come to the
TCP/IP screen where we will enter a static IP.
Hit
Space Bar to disable DHCP (unless you have a DHCP server, which at the time of this writing is NOT the case).
Enter the static IP (
192.168.0.xxx), netmask (
255.255.255.0), gateway (
192.168.0.1), and nameserver (
128.32.136.12).
Annoyingly, you will also be asked to do the same thing for
eth0, although we're not using it.
Enter a bogus IP and netmask for it (e.g.
192.168.0.200); the kickstart file will set
eth0 to be disabled at boot, so what you enter won't matter.
Also, the kickstart file's post-install section will remove the fake IP and the netmask from the config afterwards (the file
/etc/sysconfig/network-scripts/ifcfg-eth0 will be cleansed).
But, enter the same gateway and nameserver as for
eth1.
Now you will be asked if you want to erase and repartition the hard drive.
Say
yes.
You won't have to provide partition info, because the kickstart file contains it.
From this point on, the installation will be
completely automated.
You can even remove the boot disk and move on to the next machine!
All the remaining info will be obtained by the Anaconda installer from the NFS.
Finish configuring each node
We still need to do some things on each node before it's ready to be used, like:
- mount the NFS
- set up the users and groups
- configure SGE
This is easily done using
ssh-agent and command-line loops.
To start:
ssh sheridan
su -
eval `ssh-agent -s`
ssh-add
# enter the extremeley complicated passphrase
Unless you have DNS set up, the first step is to copy the
hosts file to all new nodes:
for i in $(cat ~/hostnames.new) ; do scp /etc/hosts $i:/etc/ ; done
Set up the NFS
for i in $(cat ~/hostnames.new) ; do cat add-nfs.bat | ssh $i 'eval `cat -`' ; done
The contents of
add-nfs.bat are:
mkdir /mnt/nfs
;
ln -s /mnt/nfs /nfs
;
echo lorien:/home /mnt/nfs nfs defaults 0 0 >> /etc/fstab
;
mount -a
NB: commands must be semicolon-delimited, since the backticks after
eval will eat the newlines.
Add users and groups
Yes, this is an atrocious hack:
passes,
groups, and
shadow contain the relevant tails of
/etc/passwd,
/etc/group, and
/etc/shadow, respectively.
Of course, you should make sure you store these files somewhere
safe (e.g.
/root/new-node-config/)!
cd /root/new-node-config/
for i in $(cat ~/hostnames.new) ; do cat passes | ssh $i 'cat - >> /etc/passwd' ; done
for i in $(cat ~/hostnames.new) ; do cat groups | ssh $i 'cat - >> /etc/group' ; done
for i in $(cat ~/hostnames.new) ; do cat shadow | ssh $i 'cat - >> /etc/shadow' ; done
NB: this only works for users with NFS home dirs, because they don't have to be created.
Set up NTP so that the clocks are synchronized
for i in $(cat ~/hostnames.new) ; do \
ssh $i 'echo "* * * * * date 2>&1 >> /var/log/ntpdate.log ; /usr/sbin/ntpdate 128.138.140.44 2>&1 >> /var/log/ntpdate.log" | crontab' ; \
done
Configure SGE
You can consult the
SunGridEngine and
HowToAdministerSunGridEngine pages, but I will give you The Short Version.
Register all the new nodes as exec hosts on the master host (
sheridan at the time of this writing).
If you have a shadow master (which we don't), presumably those steps will have to be done on there also.
On the master host, as the
sgeadmin user, do this
for each new node:
qconf -ae
# replace the word "template" with the new node hostname
qconf -ah NEW_NODE_HOSTNAME
Note: if you want, you can use
qconf -Ae NODE_CONFIG_FILE to load this info from a config file on disk, but this is faster than preparing the config files for each node (or maybe not, if you use the Power of Perl 1-liners).
Now, add the new nodes to the SGE queue you want them to be in:
qconf -mq DESIRED_QUEUE
You will add the hostnames to the list in
hostlist and the number of CPUs that each node has for SGE to the list in
slots (using the format
[HOSTNAME]=[NUM_OF_CPUs]).
Now, to install on the exec hosts.
As before, this is easy using command-line loops:
for i in $(cat ~/hostnames.new) ; do cat add-sge.bat | ssh $i ' eval `cat -`' ; done
where
add-sge.bat is the following script:
mkdir -p /opt/sge ;
cd /opt/sge/ ;
cp /nfs/tmp/sge-6.0u7-* . ;
cp /nfs/tmp/default.tar.gz . ;
tar xvfz sge-6.0u7-common.tar.gz ;
tar xvfz sge-6.0u7-bin-lx24-amd64.tar.gz ;
tar xvfz default.tar.gz ;
chown -R sgeadmin:sgeadmin /opt/sge ;
cat /nfs/tmp/services.tail >> /etc/services ;
iptables -I RH-Firewall-1-INPUT 3 -p tcp --dport 536 -j ACCEPT ;
iptables -I RH-Firewall-1-INPUT 3 -p tcp --dport 537 -j ACCEPT ;
service iptables save ;
. default/common/settings.sh ;
./inst_sge -x -noremote -auto /nfs/tmp/babylon_configuration.conf ;
service sgeexecd stop ;
service sgeexecd start ;
service sgeexecd stop ;
service sgeexecd start ;
NB: as before, notice that commands must be semicolon-delimited, as
eval will eat newlines.
CAVEAT: make sure that each machine starts the
sge_execd daemon
AFTER REBOOT!
There is something flaky about the install that might prevent it from doing that all the time (or at least the first few times).
The complete description of the problem is
here (the
SunGridEngine install page, "Issues with rebooting the node" section).
-- Created by:
AndrewUzilov on 05 Mar 2007