HolmesLab computing cluster (aka babylon)
Compiling stuff on the cluster
TODO: update this!
- 23 nodes for general-purpose computation (dual-CPU 64-bit AMD Opterons, 2.2GHz, 2GB RAM, 80GB HDD)
- 1 node for high-memory computation (quad-CPU 64-bit ADM Opteron, 2.2GHz, 32GB RAM, 300GB HDD)
- 1 node for RAID (dual-CPU 64-bit AMD Opterons, 2.0GHz, 2GB RAM, 2.4TB 8-disc RAID-5 - more details on ClusterRAID)
- 2 lightweight nodes for routing, intrusion detection, job control, etc. (single-CPU 64-bit AMD Opteron, 2.2GHz, 512MB RAM, 80GB HDD)
- 1 webserver for AJAX GBrowse demo and development
- 64-bit CentOS 4.2/4.3/4.4 (kernel 2.6.x) operating system for all nodes
- 2 gigabit switches
- Aten KVM-over-IP
SunGridEngine, BioPerl libraries, GD, MySQL, etc.
DART, Rfam, PANDIT, 12fly ...
- v1.5.1rc3 on old compute nodes (ivanova, kosh, etc.)
- none installed yet on new compute nodes (bester, etc.)
- v1.5.2_102 on sheridan and lorien
- v1.5.2_101 on sinclair (genome.biowiki.org)
I tried to install gcc 4.1.1 with Java AWT support... which failed because
there were problems with gtk+... so I tried to install that from source,
which failed because there were problems with glib, atk, cairo, and pango...
and although I was able to resolve the first 2 and install cairo without X
support (or at least I *think* that's what I did... ran it with --disable-xlib
option, it wouldn't work otherwise), pango killed me (said -lX11 could not be
found because X11 was incompatible - same problem as with cairo, except not as
easily avoidable... might be solvable by digging deeper into X11 dependencies,
but I give up at this point, graphics stuff is nasty stuff).
So in the end, as of today, the following are installed from source and up to
date: atk, cairo (without xlib), glib, libpng, and binutils.
But, no AWT for Java. Sorry...
This node locked up... hard.
Power light is on, but networking dead, black screen, pushing the power/reset buttons had no effect (even CD tray wouldn't open).
Had to reboot it by literally pulling the plug, no other choice left.
It came back up fine and started running SGE jobs right away, and you can log into it - everything looks fine from the shell.
But the keyboard worked sporadicaly, the mouse did not work at all.
Fiddling the KVM cable and rebooting both the node and the KVM switch brought everything back to normal.
This kind of hardware weirdness should be noted.
This node might be headed for trouble and should be investigated more thoroughly by a real sys admin.
Networking dead, KVM-over-IP shows black screen.
Looks like this node is fubared again, will reboot it soon.
However, this is the second offense.
(Back from the datacenter...)
Yep, same problem as last time - had to pull the plug to restart.
Need to call FineTec on this one.
Failed today while running the memory-hungry
Networking, etc. dead.
Logging in with KVM-over-IP showed "I/O failure to sector N of hda" (or some such) looping over and over.
Rebooting the node fixed the problem, but this looks worrying.
Is the hard drive going bad?
kosh experienced a drive failure the previous night, symptoms noted by Andrew
on further investigation, drive remounted read-only, with large parts of the filesystem already corrupted
brought in binaries from other systems to do diagnostics, waiting to rebuild/reboot after 08-15
system now refuses to initiate ssh identification, and remote access via KVM is not working
- draal (drive failure)
- byron (drive failure)
- londo (won't start)
- vir (won't start)
The nodes in the cluster are all named after characters from the science fiction television show Babylon 5.