Cluster Maintenance Log
I'm planning for this to serve as a centralized location to log problems with the cluster and their solutions, as well as any general maintenance issues. Feel free to post entries for cluster-related issues. The idea is that eventually if we're having a problem with a node, we can search through this page and get a rough history, see if anything like this has occurred in the past, etc.
Mitch and I went down to the colo. 5 nodes were down - franklin, talia, morden, marcus and bester. Marcus was actually off, which was strange seeing as it had gone down a couple weeks back in the middle of processing a cluster job - I would have expected it to just be locked up? All came back up after a hard reboot. Marcus and bester rejoined the SGE queue automatically. Franklin, talia and morden had previously been removed from the queue, will add back franklin and talia today. Morden is still exhibiting weird behavior - can't ping it, can't ssh in, couldn't mount NFS from the KVM, processes were locking up when executing basic commands like vim or ls. Not entirely sure what to make of this, will need to go back down at some point to diagnose and/or blow it away and reinstall. Hopefully not a hardware issue?
Mapped out KVM:
- 1 - sinclair
- 2 - garibaldi
- 3 - sheridan
- 4 - lorien
- 5 - kosh
- 6 - zack
- 7 - theo
- 8 - neroon
- 9 - morden (I freed up this cable, can use as open/diagnostic slot)
- 10 - refa
- 11 - natoth
- 12 - draal
- 13 - franklin
- 14 - talia
- 15 - open, no cables at colo
- 16 - open, no cables at colo
We also wired up the cm4148 serial console box, but it's currently unconfigured. It seems to have automatically been assigned an IP address, but doesn't accept log ins yet. Will need to go back at some point, can use two of our rj45(f)->serial(m) converters with the cables there to configure from one of the cluster nodes, maybe? Or just bring a laptop with a serial port. -LEB
Added talia and franklin back in w/ qconf -mq regular_nodes.q. Should keep an eye on these two, along with marcus and bester - I have a suspicion these lock ups are something to do with the SGE or the jobs running through it, and not the nodes themselves. However, if these particular nodes repeatedly go down, we should probably start worrying about them. -LEB
Kosh's file system went read-only at some point over the weekend. Was able to log in, but mount and shutdown commands gave a bus error. I'm guessing this is a result of an SGE job misbehaving. Removed it from the queue using qconf -mq, Went down to the colo and rebooted, seemed to come up fine, could create files in /tmp again. Added it back to queue, it's running jobs again, no errors so far.
Related: Jeremy had noted that kosh experienced disk failure in august of 2007 on the main cluster page, but there's no follow up as to how(/if?) this was resolved. Anyone know what the deal was? -LEB
Draal went down today while running a handalign alignment job, can't ssh in:
[lbarquist@lorien ~]$ ssh draal ssh: connect to host draal port 22: Connection refused
Oddly it responds to pings. Who knows what that means. Need to go down to colo in the next few days to diagnose. Hopefully just needs a hard reboot. Might be a good machine to dig through sge logs on, since we know approximately when it went down, and that nothing else has been done to it since then. Hopefully can confirm it's a hiccup with sge jobs taking up too much memory or some such, at this point I seriously doubt it has anything to do with the systems themselves.
Also, Mitch got a serial adapter to configure the cm4148. Possibly take care of both of these issues together. -LEB
3/30/2009All nodes went in to E state this weekend - still not entirely sure why. Maybe some wonky jobs? Reset with
qmod -c '*'. Byron went back in to E state - looks like filesystem is screwed up? ls-ing /opt gives
ls: reading directory /opt/: Input/output error
Removed it from queue - should try reinstalling OS at some point? Don't know whether this is a hardware or software problem right now.
-- Lars Barquist - 20 Jun 2008