How To Administer Sun Grid Engine
- 1 How to Administer Sun Grid Engine
- 1.1 Miscellaneous/General Notes
- 1.2 How To...
- 1.3 Why Won't My Job Run Correctly? (aka How To Troubleshoot/Diagnose Problems)
- 1.4 SGE "To Do" List
- 1.5 Comments (questions, notes, suggestions, etc.)
How to Administer Sun Grid Engine
This is the page for how to administer Sun Grid Engine jobs, hosts, queues, users, policies, etc. on the Babylon Cluster. For other info on SGE, visit:
- Sun Grid Engine (installation guide and some notes on our setup)
- How To Use Sun Grid Engine (basic usage guide)
- some How To pages from the SGE project site, with useful stuff, particularly:
- this file, which is just man sge_intro - it's a good overview of all the SGE commands and what man files to consult for more info
- thorough, complete (more or less) SGE documentation from Sun (warning: not a casual read)
Please leave comments (notes, questions, etc.) below, or edit the wiki as appropriate.
N.B.: a lot of configuration commands (specifically, qconf) will bring up a configuration file for editing using your default text editor. On the cluster, that editor is Vim. Vim will never die! So if you don't know it, please learn at least how to change and save files without shooting yourself in the foot before you proceed! It only takes a few minutes.
Commands to know and love
qconf - the essential configuration command for everything
qmod - good for clearing our errors of jobs, queues, enabling queues, disabling queues... and so on.
qstat and qacct - job accounting commands for current and completed jobs, respectively... useful for diagnosing what's wrong
qping - a really useful command for troubleshooting that, oddly, is not covered in man sge_intro! Why not?
But what about QMON?
QMON is a GUI for SGE management. Because I personally hate GUIs (since the mouse is an atrocious instrument for the lazy that's destroying America, and the colorful interface is making the children hyperactive and reducing their attention spans, thus leading me to believe that GUIs were crafted by the right hand of Lucifer himself), you will not find any QMON info here.* Read the 200+ page manual (see SGE documentation from Sun) for QMON info.
- (if you could not digest that sentence in one fell swoop, clearly you are suffering from a GUI-induced attention deficit problem)
Our current queues
regular_nodes.q - regular nodes/hosts (each node/host has 2GB RAM shared by two 2.2GHz Opteron CPUs) - 46 CPUs total
himem_nodes.q - hi-mem nodes/hosts (currently just one node/host, which has 32GB RAM shared by four 2.2GHz Opteron CPUs) - 4 CPUs total
Current exec node configuration
hostname <host name> load_scaling NONE complex_values NONE user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE
Pretty sweet, isn't it? Clearly we're not tapping even a fraction of the config options available in SGE... but someday, we probably will.
check the status of all the jobs that finished running, or had problems and exited, EVER
On the master node (or submit node? not sure... for us they're the same node, sheridan) run:
$ qacct -j
then you should pipe that to something like tail -n <number of lines> because you only want the end of the massive log that it will spit out.
Alternately, you can use:
$ qacct -j <job number>
There are many other useful options, such as getting all jobs by a specific user, etc. See man qacct for them.
If the job is still running, use qstat instead of qacct above to get the same info.
add a new queue
On the master node, run:
$ qconf -aq
This brings up a template queue configuration file for the new queue, probably using Vim. Edit it to your liking and save it, then exit. Usually, you will just leave everything as the template gave it to you, except:
- change "qname" to your queue name
- add the hosts you want in the queue to "hostlist"
- for "slots", put in a comma-delimited list of entries in the format:
[<hostname>=<number of jobs>]
indicating the maximum number of jobs allowed to execute simultaneously on each host (usually that's just the number of CPUs on that host), plus a leading "1" for no reason that I can readily identify (for all I know, it's not necessary), e.g.:
slots 1,[ivanova=2],[franklin=2],[marcus=2], \ [delenn=2],[londo=2],[vir=2],[gkar=2],[lennier=2], \ [morden=2]
- put "/bin/bash" for "shell"
See man queue_conf for more details on what all the parameters in the queue configuration file mean.
change configuration of an existing queue
$ qconf -mq <queue name>
add a new exec host
On the MASTER HOST, add/register the new host as an exec host:
$ qconf -ae
Edit the template to reflect the current configuration for exec hosts. Currently, this means just add the correct host name, save, and exit.
For some ungodly reason, the host must also be added/registered as an administrative host, which can be done with:
$ qconf -ah <host name>
Now we must add this host to queues we want it to be in:
$ qconf -mq DESIRED_QUEUE
You must add the exec host name to the hostlist list, and how many CPUs it has for SGE to the slots field.
Disclaimer: the following step is only useful if you have more than one host group... which we don't, nor do we explicitly use host groups (as far as I know) for anything. So you can probably omit this... I followed it just in case.
Add the host to the "@allhosts" host group:
$ qconf -mhgrp @allhosts
then add the name of the new host, save, and exit.
Now, configure the exec host ON THE ACTUAL EXEC HOST/NODE. Note that if you are installing more than a couple of exec hosts, it might make sense to use a configuration file and do it as described in Step 3 of Sun Grid Engine. The procedure described here is the manual, interactive one, suitable for adding a small number of hosts.
Unzip and untar the installation files into your directory of choice (for us, it is /opt/sge/). Add the port numbers for SGE to /etc/services, e.g.:
# SUN GRID ENGINE sge_qmaster 536/tcp # for Sun Grid Engine (SGE) qmaster daemon sge_execd 537/tcp # for Sun Grid Engine (SGE) exec daemon
Now perform the most idiotic step of the entire installation, and the step which the SGE documentation seems to make no mention of: we must copy your cell directory (for us, it is /opt/sge/default/, because our cell name is "default) from the master host to your exec host (i.e. you must copy the contents of /opt/sge/default/ from the master to /opt/sge/default/ on the exec host). If you don't, you will get an error during the interactive install telling you that the master node has not been installed. If there is a proper way to do this that someone knows, please enlighten me.
Set the SGE_ROOT environment variable to wherever you put the SGE files, e.g.:
$ export SGE_ROOT=/opt/sge/
which will go into the interactive install program.
Accept all defaults, except specify the local spool directory (for us, it is /opt/sge/default/spool/, but can really be anything that sgeadmin is allowed to write to... I think...). Another non-default is if you added this host to the appropriate queues on the master already, as described above, you should say no to adding a "default queue instance for this host."
At the end of the interactive installation, SGE will ask you to run the settings script. For our purposes, ignore the syntax it gives you, and run this instead:
$ . $SGE_ROOT/default/common/settings.sh
You should now be able to run qstat and such commands on the new exec node to verify that it is working
remove an exec host
Examples shown here are to remove the exec host sinclair.
Delete host from whatever queue it was in (e.g. the "regular_nodes queue):
$ qconf -mq regular_nodes.q
Delete host from its host group (e.g. the "allhosts" group):
$ qconf -mhgrp @allhosts
Remove host from exec host list:
$ qconf -de sinclair
Remove from configuration list:
$ qconf -dconf sinclair
Done... try flooding the queues with simple test jobs to make sure they don't get scheduled on the supposedly deleted host.
Why Won't My Job Run Correctly? (aka How To Troubleshoot/Diagnose Problems)
Does your job show "Eqw" or "qw" state when you run qstat, and just sits there refusing to run? Get more info on what's wrong with it using:
$ qstat -j <job number>
Does your job actually get dispatched and run (that is, qstat no longer shows it - because it was sent to an exec host, ran, and exited), but something else isn't working right? Get more info on what's wrong with it using:
$ qacct -j <job number> (especially see the lines "failed" and "exit_status")
If any of the above have an "access denied" message in them, it's probably a permissions problem. Your user account does not have the privileges to read from/write to where you told it (this happens with the -e and -o options to qsub often). So, check to make sure you do. Try, for example, to SSH into the node on which the job is trying to run (or just any node) and make sure that you can actually read from/write to the desired directories from there. While you're at it, just run the job manually from that node, see if it runs - maybe there's some library it needs that the particular node is missing.
To avoid permissions problems, cd into the directory on the NFS where you want your job to run, and submit from there using qsub -cwd to make sure it runs in that same directory on all the nodes.
Not a permissions problem? Well, maybe the nodes or the queues are unreachable. Check with:
or, for even more detail:
If the "state" column in qstat -f has a big E, that host or queue is in an error state due to... well, something. Sometimes an error just occurs and marks the whole queue as "bad", which blocks all jobs from running in that queue, even though there is nothing otherwise wrong with it. Use qmod -c <queue list> to clear the error state for a queue.
Maybe that's not the problem, though. Maybe there is some network problem preventing the SGE master from communicating with the exec hosts, such as routing problems or a firewall misconfiguration. You can troubleshoot these things with qping, which will test whether the SGE processes on the master node and the exec nodes can communicate.
N.B.: remember, the execd process on the exec node is responsible for establishing a TCP/IP connection to the qmaster process on the master node, not the other way around. The execd processes basically "phone home". So you have to run qping from the exec nodes, not the master node!
Syntax example (I am running this on a exec node, and sheridan is the SGE master):
$ qping sheridan 536 qmaster 1
where 536 is the port that qmaster is listening on, and 1 simply means that I am trying to reach a daemon. Can't reach it? Make sure your firewall has a hole on that port, that the routing is correct, that you can ping using the good old ping command, that the qmaster process is actually up, and so on.
Of course, you could ping the exec nodes from the master node, too, e.g. I can see if I can reach exec node kosh like this:
$ qping kosh 537 execd 1
but why would you do such a crazy thing? execd is responsible for reaching qmaster, not the other way around.
If the above checks out, check the messages log in /var/log/sge_messages on the submit and/or master node (on our Babylon Cluster, they're both the node sheridan):
$ tail /var/log/sge_messages
Personally, I like running:
$ tail -f /var/log/sge_messages
before I submit the job, and then submit a job in a different window. The -f option will update the tail of the file as it grows, so you can see the message log change "live" as your job executes and see what's happening as things take place.
(Note that the above is actually a symbolic link I put in to the messages log in the qmaster spool directory, i.e. /opt/sge/default/spool/qmaster/messages.)
One thing that commonly goes wrong is permissions. Make sure that the user that submitted the job using qsub actually has the permissions to write error, output, and other files to the paths you specified.
For even more precise troubleshooting... maybe the problem is unique only to some nodes(s) or some queue(s)? To pin it down, try to run the job only on some specific node or queue:
$ qsub -l hostname=<node/host name> <other job params>
$ qsub -l qname=<queue name> <other job params>
Maybe you should also try to SSH into the problem nodes directly and run the job locally from there, as your own user, and see if you can get any more detail on why it fails.
If all else fails...
Sometimes, the SGE master host will become so FUBARed that we have to resort to brute, traumatizing force to fix it. The following solution is equivalent to fixing a wristwatch with a bulldozer, but seems to cause more good than harm (although I can't guarantee that it doesn't cause long-term harm in favor of a short-term solution).
Basically, you wipe the database that keeps track of SGE jobs on the master host, taking any problem "stuck" jobs with it. (At least that's what I think this does...)
I've found this useful when:
- You submit >10,000 jobs to SGE, which uses too much system resources resulting in their inability to get dispatched to exec hosts, and start getting the "failed receiving gdi request" error on something as simple as qstat. You can't use qdel to wipe the jobs due to the same error.
- A job is stuck in the r state (and if you try to delete it, the dr state) despite the fact that the exec host is not running the job, not is even aware of it. This can happen if you reboot a stuck/unresponsive exec host.
ssh sheridan su - service sgemaster stop cd /opt/sge/default/ mv spooldb spooldb.fubared mkdir spooldb cp spooldb.fubared/sge spooldb/ chown -R sgeadmin:sgeadmin spooldb service sgemaster start
Wipe spooldb.fubared when you are confident that you won't need its contents again.
SGE "To Do" List
NB: this is now summarized under the SGE subsection of Sys Admin Tasks, and this page will probably cease being maintained.
We are tapping only a fraction of SGE's features, but as I learn the system more, the pages (Sun Grid Engine, How To Use Sun Grid Engine, and How To Administer Sun Grid Engine) will grow. Some particular things to look at are:
- improving the "how to" page (How To Use Sun Grid Engine)
- adding spare machines in the lab as an SGE queue
- scheduling and spooling optimizations
- setting up user notification e-mails so that users are notified when their jobs encounter problems (would be very, very useful)
- policy management (making sharing cluster resources more fair, not just on a "first come, first serve" basis as it is now)
- installing a shadow master host
- making the Macs in the lab submit and administartive hosts (so people don't have to log into sheridan all the time, they can just submit jobs from the Macs directly)
Of course any help in figuring stuff out is appreciated...
Comments (questions, notes, suggestions, etc.)
-- Started by: Andrew Uzilov - 08 Apr 2006