Sys Admin Tasks

From Biowiki
Jump to: navigation, search

System Administrator job requirements

Brief job description

The Holmes lab seeks an experienced system administrator to help us build & maintain infrastructure for computational biology research (specifically high-throughput genome analysis including detection & phylogeny of noncoding RNAs, transposons & other genetic elements).

As well as "typical" sysadmin tasks involving maintenance of intranet & internet services (basic flavors of email, HTTP, DNS, source code control, etc.) the job will call for the development & application of custom software for the comparative analysis of whole genomes. This will include e.g. adaptation of automated build tools for the construction of bioinformatics pipelines; use of probabilistic modeling software for profiling genomic features; and assisting the development of "Web 2.0" tools for genomic databases (wikis, blogs, forums, Ajax, etc.)

Required experience

  • significant experience administering Linux systems (Mac OS X, BSD, and other Unix flavors a plus)
  • scripting/programming experience
  • excellent knowledge and understanding of computer networking (especially TCP/IP) and network security
  • experience maintaining a parallel compute cluster (Sun Grid Engine, LSF, OpenPBS or other queueing system preferred)
  • scientific computing background desirable (bioinformatics/computational biology background most wanted)
  • specific skills/experience with 3 or more of:
    • Unix system administration:
      • Linux administration: CentOS, Fedora, and/or RedHat; building RPMs
      • Apache HTTP Server administration
      • CVS, Subversion, and/or other version control systems (e.g. darcs)
      • DNS servers (djbdns preferred)
      • Mail services: SMTP, IMAP, webmail
      • NFS and/or other networked Unix file systems
    • Unix programming & applications:
      • Perl programming
      • GNU Make
      • shell scripting
      • autoconf
      • SQL database administration (MySQL preferred)
      • PHP programming a plus; MediaWiki hacking a double-plus

Ongoing responsibilities

The system administrator will:

  • Administer the following servers & Internet services:
    • OpenSSH on all machines
    • Apache/TWiki installation on biowiki.org (alias www.biowiki.org)
    • CVS, viewCVS and Subversion services on cvs.biowiki.org
    • Djb Dns services on ns1.biowiki.org and ns2.biowiki.org
    • Mail on mail.biowiki.org (SMTP, IMAP and (later) SquirrelMail)
  • Proactively monitor network security
    • Scan security mailing lists such as BugTraq, SANS, etc., for vulnerabilities specific to above internet services
    • Update/patch active services (OpenSSH/OpenSSL/zlib, Apache...) immediately following security advisories
    • Establish and maintain a system for intrusion detection (traffic sniffing, port-scanning, logging, etc.)
  • Ensure data integrity
    • Maintain backup of all important data to the tape backup system (TSMBackupSystem)
    • Carry out regular data restoration tests to ensure the system is working as anticipated
  • Provide documentation of system administration activities
  • Provide consultation and support
    • Provide expert advice in expansion of lab's computational capabilities
    • Assist lab members with technical matters (troubleshooting computing systems)

Specific administrative tasks

The following is a sketch of a "to-do" list, roughly organized by task area & by priority.

TODO: these should be moved to RT. Please strike out items that get moved.

IH: since we will not be using this page as a "to do" list any more, it's OK to just delete items once they're placed on the Request Tracker. Alternatively just annotate "ticketed" items using the interwiki syntax Ticket:ID.

Cluster hardware & systems software

  • Medium priority
    • Optimize performance of the Cluster NFS
      • finish Cluster NFSBenchmarks
      • explore dedicating a racked switch to NFS traffic on the cluster
      • many options could be tweaked (rsize/wsize, wdelay vs no_wdelay, tcp vs udp, compression or no)
    • Improve the Cluster RAID
      • investigate increasing disk size, RAID configuration
    • Set up remote monitoring of critical system events (e-mail alerts, etc.)
      • e.g. RAID disk failure, NFS failure, suspicious network activity, webserver failure, high temperature, etc.
    • Tidy the computation cluster rack wiring
      • airflow could be improved
      • investment in neatness pays off in the future
      • note that this will require cluster downtime
  • Lower priority
    • (Ticket:7 ??) Do yum updates on all the cluster nodes except for binaries compiled from source
    • Make lorien mount NFS on loopback properly
      • will need to reboot the NFS server to identify the exact issue, so requires cluster downtime

Cluster application software

  • High priority TWikiDocGraphics.warning.gif
    • Set up an internal SQL database server
      • to be used as a central database to store genomic pipeline data, etc.
      • database overhead may be high enough to warrant this being a dedicated machine at the colo center
  • Medium priority
    • Update GCC to 4.x on all nodes
    • Install "caption" LaTeX package on cluster nodes regularly used by lab persons
  • Lower priority

Cluster queueing software (Sun Grid Engine)

Remember to cross these off of the Sun Grid Engine To Do list as they get done.

  • High priority TWikiDocGraphics.warning.gif
    • Optimize job submission/spooling
      • there is an appx. 1-2 minute delay between job submission and start of its execution that could be the rate-limiting step for thousands of small jobs (should verify this assertion: I think it may only happen when no jobs have been submitted for a while; I have not observed long queues being delayed in this way, and in any case I think it's more like 10-20 seconds than 1-2 minutes - IH)
  • Medium priority
    • Set up user notification e-mails, so that users can be notified when their jobs encounter problems, complete, etc.
  • Lower priority
    • (Ticket:33) Add any spare boxes around the lab to SGE queues
    • (Ticket:32, sort of) Add submit host capability to lab user machines
    • Figure out how checkpointing works (useful for long jobs)
    • Set up shadow master on cluster

Internet services

  • High priority TWikiDocGraphics.warning.gif
    • Migrate services from lab servers to better/more stable servers at the colo center
      • this includes DNS, CVS/viewCVS/SVN, web, e-mail services
    • Move our SourceForge repositories to our local CVS servers
      • (Ticket:30) install several related services such as cvs-notify
  • Medium priority
    • Set up listserv (e.g. majordomo) on mail.biowiki.org
    • Set up LDAP, PAM or other centralized user directory
  • Lower priority
    • transfer all lab mailing lists to our own mailserver
      • make sure to back them up, also
    • SMTP server (mail.biowiki.org):
      • Create lab_member_name@biowiki.org e-mail accounts for lab members

Software and hardware on lab machines

  • High priority TWikiDocGraphics.warning.gif
    • Supervise move of lab computing equipment to our new location in Stanley Hall
  • Medium priority
  • Lower priority
    • Organize loose cabling around the lab

Networking and security

  • High priority TWikiDocGraphics.warning.gif
    • Remove unnecessary "gateway node -> head node" login chain
      • e.g. set up a transparent firewall between the Internet and the head node
    • Set up networking infrastructure in Stanley Hall for the lab move
  • Medium priority
    • Set up the Cisco FWSM at the colo center
  • Lower priority

Research projects

Research tasks will depend on current activities of the lab, but are likely to relate to one or more of the ongoing projects in the lab. These are listed here to give some indication of the probable nature of the work, although this list is non-binding and subject to change...

---

-- Created by: Andrew Uzilov on 08 Mar 2007