Backups Using Rsync

From Biowiki
Jump to: navigation, search

Using rsync to back up your data

Everyone knows that the contents of our Cluster NFS, which include everyone's home directories on the cluster, are backed up daily at 9:30PM (see TSMBackupSystem). But what about the contents of our lab machines?

Well, the short answer is that they're not. It is up to you to back up the important contents of machines that you are responsible for (for most people, it is just their own computer) to the cluster NFS, such as to your home directory. Once the data is on the NFS, I guarantee it will get backed up to tape by our trusty and reliable automated daily backup.

But how to most easily do this? The tool of choice I recommend is rsync, which is a program for copying files. Except unlike conventional copy, rsync uses an algorithm that computes the difference between the source and destination files, and transmits only that difference.

So, if you are doing daily backups of large and/or numerous files that are mostly static, you're not going to waste bandwidth and CPU cycles by daily copying data that doesn't change - only the changes since last backup will get sent. Because we are copying over the NFS, saving bandwidth is a good idea.

How to use rsync to back up your home directory on Mac OS X (version 10.3/Panther, presumably applies to 10.4/Tiger)

Set up the backup cron job:

$ crontab -e

which will drop you into your default text editor. Use it to add a line such as:

30 0 * * * rsync --exclude-from=/Users/avu/backup-excludes.list --progress --stats --verbose --archive /Users/avu/ /mnt/nfs/users/avu/backups/angel/ > /Users/avu/backup.log 2>/Users/avu/backup.err

then save and exit. This will make the cron daemon run the specified rsync command at some designated time. For example, the above runs it every day of every month (for every weekday) at the 0th hour and the 30th minute - i.e., daily at 12:30AM. More on crontab syntax here.

I recommend scheduling your backup daily, but not at the same time as the cluster NFS automatic daily backup - do it at least a couple of hours before or a couple of hours after. Also, the first backup will probably be the most time-consuming, so maybe it is best to carry it out on the command line, just to test it out beffore adding it to cron.

The above sample rsync command will recursively back up the contents of /Users/avu/ to /mnt/nfs/users/avu/backups/angel/ (angel is the name of my machine). Note that its output to standard out gets redirected to backup.log, and standard error gets redirected to backup.err.

But what do all those options do?

--exclude-from=/Users/avu/backup-excludes.list

Specifies a file containing a list of patterns describing files and directories that should not get backed up. Note that:

  • they're patterns, so we can use wildcards;
  • the paths must be relative to the "root" of your backup, i.e. the source directory (in my example, it is /Users/avu/).

For example, here are the contents of my backup-excludes.list (note that these paths are relative to my home directory, e.g. /Users/avu/tmp/ is just tmp/ here):

.Trash/
Desktop/
Library/Caches/
Music/
tmp/
backup.log
backup.err

You can also specify an include pattern list using --include-from=<FILEPATH/NAME>.

Lastly, you can also specify an explicit list of files to include using --files-from=<FILEPATH/NAME> (see man rsync for more info).

--progress --stats --verbose

Shows a progress meter and other verbose output that is useful to log

--archive

Turns on a bunch of options useful for doing backups, such as copying recursively (necessary for copying directories and their contents) and preserving timestamps, permissions and ownership of files and directories transferred.

See man rsync for more options and info.

Method that avoids relying on the NFS to be mounted, for the really paranoid...

TODO: explain how to set up passwordless SSH login

short synopsis: run a cronjob on lorien, the NFS server, to log into your lab machine and upload your backup daily - see the note in red below!

TWikiDocGraphics.warning.gif NB the chain of trust should always be such that if a breach occurs outside the cluster, the cluster is not compromised. In other words, the cluster machine should have an ssh key for the machine to be backed up, and not the other way around. TWikiDocGraphics.warning.gif

---

-- Created by: Andrew Uzilov on 20 Jun 2006