« A great thing: hotels with ethernet | Home | Knowledge work and programming as craft work »

April 22, 2003

rsync: one of the wonders of the CS world

One of the pleasures of studying computer science is coming upon a really clever way of doing something. I've been feeling that pleasure lately about the rsync algorithm.

Andrew Tridgell invented the rsync algorithm. I'll quote from his excellent paper (co-authored with Paul Mackerras) to describe the problem:

Imagine you have two files, A and B, and you wish to update B to be the same as A. The obvious method is to copy A onto B.

Now imagine that the two files are on machines connected by a slow communications link, for example a dial up IP link. If A is large, copying A onto B will be slow. To make it faster you could compress A before sending it, but that will usually only gain a factor of 2 to 4.

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences between A and B down the link and then use this list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

This may sound a bit abstract, but it's a problem that occurs all over:

  • Keeping large software mirrors in sync
  • Incremental backups
  • Web caches (consider a university/company where everyone is loading up CNN.com: the home page is almost (but not quite) the same for each person.)

Eric Raymond famously opined that "Every good work of software starts by scratching a developer's personal itch." That's certainly true in Andrew Tridgell's case: Andrew lives lives in Australia. Andrew describes Internet connectivity from Australia as "very high latency, very low bandwidth link... a typical Internet link, at least if you're in Australia. So, a piece of wet string, a really pathetic link.." The latter description comes from an transcript of an absolutely splendid talk Andrew gave at the Ottawa Linux Symposium in 2000.

The rsync algorithm is embodied in the rsync tool, which is available under Unix and Windows (if you're using Cygwin). I've recently been playing with rsync to back up the core files from my home Linux box to another machine to back them up. The basic idea is to first make a copy of of the files on the other machine, and then run rsync periodically to keep up with any changes. The most striking example occurred when I uploaded some files up to a machine I have access to out on the Internet. I have a DSL connection. Upload speeds max out at about 200k bits/sec maximum. (Downloads go about about 1mbit.) When I did the initial copy, it took me about 6.5 hours to upload 1.5 gigabytes of data. After another day or so, I used rsync to update. I had changed some rather large email files in the interim. rsync's stats claimed that 33 files had changed. If I'd had to recopy those files again, I would have transfered 176 Mbytes. But rsync was able to sync up those files by transferring only 2.4 megabytes of data, and did the update in 4min 44 seconds. Wonderful!

If you're interested in these sorts of things, Trigdell's rsync paper is excellent. The talk is great fun as well; Trigdell leads you down some of the false alleys he went down while trying to develop rsync.