Comparing methods of downloading software from Netlib

Netlib is a repository for mathematical software such as BLAS, LAPACK and, most importantly =), AMPL Solver Library (ASL). The software can be retrieved using a number of protocols and as I found out the download speed can vary greatly. For example, it took me 10 minutes to retrieve entire ASL by FTP using wget while other options can reduce this time to tens of seconds. This was the reason why I decided to do this small comparison of different methods to retrieve software from Netlib.

So the slowest way is to retrieve individual files by FTP:

$ time wget --recursive ftp://netlib.org/ampl
...
Total wall clock time: 10m 9s
Downloaded: 880 files, 45M in 2m 26s (317 KB/s)

real 10m8.855s
user 0m0.804s
sys 0m3.704s

It turned out that a much faster method is to use HTTP instead of FTP. This is a complete surprise to me, considering that FTP was designed for file transfer. Maybe its just an issue with wget? If you have any ideas please let me know in the comment section below.

$ time wget --recursive --include-directories=ampl \
    http://www.netlib.org/ampl
...
Total wall clock time: 1m 59s
Downloaded: 714 files, 41M in 30s (1.37 MB/s)

real 1m58.986s
user 0m0.476s
sys 0m2.856s

The --include-directories=ampl option ensures that only the content of the ampl directory is downloaded and files in other locations referred from html files are ignored.

However there is even a faster method which relies on Netlib ability to provide whole directories as compressed files:

$ time wget ftp://netlib.org/ampl.tar.gz
...
real 0m49.365s
user 0m0.320s
sys 0m1.908s

Rsync is yet another method almost as fast as downloading compressed directories by FTP:

$ time rsync -avz netlib.org::netlib/ampl .
...
real 0m53.467s
user 0m0.860s
sys 0m2.512s

Putting it all together:

The conclusion is simple: using rsync or retrieving compressed directories by FTP are by far the fastest methods of downloading from Netlib. Recursive HTTP download is more than two times slower and recursive FTP is painfully slow.

Update: found possible explanation of poor performance of recursive FTP here:

Retrieving a single file from an FTP server involves an unbelievable number of back-and-forth handshaking steps.

Comments

Paul Rubin:
I think the link in your update accounts for this. Most of the steps listed there are done once per session, but I had forgotten that each file is transfered on a separate socket, with additional handshakes. That also suggests, though, that if you FTP client and the server can agree on enough simultaneous data channels, FTP might be faster than HTTP.
Victor Zverovich:
I don't think that the problem is in the FTP service being overloaded, because I tried it at different times with almost identical results. Besides FTP performs fine when retrieving archives. I think that the main issue is the number of roundtrips required to retrieve each file and large number of files (~900 in the ampl dir). Unfortunately my ftp client doesn't support recursive mget. I've tried FileZilla (see the updated chart) and it was roughly twice as fast as recursive FTP. I think this is because it uses two concurrent transfers by default. Still much slower than rsync or getting a gzipped tarball by FTP.
Paul Rubin:
I too am surprised by your results. Maybe the FTP service was more heavily loaded than the web service? Also, did you try command-line FTP (with mget) or a client such as FileZilla?

Last modified on 2012-06-25