Maximizing rsync throughput in 2 easy commands

rsync is an excellent tool for linux to copy files between different systems. However, it doesn’t yet have the ability to run multiple copy processes in parallel which means that if you are limited by the speed you can read filesystem metadata (ie list the files), or you have a very fast network connection and lots of cores on both servers you can significantly speed up copying files by running processes in parallel. For example, one process can copy files at perhaps 50MB/sec, however with a 16-core server, 1gbps network connection and a fast SSD array you can copy data at 1GB/sec (gigabytes). Here’s how:

Firstly, you need to get ssh set up so you can connect between the machines without using a password. Even if you are copying between two remote systems and you use ssh-agent key forwarding (which I highly recommend), this can become a significant bottleneck so it’s best to do the following and generate a new key on the source system:

ssh-keygen -f rsync

Hit enter when it prompts for a passphrase so that the key is generated without needing a password to open it. This will create two files, rsync which is your private key and rsync.pub which you want to add to your authorized keys on the remote host using something like:

ssh-copy-id -i rsync.pub user@remote_host

You should then be able to ssh without needing a password by doing:

ssh -i rsync user@remote_host

Next, we need to go to the remote host and allow lots of ssh sessions to be opened at once; open up /etc/ssh/sshd_config on remote_host and append or change these lines:

MaxSessions 100
MaxStartups 100

Now, on your source host run the following command to ensure that rsync uses the ssh key you just created:

RSYNC_RSH="ssh -i rsync"

Now for the good stuff – first we need to mirror the directory structure but not the files:

rsync -za --include='*/' --exclude='*' /local/path/ remote_server:/remote_path/

And now we can do the copy in parallel (you might need to install the parallel command using something like apt-get install parallel):

cd /local/path/; find -L . -type f | parallel -j 30 rsync -za {} user@remote_host:/remote/path/{}

# To exclude some files append as many of these as you want to the find command: \! -name file_to_exclude

This will copy 30 files in parallel, using compression. Play with the exact number, but 1.5 times the number of cores in your boxes should be enough. You can monitor the disk bandwidth with iostat -mx or the network throughput with a tool like iptraf. One of those, or the CPU usage should now be saturated, and your copy should be going as fast as is physically possible. You can re-run this afterwards to synchronise even quicker than a normal rsync, however you won’t be able to use it to delete files.

4 thoughts on “Maximizing rsync throughput in 2 easy commands”

  1. I tested both versions shown here:
    1) date >>startime ; find -L . -type f | parallel -j 30 rsync -za {} /target/{} ; date >>startime

    as well as :
    2) date >>startime ; find -L * -type f | parallel -j 10 rsync -akrxvHR {} /target/ ; date >>startime

    Both versions took significantly longer than just a straight “single threaded” rsync -a:
    3) date >>startime ; rsync -a /source/ /target ; date >>startime

    Looks to me the issue is that each file will be handled individually with a new fork of rsync, which is slower than having rsync work through the list it creates when running “normally”.

    Results:
    (test volume: 910 MB / 6332 files)
    1)
    Mon Mar 30 14:52:29 EDT 2020
    Mon Mar 30 14:54:23 EDT 2020
    2)
    Mon Mar 30 14:55:38 EDT 2020
    Mon Mar 30 14:57:22 EDT 2020
    3)
    Mon Mar 30 14:58:51 EDT 2020
    Mon Mar 30 14:59:04 EDT 2020

    1. For doing this within the same filesystem especially if only on a single disk, as you seem to be doing, I wouldn’t expect any performance improvement.

      The places we see the most speed-up is:
      1) high latency, high-throughput network links (eg you cannot max out 100gbps with a single-core tcp connection)
      2) large RAID arrays where there are many underlying disks which can be accessed in parallel

      I would also probably not do it with -type f, but rather set each rsync process going with -r on a different directory under the subdirectory – find args like -maxdepth 2 -mindepth 2 -type d.

Leave a Reply to Andrei Bergel Cancel reply

Your email address will not be published. Required fields are marked *