rsync is an excellent tool for linux to copy files between different systems. However, it doesn’t yet have the ability to run multiple copy processes in parallel which means that if you are limited by the speed you can read filesystem metadata (ie list the files), or you have a very fast network connection and lots of cores on both servers you can significantly speed up copying files by running processes in parallel. For example, one process can copy files at perhaps 50MB/sec, however with a 16-core server, 1gbps network connection and a fast SSD array you can copy data at 1GB/sec (gigabytes). Here’s how:
Firstly, you need to get ssh set up so you can connect between the machines without using a password. Even if you are copying between two remote systems and you use ssh-agent key forwarding (which I highly recommend), this can become a significant bottleneck so it’s best to do the following and generate a new key on the source system:
ssh-keygen -f rsync
Hit enter when it prompts for a passphrase so that the key is generated without needing a password to open it. This will create two files, rsync
which is your private key and rsync.pub
which you want to add to your authorized keys on the remote host using something like:
ssh-copy-id -i rsync.pub user@remote_host
You should then be able to ssh without needing a password by doing:
ssh -i rsync user@remote_host
Next, we need to go to the remote host and allow lots of ssh sessions to be opened at once; open up /etc/ssh/sshd_config
on remote_host
and append or change these lines:
MaxSessions 100 MaxStartups 100
Now, on your source host run the following command to ensure that rsync uses the ssh key you just created:
RSYNC_RSH="ssh -i rsync"
Now for the good stuff – first we need to mirror the directory structure but not the files:
rsync -za --include='*/' --exclude='*' /local/path/ remote_server:/remote_path/
And now we can do the copy in parallel (you might need to install the parallel
command using something like apt-get install parallel
):
cd /local/path/; find -L . -type f | parallel -j 30 rsync -za {} user@remote_host:/remote/path/{} # To exclude some files append as many of these as you want to the find command: \! -name file_to_exclude
This will copy 30 files in parallel, using compression. Play with the exact number, but 1.5 times the number of cores in your boxes should be enough. You can monitor the disk bandwidth with iostat -mx
or the network throughput with a tool like iptraf
. One of those, or the CPU usage should now be saturated, and your copy should be going as fast as is physically possible. You can re-run this afterwards to synchronise even quicker than a normal rsync, however you won’t be able to use it to delete files.