Karol Stasiak's Blog

Coding from an elevator.

Incremental synchronization of ZIP files in bandwidth-constrained environments

| Comments

Suppose you have two ZIP archives on two different machines and you want to synchronize their contents with minimal network traffic, without much regard for the CPU usage, and you know that the target machine already contains the older version of the archive, which is mostly identical to the new version. For example, you have to upload a fat JAR with the new build over a metered connection, and the server literally sits in a garage at the other end of the country.

If the archive is small, uploading it in its entirety looks simple enough, but if it grows into dozens of megabytes, of which 90% is non-changing 3rd party code, it quickly becomes a waste of both time and bandwidth.

Time is money. And depending on your ISP, bandwidth can be money too.

Long story short, I’m going to assume that the target machine runs Linux, and the source machine runs either Windows or Linux. I am going to use rsync and fuse-zip.

Why those two? Because fuse-zip can convert a ZIP archive into a writeable filesystem, and rsync can sync a writeable filesystem with another. Details below.

Task 1 – Mount the archives

Create a mountpoint on the target machine, I’ll use /tmp/z. Then mount your archive using fuse-zip:

$ sudo fuse-zip -o allow_other,rw,uid=`id -u`,gid=`id -g` /stuff/target.zip /tmp/z

To test if it works, you create a file in the mounted filesystem. It would appear in the ZIP file after you unmount it.

$ echo TEST > /tmp/z/TEST_FILE_PLEASE_IGNORE
$ sudo umount /tmp/z
$ unzip -l /stuff/target.zip | grep TEST_FILE_PLEASE_IGNORE

Do not modify the ZIP archive while it’s being mounted and do not interrupt the unmounting process if the archive was mounted for writing, as this can corrupt it.

If your source machine runs Linux, you can do the same, but mount the archive read-only:

$ sudo fuse-zip -o allow_other,ro,uid=`id -u`,gid=`id -g` /stuff/source.zip /tmp/z

You should experiment a bit if sudo and usd/gid are necessary.

If your souce machine runs Windows, you can use Pismo File Mount to mount the archive as a drive and assign it a letter. It’s a GUI program, I haven’t found a scriptable alternative yet.

Task 2 – Synchronize the mounted filesystems

On the source machine:

$ rsync --delete -zrucv -e ssh /tmp/z/ remoteuser@targethost:/tmp/z/

Explanation for the parameters:

  • -c so that the files will be compared using checksums instead of modification dates. We do not care about nor trust the dates here, what matters is content.

  • -z for the compression. Do not use the flag if the files themselves won’t compress well.

  • -r for recursive synchronization.

  • -u so only modified files are uploaded.

  • --delete so the files that are absent in the source archive will be deleted from the target archive. We want the target archive to have the same content as the source one, not more.

  • -v stands for verbose. Used once, makes rsync list all the files as they are being sent. You can skip this option, or repeat it several times, depending on your preference.

  • -e ssh picks the shell used to synchronize the files. SSH is the recommended one for most cases.

Note that the paths end with slashes, so rsync synchronizes the contents of the directories, not the directories themselves.

If the source mashine is running Windows, you can use Cygwinhttps://www.cygwin.com/, just install appropriate packages (you’ll need rsync and most likely also openssh). If you used Pismo File Mount to mount the ZIP to e.g. J: drive, use /cygdrive/j/ as your source directory.

Task 3 – Unmount the filesystems

On Linux (both source and target machine), do

$ sudo umount /tmp/z

You have to unmount the target archive so the changes take place, and the source archive so the next time you mount it, it would be current again.

Putting it all together

Here is a small script that can be used to synchronize two ZIP archives:

#!/bin/bash

# put all the necessary values here
REMOTEUSER=...
REMOTEHOST=...
SOURCEZIP=...
TARGETZIP=...

OS=`uname -o`
if [ "$OS" = Cygwin ] ; then
	# depends on where you mount it; assuming here the J: drive
	P=/cygdrive/j
else
	P=/tmp/z
	mkdir -p "$P"
	sudo umount "$P"
	sudo fuse-zip -o allow_other,uid=`id -u`,gid=`id -g`,ro "$SOURCEZIP" "$P"
fi

ssh $REMOTEUSER@$REMOTEHOST sudo jar_mount_rw
rsync --delete -zrucv -e "ssh" "$P/" $REMOTEUSER@$REMOTEHOST:/tmp/z/
ssh $REMOTEUSER@$REMOTEHOST sudo umount /tmp/z

if [ "$OS" = Cygwin ] ; then
	echo 'Finished. Unmount the J: drive.'
else
	PID="`pgrep fuse-zip`"
	if [ -z "$PID" ] ; then
		exit
	else
		sudo umount "$P"
	fi
fi

It requires this script to be present as jar_mount_rw on the target machine:

#!/bin/bash

# put all the necessary values here
TARGETZIP=...

umount /tmp/z
mkdir -p /tmp/z
fuse-zip -o allow_other,rw,uid=`id -u`,gid=`id -g` "$TARGETZIP" /tmp/z

Results

I don’t have any hard data, but the overall effect is that a 40 MB JAR archive with only few files modified in it gets synchronized in few seconds, while it could take several minutes on slower connections to upload it whole.

Comments