Rsync: Combining Includes and Excludes
I back up my computer to a local disk (well, several redundant local disks)
using rsync
. (I don't particularly trust cloud providers,
and in any case our internet connection is very slow, especially for upload,
so waiting hours while the entire contents of my disk uploads isn't appealing.)
To save space and time, I have script that includes a list of files and directories I don't need to back up: browser cache directories, object files, build directories, generated files like thumbnails, large video files, downloaded source, and so on.
I also have a list of files I do want to back up even though they'd otherwise be excluded. For instance, I sometimes have local changes in my GIMP source directory, outsrc/gimp-master/gimp/, even though most of outsrc doesn't need to be backed up. Or /blog/tags/build in my local mirror of the shallowsky website, even though I have a rule that says directories named build shouldn't usually be backed up.
I've been using rsync's --include
and --exclude
to handle this.
But I discovered yesterday that I'd been using them wrong, and some
things I thought were getting backed up, weren't.
It took some reading and experimenting before I figured out how
these rsync flags actually work — which doesn't seem to be
well explained anywhere.
First Rule Wins
Let's start with rule number one: in a long list of rsync --include and --exclude rules, the first rule wins. That's stated in the manual and also quoted in most pages that come up in a web search. But that's not quite as simple as it sounds.
Excludes
Excludes are easy. "--exclude=*.o"
skips any
file whose extension is .o.
--exclude=outsrc
will skip the directory outsrc,
which means rsync will never see any file under that directory.
I didn't have any misunderstanding there.
Well, maybe one. I was using --exclude pattern
rather
than the syntax the manual and most web discussions now use,
--exclude=pattern
. The form I was using seemed to be
working ... maybe it's an older form ...
but in my experimenting I found a few cases where it didn't
work while the other form, with the equals sign, did work.
I didn't pursue this to figure out when it makes a difference or why; I
just rewrote everything to use the preferred equals form throughout my script.
Includes
Includes are a lot trickier.
Let's say I'm excluding outsrc, but I do want to back up
outsrc/hexchat to preserve the local changes I made (hexchat
has some key bindings hard-wired that can only be changed by recompiling
the source). First rule wins, right? So I just need this:
--include=outsrc/hexchat --exclude=outsrc/
Nope, that doesn't work — nothing is copied. Rsync sees that it can go into the hexchat directory, but then when it gets to the first file, outsrc/hexchat/COPYING, it runs through the rules again and nothing exactly matches, so it doesn't copy the file.
But rsync has a special pattern, ***, for "this directory and
everything under it." So all we need is
--include=outsrc/hexchat/*** --exclude=outsrc/
, right?
Nope again: nothing is copied. I can't really explain why not; it seems
to me from what I've read that this should work. But in practice, rsync
needs --include
rules for every component of the path:
--include=outsrc/ --include=outsrc/hexchat/*** --exclude=outsrc/<
.
Except it doesn't make any sense to have both
--include=outsrc/
and --exclude=outsrc/
,
does it? Now outsrc/ won't actually be excluded any more,
because it's explicitly included in an earlier rule.
What actually worked was:
--include=outsrc/ --include=outsrc/hexchat/*** --exclude=outsrc/***
Rsync is allowed to descend into directory outsrc/ (I don't
think it's important whether or not the trailing slash is there),
but it's not allowed to copy any files or directories under outsrc,
except that outsrc/hexchat/ and every file inside it are allowed
(unless excluded by some other pattern, like --exclude="*.o"
).
If you have a deeper hierarchy, like if you want to match
outsrc/gimp-master/gimp,
you need an --include for every level:
--include=outsrc/ --include=outsrc/gimp-master/ --include=outsrc/gimp-master/gimp/***
Python can help with that
Since I don't want to have to go through writing all those rules every time I add an included directory to my backup script, I wanted the backup script to be able to take a directory and autogenerate the include rules.
Except that turned out to be hard with a shell script, so I followed my
rule of "if you've been struggling with a shell script for more than half
an hour, it's time to rewrite it into Python".
Assuming includes
and excludes
are lists of file or directory paths, here's code to generate
a set of rsync flags:
cludesflags = [] included = set() # Generate include rules for each path component in each path for inc in includes: ipath = "" # Get a version of inc that doesn't have a final slash stripinc = inc.strip('/') for component in stripinc.split('/'): if not component: continue if ipath: ipath = '/'.join([ipath, component]) else: ipath = component if ipath == stripinc: cludesflags.append(f"--include={ipath}/***") break elif ipath in included: continue else: included.add(ipath) cludesflags.append(f"--include={ipath}/") # Excludes list is much simpler for ex in fullexcludes: cludesflags.append(f"--exclude={ex}") [ ... ] rsyncargs = ["sudo", "rsync", "-av", "--delete", "--delete-excluded", *cludesflags, os.getenv("HOME"), backupdest] print(rsyncargs) time.sleep(3) subprocess.call(rsyncargs)
[ 16:11 Mar 19, 2023 More linux/cmdline | permalink to this entry | ]