Rsync: Combining Includes and Excludes (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 19 Mar 2023

Rsync: Combining Includes and Excludes

I back up my computer to a local disk (well, several redundant local disks) using rsync. (I don't particularly trust cloud providers, and in any case our internet connection is very slow, especially for upload, so waiting hours while the entire contents of my disk uploads isn't appealing.)

To save space and time, I have script that includes a list of files and directories I don't need to back up: browser cache directories, object files, build directories, generated files like thumbnails, large video files, downloaded source, and so on.

I also have a list of files I do want to back up even though they'd otherwise be excluded. For instance, I sometimes have local changes in my GIMP source directory, outsrc/gimp-master/gimp/, even though most of outsrc doesn't need to be backed up. Or /blog/tags/build in my local mirror of the shallowsky website, even though I have a rule that says directories named build shouldn't usually be backed up.

I've been using rsync's --include and --exclude to handle this. But I discovered yesterday that I'd been using them wrong, and some things I thought were getting backed up, weren't. It took some reading and experimenting before I figured out how these rsync flags actually work — which doesn't seem to be well explained anywhere.

First Rule Wins

Let's start with rule number one: in a long list of rsync --include and --exclude rules, the first rule wins. That's stated in the manual and also quoted in most pages that come up in a web search. But that's not quite as simple as it sounds.

Excludes

Excludes are easy. "--exclude=*.o" skips any file whose extension is .o. --exclude=outsrc will skip the directory outsrc, which means rsync will never see any file under that directory. I didn't have any misunderstanding there.

Well, maybe one. I was using --exclude pattern rather than the syntax the manual and most web discussions now use, --exclude=pattern. The form I was using seemed to be working ... maybe it's an older form ... but in my experimenting I found a few cases where it didn't work while the other form, with the equals sign, did work. I didn't pursue this to figure out when it makes a difference or why; I just rewrote everything to use the preferred equals form throughout my script.

Includes

Includes are a lot trickier. Let's say I'm excluding outsrc, but I do want to back up outsrc/hexchat to preserve the local changes I made (hexchat has some key bindings hard-wired that can only be changed by recompiling the source). First rule wins, right? So I just need this: --include=outsrc/hexchat --exclude=outsrc/

Nope, that doesn't work — nothing is copied. Rsync sees that it can go into the hexchat directory, but then when it gets to the first file, outsrc/hexchat/COPYING, it runs through the rules again and nothing exactly matches, so it doesn't copy the file.

But rsync has a special pattern, ***, for "this directory and everything under it." So all we need is --include=outsrc/hexchat/*** --exclude=outsrc/ , right?

Nope again: nothing is copied. I can't really explain why not; it seems to me from what I've read that this should work. But in practice, rsync needs --include rules for every component of the path: --include=outsrc/ --include=outsrc/hexchat/*** --exclude=outsrc/<.

Except it doesn't make any sense to have both --include=outsrc/ and --exclude=outsrc/, does it? Now outsrc/ won't actually be excluded any more, because it's explicitly included in an earlier rule.

What actually worked was:

--include=outsrc/ --include=outsrc/hexchat/*** --exclude=outsrc/***

Rsync is allowed to descend into directory outsrc/ (I don't think it's important whether or not the trailing slash is there), but it's not allowed to copy any files or directories under outsrc, except that outsrc/hexchat/ and every file inside it are allowed (unless excluded by some other pattern, like --exclude="*.o").

If you have a deeper hierarchy, like if you want to match outsrc/gimp-master/gimp, you need an --include for every level: --include=outsrc/ --include=outsrc/gimp-master/ --include=outsrc/gimp-master/gimp/***

Python can help with that

Since I don't want to have to go through writing all those rules every time I add an included directory to my backup script, I wanted the backup script to be able to take a directory and autogenerate the include rules.

Except that turned out to be hard with a shell script, so I followed my rule of "if you've been struggling with a shell script for more than half an hour, it's time to rewrite it into Python". Assuming includes and excludes are lists of file or directory paths, here's code to generate a set of rsync flags:

cludesflags = []
included = set()

# Generate include rules for each path component in each path
for inc in includes:
    ipath = ""
    # Get a version of inc that doesn't have a final slash
    stripinc = inc.strip('/')
    for component in stripinc.split('/'):
        if not component:
            continue
        if ipath:
            ipath = '/'.join([ipath, component])
        else:
            ipath = component
        if ipath == stripinc:
            cludesflags.append(f"--include={ipath}/***")
            break
        elif ipath in included:
            continue
        else:
            included.add(ipath)
            cludesflags.append(f"--include={ipath}/")

# Excludes list is much simpler
for ex in fullexcludes:
    cludesflags.append(f"--exclude={ex}")

[ ... ]

rsyncargs = ["sudo", "rsync", "-av", "--delete", "--delete-excluded",
             *cludesflags, os.getenv("HOME"), backupdest]
print(rsyncargs)
time.sleep(3)
subprocess.call(rsyncargs)

Tags: , , ,
[ 16:11 Mar 19, 2023    More linux/cmdline | permalink to this entry | ]

Comments via Disqus: