thewisenerd's blog – the order of files in your ext4 filesystem does not matter

the title is a cheeky reference to something at the front page of the orange site today¹.

i don’t want to be misleading here; glob order in bash is “alphanumeric”-ish.

this is more about documenting a wierd bug we encountered recently after a node image patch update, which in-turn caused a multi-hour outage since we could not get ahead of it in time.

the wildcard here, is not a glob, since the thing is not running in a bash shell. the actual argument value the JVM receives is "/jars/*", and in turn decides to be helpful, and expand the wildcard anyway².

tl;dr

the rest of the blog is about how i went about trying to figure out how we got here.

1. red herring - buildah squashing layers by default

with a recent overhaul a couple years ago of our CI/CD setup (from Jenkins to GHA), and a couple other reasons, we switched from docker to buildah for building the container images, and we noticed that some of the buildah built images would not start up.

we were copying the files into the container image in a specific order, to possibly save bandwidth with ‘shared’ layers. this involved copying jars in a specific order of “volatility” (bottom-to-top);

the intent was that layers 3 and 4 would “rarely” change, and surely this should help a bunch with bandwidth given the SHAs would be consistent.

upon investigation, we identified that a critical configuration file was being read from a jar that was not “specific project jar”, and this in-turn was causing the application startup to misbehave.

as an immediate fix, the corresponding fix was made, adding --layers to the buildah bud step.

little did we realize that this didn’t do much; but it seemed to fix the issue of the configuration file being picked up from the incorrect jar; and we went our merry way.

2. red herring - overlayfs layer order

this was an understanding gap, where i had always been under the impression that a readdir over an overlayfs, then the iteration order would follow the order of the overlayfs directory stacking order.

sidenote: containers use overlayfs and basically “stack” the image layers on top of one-another; and they use something called “white-outs” to handle deletion. one of the reasons why, if there’s a lot of white-outs, it’s terrible from a perf standpoint, and squashing layers is supposedly better for performance, or something, idk.

performing the most basic of tests, made me realize this is not the case. the only guarantee is that an “upper” layer’s files will override the “lower” layer’s files, but nothing about the iteration order of overlayfs layers.

uname -r
# 6.1.0-18-amd64

mkdir l0 l1 l2 work merged
for d in l0 l1 l2; do for file in $(seq 10 12); do touch $d/$d-$file; done; done
sudo mount -t overlay overlay -o lowerdir=./l0:./l1,upperdir=./l2,workdir=./work ./merged/
ls -1U ./merged/ # list unsorted, basically think readdir

my previous understanding, or at least the understanding as per the “fix” we put in the previous section, was that the order expected here is l2,l1,l0,

safe to say, while overlayfs guarantees that upper directories’ files shall overwrite the lower directories files, it does not guarantee that the directory traversal order shall be the same way.

around here, I validated that a segment of ls -1U on the merged “overlayfs” folder matched the ls -1U order on a “lower” directory on the underlying ” ext4”, and decided to focus my efforts on figuring out what was happening there.

3. getting side-tracked with layer extraction

thinking that the layer extraction logic could’ve changed (why??), i got to trying to get the exact tar blobs from the image.

fetching the tar blob, and replicating the containerd layer extraction logic⁴ (just the golang native "archive/tar" bits), the subsequent ls -1U output was basically the same, when running on the same nodepool.

was the inode order different? nope. sequential inodes, per order in the tar archive.

4. the oh “f-(sync)” moment

thinking, what if fsync is re-ordering blocks when flushing unwritten blocks to disk, i enabled tracing but the logs were too much.

echo 1 > /sys/kernel/debug/tracing/events/ext4/enable
cat /sys/kernel/debug/tracing/trace # too noisy,

all i had were log lines, which i did not understand, and i was not able to effectively filter the log lines for my specific operations (because shared root disk). not in the mood to sit and figure out ebpf, i decided to create a loopback device and | grep for that specific loopback device instead.

cat /sys/kernel/debug/tracing/trace | grep -v 'dev 8,1'

running the same tar extraction golang program inside the loopback device, the ls -1U order turned out different. wtf. re-ran the extraction in another folder, the order was the same;

5. hex-editing block image files

with debugfs disk.img, and running stats, there were just two possibly changing parameters, the “Filesystem UUID”, and a “Directory Hash Seed”.

the file system UUID could be easily specified at mkfs.ext4 with the -U parameter,

mkfs.ext4 -U {uuid} disk2.img

but alas, running the tar extraction test, on two ext4 partitions with the same UUID, still had different ls -1U order.

so, deciding to go after the “Directory Hash Seed” next, I realized there was no easy way to set this parameter with mkfs.ext4, so finding the offset of the directory hash seed and hex-editing it in the block file was “the only way forward”.

this was “accomplished” with a pretty dumb combination of xxd, grep -ob and printf | dd.

the ext4 header blocks also have a crc checksum, which debugfs cribs about; but it also gives you the “expected” value, so removing that “error” is just another hex edit away.

mounting the newly modified disk image, and re-running the tar extraction test, the order matched!

6. closing thoughts?

I had perused the ext4 readdir implementation⁵ somewhere when dealing with overlayfs delegating readdir to the underlying filesystem, but reading is reading, and reading is lossy.

ext4 has this thingy called “h-tree indexing” and that is something that needs to be specifically enabled, and as far as i’d checked, did NOT have them enabled.

i was assuming that is_dx_dir would exit pretty much immediately, but upon closer examination (after hex-editing block image files, ofc), i realize that is_dx_dir and ext4_dx_dir are pretty much the happy-path, since the is_dx_dir impl is “exclude-specific” and not “include-specific”.

static int ext4_readdir(struct file *file, struct dir_context *ctx)
{
    // ...
    if (is_dx_dir(inode)) {
        err = ext4_dx_readdir(file, ctx);
        if (err != ERR_BAD_DX_DIR)
            return err;

        // ...
    }
    // ...

/**
 * is_dx_dir() - check if a directory is using htree indexing
 * @inode: directory inode
 *
 * Check if the given dir-inode refers to an htree-indexed directory
 * (or a directory which could potentially get converted to use htree
 * indexing).
 *
 * Return 1 if it is a dx dir, 0 if not
 */
static int is_dx_dir(struct inode *inode)
{
    struct super_block *sb = inode->i_sb;

    if (ext4_has_feature_dir_index(inode->i_sb) &&
        ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
         ((inode->i_size >> sb->s_blocksize_bits) == 1) ||
         ext4_has_inline_data(inode)))
        return 1;

    return 0;
}

actually having a debugger to step through the kernel functions would’ve been helpful, but that’s an adventure for another day.

6.1. wait, what actually broke though?

we had three Bouncy Castle “provider” dependencies, which were on a single overlayfs layer.

there was a client library that needed a Bouncy Castle “provider” with a version “jdk15”+ as the client initialization used specific properties from a class, and those properties were only available in “jdk15”+.

up until the node image update, we “fortunately” had node images with directory hash seeds ordering “jdk15” or “jdk18” before “jdk14”.

after the node image patch update, the directory seed caused “jdk14” to be hashed with a value, causing it to come up earlier than “jdk15” or “jdk18” in readdir.

and this caused an uncaught “NoSuchFieldError” in an initializer thread, causing the client initialization to “get stuck”. newer pods thus, could not initialize.