2025-04-06
the title is a cheeky reference to something at the front page of the orange site today1.
i don’t want to be misleading here; glob order in bash is “alphanumeric”-ish.
this is more about documenting a wierd bug we encountered recently after a node image patch update, which in-turn caused a multi-hour outage since we could not get ahead of it in time.
we have JVM workloads on production, with dockerfiles that look like this.
CMD ["java", "-cp", "/jars/*", "-server", ..., "com.acmecorp.app.Application"]
the wildcard here, is not a glob, since the thing is not running in a
bash shell. the actual argument value the JVM receives is
"/jars/*"
, and in turn decides to be helpful, and expand
the wildcard anyway2.
in posix systems, this happens to use the readdir
syscall3.
overlayfs
delegates readdir
to the
underlying filesystemext4
ext4
readdir optimizes by caching the entries of a
directory in a “hashed b-tree” with a specific “directory hash
seed”the rest of the blog is about how i went about trying to figure out how we got here.
with a recent overhaul a couple years ago of our CI/CD setup (from
Jenkins to GHA), and a couple other reasons, we switched from
docker
to buildah
for building the container
images, and we noticed that some of the buildah
built
images would not start up.
we were copying the files into the container image in a specific order, to possibly save bandwidth with ‘shared’ layers. this involved copying jars in a specific order of “volatility” (bottom-to-top);
the intent was that layers 3 and 4 would “rarely” change, and surely this should help a bunch with bandwidth given the SHAs would be consistent.
upon investigation, we identified that a critical configuration file was being read from a jar that was not “specific project jar”, and this in-turn was causing the application startup to misbehave.
as an immediate fix, the corresponding fix was made, adding
--layers
to the buildah bud
step.
“buildah by default does not cache layers during building of the image, and hence ends up squashing the layers. When using wildcard to set the classpath for java, the order of listing the jars changes and hence causes jars other than the project jar to have higher priority in the class path.
This PR implements usage of the –layers flag , which re-enables caching of layers and fixes the issues with classpath jar priorities.”
little did we realize that this didn’t do much; but it seemed to fix the issue of the configuration file being picked up from the incorrect jar; and we went our merry way.
this was an understanding gap, where i had always been under the
impression that a readdir
over an overlayfs, then the
iteration order would follow the order of the overlayfs directory
stacking order.
sidenote: containers use overlayfs and basically “stack” the image layers on top of one-another; and they use something called “white-outs” to handle deletion. one of the reasons why, if there’s a lot of white-outs, it’s terrible from a perf standpoint, and squashing layers is supposedly better for performance, or something, idk.
performing the most basic of tests, made me realize this is not the case. the only guarantee is that an “upper” layer’s files will override the “lower” layer’s files, but nothing about the iteration order of overlayfs layers.
this can be easily proven by the following example
uname -r
# 6.1.0-18-amd64
mkdir l0 l1 l2 work merged
for d in l0 l1 l2; do for file in $(seq 10 12); do touch $d/$d-$file; done; done
sudo mount -t overlay overlay -o lowerdir=./l0:./l1,upperdir=./l2,workdir=./work ./merged/
ls -1U ./merged/ # list unsorted, basically think readdir
my previous understanding, or at least the understanding as per the
“fix” we put in the previous section, was that the order expected here
is l2,l1,l0
,
but the actual output, of ls -1U ./merged/
is,
l1,l2,l0
.
l1-11
l1-10
l1-12
l2-10
l2-12
l2-11
l0-10
l0-12
l0-11
safe to say, while overlayfs guarantees that upper
directories’ files shall overwrite the lower
directories
files, it does not guarantee that the directory traversal order shall be
the same way.
around here, I validated that a segment of ls -1U
on the
merged “overlayfs” folder matched the ls -1U
order on a
“lower” directory on the underlying ” ext4”, and decided to focus my
efforts on figuring out what was happening there.
thinking that the layer extraction logic could’ve changed (why??), i got to trying to get the exact tar blobs from the image.
this involved,
/v2/{repo}/manifests/{version}
.fsLayers[].blobSum
and
fetching the tar blobsfetching the tar blob, and replicating the containerd layer
extraction logic4 (just the golang native
"archive/tar"
bits), the subsequent ls -1U
output was basically the same, when running on the same nodepool.
was the inode order different? nope. sequential inodes, per order in the tar archive.
thinking, what if fsync is re-ordering blocks when flushing unwritten blocks to disk, i enabled tracing but the logs were too much.
echo 1 > /sys/kernel/debug/tracing/events/ext4/enable
cat /sys/kernel/debug/tracing/trace # too noisy,
all i had were log lines, which i did not understand, and i was not
able to effectively filter the log lines for my specific operations
(because shared root disk). not in the mood to sit and figure out ebpf,
i decided to create a loopback device and | grep
for that
specific loopback device instead.
cat /sys/kernel/debug/tracing/trace | grep -v 'dev 8,1'
running the same tar
extraction golang program inside
the loopback device, the ls -1U
order turned out different.
wtf. re-ran the extraction in another folder, the order was the
same;
creating another loopback device, the ls -1U
order
changed yet again.
so, within a filesystem, the ls -1U
order is consistent
after extraction.
with debugfs disk.img
, and running stats
,
there were just two possibly changing parameters, the “Filesystem UUID”,
and a “Directory Hash Seed”.
the file system UUID could be easily specified at
mkfs.ext4
with the -U
parameter,
mkfs.ext4 -U {uuid} disk2.img
but alas, running the tar extraction test, on two ext4 partitions
with the same UUID, still had different ls -1U
order.
so, deciding to go after the “Directory Hash Seed” next, I realized
there was no easy way to set this parameter with mkfs.ext4
,
so finding the offset of the directory hash seed and hex-editing it in
the block file was “the only way forward”.
this was “accomplished” with a pretty dumb combination of
xxd
, grep -ob
and
printf | dd
.
the ext4
header blocks also have a crc checksum, which
debugfs
cribs about; but it also gives you the “expected”
value, so removing that “error” is just another hex edit away.
mounting the newly modified disk image, and re-running the tar extraction test, the order matched!
I had perused the ext4 readdir
implementation5 somewhere when dealing with
overlayfs delegating readdir
to the underlying filesystem,
but reading is reading, and reading is lossy.
ext4 has this thingy called “h-tree indexing” and that is something that needs to be specifically enabled, and as far as i’d checked, did NOT have them enabled.
i was assuming that is_dx_dir
would exit pretty much
immediately, but upon closer examination (after hex-editing block image
files, ofc), i realize that is_dx_dir
and
ext4_dx_dir
are pretty much the happy-path, since the
is_dx_dir
impl is “exclude-specific” and not
“include-specific”.
static int ext4_readdir(struct file *file, struct dir_context *ctx)
{
// ...
if (is_dx_dir(inode)) {
= ext4_dx_readdir(file, ctx);
err if (err != ERR_BAD_DX_DIR)
return err;
// ...
}
// ...
/**
* is_dx_dir() - check if a directory is using htree indexing
* @inode: directory inode
*
* Check if the given dir-inode refers to an htree-indexed directory
* (or a directory which could potentially get converted to use htree
* indexing).
*
* Return 1 if it is a dx dir, 0 if not
*/
static int is_dx_dir(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
if (ext4_has_feature_dir_index(inode->i_sb) &&
((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
((inode->i_size >> sb->s_blocksize_bits) == 1) ||
(inode)))
ext4_has_inline_datareturn 1;
return 0;
}
actually having a debugger to step through the kernel functions would’ve been helpful, but that’s an adventure for another day.
we had three Bouncy Castle “provider” dependencies, which were on a single overlayfs layer.
bcprov-jdk14-1.38.jar
bcprov-jdk15on-1.55.jar
bcprov-jdk18on-1.75.jar
there was a client library that needed a Bouncy Castle “provider” with a version “jdk15”+ as the client initialization used specific properties from a class, and those properties were only available in “jdk15”+.
up until the node image update, we “fortunately” had node images with directory hash seeds ordering “jdk15” or “jdk18” before “jdk14”.
after the node image patch update, the directory seed caused “jdk14”
to be hashed with a value, causing it to come up earlier than “jdk15” or
“jdk18” in readdir
.
and this caused an uncaught “NoSuchFieldError” in an initializer thread, causing the client initialization to “get stuck”. newer pods thus, could not initialize.
bye now.