The kernel blame game

| categories: kernel, fedora

Fedora is a system of many individual pieces, each of which can affect the others. The kernel is usually fairly self-contained, or at least it tries to be. This is a story of two issues triggered by two different kernel changes that show the kernel is not an entity unto itself.

Major kernel version upgrades usually bring out issues and 4.4 was no different. A user filed a bug about a change in behavior: disk partitions on the hard disk were now being mounted automatically. This behavior was unwanted as those partitions previously required root to mount. The kernel is responsible for setting up device nodes but ultimately userspace is responsible for making the system calls to actually mount filesystems past a certain point. The system logs were clearly showing udisksd was making the request to mount those partitions. This really didn't look like a kernel issue so I suggested some other userspace setting had changed the automounting options. The reporter downgraded other userspace packages that had been installed at the same time but this showed no change. Installing just the kernel was enough to change the behavior so something in the kernel must have changed.

Debugging often involves working backwards to figure out how we got somewhere. Working backwards here:

  • Partitions are being mounted
  • The logs are showing the mount requests coming from udisksd
  • udisksd is triggering the mount requests
  • udisksd uses ????? to decide what to mount

I bounced this problem off of my teammates and they suggested the partition options could be coming up differently based on the kernel. So what partition options was udisks reading to figure out what to mount? The udisks documentation wasn't very useful. It gave great examples of how to change the automounting behavior but it didn't answer the question of how it sets the defaults. I eventually found the udisksctl dump command which shows what state udisks has for all the partitions it knows about. This was really insightful: in the non-working kernel udisks was marking the parititions as automount and removable which explained the behavior. It didn't explain why it was making those decisions though. Looking at the code is my preferred method for understanding some behaviors. The udisks code was less enlightening than I would have liked. All I could tell was that some call was returning that the disk was removable. On a whim, I decided to look through the kernel commit history and found a commit which matched the exact behavior and a revert confirmed this was the breaking patch. The ultimate issue was that marking the drive as hotplug capable was enough to get the drive marked as removable. The original patch author provided a fix to not mark hotplug ports as removable.

During the course of discussion on bugzilla, a second person was reporting the same issue. The fix did not work for this reporter though so we started a different bug for tracking. In this case, the behavior and bug was different: the kernel would boot but eventually dracut would spit out a message that it can't find the lvm partitions. A good starting point for issues where root partitions won't mount is to drop into the dracut shell. This lets you run shell commands to collect system information. Ultimately though someone else reported the same issue in parallel and found the issue before any debugging took place. In 4.4, the mpt2sas driver was replaced with the mpt3sas driver. The kernel has alias to take care of loading mpt3sas when modprobe mpt2sas happens. The mpt3sas module needs to be present for this to work though. Each time a new kernel is installed, dracut uses the set of existing loaded modules to figure out what to put in the initramfs. (Live CDs have a huge number of kernel modules in the initramfs to be able to run on almost any platform and give a base for installing the kernel). dracut didn't detect the name change so the mpt3sas module wasn't present in the initramfs, preventing the file system from being mounted. The workaround is fairly straight forward: put the mpt3sas module into the initramfs. Once it is in place, dracut will correctly add it to all future initramfs that are generated. Ultimately though, dracut needs to properly follow module aliases for modules which are no longer present.

Lessons here:

  • The kernel is not always right. Kernel changes need to be aware of userspace assumptions and not break them. The kernel community usually tries very hard not to introduce regressions but they do sometimes happen.
  • The kernel is not alway wrong. Sometimes a kernel change can expose a limitation in another program which needs to be fixed.
  • Knowing debugging techniques outside the kernel is necessary for kernel developers, or at least Fedora kernel maintainers.
  • Sometimes you get really really lucky in finding an answer to your problem
  • Early testing of updates in bodhi and good bug reporting helps to find issues fast and get them fixed quicker. Both of these issues were reported, and in the once case fixed, by the 2nd stable release for 4.4.