Fantastic kernel patches and where to find them

| categories: fedora

I've griped before about kernel development being scattered and spread about. A quick grep of MAINTAINERS shows over 200 git trees and even more mailing lists. Today's discussion is a partial enumeration of some common mailing lists, git trees and patchwork instances. You can certainly find some of this in the MAINTAINERS file.

  • LKML. The main mailing list. This is the one everyone thinks of when they think 'kernel'. Really though, it mostly serves as an archive of everything at this point. I do not recommend e-mailing just LKML with no other lists or people. Sometimes you'll get a response but think of it more as writing to your blog that has 10 followers you've never met, 7 of which are bots. Or your twitter. There is a patchwork instance and various mail archives out there. I haven't found one I actually like as much as GMANE unfortunately. The closest corresponding git tree is the master where all releases happen.

  • The stable mailing list. This is where patches go to be picked up for stable releases. The stable release have a set of rules for how patches are picked up. Most important is that the patch must be in Linus' tree before it will be applied to stable. Greg KH is the main stable maintainer. He does a fantastic job for taking care of the large number of patches that come in. In general, if a patch is properly tagged for stable yes it will show up eventually. There is a tree for his queue of patches to be applied along with stable git trees

  • Linux -next. This is the closest thing to an integration tree right now. The goal is to find merge conflicts and bugs before they hit Linus' tree. All the work of merging trees is handled manually. Typically subsystem maintainers have a branch that's designated for -next which gets pulled in on a daily basis. Running -next is not usually recommended for anything more than "does this fix your problem" unless you are willing to actively report bugs. Running -next and learning how to report bugs is a great way to get involved though. There's a tree with tags per day.

  • The -mm tree. This gets its name from memory management but really it's Andrew Morton's queue. Lots of odd fixes end up getting queued through here. Officially, this gets maintained with quilt. The tree for -next "mmotm" (mm of the moment) is available as a series. If you just want the memory management part of the tree, there's a tree available for that.

  • Networking. netdev is the primary mailing list which covers everything from core networking infrastructure to drivers. And there's even a patchwork instance too! David Miller is the top level networking maintainer and has a tree for all your networking needs. He has a separate -next tree. One thing to keep in mind is that networking patches are sent to stable in batches and not just tagged and picked up by Greg KH. This sometimes means a larger gap between when a patch lands in Linus' branch and when it gets into a stable release.

  • Fedora tree. Most of the git trees listed above are "source git/src-git" trees, meaning it's the actual source code. Fedora officially distributes everything in "pkg-git" form. If you look at the official Fedora kernel repository, you'll see it contains a bunch of patches and support files. This is similar to the -mm and -stable-queue. Josh Boyer (Fedora kernel maintainer emeritus) has some scripts to take the Fedora pkg-git and put it on kernel.org. This gets updated automatically with each build.

  • DRM. This is for anything and everything related to graphics. Most everything is hosted a freedesktop.org, including the mailing list. Recently, DRM has switched to a group maintainer model (Daniel Vetter has written about some of this philosophy before). Ultimately though, all the patches will come through the main DRM git repo. There's a DRM -tip for -next like testing of all the latest graphics work. Graphics maintainers may occasionally request you test that tree if you have graphics problems. There's also a patchwork instance.


Some notes on recent random numbers

| categories: fedora

By now people may have seen complaints of boot slowdown on newer kernels. I want to explain a little more about what's going on and why Fedora seems to be particularly hard hit.

The Linux kernel has a random number generator in drivers/char/random.c. This provides several interfaces for random numbers to the system. There are two main interfaces for random numbers: /dev/random and /dev/urandom. /dev/random is designed to be "secure", meaning it is sufficiently random that it can be used for things like cryptography keys. /dev/urandom is "random" in the sense that most humans won't detect a pattern but sufficient mathematical analysis might find a weakness.

Random number generators rely on entropy to work properly. You can't just make up entropy, the system has to get it from somewhere. At boot the kernel assumes it has no entropy and relies on various parts of the system (interrupts, timer ticks etc.) to give it entropy. /dev/random is supposed to block if there is not sufficient entropy in the system (entropy is a finite resource and it can be drained). Google Project 0 recently discovered several flaws in the Linux RNG, among them that the RNG was marked as being available for cyptographically secure generation earlier than it should have. They provided patches to fix this which were applied by the RNG maintainer.

And then people started seeing issues, mostly a lot of messages about crng_init. It turns out, there were a lot of places in the kernel that were trying to get random numbers early in the kernel boot process that weren't as random as they might expect. Fedora had a particularly nasty problem where the compose machines were getting stuck. Trying to get more logs from the systemd journal didn't help. Eventually after some debugging with the infrasturcture team (and the help of sendkey alt-sysrq- t in the qemu monitor window), we were able to see that init was blocked on the getrandom systemcall for secure entropy. Interestingly enough, systemd only made non-blocking (insecure) random calls in its code.

I was lucky I could re-build kernels to reproduce the issue, so I decided to experiment a bit and return something unexpected from getrandom (-ENOMEM). This gave me an error message that (luckily) uniquely mapped to gcrypt. systemd links against gcrypt for some features, such as calculating an HMAC for the journal entries. None of that involved random numbers at bootup though so it didn't explain why things were getting stuck. After some more back and forth, Patrick Uiterwijk found a patch that gcrypt was carrying. If FIPS mode is enabled, the cryptographic system is initalied at constructor time (i.e. when it gets loaded by systemd). It turns out, the default images ship with dracut-fips which will put gcrypt into FIPS mode. So the very first time systemd went to open the journal to write something, it would load gcrypt which would attempt to initialize the random number system. (Fun fact, it also looks like the default in systemd is to do a write to a journal before the commandline options are parsed. So even adding an option to not write to the journal didn't help this case. I might be wrong here though?)

Despite the fact that these patches have some side-effects, they do fix a real issue and can't exactly just permanently reverted. So what to do? One easy answer is to give the system more entropy. On virtualized systems, this can be provided by the CONFIG_HW_RANDOM_VIRTIO option. Part of the fix also involves making sure userspace isn't actually trying to rely on secure random number generation too early since randomness is hard to come by early in boot. At least for now in Fedora, we've temporarily reverted the random series on stable (F27/F28) releases. The plan is to continue working with upstream and userspace developers to find a workable solution and bring back the patches when things are fixed.


LSF/MM 2018

| categories: fedora

Wheee LSF/MM. Highlights.

  • There were a couple of sessions related to everyone's favorite mitigation, PTI. Unsurprisingly, PTI has an impact on I/O heavy workloads because it makes system calls more expensive. Minimizing system calls has always been good performance advice, and it only becomes more important with PTI. Also important is to make sure features like the vDSO are used since that helps to mitigate the system call cost. There was some discussion about if we need new system calls that take vectors to help with TLB flushing costs (e.g. multiple madvise calls will require a flush each time).

  • Ted Ts'o gave a session on fs-verity. This is a file system feature for file integrity that's mostly been focused on Android. The functionality is similar to what IMA wants to provides but focuses on immutable files. Looks promising.

  • Speaking of IMA, Mimi Zohar gave a presentation on IMA with a focus on file system topics. The goal of IMA is to provide cryptographic verification of files with keys tied to the TPM. The nature of this means it ends up touching file system internals and some things it perhaps shouldn't (e.g. file system magic numbers). There's been good progress towards making IMA more acceptable toward everyone.

  • Igor Stoppa talked about protectable dynamic memory (called pmalloc). The goal is to allow read only protection of memory that can't be statically allocated. This is a patch series I've been reviewing/following for some time now. Overall, feedback seemed promising for it to get merged soon.

  • I talked briefly about CMA (my primary reason for attending). CMA relies on alignment to pageblock size since it is tied to migration. On arm64 with 64K pages, the pageblock size gets bumped to 512MB which is a bit much. I discussed some approaches to loosening that requirements. The two big options are either just make the pageblock size smaller if we aren't using THP or just let CMA exist as a subpageblock region (some patches were recently merged by Joonsoo Kim to make this much easier). Both are plausible, now all I need to do is write the code (alas).

  • Matthew Wilcox talked about struct page. Kernel documentation has been notoriously lacking for internal APIs and structures. A struct page exists for each page of memory on the system which means it needs to be compact as possible. The end result is a difficult to understand structure. Matthew proposed re-arraging the structure to better clarify the actual usage and make it clear what fields of the structure outside users could actually use. It seemed to be well received so I expect to see it on the mailing list sometime.

  • I got a chance to chat with some Red Hat people about dm-vdo. I had seen this discussed internally somewhat before but the hallway track provided a much better explanation to me of both the details of the code and some of the potential pitfalls. The hallway track is always great.

LWN already has some coverage up but watch there for much better details on all the sessions.


Kbuild tricks

| categories: fedora

Several of the tasks I've worked on recently have involved looking at some of the kernel's build infrastructure. This is all fairly well documented which makes it nice to work with.

The kernel automatically generates some files at build time. This is mostly set up to be transparent to developers unless they are looking for them. The majority of these files are headers at include/generated. A good example of something which needs to be generated is the #define representing the kernel version (e.g. 4.15.12). The header file include/generated/bounds.h contains #defines for several enum constants calculated at build time. Cleverly, most of these files are only actually replaced if the generated output changes to avoid unnecessary recompile. Most of this work is handled by the filechk macro.

The C preprocessor is typically used on C files, as one might obviously expect. It's not actually limited to C files though. Each architecture has to define a linker script which meets the architectural requirements. The linker language is common across architectures so it's beneficial to have common definitions for typical sections such as initcalls and rodata. There's a global rule to run the pre-processor on any .lds.S file. Devicetree files also get preprocessed which avoids a lot of copy and pasting of numerical defines.

The compiler flags are typically set in the top level Makefile and named as you might expect (CFLAGS, CXXFLAGS etc.). The process of building the kernel requires building a number of smaller programs. The c flags for these programs are controlled by a different set of variables (HOSTCFLAGS). It sounds incredibly obvious but I've lost time from my day trying to figure out why setting options in CFLAGS weren't being picked up by the host compiler. For more fun, it's possible to use environment variables to set different flags for compiling built-in vs. module code. The moral of the story is know what you're setting.

Debugging build infrastructure isn't always pleasant but the kernel build system isn't too bad overall. I'm at least beginning to understand more parts of it as I find increasingly more obscure things to modify.


Fun with gcc plugins

| categories: fedora

One of piece of infrastructure that's come in as part of the Kernel Self Protection Project (KSPP) is support for gcc plugins. I touched on this briefly in my DevConf talk but I wanted to discuss a few more of the 'practicalities' of dealing with compiler plugins.

At an incredibly abstract level, a compiler transforms a program from some form A to another form A'. Your A might be C, C++ and you expect A' to be a binary file you can run. Modern compilers like gcc produce the final result by transforming your program several passes, so you end up with A to A' to A'' to A''' etc. The gcc plugin architecture allows you to hook in at various points to make changes to the intermediate state of the program. gcc has a number of internal representations so depending on where you are hooking you may need to use a different representation.

Kernel development gets a (not undeserved) reputation for being poorly documented and difficult to get into. To write even a self-contained kernel module requires some knowledge about the rest of the code base. If you have some familiarity with the code base it makes things much easier. I've found compiler plugins to be similarly difficult. I'm not working with the gcc code base on a regular basis so figuring out how to do something practical with the internal structures feels like an uphill battle. I played around with writing a toy plugin to look at the representation and it took me forever to figure out how to get the root of the tree so I could do something as simple as call walk_tree. Once I figured that out, I spent more time figuring out how to actually do a switch on the node to see what type it was. Basically, I'm a beginner in an unfamiliar code base so it takes me a while to do anything.

Continuing the parallels between kernels and compilers, the internal ABI of gcc may change between versions, similar how the kernel provides no stable internal ABI. If you want to support multiple compiler versions in your plugin, this results in an explosion of #ifdef VERSION >= BLAH all throughout the code. Arguably, external kernel modules have the same problem but I'd argue the problem is slightly worse for compiler plugins. Kernel modules can be built and shipped for particular kernel versions but it's harder to require specific compiler versions.

With all this talk about how hard it is to use compiler plugins, there might be some questions about if it's really worth it to support them at all. My useless answer is "it depends" and "isn't that the ultimate question of any feature". If you have a plugin that can eliminate bug classes, is it worth the maintenance burden? I say yes. One long term option is to get features merged into the main gcc trunk so they don't have to be carried as plugins. Some of the tweaks are kernel specific though, so we're probably stuck carrying the more useful plugins. There is interest in new compiler flags and features so we'll have to see what happens in the future.


« Previous Page -- Next Page »