The opinion nobody asked for

| categories: hottakes

Tired and cranky is probably not the best time to write a blog post but here it goes anyway. The business world seems to love to talk about 'authenticity' and 'bringing your whole self' to work. That applies to what you talk about, especially if you are a Woman in Tech(tm). If you are passionate about talking about diversity and those issues because it's part of who you are, go for it. If you'd rather not talk about those issues because it's not who you are, that's okay. Most important is to know why you are making your choices. It's okay to change your mind too.

None of this is an excuse to not care at all though. The real trick is to figure out how you work best. Maybe you're better at talking to other privately. Maybe you boost up other people who do want to share their story. Maybe you spend your time organizing. There's lots of ways to make the tech industry better and most important is learning how to do it in the way that works for you. We can all do better.


Fantastic kernel patches and where to find them

| categories: fedora

I've griped before about kernel development being scattered and spread about. A quick grep of MAINTAINERS shows over 200 git trees and even more mailing lists. Today's discussion is a partial enumeration of some common mailing lists, git trees and patchwork instances. You can certainly find some of this in the MAINTAINERS file.

  • LKML. The main mailing list. This is the one everyone thinks of when they think 'kernel'. Really though, it mostly serves as an archive of everything at this point. I do not recommend e-mailing just LKML with no other lists or people. Sometimes you'll get a response but think of it more as writing to your blog that has 10 followers you've never met, 7 of which are bots. Or your twitter. There is a patchwork instance and various mail archives out there. I haven't found one I actually like as much as GMANE unfortunately. The closest corresponding git tree is the master where all releases happen.

  • The stable mailing list. This is where patches go to be picked up for stable releases. The stable release have a set of rules for how patches are picked up. Most important is that the patch must be in Linus' tree before it will be applied to stable. Greg KH is the main stable maintainer. He does a fantastic job for taking care of the large number of patches that come in. In general, if a patch is properly tagged for stable yes it will show up eventually. There is a tree for his queue of patches to be applied along with stable git trees

  • Linux -next. This is the closest thing to an integration tree right now. The goal is to find merge conflicts and bugs before they hit Linus' tree. All the work of merging trees is handled manually. Typically subsystem maintainers have a branch that's designated for -next which gets pulled in on a daily basis. Running -next is not usually recommended for anything more than "does this fix your problem" unless you are willing to actively report bugs. Running -next and learning how to report bugs is a great way to get involved though. There's a tree with tags per day.

  • The -mm tree. This gets its name from memory management but really it's Andrew Morton's queue. Lots of odd fixes end up getting queued through here. Officially, this gets maintained with quilt. The tree for -next "mmotm" (mm of the moment) is available as a series. If you just want the memory management part of the tree, there's a tree available for that.

  • Networking. netdev is the primary mailing list which covers everything from core networking infrastructure to drivers. And there's even a patchwork instance too! David Miller is the top level networking maintainer and has a tree for all your networking needs. He has a separate -next tree. One thing to keep in mind is that networking patches are sent to stable in batches and not just tagged and picked up by Greg KH. This sometimes means a larger gap between when a patch lands in Linus' branch and when it gets into a stable release.

  • Fedora tree. Most of the git trees listed above are "source git/src-git" trees, meaning it's the actual source code. Fedora officially distributes everything in "pkg-git" form. If you look at the official Fedora kernel repository, you'll see it contains a bunch of patches and support files. This is similar to the -mm and -stable-queue. Josh Boyer (Fedora kernel maintainer emeritus) has some scripts to take the Fedora pkg-git and put it on kernel.org. This gets updated automatically with each build.

  • DRM. This is for anything and everything related to graphics. Most everything is hosted a freedesktop.org, including the mailing list. Recently, DRM has switched to a group maintainer model (Daniel Vetter has written about some of this philosophy before). Ultimately though, all the patches will come through the main DRM git repo. There's a DRM -tip for -next like testing of all the latest graphics work. Graphics maintainers may occasionally request you test that tree if you have graphics problems. There's also a patchwork instance.


Some notes on recent random numbers

| categories: fedora

By now people may have seen complaints of boot slowdown on newer kernels. I want to explain a little more about what's going on and why Fedora seems to be particularly hard hit.

The Linux kernel has a random number generator in drivers/char/random.c. This provides several interfaces for random numbers to the system. There are two main interfaces for random numbers: /dev/random and /dev/urandom. /dev/random is designed to be "secure", meaning it is sufficiently random that it can be used for things like cryptography keys. /dev/urandom is "random" in the sense that most humans won't detect a pattern but sufficient mathematical analysis might find a weakness.

Random number generators rely on entropy to work properly. You can't just make up entropy, the system has to get it from somewhere. At boot the kernel assumes it has no entropy and relies on various parts of the system (interrupts, timer ticks etc.) to give it entropy. /dev/random is supposed to block if there is not sufficient entropy in the system (entropy is a finite resource and it can be drained). Google Project 0 recently discovered several flaws in the Linux RNG, among them that the RNG was marked as being available for cyptographically secure generation earlier than it should have. They provided patches to fix this which were applied by the RNG maintainer.

And then people started seeing issues, mostly a lot of messages about crng_init. It turns out, there were a lot of places in the kernel that were trying to get random numbers early in the kernel boot process that weren't as random as they might expect. Fedora had a particularly nasty problem where the compose machines were getting stuck. Trying to get more logs from the systemd journal didn't help. Eventually after some debugging with the infrasturcture team (and the help of sendkey alt-sysrq- t in the qemu monitor window), we were able to see that init was blocked on the getrandom systemcall for secure entropy. Interestingly enough, systemd only made non-blocking (insecure) random calls in its code.

I was lucky I could re-build kernels to reproduce the issue, so I decided to experiment a bit and return something unexpected from getrandom (-ENOMEM). This gave me an error message that (luckily) uniquely mapped to gcrypt. systemd links against gcrypt for some features, such as calculating an HMAC for the journal entries. None of that involved random numbers at bootup though so it didn't explain why things were getting stuck. After some more back and forth, Patrick Uiterwijk found a patch that gcrypt was carrying. If FIPS mode is enabled, the cryptographic system is initalied at constructor time (i.e. when it gets loaded by systemd). It turns out, the default images ship with dracut-fips which will put gcrypt into FIPS mode. So the very first time systemd went to open the journal to write something, it would load gcrypt which would attempt to initialize the random number system. (Fun fact, it also looks like the default in systemd is to do a write to a journal before the commandline options are parsed. So even adding an option to not write to the journal didn't help this case. I might be wrong here though?)

Despite the fact that these patches have some side-effects, they do fix a real issue and can't exactly just permanently reverted. So what to do? One easy answer is to give the system more entropy. On virtualized systems, this can be provided by the CONFIG_HW_RANDOM_VIRTIO option. Part of the fix also involves making sure userspace isn't actually trying to rely on secure random number generation too early since randomness is hard to come by early in boot. At least for now in Fedora, we've temporarily reverted the random series on stable (F27/F28) releases. The plan is to continue working with upstream and userspace developers to find a workable solution and bring back the patches when things are fixed.


LSF/MM 2018

| categories: fedora

Wheee LSF/MM. Highlights.

  • There were a couple of sessions related to everyone's favorite mitigation, PTI. Unsurprisingly, PTI has an impact on I/O heavy workloads because it makes system calls more expensive. Minimizing system calls has always been good performance advice, and it only becomes more important with PTI. Also important is to make sure features like the vDSO are used since that helps to mitigate the system call cost. There was some discussion about if we need new system calls that take vectors to help with TLB flushing costs (e.g. multiple madvise calls will require a flush each time).

  • Ted Ts'o gave a session on fs-verity. This is a file system feature for file integrity that's mostly been focused on Android. The functionality is similar to what IMA wants to provides but focuses on immutable files. Looks promising.

  • Speaking of IMA, Mimi Zohar gave a presentation on IMA with a focus on file system topics. The goal of IMA is to provide cryptographic verification of files with keys tied to the TPM. The nature of this means it ends up touching file system internals and some things it perhaps shouldn't (e.g. file system magic numbers). There's been good progress towards making IMA more acceptable toward everyone.

  • Igor Stoppa talked about protectable dynamic memory (called pmalloc). The goal is to allow read only protection of memory that can't be statically allocated. This is a patch series I've been reviewing/following for some time now. Overall, feedback seemed promising for it to get merged soon.

  • I talked briefly about CMA (my primary reason for attending). CMA relies on alignment to pageblock size since it is tied to migration. On arm64 with 64K pages, the pageblock size gets bumped to 512MB which is a bit much. I discussed some approaches to loosening that requirements. The two big options are either just make the pageblock size smaller if we aren't using THP or just let CMA exist as a subpageblock region (some patches were recently merged by Joonsoo Kim to make this much easier). Both are plausible, now all I need to do is write the code (alas).

  • Matthew Wilcox talked about struct page. Kernel documentation has been notoriously lacking for internal APIs and structures. A struct page exists for each page of memory on the system which means it needs to be compact as possible. The end result is a difficult to understand structure. Matthew proposed re-arraging the structure to better clarify the actual usage and make it clear what fields of the structure outside users could actually use. It seemed to be well received so I expect to see it on the mailing list sometime.

  • I got a chance to chat with some Red Hat people about dm-vdo. I had seen this discussed internally somewhat before but the hallway track provided a much better explanation to me of both the details of the code and some of the potential pitfalls. The hallway track is always great.

LWN already has some coverage up but watch there for much better details on all the sessions.


Kbuild tricks

| categories: fedora

Several of the tasks I've worked on recently have involved looking at some of the kernel's build infrastructure. This is all fairly well documented which makes it nice to work with.

The kernel automatically generates some files at build time. This is mostly set up to be transparent to developers unless they are looking for them. The majority of these files are headers at include/generated. A good example of something which needs to be generated is the #define representing the kernel version (e.g. 4.15.12). The header file include/generated/bounds.h contains #defines for several enum constants calculated at build time. Cleverly, most of these files are only actually replaced if the generated output changes to avoid unnecessary recompile. Most of this work is handled by the filechk macro.

The C preprocessor is typically used on C files, as one might obviously expect. It's not actually limited to C files though. Each architecture has to define a linker script which meets the architectural requirements. The linker language is common across architectures so it's beneficial to have common definitions for typical sections such as initcalls and rodata. There's a global rule to run the pre-processor on any .lds.S file. Devicetree files also get preprocessed which avoids a lot of copy and pasting of numerical defines.

The compiler flags are typically set in the top level Makefile and named as you might expect (CFLAGS, CXXFLAGS etc.). The process of building the kernel requires building a number of smaller programs. The c flags for these programs are controlled by a different set of variables (HOSTCFLAGS). It sounds incredibly obvious but I've lost time from my day trying to figure out why setting options in CFLAGS weren't being picked up by the host compiler. For more fun, it's possible to use environment variables to set different flags for compiling built-in vs. module code. The moral of the story is know what you're setting.

Debugging build infrastructure isn't always pleasant but the kernel build system isn't too bad overall. I'm at least beginning to understand more parts of it as I find increasingly more obscure things to modify.


Next Page ยป