There will always be hardware bugs

By now everyone has seen the latest exploit, meltdown and spectre, complete with logos and full academic paper. The gist of this is that side channel attacks on CPUs are now actually plausible instead of mostly theoretical. LWN (subscribe!) has a good collection of posts about actual technical details and mitigations. Because this involves hardware and not just software, fixes get more complicated.

In my previous job, I worked on kernels for mobile phones. This involved working with new hardware. I love working with hardware but one thing you learn pretty quickly is that hardware will have bugs. Sometimes the hardware team has already found them and will give a workaround. Other times you spend weeks chasing weird crashes and going back and forth with the hardware team. One of the challenging issues when working across teams in any area is communicating your domain expertise and listening to others expertise. There can be a lot of “well how about we just…” and talking across each other. Once upon a time, some hardware was not working the way we expected and we were talking to the hardware team. They were having trouble reproducing the behavior seen on our complex Android stack so we were running a series of experiments on our setups. Much of the actual work was figuring out how to take the requests from the hardware team and translate them into something reasonable for the kernel (e.g. where does “after each TLB flush” apply). Sometimes the experiments weren’t actually feasible due to how the kernel was written.

If you are lucky (or unlucky depending on your view), you may find a hardware bug. The question then comes what to do. There may, again, be back and forth about what’s actually an acceptable workaround. “Just run this sequence of code sometimes” may sound simple to the hardware team but might be impractical to actually implement in the kernel. The performance penalties can be high if part of the microarchitecture needs to be turned off. Sometimes the answer turns out to be “pretty please don’t run this sequence of code which should never get generated by a reasonable compiler”. Obviously, if an issue has security implications you may need to just take a performance hit but not implementing a workaround can be a valid decision.

Part of the discussion around all this has been a call for more open source hardware. This is absolutely a worthwhile goal. Most processors support adjusting various microarchitecture features. This is mostly for verification purposes but it’s also useful if there’s a need to disable a feature such as a prefetcher or branch predictor. The microarchitecture is usually considered proprietary and as such it’s next to impossible to figure out how to make changes without consultation from the hardware team. So an open source hardware design would allow for better insight into the microarchitecture. What most people miss about open hardware is that you still have all the problems of hardware. Unless you’re running on an FPGA, you can’t just drop in a new hardware revision immediately. You’re still going to have to implement software workarounds. The value of open hardware comes from freedom of licensing but not freedom from bugs.

Calling all this an “Intelocolypse” is deeply unfair as basically all modern processors from multiple vendors were affected here. It’s a fundamental flaw in most implementations. It’s certainly possible for each vendor/architecture to give a workaround but because of the severity here, there are proposals to fix this in generic kernel code. As has been mentioned though, many of the fixes are still under review so we’ll have to see what happens. A big shout out to all the hardware and software developers who spent time coming up with proposals.