I recently implemented a fun little feature for systemd: inspired by MacOS' "target disk mode", a tiny tool called systemd-storagetm, that exposes all local block devices as NVMe-TCP devices, as they pop up. The idea is that if available in your initrd you can just boot into that (instead of into your full OS), and can access your disks via NVMe-TCP (in case you wonder what that is: it's the new hot shit for exposing block devices over the network, kinda like iSCSI, NBD, …, but cool).
@mjg59 hmm, so the hibernation thing, what i never grokked: isn't it entirely sufficient to just bind encryption to a secret locked to PCRs 0…7? that's a very simple policy that just says: if you want to unlock the hibernation image you have to boot the same system with all the same components, including the kernel?
I mean, usually the PCR brittleness issue is what makes policies like that unrealistic in reality, but in a simple hibernation/thawing cycle that shouldn't be as much of a problem.
@mjg59 then attach a quote of the same pcrs to the hibernation image. And some nonce that binds image and quote together. That would mean the same boot path is necessary to generate and use a hibernation image.
@mjg59 here's another idea: generate the hibernation encryption key already during early boot (i.e. any time before you allow userspace to do the first TPM interaction), keep it in mem (i.e. kernel keyring or so) Bind the key to PCR 0..7 + PCR 9. Then measure some separator into PCR 9, to make the key unretrievable by userspace.
Now you have the guarantee that userspace cannot retrieve the key, nor gen one anymore, but as long as the same boot path is booted the key will be avail to the kernel
@mjg59 and afaics you don't even need any quotes or so in this case: the kernel can get the encryption key, userspace couldn't possibly, whatever it does.
this appears really simple and minimalistic to me. And given that kernel already measures stuff to PCR 9 it's not even licking a cookie that wasn't licked before by the kernel.
I mean, if I understood your talk correctly your more recent ideas where built around "fake PCRs", i.e. "hybrid" nv indexes? those make NV changes too, hence you already were almost there, accepting using NV as storage..
Linux could just pick some fixed nv index for this (and maybe a kernel param to change it). It's rather unlikely that people intend to "multi-boot" two hibernation images and expect things to just work.
@mjg59 hmm, when i read that in the spec my understanding was that the nv index are allocated persistently always, but the contents might get reset on reboot.
@mjg59 why? I mean, if you want to store the encryption key in the swap file then yeah this complicates things a bit. But as mentioned i think for this a fixed allocated nvindex is fine, because (my assumption at least says) that one active hibernation image per system is enough.
@mjg59 but if you don't want to use an nvindex, that's fine too: just outsource the problem to userspace, i.e. next to /sys/power/resume and /sys/power/resume_offset add /sys/power/resume_key, and you can read the wrapped, marshalled key from there or write it there. Then, userspace would just write that whenever it writes the devnum and offset anyway, it's just one more thing it needs to figure out somehow. In systemd we'd for example write the key to the ESP or so and pick it up from there.
@mjg59 but this key could also be written to the LUKS metadata (it#s nicely extensible), so that it is closely associated with the volume it is primarily used for.
or people could even embedd it into the kernel cmdline if they like. endless possibilities...
that all said, I still think nvindex would be the obvious choice, so much simpler and more robust.
@mjg59 BTW, one addendum: instead of dealing with PCRs and stuff I think a much simpler and nicer approach is to initialize/read the NV index in the kernel, before allowing userspace to access it, and setting TPMA_NV_READLOCKED|TPMA_NV_WRITELOCKED on it. This means the key can be read/written exactly once until the system is reset. That makes things super simple: during early boot in the kernel read it/reinitialize it, and then you can be sure later stuff cannot fuck with it anymore.
So for my prediction stuff I want the measurements, not the results so much. i.e. I want to be able to recognize stuff in the event log, and for that it's crucial to know what was measured there as individual hashes, not just the result.
And I want all PCRs that firmware might influence. Which is a lot, basically 0…6.
(And for the SecureBoot policy update it would make sense to have the same data for PCR 7 btw).
My favourite format would be a subset of the TCG JSON-CEL.
@hughsie@bluca systemd git will now generate measurement logs in JSON-CEL, for the various things it now measures. Hence to me it would be most natural to just make fwupd provide JSON-CEL style measurement logs for expected logs for each firmware update + secureboot policy updates.
I say JSON-CEL "style" instead of proper JSON-CEL, because the "recnum" field should not be included, since that depends on extension order across PCRs, which is irrelevant. Hence simply don't do recnum, and good.
This whole mess just makes me think we should try harder to kick suid/fcaps out of general purpose Linux distributions. The whole concept is fundamentally backwards, and one of the major weaknesses of traditional UNIX I am sure. The idea behind suid/fcaps of first granting the privileges, inheriting some major, uncontrolled part of the execution environment/resource context/security context and then expecting the binary to securely gate its misuse is just a major mistake: https://www.openwall.com/lists/oss-security/2023/10/03/2
@xexaxo CAP_SYS_NICE is wrong tool for the job. use RLIMIT_NICE instead.
I guess we could just raise that level to all session logging in via pam_systemd, so that they maybe can acquire nice level -5 without any complications if they like
UNIX access controls suck though, since they control access to objects, not operations. And they are incompatible with potentially interactive authentication. Both of these things are what Polkit brings to the table: you authenticate actions, and you can allow them to require re-authentication by a user, interactively.
@dalias@jamesh@mariusor and what's even worse. they are permanent: file ownership/ACL entries are persistently made on some inode, and there's no scheme to clean that up again (unless some – brittle – logic cleans this up manually). Moreover if you have access to an inode you basically have access to it forever, just by keeping an fd open.
UNIX access control works for simple, relatively static, non-interactive cases, but Polkit is precisely supposed to fill the gap where that's not enough.
@suqdiq there are plenty special purposes cases where people did this (I mean, the NNP option in systemd's configuration file exists precisely for those cases, and was added as result of a request from one). However, what I am asking for here, is that a generic Linux distro goes for this, and kills suid/fcaps for good.
@xexaxo i dont really grok why just making the gpu prio follow the nice level isnt enough. I mean in absence of detailed io sched and cpu sched configuration both schedulers follow the generic nice level too. So why not do the same for gpu scheduling?
Or from the opposite PoV, when would you want a process that gets prio on the gpu but is scheduled with low prio on cpu & io? At least on the typical desktop cases it sounds unnecessary to set more than a single generic nice level to me.