When you press the power button of your computer, a myriad and one hardware and software components spring into action at once. Memory must be probed, CPU features enumerated, peripherals initialized, and operating systems discovered, all before you see those familiar kernel printk log lines appear on the console. This part of the boot process is arcane and legacy-ridden. It is characterized by write-once code that is seldom looked at in the software industry outside of hardware vendors and BIOS authors, even though it is run by virtually every single computer on Earth. It should not be surprising that this is the case: once a piece of x86 assembly was written that initializes the 64bit long-mode of modern Intel CPUs, why would anyone want to change that code, ever?
Today, however, the advent of Confidential Computing and VM-based isolation, in particular, puts us in an awkward position. Confidential Computing (CC) provides us with never-before-seen security guarantees enabling use cases that simply weren’t possible before, however it also assumes an attacker with extreme capabilities. To be precise, the threat model of such technologies is so strong that the hardware, or rather, the hypervisor acting on behalf of the hardware, cannot be trusted anymore! To defend against such a powerful adversary, all we can do is dust off the manuals and rewrite the ancient boot code that has been working fine for so many years.
All the firmware/BIOS code in use today was written with the assumption that the hardware and the hypervisor are cooperating with the software to orchestrate a successful boot. But with the Confidential Computing attacker model, this assumption does not hold anymore. If an attacker can take control of the hypervisor then they can report fake devices, invalid memory locations, even raise exceptions that are completely unrelated to the instruction being executed. Such manipulations open a wide spectrum of attack vectors allowing the attacker to divert the computation’s control flow, and with it compromise all security properties that Confidential Computing is supposed to provide.
In the following, we will describe our journey into the realm of AMD SEV-SNP, the first semi-complete, hardware-isolated, and VM-based CC technology to date (although Intel’s TDX is also around the corner). We will explain how we arrived at a hardened VM boot process, starting from a modified version of James Bottomley’s GRUB/OVMF-based boot method, and ending with a minimal qboot-based firmware that has first-class support of SEV-SNP and crucially, does not rely on any input from the hypervisor (outside of the secure GHCB-based protocol for virtio-vsock). Note that this article is heavily technical and assumes a cursory understanding of how a boot process looks like, as well as basic security concepts. However, we will try to introduce Confidential Computing terminology as we need it.
But first, what is the point of Confidential Computing anyway? You may have heard various buzzwords and slogans around the concept, but the core idea is pretty simple: Confidential Computing gives you a way to prove that a remote machine is doing what you think it’s doing, even if that remote machine has been hacked. The two main properties required for this are
- Isolation: encryption and integrity checking of memory that the computation is operating on.
- Attestation: cryptographic signature of hardware vendor proving that the computation has been isolated.
When you write a piece of Python script and run it locally on your computer, you can be relatively sure that the result you see printed on the screen actually corresponds to the script you have written. This is because you have full control over the machine, and you trust that all software and hardware components that orchestrate the printing of that result are working correctly. Confidential Computing allows you to extend this knowledge to remote computation.
We must make a disclaimer here that “Confidential Computing” actually has many definitions, including ones that do not necessitate the use of special hardware. The core idea behind it is this concept of “remotely trustable compute”, and there are also many efforts aiming to implement it with mathematical cryptographic primitives. Hardware-isolation-based implementations are also commonly referred to as Trusted Execution Environments or TEEs, and in this article, we will be using the two terms synonymously. For a solid introduction to the current TEE landscape, see this article.
On the shoulders of SGX
The first widely available and mature technology that enables Confidential Computing is Intel’s Software Guard eXtensions. This technology ensures that the data being operated on by the isolated piece of code (the enclave) is never present in plain-text form in DRAM: essentially all data gets encrypted/decrypted as it leaves/enters the CPU package. The hardware also provides integrity and freshness guarantees over said data, so we have the extremely strong confidentiality+integrity+freshness trio on memory as a basic building block. SGX also provides attestation capabilities to prove the isolation property to a remote party: the best way to think about this is as an authenticated Diffie-Hellman key exchange with the SGX-isolated code itself, where attestation = authentication of the isolated code’s hash. Furthermore SGX provides another useful primitive, sealing: this is a key derivation process rooted in per-chip unique entropy, which can be exactly reproduced by a specific enclave even after reboots. With these sealing keys we can persist sensitive data in a durable way, as long as we can retain access to the CPU itself.
So with SGX we have strong isolation guarantees, attestation capabilities, and encrypted data at rest... why do we need AMD SEV and Intel TDX again? The answer is painfully obvious to anyone who has ever tried to actually use SGX: the programming model. One of SGX’s most appealing properties from a security perspective is also its most appalling one from a usability point of view: the operating system is treated as malicious. In practice this means that any syscall interrupt coming from the isolated code will immediately exit the enclave context. Intel made this choice because syscalls like reading or writing to a file descriptor are inherently unsafe from a Confidential Computing standpoint. The choice of not allowing these calls means that the enclave developer has to be explicit about what data they allow to flow to and from the untrusted part of the software. However, this also means that approximately zero pre-existing applications work inside SGX (and even programming languages, including C because of its widespread dependency on libc), everything must be written from scratch. The lack of syscalls also means that there is no
fork(), or any kind of OS-assisted process management capability. SGX effectively gives us process-level single address space isolation, reminiscent of the old DOS days. Good luck running Python FFI code in SGX that runs
ld to find a shared object on the filesystem, to be then loaded and executed dynamically...
Needless to say, this has not exactly helped adoption. There have been various efforts to bridge this usability gap like SCONE, Graphene, Fortanix containers, LKL-SGX and many others. The two main approaches were either to effectively reintroduce syscalls through the enclave-host ABI, or in the case of LKL to completely embed a library OS into the enclave. These approaches are highly involved, have restricted functionality, and in the case of syscall-bridging they (re)introduce a new dimension for attacks. Furthermore, they increase the so called TCB’s size. TCB stands for Trusted Computing Base, and it encompasses all software and hardware components that are assumed to be trusted when reasoning about the security of an enclave. A general security design principle is to keep the TCB as small as possible, as the larger it is the harder it becomes to audit and the more potential there is for vulnerabilities.
Instead of committing to one of these technologies, we decided to bite the bullet and go for the from-scratch approach where we have full control over the security guarantees. (As a side note, this decision resulted in rolling our own distributed SQL engine inside SGX, which perhaps deserves its own article). We knew that a much more usable alternative was around the corner.
The most natural solution to the usability issue is of course using an already existing abstraction: virtualization (AMD-V/Intel VT-x)! If we could somehow protect a full kernel instead of just a single process, then suddenly everything becomes much easier. There is nothing restricting the compute from making syscalls, as the kernel itself serving those syscalls would be part of the isolated code. One important trade-off that must be mentioned here is that with the inclusion of the operating system the secured code surface (the TCB) is also magnified, which can cause build reproducibility issues and widens the attack surface. Enabling secure virtualization is exactly what AMD and Intel have been tackling for the past years in the forms of SEV and TDX, and what we have been waiting for, as it allows us to enable basically any computation we want.
So what about SGX? Is it made obsolete by these VM-based approaches? The answer is no, and in fact Intel’s TDX relies on SGX to implement some components. The way to think about it is that SGX simply has different security and usability trade-offs than the VM-based approaches. For example, AMD SEV relies on a separate coprocessor (the PSP) and a custom firmware to implement early encryption, measurement and attestation functionality. This is a new component to attack, and in fact it has been the main focus of SEV-related security research to date. SGX simply does not have such a component. So in general the rule of thumb is: in terms of security SGX is king in the TEE world so it’s the recommended tech if you are writing an application from scratch, but if you want to run pre-existing software then use one of the VM approaches.
Dipping the toe: measured boot with GRUB
Our first tinkering project that involved AMD SEV-SNP was to explore James Bottomley’s boot approach which utilizes OVMF and GRUB. OVMF is the widely adopted firmware for virtual machines, and has extensive support for UEFI, including the full glory of a rescue shell, a graphical menu, everything boot-related. AMD and Intel have been working on OVMF and QEMU/KVM patches to add support for their respective technologies, so it was a natural starting point.
The original project by James focused on booting an encrypted operating system, and most of the custom code he added concerned the handling of this encryption key. In our case however, we don’t really care about the encryption of the OS disk. What we do care about is the integrity of the boot process and the OS - that the exact known version of the code has been loaded and the startup has not been tampered with. As long as we have code integrity, protecting the actual code contents doesn’t really matter, as secrets will be provisioned dynamically to encrypted memory once the VM has been attested.
When you power on a regular bare metal machine, there is a piece of firmware/BIOS installed on the motherboard that is responsible for early initialization and configuration of the machine. As part of this initialization the firmware looks for different devices which may contain an operating system - this can be a disk or e.g. a USB device. The BIOS then looks at the partition table of said devices (MBR or GPT), and figures out whether any of the partitions have the “bootable” flag set. It then selects one of these devices and hands control to the bootloader (like GRUB) installed on the partition. However, when booting a secure VM, we need to modify some of these steps.
The basis of all security guarantees of CC software is the measurement of the TCB: while the initial code image is loaded into the secure environment, it is hashed, and this hash can later be queried by the software to assert that the right code has been loaded. However, if we simply replicated the regular boot process when launching a secure VM, this measurement would only cover the firmware part itself! The loading of the bootloader and operating system from an external device must also be secured, otherwise we may load malicious software and we would be none the wiser. To this end, instead of loading the bootloader from disk, James burned GRUB into the firmware itself, which means that the measurement of the firmware covered the bootloader as well. However, the operating system itself was still not measured, only encrypted/decrypted. So our first project was to extend this by measuring the OS disk itself that GRUB loads - we added a
--hash flag to the
chainload GRUB command that causes the chainloaded OS disk to be measured. The GRUB configuration then included this flag with the OS disk hash, which in turn was burned into the firmware, which was in turn what got hashed, so we gained integrity of the loaded operating system.
Although this worked fine, we also found that it is very easy to divert the boot control flow. OVMF and GRUB were written with a semi-interactive use-case in mind: they have rescue shells, user input, and they query a lot of things from the hypervisor/hardware: it is ripe with potential exploits. If an attacker can trigger a rescue shell (e.g. by replying to a device probe with a non-existent floppy disk device that returns errors), they basically can take over the “secure” VM completely, and hide malicious behaviour while attestation still checks out. So our biggest takeaway from this project was that we cannot really use GRUB as part of the boot process, and OVMF also seems to have too many features and has not been written defensively against a malicious hypervisor.
51,669 kvm:kvm_mmio 204,562 kvm:kvm_pio
(Will explain later... The suspense is real!)
Reducing the attack surface: pure OVMF boot
When we learned about the security issue with boot methods utilizing existing software, we raised it during the Linux Plumbers Conference dedicated to confidential VM technology. This raised a lively discussion, but more importantly it marked the beginning of a collaboration that would ultimately produce a hardened boot process for AMD SEV-SNP VMs, which was our goal.
The first step was to again take OVMF, and initiate a direct boot of Linux without the involvement of a bootloader. To do this we still had to measure the OS, and we decided to go with a “monolithic blob” approach for the kernel measurement: when you launch a VM with e.g. QEMU, you can specify the
initrd and the kernel
cmdline as separate command line arguments. If you do so, QEMU will make these blobs available to the guest firmware using a custom protocol called
fw_cfg, which can be used for communication between QEMU and the guest VM. The firmware(e.g. SeaBIOS) can then take these blobs and arrange them in memory to initiate a successful Linux boot. In order to simplify the measurement process, instead of handling these blobs separately we instead baked them into a single kernel image using the kernel config options
CONFIG_INITRAMFS_SOURCE. This way we only needed to check a single hash in the guest (OVMF) when loading the kernel from the hypervisor.
Additionally, we tried to remove as much of the hypervisor-firmware communications and extraneous features from OVMF as possible. This included virtio and disk support (note: virtio would be supported by the launched kernel though), console/shell support, PCI, SCSI and TPM support, suspend/resume, pflash probes and SMBIOS. To eliminate data exchange regarding number of vCPUs and size of memory, we decided to fix these values inside the firmware. In practice this means that the VMs created using this boot method cannot be dynamically sized, the measurement of the VM will cover how many vCPUs and how much memory the VM will use. To facilitate burning in these parameters and also for burning in the loaded kernel’s hash, we added a custom OVMF metadata section to the firmware that contains these values. After building the firmware, the metadata at this specific offset can be modified by a tool to set the parameters as we want (and with it, change the measurement). Furthermore, we minimized the kernel itself by starting from a very small kernel configuration called
tinyconfig, and slowly adding features required to boot a minimal system with virtio VSOCK support, and features needed to run our actual payloads. To illustrate, a stock Ubuntu 20.04 LTS distribution from Azure uses 2562 kernel configuration options (
=y) and the bzImage is 14MB, whereas a minimal non-debug kernel we use that has support for running docker containers uses 340 options and the bzImage is only 2.9MB.
792 kvm:kvm_mmio 74,379 kvm:kvm_pio
(The numbers are going down.. is that a good thing? 🤔)
Back to the drawing board: qboot-based AMD SEV-SNP boot
One of the most promising minimal x86 firmware candidates was qboot. It has a very small codebase, supports QEMU’s
fw_cfg, and basically directly boots the kernel without doing many extraneous roundtrips to the hypervisor. There are several downsides of this. Qboot only supports a 32bit Linux boot, and we need a 64bit boot as we need to access the C-bit in page attributes, which marks the page shared. This is needed to create a GHCB page which is the SEV way of securely communicating with the hypervisor. Moreover, qboot does not support other AMD SEV concepts like the PVALIDATE routine that validates the protected memory region, or the VC exception handler required for safely handling hypervisor interrupts, which are vital to SEV’s proper functioning. It also doesn’t have crypto primitives like SHA256 which we need in order to measure data coming from the hypervisor. Furthermore, QEMU looks specifically for OVMF metadata to determine how to load the VM into SEV, qboot naturally doesn’t have this metadata.
To evaluate whether it’s worth investing time into a project exploring qboot, we wanted to get some concrete numbers on how much guest-hypervisor communication we can eliminate by switching to it from OVMF. To this end we used tracepoints in KVM to count IO and MMIO requests during an unmodified qboot boot, and compared it with numbers from our OVMF boot. These are the numbers you’ve been seeing at the end of each subsection, and what we’ve been using during the project as a rough measure of potential attack points during early boot. The tracepoints were counted using
perf stat -e kvm:kvm_mmio -e kvm:kvm_pio, and our initial measurements showed that qboot has a much reduced attack surface than OVMF.
1,595 kvm:kvm_mmio 300 kvm:kvm_pio
(Initial measurements of IO and MMIO requests with an unmodified qboot)
We followed a methodical approach here. First we added OVMF-like metadata to the firmware image to cater for QEMU which is expecting it for SEV support. Then we made qboot boot Linux first in 32bit mode (by default it uses the 16bit entry point), then we made it support a 64bit boot, for which we had to add an early CPUID call to get SEV-specific data to setup the initial page table. Then we added SEV-SNP support, in particular the GHCB protocol utilizing the VC exception handler, and the PVALIDATE routine validating the protected memory range.
Finally, we instrumented all IN calls returning data from the hypervisor. By hashing and asserting these inputs we could explore what kind of data the firmware is using for the boot, and we could slowly try to either eliminate these inputs, or burn in their measurements so that they cannot be tampered with. All of these calls were using QEMU’s
We found the following hypervisor inputs:
- SMBIOS: This is data exposed by the BIOS(hypervisor) about the PC, like motherboard and CPU information. We could simply remove this from the code altogether as it’s not strictly needed for a boot.
- kernel, kernel cmdline, initrd: This is the operating system itself to be loaded by the firmware. The way we eliminated this input was by burning all of these into the firmware blob itself (so we don’t even need to load it and measure it). The only tricky part here was ensuring that the kernel actually fit the maximum size of the firmware. The kernel we embedded was minimal from the get go, for security purposes we based it on the
tinyconfigconfiguration as explained earlier. But we also had to limit initrd, which is a bit trickier as the workload compute can be a lot larger (think docker image sizes..). We solved this by introducing a second boot stage. The initial stage 1 initrd contains only a statically compiled busybox, and once it starts it loads and measures a stage 2 initrd through virtio VSOCK, which is then mounted and chrooted. This way the final combined firmware fit into less than 4MB, which incidentally also made VM boots very fast.
- MPTABLE: data regarding number of vCPUs and related interrupt settings to build the CPU table. To eliminate this input we decided to fix the number of vCPUs during build time and burn this number into the metadata as before. The firmware then uses this number to manually construct the right structures for each vCPU.
- e820 map: the memory map that describes what physical memory regions are available to the kernel. We pursued an approach similar to MPTABLE and burned in the size of used guest memory into the metadata, and then used this to manually construct the e820 map.
- ACPI table: this is where we get the PCI device tree from, needed by the guest kernel for PCI-based virtio. This input we did not eliminate fully at this point as it is quite elaborate to manually construct these tables. Instead we decided to keep this input and measure the blob with a burned in hash. To make the table blob auditable and reproducible, during the build of the firmware we ran QEMU in simulation(non-KVM) mode, where it produced the exact same table as with KVM, and then used that table to calculate the hash of the expected table. Because of this, QEMU became part of the “audit surface” of the firmware (temporarily, see next section).
By the end of this part of the project we ended up with a working SNP-enlightened firmware capable of booting 64bit Linux, and which only loaded the ACPI table from the hypervisor, and even that input was measured!
0 kvm:kvm_mmio 49 kvm:kvm_pio
Getting rid of ACPI: microvm-based boot
Needless to say, we were not satisfied with this ACPI workaround: running QEMU as part of the build process and burning in the ACPI table’s measurement just felt off and unnecessary. So we went back to the very reason why we need the ACPI table in the first place: it’s to pass the device tree, which in turn was needed because we were using PCI-based virtio.
There are two implementations of virtio: PCI-based and MMIO-based. QEMU supports both, but with different chipset (”machine”)-types. MMIO-based virtio in particular requires the use of the microvm machine type. This was created as part of the Firecracker project specifically to cater for lightweight VMs in cloud environments with little to no device dependencies. If we could make the firmware work with the microvm machine type, we would not need PCI support at all!
The only hurdle we had to tackle to make this work was the way QEMU loads the firmware. The currently upstreamed QEMU SEV patches make use of the -pflash argument to pass the firmware blob (this is because the main development effort is around supporting OVMF), which then gets measured by the secure co-processor (PSP) as it’s loaded. However, the microvm machine type has no support for -pflash, instead it relies on -bios to pass the firmware. This meant two things, first that we had to extend QEMU to support loading SEV VMs using -bios (which meant we forked QEMU with a surprisingly small patch). And second, we had to address the fact that -bios loads the firmware pages into ROM, and therefore we cannot use that firmware memory for writes. To address this, we added a relocation routine during early initialization where the firmware code relocates into RAM and jumps into itself.
By switching to microvm, not only did we remove the last IN calls, we could also remove all the ACPI and
fw_cfg code, reducing the firmware code further.
But I’ll just let the numbers do the talking:
0 kvm:kvm_mmio 0 kvm:kvm_pio
No input. Zero. Nada. Well, until the kernel starts. Incidentally because the guest does not rely on any input from the hypervisor, this also opens up the possibility of switching to a completely different hypervisor any time we want (as long as that hypervisor supports OVMF-style metadata for SEV).
Confidential Computing is here to stay, and it’s maturing by the day. With new VM-isolation based technologies the barrier for entry is being lowered significantly, as we can finally secure existing already-written software without too much trouble. Existing software however has not been written with a malicious host in mind, and we must be diligent in eliminating potential attack vectors that are left over from “pre-CC” days.
Needless to say, securing the integrity of early boot is only one step in the hardening process. The most complex software in the guest VM is the kernel itself, and securing it requires significant effort not only from Confidential Computing developers, but also from Linux maintainers, as the kind of changes required reach deep into the bowels of kernel code. For now we have taken a conservative approach with the kernel by using an absolutely minimal configuration required to run the specific workloads we want, but even that can be open to potential attacks, and supporting a more fully fledged operating system is a lot more work. There is an ongoing effort by Intel to harden the Linux kernel for Confidential Computing, which will hopefully gain more and more traction over time.
In this article we described how we hardened AMD SEV-SNP early boot up to the point of handing control to the Linux kernel. The ideas and design will most likely translate to Intel TDX as well, but that is a future bridge to cross. We are planning to open source this work, together with build tooling to produce a
-bios firmware blob and the patched QEMU. If you have any questions, comments or suggestions on the approach we’ve taken and the design, please reach out at email@example.com, it’s greatly appreciated.
Also, if you got to this point and you’re interested in this kind of work or how to secure the higher-level compute workload, we are hiring! At Decentriq we use Confidential Computing to enable highly security and privacy-sensitive compute, enabling cross-company and cross-industry data collaborations that simply weren’t possible before.
We would like to give special thanks to a certain unnamed individual who engineered the lion’s share of this project. They explicitly asked to remain anonymous, but without them this work would not have been possible.