Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

May 16, 2022

Gerd Hoffmann

edk2 quickstart for virtualization

Here is a quickstart for everyone who wants (or needs to) deal with edk2 firmware, with a focus on virtual machine firmware. The article assumes you are using a linux machine with gcc.

Building firmware for VMs

To build edk2 you need to have a bunch of tools installed. An compiler and the make are required of course, but also iasl, nasm and libuuid. So install them first (package names are for centos/fedora).

dnf install -y make gcc binutils iasl nasm libuuid-devel

If you want cross-build arm firmware on a x86 machine you also need cross compilers. While being at also set the environment variables needed to make the build system use the cross compilers:

dnf install -y gcc-aarch64-linux-gnu gcc-arm-linux-gnu
export GCC5_AARCH64_PREFIX="aarch64-linux-gnu-"
export GCC5_ARM_PREFIX="arm-linux-gnu-"

Next clone the tiaocore/edk2 repository and also fetch the git submodules.

git clone https://github.com/tianocore/edk2.git
cd edk2
git submodule update --init

The edksetup script will prepare the build environment for you. The script must be sourced because it sets some environment variables (WORKSPACE being the most important one). This must be done only once (as long as you keep the shell with the configured environment variables open).

source edksetup.sh

Next step is building the BaseTools (also needed only once):

make -C BaseTools

Note: Currently (April 2022) BaseTools are being rewritten in Python, so most likely this step will not be needed any more at some point in the future.

Finally the build (for x64 qemu) can be kicked off:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc

The firmware volumes built can be found in Build/OvmfX64/DEBUG_GCC5/FV.

Building the aarch64 firmware instead:

build -t GCC5 -a AARCH64 -p ArmVirtPkg/ArmVirtQemu.dsc

The build results land in Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV.

Qemu expects the aarch64 firmware images being 64M im size. The firmware images can't be used as-is because of that, some padding is needed to create an image which can be used for pflash:

dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
dd of="QEMU_VARS-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_VARS-pflash.raw" if="QEMU_VARS.fd" conv=notrunc

There are a bunch of compile time options, typically enabled using -D NAME or -D NAME=TRUE. Options which are enabled by default can be turned off using -D NAME=FALSE. Available options are defined in the *.dsc files referenced by the build command. So a feature-complete build looks more like this:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc \
    -D FD_SIZE_4MB \
    -D NETWORK_IP6_ENABLE \
    -D NETWORK_HTTP_BOOT_ENABLE \
    -D NETWORK_TLS_ENABLE \
    -D TPM2_ENABLE

Secure boot support (on x64) requires SMM mode. Well, it builds and works without SMM, but it's not secure then. Without SMM nothing prevents the guest OS writing directly to flash, bypassing the firmware, so protected UEFI variables are not actually protected.

Also suspend (S3) support works with enabled SMM only in case parts of the firmware (PEI specifically, see below for details) run in 32bit mode. So the secure boot variant must be compiled this way:

build -t GCC5 -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc \
    -D FD_SIZE_4MB \
    -D SECURE_BOOT_ENABLE \
    -D SMM_REQUIRE \
    [ ... add network + tpm + other options as needed ... ]

The FD_SIZE_4MB option creates a larger firmware image, being 4MB instead of 2MB (default) in size, offering more space for both code and vars. The RHEL/CentOS builds use that. The Fedora builds are 2MB in size, for historical reasons.

If you need 32-bit firmware builds for some reason, here is how to do it:

build -t GCC5 -a ARM -p ArmVirtPkg/ArmVirtQemu.dsc
build -t GCC5 -a IA32 -p OvmfPkg/OvmfPkgIa32.dsc

The build results will be in in Build/ArmVirtQemu-ARM/DEBUG_GCC5/FV and Build/OvmfIa32/DEBUG_GCC5/FV

Booting fresh firmware builds

The x86 firmware builds create three different images:

OVMF_VARS.fd
This is the firmware volume for persistent UEFI variables, i.e. where the firmware stores all configuration (boot entries and boot order, secure boot keys, ...). Typically this is used as template for an empty variable store and each VM gets its own private copy, libvirt for example stores them in /var/lib/libvirt/qemu/nvram.
OVMF_CODE.fd
This is the firmware volume with the code. Separating this from VARS does (a) allow for easy firmware updates, and (b) allows to map the code read-only into the guest.
OVMF.fd
The all-in-one image with both CODE and VARS. This can be loaded as ROM using -bios, with two drawbacks: (a) UEFI variables are not persistent, and (b) it does not work for SMM_REQUIRE=TRUE builds.

qemu handles pflash storage as block devices, so we have to create block devices for the firmware images:

CODE=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_CODE.fd
VARS=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_VARS.fd
qemu-system-x86_64 \
  -blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
  -blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
  -machine q35,pflash0=code,pflash1=vars \
  [ ... ]

Here is the arm version of that (using the padded files created using dd, see above):

CODE=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_EFI-pflash.raw
VARS=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_VARS-pflash.raw
qemu-system-aarch64 \
  -blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
  -blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
  -machine virt,pflash0=code,pflash1=vars \
  [ ... ]

Source code structure

The core edk2 repo holds a number of packages, each package has its own toplevel directory. Here are the most interesting ones:

OvmfPkg
This holds both the x64-specific code (i.e. OVMF itself) and virtualization-specific code shared by all architectures (virtio drivers).
ArmVirtPkg
Arm specific virtual machine support code.
MdePkg, MdeModulePkg
Most core code is here (PCI support, USB support, generic services and drivers, ...).
PcAtChipsetPkg
Some Intel architecture drivers and libs.
ArmPkg, ArmPlatformPkg
Common Arm architecture support code.
CryptoPkg, NetworkPkg, FatPkg, CpuPkg, ...
As the names of the packages already suggest: Crypto support (using openssl), Network support (including network boot), FAT Filesystem driver, ...

Firmware boot phases

The firmware modules in the edk2 repo often named after the boot phase they are running in. Most drivers are named SomeThingDxe for example.

ResetVector
This is where code execution starts after a machine reset. The code will do the bare minimum needed to enter SEC. On x64 the most important step is the transition from 16-bit real mode to 32-bit mode or 64bit long mode.
SEC (Security)
This code typically loads and uncompresses the code for PEI and SEC. On physical hardware SEC often lives in ROM memory and can not be updated. The PEI and DXE firmware volumes are loaded from (updateable) flash.
With OVMF both SEC firmware volume and the compressed volume holding PXE and DXE code are part of the OVMF_CODE image and will simply be mapped into guest memory.
PEI (Pre-EFI Initialization)
Platform Initialization is done here. Initialize the chipset. Not much to do here in virtual machines, other than loading the x64 e820 memory map (via fw_cfg) from qemu, or get the memory map from the device tree (on aarch64). The virtual hardware is ready-to-go without much extra preaparation.
PEIMs (PEI Modules) can implement functionality which must be executed before entering the DXE phase. This includes security-sensitive things like initializing SMM mode and locking down flash memory.
DXE (Driver Execution Environment)
When PEI is done it hands over control to the full EFI environment contained in the DXE firmware volume. Most code is here. All kinds of drivers. the firmware setup efi app, ...
Strictly speaking this isn't only one phase. The code for all phases after PEI is part of the DXE firmware volume though.

Useful Links

by Gerd Hoffmann at May 16, 2022 10:00 PM

May 13, 2022

Thomas Huth

Improved KVM virtualization with RHEL 9 on IBM Z

This week, Red Hat Enterprise Linux 9 has been announced, which will also bring us lots of new stuff for our beloved mainframe.

First, compared with RHEL 8, a lot of generic packages have been updated, of course. For example, RHEL 9 on IBM Z comes with:

  • Linux kernel 5.14
  • glibc 2.34
  • gcc 11.2
  • clang 13.0
  • binutils 2.35
  • s390utils 2.19

And of course all of these have been thoroughly tested during the past months, which is also the reason why RHEL sometimes does not ship the very latest bleeding edge versions of the upstream projects – thorough testing needs some time. But you can be sure that Red Hat also backported lots of selected upstream fixes and improvements e.g. for the kernel to their downstream packages, so this is very up to date and stable software here.

The new KVM virtualization stack

The first big news is: There is no need anymore to install the separate virt:av (“Advanced Virtualization”) module to get the latest and greatest virtualization features on IBM Z. Everything is packaged along with the main RHEL distribution for easier installation now and will be kept up-to-date there, with important new features like virtio-fs enabled by default. And of course, as with the latest releases of RHEL 8, there is also no limit to 4 guests anymore, so you don’t have to worry about the number of supported KVM guests (as long as your hardware can handle them).

The versions that will be shipped with RHEL 9.0 are:

  • QEMU 6.2.0
  • libvirt 8.0.0
  • libguestfs 1.46.1
  • virt-install 3.2.0
  • libslirp 4.4.0

To answer the maybe most important question: Yes, this will also support the brand new IBM z16 mainframe already. Basic support for this new generation has already been added to QEMU 6.1.0 and kernel 5.14, and additional z16 features have been enabled by default in QEMU 6.2.0.

Another great new change is that it is now possible to configure mediated devices directly with the virtualization CLI tools on IBM Z. You can now add vfio-ap and vfio-ccw mediated devices to your KVM guests using virt-install or virt-xml. With virt-install, you can also create a VM that uses an existing DASD mediated device as its primary disk.

Additionally, many small performance improvements (like the specification exception interpretation feature) and bug fixes have been backported to the RHEL 9 kernel and the userspace tools to give you a great virtualization experience with RHEL 9.

One more thing that is worth mentioning (though it is not specific to IBM Z), which you might have noticed by clicking on the links in the previous paragraphs already, there is another big change in RHEL 9: The development of the upcoming minor RHEL 9 releases (i.e. 9.1, 9.2, etc.) is now done in the public via the CentOS Stream repositories. That means you can not only peak on the work that will be integrated in the next 9.y release, you can now even directly participate in the development of these next release if you like! Isn’t that cool?

Anyway, no matter whether you are planning to participate or just want to use the software, please enjoy the new KVM virtualization stack on the mainframe!

May 13, 2022 03:45 PM

April 29, 2022

Stefan Hajnoczi

Debugging Flatpak applications

Flatpak is a way to distribute applications on Linux. Its container-style approach allows applications to run across Linux distributions. This means native packages (rpm, deb, etc) are not needed and it's relatively easy to get your app to Linux users with fewer worries about distro compatibility. This makes life a lot easier for developers and is also convenient for users.

I've run popular applications like OBS Studio as flatpaks and even publish my own on Flathub, a popular hosting site for applications. Today I figured out how to debug flatpaks, which requires some extra steps that I'll share below so I don't forget them myself!

Bonus Tip: Testing local flatpaks

If you're building a flatpak of your own application it's handy to use the dir sources type in the manifest to compile your application's source code from a local directory instead of a git tag or tarball URL. This way you can make changes to the source code and test them quickly inside Flatpak.

Put something along these lines in the manifest's modules object where /home/user/my-app is you local directory with your app's source code:

{
"name": "my-app",
"sources": [
{
"type": "dir",
"path": "/home/user/my-app"
}
],
...
}

Building and installing apps with debuginfo

flatpak-builder(1) automatically creates a separate .Debug extension for your flatpak that contains your application's debuginfo. You'll need the .Debug extension if you want proper backtraces and source level debugging. At the time of writing the Flatpak documentation did not mention how to install the locally-built .Debug extension. Here is how:

$ flatpak-builder --user --force-clean --install build my.org.app.json
$ flatpak install --user --reinstall --assumeyes "$(pwd)/.flatpak-builder/cache" my.org.app.Debug

It might be a good idea to install debuginfo for the system libraries in your SDK too in case it's not already installed:

$ flatpak install org.kde.Sdk.Debug # or your runtime's SDK

Running applications for debugging

There is a flatpak(1) option that launches the application with the SDK instead of the Runtime:

$ flatpak run --user --devel my.org.app

The SDK contains development tools whereas the Runtime just has the files needed to run applications.

It can also be handy to launch a shell so you can control the launch of your app and maybe use gdb or strace:

$ flatpak run --user --devel --command=sh my.org.app
[📦 my.org.app ~]$ gdb /app/bin/my-app

Working with core dumps

If your application crashes it will dump core like any other process. However, existing ways of inspecting core dumps like coredumpctl(1) are not fully functional because the process ran inside namespaces and debuginfo is located inside flatpaks instead of the usual system-wide /usr/lib/debug location. coredumpctl(1), gdb, etc aren't Flatpak-aware and need extra help.

Use the flatpak-coredumpctl wrapper to launch gdb:

$ flatpak-coredumpctl -m <PID> my.org.app

You can get PID from the list printed by coredumpctl(1).

Conclusion

This article showed how to install locally-built .Debug extensions and inspect core dumps when using Flatpak. I hope that over time these manual steps will become unnecessary as flatpak-builder(1) and coredumpctl(1) are extended to automatically install .Debug extensions and handle Flatpak core dumps. For now it just takes a few extra commands compared to debugging regular applications.

by Unknown (noreply@blogger.com) at April 29, 2022 05:41 PM

April 26, 2022

KVM on Z

Ubuntu 22.04 released

Canonical release a new LTS (Long Term Support) version of its Ubuntu server offering Ubuntu Server 22.04!
It ships

  • Linux kernel 5.15
  • QEMU v6.2
  • libvirt v8.0
See the release notes here, and the blog entry at Canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at April 26, 2022 10:57 AM

Ubuntu 21.10 released

Ubuntu Server 21.10 is out!
It ships

  • Linux kernel 5.13 (including, among others, features as described here and here)
  • QEMU v6.0
  • libvirt v7.6
See the release notes here, and the blog entry at Canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at April 26, 2022 10:56 AM

April 21, 2022

KVM on Z

Howto: Verifying Secure Execution Host Key Documents

TL;DR

Using genprotimg to build an IBM Secure Execution for Linux image may fail after CA certificate reissue in April 2022.
If this happens you can work around it by using this script
to verify the validity of the host key document and run genprotimg with the --noverify option.

The certificates used to verify Host Key Documents for IBM Secure Execution for Linux are about to be renewed in April 2022. As a result, the genprotimg tool contained in your Linux distribution may report a verification failure and refuse to build a secure image. Patches for RHEL 8 as well as for SLES 15 and Ubuntu 20.04 are underway, so the issue can eventually be fixed by updating your Linux installation.
Until then, it is possible to work around as follows:
  1. Download the check_hostkeydoc script at https://github.com/ibm-s390-linux/s390-tools/blob/master/genprotimg/samples/check_hostkeydoc.
  2. Run
    check_hostkeydoc <your host key document> \
        ibm-z-host-key-signing.crt -c DigiCertCA.crt
     
  3. If the previous step succeeded, it's safe to
    run genprotimg --no-verify -k <your host key document> \
        -o <output file> -i <kernel file> -r <ramdisk file> \
        -p <parameter file>
More information about IBM Secure Execution for Linux can be found here.

by Stefan Raspl (noreply@blogger.com) at April 21, 2022 02:28 PM

April 19, 2022

QEMU project

QEMU version 7.0.0 released

We’d like to announce the availability of the QEMU 7.0.0 release. This release contains 2500+ commits from 225 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • ACPI: support for logging guest events via ACPI ERST interface
  • virtiofs: improved security label support
  • block: improved flexibility for fleecing backups, including support for non-qcow2 images
  • ARM: ‘virt’ board support for virtio-mem-pci, specifying guest CPU topology, and enabling PAuth when using KVM/hvf
  • ARM: ‘xlnx-versal-virt’ board support for PMC SLCR and emulating the OSPI flash memory controller
  • ARM: ‘xlnx-zynqmp’ now models the CRF and APU control
  • HPPA: support for up to 16 vCPUs, improved graphics driver for HP-UX VDE/CDE environments, setting SCSI boot order, and a number of other new features
  • OpenRISC: ‘sim’ board support for up to 4 cores, loading an external initrd image, and automatically generating a device tree for the boot kernel
  • PowerPC: ‘pseries’ emulation support for running guests as a nested KVM hypervisor, and new support for spapr-nvdimm device
  • PowerPC: ‘powernv’ emulation improvements for XIVE and PHB 3/4, and new support for XIVE2 and PHB5
  • RISC-V: support for KVM
  • RISC-V: support for ratified 1.0 Vector extension, as well as Zve64f, Zve32f, Zfhmin, Zfh, zfinx, zdinx, and zhinx{min} extensions.
  • RISC-V: ‘spike’ machine support for OpenSBI binary loading
  • RISC-V: ‘virt’ machine support for 32 cores, and AIA support.
  • s390x: support for “Miscellaneous-Instruction-Extensions Facility 3” (a z15 extension)
  • x86: Support for Intel AMX
  • and lots more…

Thank you to everyone involved!

April 19, 2022 04:04 PM

April 07, 2022

KVM on Z

IBM z16 announced!

Today, IBM announced the new IBM z16, with a planned availability date of May 31.

See here for the press release, and here for the offical homepag. For further details, including a list of supported Linux distributions, see Eberhard's blog here.

And for a more hands-on tour of the new box, check out this video.

by Stefan Raspl (noreply@blogger.com) at April 07, 2022 02:57 PM

March 31, 2022

KVM on Z

Documentation: Solution Assurance

The solution assurance team started to publish solution setups, recommendations, and step-by-step guidelines for a broad range of topics, for example:

  • High availability clustering
  • IBM Cloud Infrastructure Center
  • CPUMF
  • kdump
and more!

You can access the materials here, but don't forget to check back periodically: There is more to come!

by Stefan Raspl (noreply@blogger.com) at March 31, 2022 09:34 PM

March 29, 2022

KVM on Z

Documentation: SAP Application Server on KVM

This new publication aims towards providing practical insights for running real-world workloads on KVM on IBM Z. From the abstract:

The SAP on IBM Z Performance team, in Poughkeepsie, NY, conducted a series of measurements to assess the performance cost of implementing a KVM environment to host SAP application servers. The tests used SAP (SBS 9.0) core banking workloads, with a Db2 database having 100 million banking accounts, which are comparable to some of the largest banks in the world. Tests were conducted that used both banking workload types, Account Settlement (batch) and Day Posting, which simulates online transactional processing (OLTP). They were executed on an IBM z15 with 16 and 32 Integrated Facility for Linux (IFL) processor configurations, that used various degrees of virtualization.

The document is available here.

by Stefan Raspl (noreply@blogger.com) at March 29, 2022 04:20 PM

March 07, 2022

QEMU project

Apply for a QEMU Google Summer of Code internship

We have great news to share: QEMU has been accepted as a Google Summer of Code 2022 organization! Google Summer of Code is an open source internship program offering paid remote work opportunities for contributing to open source. The internship runs from June 13th to September 12th.

Now is the chance to get involved in QEMU development! The QEMU community has put together a list of project ideas here.

Google has dropped the requirement that you need to be enrolled in a higher education course. We’re excited to work with a wider range of contributors this year! For details on the new eligibility requirements, see here.

You can submit your application from April 4th to 19th.

GSoC interns work together with their mentors, experienced QEMU contributors who support their interns in their projects. Code developed during the internship is submitted through the same open source development process that all QEMU contributions follow. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.

If you have questions about applying for QEMU GSoC, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.

March 07, 2022 01:30 PM

Stefan Hajnoczi

vhost-user is coming to non-Linux hosts!

Sergio Lopez sent a QEMU patch series and vhost-user protocol specification update that maps vhost-user to non-Linux POSIX host operating systems. This is great news because vhost-user has become a popular way to develop emulated devices in any programming language that execute as separate processes with their own security sandboxing. Until now they have only been available on Linux hosts.

At the moment the BSD and macOS implementation is slower than the Linux implementation because the KVM ioeventfd and irqfd primitives are unavailable on those operating systems. Instead POSIX pipes is used and the VMM (QEMU) needs to acts as a forwarder for MMIO/PIO accesses and interrupt injections. On Linux the kvm.ko kernel module has direct support for this, bypassing the VMM process and achieving higher efficiency. However, similar mechanisms could be added to non-KVM virtualization drivers in the future.

This means that vhost-user devices can now start to support multiple host operating systems and I'm sure they will be used in new ways that no one thought about before.

by Unknown (noreply@blogger.com) at March 07, 2022 09:37 AM

February 28, 2022

Gerd Hoffmann

Introducing ovmfctl

New project: Tools for for ovmf (and armvirt) firmware volumes. It's written in python and can be installed with a simple pip3 install ovmfctl. The project is hosted at gitlab.

ovmfdump

Usage: ovmfctl --input file.fd.

It's a debugging tool which just prints the structure and content of firmware volumes.

ovmfctl

This is a tool to print and modify variable store volumes. Main focus has been on certificate handling so far.

Enrolling certificates for secure boot support in virtual machines has been a rather painfull process. It's handled by EnrollDefaultKeys.efi which needs to be started inside a virtual machine to enroll the certificates and enable secure boot mode.

With ovmfctl it is dead simple:

ovmfctl --input /usr/share/edk2/ovmf/OVMF_VARS.fd \
        --enroll-redhat \
        --secure-boot \
        --output file.fd

This enrolls the Red Hat Secure Boot certificate which is used by Fedora, CentOS and RHEL as platform key. The usual Microsoft certificates are added to the certificate database too, so windows guests and shim.efi work as expected.

If you want more fine-grained control you can use the --set-pk, --add-kek, --add-db and --add-mok switches instead. The --enroll-redhat switch above is actually just a shortcut for:

--set-pk  a0baa8a3-041d-48a8-bc87-c36d121b5e3d RedHatSecureBootPKKEKkey1.pem \
--add-kek a0baa8a3-041d-48a8-bc87-c36d121b5e3d RedHatSecureBootPKKEKkey1.pem \
--add-kek 77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftCorporationKEKCA2011.pem \
--add-db  77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftWindowsProductionPCA2011.pem \
--add-db  77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftCorporationUEFICA2011.pem \

If you just want the variable store be printed use ovmfctl --input file.fd --print. Add --hexdump for more details.

Extract all certificates: ovmfctl --input file.fd --extract-certs.

Try ovmfctl --help for a complete list of command line switches. Note that Input and output file can be indentical for inplace updates.

That's it. Enjoy!

by Gerd Hoffmann at February 28, 2022 11:00 PM

February 15, 2022

QEMU project

QEMU welcomes Outreachy internship applicants

QEMU is offering open source internships in Outreachy’s May-August 2022 round. You can submit your application until February 25th 2022 if you want to contribute to QEMU in a remote work internship this summer.

Outreachy internships are extended to people who are subject to systemic bias and underrepresentation in the technical industry where they are living. For details on applying, please see the Outreachy website. If you are not eligible, don’t worry, QEMU is also applying to participate in Google Summer of Code again and we hope to share news about additional internships later this year.

Outreachy interns work together with their mentors, experienced QEMU contributors who support their interns in their projects. Code developed during the internship is submitted via the same open source development process that all QEMU code follows. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.

Now is the chance to get involved in QEMU development!

If you have questions about applying for QEMU Outreachy, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.

February 15, 2022 01:30 PM

February 04, 2022

Stefan Hajnoczi

Speaking at FOSDEM '22 about "What's coming in VIRTIO 1.2"

I will give a talk titled What's coming in VIRTIO 1.2: New virtual I/O devices and features on Saturday, February 5th 2022 at 10:00 CET at the FOSDEM virtual conference (it's free and there is no registration!). The 9 new device types will be covered, as well as some of the other features that have been added to the upcoming 1.2 release of the VIRTIO specification. I hope to see you there and if you miss it there will be slides and video available afterwards.

by Unknown (noreply@blogger.com) at February 04, 2022 08:04 PM

January 21, 2022

Cornelia Huck

QEMU machine types and compatibility (part 2)

In the first part of this article, I talked about how you can use versioned machine types to ensure compatibility. But the more interesting part is how this actually works under the covers.

Device properties, and making them compatible

QEMU devices often come with a list of properties that influence how the device is created and how it operates. Typically, authors try to come up with reasonable default values, which may be overriden if desired. However, the idea of what is considered reasonable may change over time, and a newer QEMU may provide a different default value for a property.

If you want to migrate a guest from an older QEMU machine to a more recent QEMU, you obviously need to use the default values from that older QEMU machine as well. For that, QEMU uses arrays of GlobalPropery structures.

If you take a look at hw/core/machine.c, you will notice several arrays named hw_compat_<major>_<minor>. These contain triplets specifying (from right to left) the default value for a certain property for a certain device. The arrays are designed to be included by the compat machine for <major>.<minor>, thus specifying a default value for that machine version and older. (More on this later in this article.)

For example, QEMU 5.2 changed the default number of virtio queues defined for virtio-blk and virtio-scsi devices: prior to 5.1, one queue would be present if no other value had been specified; with 5.2, the default number of queues would align with the number of vcpus for virtio-pci. Therefore, hw_compat_5_1 contains the following lines:

{ "virtio-blk-device", "num-queues", "1"},
{ "virtio-scsi-device", "num_queues", "1"},

(and some corresponding lines for vhost.) This makes sure that any virtio-blk or virtio-scsi device on a -5.1 or older machine type will have one virtio queue per default. Note that this holds true for all virtio-blk and virtio-scsi devices, regardless of which transport they are using; for transports like ccw where nothing changed with 5.2, this simply does not make any difference.

Generally, statements for all devices can go into the hw_compat_ arrays; if a device is not present or even not available at all for the machine that is started, the statement will simply not take any effect.

x86 considerations

For the x86 machine types (pc-i440fx and pc-q35), pc_compat_<major>_<minor> arrays are defined in hw/i386/pc.c, mostly covering properties for x86 cpus, but also some other x86-specific devices.

Per-machine changes

Some incompatible changes are not happening at the device property level, so the compat properties approach cannot be used. Instead, the individual machines need to take care of those changes.

For example, in QEMU 6.2 the smp parsing code started to prefer cores over sockets instead of preferring sockets. Therefore, all 6.1 compat machines have code like

m->smp_props.prefer_sockets = true;

to set prefer_sockets to true in the MachineClass. (Note that the m68k virt machine does not support smp, and therefore does not need that statement.)

Machines also sometimes need to configure associated capabilities in a compatible way. For example, the s390x cpu models may gain new feature flags in newer QEMU releases; when using a compat machine, those new flags need to be off in the cpu models that are used by default.

Inheritance

Compat machines for older machine types need the compatibility changes for newer machine types as well as some changes on top. Typically, this is done by the MachineState respectively MachineClass initializing functions for version n-1 calling the respective initializing functions for version n. As all new compatibility changes are added for the latest versioned machine type, changes are propagated down the whole stack of versions.

All machine types for version n include the hw_compat_<n> array (and the pc_compat_<n> array for x86), unless they are the latest version (which does not need any compat handling yet.) The older compat property arrays are included via the inheritance mechanism.

Putting it all together

QEMU currently supports versioned machine types for x86 (pc-i440fx, pc-q35), arm (virt), aarch64 (virt), s390x (s390-ccw-virtio), ppc64 (pseries), and m68k (virt). At the beginning of each development cycle, new (empty) arrays of compat properties for the last version are added and wired up in the machine types for that last version, new versions of each of these machines are added to the code, and the defaults switched to them (well, that’s the goal.) After that, the framework for adding incompatible changes is in place.

If you find that these changes have not yet been made when you plan to make an incompatible change, it is important that you add the new machine types first.

New and incompatible device properties

If you plan to change the default value of a device property, or add a new property with a default value that will cause guest-observable changes, you need to add an entry that preserves the old value (or sets a value that does not change the behaviour) to the compat property array for the last version. In general (non-x86 specific change), that means adding it to the hw_compat_ array, and all machine types will use it automatically.

Take care to use the right device for specifying the property; for example, there is often some confusion when dealing with virtio devices. If you e.g. modify a virtio-blk property (as in the example above), you need to add a statement for virtio-blk-device and not for virtio-blk-pci, or virtio-blk instances using the ccw or mmio transports would be left out. If, on the other hand, you modify a property only for virtio-blk devices using the pci transport, you need to add a statement for virtio-blk-pci. Similar considerations apply to other devices inheriting from base types.

Per-machine changes

If you change a non-device default characteristic, you need to add a compatibility statement for the machine types for the last version in their instance (or class) init functions. The hardest part here is making sure that all relevant machine types get the update.

For example, if you add a change in the s390x cpu models, it is easy to see that you only need to modify the code for the s390-ccw-virtio machine. For other changes, every versioned machine needs the change. And there are cases like the prefer_sockets change mentioned above, that apply to any machine type that supports smp.

I hope that these explanations help a bit with understanding how machine type compatibility works, and where to add your own changes.

by Cornelia Huck at January 21, 2022 11:30 AM

January 05, 2022

Cornelia Huck

QEMU machine types and compatibility

If you want to migrate a guest initially started on an older QEMU version to a newer version of QEMU, you need to make sure that the two machines are actually compatible with each other. Once you exclude things like devices that cannot be migrated at all and make sure both QEMU invocations actually create the same virtual hardware, this basically boils down to using compatible machines.

Versioned machine types

If you simply want to create a machine without any consideration regarding migration compatibility, you will usually do something like

qemu-system-ppc64 -machine pseries (...)

This will create a machine of the pseries type. But in this case, pseries is actually an alias to the latest version of this machine type; for 6.2, this would be pseries-6.2. You can find out which machine types are versioned (and which machine types actually exist for a given binary) via -machine ?:

$ qemu-system-ppc64 -machine ?
Supported machines are:
40p                  IBM RS/6000 7020 (40p)
bamboo               bamboo
g3beige              Heathrow based PowerMAC
mac99                Mac99 based PowerMAC
mpc8544ds            mpc8544ds
none                 empty machine
pegasos2             Genesi/bPlan Pegasos II
powernv10            IBM PowerNV (Non-Virtualized) POWER10
powernv8             IBM PowerNV (Non-Virtualized) POWER8
powernv              IBM PowerNV (Non-Virtualized) POWER9 (alias of powernv9)
powernv9             IBM PowerNV (Non-Virtualized) POWER9
ppce500              generic paravirt e500 platform
pseries-2.1          pSeries Logical Partition (PAPR compliant)
pseries-2.10         pSeries Logical Partition (PAPR compliant)
pseries-2.11         pSeries Logical Partition (PAPR compliant)
pseries-2.12         pSeries Logical Partition (PAPR compliant)
pseries-2.12-sxxm    pSeries Logical Partition (PAPR compliant)
pseries-2.2          pSeries Logical Partition (PAPR compliant)
pseries-2.3          pSeries Logical Partition (PAPR compliant)
pseries-2.4          pSeries Logical Partition (PAPR compliant)
pseries-2.5          pSeries Logical Partition (PAPR compliant)
pseries-2.6          pSeries Logical Partition (PAPR compliant)
pseries-2.7          pSeries Logical Partition (PAPR compliant)
pseries-2.8          pSeries Logical Partition (PAPR compliant)
pseries-2.9          pSeries Logical Partition (PAPR compliant)
pseries-3.0          pSeries Logical Partition (PAPR compliant)
pseries-3.1          pSeries Logical Partition (PAPR compliant)
pseries-4.0          pSeries Logical Partition (PAPR compliant)
pseries-4.1          pSeries Logical Partition (PAPR compliant)
pseries-4.2          pSeries Logical Partition (PAPR compliant)
pseries-5.0          pSeries Logical Partition (PAPR compliant)
pseries-5.1          pSeries Logical Partition (PAPR compliant)
pseries-5.2          pSeries Logical Partition (PAPR compliant)
pseries-6.0          pSeries Logical Partition (PAPR compliant)
pseries-6.1          pSeries Logical Partition (PAPR compliant)
pseries              pSeries Logical Partition (PAPR compliant) (alias of pseries-6.2)
pseries-6.2          pSeries Logical Partition (PAPR compliant) (default)
ref405ep             ref405ep
sam460ex             aCube Sam460ex
taihu                taihu
virtex-ml507         Xilinx Virtex ML507 reference design

As you can see, there are various pseries-x.y machine types for older versions; these are designed to present a configuration that is compatible with a default machine that was created with an older QEMU version. For example, if you wanted to migrate a guest running on a pseries machine that was created using QEMU 5.1, the receiving QEMU would need to be started with

qemu-system-ppc64 -machine pseries-5.1 (...)

Supported machine types

Note: the following applies to upstream QEMU. Distributions may support different versioned machine types in their builds.

This list is as of QEMU 6.2; new versioned machine types may be added in the future, and sometimes old ones deprecated and removed. The machine types for the next QEMU release are usually introduced early in the release cycle (at least, that is the goal…)

arm, aarch64

The virt machine type supports versions since 2.6.

m68k

The virt machine type supports versions since 6.0.

ppc64

The pseries machine type supports versions since 2.1.

s390x

The s390-ccw-virtio machine type supports versions since 2.4.

i386, x86_64

The pc-i440fx machine type supports versions since 1.4 (there used to be even older ones, but they have been removed), while the pc-q35 machine type supports versions since 2.4.

There’s an additional thing to consider here: the pc machine type alias points (as of QEMU 6.2) to the latest pc-i440fx machine type; if you want the latest pc-q35 machine type instead, you have to use q35.

How to use this

If you want to simply fire up a QEMU instance and shut it down again without wanting to migrate it anywhere, you can stick to the default machine type. However, if you might want to migrate the machine later, it is probably a good idea to specify a versioned machine type explicitly, so that you don’t have to remember which QEMU version you started it with.

Or just use management software like libvirt, which will do the machine type expansion to the latest version for you automatically, so you don’t have to worry about it later.

This concludes the usage part of compatible machine types; a follow-up post will look at how this is actually implemented.

by Cornelia Huck at January 05, 2022 11:30 AM

December 14, 2021

QEMU project

QEMU version 6.2.0 released

We’d like to announce the availability of the QEMU 6.2.0 release. This release contains 2300+ commits from 189 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • virtio-mem: guest memory dumps are now fully supported, along with pre-copy/post-copy migration and background guest snapshots
  • QMP: support for nw DEVICE_UNPLUG_GUEST_ERROR to detect guest-reported hotplug failures
  • TCG: improvements to TCG plugin argument syntax, and multi-core support for cache plugin
  • 68k: improved support for Apple’s NuBus, including ability to load declaration ROMs, and slot IRQ support
  • ARM: macOS hosts with Apple Silicon CPUs now support ‘hvf’ accelerator for AArch64 guests
  • ARM: emulation support for Fujitsu A64FX processor model
  • ARM: emulation support for kudo-mbc machine type
  • ARM: M-profile MVE extension is now supported for Cortex-M55
  • ARM: ‘virt’ machine now supports an emulated ITS (Interrupt Translation Service) and supports more than 123 CPUs in emulation mode
  • ARM: xlnx-zcu102 and xlnx-versal-virt machines now support BBRAM and eFUSE devices
  • PowerPC: improved POWER10 support for the ‘powernv’ machine type
  • PowerPC: initial support for POWER10 DD2.0 CPU model
  • PowerPC: support for FORM2 PAPR NUMA descriptions for ‘pseries’ machine type
  • RISC-V: support for Zb[abcs] instruction set extensions
  • RISC-V: support for vhost-user and numa mem options across all boards
  • RISC-V: SiFive PWM support
  • x86: support for new Snowridge-v4 CPU model
  • x86: guest support for Intel SGX
  • x86: AMD SEV guests now support measurement of kernel binary when doing direct kernel boot (not using a bootloader)
  • and lots more…

Thank you to everyone involved!

December 14, 2021 09:32 PM

December 08, 2021

Stefan Hajnoczi

How to add debuginfo to perf(1)

Sometimes it's necessary to add debuginfo so perf-report(1) and similar commands can display human-readable function names instead of raw addresses. For instance, if a program was built from source and stripped of symbols when installing into /usr/local/bin/ then perf(1) does not have the symbol information available.

perf(1) maintains a cache of debuginfos keyed by the build-id (also known as .note.gnu.build-id) that uniquely identifies executables and shared objects on Linux. perf.data files contain the build-ids involved when the data was recorded. This allows perf-report(1) and similar commands to look up the required debuginfo from the build-id cache for address to function name translation.

If perf-report(1) displays raw addresses instead of human-readable function names, then we need to get the debuginfo for the build-ids in the perf.data file and add it to the build-id cache. You can show the build-ids required by a perf.data file with perf-buildid-list(1):


$ perf buildid-list # reads ./perf.data by default
b022da126fad1e0a287a6a25016f6c7c996e68c9 /lib/modules/5.14.11-200.fc34.x86_64/kernel/arch/x86/kvm/kvm-intel.ko.xz
f8aa9d9bf047e67b76f22426ad4af310f9b0325a /lib/modules/5.14.11-200.fc34.x86_64/kernel/arch/x86/kvm/kvm.ko.xz
6740f24c4733268d03b41f9483282297dde6b286 [vdso]

Your build-id cache may be missing debuginfo or have limited debuginfo with less symbol information than you need. For example, if data was collected from a stripped /usr/local/bin/my-program executable and you now want to update the build-id cache with the executable that contains full debuginfo, use the perf-buildid-cache(1) command:


$ perf buildid-cache --update=path/to/my-program-with-symbols

There is also an --add=path/to/debuginfo option for adding new build-ids that are not yet in the cache.

Now perf-report(1) and similar tools will display human-readable function names from path/to/my-program-with-symbols instead of the stripped /usr/local/bin/my-program executable. If that doesn't work, verify that the build-ids in my-program-with-symbols and my-program match.

by Unknown (noreply@blogger.com) at December 08, 2021 03:03 PM

November 25, 2021

KVM on Z

RHEL 8.5 AV Released

RHEL 8.5 Advanced Virtualization (AV) is out! See the official announcement and the release notes.

KVM is supported via Advanced Virtualization, and provides

  • QEMU v6.0, supporting virtio-fs on IBM Z
  • libvirt v7.6

Furthermore, RHEL 8.5 AV now supports the possibility to persist mediated devices.

For a detailed list of Linux on Z-specific changes, also see this blog entry at Red Hat.

IBM-specific documentation for Red Hat Enterprise Linux 8.5 is available at IBM Documentation here (in particular: Device Drivers, Features and Commands on Red Hat Enterprise Linux 8.5).
See here on how to enable AV in RHEL 8 installs.

by Stefan Raspl (noreply@blogger.com) at November 25, 2021 11:32 AM

November 23, 2021

KVM on Z

New Community: Compass L

Do you know Compass L yet...? This community offers a great opportunity to interact with other users, developers and architects of Linux and KVM on IBM Z!

For further information, see the flyer below, or head over right away and join the community here.


by Stefan Raspl (noreply@blogger.com) at November 23, 2021 11:46 AM

November 21, 2021

Gerd Hoffmann

processing patch mails with b4 and notmuch

This blog post describes my mail setup, with a focus on how I handle patch email. Lets start with a general mail overview. Not going too deep into the details here, the internet has plenty of documentation and configuration tutorials.

Outgoing mail

Most of my machines have a local postfix configured for outgoing mail. My workstation and my laptop forward all mail (over vpn) to the company internal email server. All I need for this to work is a relayhost line in /etc/postfix/main.cf:

relayhost = [smtp.corp.redhat.com]

Most unix utilities (including git send-email) try to send mails using /usr/sbin/sendmail by default. This tool will place the mail in the postfix queue for processing. The name of the binary is a convention dating back to the days where sendmail was the one and only unix mail processing daemon.

Incoming mail

All my mail is synced to local maildir storage. I'm using offlineimap for the job. Plenty of other tools exist, isync is another popular choice.

Local mail storage has the advantage that reading mail is faster, especially in case you have a slow internet link. Local mail storage also allows to easily index and search all your mail with notmuch.

Filtering mail

I'm using server side filtering. The major advantage is that I always have the same view on all my mail. I can use a mail client on my workstation, the web interface or a mobile phone. Doesn't matter, I always see the same folder structure.

Reading mail

All modern email clients should be able to use maildir folders. I'm using neomutt. I also have used thunderbird and evolution in the past. All working fine.

The reason I use neomutt is that it is simply faster than GUI-based mailers, which matters when you have to handle alot of email. It is also easy very to hook up scripts, which is very useful when it comes to patch processing.

Outgoing patches

I'm using git send-email for the simple cases and git-publish for the more complex ones. Where "simple" typically is single changes (not a patch series) where it is unlikely that I have to send another version addressing review comments.

git publish keeps track of the revisions you have sent by storing a git tag in your repo. It also stores the cover letter and the list of people Cc'ed on the patch, so sending out a new revision of a patch series is much easier than with plain git send-email.

git publish also features config profiles. This is helpful for larger projects where different subsystems use different mailing lists (and possibly different development branches too).

Incoming patches

So, here comes the more interesting part: Hooking scripts into neomutt for patch processing. Lets start with the config (~/.muttrc) snippet:

# patch processing
bind	index,pager	p	noop			# default: print
macro	index,pager	pa	"<pipe-entry>~/.mutt/bin/patch-apply.sh<enter>"
macro	index,pager	pl	"<pipe-entry>~/.mutt/bin/patch-lore.sh<enter>"

First I map the 'p' key to noop (instead of print which is the default configuration), which allows to use two-key combinations starting with 'p' for patch processing. Then 'pa' is configured to run my patch-apply.sh script, and 'pl' runs patch-lore.sh.

Lets have a look at the patch-apply.sh script which applies a single patch:

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-apply-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
        echo "ERROR: can't figure project"
        exit 1
fi

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try applying patch to $project, branch $branch"
echo "#"

if git am --message-id --3way --ignore-whitespace --whitespace=fix "$file"; then
        echo "#"
        echo "# OK"
        echo "#"
else
        echo "# FAILED, cleaning up"
        cp -v .git/rebase-apply/patch patch-apply-failed.diff
        cp -v "$file" patch-apply-failed.mail
        git am --abort
        git reset --hard
fi

The mail is passed to the script on stdin, so the first thing the script does is to store that mail in a temporary file. Next it goes try figure which project the patch is for. The logic for that is in a separate file so other scripts can share it, see below. Finally try to apply the patch using git am. In case of a failure store both decoded patch and complete email before cleaning up and exiting.

Now for patch-find-project.sh. This script snippet tries to figure the project by checking which mailing list the mail was sent to:

#!/bin/sh
if test "$PATCH_PROJECT" != ""; then
        project="$PATCH_PROJECT"
elif grep -q -e "devel@edk2.groups.io" "$file"; then
        project="edk2"
elif grep -q -e "qemu-devel@nongnu.org" "$file"; then
        project="qemu"
# [ ... more checks snipped ... ]
fi
if test "$project" = ""; then
        echo "Can't figure project automatically."
        echo "Use env var PATCH_PROJECT to specify."
fi

The PATCH_PROJECT environment variable can be used to override the autodetect logic if needed.

Last script is patch-lore.sh. That one tries to apply a complete patch series, with the help of the b4 tool. b4 makes patch series management an order of magnitude simpler. It will find the latest revision of a patch series, bring the patches into the correct order, pick up tags (Reviewed-by, Tested-by etc.) from replies, checks signatures and more.

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-queue-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
	echo "ERROR: can't figure project"
	exit 1
fi

# find msgid
msgid=$(grep -i -e "^message-id:" "$file" | head -n 1 \
	| sed -e 's/.*<//' -e 's/>.*//')

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try queuing patch (series) for $project, branch $branch"
echo "#"
echo "# msgid: $msgid"
echo "#"

# create work dir
WORK="${TMPDIR-/tmp}/${0##*/}-$$"
mkdir "$WORK" || exit 1
trap 'rm -rf $file "$WORK"' EXIT

echo "# fetching from lore ..."
echo "#"
b4 am	--outdir "$WORK" \
	--apply-cover-trailers \
	--sloppy-trailers \
	$msgid || exit 1

count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
if test "$count" = "0"; then
	echo "#"
	echo "# got nothing, trying notmuch instead ..."
	echo "#"
	echo "# update db ..."
	notmuch new
	echo "# find thread ..."
	notmuch show \
		--format=mbox \
		--entire-thread=true \
		id:$msgid > $WORK/notmuch.thread
	echo "# process mails ..."
	b4 am	--outdir "$WORK" \
		--apply-cover-trailers \
		--sloppy-trailers \
		--use-local-mbox $WORK/notmuch.thread \
		$msgid || exit 1
	count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
fi

echo "#"
echo "# got $count patches, trying to apply ..."
echo "#"
if git am -m -3 $WORK/*.mbx; then
	echo "#"
	echo "# OK"
	echo "#"
else
	echo "# FAILED, cleaning up"
	git am --abort
	git reset --hard
fi

First part (store mail, find project) of the script is the same as patch-apply.sh. Then the script goes get the message id of the mail passed in and feeds that into b4. b4 will go try to find the email thread on lore.kernel.org. In case this doesn't return results the script will go query notmuch for the email thread instead and feed that into b4 using the --use-local-mbox switch.

Finally it tries to apply the complete patch series prepared by b4 with git am.

So, with all that in place applying a patch series is just two key strokes in neomutt. Well, almost. I still need an terminal on the side which I use to make sure the correct branch is checked out, to run build tests etc.

by Gerd Hoffmann at November 21, 2021 11:00 PM

November 09, 2021

Stefan Hajnoczi

Peer-to-peer applications with Urbit

This article gives an overview of the architecture of Urbit applications. I spent a weekend trying out Urbit, reading documentation, and digging through the source code. I'm always on the lookout for the next milestone system that will change the internet and computing landscape. In particular, I find decentralized and peer-to-peer systems interesting because I have a sense that the internet is not quite right. It could be better if only someone could figure out how to do that and make it mainstream.

Urbit is an operating system and network designed to give users control by running applications on personal servers instead of on centralized servers operated by the application creators. This means data is stored on personal servers and is not immediately accessible to application creators. Both the Urbit operating system and network run on top of existing computing infrastructure. It's not a baremetal operating system (it runs under Linux, macOS, and Windows) or a new Layer 3 network protocol (it uses UDP). If you want to know more there is an overview of Urbit here.

The operating function

The Urbit kernel, Arvo, is a single-function operating system in the sense of purely functional programming. The operating system function takes the previous state and input events and produces the next state and output events. This means that the state of the system can be saved after each invocation. If there is a power failure or the system otherwise stops execution it's easy to resume it later from the last state.

Urbit has a virtual machine and runtime that supports this programming environment. The low-level language is called Nock and the higher-level language is called Hoon. I haven't investigated them in detail, but they appear to support deterministic purely functional programming with I/O and other side-effects kept outside via monadsand passing around inputs like the current time.

Applications

Applications, also called agents, follow the purely functional model where they produce the next state as their result. Agents expose their services in three ways:

  1. Peek, a read-only query that fetches data without changing state.
  2. Poke, a command and response similar to a Remote Procedure Call (RPC).
  3. Subscriptions, a stream of updates that may be delivered over time until the subscription is closed.

For example, an application that keeps a counter can define a poke interface for incrementing the counter and a peek interface for querying its value. A subscription can be used to receive an update whenever the counter changes.

Urbit supports peeks, pokes, and subscriptions over the network. This is how applications running on different personal servers can communicate. If we want to replicate a remote counter we can subscribe to it and then poke our local counter every time an update is received. This replication model leads to the store/hook/view architecture, a way of splitting applications into components that support local state, remote replication, and a user interface. In our counter example the store would be the counter, the hook would be the code that replicates remote counters, and the view would provide any logic needed for the user interface to control the counter.

Interacting with the outside world

User interfaces for applications are typically implemented in Landscape, a web-based user interface for interacting with Urbit from your browser. The user interface can be a modern JavaScript application that communicates with the agent running inside Urbit via the HTTP JSON API. This API supports peeks, pokes, and subscriptions. In other words, the application's backend is implemented as an Urbit agent while the frontend is a regular client-side web application.

Of course there are also APIs for data storage, networking, HTTP, etc. For example, the weather widget in Landscape fetches the weather from a web service using an HTTP request.

Urbit also supports peer discovery so you can resolve the funny IDs like ~bitbet-bolbel and establish connections to remote Urbit instances. The IDs are allocated hierarchically and ultimately registered on the Ethereum blockchain.

Criticisms

Keep in mind I only spent a weekend investigating Urbit so I don't understand the full system and could be wrong about what I have described. Also, I've spent a lot of time and am therefore invested in Linux and conventional programming environments. Feedback from the Urbit community is welcome, just send me an email or message me on IRC or Matrix.

The application and network model is intended for personal servers. I don't think people want personal servers. It's been tried before by Sandstorm, FreedomBox, and various projects without mainstream success. I think a more interesting model for the powerful devices we run today is one without any "server" at all. Instead of having an always-on server that is hosted somewhere, apps should be able to replicate and sync directly between a laptop and a phone. Having the option to run a personal server for always-on services like chat rooms or file hosting is nice, but many things don't need this. I wish Urbit was less focussed on personal servers and more on apps that replicate and sync directly between each other.

Urbit is obfuscated by the most extreme not invented here (NIH) syndrome I have ever seen. I tried to keep the terminology at a minimum in this article, so it might not be obvious unless you dive into the documentation or source code yourself. Not only is most of it a reinvention of existing stuff but it also uses new terminology for everything. It was disappointing to find that what first appeared like an alien system that might hold interesting discoveries was just a quirky reimplementation of existing concepts.

It will be difficult for Urbit to catch on as a platform since it has no common terminology with existing programming environments. If you want to write an app for Urbit using the Hoon programming language you'll have to wade through a lot of NIH at every level of the stack (programming language, operating system, APIs). There is an argument that reinventing everything allows the whole system to be small and self-contained, but in practice that's not true since Landscape apps are JavaScript web applications. They drag in the entire conventional computing environment that Urbit was supposed to replace. I wonder if the same kind of system can be built on top of a browser plus Deno with WebRTC for the server side, reusing existing technology that is actively developed by teams much larger than Urbit. It seems like a waste because Urbit doesn't really innovate in VMs, programming languages, etc yet the NIH stuff needs to be maintained.

Finally, for a system that is very much exposed to the network, I didn't see a strong discipline or much support for writing secure applications. The degree of network transparency that Urbit agents have also means that they present an attack surface. I would have expected the documentation and APIs/tooling to steer developers in a direction where it's hard to make mistakes. My impression is that a lot of the attack surface in agents is hand coded and security issues could become commonplace when Urbit gains more apps written by a wider developer community.

Despite this criticism I really enjoyed playing with Urbit. It is a cool rabbit hole to go down.

Conclusion

Urbit applications boil down to a relatively familiar interface similar to what can be done with gRPC: command/response, querying data, and subscriptions. The Urbit network allows applications to talk to each other directly in a peer-to-peer fashion. Users run apps on personal servers instead of centralized servers operated by the application creators (like Twitter, Facebook, etc). If Urbit can attract enough early adopters then it could become an interesting operating system and ecosystem that overcomes some of the issues of today's centralized internet. If you're wondering, I think it's worth spending a weekend exploring Urbit!

by Unknown (noreply@blogger.com) at November 09, 2021 01:54 PM

November 01, 2021

Cornelia Huck

Blog update

I have moved my blog to a new location and done some other changes at the same time.

  • This blog is now generated via Jekyll (a huge thank you to the authors here!) This makes posts easier to write for me (especially when formatting command output and similar), and gets rid of intrusive scripts as on the Blogger platform.
  • This blog’s title is now “KVM, QEMU, and more.” Observant readers may have noticed that I dropped the “Big Iron” part. I may still post s390x-specific content, but in general, I plan to write more about topics that are not architecture-specific.

And yes, I plan to actually post something new this year ;)

by Cornelia Huck at November 01, 2021 11:00 PM

October 15, 2021

Stefan Hajnoczi

A new approach to usermode networking with passt

There is a new project called passt that Stefano Brivio has been working on to implement usermode networking, the magic that forwards network packets between QEMU guests and the host network.

passt is designed as a replacement for QEMU's --netdev user (also known as slirp), a feature that is commonly used but not really considered production-ready. What passt improves on is security and performance, finally making usermode networking production-ready. That's all you need to know to try it out but I thought the internals of how passt works are interesting, so this article explains the approach.

Why usermode networking is necessary

Guests send and receive Ethernet frames through emulated network interface cards like virtio-net. Those packets need to be injected into the host network but popular operating systems don't provide an API for sending and receiving Ethernet frames because that poses a security risk (spoofing) or could simply interfere with other applications on the host.

Actually that's not quite true, operating systems do provide specialized APIs for injecting Ethernet frames but they come with limitations. For example, the Linux tun/tap driver requires additional network configuration steps as well as administrator privileges. Sometimes it's not possible to take advantage of tap due to these limitations and we really need a solution for unprivileged users. That's what usermode networking is about.

Transmuting Ethernet frames to Socket API calls

Since an unprivileged user cannot inject Ethernet frames into the host network, we have to make due with the POSIX Sockets API that is available to unprivileged users. Each Ethernet frame sent by the guest needs to be converted into equivalent Sockets API calls on the host so that the desired effect is achieved even though we weren't able to transmit the original Ethernet frame byte-for-byte. Incoming packets from the external network need to be received via the Sockets API and repackaged into Ethernet frames that the guest network interface card can receive.

In networking parlance this conversion between Ethernet frames and Sockets API calls is a Layer 2 (Data Link Layer)/Layer 4 (Transport Layer) conversion. The Ethernet frames have additional packet headers including the Ethernet header, IP header, and the TCP/UDP header that the Sockets API calls don't include. Careful use of the Sockets API makes it possible to synthesize Ethernet frames that are similar enough to the original ones that the guest can communicate successfully.

For the most part this conversion requires parsing and building, respectively, packet headers in a straightforward way. The TCP protocol makes things more interesting though because a TCP connection involves non-trivial state that is normally handled by the TCP/IP stack. For example, data sent over a TCP connection might arrive out of order or some chunks may have been dropped. There is also a state machine for the TCP connection lifecycle including its famous three-way handshake. This means TCP connections must be carefully tracked so that these low-level protocol features can be simulated correctly.

How passt works

Passt runs as an unprivileged host userspace process that is connected to QEMU through --netdev socket, a way to transfer Ethernet frames from QEMU's network interface card emulation to another process like passt. When passt reads an Ethernet frame like a UDP message from the guest it sends the data onwards through an equivalent AF_INET SOCK_DGRAM socket on the host. It also keeps the socket open so replies can be read on the host and then packaged into Ethernet frames that are written to the guest. The effect of this is that guest network communication appears like it's coming from the passt process on the host and integrates nicely into host networking.

How TCP works is a bit more interesting. Since TCP connections require acknowledgement messages for reliable delivery, passt uses the recvmmsg(2) MSG_PEEK flag to fetch data while keeping it queued in the host network stack's rcvbuf until the guest acknowledges it. This avoids extra buffer management code in passt and is part of its strategy of implementing only a subset of TCP. There is no need to duplicate the full TCP/IP stack since the host and guest already have them, but achieving this requires detailed knowledge of TCP so that passt can maintain just enough state.

Incoming connections are handled by port forwarding. This means passt can bind to port 2222 on the host and forward connections to port 22 inside the guest. This is very useful for usermode networking since the user may not have permission to bind to low-numbered ports on the host or there might already be host services listening on those ports. If you don't want to mess with port forwarding you can use passt's all mode, which simply listens on all non-ephemeral ports (basically a brute force approach).

A few basic network protocols are necessary for network communication: ARP, ICMP, DHCP, DNS, and IPv6 services. Passt offers these because the guest cannot talk to those services on the host network directly. They can be disabled when the guest has knowledge of the network configuration and doesn't need them.

Why passt is unique

Thanks to running in a separate process from QEMU and by taking a minimalist approach, passt is able to tighten security. Its seccomp filters are stronger than anything the larger QEMU process could do. The code is clean and specifically designed for security and simplicity. I think writing passt in C was a missed opportunity. Some users may rule it out entirely for this reason. Userspace parsing of untrusted network packets should be done in a memory-safe programming language nowadays. Nevertheless, it's a step up from slirp, which has a poor track record of security issues and runs as part of the QEMU process.

I'm excited to see how passt performs in comparison to slirp. passt uses techniques like recvmmsg(2)/sendmmsg(2) to batch message transfer and reads multiple Ethernet frames from the guest in a single syscall to amortize the cost of syscalls over multiple packets. There is no dynamic memory allocation and packet headers are pre-populated to minimize the number of CPU cycles spent in the data path. While this is promising, QEMU's --netdev socket isn't the fastest (packets first take a trip through QEMU's net subsystem queues), but still a good trade-off between performance and simplicity/security. Based on reading the code, I think passt will be faster than slirp but I haven't benchmarked it myself.

There is another mode that passt supports for containers instead of virtualization. Although it's not relevant to QEMU, this so-called pasta mode is a cool feature for container networking. In this mode pasta connects a network namespace with the outside world (init namespace) through a tap device. This might become passt's killer feature, because the same software can be used for both virtualization and containers, so why bother investing in two separate solutions?

Conclusion

Passt is a promising replacement for slirp (on Linux hosts at least). It looks like there will finally be a production-ready usermode networking feature for QEMU that is fast and secure. Passt's functionality is generic enough that other projects besides QEMU will be able to use it, which is great because this kind of networking code is non-trivial to develop. I look forward to passt becoming available for use with QEMU in Linux distributions soon!

by Unknown (noreply@blogger.com) at October 15, 2021 07:00 PM

October 13, 2021

KVM on Z

Documentation Update: KVM Virtual Server Management

Intended for KVM virtual server administrators, the "Linux on Z and LinuxONE - KVM Virtual Server Management" book illustrates how to set up, configure, and operate Linux on KVM instances and their virtual devices running on the KVM host and IBM Z hardware.

This major update includes libvirt commands and XML elements for managing the lifecycle of VFIO mediated devices, performance tuning tips, and a simplified method for configuring a virtual server for IBM Secure Execution for Linux.

by Stefan Raspl (noreply@blogger.com) at October 13, 2021 09:47 AM

September 16, 2021

Stefan Hajnoczi

KVM Forum 2021 Highlights

Here are highlights from this year's KVM Forum conference, the yearly event around QEMU, KVM, and related communities like VIRTIO and rust-vmm.

Video recordings will be posted soon. In the meantime, here are short summaries of what I learnt. You can find slides to many of these talks in the links below.

vDPA

vDPA is a Linux driver framework for developing hybrid hardware/software VIRTIO devices.

In Hyperscale vDPA, Jason Wang covered ways to create fine-grained virtual devices for virtual machines and containers using vDPA. This means offering a way to slice up a physical device into many virtual devices with some aspects of the virtual device handled in hardware and others in software. This is the direction that networking and accelerator devices are heading in and is actively being discussed in the VIRTIO community. Many different approaches are possible and Jason's talk enumerates some of them:

  • An interface for management commands (a virtio-pci capability, a management virtqueue)
  • DMA isolation (PCIe PASID, a platform-independent device MMU)
  • More than 2048 MSI-X interrupts (virtio-pci capability for VIRTIO-specific MSI-X tables)

Another new vDPA project was presented by Yongji Xie. VDUSE - vDPA Device in Userspace showed how vDPA devices can be implemented in userspace. Although this is roughly the same use case as vhost-user, it has the unique advantage of allowing containers and bare metal to attach devices. An untrusted userspace process emulates the device and the host kernel can either present a vhost device to virtual machines or attach to the userspace device. This is a neat way to develop software devices that can also benefit container workloads.

Stefano Garzarella covered the new unified virtio-blk storage stack in vdpa-blk: Unified Hardware and Software Offload for virtio-blk. The goal is to support hardware virtio-blk devices, an optimized host kernel software device, and still offer QEMU block layer features like qcow2 images. This allows the fast path to go directly to hardware or an optimized in-kernel device while software storage features can still be used when desired via a slow path.

vfio-user

VFIO User - Using VFIO as the IPC Protocol in Multi-process QEMU focussed on the new out-of-process device interface that John Johnson, Jagannathan Raman, and Elena Ufimtseva have been working on together with others. This new protocol allows PCI (and perhaps other busses in the future) devices to be implemented as separate processes. QEMU communicates with the device over a UNIX domain socket. The approach is similar to vhost-user except the protocol messages are based on the Linux VFIO ioctl interface instead of the vhost ioctls.

While vhost-user has been in use for a number of years for VIRTIO-based devices, vfio-user now makes it possible to implement non-VIRTIO devices as separate processes. There were several other talks about vfio-user at KVM Forum 2021 that you can also check out:

virtiofs

In Towards High-availability for Virtio-fs, Jiachen Zheng and Yongji Xie explained how they extended virtiofs to handle crash recovery and live updates. These features are challenging for any program with a lot of state because care must to taken to maintain a consistent snapshot to resume from in the case of a restart. They tackled this by storing Linux file handles and a journal in a shm file. This required some changes to QEMU's virtiofsd data structures that makes them suitable for storing in shm and a journal that makes it possible to provide idempotency for operations like mkdir that would otherwise fail if replayed.

Virtual IOMMUs

Suravee Suthikulpanit and Wei Huang gave a talk titled Analysis of AMD HW-assisted vIOMMU Implementation and Performance. AMD is working on a hardware implementation of a virtual IOMMU that allows guests to specify DMA permissions for guest memory. This functionality is important for VFIO device assignment within guests, for example. Although it can be done in software via emulation of real IOMMUs or the virtio-iommu device that was designed specifically for virtual machines, implementing the vIOMMU in real hardware has performance advantages. One interesting feature of the hardware-assisted vIOMMU is that it natively supports encrypted memory for AMD SEV-SNP guests, something that is slow and clumsy to do in software.

by Unknown (noreply@blogger.com) at September 16, 2021 04:17 PM

September 06, 2021

Gerd Hoffmann

Advanced network booting for virtual machines

Network booting is cool. Once you have setup everything you can stop juggling iso images in your virtual machine configs. Instead you just kick a network boot and pick whatever you want install from the boot menu delivered by the boot server.

This article is not about the basics of setting up a boot server. The internet has tons of tutorials on how to install a tftp server and how to boot your favorite OS from tftp. This article will focus on configuring network boot for libvirt-managed virtual machines.

Before we get started ...

The config file snippets are examples from my home network, home.kraxel.org is the local domain and 192.168.2.14 is the machine acting as boot server here. You have to replace those to match your setup of course. The same is true for the boot file names.

The default libvirt network uses 192.168.122.0/24. In case you use that unmodified these addresses will work fine for you and in fact they should already be in your libvirt network configuration. If you have changed the default libvirt network I expect you know what you have to do 😎.

Step one: very basic netboot setup

That is pretty simple. libvirt has support for that, so all you have to do is adding a bootp tag with the ip address of your tftp server and the boot file name to the network config.

<network>
  [ ... ]
  <ip address='192.168.122.1' netmask='255.255.255.0'>                                        
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>                                    
      <bootp file='pxelinux.0' server='192.168.2.14'/>
    </dhcp>
  </ip>
</network>

You can edit the network configuration using virsh net-edit name. The default libvirt network is simply named default. The network needs an restart to apply any changes (virsh net-destroy name; virsh net-start name).

That was easy, right? Well, maybe not. In case this is not working for you try running modprobe nf_nat_tftp. tftp uses udp, which means there are no connections at ip level, so the kernel has to look into the tftp packets to figure how to route them correctly for a masqueraded network. The nf_nat_tftp kernel module does exactly that.

Note: Recent libvirt versions seem to take care to load nf_nat_tftp if needed, so there is a chance this works out-of-the-box for you.

Neverthelless that leads straight to the question: do we actually need tftp?

Step two: replace tftp with http

As you might have guessed the answer is no.

The ipxe boot roms support booting from http, by simply specifying an URL instead of a filename as bootfile. This was never formally specified though, so unfortunaly you can't expect this to work with every boot rom. For qemu-powered virtual machines this isn't a problem at all because the qemu boot roms are built from ipxe. With physical machines you might have to hop though some extra loops to chainload ipxe (not covered here).

The easiest way to get this going is to install apache on your tftp boot server, then configure a virtual host with the tftproot as document root. You can do so by dropping a snippet like this into /etc/httpd/conf.d/:

<Directory "/var/lib/tftpboot">
        Options Indexes FollowSymLinks
        AllowOverride None
	Require all granted
</Directory>
<VirtualHost *:80>
        ServerName boot.home.kraxel.org
        DocumentRoot /var/lib/tftpboot
</VirtualHost>

Enabling Indexing is not needed for boot server functionality, but might be handy if you want access the boot server with your web browser for trouble-shooting.

Using the tftproot as document root has the advantage that the paths are identical for both tftp and http boot, so your pxelinux and grub configuration files should continue to work unmodified.

Now you can go edit your libvirt network config and replace the bootp configuration with this:

<bootp file='http://boot.home.kraxel.org/pxelinux.0'/>

Done. Don't forget to restart the network to apply the changes. Booting should be noticable faster now (especially when fetching larger initrds), and any NAT traversal problems should be gone too.

Extra tip for lazy people

When using http you can boot from pretty much any server on the internet, there is no need to setup your own. You can use for example the boot server provided by netboot.xyz with a large collection of operating systems available as live systems and for install. Here is the bootp snippet for this:

<bootp file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>

In most cases probably want have a local boot server for faster installs. But for a one time test install of a new distro this might be more handy than downloading the install iso.

Step three: what about UEFI?

For EFI guests the pxelinux.0 is pretty much useless indeed, so we must do something else for them. First question is how do we figure this is a EFI guest asking for a boot file? Lets have a look at the dhcp request, BIOS guest goes first. Captured using tcpdump -i virbr0 -v port bootps:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 0
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"

Now a request from a (x64) EFI guest:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            { ... ]
	    ARCH Option 93, length 2: 7
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003001"

See? The EFI guest uses arch 7 instead of 0, in both option 93 and option 60. So we will use that.

Unfortunaly libvirt has no direct support for that. But libvirt uses dnsmasq as dhcp (and dns) server for the virtual networks. dnsmasq has support for this, and starting with libvirt version 5.6.0 it is possible to specify any dnsmasq config option in your libvirt network configuration using the dnsmasq xml namespace.

dnsmasq uses the concept of tags to implement this. Requests can be tagged using matches, and configurartion directives can be applied to requests with certain tags. So, here is how it looks like, using the efi-x64-pxe tag for x64 efi guests and /arch-x86_64/grubx64.efi as bootfile.

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>
      <bootp file='http://boot.home.kraxel.org/pxelinux.0'/>
    </dhcp>
  </ip>
  <dnsmasq:options>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>
  </dnsmasq:options>
</network>

dnsmasq uses '#' for comments, and it is here only to visually separate entries a bit. It will also be in the dnsmasq config files created by libvirt (in /var/lib/libvirt/dnsmasq/).

Step four: Can UEFI guests use http too?

Sure. You might have already noticed that the UEFI boot manager has both UEFI PXEv4 and UEFI HTTPv4 entries. Here is what happens when you pick the latter:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 16
	    Vendor-Class Option 60, length 33: "HTTPClient:Arch:00016:UNDI:003001"

It's arch 16 now. Also option 60 starts with HTTPClient instead of PXEClient. So we can simply add another arch match to identify http clients.

Another detail we need to take care of is that the UEFI http boot client expect a reply with option 60 set to HTTPClient, otherwise it will be ignored. So we need to take care of that too, using dhcp-option-force. Here we go, using tag efi-x64-http for http clients:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <dnsmasq:options>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-http,http://boot.home.kraxel.org/arch-x86_64/grubx64.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

Extra tip for lazy people, now with UEFI

Complete example, defining a new libvirt network named netboot.xyz. You can store that in some file, then use virsh net-define file to create the network.

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  <name>netboot.xyz</name>
  <forward mode='nat'/>
  <bridge name='netboot0' stp='on' delay='0'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.123.10' end='192.168.123.99'/>
      <bootp file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>
    </dhcp>
  </ip>
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-http,http://boot.netboot.xyz/ipxe/netboot.xyz.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

Then, in your guest domain configration, use <source network='netboot.xyz'/> to use the new network. With this both BIOS and UEFI guests can netboot from netboot.xyz. With UEFI you have to take care to pick the UEFI HTTPv4 entry from the firmware boot menu.

Step five: architecture experiments

There is a world beyond x86. The arch field does not only specify the system architecture (bios vs. uefi) or the boot protocol (pxe vs. http), but also the cpu architecture. Here are the ones relevant for qemu:

CodeArchitecture
0x00BIOS pxeboot (both i386 and x86_64)
0x06EFI pxeboot, IA32 (i386)
0x07EFI pxeboot, X64 (x86_64)
0x0aEFI pxeboot, ARM (v7)
0x0bEFI pxeboot, AA64 (v8 / aarch64)
0x12powerpc64
0x16EFI httpboot, X64
0x18EFI httpboot, ARM
0x19EFI httpboot, AA64
0x31s390x

So, if you want play with arm or powerpc without owning such a machine you can let qemu emulate it with tcg. If you want netboot it -- no problem, just add a few more lines to your network configuration. Here is an example for aarch64:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <dnsmasq:options>
    [ ... ]
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-aa64-pxe,option:client-arch,b'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-aa64-pxe,/arch-aarch64/grubaa64.efi,,192.168.2.14'/>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-aa64-http,option:client-arch,19'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-aa64-http,http://boot.home.kraxel.org/arch-aarch64/grubaa64.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-aa64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

In case you are wondering why I place the grub binaries in subdirectories: grub tries fetch the config file from the same directory, so that way I get per-arch config files and they are named /arch-aarch64/grub.cfg, /arch-x86_64/grub.cfg and so on. A nice side effect is that the toplevel directory is a bit less cluttered with files.

And beyond libvirt?

Well, the fundamental idea doesn't change. Look at arch option, then send different replies depending on what you find there. With other dhcp servers the syntax is different, but the pattern is the same. Here is a sample snippet for the isc dhcp server shipped with most linux distributions:

option arch code 93 = unsigned integer 16;

subnet 192.168.2.0 netmask 255.255.255.0 {
        [ ... ]

        if (option arch = 00:16) {
		option vendor-class-identifier "HTTPClient";
		filename "http://boot.home.kraxel.org/arch-x86_64/grubx64.efi";
	} else if (option arch = 00:07) {
		next-server 192.168.2.14;
		filename "/arch-x86_64/grubx64.efi";
	} else {
		next-server 192.168.2.14;
		filename "/pxelinux.0";
	}
}

by Gerd Hoffmann at September 06, 2021 10:00 PM

QEMU project

Presenting guest images as raw image files with FUSE

Sometimes, there is a VM disk image whose contents you want to manipulate without booting the VM. One way of doing this is to use libguestfs, which can boot a minimal Linux VM to provide the host with secure access to the disk’s contents. For example, guestmount allows you to mount a guest filesystem on the host, without requiring root rights.

However, maybe you cannot or do not want to use libguestfs, e.g. because you do not have KVM available in your environment, and so it becomes too slow; or because you do not want to go through a guest OS, but want to access the raw image data directly on the host, with minimal overhead.

Note: Guest images can generally be arbitrarily modified by VM guests. If you have an image to which an untrusted guest had write access at some point, you must treat any data and metadata on this image as potentially having been modified in a malicious manner. Parsing anything must be done carefully and with caution. Note that many existing tools are not careful in this regard, for example, filesystem drivers generally deliberately do not have protection against maliciously corrupted filesystems. This is why in contrast accessing an image through libguestfs is considered secure, because the actual access happens in a libvirt-managed VM guest.

From this point, we assume you are aware of the security caveats and still want to access and manipulate image data on the host.

Now, unless your image is already in raw format, you will be faced with the problem of getting it into raw format. The tools that you might want to use for image manipulation generally only work on raw images (because that is how block device files appear), like:

  • dd to just copy data to and from given offsets,
  • parted to manipulate the partition table,
  • kpartx to present all partitions as block devices,
  • mount to access filesystems’ contents.

So if you want to use such tools on image files e.g. in QEMU’s qcow2 format, you will need to translate them into raw images first, for example by:

  • Exporting the image file with qemu-nbd -c as an NBD block device file,
  • Converting between image formats using qemu-img convert,
  • Accessing the image from a guest, where it appears as a normal block device.

Unfortunately, none of these methods is perfect: qemu-nbd -c generally requires root rights; converting to a temporary raw copy requires additional disk space and the conversion process takes time; and accessing the image from a guest is basically what libguestfs does (i.e., if that is what you want, then you should probably use libguestfs).

As of QEMU 6.0, there is another method, namely FUSE block exports. Conceptually, these are rather similar to using qemu-nbd -c, but they do not require root rights.

Note: FUSE block exports are a feature that can be enabled or disabled during the build process with --enable-fuse or --disable-fuse, respectively; omitting either configure option will enable the feature if and only if libfuse3 is present. It is possible that the QEMU build you are using does not have FUSE block export support, because it was not compiled in.

FUSE (Filesystem in Userspace) is a technology to let userspace processes provide filesystem drivers. For example, sshfs is a program that allows mounting remote directories from a machine accessible via SSH.

QEMU can use FUSE to make a virtual block device appear as a normal file on the host, so that tools like kpartx can interact with it regardless of the image format, like in the following example:

$ qemu-img create -f raw foo.img 20G
Formatting 'foo.img', fmt=raw size=21474836480

$ parted -s foo.img \
    'mklabel msdos' \
    'mkpart primary ext4 2048s 100%'

$ qemu-img convert -p -f raw -O qcow2 foo.img foo.qcow2 && rm foo.img
    (100.00/100%)

$ file foo.qcow2
foo.qcow2: QEMU QCOW2 Image (v3), 21474836480 bytes

$ sudo kpartx -l foo.qcow2

$ qemu-storage-daemon \
    --blockdev node-name=prot-node,driver=file,filename=foo.qcow2 \
    --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    --export \
    type=fuse,id=exp0,node-name=fmt-node,mountpoint=foo.qcow2,writable=on \
    &
[1] 200495

$ file foo.qcow2
foo.qcow2: DOS/MBR boot sector; partition 1 : ID=0x83, start-CHS (0x10,0,1),
end-CHS (0x3ff,3,32), startsector 2048, 41940992 sectors

$ sudo kpartx -av foo.qcow2
add map loop0p1 (254:0): 0 41940992 linear 7:0 2048

In this example, we create a partition on a newly created raw image. We then convert this raw image to qcow2 and discard the original. Because a tool like kpartx cannot parse the qcow2 format, it reports no partitions to be present in foo.qcow2.

Using the QEMU storage daemon, we then create a FUSE export for the image that apparently turns it into a raw image, which makes the content and thus the partitions visible to file and kpartx. Now, we can use kpartx to access the partition in foo.qcow2 under /dev/mapper/loop0p1.

So how does this work? How can the QEMU storage daemon make a qcow2 image appear as a raw image?

File mounts

To transparently translate a file into a different format, like we did above, we make use of two little-known facts about filesystems and the VFS on Linux. The first one of these we can explain immediately, for the second one we will need some more information about how FUSE exports work, so that secret will be lifted later (down in the “Mounting an image on itself” section).

Here is the first secret: Filesystems do not need to have a root directory. They only need a root node. A regular file is a node, so a filesystem that only consists of a single regular file is perfectly valid.

Note that this is not about filesystems with just a single file in their root directory, but about filesystems that really do not have a root directory.

Conceptually, every filesystem is a tree, and mounting works by replacing one subtree of the global VFS tree by the mounted filesystem’s tree. Normally, a filesystem’s root node is a directory, like in the following example:

Regular filesystem: Root directory is mounted to a directory mount point
Fig. 1: Mounting a regular filesystem with a directory as its root node

Here, the directory /foo and its content (the files /foo/a and /foo/b) are shadowed by the new filesystem (showing /foo/x and /foo/y).

Note that a filesystem’s root node generally has no name. After mounting, the filesystem’s root directory’s name is determined by the original name of the mount point. (“/” is not a name. It specifically is a directory without a name.)

Because a tree does not need to have multiple nodes but may consist of just a single leaf, a filesystem with a file for its root node works just as well, though:

Mounting a file root node to a regular file mount point
Fig. 2: Mounting a filesystem with a regular (unnamed) file as its root node

Here, FS B only consists of a single node, a regular file with no name. (As above, a filesystem’s root node is generally unnamed.) Consequently, the mount point for it must also be a regular file (/foo/a in our example), and just like before, the content of /foo/a is shadowed, and when opening it, one will instead see the contents of FS B’s unnamed root node.

QEMU block exports

Before we can see what FUSE exports are and how they work, we should explore QEMU block exports in general.

QEMU allows exporting block nodes via various protocols (as of 6.0: NBD, vhost-user, FUSE). A block node is an element of QEMU’s block graph (see e.g. Managing the New Block Layer, a talk given at KVM Forum 2017), which can for example be attached to guest devices. Here is a very simple example:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> virtio-blk guest device-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> virtio-blk guest device" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> virtio-blk guest device" />
Fig. 3: A simple block graph for attaching a qcow2 image to a virtio-blk guest device

This is the simplest example for a block graph that connects a virtio-blk guest device to a qcow2 image file. The file block driver, instanced in the form of a block node named prot-node, accesses the actual file and provides the node above it access to the raw content. This node above, named fmt-node, is handled by the qcow2 block driver, which is capable of interpreting the qcow2 format. Parents of this node will therefore see the actual content of the virtual disk that is represented by the qcow2 image. There is only one parent here, which is the virtio-blk guest device, which will thus see the virtual disk.

The command line to achieve the above could look something like this:

$ qemu-system-x86_64 \
    -blockdev node-name=prot-node,driver=file,filename=$image_path \
    -blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    -device virtio-blk,drive=fmt-node,share-rw=on

Besides attaching guest devices to block nodes, you can also export them for users outside of qemu, for example via NBD. Say you have a QMP channel open for the QEMU instance above, then you could do this:

{
    "execute": "nbd-server-start",
    "arguments": {
        "addr": {
            "type": "inet",
            "data": {
                "host": "localhost",
                "port": "10809"
            }
        }
    }
}
{
    "execute": "block-export-add",
    "arguments": {
        "type": "nbd",
        "id": "exp0",
        "node-name": "fmt-node",
        "name": "guest-disk",
        "writable": true
    }
}

This opens an NBD server on localhost:10809, which exports fmt-node (under the NBD export name guest-disk). The block graph looks as follows:

Same block graph as fig. 3, but with an NBD server attached to fmt-node
Fig. 4: Block graph extended by an NBD server

NBD clients connecting to this server will see the raw disk as seen by the guest – we have exported the guest disk:

$ qemu-img info nbd://localhost/guest-disk
image: nbd://localhost:10809/guest-disk
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: unavailable

QEMU storage daemon

If you are not running a guest, and so do not need guest devices, but all you want is to use the QEMU block layer (for example to interpret the qcow2 format) and export nodes from the block graph, then you can use the more lightweight QEMU storage daemon instead of a full-blown QEMU process:

$ qemu-storage-daemon \
    --blockdev node-name=prot-node,driver=file,filename=$image_path \
    --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    --nbd-server addr.type=inet,addr.host=localhost,addr.port=10809 \
    --export \
    type=nbd,id=exp0,node-name=fmt-node,name=guest-disk,writable=on

Which creates the following block graph:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> NBD server-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> NBD server" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> NBD server" />
Fig. 5: Exporting a qcow2 image over NBD

FUSE block exports

Besides NBD exports, QEMU also supports vhost-user and FUSE exports. FUSE block exports make QEMU become a FUSE driver that provides a filesystem that consists of only a single node, namely a regular file that has the raw contents of the exported block node. QEMU will automatically mount this filesystem on a given existing regular file (which acts as the mount point, as described in the “File mounts” section).

Thus, FUSE exports can be used like this:

$ touch mount-point

$ qemu-storage-daemon \
  --blockdev node-name=prot-node,driver=file,filename=$image_path \
  --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
  --export \
  type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on

The mount point now appears as the raw VM disk that is stored in the qcow2 image:

$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB

And mount tells us that this is indeed its own filesystem:

$ mount | grep mount-point
/dev/fuse on /tmp/mount-point type fuse (rw,nosuid,nodev,relatime,user_id=1000,
group_id=100,default_permissions,allow_other,max_read=67108864)

The block graph looks like this:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> FUSE server <-> exported file-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> FUSE server -> exported file" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> FUSE server -> exported file" />
Fig. 6: Exporting a qcow2 image over FUSE

Closing the storage daemon (e.g. with Ctrl-C) automatically unmounts the export, turning the mount point back into an empty normal file:

$ mount | grep -c mount-point
0

$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 0 B (0 bytes)
disk size: 0 B

Mounting an image on itself

So far, we have seen what FUSE exports are, how they work, and how they can be used. However, in the very first example in this blog post, we did not export the raw image on some empty regular file that just serves as a mount point – no, we turned the original qcow2 image itself into a raw image.

How does that work?

What happens to the old tree under a mount point?

Mounting a filesystem only shadows the mount point’s original content, it does not remove it. The original content can no longer be looked up via its (absolute) path, but it is still there, much like a file that has been unlinked but is still open in some process. Here is an example:

First, create some file in some directory, and have some process keep it open:

$ mkdir foo

$ echo 'Is anyone there?' > foo/bar

$ irb
irb(main):001:0> f = File.open('foo/bar', 'r+')
=> #<File:foo/bar>
irb(main):002:0> ^Z
[1]  + 35494 suspended  irb

Next, mount something on the directory:

$ sudo mount -t tmpfs tmpfs foo

The file cannot be found anymore (because foo’s content is shadowed by the mounted filesystem), but the process who kept it open can still read from it, and write to it:

$ ls foo

$ cat foo/bar
cat: foo/bar: No such file or directory

$ fg
f.read
irb(main):002:0> f.read
=> "Is anyone there?\n"
irb(main):003:0> f.puts('Hello from the shadows!')
=> nil
irb(main):004:0> exit

$ ls foo

$ cat foo/bar
cat: foo/bar: No such file or directory

Unmounting the filesystem lets us see our file again, with its updated content:

$ sudo umount foo

$ ls foo
bar

$ cat foo/bar
Is anyone there?
Hello from the shadows!

Letting a FUSE export shadow its image file

The same principle applies to file mounts: The original inode is shadowed (along with its content), but it is still there for any process that opened it before the mount occurred. Because QEMU (or the storage daemon) opens the image file before mounting the FUSE export, you can therefore specify an image’s path as the mount point for its corresponding export:

$ qemu-img create -f qcow2 foo.qcow2 20G
Formatting 'foo.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
 compression_type=zlib size=21474836480 lazy_refcounts=off refcount_bits=16

$ qemu-img info foo.qcow2
image: foo.qcow2
file format: qcow2
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

$ qemu-storage-daemon --blockdev \
   node-name=node0,driver=qcow2,file.driver=file,file.filename=foo.qcow2 \
   --export \
   type=fuse,id=node0-export,node-name=node0,mountpoint=foo.qcow2,writable=on &
[1] 40843

$ qemu-img info foo.qcow2
image: foo.qcow2
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB

$ kill %1
[1]  + 40843 done       qemu-storage-daemon --blockdev  --export

In graph form, that looks like this:

Two graphs: First, foo.qcow2 is opened by QEMU; second, a FUSE server exports the raw disk under foo.qcow2, thus shadowing the original foo.qcow2
Fig. 6: Exporting a qcow2 image via FUSE on its own path

QEMU (or the storage daemon in this case) keeps the original (qcow2) file open, and so it keeps access to it, even after the mount. However, any other process that opens the image by name (i.e. open("foo.qcow2")) will open the raw disk image exported by QEMU. Therefore, it looks like the qcow2 image is in raw format now.

qemu-fuse-disk-export.py

Because the QEMU storage daemon command line tends to become kind of long, I’ve written a script to facilitate the process: qemu-fuse-disk-export.py (direct download link). This script automatically detects the image format, and its --daemonize option allows safe use in scripts, where it is important that the process blocks until the export is fully set up.

Using qemu-fuse-disk-export.py, the above example looks like this:

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

$ qemu-fuse-disk-export.py foo.qcow2 &
[1] 13339
All exports set up, ^C to revert

$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw

$ kill -SIGINT %1
[1]  + 13339 done       qemu-fuse-disk-export.py foo.qcow2

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

Or, with --daemonize/-d:

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2

$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw

$ kill -SIGINT $(cat qfde.pid)

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

Bringing it all together

Now we know how to make disk images in any format understood by QEMU appear as raw images. We can thus run any application on them that works with such raw disk images:

$ qemu-fuse-disk-export.py \
    -dp qfde.pid \
    Arch-Linux-x86_64-basic-20210711.28787.qcow2

$ parted Arch-Linux-x86_64-basic-20210711.28787.qcow2 p
WARNING: You are not superuser.  Watch out for permissions.
Model:  (file)
Disk /tmp/Arch-Linux-x86_64-basic-20210711.28787.qcow2: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  2097kB  1049kB                     bios_grub
 2      2097kB  42.9GB  42.9GB  btrfs

$ sudo kpartx -av Arch-Linux-x86_64-basic-20210711.28787.qcow2
add map loop0p1 (254:0): 0 2048 linear 7:0 2048
add map loop0p2 (254:1): 0 83881951 linear 7:0 4096

$ sudo mount /dev/mapper/loop0p2 /mnt/tmp

$ ls /mnt/tmp
bin   boot  dev  etc  home  lib  lib64  mnt  opt  proc  root  run  sbin  srv
swap  sys   tmp  usr  var

$ echo 'Hello, qcow2 image!' > /mnt/tmp/home/arch/hello

$ sudo umount /mnt/tmp

$ sudo kpartx -d Arch-Linux-x86_64-basic-20210711.28787.qcow2
loop deleted : /dev/loop0

$ kill -SIGINT $(cat qfde.pid)

And launching the image, in the guest we see:

[arch@archlinux ~] cat hello
Hello, qcow2 image!

A note on allow_other

In the example presented in the above section, we access the exported image with a different user than the one who exported it (to be specific, we export it as a normal user, and then access it as root). This does not work prior to QEMU 6.1:

$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2

$ sudo stat foo.qcow2
stat: cannot statx 'foo.qcow2': Permission denied

QEMU 6.1 has introduced support for FUSE’s allow_other mount option. Without that option, only the user who exported the image has access to it. By default, if the system allows for non-root users to add allow_other to FUSE mount options, QEMU will add it, and otherwise omit it. It does so by simply attempting to mount the export with allow_other first, and if that fails, it will try again without. (You can also force the behavior with the allow_other=(on|off|auto) export parameter.)

Non-root users can pass allow_other if and only if /etc/fuse.conf contains the user_allow_other option.

Conclusion

As shown in this blog post, FUSE block exports are a relatively simple way to access images in any format understood by QEMU as if they were raw images. Any tool that can manipulate raw disk images can thus manipulate images in any format, simply by having the QEMU storage daemon provide a translation layer. By mounting the FUSE export on the original image path, this translation layer will effectively be invisible, and the original image will look like it is in raw format, so it can directly be accessed by those tools.

The current main disadvantage of FUSE exports is that they offer relatively bad performance. That should be fine as long as your use case is just light manipulation of some VM images, like manually modifying some files on them. However, we did not yet really try to optimize performance, so if more serious use cases appear that would require better performance, we can try.

by Hanna Reitz at September 06, 2021 06:30 PM

August 24, 2021

QEMU project

QEMU version 6.1.0 released

We’d like to announce the availability of the QEMU 6.1.0 release. This release contains 3000+ commits from 221 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • block: support for changing block node options after creation via ‘blockdev-reopen’ QMP command
  • Crypto: more performant backend recommendations and improved documentation
  • I2C: emulation support for I2C muxes (pca9546, pca9548) and PMBus
  • TCG Plugins: now enabled by default, with new execlog and cache modelling plugins.
  • ARM: new board support for Aspeed (rainier-bmc, quanta-q7l1), npcm7xx (quanta-gbs-bmc), and Cortex-M3 (stm32vldiscovery) based machines
  • ARM: Aspeed support of Hash and Crypto Engine
  • ARM: emulation support for SVE2 (including bfloat16), integer matrix multiply accumulate operations, TLB invalidate in Outer Shareable domain, TLB range invalidate, and more.
  • PowerPC: pseries: support for detecting hotplug failures in newer guests
  • PowerPC: pseries: increased maximum CPU count
  • PowerPC: pseries: emulation support for some POWER10 prefixed instructions
  • PowerPC: new board support for Genesi/bPlan Pegasos II (pegasos2)
  • RISC-V: updates to OpenTitan platform support, including OpenTitan timer
  • RISC-V: support for virtio-vga
  • RISC-V: documentation improvements and general code cleanups/fixes
  • s390: emulation support for the vector-enhancements facility
  • s390: support for gen16 CPU models
  • x86: new Intel CPU model versions with support for XSAVES instruction
  • x86: added ACPI based PCI hotplug support for Q35 machine (now the default)
  • x86: improvements to emulation of AMD virtualization extensions
  • and lots more…

Thank you to everyone involved!

August 24, 2021 08:22 PM

Powered by Planet!
Last updated: June 25, 2022 07:06 AMEdit this page