Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

March 24, 2023

QEMU project

Preparing a consistent Python environment

Building QEMU is a complex task, split across several programs. the configure script finds the host and cross compilers that are needed to build emulators and firmware; Meson prepares the build environment for the emulators; finally, Make and Ninja actually perform the build, and in some cases they run tests as well.

In addition to compiling C code, many build steps run tools and scripts which are mostly written in the Python language. These include processing the emulator configuration, code generators for tracepoints and QAPI, extensions for the Sphinx documentation tool, and the Avocado testing framework. The Meson build system itself is written in Python, too.

Some of these tools are run through the python3 executable, while others are invoked directly as sphinx-build or meson, and this can create inconsistencies. For example, QEMU’s configure script checks for a minimum version of Python and rejects too-old interpreters. However, what would happen if code run by Sphinx used a different version?

This situation has been largely hypothetical until recently; QEMU’s Python code is already tested with a wide range of versions of the interpreter, and it would not be a huge issue if Sphinx used a different version of Python as long as both of them were supported. This will change in version 8.1 of QEMU, which will bump the minimum supported version of Python from 3.6 to 3.8. While all the distros that QEMU supports have a recent-enough interpreter, the default on RHEL8 and SLES15 is still version 3.6, and that is what all binaries in /usr/bin use unconditionally.

As of QEMU 8.0, even if configure is told to use /usr/bin/python3.8 for the build, QEMU’s custom Sphinx extensions would still run under Python 3.6. configure does separately check that Sphinx is executing with a new enough Python version, but it would be nice if there were a more generic way to prepare a consistent Python environment.

This post will explain how QEMU 8.1 will ensure that a single interpreter is used for the whole of the build process. Getting there will require some familiarity with Python packaging, so let’s start with virtual environments.

Virtual environments

It is surprisingly hard to find what Python interpreter a given script will use. You can try to parse the first line of the script, which will be something like #! /usr/bin/python3, but there is no guarantee of success. For example, on some version of Homebrew /usr/bin/meson will be a wrapper script like:

#!/bin/bash
PYTHONPATH="/usr/local/Cellar/meson/0.55.0/lib/python3.8/site-packages" \
  exec "/usr/local/Cellar/meson/0.55.0/libexec/bin/meson" "$@"

The file with the Python shebang line will be hidden somewhere in /usr/local/Cellar. Therefore, performing some kind of check on the files in /usr/bin is ruled out. QEMU needs to set up a consistent environment on its own.

If a user who is building QEMU wanted to do so, the simplest way would be to use Python virtual environments. A virtual environment takes an existing Python installation but gives it a local set of Python packages. It also has its own bin directory; place it at the beginning of your PATH and you will be able to control the Python interpreter for scripts that begin with #! /usr/bin/env python3.

Furthermore, when packages are installed into the virtual environment with pip, they always refer to the Python interpreter that was used to create the environment. Virtual environments mostly solve the consistency problem at the cost of an extra pip install step to put QEMU’s build dependencies into the environment.

Unfortunately, this extra step has a substantial downside. Even though the virtual environment can optionally refer to the base installation’s installed packages, pip will always install packages from scratch into the virtual environment. For all Linux distributions except RHEL8 and SLES15 this is unnecessary, and users would be happy to build QEMU using the versions of Meson and Sphinx included in the distribution.

Even worse, pip install will access the Python package index (PyPI) over the Internet, which is often impossible on build machines that are sealed from the outside world. Automated installation of PyPI dependencies may actually be a welcome feature, but it must also remain strictly optional.

In other words, the ideal solution would use a non-isolated virtual environment, to be able to use system packages provided by Linux distributions; but it would also ensure that scripts (sphinx-build, meson, avocado) are placed into bin just like pip install does.

Distribution packages

When it comes to packages, Python surely makes an effort to be confusing. The fundamental unit for importing code into a Python program is called a package; for example os and sys are two examples of a package. However, a program or library that is distributed on PyPI consists of many such “import packages”: that’s because while pip is usually said to be a “package installer” for Python, more precisely it installs “distribution packages”.

To add to the confusion, the term “distribution package” is often shortened to either “package” or “distribution”. And finally, the metadata of the distribution package remains available even after installation, so “distributions” include things that are already installed (and are not being distributed anywhere).

All this matters because distribution metadata will be the key to building the perfect virtual environment. If you look at the content of bin/meson in a virtual environment, after installing the package with pip, this is what you find:

#!/home/pbonzini/my-venv/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from mesonbuild.mesonmain import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

This looks a lot like automatically generated code, and in fact it is; the only parts that vary are the from mesonbuild.mesonmain import main import, and the invocation of the main() function on the last line. pip creates this invocation script based on the setup.cfg file in Meson’s source code, more specifically based on the following stanza:

[options.entry_points]
console_scripts =
  meson = mesonbuild.mesonmain:main

Similar declarations exist in Sphinx, Avocado and so on, and accessing their content is easy via importlib.metadata (available in Python 3.8+):

$ python3
>>> from importlib.metadata import distribution
>>> distribution('meson').entry_points
[EntryPoint(name='meson', value='mesonbuild.mesonmain:main', group='console_scripts')]

importlib looks up the metadata in the running Python interpreter’s search path; if Meson is installed under another interpreter’s site-packages directory, it will not be found:

$ python3.8
>>> from importlib.metadata import distribution
>>> distribution('meson').entry_points
Traceback (most recent call last):
...
importlib.metadata.PackageNotFoundError: meson

So finally we have a plan! configure can build a non-isolated virtual environment, use importlib to check that the required packages exist in the base installation, and create scripts in bin that point to the right Python interpreter. Then, it can optionally use pip install to install the missing packages.

While this process includes a certain amount of specialized logic, Python provides a customizable venv module to create virtual environments. The custom steps can be performed by subclassing venv.EnvBuilder.

This will provide the same experience as QEMU 8.0, except that there will be no need for the --meson and --sphinx-build options to the configure script. The path to the Python interpreter is enough to set up all Python programs used during the build.

There is only one thing left to fix…

Nesting virtual environments

Remember how we started with a user that creates her own virtual environment before building QEMU? Well, this would not work anymore, because virtual environments cannot be nested. As soon as configure creates its own virtual environment, the packages installed by the user are not available anymore.

Fortunately, the “appearance” of a nested virtual environment is easy to emulate. Detecting whether python3 runs in a virtual environment is as easy as checking sys.prefix != sys.base_prefix; if it is, we need to retrieve the parent virtual environments site-packages directory:

>>> import sysconfig
>>> sysconfig.get_path('purelib')
'/home/pbonzini/my-venv/lib/python3.11/site-packages'

and write it to a .pth file in the lib directory of the new virtual environment. The following demo shows how a distribution package in the parent virtual environment will be available in the child as well:

A small detail is that configure’s new virtual environment should mirror the isolation setting of the parent. An isolated venv can be detected because sys.base_prefix in site.PREFIXES is false.

Conclusion

Right now, QEMU only makes a minimal attempt at ensuring consistency of the Python environment; Meson is always run using the interpreter that was passed to the configure script with --python or $PYTHON, but that’s it. Once the above technique will be implemented in QEMU 8.1, there will be no difference in the build experience, but configuration will be easier and a wider set of invalid build environments will be detected. We will merge these checks before dropping support for Python 3.6, so that users on older enterprise distributions will have a smooth transition.

March 24, 2023 09:00 AM

March 22, 2023

Stefan Hajnoczi

How to debug stuck VIRTIO devices in QEMU

Every once in a while a bug comes along where a guest hangs while communicating with a QEMU VIRTIO device. In this blog post I'll share some debugging approaches that can help QEMU developers who are trying to understand why a VIRTIO device is stuck.

There are a number of reasons why communication with a VIRTIO device might cease, so it helps to identify the nature of the hang:

  • Did the QEMU device see the requests that the guest driver submitted?
  • Did the QEMU device complete the request?
  • Did the guest driver see the requests that the device completed?

The case I will talk about is when QEMU itself is still responsive (the QMP/HMP monitor works) and the guest may or may not be responsive.

Finding requests that are stuck

There is a QEMU monitor command to inspect virtqueues called x-query-virtio-queue-status (QMP) and info virtio-queue-status (HMP). This is a quick way to extract information about a virtqueue from QEMU.

This command allows us to answer the question of whether the QEMU device completed its requests. The shadow_avail_idx and used_idx values in the output are the Available Ring index and Used Ring index, respectively. When they are equal the device has completed all requests. When they are not equal there are still requests in flight and the request must be stuck inside QEMU.

Here is a little more background on the index values. Remember that VIRTIO Split Virtqueues have an Available Ring index and a Used Ring index. The Available Ring index is incremented by the driver whenever it submits a request. The Used Ring index is incremented by the device whenever it completes a request. If the Available Ring index is equal to the Used Ring index then all requests have been completed.

Note that shadow_avail_idx is not the vring Available Ring index in guest RAM but just the last cached copy that the device saw. That means we cannot tell if there are new requests that the device hasn't seen yet. We need to take another approach to figure that out.

Finding requests that the device has not seen yet

Maybe the device has not seen new requests recently and this is why the guest is stuck. That can happen if the device is not receiving Buffer Available Notifications properly (normally this is done by reading a virtqueue kick ioeventfd, also known as a host notifier in QEMU).

We cannot use QEMU monitor commands here, but attaching the GDB debugger to QEMU will allow us to peak at the Available Ring index in guest RAM. The following GDB Python script loads the Available Ring index for a given VirtQueue:


$ cat avail-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
avail_idx = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[1]
if avail_idx != vq['shadow_avail_idx']:
print('Device has not seen all available buffers: avail_idx {} shadow_avail_idx {} in {}'.format(avail_idx, vq['shadow_avail_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Finding completions that the guest has not seen

If requests are not stuck inside QEMU and the device has seen the latest request, then the guest driver might have missed the Used Buffer Notification from the device (normally an interrupt handler or polling loop inside the guest detects completed requests).

In VIRTIO the driver's current index in the Used Ring is not visible to the device. This means we have no general way of knowing whether the driver has seen completions. However, there is a cool trick for modern devices that have the VIRTIO_RING_F_EVENT_IDX feature enabled.

The trick is that the Linux VIRTIO driver code updates the Used Event Index every time a completed request is popped from the virtqueue. So if we look at the Used Event Index we know the driver's index into the Used Ring and can find out whether it has seen request completions.

The following GDB Python script loads the Used Event Index for a given VirtQueue:


$ cat used-event-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
used_event = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[2 + vq['vring']['num']]
if used_event != vq['used_idx']:
print('Driver has not seen all used buffers: used_event {} used_idx {} in {}'.format(used_event, vq['used_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Conclusion

I hope this helps anyone who has to debug a VIRTIO device that seems to have gotten stuck.

by Unknown (noreply@blogger.com) at March 22, 2023 07:39 PM

March 17, 2023

KVM on Z

KVM Upstream Roundup 2022

The KVM on Z community has been quite busy in 2022, delivering a number of features. Here is a look back at a selection of significant new features in recent months:
  • Support for IBM z16 and LinuxONE Emperor 4
    • Allows guest exploitation of new hardware features and live migration for the latest machine generation.
    • Requires QEMU 6.1 and Linux kernel 5.14 or later.
  • Support of long kernel command lines
    • Kernel command lines up to 64 KB length, e.g. allows to specify plenty of I/O devices.
    • Requires QEMU 7.0, s390-tools 2.20 or later.
  • Remote attestation for Secure Execution
    • Cryptographic evidence of workload authenticity and integrity facilitates integration into common Confidential Computing frameworks.
    • Requires QEMU 7.1 and Linux kernel 5.19 or later.
  • Encrypted dump for Secure Execution
    • Enhances problem determination capabilities while not compromising the security of secure KVM guests.
    • Requires QEMU 7.1, Linux 5.19, and s390-tools 2.22 or later.
  • Persistent configuration for vfio-ap 
    • The s390-tools command zdev can now be used to persist Crypto passthrough configurations.
    • Requires s390-tools 2.22 or later.
  • Interpretive vfio-pci support for ISM
    • Allows pass-through of ISM devices to KVM guests, enabling high-bandwith and low-latency network communication using SMC-D.
    • Requires QEMU 7.2 and Linux kernel 6.2 or later.

by Viktor Mihajlovski (noreply@blogger.com) at March 17, 2023 02:05 PM

March 09, 2023

Daniel Berrange

make-tiny-image.py: creating tiny initrds for testing QEMU or Linux kernel/userspace behaviour

As a virtualization developer a significant amount of time is spent in understanding and debugging the behaviour and interaction of QEMU and the guest kernel/userspace code. As such my development machines have a variety of guest OS installations that get booted for various tasks. Some tasks, however, require a repeated cycle of QEMU code changes, or QEMU config changes, followed by guest testing. Waiting for an OS to boot can quickly become a significant time sink affecting productivity and lead to frustration. What is needed is a very low overhead way to accomplish simple testing tasks without an OS getting in the way.

Enter ‘make-tiny-image.py‘ tool for creating minimal initrd images.

If invoked with no arguments, this tool will create an initrd containing nothing more than busybox. The “init” program will be a script that creates a few device nodes, mounts proc/sysfs and then runs the busybox ‘sh’ binary to provide an interactive shell. This is intended to be used as follows

$ ./make-tiny-image.py
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -accel kvm -m 1000 -display none -serial stdio
~ # uname  -a
Linux (none) 6.0.8-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 11 15:09:04 UTC 2022 x86_64 x86_64 x86_64 Linux
~ # uptime
 15:05:42 up 0 min,  load average: 0.00, 0.00, 0.00
~ # free
              total        used        free      shared  buff/cache   available
Mem:         961832       38056      911264        1388       12512      845600
Swap:             0           0           0
~ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
none                    480916         0    480916   0% /dev
~ # ls
bin   dev   init  proc  root  sys   usr
~ # <Ctrl+D>
[   23.841282] reboot: Power down

When I say “low overhead”, just how low are we talking about ? With KVM, it takes less than a second to bring up the shell. Testing with emulation is where this really shines. Booting a full Fedora OS with QEMU emulation is slow enough that you don’t want to do it at all frequently. With this tiny initrd, it’ll take a little under 4 seconds to boot to the interactive shell. Much slower than KVM, but fast enough you’ll be fine repeating this all day long, largely unaffected by the (lack of) speed relative to KVM.

The make-tiny-image.py tool will create the initrd such that it drops you into a shell, but it can be told to run another command instead. This is how I tested the overheads mentioned above

$ ./make-tiny-image.py --run poweroff
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ time qemu-system-x86_64 \
     -kernel /boot/vmlinuz-$(uname -r) \
     -initrd tiny-initrd.img \
     -append 'console=ttyS0 quiet' \
     -m 1000 -display none -serial stdio -accel kvm
[    0.561174] reboot: Power down

real	0m0.828s
user	0m0.613s
sys	0m0.093s
$ time qemu-system-x86_64 \
     -kernel /boot/vmlinuz-$(uname -r) \
     -initrd tiny-initrd.img \
     -append 'console=ttyS0 quiet' \
     -m 1000 -display none -serial stdio -accel tcg
[    2.741983] reboot: Power down

real	0m3.774s
user	0m3.626s
sys	0m0.174s

As a more useful real world example, I wanted to test the effect of changing the QEMU CPU configuration against KVM and QEMU, by comparing at the guest /proc/cpuinfo.

$ ./make-tiny-image.py --run 'cat /proc/cpuinfo'
tiny-initrd.img
6.0.8-300.fc37.x86_64

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel tcg -cpu max | grep '^flags'
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
                  cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss syscall 
                  nx mmxext pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl 
                  cpuid extd_apicid pni pclmulqdq monitor ssse3 cx16 sse4_1 
                  sse4_2 movbe popcnt aes xsave rdrand hypervisor lahf_lm 
                  svm cr8_legacy abm sse4a 3dnowprefetch vmmcall fsgsbase 
                  bmi1 smep bmi2 erms mpx adx smap clflushopt clwb xsaveopt 
                  xgetbv1 arat npt vgif umip pku ospke la57

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel kvm -cpu max | grep '^flags'
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
                  cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx 
                  pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good 
                  nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx 
                  ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe 
                  popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
                  hypervisor lahf_lm abm 3dnowprefetch cpuid_fault 
                  invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced 
                  tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase 
                  tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx 
                  rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 
                  xsaves arat umip sgx_lc md_clear arch_capabilities

NB, with the list of flags above, I’ve manually line wrapped the output for saner presentation in this blog rather than have one giant long line.

These examples have relied on tools provided by busybox, but we’re not limited by that. It is possible to tell it to copy in arbitrary extra binaries from the host OS by just listing their name. If it is a dynamically linked ELF binary, it’ll follow the ELF header dependencies, pulling in any shared libraries needed.

$ ./make-tiny-image.py hwloc-info lstopo-no-graphics
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy bin /usr/bin/hwloc-info -> /tmp/make-tiny-imagexu_mqd99/bin/hwloc-info
Copy bin /usr/bin/lstopo-no-graphics -> /tmp/make-tiny-imagexu_mqd99/bin/lstopo-no-graphics
Copy lib /lib64/libhwloc.so.15 -> /tmp/make-tiny-imagexu_mqd99/lib64/libhwloc.so.15
Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libc.so.6
Copy lib /lib64/libm.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libm.so.6
Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-imagexu_mqd99/lib64/ld-linux-x86-64.so.2
Copy lib /lib64/libtinfo.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libtinfo.so.6

$ qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm 
~ # hwloc-info 
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        1 L2Cache (type #5)
    depth 4:       1 L1dCache (type #4)
     depth 5:      1 L1iCache (type #9)
      depth 6:     1 Core (type #2)
       depth 7:    1 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  1 Bridge (type #14)
Special depth -5:  3 PCIDev (type #15)
Special depth -6:  1 OSDev (type #16)
Special depth -7:  1 Misc (type #17)

~ # lstopo-no-graphics 
Machine (939MB total)
  Package L#0
    NUMANode L#0 (P#0 939MB)
    L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  HostBridge
    PCI 00:01.1 (IDE)
      Block "sr0"
    PCI 00:02.0 (VGA)
    PCI 00:03.0 (Ethernet)
  Misc(MemoryModule)

An obvious limitation is that if the binary/library requires certain data files, those will not be present in the initrd. It isn’t attempting to do anything clever like query the corresponding RPM file list and copy those. This tool is meant to be simple and fast and keep out of your way. If certain data files are critical for testing though, the --copy argument can be used. The copied files will be put at the same path inside the initrd as found on the host

$ ./make-tiny-image.py --copy /etc/redhat-release 
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy extra /etc/redhat-release -> /tmp/make-tiny-imageicj1tvq4/etc/redhat-release

$ qemu-system-x86_64 \
    -kernel /boot/vmlinuz-$(uname -r) \
    -initrd tiny-initrd.img \
    -append 'console=ttyS0 quiet' \
    -m 1000 -display none -serial stdio -accel kvm 
~ # cat /etc/redhat-release 
Fedora release 37 (Thirty Seven)

What if the problem being tested requires using some kernel modules ? That’s covered too with the --kmod argument, which will copy in the modules listed, along with their dependencies and the insmod command itself. As an example of its utility, I used this recently to debug a regression in support for the iTCO watchdog in Linux kernels

$ ./make-tiny-image.py --kmod lpc_ich --kmod iTCO_wdt  --kmod i2c_i801 
tiny-initrd.img
6.0.8-300.fc37.x86_64
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/lpc_ich.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/lpc_ich.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_wdt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_wdt.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_vendor_support.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_vendor_support.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/intel_pmc_bxt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/intel_pmc_bxt.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/busses/i2c-i801.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-i801.ko.xz
Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/i2c-smbus.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-smbus.ko.xz
Copy bin /usr/sbin/insmod -> /tmp/make-tiny-image63td8wbl/bin/insmod
Copy lib /lib64/libzstd.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libzstd.so.1
Copy lib /lib64/liblzma.so.5 -> /tmp/make-tiny-image63td8wbl/lib64/liblzma.so.5
Copy lib /lib64/libz.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libz.so.1
Copy lib /lib64/libcrypto.so.3 -> /tmp/make-tiny-image63td8wbl/lib64/libcrypto.so.3
Copy lib /lib64/libgcc_s.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libgcc_s.so.1
Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-image63td8wbl/lib64/libc.so.6
Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-image63td8wbl/lib64/ld-linux-x86-64.so.2

$ ~/src/virt/qemu/build/qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm  -M q35 -global ICH9-LPC.noreboot=false -watchdog-action poweroff -trace ich9* -trace tco*
ich9_cc_read addr=0x3410 val=0x20 len=4
ich9_cc_write addr=0x3410 val=0x0 len=4
ich9_cc_read addr=0x3410 val=0x0 len=4
ich9_cc_read addr=0x3410 val=0x0 len=4
ich9_cc_write addr=0x3410 val=0x20 len=4
ich9_cc_read addr=0x3410 val=0x20 len=4
tco_io_write addr=0x4 val=0x8
tco_io_write addr=0x6 val=0x2
tco_io_write addr=0x6 val=0x4
tco_io_read addr=0x8 val=0x0
tco_io_read addr=0x12 val=0x4
tco_io_write addr=0x12 val=0x32
tco_io_read addr=0x12 val=0x32
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
~ # mknod /dev/watchdog0 c 10 130
~ # cat /dev/watchdog0
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
cat: read error: Invalid argument
[   11.052062] watchdog: watchdog0: watchdog did not stop!
tco_io_write addr=0x0 val=0x1
tco_timer_reload ticks=50 (30000 ms)
~ # tco_timer_expired timeouts_no=0 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)
tco_timer_expired timeouts_no=1 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)
tco_timer_expired timeouts_no=0 no_reboot=0/1
tco_timer_reload ticks=50 (30000 ms)

The Linux regression had accidentally left the watchdog with the ‘no reboot’ bit set, so it would never trigger the action, which we diagnosed from seeing repeated QEMU trace events for tco_timer_expired after triggering the watchdog in the guest. This was quicky fixed by the Linux maintainers.

In spite of being such a simple and crude script, with many, many, many unhandled edge cases, it has proved remarkably useful at enabling low overhead debugging of QEMU/Linux guest behaviour.

by Daniel Berrange at March 09, 2023 03:54 PM

March 08, 2023

QEMU project

KVM Forum 2023: Call for presentations

KVM Forum is an annual event that presents a rare opportunity for KVM and QEMU developers and users to discuss the state of Linux virtualization technology and plan for the challenges ahead. Sessions include updates on the state of the KVM virtualization stack, planning for the future, and many opportunities for attendees to collaborate.

This year’s event will be held in Brno, Czech Republic on June 14-15, 2023. It will be in-person only and will be held right before the DevConf.CZ open source community conference.

June 14 will be at least partly dedicated to a hackathon or “day of BoFs”. This will provide time for people to get together and discuss strategic decisions, as well as other topics that are best solved within smaller groups.

Call for presentations

We encourage you to submit presentations via the KVM Forum CfP page. Suggested topics include:

  • Scalability and Optimization
  • Hardening and security
  • Confidential computing
  • Testing
  • KVM and the Linux Kernel:
    • New Features and Ports
    • Device Passthrough: VFIO, mdev, vDPA
    • Network Virtualization
    • Virtio and vhost
  • Virtual Machine Monitors and Management:
    • VMM Implementation: APIs, Live Migration, Performance Tuning, etc.
    • Multi-process VMMs: vhost-user, vfio-user, QEMU Storage Daemon
    • QEMU without KVM: Hypervisor.framework and other hypervisors
    • Managing KVM: Libvirt, KubeVirt, Kata Containers
  • Emulation:
    • New Devices, Boards and Architectures
    • CPU Emulation and Binary Translation

The deadline for submitting presentations is April 2, 2023 - 11:59 PM PDT. Accepted speakers will be notified on April 17, 2023.

Attending KVM Forum

Admission to KVM Forum and DevConf.CZ is free. However, registration is required and the number of attendees is limited by the space available at the venue.

The DevConf.CZ program will feature technical talks on a variety of topics, including cloud and virtualization infrastructure—so make sure to register for DevConf.CZ as well if you would like to attend.

Both conferences are committed to fostering an open and welcoming environment for everybody. Participants are expected to abide by the Devconf.cz code of conduct and media policy.

March 08, 2023 12:45 PM

February 23, 2023

QEMU project

Announcing QEMU Google Summer of Code and Outreachy 2023 internships

QEMU is participating in Google Summer of Code and Outreachy again this year! Google Summer of Code and Outreachy are open source internship programs that offer paid remote work opportunities for contributing to open source. Internships generally run May through August, so if you have time and want to experience open source development, read on to find out how you can apply.

Each intern is paired with one or more mentors, experienced QEMU contributors who support them during the internship. Code developed by the intern is submitted through the same open source development process that all QEMU contributions follow. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.

Find out if you are eligible

Information on who can apply is here for Google Summer of Code and here for Outreachy. Note that Outreachy initial applications ended on February 6th so only those who have been accepted into Outreachy can apply for QEMU Outreachy internships.

Select a project idea

Look through the the list of QEMU project ideas and see if there is something you are interested in working on. Once you have found a project idea you want to apply for, email the mentor for that project idea to ask any questions you may have and discuss the idea further.

Submit your application

You can apply for Google Summer of Code from March 20th to April 4th and apply for Outreachy from March 6th to April 3rd.

Good luck with your applications!

If you have questions about applying for QEMU GSoC or Outreachy, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.

February 23, 2023 01:30 PM

February 20, 2023

Stefan Hajnoczi

Writing a C library in Rust

I started working on libblkio in 2020 with the goal of creating a high-performance block I/O library. The internals are written in Rust while the library exposes a public C API for easy integration into existing applications. Most languages have a way to call C APIs, often called a Foreign Function Interface (FFI). It's the most universal way to call into code written in different languages within the same program. The choice of building a C API was a deliberate one in order to make it easy to create bindings in many programming languages. However, writing a library in Rust that exposes a C API is relatively rare (librsvg is the main example I can think of), so I wanted to share what I learnt from this project.

Calling Rust code from C

Rust has good support for making functions callable from C. The documentation on calling Rust code from C covers the basics. Here is the Rust implementation of void blkioq_set_completion_fd_enabled(struct blkioq *q, bool enable) from libblkio:


#[no_mangle]
pub extern "C" fn blkioq_set_completion_fd_enabled(q: &mut Blkioq, enable: bool) {
q.set_completion_fd_enabled(enable);
}

A C program just needs a function prototype for blkioq_set_completion_fd_enabled() and can call it directly like a C function.

What's really nice is that most primitive Rust types can be passed between languages without special conversion code in Rust. That means the function can accept arguments and return values that map naturally from Rust to C. In the code snippet above you can see that the Rust bool argument can be used without explicit conversion.

C pointers are converted to Rust pointers or references automatically by the compiler. If you want them to be nullable, just wrap them in Rust Option and the C NULL value becomes Rust None while a non-NULL pointer becomes Some. This makes it a breeze to pass data between Rust and C. In the example above, the Rust &mut Blkioq argument is a C struct blkioq *.

Rust structs also map to C nicely when they are declared with repr(C). The Rust compiler lays out the struct in memory so that its representation is compatible with the equivalent C struct.

Limitations of Rust FFI

It's not all roses though. There are fundamental differences between Rust and C that make FFI challenging. Not all language constructs are supported by FFI and some that are require manual work.

Rust generics and dynamically sized types (DST) cannot be used in extern "C" function signatures. Generics require that Rust compiler to generate code, which does not make sense in a C API because there is no Rust compiler involved. DSTs have no mapping to C and so they need to be wrapped in something that can be expressed in C, like a struct. DSTs include trait objects, so you cannot directly pass trait objects across the C/Rust language boundary.

Two extremes in library design

The limitations of FFI raise the question of how to design the library. The first extreme is to use the lowest common denominator language features supported by FFI. In the worst case this means writing C in Rust with frequent use of unsafe (because pointers and unpacked DSTs are passed around). This is obviously a bad approach because it foregoes the safety and expressiveness benefits of Rust. I think few human programmers would follow this approach although code generators or translators might output Rust code of this sort.

The other extreme is to forget about C and focus on writing an idiomatic Rust crate and then build a C API afterwards. Although this sounds nice, it's not entirely a good idea either because of the FFI limitations I mentioned. The Rust crate might be impossible to express as a C API and require significant glue code and possibly performance sacrifices if it values cannot be passed across language boundaries efficiently.

Lessons learnt

When I started libblkio I thought primarily in terms of the C API. Although the FFI code was kept isolated and the rest of the codebase was written in acceptably nice Rust, the main mistake was that I didn't think of what the native Rust crate API should look like. Only thinking of the C API meant that some of the key design decisions were suboptimal for a native Rust crate. Later on, when we began experimenting with a native Rust crate, it became clear where assumptions from the unsafe C API had crept in. It is hard to change them now, although Alberto Faria has done great work in revamping the codebase for a natural Rust API.

I erred too much on the side of the C API. In the future I would try to stay closer to the middle or slightly towards the native Rust API (but not to the extreme). That approach is most likely to end up with code that presents an efficient C API while still implementing it in idiomatic Rust. Overall, implementing a C library API in Rust was a success. I would continue to do this instead of writing new libraries in C because Rust's language features are more attractive than C's.

by Unknown (noreply@blogger.com) at February 20, 2023 03:24 PM

Video and slides available for "vhost-user-blk: a fast userspace block I/O interface"

At FOSDEM '23 I gave a talk about vhost-user-blk and its use as a userspace block I/O interface. The video and slides are now available here. Enjoy!

by Unknown (noreply@blogger.com) at February 20, 2023 12:47 PM

January 27, 2023

Stefan Hajnoczi

Speaking at FOSDEM '23 about "vhost-user-blk: A fast userspace block I/O interface"

vhost-user-blk has connected hypervisors to software-defined storage since around 2017, but it was mainly seen as virtualization technology. Did you know that vhost-user-blk is not specific to virtual machines? I think it's time to use it more generally as a userspace block I/O interface because it's fast, unprivileged, and avoids exposing kernel attack surfaces.

My LWN.net article about Accessing QEMU storage features without a VM already hinted at this, but now it's time to focus on what vhost-user-blk is and why it's easy to integrate into your applications. libblkio is a simple and familiar block I/O API with vhost-user-blk support. You can connect to existing SPDK-based software-defined storage applications, qemu-storage-daemon, and other vhost-user-blk back-ends.

Come see my FOSDEM '23 talk about vhost-user-blk as a fast userspace block I/O interface live on Saturday Feb 4 2023, 11:15 CET. It will be streamed on the FOSDEM website and recordings will be available later. Slides are available here.

by Unknown (noreply@blogger.com) at January 27, 2023 08:46 PM

January 26, 2023

KVM on Z

Client Webcast: Why virtualization is still highly used in the era of Containers and Cloud

 Wednesday, February 1st, 11:00 AM - 12:00 PM ET
"Open to Clients, Business Partners, IBMers, IT Architects, Systems Admins, etc."

  • Register here: https://ibm.webex.com/ibm/onstage/g.php?MTID=e6e21968a924c2f1ed2689f345676caf3
    After registering, you will receive a confirmation email containing information about joining the webinar.
  • With the new workloads running in containers and running in clouds, the question is if virtualization is still important and how it relates with these workloads. The session will illustrate where and why virtualization is still important in the different deployment models.
  • Presented by Wilhelm Mild, IBM Executive IT Architect - Integration Architectures for Containers, Mobile, IBM Z and Linux

by Stefan Raspl (noreply@blogger.com) at January 26, 2023 10:05 AM

December 14, 2022

QEMU project

QEMU version 7.2.0 released

We’d like to announce the availability of the QEMU 7.2.0 release. This release contains 1800+ commits from 205 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • ARM: emulation support for the following CPU features: Enhanced Translation Synchronization, PMU Extensions v3.5, Guest Translation Granule size, Hardware management of access flag/dirty bit state, and Preventing EL0 access to halves of address maps
  • ARM: emulation support for Cortex-A35 CPUs
  • LoongArch: support for fw_cfg DMA functionality, memory hotplug, and TPM device emulation
  • OpenRISC: support for multi-threaded TCG, stability improvements, and new ‘virt’ machine type for CI/device testing.
  • RISC-V: ‘virt’ machine support for booting S-mode firmware from pflash, and general device tree improvements
  • s390x: support for Message-Security-Assist Extension 5 (RNG via PRNO instruction), SHA-512 via KIMD/KLMD instructions, and enhanced zPCI interpretation support for KVM guests
  • x86: TCG performance improvements, including SSE
  • x86: TCG support for AVX, AVX2, F16C, FMA3, and VAES instructions
  • x86: KVM support for “notify vmexit” mechanism to prevent processor bugs from hanging whole system
  • LUKS block device headers are validated more strictly, creating LUKS images is supported on macOS
  • Memory backends now support NUMA-awareness when preallocating memory
  • and lots more…

Thank you to everyone involved!

December 14, 2022 11:53 PM

December 13, 2022

Thomas Huth

How to migrate bug tickets from Launchpad to Gitlab

For a long time, the QEMU project hosted its git repository on their own server and used Launchpad for tracking bugs. The self-hosting of the git repository caused some troubles, so the project switched the main repository to Gitlab in January 2021. That change of course also triggered the question whether the bug tracking could be moved from Launchpad to Gitlab, too. This would provide a better integration of the bug tracking with the git repository, and also has the advantage that more QEMU developers have a Gitlab account than a Launchpad account. But after some discussions it was clear that there was the desire to not simply leave the opened bug tickets at Launchpad behind, so for being able to switch, those tickets needed to be migrated to the Gitlab issue tracker instead.

Fortunately, there are APIs for both, Launchpad and Gitlab, so although I was a complete Python newbie, I was indeed able to build a little script that transfers bug tickets from Launchpad to Gitlab. I recently found the script on my hard disk again, and I thought it might maybe be helpful for other people in the same situation, so here it is:

#!/usr/bin/env python3

import argparse
import os
import re
import sys
import time
import gitlab
import textwrap

from launchpadlib.launchpad import Launchpad
import lazr.restfulclient.errors

parser = argparse.ArgumentParser(description=
                                 "Copy bugs from Launchpad to Gitlab")
parser.add_argument('-l',
                    '--lp-project-name',
                    dest='lp_project_name',
                    help='The Launchpad project name.')
parser.add_argument('-g',
                    '--gl-project-id',
                    dest='gl_project_id',
                    help='The Gitlab project ID.')
parser.add_argument('--verbose', '-v',
                    help='Enable debug logging.',
                    action="store_true")
parser.add_argument('--open', '-o',
                    dest='open_url',
                    help='Open URLs in browser.',
                    action="store_true")
parser.add_argument('--anonymous', '-a',
                    help='Use anonymous login to launchpad (no updates!)',
                    action="store_true")
parser.add_argument('--search-text', '-s',
                    dest='search_text',
                    help='Look up bugs by searching for text.')
parser.add_argument('--reporter', '-r',
                    dest='reporter',
                    help='Look up bugs from the given reporter only.')
parser.add_argument('-b',
                    '--batch-size',
                    dest='batch_size',
                    default=20,
                    type=int,
                    help='The maximum amount of bug tickets to handle.')
args = parser.parse_args()


def get_launchpad():
    cache_dir = os.path.expanduser("~/.launchpadlib/cache/")
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir, 0o700)

    def no_credential():
        print("ERROR: Can't proceed without Launchpad credential.")
        sys.exit()

    if args.anonymous:
        launchpad = Launchpad.login_anonymously(args.lp_project_name +
                                                '-bugs',
                                                'production', cache_dir)
    else:
        launchpad = Launchpad.login_with(args.lp_project_name + '-bugs',
                                         'production',
                                         cache_dir,
                                         credential_save_failed=no_credential)
    return launchpad

def convert_tags(tags):
    convtab = {
        "cve": "Security",
        "disk": "Storage",
        "documentation": "Documentation",
        "ethernet": "Networking",
        "feature-request": "kind::Feature Request",
        "linux": "os: Linux",
        "macos": "os: macOS",
        "security": "Security",
        "test": "Tests",
        "tests": "Tests",
    }

    labels = []
    for tag in tags:
        label = convtab.get(tag)
        if label:
            labels.append(label)
    return labels

def show_bug_task(bug_task):
    print('*** %s - %s' % (bug_task.bug.web_link,
                           str(bug_task.bug.title)[0:44] + "..."))
    if args.verbose:
        print('### Description: %s' % bug_task.bug.description)
        print('### Tags: %s' % bug_task.bug.tags)
        print('### Status: %s' % bug_task.status)
        print('### Assignee: %s' % bug_task.assignee)
        print('### Owner: %s' % bug_task.owner)
        for attachment in bug_task.bug.attachments:
            print('#### Attachment: %s (%s)'
                  % (attachment.data_link, attachment.title))
            #print(sorted(attachment.lp_attributes))
        for message in bug_task.bug.messages:
            print('#### Message: %s' % message.content)

def mark_lp_bug_moved(bug_task, new_url):
    subject = "Moved bug report"
    comment = """
This is an automated cleanup. This bug report has been moved to the
new bug tracker on gitlab.com and thus gets marked as 'expired' now.
Please continue with the discussion here:

 %s
""" % new_url

    bug_task.status = "Expired"
    bug_task.assignee = None
    try:
        bug_task.lp_save()
        bug_task.bug.newMessage(subject=subject, content=comment)
        if args.verbose:
            printf(" ... expired LP bug report %s" % bug_task.web_link)
    except lazr.restfulclient.errors.ServerError as e:
        print("ERROR: Timeout while saving LP bug update! (%s)" % e, end='')
    except Exception as e:
        print("ERROR: Failed to save LP bug update! (%s)" % e, end='')

def preptext(txt):
    txtwrapper = textwrap.TextWrapper(replace_whitespace = False,
                                      break_long_words = False,
                                      drop_whitespace = True, width = 74)
    outtxt = ""
    for line in txt.split("\n"):
        outtxt += txtwrapper.fill(line) + "\n"
    outtxt = outtxt.replace("-", "&#45;")
    outtxt = outtxt.replace("<", "&lt;")
    outtxt = outtxt.replace(">", "&gt;")
    return outtxt

def transfer_to_gitlab(launchpad, project, bug_task):
    bug = bug_task.bug
    desc = "This bug has been copied automatically from: " \
           + bug_task.web_link \
           + "<br/>\nReported by '[" + bug.owner.display_name \
           + "](https://launchpad.net/~" + bug.owner.name + ")' "
    desc += "on " \
           + bug.date_created.date().isoformat() + " :\n\n" \
           + "<pre>" + preptext(bug.description) + "</pre>\n"
    issue = project.issues.create({'title': bug.title, 'description': desc},
                                  retry_transient_errors = True)
    for msg in bug.messages:
        has_attachment = False
        attachtxt = "\n**Attachments:**\n\n"
        for attachment in bug_task.bug.attachments:
            if attachment.message == msg:
                has_attachment = True
                attachtxt += "* [" + attachment.title + "](" \
                             + attachment.data_link + ")\n"
        note = "Comment from '[" + msg.owner.display_name \
            + "](" + msg.owner.web_link + ")' on Launchpad (" \
            + msg.date_created.date().isoformat() + "):\n"
        if msg == bug.messages[0] or not msg.content.strip():
            if not has_attachment:
                continue
        else:
            note += "\n<pre>" + preptext(msg.content) + "</pre>\n"
        if has_attachment:
            note += attachtxt
        issue.notes.create({'body': note}, retry_transient_errors = True)
        time.sleep(0.2)  # To avoid "spamming"
    labels = convert_tags(bug.tags)
    labels.append("Launchpad")
    issue.labels = labels
    issue.save(retry_transient_errors = True)
    print("    ==> %s" % issue.web_url)
    if not args.anonymous:
        mark_lp_bug_moved(bug_task, issue.web_url)
    if args.open_url:
        os.system("xdg-open " + issue.web_url)


def main():
    print("LP2GL", args)

    if not args.lp_project_name:
        print("Please specify a Launchpad project name (with -l)")
        return

    launchpad = get_launchpad()
    lp_project = launchpad.projects[args.lp_project_name]
    if args.reporter:
        bug_tasks = lp_project.searchTasks(
            status=["New", "Confirmed", "Triaged"],
            bug_reporter="https://api.launchpad.net/1.0/~" + args.reporter,
            omit_duplicates=True,
            order_by="datecreated")
    elif args.search_text:
        bug_tasks = lp_project.searchTasks(
            status=["New", "Confirmed", "Triaged", "In Progress"],
            search_text=args.search_text,
            omit_duplicates=True,
            order_by="datecreated")
    else:
        bug_tasks = lp_project.searchTasks(
            status=["New", "Confirmed", "Triaged"],
            omit_duplicates=True,
            order_by="datecreated")

    if args.gl_project_id:
        try:
            priv_token = os.environ['GITLAB_PRIVATE_TOKEN']
        except Exception as e:
            print("Please set the GITLAB_PRIVATE_TOKEN env variable!")
            return

        gl = gitlab.Gitlab('https://gitlab.com', private_token=priv_token)
        gl.auth()
        project = gl.projects.get(args.gl_project_id)
    else:
        print("Provide a Gitlab project ID to transfer the bugs ('-g')")

    batch_size = args.batch_size
    for bug_task in bug_tasks:
        if batch_size < 1 :
            break
        owner = bug_task.owner.name
        if args.open_url:
            os.system("xdg-open " + bug_task.bug.web_link)
        show_bug_task(bug_task)
        if args.gl_project_id:
            time.sleep(2)  # To avoid "spamming"
            transfer_to_gitlab(launchpad, project, bug_task)
        batch_size -= 1

    print("All done.")

if __name__ == '__main__':
    main()

You need to specify at least a Launchpad project name with the -l parameter (for example -l qemu-kvm), and for simple initial tests it might be good to use -a for an anonymous Launchpad login, too (the Launchpad ticket won’t be updated in that case). Without further parameters, this will just list the tickets in the Launchpad project that are still opened.

To transfer tickets to a Gitlab issue tracker, you need to specify the Gitlab project ID with the -g parameter (which can be found on the main page of your project on Gitlab) and provide a Gitlab access token for the API via the GITLAB_PRIVATE_TOKEN environment variable.

Anyway, if you want to use the script, I recommend to test it with anonymous access for Launchpad (i.e. with the -a parameter) and a dummy project on Gitlab first (which you can just delete afterwards). This way you can get a basic understanding and impression of the script first, before you use it for the final transfer of your bug tickets.

December 13, 2022 01:45 PM

December 12, 2022

Stefan Hajnoczi

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

OperationDescription
create-branchPush a new branch that doesn't exist yet
create-tagPush a new tag that doesn't exist yet
fast-forwardPush a commit that is a descendent of the current commit
forcePush a commit or tag replacing the previous ref
deleteDelete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

OperationDescription
create-directoryCreate a new directory
create-fileCreate a new file
create-symlinkCreate a symlink
modifyChange an existing file or symlink
delete-fileDelete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:


branches/ - access control policies for branches
owner.conf
groups/ - group definitions (see below)
...
refs/ - access control policies for refs
owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
# merge-reqs/<user-id>/<merge-req-num>/labels/
# needs-review -> /labels/needs-review
# ...
# labels/
# needs-review/
# <user-id>/
# <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
# ...
# ...
# ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.

by Unknown (noreply@blogger.com) at December 12, 2022 02:14 PM

November 21, 2022

KVM on Z

New Release: RHEL 9.1 on IBM zSystems

After releasing RHEL 8.7 a week before, Red Hat now published RHEL 9.1, see the press release here! It ships, among others:

  • QEMU v7.0
  • libvirt v8.5

Note that RHEL9.1 is NOT an EUS (Extended Update Support) release, so it will go out of support with the GA of RHEL 9.2. For details, please see the "Red Hat Enterprise Linux Life Cycle" here.



by Stefan Raspl (noreply@blogger.com) at November 21, 2022 04:06 PM

IBM Cloud Infrastructure Center 1.1.6 with several KVM on IBM zSystems enhancements

The new version if IBM Cloud Infrastructure Center is available and has several improvements for KVM on IBM zSystems:

    • Consistency group and group snapshots
    • Security group on Red Hat KVM
    • Hybrid hypervisor support
    • Virtual machine backup and restore on KVM
    • Performance enhancement in Day2 operations
    • Visualization of network related components


by Christian Bornträger (noreply@blogger.com) at November 21, 2022 12:43 PM

November 18, 2022

Stefan Hajnoczi

LWN article on "Accessing QEMU storage features without a VM"

At KVM Forum 2022 Kevin Wolf and Stefano Garzarella gave a talk on qemu-storage-daemon, a way to get QEMU's storage functionality without running a VM. It's great for accessing disk images, basically taking the older qemu-nbd to the next level. The cool thing is this makes QEMU's software-defined storage functionality - block devices with snapshots, incremental backup, image file formats, etc - available to other programs. Backup and forensics tools as well as other types of programs can take advantage of qemu-storage-daemon.

Here is the full article about Accessing QEMU storage features without a VM. Enjoy!

by Unknown (noreply@blogger.com) at November 18, 2022 08:44 PM

KVM on Z

Documentation: Red Hat High Availability

Our solution assurance team published a new paper, providing guidance together with hints and tricks and practical examples to help you configure and use the Red Hat Enterprise Linux High Availability Add-on (Red Hat HA).

You can access the paper here.

by Stefan Raspl (noreply@blogger.com) at November 18, 2022 01:48 PM

November 17, 2022

QEMU project

Introduction to Zoned Storage Emulation

This summer I worked on adding Zoned Block Device (ZBD) support to virtio-blk as part of the Outreachy internship program. QEMU hasn’t directly supported ZBDs before so this article explains how they work and why QEMU needed to be extended.

Zoned block devices

Zoned block devices (ZBDs) are divided into regions called zones that can only be written sequentially. By only allowing sequential writes, SSD write amplification can be reduced by eliminating the need for a Flash Translation Layer, and potentially lead to higher throughput and increased capacity. Providing a new storage software stack, zoned storage concepts are standardized as ZBC (SCSI standard), ZAC (ATA standard), and ZNS (NVMe). Meanwhile, the virtio protocol for block devices(virtio-blk) should also be aware of ZBDs instead of taking them as regular block devices. It should be able to pass such devices through to the guest. An overview of necessary work is as follows:

  1. Virtio protocol: extend virtio-blk protocol with main zoned storage concept, Dmitry Fomichev
  2. Linux: implement the virtio specification extensions, Dmitry Fomichev
  3. QEMU: add zoned storage APIs to the block layer, Sam Li
  4. QEMU: implement zoned storage support in virtio-blk emulation, Sam Li

Once the QEMU and Linux patches have been merged it will be possible to expose a virtio-blk ZBD to the guest like this:

-blockdev node-name=drive0,driver=zoned_host_device,filename=/path/to/zbd,cache.direct=on \
-device virtio-blk-pci,drive=drive0 \

And then we can perform zoned block commands on that device in the guest os.

# blkzone report /dev/vda
start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000100000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000120000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000140000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000160000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]

Zoned emulation

Currently, QEMU can support zoned devices by virtio-scsi or PCI device passthrough. It needs to specify the device type it is talking to. Whereas storage controller emulation uses block layer APIs instead of directly accessing disk images. Extending virtio-blk emulation avoids code duplication and simplify the support by hiding the device types under a unified zoned storage interface, simplifying VM deployment for different types of zoned devices. Virtio-blk can also be implemented in hardware. If those devices wish to follow the zoned storage model then the virtio-blk specification needs to natively support zoned storage. With such support, individual NVMe namespaces or anything that is a zoned Linux block device can be exposed to the guest without passing through a full device.

For zoned storage emulation, zoned storage APIs support three zoned models (conventional, host-managed, host-aware) , four zone management commands (Report Zone, Open Zone, Close Zone, Finish Zone), and Append Zone. The QEMU block layer has a BlockDriverState graph that propagates device information inside block layer. File-posix driver is the lowest level within the graph where zoned storage APIs reside.

After receiving the block driver states, Virtio-blk emulation recognizes zoned devices and sends the zoned feature bit to guest. Then the guest can see the zoned device in the host. When the guest executes zoned operations, virtio-blk driver issues corresponding requests that will be captured by viritio-blk device inside QEMU. Afterwards, virtio-blk device sends the requests to file-posix driver which will perform zoned operations using Linux ioctls.

Unlike zone management operations, Linux doesn’t have a user API to issue zone append requests to zoned devices from user space. With the help of write pointer emulation tracking locations of write pointer of each zone, QEMU block layer can perform append writes by modifying regular writes. Write pointer locks guarantee the execution of requests. Upon failure it must not update the write pointer location which is only got updated when the request is successfully finished.

Problems can always be solved with right mind and right tools. A good approach to avoid pitfalls of programs is test-driven. In the beginning, users like qemu-io commands utility can invoke new block layer APIs. Moving towards to guest, existing tools like blktests, zonefs-tools, and fio are introduced for broader testing. Depending on the size of the zoned device, some tests may take long enough time to finish. Besides, tracing is also a good tool for spotting bugs. QEMU tracking tools and blktrace monitors block layer IO, providing detailed information to analysis.

Starting the journey with open source

As a student interested in computer science, I am enthusiastic about making real applications and fortunate to find the opportunity in this summer. I have a wonderful experience with QEMU where I get chance to work with experienced engineers and meet peers sharing same interests. It is a good starting point for me to continue my search on storage systems and open source projects.

Public communication, reaching out to people and admitting to failures used to be hard for me. Those feelings had faded away as I put more effort to this project over time. For people may having the same trouble as me, it might be useful to focus on the tasks ahead of you instead of worrying about the consequences of rejections from others.

Finally, I would like to thank Stefan Hajnoczi, Damien Le Moal, Dmitry Fomichev, and Hannes Reinecke for mentoring me - they have guided me through this project with patience and expertise, when I hit obstacles on design or implementations, and introduced a fun and vibrant open source world for me. Also thank QEMU community and Outreachy for organizing this program.

Conclusion

The current status for this project is waiting for virtio specifications extension and Linux driver support patches got accepted. And the up-to-date patch series of zoned device support welcome any new comments.

The next step for zoned storage emulation in QEMU is to enable full zoned emulation through virtio-blk. Adding support on top of a regular file, it allows developers accessing a zoned device environment without real zoned storage hardwares. Furthermore, virtio-scsi may need to add full emulation support to complete the zoned storage picture in QEMU. QEMU NVMe ZNS emulation can also use new block layer APIs to attach real zoned storage if the emulation is used in production in future.

by Sam Li at November 17, 2022 12:00 AM

November 13, 2022

KVM on Z

New Release: RHEL 8.7 on IBM zSystems

RHEL 8.7 is out, see the blog entry here! It ships, among others:

  • QEMU v6.2, supporting virtio-fs on IBM Z
  • libvirt v8.0

For a detailed list of Linux on Z-specific changes, see the release notes.

by Stefan Raspl (noreply@blogger.com) at November 13, 2022 10:42 PM

November 10, 2022

Stefan Hajnoczi

Using qemu-img to access vhost-user-blk storage

vhost-user-blk is a high-performance storage protocol that connects virtual machines to software-defined storage like SPDK or qemu-storage-daemon. Until now, tool support for vhost-user-blk has been lacking. Accessing vhost-user-blk devices involved running a virtual machine, which requires more setup than one would like.

QEMU 7.2 adds vhost-user-blk support to the qemu-img tool. This is possible thanks to libblkio, a library that other programs besides QEMU can use too.

Check for vhost-user-blk support in your installed qemu-img version like this (if it says 0 then you need to update qemu-img or compile it from source with libblkio enabled):

$ qemu-img --help | grep virtio-blk-vhost-user | wc -l
1

You can copy a raw disk image file into a vhost-user-blk device like this:


$ qemu-img convert \
--target-image-opts \
-n \
test.img \
driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on

The contents of the vhost-user-blk device can be saved as a qcow2 image file like this:

$ qemu-img convert \
--image-opts \
-O qcow2 \
driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on out.qcow2

The size of the virtual disk can be read:

$ qemu-img info \
--image-opts \
driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
image: json:{"driver": "virtio-blk-vhost-user"}
file format: virtio-blk-vhost-user
virtual size: 4 GiB (4294967296 bytes)
disk size: unavailable

Other qemu-img sub-commands like bench and dd are also available for quickly accessing the vhost-user-blk device without running a virtual machine:


$ qemu-img bench \
--image-opts \
driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
Run completed in 1.443 seconds.

Being able to access vhost-user-blk devices from qemu-img makes vhost-user-blk a little easier to work with.

by Unknown (noreply@blogger.com) at November 10, 2022 07:35 PM

October 25, 2022

KVM on Z

New Release: Ubuntu 22.10

Canonical released a new version of their Ubuntu server offering Ubuntu Server 22.10!

One of the highlights is the addition of a new feature providing Secure Execution attestation.

See the announcement on the mailing list here, and the blog entry at Canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at October 25, 2022 06:55 PM

October 19, 2022

KVM on Z

RHEL 8.6: RHEL-AV is no longer necessary for the latest version of KVM

by Christian Bornträger (noreply@blogger.com) at October 19, 2022 01:48 PM

August 30, 2022

QEMU project

QEMU version 7.1.0 released

We’d like to announce the availability of the QEMU 7.1.0 release. This release contains 2800+ commits from 238 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • Live migration: support for zero-copy-send on Linux
  • QMP: new options for exporting NBD images with dirty bitmaps via ‘block-export-add’ command
  • QMP: new ‘query-stats’ and ‘query-stats-schema’ commands for retrieving statistics from various QEMU subsystems
  • QEMU guest agent: improved Solaris support, new commands ‘guest-get-diskstats’/’guest-get-cpustats’, ‘guest-get-disks’ now reports NVMe SMART information, and ‘guest-get-fsinfo’ now reports NVMe bus-type
  • ARM: emulation support for new machine types: Aspeed AST1030 SoC, Qaulcomm, and fby35 (AST2600 / AST1030)
  • ARM: emulation support for Cortex-A76 and Neoverse-N1 CPUs
  • ARM: emulation support for Scalable Matrix Extensions, cache speculation control, RAS, and many other CPU extensions
  • ARM: ‘virt’ board now supports emulation of GICv4.0
  • HPPA: new SeaBIOS v6 firmware with support for PS/2 keyboard in boot menu when running with GTK UI, improved serial port emulation, and additional STI text fonts
  • LoongArch: initial support for LoongArch64 architecture, Loongson 3A5000 multiprocessor SoC, and the Loongson 7A1000 host bridge
  • MIPS: Nios2 board (-machine 10m50-ghrd) now support Vectored Interrupt Controller, shadow register sets, and improved exception handling
  • OpenRISC: ‘or1k-sim’ machine now support 4 16550A UART serial devices instead of 1
  • RISC-V: new ISA extensions with support for privileged spec version 1.12.0, software access to MIP SEIP, Sdtrig extension, vector extension improvements, native debug, PMU improvements, and many other features and miscellaneous fixes/improvements
  • RISC-V: ‘virt’ board now supports TPM
  • RISC-V: ‘OpenTitan’ board now supports Ibex SPI
  • s390x: emulation support for s390x Vector-Enhancements Facility 2
  • s390x: s390-ccw BIOS now supports booting from drives with non-512 sector sizes
  • x86: virtualization support for architectural LBRs
  • Xtensa: support for lx106 core and cache testing opcodes
  • and lots more…

Thank you to everyone involved!

August 30, 2022 11:00 PM

July 19, 2022

Gerd Hoffmann

edk2 and firmware packaging

Firmware autobuilder goes EOL

Some people already noticed and asked questions. So guess I better write things down in my blog so I don't have to answer the questions over and over again, and I hope to also clarify some things on distro firmware builds.

So, yes, the jenkins autobuilder creating the firmware repository at https://www.kraxel.org/repos/jenkins/ has been shutdown yesterday (Jul 19th 2020). The repository will stay online for the time being, so your establish workflows will not instantly break. But the repository will not get updates any more, so it is wise to start looking for alternatives now.

The obvious primary choice would be to just use the firmware builds provided by your distribution. I'll cover edk2 only, which seems to be the by far most popular use, even thought here are also builds for other firmware projects.

RHEL / Fedora edk2 firmware builds

Given I'm quite familier with the RHEL / Fedora world I can give some advise here. The edk2-ovmf package comes with multiple images for the firmware code and the varstore template which allow for various combinations. The most important ones are:

OVMF_CODE.secboot.fd and OVMF_VARS.secboot.fd
Run the secure-boot capable firmware build with secure boot enabled. The varstore has the microsoft secure boot keys enrolled and secure boot enabled.
Requires q35. Requires smm mode support (which is enabled by default these days).
OVMF_CODE.secboot.fd and OVMF_VARS.fd
Run the secure-boot capable firmware build with secure boot disabled. The varstore is blank.
Requires q35 and smm mode support too.
OVMF_CODE.fd and OVMF_VARS.fd
Run the firmware build without secure boot support with the blank varstore.
Works with both q35 and pc machine types. Only available on Fedora.

Configure libvirt domains for UEFI

The classic way to setup this in libvirt looks like this:

<domain type='kvm'>
[ ... ]
  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>
    <nvram template='/usr/share/OVMF/OVMF_VARS.fd'/>
  </os>

To make this easier the firmware builds come with json files describing the capabilities and requirements. You can find these files in /usr/share/qemu/firmware/. libvirt can use them to automatically find suitable firmware images, so you don't have to write the firmware image paths into the domain configuration. You can simply use this instead:

<domain type='kvm'>
[ ... ]
  <os firmware='efi'>
    <type arch='x86_64' machine='q35'>hvm</type>
  </os>

libvirt also allows to ask for specific firmware features. If you don't want use secure boot for example you can ask for the blank varstore template (no secure boot keys enrolled) this way:

<domain type='kvm'>
[ ... ]
  <os firmware='efi'>
    <type arch='x86_64' machine='q35'>hvm</type>
    <firmware>
      <feature name='enrolled-keys' enabled='no' />
    </firmware>
  </os>

In case you change the configuration of an existing virtual machine you might (depending on the kind of change) have to run virsh start --reset-nvram domain once to to start over with a fresh copy of the varstore template.

But why shutdown the autobuilder?

The world has moved forward. UEFI isn't a niche use case any more. Linux distributions all provide good packages theys days. The edk2 project got good CI coverage (years ago it was my autobuilder raising the flag when a commit broke the gcc build). The edk2 project got a regular release process distros can (and do) follow.

All in all the effort to maintain the autobuilder doesn't look justified any more.

by Gerd Hoffmann at July 19, 2022 10:00 PM

KVM on Z

Persistent configuration of crypto passthrough

Are you using CryptoExpress cards with KVM on IBM zSystems or LinuxONE? Sebastian Mitterle has a very good overview on how to make crypto device passthrough persistent.

http://learningbytesting.mathume.com/2022/07/persistent-crypto-device-passthrough-on.html

by Christian Bornträger (noreply@blogger.com) at July 19, 2022 11:43 AM

Daniel Berrange

Trying sd-boot and unified kernel images in a KVM virtual machine

A recent thread on the Fedora development list about unified kernel images co-incided with work I’m involved in wrt confidential computing (AMD SEV[-SNP], Intel TDX, etc). In exploring the different options for booting virtual machines in a confidential computing environment, one of the problems that keeps coming up is that of validating the boot measurements of the initrd and kernel command line. The initrd is currently generated on the fly at the time the kernel is installed on a host, while the command line typically contains host specific UUIDs for filesystems or LUKS volumes. Before even dealing with those problems, grub2‘s support for TPMs causes pain due to its need to measure every single grub.conf configuration line that is executed into a PCR. Even with the most minimal grub.conf using autodiscovery based on the boot loader spec, the grub.conf boot measurements are horribly cumbersome to deal with.

With this in mind, in working on confidential virtualization, we’re exploring options for simplifying the boot process by eliminating any per-host variable measurements. A promising way of achieving this is to make use of sd-boot instead of grub2, and using unified kernel images pre-built and signed by the OS vendor. I don’t have enough familiarity with this area of Linux, so I’ve been spending time trying out the different options available to better understand their operation. What follows is a short description of how i took an existing Fedora 36 virtual machine and converted it to sd-boot with a unified kernel image.

First of all, I’m assuming that the virtual machine has been installed using UEFI (EDK2’s OOVMF build) as the firmware, rather than legacy BIOS (aka SeaBIOS). This is not the default with virt-manager/virt-install, but an opt-in is possible at time of provisioning the guest. Similarly it is possible to opt-in to adding a virtual TPM to the guest, for the purpose of receiving boot measurements. Latest upstream code for virt-manager/virt-install will always add a vTPM if UEFI is requested.

Assuming UEFI + vTPM are enabled for the guest, the default Fedora / RHEL setup will also result in SecureBoot being enabled in the guest. This is good in general, but the sd-boot shipped in Fedora is not currently signed. Thus for (current) testing, either disable SecureBoot, or manually sign the sd-boot binary with a local key and enroll that key with UEFI. SecureBoot isn’t immediately important, so the quickest option is disabling SecureBoot with the following libvirt guest XML config setup:

<os firmware='efi'>
  <type arch='x86_64' machine='pc-q35-6.2'>hvm</type>
  <firmware>
    <feature enabled='no' name='secure-boot'/>
  </firmware>
  <loader secure='no'/>
  <boot dev='hd'/>
</os>

The next time the guest is cold-booted, the ‘--reset-nvram‘ flag needs to be passed to ‘virsh start‘ to make it throwaway the existing SecureBoot enabled NVRAM and replace it with one disabling SecureBoot.

$ virsh start --reset-nvram fedora36test

Inside the guest, surprisingly, there were only two steps required, installing ‘sd-boot’ to the EFI partition, and building the unified kernel images. Installing ‘sd-boot’ will disable the use of grub, so don’t reboot after this first step, until the kernels are setup:

$ bootctl install
Created "/boot/efi/EFI/systemd".
Created "/boot/efi/loader".
Created "/boot/efi/loader/entries".
Created "/boot/efi/EFI/Linux".
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/efi/EFI/systemd/systemd-bootx64.efi".
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/efi/EFI/BOOT/BOOTX64.EFI".
Updated /etc/machine-info with KERNEL_INSTALL_LAYOUT=bls
Random seed file /boot/efi/loader/random-seed successfully written (512 bytes).
Not installing system token, since we are running in a virtualized environment.
Created EFI boot entry "Linux Boot Manager".

While the ‘/boot/efi/loader‘ directory could be populated with config files specifying kernel/initrd/cmdline to boot, the desire is to be able to demonstrate booting with zero host local configuration. So the next step is to build and install the unified kernel image. The Arch Linux wiki has a comprehensive guide, but the easiest option for Fedora appears to be to use dracut with its ‘--uefi‘ flag

$ for i in /boot/vmlinuz-*x86_64
do
   kver=${i#/boot/vmlinuz-}
   echo "Generating $kver"
   dracut  --uefi --kver $kver --kernel-cmdline "root=UUID=5fd49e99-6297-4880-92ef-bc31aef6d2f0 ro rd.luks.uuid=luks-6806c81d-4169-4e7a-9bbc-c7bf65cabcb2 rhgb quiet"
done
Generating 5.17.13-300.fc36.x86_64
Generating 5.17.5-300.fc36.x86_64

The observant will notice the ‘–kernel-cmdline’ argument refers to install specific UUIDs for the LUKS volume and root filesystem. This ultimately needs to be eliminated too, which would require configuring the guest disk image to comply with the discoverable partitions spec. That is beyond the scope of my current exercise of merely demonstrating use of sd-boot and unified kernels. It ought to be possible to write a kickstart file to automate creation of a suitable cloud image though.

At this point the VM is rebooted, and watching the graphical console confirms that the grub menu has disappeared and display output goes straight from the UEFI splash screen into Linux. There’s no menu shown by sd-boot by default, but if desired this can be enabled by editing /boot/efi/loader/loader.conf to uncomment the line timeout 3, at which point it will show the kernel version selection at boot.

If following this scheme, bear in mind that nothing is wired up to handle this during kernel updates. The kernel RPM triggers will continue to setup grub.conf and generate standalone initrds. IOW don’t try this on a VM that you care about. I assume there’s some set of commands I could use to uninstall sd-boot and switch back to grub, but I’ve not bothered to figure this out.

Overall this exercise was suprisingly simple and painless. The whole idea of using a drastically simplified boot loader instead of grub, along with pre-built unified kernel images, feels like it has alot of promise, especially in the context of virtual machines where the set of possible boot hardware variants is small and well understood.

by Daniel Berrange at July 19, 2022 10:26 AM

July 18, 2022

Cornelia Huck

VIRTIO 1.2 is out!

A new version of the virtio specification has been released! As it has been three years after the 1.1 release, quite a lot of changes have accumulated. I have attempted to list some of them below; for details, you are invited to check out the spec :)

There are already some changes queued for 1.3; let’s hope it won’t take us three years again before the next release ;)

New device types in 1.2

Several new device types have been added.

  • virtio-pmem: persistent memory device; useful to avoid a separate page cache in the guest
  • virtio-fs: access a file system; kind of the spritual successor to the never officially standardized virtio-9p
  • virtio-rpmb: a tamper-resistant and anti-replay storage device
  • virtio-iommu: can both be a proxy for a physical IOMMU, or act as a virtual IOMMU
  • virtio-snd: a sound card supporting input and output PCM streams
  • virtio-mem: provides a memory region in guest physical address space; useful to implement memory hot(un)plugging
  • virtio-i2c: a virtual I2C adapter
  • virtio-scmi: implements the Arm System Control and Management Interface, for things like sensors etc.
  • virtio-gpio: a virtual GPIO device to manage named I/O lines

New features for existing device types

Enhancements have been added to some already existing device types.

  • virtio-blk
    • multiqueue support
    • lifetime metrics
    • secure erase
  • virtio-net
    • support for the guest providing the exact header length
    • receive-side scaling
    • per-packet hash reporting
    • per-virtqueue driver notifications
    • UDP segmentation offload
  • virtio-gpu
    • support for 3D commands
    • resource sharing
    • blob resources
    • context initialization
  • virtio-balloon
    • free page hints
    • page poisoning
    • free page reporting
  • virtio-vsock
    • seqpacket sockets

Features not specific to a device type

Some general enhancements include:

  • support for vendor-specific PCI capabilities
  • support for sharing resources between devices
  • support for resetting individual virtqueues

by Cornelia Huck at July 18, 2022 07:45 AM

July 05, 2022

Alberto Garcia

Running the Steam Deck’s OS in a virtual machine using QEMU

SteamOS desktop

Introduction

The Steam Deck is a handheld gaming computer that runs a Linux-based operating system called SteamOS. The machine comes with SteamOS 3 (code name “holo”), which is in turn based on Arch Linux.

Although there is no SteamOS 3 installer for a generic PC (yet), it is very easy to install on a virtual machine using QEMU. This post explains how to do it.

The goal of this VM is not to play games (you can already install Steam on your computer after all) but to use SteamOS in desktop mode. The Gamescope mode (the console-like interface you normally see when you use the machine) requires additional development to make it work with QEMU and will not work with these instructions.

A SteamOS VM can be useful for debugging, development, and generally playing and tinkering with the OS without risking breaking the Steam Deck.

Running the SteamOS desktop in a virtual machine only requires QEMU and the OVMF UEFI firmware and should work in any relatively recent distribution. In this post I’m using QEMU directly, but you can also use virt-manager or some other tool if you prefer, we’re emulating a standard x86_64 machine here.

General concepts

SteamOS is a single-user operating system and it uses an A/B partition scheme, which means that there are two sets of partitions and two copies of the operating system. The root filesystem is read-only and system updates happen on the partition set that is not active. This allows for safer updates, among other things.

There is one single /home partition, shared by both partition sets. It contains the games, user files, and anything that the user wants to install there.

Although the user can trivially become root, make the root filesystem read-write and install or change anything (the pacman package manager is available), this is not recommended because

  • it increases the chances of breaking the OS, and
  • any changes will disappear with the next OS update.

A simple way for the user to install additional software that survives OS updates and doesn’t touch the root filesystem is Flatpak. It comes preinstalled with the OS and is integrated with the KDE Discover app.

Preparing all the necessary files

The first thing that we need is the installer. For that we have to download the Steam Deck recovery image from here: https://store.steampowered.com/steamos/download/?ver=steamdeck&snr=

Once the file has been downloaded, we can uncompress it and we’ll get a raw disk image called steamdeck-recovery-4.img (the number may vary).

Note that the recovery image is already SteamOS (just not the most up-to-date version). If you simply want to have a quick look you can play a bit with it and skip the installation step. In this case I recommend that you extend the image before using it, for example with ‘truncate -s 64G steamdeck-recovery-4.img‘ or, better, create a qcow2 overlay file and leave the original raw image unmodified: ‘qemu-img create -f qcow2 -F raw -b steamdeck-recovery-4.img steamdeck-recovery-extended.qcow2 64G

But here we want to perform the actual installation, so we need a destination image. Let’s create one:

$ qemu-img create -f qcow2 steamos.qcow2 64G

Installing SteamOS

Now that we have all files we can start the virtual machine:

$ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
    -device usb-ehci -device usb-tablet \
    -device intel-hda -device hda-duplex \
    -device VGA,xres=1280,yres=800 \
    -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \
    -drive if=virtio,file=steamdeck-recovery-4.img,driver=raw \
    -device nvme,drive=drive0,serial=badbeef \
    -drive if=none,id=drive0,file=steamos.qcow2

Note that we’re emulating an NVMe drive for steamos.qcow2 because that’s what the installer script expects. This is not strictly necessary but it makes things a bit easier. If you don’t want to do that you’ll have to edit ~/tools/repair_device.sh and change DISK and DISK_SUFFIX.

SteamOS installer shortcuts

Once the system has booted we’ll see a KDE Plasma session with a few tools on the desktop. If we select “Reimage Steam Deck” and click “Proceed” on the confirmation dialog then SteamOS will be installed on the destination drive. This process should not take a long time.

Now, once the operation finishes a new confirmation dialog will ask if we want to reboot the Steam Deck, but here we have to choose “Cancel”. We cannot use the new image yet because it would try to boot into the Gamescope session, which won’t work, so we need to change the default desktop session.

SteamOS comes with a helper script that allows us to enter a chroot after automatically mounting all SteamOS partitions, so let’s open a Konsole and make the Plasma session the default one in both partition sets:

$ sudo steamos-chroot --disk /dev/nvme0n1 --partset A
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit

$ sudo steamos-chroot --disk /dev/nvme0n1 --partset B
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit

After this we can shut down the virtual machine. Our new SteamOS drive is ready to be used. We can discard the recovery image now if we want.

Booting SteamOS and first steps

To boot SteamOS we can use a QEMU line similar to the one used during the installation. This time we’re not emulating an NVMe drive because it’s no longer necessary.

$ cp /usr/share/OVMF/OVMF_VARS.fd .
$ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
   -device usb-ehci -device usb-tablet \
   -device intel-hda -device hda-duplex \
   -device VGA,xres=1280,yres=800 \
   -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \
   -drive if=pflash,format=raw,file=OVMF_VARS.fd \
   -drive if=virtio,file=steamos.qcow2 \
   -device virtio-net-pci,netdev=net0 \
   -netdev user,id=net0,hostfwd=tcp::2222-:22

(the last two lines redirect tcp port 2222 to port 22 of the guest to be able to SSH into the VM. If you don’t want to do that you can omit them)

If everything went fine, you should see KDE Plasma again, this time with a desktop icon to launch Steam and another one to “Return to Gaming Mode” (which we should not use because it won’t work). See the screenshot that opens this post.

Congratulations, you’re running SteamOS now. Here are some things that you probably want to do:

  • (optional) Change the keyboard layout in the system settings (the default one is US English)
  • Set the password for the deck user: run ‘passwd‘ on a terminal
  • Enable / start the SSH server: ‘sudo systemctl enable sshd‘ and/or ‘sudo systemctl start sshd‘.
  • SSH into the machine: ‘ssh -p 2222 deck@localhost

Updating the OS to the latest version

The Steam Deck recovery image doesn’t install the most recent version of SteamOS, so now we should probably do a software update.

  • First of all ensure that you’re giving enought RAM to the VM (in my examples I run QEMU with -m 8G). The OS update might fail if you use less.
  • (optional) Change the OS branch if you want to try the beta release: ‘sudo steamos-select-branch beta‘ (or main, if you want the bleeding edge)
  • Check the currently installed version in /etc/os-release (see the BUILD_ID variable)
  • Check the available version: ‘steamos-update check
  • Download and install the software update: ‘steamos-update

Note: if the last step fails after reaching 100% with a post-install handler error then go to Connections in the system settings, rename Wired Connection 1 to something else (anything, the name doesn’t matter), click Apply and run steamos-update again. This works around a bug in the update process. Recent images fix this and this workaround is not necessary with them.

As we did with the recovery image, before rebooting we should ensure that the new update boots into the Plasma session, otherwise it won’t work:

$ sudo steamos-chroot --partset other
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit

After this we can restart the system.

If everything went fine we should be running the latest SteamOS release. Enjoy!

Reporting bugs

SteamOS is under active development. If you find problems or want to request improvements please go to the SteamOS community tracker.

Edit 06 Jul 2022: Small fixes, mention how to install the OS without using NVMe.

by berto at July 05, 2022 07:11 PM

June 30, 2022

Stefan Hajnoczi

Comparing VIRTIO, NVMe, and io_uring queue designs

Queues and their implementation using shared memory ring buffers are a standard tool for communicating with I/O devices and between CPUs. Although ring buffers are widely used, there is no standard memory layout and it's interesting to compare the differences between designs. When defining libblkio's APIs, I surveyed the ring buffer designs in VIRTIO, NVMe, and io_uring. This article examines some of the differences between the ring buffers and queue semantics in VIRTIO, NVMe, and io_uring.

Ring buffer basics

A ring buffer is a circular array where new elements are written or produced on one side and read or consumed on the other side. Often terms such as head and tail or reader and writer are used to describe the array indices at which the next element is accessed. When the end of the array is reached, one moves back to the start of the array. The empty and full conditions are special states that must be checked to avoid underflow and overflow.

VIRTIO, NVMe, and io_uring all use single producer, single consumer shared memory ring buffers. This allows a CPU and an I/O device or two CPUs to communicate across a region of memory to which both sides have access.

Embedding data in descriptors

At a minimum a ring buffer element, or descriptor, contains the memory address and size of a data buffer:

OffsetTypeName
0x0u64buf
0x8u64len

In a storage device the data buffer contains a request structure with information about the I/O request (logical block address, number of sectors, etc). In order to process a request, the device first loads the descriptor and then loads the request structure described by the descriptor. Performing two loads is sub-optimal and it would be faster to fetch the request structure in a single load.

Embedding the data buffer in the descriptor is a technique that reduces the number of loads. The descriptor layout looks like this:

OffsetTypeName
0x0u64remainder_buf
0x8u64remainder_len
0x10...request structure

The descriptor is extended to make room for the data. If the size of the data varies and is sometimes too large for a descriptor, then the remainder is put into an external buffer. The common case will only require a single load but larger variable-sized buffers can still be handled with 2 loads as before.

VIRTIO does not embed data in descriptors due to its layered design. The data buffers are defined by the device type (net, blk, etc) and virtqueue descriptors are one layer below device types. They have no knowledge of the data buffer layout and therefore cannot embed data.

NVMe embeds the request structure into the Submission Queue Entry. The Command Dword 10, 11, 12, 13, 14, and 15 fields contain the request data and their meaning depends on the Opcode (request type). I/O buffers are still external and described by Physical Region Pages (PRPs) or Scatter Gather Lists (SGLs).

io_uring's struct io_uring_sqe embeds the request structure. Only I/O buffer(s) need to be external as their size varies, would be too large for the ring buffer, and typically zero-copy is desired due to the size of the data.

It seems that VIRTIO could learn from NVMe and io_uring. Instead of having small 16-byte descriptors, it could embed part of the data buffer into the descriptor so that devices need to perform fewer loads during request processing. The 12-byte struct virtio_net_hdr and 16-byte struct virtio_blk_req request headers would fit into a new 32-byte descriptor layout. I have not prototyped and benchmarked this optimization, so I don't know how effective it is.

Descriptor chaining vs external descriptors

I/O requests often include variable size I/O buffers that require scatter-gather lists similar to POSIX struct iovec arrays. Long arrays don't fit into a descriptor so descriptors have fields that point to an external array of descriptors.

Another technique for scatter-gather lists is to chain descriptors together within the ring buffer instead of relying on memory external to the ring buffer. When descriptor chaining is used, I/O requests that don't fit into a single descriptor can occupy multiple descriptors.

Advantages of chaining are better cache locality when a sequence of descriptors is used and no need to allocate separate per-request external descriptor memory.

A consequence of descriptor chaining is that the maximum queue size, or queue depth, becomes variable. It is not possible to guarantee space for specific number of I/O requests because the available number of descriptors depends on the chain size of requests placed into the ring buffer.

VIRTIO supports descriptor chaining although drivers usually forego it when VIRTIO_F_RING_INDIRECT_DESC is available.

NVMe and io_uring do not support descriptor chaining, instead relying on embedded and external descriptors.

Limits on in-flight requests

The maximum number of in-flight requests depends on the ring buffer design. Designs where descriptors are occupied from submission until completion prevent descriptor reuse for other requests while the current request is in flight.

An alternative design is where the device processes submitted descriptors and they are considered free again as soon as the device has looked at them. This approach is natural when separate submission and completion queues are used and there is no relationship between the two descriptor rings.

VIRTIO requests occupy descriptors for the duration of their lifetime, at least in the Split Virtqueue format. Therefore the number of in-flight requests is influenced by the descriptor table size.

NVMe has separate Submission Queues and Completion Queues, but its design still limits the number of in-flight requests to the queue size. The Completion Queue Entry's SQ Head Pointer (SQHD) field precludes having more requests in flight than the Submission Queue size because the field would no longer be unique. Additionally, the driver has no way of detecting Submission Queue Head changes, so it only knows there is space for more submissions when completions occur.

io_uring has independent submission (SQ) and completions queues (CQ) with support for more in-flight requests than the ring buffer size. When there are more in-flight requests than CQ capacity, it's possible to overflow the CQ. io_uring has a backlog mechanism for this case, although the intention is for applications to properly size queues to avoid hitting the backlog often.

Conclusion

VIRTIO, NVMe, and io_uring have slightly different takes on queue design. The semantics and performance vary due to these differences. VIRTIO lacks data embedding inside descriptors. io_uring supports more in-flight requests than the queue size. NVMe and io_uring rely on external descriptors with no ability to chain descriptors.

by Unknown (noreply@blogger.com) at June 30, 2022 01:22 PM

May 16, 2022

Gerd Hoffmann

edk2 quickstart for virtualization

Here is a quickstart for everyone who wants (or needs to) deal with edk2 firmware, with a focus on virtual machine firmware. The article assumes you are using a linux machine with gcc.

Building firmware for VMs

To build edk2 you need to have a bunch of tools installed. An compiler and the make are required of course, but also iasl, nasm and libuuid. So install them first (package names are for centos/fedora).

dnf install -y make gcc binutils iasl nasm libuuid-devel

If you want cross-build arm firmware on a x86 machine you also need cross compilers. While being at also set the environment variables needed to make the build system use the cross compilers:

dnf install -y gcc-aarch64-linux-gnu gcc-arm-linux-gnu
export GCC5_AARCH64_PREFIX="aarch64-linux-gnu-"
export GCC5_ARM_PREFIX="arm-linux-gnu-"

Next clone the tiaocore/edk2 repository and also fetch the git submodules.

git clone https://github.com/tianocore/edk2.git
cd edk2
git submodule update --init

The edksetup script will prepare the build environment for you. The script must be sourced because it sets some environment variables (WORKSPACE being the most important one). This must be done only once (as long as you keep the shell with the configured environment variables open).

source edksetup.sh

Next step is building the BaseTools (also needed only once):

make -C BaseTools

Note: Currently (April 2022) BaseTools are being rewritten in Python, so most likely this step will not be needed any more at some point in the future.

Finally the build (for x64 qemu) can be kicked off:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc

The firmware volumes built can be found in Build/OvmfX64/DEBUG_GCC5/FV.

Building the aarch64 firmware instead:

build -t GCC5 -a AARCH64 -p ArmVirtPkg/ArmVirtQemu.dsc

The build results land in Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV.

Qemu expects the aarch64 firmware images being 64M im size. The firmware images can't be used as-is because of that, some padding is needed to create an image which can be used for pflash:

dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
dd of="QEMU_VARS-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_VARS-pflash.raw" if="QEMU_VARS.fd" conv=notrunc

There are a bunch of compile time options, typically enabled using -D NAME or -D NAME=TRUE. Options which are enabled by default can be turned off using -D NAME=FALSE. Available options are defined in the *.dsc files referenced by the build command. So a feature-complete build looks more like this:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc \
    -D FD_SIZE_4MB \
    -D NETWORK_IP6_ENABLE \
    -D NETWORK_HTTP_BOOT_ENABLE \
    -D NETWORK_TLS_ENABLE \
    -D TPM2_ENABLE

Secure boot support (on x64) requires SMM mode. Well, it builds and works without SMM, but it's not secure then. Without SMM nothing prevents the guest OS writing directly to flash, bypassing the firmware, so protected UEFI variables are not actually protected.

Also suspend (S3) support works with enabled SMM only in case parts of the firmware (PEI specifically, see below for details) run in 32bit mode. So the secure boot variant must be compiled this way:

build -t GCC5 -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc \
    -D FD_SIZE_4MB \
    -D SECURE_BOOT_ENABLE \
    -D SMM_REQUIRE \
    [ ... add network + tpm + other options as needed ... ]

The FD_SIZE_4MB option creates a larger firmware image, being 4MB instead of 2MB (default) in size, offering more space for both code and vars. The RHEL/CentOS builds use that. The Fedora builds are 2MB in size, for historical reasons.

If you need 32-bit firmware builds for some reason, here is how to do it:

build -t GCC5 -a ARM -p ArmVirtPkg/ArmVirtQemu.dsc
build -t GCC5 -a IA32 -p OvmfPkg/OvmfPkgIa32.dsc

The build results will be in in Build/ArmVirtQemu-ARM/DEBUG_GCC5/FV and Build/OvmfIa32/DEBUG_GCC5/FV

Booting fresh firmware builds

The x86 firmware builds create three different images:

OVMF_VARS.fd
This is the firmware volume for persistent UEFI variables, i.e. where the firmware stores all configuration (boot entries and boot order, secure boot keys, ...). Typically this is used as template for an empty variable store and each VM gets its own private copy, libvirt for example stores them in /var/lib/libvirt/qemu/nvram.
OVMF_CODE.fd
This is the firmware volume with the code. Separating this from VARS does (a) allow for easy firmware updates, and (b) allows to map the code read-only into the guest.
OVMF.fd
The all-in-one image with both CODE and VARS. This can be loaded as ROM using -bios, with two drawbacks: (a) UEFI variables are not persistent, and (b) it does not work for SMM_REQUIRE=TRUE builds.

qemu handles pflash storage as block devices, so we have to create block devices for the firmware images:

CODE=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_CODE.fd
VARS=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_VARS.fd
qemu-system-x86_64 \
  -blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
  -blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
  -machine q35,pflash0=code,pflash1=vars \
  [ ... ]

Here is the arm version of that (using the padded files created using dd, see above):

CODE=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_EFI-pflash.raw
VARS=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_VARS-pflash.raw
qemu-system-aarch64 \
  -blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
  -blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
  -machine virt,pflash0=code,pflash1=vars \
  [ ... ]

Source code structure

The core edk2 repo holds a number of packages, each package has its own toplevel directory. Here are the most interesting ones:

OvmfPkg
This holds both the x64-specific code (i.e. OVMF itself) and virtualization-specific code shared by all architectures (virtio drivers).
ArmVirtPkg
Arm specific virtual machine support code.
MdePkg, MdeModulePkg
Most core code is here (PCI support, USB support, generic services and drivers, ...).
PcAtChipsetPkg
Some Intel architecture drivers and libs.
ArmPkg, ArmPlatformPkg
Common Arm architecture support code.
CryptoPkg, NetworkPkg, FatPkg, CpuPkg, ...
As the names of the packages already suggest: Crypto support (using openssl), Network support (including network boot), FAT Filesystem driver, ...

Firmware boot phases

The firmware modules in the edk2 repo often named after the boot phase they are running in. Most drivers are named SomeThingDxe for example.

ResetVector
This is where code execution starts after a machine reset. The code will do the bare minimum needed to enter SEC. On x64 the most important step is the transition from 16-bit real mode to 32-bit mode or 64bit long mode.
SEC (Security)
This code typically loads and uncompresses the code for PEI and SEC. On physical hardware SEC often lives in ROM memory and can not be updated. The PEI and DXE firmware volumes are loaded from (updateable) flash.
With OVMF both SEC firmware volume and the compressed volume holding PXE and DXE code are part of the OVMF_CODE image and will simply be mapped into guest memory.
PEI (Pre-EFI Initialization)
Platform Initialization is done here. Initialize the chipset. Not much to do here in virtual machines, other than loading the x64 e820 memory map (via fw_cfg) from qemu, or get the memory map from the device tree (on aarch64). The virtual hardware is ready-to-go without much extra preaparation.
PEIMs (PEI Modules) can implement functionality which must be executed before entering the DXE phase. This includes security-sensitive things like initializing SMM mode and locking down flash memory.
DXE (Driver Execution Environment)
When PEI is done it hands over control to the full EFI environment contained in the DXE firmware volume. Most code is here. All kinds of drivers. the firmware setup efi app, ...
Strictly speaking this isn't only one phase. The code for all phases after PEI is part of the DXE firmware volume though.

Useful Links

by Gerd Hoffmann at May 16, 2022 10:00 PM

Powered by Planet!
Last updated: March 26, 2023 09:05 AMEdit this page