There were several changes done since my previous post on the topic. So after some discussions I decided to write a post about it.
There are improvements, fixes and even issues with BSA specification.
SBSA Reference Platform (“sbsa-ref” in short) is now at version 0.3 one. Note that this is internal number. Machine name is still the same.
First bump was adding GIC data into (minimalistic) device-tree so firmware can configure it without using any magic numbers (as it was before).
Second update added GIC ITS (Interrupt Translation Services) support. Which means that we can have MSI-X interrupts and complex PCI Express setup.
Third time we said goodbye to USB 2.0 (EHCI) host controller. It never worked and only generated kernel warnings. XHCI (USB 3) controller is used instead now. EDK2 enablement is still work in progress.
Most of versioning updates involved firmware changes. Information about hardware details gets passed from virtual hardware level to operating system via standard defined ways:
This way we were able to get rid of part of “magic numbers” from firmware components.
We can use Neoverse V1 cpu core now. It uses Arm v8.4 architecture and brings SVE and a bunch of other interesting features. You may need to update Trusted Firmware to make use of it.
QEMU got Arm Cortex-A710 cpu core support. It is first Arm v9.0 core there. Due to 240 address space we cannot use it for sbsa-ref. But it prepares code for Neoverse N2/V2 cores.
SBSA Reference Platform passes most of BSA ACS tests from PCI Express module:
*** Starting PCIe tests ***
Operating System View:
801 : Check ECAM Presence : Result: PASS
802 : PE - ECAM Region accessibility check : Result: PASS
803 : All EP/Sw under RP in same ECAM Region : Result: PASS
804 : Check RootPort NP Memory Access : Result: PASS
805 : Check RootPort P Memory Access : Result: PASS
806 : Legacy int must be SPI & lvl-sensitive
Checkpoint -- 2 : Result: SKIPPED
808 : Check all 1's for out of range : Result: PASS
809 : Vendor specfic data are PCIe compliant : Result: PASS
811 : Check RP Byte Enable Rules : Result: PASS
817 : Check Direct Transl P2P Support
Checkpoint -- 1 : Result: SKIPPED
818 : Check RP Adv Error Report
Checkpoint -- 1 : Result: SKIPPED
819 : RP must suprt ACS if P2P Txn are allow
Checkpoint -- 1 : Result: SKIPPED
820 : Type 0/1 common config rule : Result: PASS
821 : Type 0 config header rules : Result: PASS
822 : Check Type 1 config header rules
BDF 0x400 : SLT attribute mismatch: 0xFF020100 instead of 0x20100
BDF 0x500 : SLT attribute mismatch: 0xFF030300 instead of 0x30300
BDF 0x600 : SLT attribute mismatch: 0xFF040400 instead of 0x40400
BDF 0x700 : SLT attribute mismatch: 0xFF050500 instead of 0x50500
BDF 0x800 : SLT attribute mismatch: 0xFF060600 instead of 0x60600
BDF 0x900 : SLT attribute mismatch: 0xFF080700 instead of 0x80700
BDF 0x10000 : SLT attribute mismatch: 0xFF020201 instead of 0x20201
Failed on PE - 0
Checkpoint -- 7 : Result: FAIL
824 : Device capabilities reg rule : Result: PASS
825 : Device Control register rule : Result: PASS
826 : Device cap 2 register rules : Result: PASS
830 : Check Cmd Reg memory space enable
BDF 400 MSE functionality failure
Failed on PE - 0
Checkpoint -- 1 : Result: FAIL
831 : Check Type0/1 BIST Register rule : Result: PASS
832 : Check HDR CapPtr Register rule : Result: PASS
833 : Check Max payload size supported : Result: PASS
835 : Check Function level reset : Result: PASS
836 : Check ARI forwarding enable rule : Result: PASS
837 : Check Config Txn for RP in HB : Result: PASS
838 : Check all RP in HB is in same ECAM : Result: PASS
839 : Check MSI support for PCIe dev : Result: PASS
840 : PCIe RC,PE - Same Inr Shareable Domain : Result: PASS
841 : NP type-1 PCIe supp 32-bit only
NP type-1 pcie is not 32-bit mem type
Failed on PE - 0
Checkpoint -- 1 : Result: FAIL
842 : PASID support atleast 16 bits
Checkpoint -- 3 : Result: SKIPPED
One or more PCIe tests failed or were skipped.
-------------------------------------------------------
Total Tests run = 30 Tests Passed = 22 Tests Failed = 3
-------------------------------------------------------
As you see some of them require work.
I reported problem with test 822 to QEMU developers and turned out that it is a bug there. I got patch from Michael S. Tsirkin (one of QEMU PCI maintainers) and it made test pass. I hope it will be merged soon.
I wonder how many SBSA physical platforms will use one of those. Probably none, but my testing setup has one.
And it makes test 841 fail. This time problem requires more discussion because BSA specification writes (chapter E.2 PCI Express Memory Space):
When PCI Express memory space is mapped as normal memory, the system must support unaligned accesses to that region. PCI Type 1 headers, used in PCI-to-PCI bridges, and therefore in root ports and switches, have to be programmed with the address space resources claimed by the given bridge. For non-prefetchable (NP) memory, Type 1 headers only support 32-bit addresses. This implies that endpoints on the other end of a PCI-to-PCI bridge only support 32-bit NP BARs.
On the other side we have PCI Express Base Specification Revision 6.0 which, in chapter 7.5.1.2.1, says that BAR can be either 32 or 64-bit long:
Base Address registers that map into Memory Space can be 32 bits or 64 bits wide (to support mapping into a 64-bit address space) with bit 0 hardwired to 0b. For Memory Base Address registers, bits 2 and 1 have an encoded meaning as shown in Table 7-9. Bit 3 should be set to 1b if the data is prefetchable and set to 0b otherwise. A Function is permitted to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of the byte enables, and host bridges can merge processor writes into this range 150 without causing errors. Bits 3-0 are read-only.
Table 7-9 Memory Base Address Register Bits 2:1 Encoding
Bits 2:1(b) Meaning 00 Base register is 32 bits wide and can be mapped anywhere in the 32 address bit Memory Space. 01 Reserved 10 Base register is 64 bits wide and can be mapped anywhere in the 64 address bit Memory Space. 11 Reserved
And pcie-pci-bridge device in QEMU uses 64-bit BAR.
I opened support ticket for it at Arm. Will see how it ends.
Arm v8.1 architecture brought Virtual Host Extension (VHE in short). And it added one more timer: non-secure EL2 virtual timer.
BSA ACS checks for it and we were failing:
226 : Check NS EL2-Virt timer PPI Assignment START
NS EL2 Virtual timer interrupt 28 not received
Failed on PE - 0
B_PPI_02
Checkpoint -- 4 : Result: FAIL
END
Turned out that everything to make it pass was already present in QEMU. Except code to enable it for our platform. Two lines of code were enough.
After I sent my small patch, Leif Lindholm extracted timer definitions to separate include file and cleaned code around it to make it easier to compare QEMU code with BSA specification.
Result? Test passes:
226 : Check NS EL2-Virt timer PPI Assignment START
Received vir el2 interrupt
B_PPI_02
: Result: PASS
END
SBSA Reference Platform in QEMU gets better and better with time. We can emulate more complex systems, information about hardware details gets passed from virtual hardware level to operating system via standard defined ways.
Still have test failures but less than it was in past.
I hacked up a prototype multi-player game in just a static HTML/JS files. The game runs in players' browsers without the need for a centralized game server. This peer-to-peer model - getting rid of the server - is something I've been interested in for a long time. I finally discovered a way to make it work without hosting my own server or relying on a hosted service that requires API keys, accounts, or payments. That missing piece came in the form of nostr, a decentralized network protocol that I'll get into later.
Recently p2panda and Veilid were released. They are decentralized application frameworks. Neither has the exact properties I like, but that spurred me to work on a prototype game that shows the direction that I find promising for decentralized applications.
Most distributed applications today are built on a centralized client-server model. Applications are not a single program, but two programs. A client application on the user's device communicates with a server application on the application owner's machines. The way it works is pretty simple: the server holds the data and the client sends commands to operate on the data.
The centralized client-server model is kind of a drag because you need to develop two separate programs and maintain a server so that the application remains online at all times. Non-technical users can't really host the application themselves. It costs money to run the server. If the application owner decides to pull the plug on the server then users cannot use the application anymore. Bad practices of locking in, hoarding, and selling user data as well as monitoring and manipulating user behavior are commonplace because the server controls access to user data.
Peer-to-peer applications solve many of these issues. The advantages are roughly:
This needs to work for web, mobile, and desktop applications because people switch between these three environments all the time. It would be impractical if the solution does not support all environments. The web is the most restrictive environment, mostly for security reasons. Many technologies are not available on the web, including networking APIs that desktop peer-to-peer applications tend to rely on. But if a solution works on the web, then mobile and desktop applications are likely to be able to use the same technology and interoperate with web applications.
Luckily the web environment has one technology that can be used to build peer-to-peer applications: WebRTC. Implementations are available for mobile and destkop environments as well. WebRTC's DataChannels can be thought of as network connections that transfer messages between two devices. They are the primitive for communicating in a peer-to-peer application in place of HTTPS, TCP, or UDP connections that most existing application use today.
Unfortunately WebRTC is not fully peer-to-peer because it relies on a "signaling server" for connection establishment. The signaling server exchanges connectivity information so that a peer-to-peer connetion can be negotiated. This negotiation process does not always succeed, by the way, so in some cases it is not possible to create a peer-to-peer connection. I have no solution for that without hosting servers.
The crux of using WebRTC is that a signaling server is needed, but we don't want to host one for each application. Over the years I've investigated existing peer-to-peer networks like Tor and WebTorrent to see if they could act as the signaling server. I didn't find one that is usable from the web environment (it's too restricted) until now.
It turns out that nostr, originally designed for social network applications but now being used for a bunch of different applications, is web-friendly and could act as a WebRTC signaling server quite easily. In my prototype I abused the encrypted direct message (NIP-04) feature for WebRTC signaling. It works but has the downside that the nostr relay wastes storage because there is no need to preserve the messages. That can be fixed by assigning an "ephemeral kind" so the relay knows it can discard messages after delivery.
(Another option is to build a free public WebRTC signaling service. Its design would be remarkably close to the nostr protocol, so I decided not to reinvent the wheel. If anyone wants to create a public service, let me know and I can share ideas and research.)
Once connectivity has been established via WebRTC, it's up to the application to decide how to communicate. It could be a custom protocol like the JSON messages that my prototype uses, it could be the nostr protocol, it could be HTTP, or literally anything.
Here is how my game prototype works:
In order to connect apps, a user must share a public key with the other user. The public key allows the other user to connect. In my prototype the player hosting the game gets a URL that can be shared with the other player. When the other player visits the URL they will join the game because the public key is embedded in the URL. The role of the public key is similar to the idea behind INET256's "stable addresses derived from public keys".
When devices go offline it is no longer possible to connect to them. This is not a problem for short-lived use cases like playing a game of chess or synchronizing the state of an RSS reader application between a phone and a laptop. For long-lived use cases like a discussion forum or a team chat there are two options: a fully peer-to-peer replicated and eventually consistent data model or a traditional centralized server hosted on a supernode. Both of these options are possible.
You can try out my prototype in your web browser. It's a 2-player tic-tac-toe game: https://gitlab.com/stefanha/tic-tac-toe-p2p/. If the game does not start, try it again (sorry, I hacked it up in a weekend and it's not perfect).
If you want to discuss or share other peer-to-peer application approaches, see my contact details here.
We’d like to announce the availability of the QEMU 8.1.0 release. This release contains 2900+ commits from 250 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
Interested in the latest news on Linux on
IBM Z & LinuxONE? Come and meet us at the 2023 IBM TechXchange EMEA
Client Workshop for Linux on IBM Z and LinuxONE on September 19-20 in
Ehningen, Germany!
Register here.
Second quarter of 2023 was quite productive in terms of new Linux distribution releases and KVM related features shipped in there. Here they are, in chronological order.
A couple of weeks ago, Red Hat Enterprise Linux 9.2 and Red Hat Enterprise Linux 8.8 have been release – time to look at the new features here with regards to KVM virtualization on IBM Z systems.
The KVM code in the 5.14-based kernel of RHEL 9.2 has been refreshed to the state of the upstream 6.0 kernel.
Additionally, many packages from the virtualization stack have been rebased in RHEL 9.2. The following versions are now available:
Speaking of libslirp, a new alternative to the “slirp” user mode networking called passt has been added in RHEL 9.2 for the first time and can be used by installing the “passt” package and adjusting the XML definition of your guest accordingly. “passt” should provide more performance than “slirp” and was designed with security in mind.
Beside the generic new features that are available thanks to the rebased packages in RHEL 9.2, there are also some cool new IBM Z-specific features which have been explicitly backported to the RHEL 9.2 and 8.8 code base:
When running secure guests
it is of course normally not possible to dump the guest’s memory from the
host (e.g. with virsh dump --memory-only
) since the memory pages of the
guest are not available to the host system anymore.
However, in some cases (e.g. when debugging a misbehaving or crashing kernel in the guest), the owner of the guest VM still might want to get a dump of the guest memory – just without providing it in clear text to the administrator of the host system. With RHEL 9.2 and 8.8, this is now possible on the new z16 mainframe. Please see the related documentation from IBM to learn how to set up such a dump.
vfio-ap crypto adapters can now be hotplugged to guests during runtime, too, which brings you more flexibility, without the need to shutdown your guests to change their configurations.
The kernel code in RHEL 9.2 and 8.8 can now enable a new firmware/hardware feature of the recent IBM Z machines that can speed up the performance of passthrough PCI devices (more events can be handled within the guest, without intervention of the KVM hypervisor). Additionally, this now also allows to pass ISM PCI devices through to KVM guests (which was not possible before).
QEMU has emulation of several machines. One of them is “sbsa-ref” which stands for SBSA Reference Platform. The Arm server in simpler words.
In past I worked on it when my help was needed. We have CI jobs which run some tests (SBSA ACS, BSA ACS) and do some checks to see how we are with SBSA compliance.
One day there was discussion that we need a way to recognize variants of “sbsa-ref” in some sane way. The idea was to get rid of most of hardcoded values and provide a way to have data going from QEMU up to firmware.
We started with adding “platform version major/minor” fields into DeviceTree. Starting with “0.0” as value. And for some time nothing changed here as some of people working on SBSA Reference Platform changed jobs and other worked on other parts of it.
Note that this is different than other QEMU targets. We do not go “sbsa-ref-8.0”, “sbsa-ref-8.1” way as this would add maintenance work without any gain for us.
During last Linaro Connect we had some discussion on how we want to proceed. And some after (as not everyone got there — UK visa issues).
The plan is simple:
After setting the plan I created a bunch of Jira tickets and started writing code. Some changes were new, some were adapted from our work-in-progress ones.
Trusted Firmware (TF-A) reads DeviceTree from QEMU and provides platform version (PV from now on) up to firmware via SMC. EDK2 reads it and does nothing (as expected).
Firmware knows which platform version we are on so it can do something about it. So we bump the value in QEMU and provide Arm GIC addresses via another SMCs.
TF-A uses those values instead of hardcoded ones to initialize GIC. Then EDK2 does the same.
If such firmware boots on older QEMU then hardcoded values are used and machine is operational still.
Here things start to be more interesting. We add Interrupt Translation Service support to GIC. Which means we have LPI, MSI(-X) etc. In other words: have normal, working PCI Express with root ports, proper interrupts etc.
From code side it is like previous step: QEMU adds address to DT, TF-A reads it and provides via SMC to EDK2.
If such firmware boots on older QEMU then ITS is not initialized as it was not present in PV 0.0 system.
Normal PCI Express is present. So let get rid of hardcoded values. Similar steps and behaviour like above.
At this step we have normal, working PCI Express structure. So let get rid of some platform devices and replace them with expansion cards:
We can use “ich9-ahci” card instead of former and “qemu-xhci” for latter one.
This step is EDK2 only as we do not touch those parts in TF-A. No real code yet as it needs adding some conditions to existing ASL code so operating system will not get information in DSDT table.
Again: if booted on lower PV then hardcoded values are used.
Recently some additional changes to “sbsa-ref” were merged.
We exchanged graphics card from basic VGA one on legacy PCI bus to Bochs one (which uses PCI Express). From firmware or OS view not much changed as both were supported already.
Other change was default processor. QEMU 8.0 brought emulation of Arm Neoverse-N1 cpu. It is enabled in TF-A for a while so we switched to use it by default (instead of ancient Cortex-A57). With move from arm v8.0 to v8.2 we got several cpu features and functionalities.
The above steps are cleanup preparing “sbsa-ref” for future work. We want to be able to change hardware definition more. For example to select exact GIC model (like GIC-600) instead of generic “arm-gic-v3” one.
SBSA Reference Platform is system where most of expansion is expected to happen by adding PCI Express cards.
RoCE interfaces may lose their IP settings due to an unexpected change of the network interface name.
The RoCE Express adapters can lose their IP settings due to an unexpected change of the network interface name if both of the following conditions are met:
To workaround this problem:
Create the file /etc/systemd/network/98-rhel87-s390x.link with the following content:
[Match]
Architecture=s390x
KernelCommandLine=!net.naming-scheme=rhel-8.7
[Link]
NamePolicy=kernel database slot path
AlternativeNamesPolicy=database slot path
MACAddressPolicy=persistent
After rebooting the system for the changes to take effect, you can safely upgrade to RHEL 8.7 or later.
Note:The 1.3.0 release of the libblkio high-performance block device I/O library is out. libblkio provides an API that abstracts various storage interfaces that are efficient but costly to integrate into applications including io_uring, NVMe uring_cmd, virtio-blk-pci, vdpa-blk, and more. Switching between them is very easy using libblkio and gives you flexibility to target both kernel and userspace storage interfaces in your application.
Linux packaging work has progressed over the past few months. Arch Linux, Fedora, and CentOS Stream now carry libblkio packages and more will come in the future. This makes it easier to use libblkio in your applications because you don't need to compile it from source yourself.
In this release the vdpa-blk storage interface support has been improved. vpda-blk is a virtio-blk-based storage interface designed for hardware implementation, typically on Data Processing Unit (DPU) PCIe adapters. Applications can use vdpa-blk to talk directly to the hardware from userspace. This approach can be used either as part of a hypervisor like QEMU or simply to accelerate I/O-bound userspace applications. QEMU uses libblkio to make vdpa-blk devices available to guests.
The downloads and release notes are available here.
Canonical released a new version of their Ubuntu server offering Ubuntu Server 23.10!
See the announcement on the mailing list here, and the blog entry at Canonical with Z-specific highlights here.
We’d like to announce the availability of the QEMU 8.0.0 release. This release contains 2800+ commits from 238 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
Today we announced new IBM z16 and IBM LinuxONE single frame and rack mount options. That’s right – a mainframe in a rack mount! The unique design provides more flexibility and choice for clients to choose the best fit for their business, whether they are a start-up or a major enterprise.
See the official press release here.
Also, see Ian Cutress from TechTechPotato giving us a visit on site in Böblingen, Germany, to check out our hardware here.
And participate in our virtual launch events as follows:
Building QEMU is a complex task, split across several programs.
the configure
script finds the host and cross compilers that are needed
to build emulators and firmware; Meson prepares the build environment
for the emulators; finally, Make and Ninja actually perform the build,
and in some cases they run tests as well.
In addition to compiling C code, many build steps run tools and scripts which are mostly written in the Python language. These include processing the emulator configuration, code generators for tracepoints and QAPI, extensions for the Sphinx documentation tool, and the Avocado testing framework. The Meson build system itself is written in Python, too.
Some of these tools are run through the python3
executable, while others
are invoked directly as sphinx-build
or meson
, and this can create
inconsistencies. For example, QEMU’s configure
script checks for a
minimum version of Python and rejects too-old interpreters. However,
what would happen if code run by Sphinx used a different version?
This situation has been largely hypothetical until recently; QEMU’s
Python code is already tested with a wide range of versions of the
interpreter, and it would not be a huge issue if Sphinx used a different
version of Python as long as both of them were supported. This will
change in version 8.1 of QEMU, which will bump the minimum supported
version of Python from 3.6 to 3.8. While all the distros that QEMU
supports have a recent-enough interpreter, the default on RHEL8 and
SLES15 is still version 3.6, and that is what all binaries in /usr/bin
use unconditionally.
As of QEMU 8.0, even if configure
is told to use /usr/bin/python3.8
for the build, QEMU’s custom Sphinx extensions would still run under
Python 3.6. configure does separately check that Sphinx is executing
with a new enough Python version, but it would be nice if there were
a more generic way to prepare a consistent Python environment.
This post will explain how QEMU 8.1 will ensure that a single interpreter is used for the whole of the build process. Getting there will require some familiarity with Python packaging, so let’s start with virtual environments.
It is surprisingly hard to find what Python interpreter a given script
will use. You can try to parse the first line of the script, which will
be something like #! /usr/bin/python3
, but there is no guarantee of
success. For example, on some version of Homebrew /usr/bin/meson
will be a wrapper script like:
#!/bin/bash
PYTHONPATH="/usr/local/Cellar/meson/0.55.0/lib/python3.8/site-packages" \
exec "/usr/local/Cellar/meson/0.55.0/libexec/bin/meson" "$@"
The file with the Python shebang line will be hidden somewhere in
/usr/local/Cellar
. Therefore, performing some kind of check on the
files in /usr/bin
is ruled out. QEMU needs to set up a consistent
environment on its own.
If a user who is building QEMU wanted to do so, the simplest way would
be to use Python virtual environments. A virtual environment takes an
existing Python installation but gives it a local set of Python packages.
It also has its own bin
directory; place it at the beginning of your
PATH
and you will be able to control the Python interpreter for scripts
that begin with #! /usr/bin/env python3
.
Furthermore, when packages are installed into the virtual environment
with pip
, they always refer to the Python interpreter that was used to
create the environment. Virtual environments mostly solve the consistency
problem at the cost of an extra pip install
step to put QEMU’s build
dependencies into the environment.
Unfortunately, this extra step has a substantial downside. Even though
the virtual environment can optionally refer to the base installation’s
installed packages, pip
will always install packages from scratch
into the virtual environment. For all Linux distributions except RHEL8
and SLES15 this is unnecessary, and users would be happy to build QEMU
using the versions of Meson and Sphinx included in the distribution.
Even worse, pip install
will access the Python package index (PyPI)
over the Internet, which is often impossible on build machines that
are sealed from the outside world. Automated installation of PyPI
dependencies may actually be a welcome feature, but it must also remain
strictly optional.
In other words, the ideal solution would use a non-isolated virtual
environment, to be able to use system packages provided by Linux
distributions; but it would also ensure that scripts (sphinx-build
,
meson
, avocado
) are placed into bin
just like pip install
does.
When it comes to packages, Python surely makes an effort to be confusing.
The fundamental unit for importing code into a Python program is called
a package; for example os
and sys
are two examples of a package.
However, a program or library that is distributed on PyPI consists
of many such “import packages”: that’s because while pip
is usually
said to be a “package installer” for Python, more precisely it installs
“distribution packages”.
To add to the confusion, the term “distribution package” is often shortened to either “package” or “distribution”. And finally, the metadata of the distribution package remains available even after installation, so “distributions” include things that are already installed (and are not being distributed anywhere).
All this matters because distribution metadata will be the key to
building the perfect virtual environment. If you look at the content
of bin/meson
in a virtual environment, after installing the package
with pip
, this is what you find:
#!/home/pbonzini/my-venv/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from mesonbuild.mesonmain import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(main())
This looks a lot like automatically generated code, and in fact it is;
the only parts that vary are the from mesonbuild.mesonmain import main
import, and the invocation of the main()
function on the last line.
pip
creates this invocation script based on the setup.cfg
file
in Meson’s source code, more specifically based on the following stanza:
[options.entry_points]
console_scripts =
meson = mesonbuild.mesonmain:main
Similar declarations exist in Sphinx, Avocado and so on, and accessing their
content is easy via importlib.metadata
(available in Python 3.8+):
$ python3
>>> from importlib.metadata import distribution
>>> distribution('meson').entry_points
[EntryPoint(name='meson', value='mesonbuild.mesonmain:main', group='console_scripts')]
importlib
looks up the metadata in the running Python interpreter’s
search path; if Meson is installed under another interpreter’s site-packages
directory, it will not be found:
$ python3.8
>>> from importlib.metadata import distribution
>>> distribution('meson').entry_points
Traceback (most recent call last):
...
importlib.metadata.PackageNotFoundError: meson
So finally we have a plan! configure
can build a non-isolated virtual
environment, use importlib
to check that the required packages exist
in the base installation, and create scripts in bin
that point to the
right Python interpreter. Then, it can optionally use pip install
to
install the missing packages.
While this process includes a certain amount of
specialized logic, Python provides a customizable venv
module to create virtual
environments. The custom steps can be performed by subclassing
venv.EnvBuilder
.
This will provide the same experience as QEMU 8.0, except that there will
be no need for the --meson
and --sphinx-build
options to the
configure
script. The path to the Python interpreter is enough to
set up all Python programs used during the build.
There is only one thing left to fix…
Remember how we started with a user that creates her own virtual
environment before building QEMU? Well, this would not work
anymore, because virtual environments cannot be nested. As soon
as configure
creates its own virtual environment, the packages
installed by the user are not available anymore.
Fortunately, the “appearance” of a nested virtual environment is easy
to emulate. Detecting whether python3
runs in a virtual environment
is as easy as checking sys.prefix != sys.base_prefix
; if it is,
we need to retrieve the parent virtual environments site-packages
directory:
>>> import sysconfig
>>> sysconfig.get_path('purelib')
'/home/pbonzini/my-venv/lib/python3.11/site-packages'
and write it to a .pth
file in the lib
directory of the new virtual
environment. The following demo shows how a distribution package in the
parent virtual environment will be available in the child as well:
A small detail is that configure
’s new virtual environment should
mirror the isolation setting of the parent. An isolated venv can be
detected because sys.base_prefix in site.PREFIXES
is false.
Right now, QEMU only makes a minimal attempt at ensuring consistency
of the Python environment; Meson is always run using the interpreter
that was passed to the configure script with --python
or $PYTHON
,
but that’s it. Once the above technique will be implemented in QEMU 8.1,
there will be no difference in the build experience, but configuration
will be easier and a wider set of invalid build environments will
be detected. We will merge these checks before dropping support for
Python 3.6, so that users on older enterprise distributions will have
a smooth transition.
Every once in a while a bug comes along where a guest hangs while communicating with a QEMU VIRTIO device. In this blog post I'll share some debugging approaches that can help QEMU developers who are trying to understand why a VIRTIO device is stuck.
There are a number of reasons why communication with a VIRTIO device might cease, so it helps to identify the nature of the hang:
The case I will talk about is when QEMU itself is still responsive (the QMP/HMP monitor works) and the guest may or may not be responsive.
There is a QEMU monitor command to inspect virtqueues called x-query-virtio-queue-status (QMP) and info virtio-queue-status (HMP). This is a quick way to extract information about a virtqueue from QEMU.
This command allows us to answer the question of whether the QEMU device completed its requests. The shadow_avail_idx and used_idx values in the output are the Available Ring index and Used Ring index, respectively. When they are equal the device has completed all requests. When they are not equal there are still requests in flight and the request must be stuck inside QEMU.
Here is a little more background on the index values. Remember that VIRTIO Split Virtqueues have an Available Ring index and a Used Ring index. The Available Ring index is incremented by the driver whenever it submits a request. The Used Ring index is incremented by the device whenever it completes a request. If the Available Ring index is equal to the Used Ring index then all requests have been completed.
Note that shadow_avail_idx is not the vring Available Ring index in guest RAM but just the last cached copy that the device saw. That means we cannot tell if there are new requests that the device hasn't seen yet. We need to take another approach to figure that out.
Maybe the device has not seen new requests recently and this is why the guest is stuck. That can happen if the device is not receiving Buffer Available Notifications properly (normally this is done by reading a virtqueue kick ioeventfd, also known as a host notifier in QEMU).
We cannot use QEMU monitor commands here, but attaching the GDB debugger to QEMU will allow us to peak at the Available Ring index in guest RAM. The following GDB Python script loads the Available Ring index for a given VirtQueue:
$ cat avail-idx.py import gdb # ADDRESS is the address of a VirtQueue struct vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer()) avail_idx = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[1] if avail_idx != vq['shadow_avail_idx']: print('Device has not seen all available buffers: avail_idx {} shadow_avail_idx {} in {}'.format(avail_idx, vq['shadow_avail_idx'], vq.dereference()))
You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.
If requests are not stuck inside QEMU and the device has seen the latest request, then the guest driver might have missed the Used Buffer Notification from the device (normally an interrupt handler or polling loop inside the guest detects completed requests).
In VIRTIO the driver's current index in the Used Ring is not visible to the device. This means we have no general way of knowing whether the driver has seen completions. However, there is a cool trick for modern devices that have the VIRTIO_RING_F_EVENT_IDX feature enabled.
The trick is that the Linux VIRTIO driver code updates the Used Event Index every time a completed request is popped from the virtqueue. So if we look at the Used Event Index we know the driver's index into the Used Ring and can find out whether it has seen request completions.
The following GDB Python script loads the Used Event Index for a given VirtQueue:
$ cat used-event-idx.py import gdb # ADDRESS is the address of a VirtQueue struct vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer()) used_event = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[2 + vq['vring']['num']] if used_event != vq['used_idx']: print('Driver has not seen all used buffers: used_event {} used_idx {} in {}'.format(used_event, vq['used_idx'], vq.dereference()))
You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.
I hope this helps anyone who has to debug a VIRTIO device that seems to have gotten stuck.
As a virtualization developer a significant amount of time is spent in understanding and debugging the behaviour and interaction of QEMU and the guest kernel/userspace code. As such my development machines have a variety of guest OS installations that get booted for various tasks. Some tasks, however, require a repeated cycle of QEMU code changes, or QEMU config changes, followed by guest testing. Waiting for an OS to boot can quickly become a significant time sink affecting productivity and lead to frustration. What is needed is a very low overhead way to accomplish simple testing tasks without an OS getting in the way.
Enter ‘make-tiny-image.py‘ tool for creating minimal initrd images.
If invoked with no arguments, this tool will create an initrd containing nothing more than busybox. The “init” program will be a script that creates a few device nodes, mounts proc/sysfs and then runs the busybox ‘sh’ binary to provide an interactive shell. This is intended to be used as follows
$ ./make-tiny-image.py tiny-initrd.img 6.0.8-300.fc37.x86_64 $ qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -accel kvm -m 1000 -display none -serial stdio ~ # uname -a Linux (none) 6.0.8-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 11 15:09:04 UTC 2022 x86_64 x86_64 x86_64 Linux ~ # uptime 15:05:42 up 0 min, load average: 0.00, 0.00, 0.00 ~ # free total used free shared buff/cache available Mem: 961832 38056 911264 1388 12512 845600 Swap: 0 0 0 ~ # df Filesystem 1K-blocks Used Available Use% Mounted on none 480916 0 480916 0% /dev ~ # ls bin dev init proc root sys usr ~ # <Ctrl+D> [ 23.841282] reboot: Power down
When I say “low overhead”, just how low are we talking about ? With KVM, it takes less than a second to bring up the shell. Testing with emulation is where this really shines. Booting a full Fedora OS with QEMU emulation is slow enough that you don’t want to do it at all frequently. With this tiny initrd, it’ll take a little under 4 seconds to boot to the interactive shell. Much slower than KVM, but fast enough you’ll be fine repeating this all day long, largely unaffected by the (lack of) speed relative to KVM.
The make-tiny-image.py
tool will create the initrd such that it drops you into a shell, but it can be told to run another command instead. This is how I tested the overheads mentioned above
$ ./make-tiny-image.py --run poweroff tiny-initrd.img 6.0.8-300.fc37.x86_64 $ time qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -m 1000 -display none -serial stdio -accel kvm [ 0.561174] reboot: Power down real 0m0.828s user 0m0.613s sys 0m0.093s $ time qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -m 1000 -display none -serial stdio -accel tcg [ 2.741983] reboot: Power down real 0m3.774s user 0m3.626s sys 0m0.174s
As a more useful real world example, I wanted to test the effect of changing the QEMU CPU configuration against KVM and QEMU, by comparing at the guest /proc/cpuinfo.
$ ./make-tiny-image.py --run 'cat /proc/cpuinfo' tiny-initrd.img 6.0.8-300.fc37.x86_64 $ qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -m 1000 -display none -serial stdio -accel tcg -cpu max | grep '^flags' flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss syscall nx mmxext pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 movbe popcnt aes xsave rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a 3dnowprefetch vmmcall fsgsbase bmi1 smep bmi2 erms mpx adx smap clflushopt clwb xsaveopt xgetbv1 arat npt vgif umip pku ospke la57 $ qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -m 1000 -display none -serial stdio -accel kvm -cpu max | grep '^flags' flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat umip sgx_lc md_clear arch_capabilities
NB, with the list of flags above, I’ve manually line wrapped the output for saner presentation in this blog rather than have one giant long line.
These examples have relied on tools provided by busybox, but we’re not limited by that. It is possible to tell it to copy in arbitrary extra binaries from the host OS by just listing their name. If it is a dynamically linked ELF binary, it’ll follow the ELF header dependencies, pulling in any shared libraries needed.
$ ./make-tiny-image.py hwloc-info lstopo-no-graphics tiny-initrd.img 6.0.8-300.fc37.x86_64 Copy bin /usr/bin/hwloc-info -> /tmp/make-tiny-imagexu_mqd99/bin/hwloc-info Copy bin /usr/bin/lstopo-no-graphics -> /tmp/make-tiny-imagexu_mqd99/bin/lstopo-no-graphics Copy lib /lib64/libhwloc.so.15 -> /tmp/make-tiny-imagexu_mqd99/lib64/libhwloc.so.15 Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libc.so.6 Copy lib /lib64/libm.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libm.so.6 Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-imagexu_mqd99/lib64/ld-linux-x86-64.so.2 Copy lib /lib64/libtinfo.so.6 -> /tmp/make-tiny-imagexu_mqd99/lib64/libtinfo.so.6 $ qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm ~ # hwloc-info depth 0: 1 Machine (type #0) depth 1: 1 Package (type #1) depth 2: 1 L3Cache (type #6) depth 3: 1 L2Cache (type #5) depth 4: 1 L1dCache (type #4) depth 5: 1 L1iCache (type #9) depth 6: 1 Core (type #2) depth 7: 1 PU (type #3) Special depth -3: 1 NUMANode (type #13) Special depth -4: 1 Bridge (type #14) Special depth -5: 3 PCIDev (type #15) Special depth -6: 1 OSDev (type #16) Special depth -7: 1 Misc (type #17) ~ # lstopo-no-graphics Machine (939MB total) Package L#0 NUMANode L#0 (P#0 939MB) L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) HostBridge PCI 00:01.1 (IDE) Block "sr0" PCI 00:02.0 (VGA) PCI 00:03.0 (Ethernet) Misc(MemoryModule)
An obvious limitation is that if the binary/library requires certain data files, those will not be present in the initrd. It isn’t attempting to do anything clever like query the corresponding RPM file list and copy those. This tool is meant to be simple and fast and keep out of your way. If certain data files are critical for testing though, the --copy
argument can be used. The copied files will be put at the same path inside the initrd as found on the host
$ ./make-tiny-image.py --copy /etc/redhat-release tiny-initrd.img 6.0.8-300.fc37.x86_64 Copy extra /etc/redhat-release -> /tmp/make-tiny-imageicj1tvq4/etc/redhat-release $ qemu-system-x86_64 \ -kernel /boot/vmlinuz-$(uname -r) \ -initrd tiny-initrd.img \ -append 'console=ttyS0 quiet' \ -m 1000 -display none -serial stdio -accel kvm ~ # cat /etc/redhat-release Fedora release 37 (Thirty Seven)
What if the problem being tested requires using some kernel modules ? That’s covered too with the --kmod
argument, which will copy in the modules listed, along with their dependencies and the insmod command itself. As an example of its utility, I used this recently to debug a regression in support for the iTCO watchdog in Linux kernels
$ ./make-tiny-image.py --kmod lpc_ich --kmod iTCO_wdt --kmod i2c_i801 tiny-initrd.img 6.0.8-300.fc37.x86_64 Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/lpc_ich.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/lpc_ich.ko.xz Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_wdt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_wdt.ko.xz Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/watchdog/iTCO_vendor_support.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/iTCO_vendor_support.ko.xz Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/mfd/intel_pmc_bxt.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/intel_pmc_bxt.ko.xz Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/busses/i2c-i801.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-i801.ko.xz Copy kmod /lib/modules/6.0.8-300.fc37.x86_64/kernel/drivers/i2c/i2c-smbus.ko.xz -> /tmp/make-tiny-image63td8wbl/lib/modules/i2c-smbus.ko.xz Copy bin /usr/sbin/insmod -> /tmp/make-tiny-image63td8wbl/bin/insmod Copy lib /lib64/libzstd.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libzstd.so.1 Copy lib /lib64/liblzma.so.5 -> /tmp/make-tiny-image63td8wbl/lib64/liblzma.so.5 Copy lib /lib64/libz.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libz.so.1 Copy lib /lib64/libcrypto.so.3 -> /tmp/make-tiny-image63td8wbl/lib64/libcrypto.so.3 Copy lib /lib64/libgcc_s.so.1 -> /tmp/make-tiny-image63td8wbl/lib64/libgcc_s.so.1 Copy lib /lib64/libc.so.6 -> /tmp/make-tiny-image63td8wbl/lib64/libc.so.6 Copy lib /lib64/ld-linux-x86-64.so.2 -> /tmp/make-tiny-image63td8wbl/lib64/ld-linux-x86-64.so.2 $ ~/src/virt/qemu/build/qemu-system-x86_64 -kernel /boot/vmlinuz-$(uname -r) -initrd tiny-initrd.img -append 'console=ttyS0 quiet' -m 1000 -display none -serial stdio -accel kvm -M q35 -global ICH9-LPC.noreboot=false -watchdog-action poweroff -trace ich9* -trace tco* ich9_cc_read addr=0x3410 val=0x20 len=4 ich9_cc_write addr=0x3410 val=0x0 len=4 ich9_cc_read addr=0x3410 val=0x0 len=4 ich9_cc_read addr=0x3410 val=0x0 len=4 ich9_cc_write addr=0x3410 val=0x20 len=4 ich9_cc_read addr=0x3410 val=0x20 len=4 tco_io_write addr=0x4 val=0x8 tco_io_write addr=0x6 val=0x2 tco_io_write addr=0x6 val=0x4 tco_io_read addr=0x8 val=0x0 tco_io_read addr=0x12 val=0x4 tco_io_write addr=0x12 val=0x32 tco_io_read addr=0x12 val=0x32 tco_io_write addr=0x0 val=0x1 tco_timer_reload ticks=50 (30000 ms) ~ # mknod /dev/watchdog0 c 10 130 ~ # cat /dev/watchdog0 tco_io_write addr=0x0 val=0x1 tco_timer_reload ticks=50 (30000 ms) cat: read error: Invalid argument [ 11.052062] watchdog: watchdog0: watchdog did not stop! tco_io_write addr=0x0 val=0x1 tco_timer_reload ticks=50 (30000 ms) ~ # tco_timer_expired timeouts_no=0 no_reboot=0/1 tco_timer_reload ticks=50 (30000 ms) tco_timer_expired timeouts_no=1 no_reboot=0/1 tco_timer_reload ticks=50 (30000 ms) tco_timer_expired timeouts_no=0 no_reboot=0/1 tco_timer_reload ticks=50 (30000 ms)
The Linux regression had accidentally left the watchdog with the ‘no reboot’ bit set, so it would never trigger the action, which we diagnosed from seeing repeated QEMU trace events for tco_timer_expired
after triggering the watchdog in the guest. This was quicky fixed by the Linux maintainers.
In spite of being such a simple and crude script, with many, many, many unhandled edge cases, it has proved remarkably useful at enabling low overhead debugging of QEMU/Linux guest behaviour.
KVM Forum is an annual event that presents a rare opportunity for KVM and QEMU developers and users to discuss the state of Linux virtualization technology and plan for the challenges ahead. Sessions include updates on the state of the KVM virtualization stack, planning for the future, and many opportunities for attendees to collaborate.
This year’s event will be held in Brno, Czech Republic on June 14-15, 2023. It will be in-person only and will be held right before the DevConf.CZ open source community conference.
June 14 will be at least partly dedicated to a hackathon or “day of BoFs”. This will provide time for people to get together and discuss strategic decisions, as well as other topics that are best solved within smaller groups.
We encourage you to submit presentations via the KVM Forum CfP page. Suggested topics include:
The deadline for submitting presentations is April 2, 2023 - 11:59 PM PDT. Accepted speakers will be notified on April 17, 2023.
Admission to KVM Forum and DevConf.CZ is free. However, registration is required and the number of attendees is limited by the space available at the venue.
The DevConf.CZ program will feature technical talks on a variety of topics, including cloud and virtualization infrastructure—so make sure to register for DevConf.CZ as well if you would like to attend.
Both conferences are committed to fostering an open and welcoming environment for everybody. Participants are expected to abide by the Devconf.cz code of conduct and media policy.
QEMU is participating in Google Summer of Code and Outreachy again this year! Google Summer of Code and Outreachy are open source internship programs that offer paid remote work opportunities for contributing to open source. Internships generally run May through August, so if you have time and want to experience open source development, read on to find out how you can apply.
Each intern is paired with one or more mentors, experienced QEMU contributors who support them during the internship. Code developed by the intern is submitted through the same open source development process that all QEMU contributions follow. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.
Information on who can apply is here for Google Summer of Code and here for Outreachy. Note that Outreachy initial applications ended on February 6th so only those who have been accepted into Outreachy can apply for QEMU Outreachy internships.
Look through the the list of QEMU project ideas and see if there is something you are interested in working on. Once you have found a project idea you want to apply for, email the mentor for that project idea to ask any questions you may have and discuss the idea further.
You can apply for Google Summer of Code from March 20th to April 4th and apply for Outreachy from March 6th to April 3rd.
Good luck with your applications!
If you have questions about applying for QEMU GSoC or Outreachy, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.
I started working on libblkio in 2020 with the goal of creating a high-performance block I/O library. The internals are written in Rust while the library exposes a public C API for easy integration into existing applications. Most languages have a way to call C APIs, often called a Foreign Function Interface (FFI). It's the most universal way to call into code written in different languages within the same program. The choice of building a C API was a deliberate one in order to make it easy to create bindings in many programming languages. However, writing a library in Rust that exposes a C API is relatively rare (librsvg is the main example I can think of), so I wanted to share what I learnt from this project.
Rust has good support for making functions callable from C. The documentation on calling Rust code from C covers the basics. Here is the Rust implementation of void blkioq_set_completion_fd_enabled(struct blkioq *q, bool enable) from libblkio:
#[no_mangle] pub extern "C" fn blkioq_set_completion_fd_enabled(q: &mut Blkioq, enable: bool) { q.set_completion_fd_enabled(enable); }
A C program just needs a function prototype for blkioq_set_completion_fd_enabled() and can call it directly like a C function.
What's really nice is that most primitive Rust types can be passed between languages without special conversion code in Rust. That means the function can accept arguments and return values that map naturally from Rust to C. In the code snippet above you can see that the Rust bool argument can be used without explicit conversion.
C pointers are converted to Rust pointers or references automatically by the compiler. If you want them to be nullable, just wrap them in Rust Option and the C NULL value becomes Rust None while a non-NULL pointer becomes Some. This makes it a breeze to pass data between Rust and C. In the example above, the Rust &mut Blkioq argument is a C struct blkioq *.
Rust structs also map to C nicely when they are declared with repr(C). The Rust compiler lays out the struct in memory so that its representation is compatible with the equivalent C struct.
It's not all roses though. There are fundamental differences between Rust and C that make FFI challenging. Not all language constructs are supported by FFI and some that are require manual work.
Rust generics and dynamically sized types (DST) cannot be used in extern "C" function signatures. Generics require that Rust compiler to generate code, which does not make sense in a C API because there is no Rust compiler involved. DSTs have no mapping to C and so they need to be wrapped in something that can be expressed in C, like a struct. DSTs include trait objects, so you cannot directly pass trait objects across the C/Rust language boundary.
The limitations of FFI raise the question of how to design the library. The first extreme is to use the lowest common denominator language features supported by FFI. In the worst case this means writing C in Rust with frequent use of unsafe (because pointers and unpacked DSTs are passed around). This is obviously a bad approach because it foregoes the safety and expressiveness benefits of Rust. I think few human programmers would follow this approach although code generators or translators might output Rust code of this sort.
The other extreme is to forget about C and focus on writing an idiomatic Rust crate and then build a C API afterwards. Although this sounds nice, it's not entirely a good idea either because of the FFI limitations I mentioned. The Rust crate might be impossible to express as a C API and require significant glue code and possibly performance sacrifices if it values cannot be passed across language boundaries efficiently.
When I started libblkio I thought primarily in terms of the C API. Although the FFI code was kept isolated and the rest of the codebase was written in acceptably nice Rust, the main mistake was that I didn't think of what the native Rust crate API should look like. Only thinking of the C API meant that some of the key design decisions were suboptimal for a native Rust crate. Later on, when we began experimenting with a native Rust crate, it became clear where assumptions from the unsafe C API had crept in. It is hard to change them now, although Alberto Faria has done great work in revamping the codebase for a natural Rust API.
I erred too much on the side of the C API. In the future I would try to stay closer to the middle or slightly towards the native Rust API (but not to the extreme). That approach is most likely to end up with code that presents an efficient C API while still implementing it in idiomatic Rust. Overall, implementing a C library API in Rust was a success. I would continue to do this instead of writing new libraries in C because Rust's language features are more attractive than C's.
At FOSDEM '23 I gave a talk about vhost-user-blk and its use as a userspace block I/O interface. The video and slides are now available here. Enjoy!
vhost-user-blk has connected hypervisors to software-defined storage since around 2017, but it was mainly seen as virtualization technology. Did you know that vhost-user-blk is not specific to virtual machines? I think it's time to use it more generally as a userspace block I/O interface because it's fast, unprivileged, and avoids exposing kernel attack surfaces.
My LWN.net article about Accessing QEMU storage features without a VM already hinted at this, but now it's time to focus on what vhost-user-blk is and why it's easy to integrate into your applications. libblkio is a simple and familiar block I/O API with vhost-user-blk support. You can connect to existing SPDK-based software-defined storage applications, qemu-storage-daemon, and other vhost-user-blk back-ends.
Come see my FOSDEM '23 talk about vhost-user-blk as a fast userspace block I/O interface live on Saturday Feb 4 2023, 11:15 CET. It will be streamed on the FOSDEM website and recordings will be available later. Slides are available here.
Wednesday, February 1st, 11:00 AM - 12:00 PM ET
"Open to Clients, Business Partners, IBMers, IT Architects, Systems Admins, etc."
We’d like to announce the availability of the QEMU 7.2.0 release. This release contains 1800+ commits from 205 authors.
You can grab the tarball from our download page. The full list of changes are available in the Wiki.
Highlights include:
Thank you to everyone involved!
For a long time, the QEMU project hosted its git repository on their own server and used Launchpad for tracking bugs. The self-hosting of the git repository caused some troubles, so the project switched the main repository to Gitlab in January 2021. That change of course also triggered the question whether the bug tracking could be moved from Launchpad to Gitlab, too. This would provide a better integration of the bug tracking with the git repository, and also has the advantage that more QEMU developers have a Gitlab account than a Launchpad account. But after some discussions it was clear that there was the desire to not simply leave the opened bug tickets at Launchpad behind, so for being able to switch, those tickets needed to be migrated to the Gitlab issue tracker instead.
Fortunately, there are APIs for both, Launchpad and Gitlab, so although I was a complete Python newbie, I was indeed able to build a little script that transfers bug tickets from Launchpad to Gitlab. I recently found the script on my hard disk again, and I thought it might maybe be helpful for other people in the same situation, so here it is:
#!/usr/bin/env python3
import argparse
import os
import re
import sys
import time
import gitlab
import textwrap
from launchpadlib.launchpad import Launchpad
import lazr.restfulclient.errors
parser = argparse.ArgumentParser(description=
"Copy bugs from Launchpad to Gitlab")
parser.add_argument('-l',
'--lp-project-name',
dest='lp_project_name',
help='The Launchpad project name.')
parser.add_argument('-g',
'--gl-project-id',
dest='gl_project_id',
help='The Gitlab project ID.')
parser.add_argument('--verbose', '-v',
help='Enable debug logging.',
action="store_true")
parser.add_argument('--open', '-o',
dest='open_url',
help='Open URLs in browser.',
action="store_true")
parser.add_argument('--anonymous', '-a',
help='Use anonymous login to launchpad (no updates!)',
action="store_true")
parser.add_argument('--search-text', '-s',
dest='search_text',
help='Look up bugs by searching for text.')
parser.add_argument('--reporter', '-r',
dest='reporter',
help='Look up bugs from the given reporter only.')
parser.add_argument('-b',
'--batch-size',
dest='batch_size',
default=20,
type=int,
help='The maximum amount of bug tickets to handle.')
args = parser.parse_args()
def get_launchpad():
cache_dir = os.path.expanduser("~/.launchpadlib/cache/")
if not os.path.exists(cache_dir):
os.makedirs(cache_dir, 0o700)
def no_credential():
print("ERROR: Can't proceed without Launchpad credential.")
sys.exit()
if args.anonymous:
launchpad = Launchpad.login_anonymously(args.lp_project_name +
'-bugs',
'production', cache_dir)
else:
launchpad = Launchpad.login_with(args.lp_project_name + '-bugs',
'production',
cache_dir,
credential_save_failed=no_credential)
return launchpad
def convert_tags(tags):
convtab = {
"cve": "Security",
"disk": "Storage",
"documentation": "Documentation",
"ethernet": "Networking",
"feature-request": "kind::Feature Request",
"linux": "os: Linux",
"macos": "os: macOS",
"security": "Security",
"test": "Tests",
"tests": "Tests",
}
labels = []
for tag in tags:
label = convtab.get(tag)
if label:
labels.append(label)
return labels
def show_bug_task(bug_task):
print('*** %s - %s' % (bug_task.bug.web_link,
str(bug_task.bug.title)[0:44] + "..."))
if args.verbose:
print('### Description: %s' % bug_task.bug.description)
print('### Tags: %s' % bug_task.bug.tags)
print('### Status: %s' % bug_task.status)
print('### Assignee: %s' % bug_task.assignee)
print('### Owner: %s' % bug_task.owner)
for attachment in bug_task.bug.attachments:
print('#### Attachment: %s (%s)'
% (attachment.data_link, attachment.title))
#print(sorted(attachment.lp_attributes))
for message in bug_task.bug.messages:
print('#### Message: %s' % message.content)
def mark_lp_bug_moved(bug_task, new_url):
subject = "Moved bug report"
comment = """
This is an automated cleanup. This bug report has been moved to the
new bug tracker on gitlab.com and thus gets marked as 'expired' now.
Please continue with the discussion here:
%s
""" % new_url
bug_task.status = "Expired"
bug_task.assignee = None
try:
bug_task.lp_save()
bug_task.bug.newMessage(subject=subject, content=comment)
if args.verbose:
printf(" ... expired LP bug report %s" % bug_task.web_link)
except lazr.restfulclient.errors.ServerError as e:
print("ERROR: Timeout while saving LP bug update! (%s)" % e, end='')
except Exception as e:
print("ERROR: Failed to save LP bug update! (%s)" % e, end='')
def preptext(txt):
txtwrapper = textwrap.TextWrapper(replace_whitespace = False,
break_long_words = False,
drop_whitespace = True, width = 74)
outtxt = ""
for line in txt.split("\n"):
outtxt += txtwrapper.fill(line) + "\n"
outtxt = outtxt.replace("-", "-")
outtxt = outtxt.replace("<", "<")
outtxt = outtxt.replace(">", ">")
return outtxt
def transfer_to_gitlab(launchpad, project, bug_task):
bug = bug_task.bug
desc = "This bug has been copied automatically from: " \
+ bug_task.web_link \
+ "<br/>\nReported by '[" + bug.owner.display_name \
+ "](https://launchpad.net/~" + bug.owner.name + ")' "
desc += "on " \
+ bug.date_created.date().isoformat() + " :\n\n" \
+ "<pre>" + preptext(bug.description) + "</pre>\n"
issue = project.issues.create({'title': bug.title, 'description': desc},
retry_transient_errors = True)
for msg in bug.messages:
has_attachment = False
attachtxt = "\n**Attachments:**\n\n"
for attachment in bug_task.bug.attachments:
if attachment.message == msg:
has_attachment = True
attachtxt += "* [" + attachment.title + "](" \
+ attachment.data_link + ")\n"
note = "Comment from '[" + msg.owner.display_name \
+ "](" + msg.owner.web_link + ")' on Launchpad (" \
+ msg.date_created.date().isoformat() + "):\n"
if msg == bug.messages[0] or not msg.content.strip():
if not has_attachment:
continue
else:
note += "\n<pre>" + preptext(msg.content) + "</pre>\n"
if has_attachment:
note += attachtxt
issue.notes.create({'body': note}, retry_transient_errors = True)
time.sleep(0.2) # To avoid "spamming"
labels = convert_tags(bug.tags)
labels.append("Launchpad")
issue.labels = labels
issue.save(retry_transient_errors = True)
print(" ==> %s" % issue.web_url)
if not args.anonymous:
mark_lp_bug_moved(bug_task, issue.web_url)
if args.open_url:
os.system("xdg-open " + issue.web_url)
def main():
print("LP2GL", args)
if not args.lp_project_name:
print("Please specify a Launchpad project name (with -l)")
return
launchpad = get_launchpad()
lp_project = launchpad.projects[args.lp_project_name]
if args.reporter:
bug_tasks = lp_project.searchTasks(
status=["New", "Confirmed", "Triaged"],
bug_reporter="https://api.launchpad.net/1.0/~" + args.reporter,
omit_duplicates=True,
order_by="datecreated")
elif args.search_text:
bug_tasks = lp_project.searchTasks(
status=["New", "Confirmed", "Triaged", "In Progress"],
search_text=args.search_text,
omit_duplicates=True,
order_by="datecreated")
else:
bug_tasks = lp_project.searchTasks(
status=["New", "Confirmed", "Triaged"],
omit_duplicates=True,
order_by="datecreated")
if args.gl_project_id:
try:
priv_token = os.environ['GITLAB_PRIVATE_TOKEN']
except Exception as e:
print("Please set the GITLAB_PRIVATE_TOKEN env variable!")
return
gl = gitlab.Gitlab('https://gitlab.com', private_token=priv_token)
gl.auth()
project = gl.projects.get(args.gl_project_id)
else:
print("Provide a Gitlab project ID to transfer the bugs ('-g')")
batch_size = args.batch_size
for bug_task in bug_tasks:
if batch_size < 1 :
break
owner = bug_task.owner.name
if args.open_url:
os.system("xdg-open " + bug_task.bug.web_link)
show_bug_task(bug_task)
if args.gl_project_id:
time.sleep(2) # To avoid "spamming"
transfer_to_gitlab(launchpad, project, bug_task)
batch_size -= 1
print("All done.")
if __name__ == '__main__':
main()
You need to specify at least a Launchpad project name with the -l
parameter
(for example -l qemu-kvm
), and for simple initial tests it might be good to
use -a
for an anonymous Launchpad login, too (the Launchpad ticket won’t be
updated in that case). Without further parameters, this will just list the
tickets in the Launchpad project that are still opened.
To transfer tickets to a Gitlab issue tracker, you need to specify the
Gitlab project ID with the -g
parameter (which can be found on the main
page of your project on Gitlab) and provide a
Gitlab access token
for the API via the GITLAB_PRIVATE_TOKEN
environment variable.
Anyway, if you want to use the script, I recommend to test it with anonymous
access for Launchpad (i.e. with the -a
parameter) and a dummy project on
Gitlab first (which you can just delete afterwards). This way you can get a
basic understanding and impression of the script first, before you use it for
the final transfer of your bug tickets.
The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.
I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.
A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.
First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.
Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.
Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.
Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...
The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.
Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.
This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.
Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.
The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:
anyone create,fast-forward ^heads/foo$
The operations available on refs include:
Operation | Description |
---|---|
create-branch | Push a new branch that doesn't exist yet |
create-tag | Push a new tag that doesn't exist yet |
fast-forward | Push a commit that is a descendent of the current commit |
force | Push a commit or tag replacing the previous ref |
delete | Delete a ref |
What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:
anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$
This would allow Alice to push her own branches but Alice could not push to Bob's branches.
We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):
anyone create-file ^shared-dir/.*$
The operations available on branches include:
Operation | Description |
---|---|
create-directory | Create a new directory |
create-file | Create a new file |
create-symlink | Create a symlink |
modify | Change an existing file or symlink |
delete-file | Delete a file |
... |
$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:
anyone create-file,modify,delete-file ^$user_id/.*$
You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.
The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.
The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.
The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.
Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:
branches/ - access control policies for branches owner.conf groups/ - group definitions (see below) ... refs/ - access control policies for refs owner.conf
The default branches/owner.conf access control policy is as follows:
owner create-file,create-directory,modify,delete ^.*$
The default refs/owner.conf access control policy is as follows:
owner create-branch,create-tag,fast-foward,force,delete ^.*$
This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.
Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.
Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.
This completes the description of the access control app. Now let's look at how other functionality is built on top of this.
A merge requests app can be built on top of this model. The refs access control policy is as follows:
# The data branch contains the titles, comments, etc anyone modify ^apps/merge-reqs/data$ # Each merge request revision is pushed as a tag in a per-user namespace anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$
The branch access control policy is:
# Merge requests are per-user and numbered anyone create-directory ^merge-reqs/$user_id/[0-9]+$ # Title string anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$ # Labels (open, needs-review, etc) work like this: # # merge-reqs/<user-id>/<merge-req-num>/labels/ # needs-review -> /labels/needs-review # ... # labels/ # needs-review/ # <user-id>/ # <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num> # ... # ... # ... # # This directory and symlink layout makes it possible to enumerate labels for a # given merge request and to enumerate merge requests for a given label. # # Both the merge request author and maintainers can add/remove labels to/from a # merge request. anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$ anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$ maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$ maintainers create-directory ^labels/[^/]+$ anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$ maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$ # Comments are stored as individual files in per-user directories. Each file # contains a timestamp and the contents of the comment. The timestamp can be # used to sort comments chronologically. anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$ anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$ anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$
When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.
The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.
This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.
One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.
An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.
I hope that this has shown how git apps can be built without application code running on the server.
Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.
A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.
When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.
There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.
The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.
The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.
There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.
Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.
I hope this was an interesting read for anyone thinking about git forges and building git apps.
After releasing RHEL 8.7 a week before, Red Hat now published RHEL 9.1, see the press release here! It ships, among others:
Note that RHEL9.1 is NOT an EUS (Extended Update Support) release, so it will go out of support with the GA of RHEL 9.2. For details, please see the "Red Hat Enterprise Linux Life Cycle" here.
The new version if IBM Cloud Infrastructure Center is available and has several improvements for KVM on IBM zSystems:
At KVM Forum 2022 Kevin Wolf and Stefano Garzarella gave a talk on qemu-storage-daemon, a way to get QEMU's storage functionality without running a VM. It's great for accessing disk images, basically taking the older qemu-nbd to the next level. The cool thing is this makes QEMU's software-defined storage functionality - block devices with snapshots, incremental backup, image file formats, etc - available to other programs. Backup and forensics tools as well as other types of programs can take advantage of qemu-storage-daemon.
Here is the full article about Accessing QEMU storage features without a VM. Enjoy!
Our solution assurance team published a new paper, providing guidance together with hints and tricks and practical examples to help you configure and use the Red Hat Enterprise Linux High Availability Add-on (Red Hat HA).
You can access the paper here.