Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

March 25, 2020

Marcin Juszkiewicz

Sharing PCIe cards across architectures

Some days ago during one of conference calls one of my co-workers asked:

Has anyone ever tried PCI forwarding to an ARM VM on an x86 box?

As my machine was opened I just turned it off and inserted SATA controller into one of unused PCI Express slots. After boot I started one of my AArch64 CirrOS VM instances and gave it this card. Worked perfectly:

[   21.603194] pcieport 0000:00:01.0: pciehp: Slot(0): Attention button pressed
[   21.603849] pcieport 0000:00:01.0: pciehp: Slot(0) Powering on due to button press
[   21.604124] pcieport 0000:00:01.0: pciehp: Slot(0): Card present
[   21.604156] pcieport 0000:00:01.0: pciehp: Slot(0): Link Up
[   21.739977] pci 0000:01:00.0: [1b21:0612] type 00 class 0x010601
[   21.740159] pci 0000:01:00.0: reg 0x10: [io  0x0000-0x0007]
[   21.740199] pci 0000:01:00.0: reg 0x14: [io  0x0000-0x0003]
[   21.740235] pci 0000:01:00.0: reg 0x18: [io  0x0000-0x0007]
[   21.740271] pci 0000:01:00.0: reg 0x1c: [io  0x0000-0x0003]
[   21.740306] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x001f]
[   21.740416] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x000001ff]
[   21.742660] pci 0000:01:00.0: BAR 5: assigned [mem 0x10000000-0x100001ff]
[   21.742709] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x101f]
[   21.742770] pci 0000:01:00.0: BAR 0: assigned [io  0x1020-0x1027]
[   21.742803] pci 0000:01:00.0: BAR 2: assigned [io  0x1028-0x102f]
[   21.742834] pci 0000:01:00.0: BAR 1: assigned [io  0x1030-0x1033]
[   21.742866] pci 0000:01:00.0: BAR 3: assigned [io  0x1034-0x1037]
[   21.742935] pcieport 0000:00:01.0: PCI bridge to [bus 01]
[   21.742961] pcieport 0000:00:01.0:   bridge window [io  0x1000-0x1fff]
[   21.744805] pcieport 0000:00:01.0:   bridge window [mem 0x10000000-0x101fffff]
[   21.745749] pcieport 0000:00:01.0:   bridge window [mem 0x8000000000-0x80001fffff 64bit pref]

Let’s go deeper

Next day I turned off desktop for CPU cooler upgrade. During process I went through my box of expansion cards and plugged additional USB 3.0 controller (Renesas based). Also added SATA hard drive and connected it to previously added controller.

Once computer was back online I created new VM instance. This time I used Fedora 32 beta. But when I tried to add PCI Express card I got an error:

Error while starting domain: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/", line 1279, in startup
  File "/usr/lib64/python3.8/site-packages/", line 1234, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

Hmm. It worked before. Tried other card — with the same effect.


Went to #qemu IRC channel and started discussing issue with QEMU developers. Turned out that probably no one tried sharing expansion cards to foreign architecture guest (in TCG mode instead of same architecture KVM mode).

As I had VM instance where sharing card worked I started checking what was wrong. After some restarts it was clear that crossing 3054 MB of guest memory was enough to get VFIO errors like above.


Issue not reported does not exist. So I opened a bug against QEMU. Filled it with error messages, “lspci” output data for used cards, QEMU command line (generated by libvirt) etc.

Looks like the problem lies in architecture differences between x86-64 (host) and aarch64 (guest). Let me quote Alex Williamson:

The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn’t know it’s being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue.

Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue.

I have to admit that this is too low-level for me. I hope that the problem I hit will help someone to improve QEMU.

by Marcin Juszkiewicz at March 25, 2020 05:23 PM

March 04, 2020

Marcin Juszkiewicz

CirrOS 0.5.0 released

Someone may say that I am main reason why CirrOS project does releases.

In 2016 I got task at Linaro to get it running on AArch64. More details are in my blog post ‘my work on changing CirrOS images’. Result was 0.4.0 release.

Last year I got another task at Linaro. So we released 0.5.0 version today.

But that’s not how it happened.

Multiple contributors

Since 0.4.0 release there were changes done by several developers.

Robin H. Johnson took care of kernel modules. Added new ones, updated names. Also added several new features.

Murilo Opsfelder Araujo fixed build on Ubuntu 16.04.3 as gcc changed preprocessor output.

Jens Harbott took care of lack of space for data read from config-drive.

Paul Martin upgraded CirrOS build system to BuildRoot 2019.02.1 and bumped kernel/grub versions.

Maciej Józefczyk took care of metadata requests.

Marcin Sobczyk fixed starting of Dropbear and dropped creation of DSS ssh key which was no longer supported.

My Linaro work

At Linaro I got Jira card with “Upgrade CirrOS’ kernel to Ubuntu 18.04’s kernel” title.

This was needed as 4.4 kernel was far too old and gave us several booting issues. Internally we had builds with 4.15 kernel but it should be done properly and upstream.

So I fetched code, did some test builds and started looking how to improve situation. Spoke with Scott Moser (owner of CirrOS project) and he told me about his plans to migrate from Launchpad to GitHub. So we did that in December 2019 and then fun started.

Continuous Integration

GitHub has several ways of adding CI to projects. First we tried GitHub Actions but turned out that it is paid service. Looked around and then I decided to go with Travis CI.

Scott generated all required keys and integration started. Soon we had every pull request going through CI. Then I added simple script (bin/test-boot) so each image was booted after build. Scott improved script and fixed Power boot issue.

Next step was caching downloads and ccache files. This was huge improvement!

In meantime Travis bumped free service to 5 simultaneous builders which got our builds even faster.

CirrOS supports building only under Ubuntu LTS. But I use Fedora so we merged two changes to make sure that proper ‘grub(2)-mkimage’ command is used.

Kernel changes

4.4 kernel had to go. First idea was to move to 4.18 from Ubuntu 18.04 release. But if we upgrade then why not going for HWE one? I checked 5.0 and 5.3 versions. As both worked fine we decided to go with newer one.

Modules changes

During start of CirrOS image several kernel modules are loaded. But there were several “no kernel module found” like messages for built-in ones.

We took care of it by querying /sys/module/ directory so now module loading is quiet process. At the end a list of loaded ones is printed.

VirtIO changes

Lot of things happened since 4.4 kernels. So we added several VirtIO modules.

One of results is working graphical console on AArch64. Thanks to ‘virtio-gpu’ providing framebuffer and ‘hid-generic’ handling usb input devices.

As lack of entropy is common issue in VM instances we added ‘virtio-rng’ module. No more ‘uninitialized urandom read’ messages from kernel.

Final words

Yesterday Scott created 0.5.0 tag and CI built all release images. Then I wrote release notes (based on ones from pre-releases). Kolla project got patch to move to use new version.

When next release? Looking at history someone may say 2023 as previous one was in 2016 year. But know knows. Maybe we will get someone with “please add s390x support” question ;D

by Marcin Juszkiewicz at March 04, 2020 10:53 AM

February 26, 2020

QEMU project

Announcing Google Summer of Code 2020 internships

QEMU is participating in Google Summer of Code 2020 again this year! Google Summer of Code (GSoC) is an internship program that brings students into open source development. GSoC offers full-time remote work opportunities for talented new developers wishing to get involved in our community.

Each intern works with one or more mentors who support them in their project. Code is submitted through QEMU’s open source development process, giving the intern experience in open source software development.

If you are interested in contributing to QEMU through a paid 12-week internship from May to August, take a look at our project ideas for more information. Applications are open March 16-31, 2020.

Please review the eligibility criteria before applying.

QEMU is also participating in the Outreachy internship program, so be sure to check that out as well!

February 26, 2020 07:00 AM

February 25, 2020

Stefano Garzarella

AF_VSOCK: nested VMs and loopback support available

During the last KVM Forum 2019, we discussed some next steps and several requests came from the audience.

In the last months, we worked on that and recent Linux releases contain interesting new features that we will describe in this blog post:

DevConf.CZ 2020

These updates and an introduction to AF_VSOCK were presented at DevConf.CZ 2020 during the “VSOCK: VM↔host socket with minimal configuration” talk. Slides available here.

Nested VMs

Before Linux 5.5, the AF_VSOCK core supported only one transport loaded at run time. That was a limit for nested VMs, because we need multiple transports loaded together.

Types of transport

Under the AF_VSOCK core, that provides the socket interface to the user space applications, we have several transports that implement the communication channel between guest and host.


vsock transports

These transports depend on the hypervisor and we can put them in two groups:

  • H2G (host to guest) transports: they run in the host and usually they provide the device emulation; currently we have vhost and vmci transports.
  • G2H (guest to host) transports: they run in the guest and usually they are device drivers; currently we have virtio, vmci, and hyperv transports.


In a nested VM environment, we need to load both G2H and H2G transports together in the L1 guest, for this reason, we implemented the multi-transports support to use vsock through nested VMs.


vsock and nested VMs

Starting from Linux 5.5, the AF_VSOCK can handle two types of transports loaded together at runtime:

  • H2G transport, to communicate with the guest
  • G2H transport, to communicate with the host.

So in the QEMU/KVM environment, the L1 guest will load both virtio-transport, to communicate with L0, and vhost-transport to communicate with L2.

Local Communication

Another feature recently added is the possibility to communicate locally on the same host. This feature, suggested by Richard WM Jones, can be very useful for testing and debugging applications that use AF_VSOCK without running VMs.

Linux 5.6 introduces a new transport called vsock-loopback, and a new well know CID for local communication: VMADDR_CID_LOCAL (1). It’s a special CID to direct packets to the same host that generated them.


vsock loopback

Other CIDs can be used for the same purpose, but it’s preferable to use VMADDR_CID_LOCAL:

  • Local Guest CID
    • if G2H is loaded (e.g. running in a VM)
    • if H2G is loaded and G2H is not loaded (e.g. running on L0). If G2H is also loaded, then VMADDR_CID_HOST is used to reach the host

Richard recently used the vsock local communication to implement a regression test test for nbdkit/libnbd vsock support, using the new VMADDR_CID_LOCAL.


# Listening on port 1234 using ncat(1)
l0$ nc --vsock -l 1234

# Connecting to the local host using VMADDR_CID_LOCAL (1)
l0$ nc --vsock 1 1234


by (Stefano Garzarella) at February 25, 2020 07:30 PM

February 15, 2020

Stefan Hajnoczi

An introduction to GDB scripting in Python

Sometimes it's not humanly possible to inspect or modify data structures manually in a debugger because they are too large or complex to navigate. Think of a linked list with hundreds of elements, one of which you need to locate. Finding the needle in the haystack is only possible by scripting the debugger to automate repetitive steps.

This article gives an overview of the GNU Debugger's Python scripting support so that you can tackle debugging tasks that are not possible manually.

What scripting GDB in Python can do

GDB can load Python scripts to automate debugging tasks and to extenddebugger functionality. I will focus mostly on automating debugging tasks but extending the debugger is very powerful though rarely used.

Say you want to search a linked list for a particular node:

(gdb) p
(gdb) p
(gdb) p

Doing this manually can be impossible for lists with too many elements. GDB scripting support allows this task to be automated by writing a script that executes debugger commands and interprets the results.

Loading Python scripts

The source GDB command executes files ending with the .py extension in a Python interpreter. The interpreter has access to the gdb Python module that exposes debugging APIs so your script can control GDB.

$ cat
print('Hi from Python, this is GDB {}'.format(gdb.VERSION))
$ gdb
(gdb) source
Hi from Python, this is GDB Fedora

Notice that the gdb module is already imported. See the GDB Python API documentation for full details of this module.

It's also possible to run ad-hoc Python commands from the GDB prompt:

(gdb) py print('Hi')

Executing commands

GDB commands are executed using gdb.execute(command, from_tty, to_string). For example, gdb.execute('step') runs the step command. Output can be collected as a Python string by setting to_string to True. By default output goes to the interactive GDB session.

Although gdb.execute() is fundamental to GDB scripting, at best it allows screen-scraping (interpreting the output string) rather than a Pythonic way of controlling GDB. There is actually a full Python API that represents the debugged program's types and values in Python. Most scripts will use this API instead of simply executing GDB commands as if simulating an interactive shell session.

Navigating program variables

The entry point to navigating program variables is gdb.parse_and_eval(expression). It returns a gdb.Value.

When a gdb.Value is a struct its fields can be indexed using value['field1']['child_field1'] syntax. The following example iterates a linked list:

elem = gdb.parse_and_eval('block_backends.tqh_first')
while elem:
name = elem['name'].string()
if name == 'drive2':
print('Found {}'.format(elem['dev']))
elem = elem['link']['tqe_next']

This script iterates the block_backends linked list and checks the name field of each element against "drive2". When it finds "drive2" it prints the dev field of that element.

There is a lot more that GDB Python scripts can do but you'll have to check out the API documentation to learn about it.


Python scripts can automate tedious debugging tasks in GDB. Having the full power of Python and access to file I/O, HTTP requests, etc means pretty much any debugging task can be turned into a full-blown program. A subset of this was possible in the past through GDB command scripts, but Python is a much more flexible programming language and familiar to many developers (more so than GDB's own looping and logic commands!).

by Unknown ( at February 15, 2020 04:16 PM

February 10, 2020

Stefan Hajnoczi

Video for "virtio-fs: a shared file system for virtual machines" at FOSDEM '20 now available

The video and slides from my virtio-fs talk at FOSDEM '20 are now available!

virtio-fs is a shared file system that lets guests access a directory on the host. It can be used for many things, including secure containers, booting from a root directory, and testing code inside a guest.

The talk explains how virtio-fs works, including the Linux FUSE protocol that it's based on and how FUSE concepts are mapped to VIRTIO.

virtio-fs guest drivers have been available since Linux v5.4 and QEMU support will be available from QEMU v5.0 onwards.

Video (webm) (mp4)

Slides (PDF)

by Unknown ( at February 10, 2020 08:38 AM

February 09, 2020

Stefan Hajnoczi

Why CPU Utilization Metrics are Confusing

How much CPU is being used? Intuitively we would like to know the percentage of time being consumed. Popular utilities like top(1) and virt-top(1) do show percentages but the numbers can be weird. This post goes into how CPU utilization is accounted and why the numbers can be confusing.

Tools sometimes show CPU utilizations above 100%. Or we know a virtual machine is consuming all its CPU but only 12% CPU utilization is reported. Comparing CPU utilization metrics from different tools often reveals that the numbers they report are wildly different. What's going on?

How CPU Utilization is Measured

Imagine we want to measure the CPU utilization of an application on a simple computer with one CPU. Each time the application is scheduled on the CPU we record the time until it is next descheduled. The utilization is calculated by dividing the total CPU time that the application ran by the time interval being measured:

Here t is execution time for each of the n times the application was scheduled and T is the time unit being measured (e.g. 1 second).

So far, so good. This is how CPU utilization times should work. Now let's look at why the percentages can be confusing.

CPU Utilization on Multi-Processor Systems

Modern computers from mobile phones to laptops to servers typically have multiple logical CPUs. They are called logical CPUs because they appear as a CPU to software regardless of whether they are implemented as a socket, a core, or an SMThardware thread.

On multi-processor systems we need to adapt the CPU utilization formula to account for CPUs running in parallel. There are two ways to do this:

  1. Treat 100% as full utilization of all CPUs. top(1) calls this Solaris mode.
  2. Treat 100% as full utilization of one CPU. top(1) calls this Irix mode.

By default top(1) reports CPU utilization in Irix mode and virt-top(1) reports Solaris mode.

The implications of Solaris mode are that a single CPU being fully utilized is only reported as 1/N CPU utilization where N is the number of CPUs. On a system with a large number of CPUs the utilization percentages can be very low even though some CPUs are fully utilized. Even on my laptop with 4 logical CPUs that means a single-threaded application consuming a full CPU only reports 25% CPU utilization.

Irix mode produces more intuitive 0-100% numbers for single-threaded applications but multi-threaded applications may consume multiple CPUs and therefore exceed 100%, which looks a bit funny.


Since there are two ways of accounting CPU utilization on multi-processor systems it is always necessary to know which method is being used. A percentage on its own is meaningless and might be misinterpreted.

This also explains why numbers reported by different tools can be so vastly different. It is necessary to check which accounting method is being used by both tools.

Documentation (and source code) often sheds light on which accounting method is used, but another way to check is by running a process that consumes a full CPU and then observing the CPU utilization that is reported. This can be done by running while true; do true; done in a shell and checking the CPU utilization numbers that are reported.

virt-top(1) has another peculiarity that must be taken into account. Its formula divides CPU time consumed by a guest by the total CPU time available on the host. If the guest has 4 vCPUs but the guest has 8 physical CPUs, then the guest can only ever reach 50% because it will never use all physical CPUs at once.


CPU utilization can be confusing on multi-processor systems, which is most computers today. Interpreting CPU utilization metrics requires knowing whether Solaris mode or Irix mode was used for calculation. Be careful with CPU utilization metrics!

by Unknown ( at February 09, 2020 06:41 PM

February 07, 2020

Stefan Hajnoczi

Apply for QEMU Outreachy 2020 May-August internships now!

QEMU is participating in the Outreachy open source internship program again this year. Check out the QEMU blog for more information about this 12-week full-time, paid, remote work internship working on QEMU.

by Unknown ( at February 07, 2020 04:58 PM

February 06, 2020

QEMU project

Announcing Outreachy 2020 internships

QEMU is participating in Outreachy again this year! Outreachy is an open source internship program for anyone who faces under-representation, systemic bias, or discrimination in the technology industry. Outreachy offers full-time remote work opportunities for talented new developers wishing to get involved in our community.

Each intern works with one or more mentors who support them in their project. Code is submitted through QEMU’s open source development process, giving the intern experience in open source software development.

If you are interested in contributing to QEMU through a paid 12-week internship from May to August, take a look at our project ideas for more information.

Please review the eligibility criteria before applying.

These internships are generously funded by Red Hat.

February 06, 2020 07:00 AM

February 05, 2020

Daniel Berrange

libvirt: an “embedded” QEMU driver mode for isolated usage

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Historical driver architecture

Historically the local stateful drivers in libvirt have supported one or two modes of access

  • “system mode” – privileged libvirtd running as root, global per host
  • “session mode” – unprivileged libvirtd, isolated to individual non-root users

Within context of each daemon, VM name uniqueness is enforced. Operating via the daemon means that all applications connected to that same libvirtd get the same world view. This single world view is exactly what you want when dealing with server / cloud / desktop virtualization, because it means tools like ‘virt-top‘, ‘virt-viewer’, ‘virsh‘ can see the same VMs as virt-manager / oVirt / OpenStack / GNOME Boxes / etc.

There are other use cases for virtualization, however, where this single world view across applications may be much less desirable. Instead of spawning VMs for the purpose of running a full guest operating system, the VM is used as a building block for an application specific use case. I describe these use cases as “embedded virtualization”, with the libguestfs project being a well known long standing example. This uses a VM as a way to confine execution of its appliance, allowing safe manipulation of disk images. The libvirt-sandbox project is another example which provides a way to take binaries installed on the host OS and directly execute them inside a virtual machine, using 9p filesystem passthrough. More recently the Kata project aims to provide a docker compatible container runtime built using KVM.

In many, but not neccessarily all, of these applications, it is unhelpful for the KVM instances that are launched to become visible to other applications like virt-manager / OpenStack. For example if Nova sees a libguestfs VM running in libvirt it won’t be able to correlate this VM with its own world view. There have been cases where a mgmt app would try to destroy these externally launched VM in order to reconcile its world view.

There are other practicalities to consider when using a shared daemon like libvirtd. Each application has to ensure it creates a sensible unique name for each virtual machine, that won’t clash with names picked by other applications. Then there is the question of cleaning up resources such as log files left over from short lived VMs.

When spawning KVM via a separate daemon, the QEMU process is daemonized, such that it disassociated from both libvirtd and the application which spawned it. It will only be cleaned up by an explicit API call to destroy it, or by the guest application shutting it down. For embedded use cases, it would be helpful if the VM would automatically die when the application which launched it dies. Libvirt introduces a notion of “auto destroy” to associated the lifetime of a VM with the client socket connection. It would be simpler if the VM process were simply in the same process group as the application, allowing normal OS level process tree pruning. The disassociated process context means that the QEMU process also looses the cgroup & namespace placement of the application using it

An initial embedded libvirt driver

A possible answer to all these problems is to introduce the notion of an “embedded mode” for libvirt drivers. When using a libvirt driver in this mode, there is no libvirtd daemon involved, instead the libvirt driver code is loaded into the application process itself. In embedded mode the libvirt driver is operating against a custom directory prefix for reading and writing config / state files. The directory is private to each application which has an instance of the embedded driver open. Since the libvirt driver is directly loaded into the application, there is no RPC service exposed and thus there is no way to use virsh and other tools to access the driver. This is important to remember because it means there is no way to debug problems with embedded VMs using normal libvirt tools. For some applications this is acceptable as the VMs are short-lived & throw away, but for others this restriction might be unacceptable.

At the time of writing this post, support for embedded QEMU driver connections has merged to GIT master, and will be released in 6.1.0. In order to enable use of encrypted disks, there is also support for an embedded secret driver. The embedded driver feature is considered experimental initially, and so contrary to normal libvirt practice we’re not providing a strong upgrade compatibility guarantee. The API and XML formats won’t change, but the behavior of the embedded driver may still change.

Along with the embedded driver mode, is a new command line tool called virt-qemu-run. This is a simple tool using the embedded QEMU driver to run a single QEMU virtual machine, automatically exiting when QEMU exits, or tearing down QEMU if the tool exits abnormally. This can be used directly by users for self contained virtual machines, but it also serves as an example of how to use the embedded driver and has been important for measuring startup performance. This tool is also considered experimental and so its CLI syntax is subject to change in future.

In general the embedded mode drivers should offer the same range of functionality as the main system or session modes in libvirtd. To learn more about their usage and configuration, consult the three pages linked in the above paragraphs.

Further development work

During development of the embedded driver one of the problems that quickly became apparently was the time required to launch a virtual machine. When libvirtd starts up one of the things it does is to probe all installed QEMU binaries to determine what features they support. This can take 300-500 milliseconds per binary which doesn’t sound like much, but if you have all 30 QEMU binaries installed this is 10-15 seconds. The results of this probing are cached, avoiding repeated performance hits until something changes which would invalidate the information. The caching doesn’t help the embedded driver case though, because it is using a private directory tree for state and thus doesn’t see the cache from the system / session mode drivers. To deal with this problem the QEMU driver startup process was significantly refactored such that probing of QEMU binaries is delayed until the data is actually needed. This massively helps both the new embedded mode and existing system/session modes.

Unfortunately it is fairly common for applications to query the libvirt host capabilities and the returned data is required to report on all QEMU binaries, thus triggering the slow probing operation. There is a new API which allows probing of a single QEMU binary which applications are increasingly using, but there are still valid use cases for the general host capabilities information. To address the inherent design limitations of the current API, one or more replacements are required to allow more targetted information reporting to avoid the mass QEMU probe.

Attention will then need to switch to optimizing the startup procedure for spawning QEMU. There is one key point where libvirt uses QMP to ask the just launched QEMU what CPU features it has exposed to the guest OS. This results in a huge number of QMP calls, one for each CPU feature. This needs to be optimized, ideally down to 1 single QMP call, which might require QEMU enhancements to enable libvirt to get the required information more efficiently.

One of the goals of the embedded driver is to have the QEMU process inherit the application’s process context (cgroups, namespaces, CPU affinity, etc) by default and keep QEMU as a child of the application process. This does not currently happen as the embedded driver is re-using the existing startup code which moves QEMU into dedicated cgroups and explicitly resets CPU affinity, as well as daemonizing QEMU. The need to address these problems is one of the reasons the embedded mode is marked experimental with behaviour subject to change.

by Daniel Berrange at February 05, 2020 11:18 AM

February 04, 2020

Daniel Berrange

libvirt: split of the monolithic libvirtd daemon

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Monolithic daemon

Anyone who has used libvirt should be familiar with the libvirtd daemon which runs most of the virtualization and secondary drivers that libvirt distributes. Only a few libvirt drivers are stateless and run purely in the library. Internally libvirt has always tried to maintain a fairly modular architecture, with each hypervisor driver being a separated from other drivers. There are also secondary drivers providing storage, network, firewall functionality which are notionally separate from all the virtualization drivers. Over time the separation has broken down with hypervisor drivers directly invoking internal methods from the secondary drivers, but last year there was a major effort to reverse this and re-gain full separation between every driver.

There are various problems with having a monolithic daemon like libvirtd. From a security POV, it is hard to provide any meaningful protections to libvirtd. The range of functionality it exposes, provides an access level that is more or less equivalent to having a root shell. So although libvirtd runs with a “virtd_t” SELinux context, this should be considered little better than running “unconfined_t“. As well as providing direct local access to the APIs, the libvirtd daemon also has the job of exposing remote access over TCP, most commonly needed when doing live migration. Exposing the drivers directly over TCP is somewhat undesirable given the size of the attack surface they have.

The biggest problems users have seen are around reliability of the daemon. A bug in any single driver in libvirt can impact on the functionality of all other drivers. As an example, if something goes wrong in the libvirt storage mgmt APIs, this can harm management of any QEMU VMs. Problems can be things like crashes of the daemon due to memory corruption, or more subtle things like main event loop starvation due to long running file handle event callbacks, or accidental resource cleanup such as closing a file descriptor belonging to another thread.

Libvirt drivers are shipped as loadable modules, and an installation of libvirt does not have to include all drivers. Thus a minimal installation of libvirt is a lot smaller than users typically imagine it is. The existance of the monolithic libvirtd daemon, however, and the fact the many apps pull in broader RPM dependencies than they truly need, results in a perception that libvirt is bloated / heavyweight.

Modular daemons

With all this in mind, libvirt has started a move over to a new modular daemon model. In this new world, each driver in libvirt (both hypervisor drivers & secondary drivers) will be serviced by its own dedicated daemon. So there will be a “virtqemud“, “virtxend“, “virtstoraged“, “virtnwfilterd“, etc. Each of these daemons will only support access via a dedicated local UNIX domain socket, /run/libvirt/$DAEMONNAME, eg /run/libvirt/virtqemud. The libvirt client library will be able to connect to either the old monolithic daemon socket path /run/libvirt/libvirt-sock, or the new per-daemon socket. The hypervisor daemons will be able to open connections to the secondary daemons when required by requested functionality, eg to config a firewall for a QEMU guest NIC.

Remote off-host access to libvirt functionality will be handled via a new virtproxyd daemon which listens for TCP connections and forwards API calls over a local UNIX socket to whichever modular daemon needs to service it. This proxy daemon will also be responsible for handling the monolithic daemon UNIX domain socket path that old libvirt clients will be expecting to use.

Overall from an application developer POV, the change to monolithic daemons will be transparent at the API level. The main impact will be on deployment tools like Puppet / Ansible seeking to automate deployment of libvirt, which will need to be aware of these new daemons and their config files. The resulting architecture should be more reliable in operation and enable development of more restrictive security policies.

Both the existing libvirtd and the new modular daemons have been configured to make use of systemd socket activation and auto-shutdown after a timeout, so the daemons should only be launched when they actually need to do some work. Several daemons will still need to startup at boot to activate various resources (create the libvirt virb0 bridge device, or auto-start VMs), but should stop quickly once this is done.

Migration timeframe

At the time of writing the modular daemons exist in libvirt releases and are built and installed by default. The libvirt client library, however, still defaults to connecting to the monolithic libvirtd UNIX socket. To best of my knowledge, all distros with systemd use presets which favour the monolithic daemon too. IOW, thus far, nothing has changed from most user’s POV. In the near future, however, we intend to flip the switch in the build system such that the libvirt client library favours connections to the modular daemons, and encourage distros to change their systemd presets to match.

The libvirtd daemon will remain around, but deprecated, for some period of time before it is finally deleted entirely. When this deletion will happen is still TBD, but it is not less than 1 year away, and possibly as much as 2 years. The decision will be made based on how easily & quickly applications find adaptation to the new modular daemon world.

Future benefits

The modular daemon model opens up a number of interesting possibilities for addressing long standing problems with libvirt. For example, the QEMU driver in libvirt can operate in “system mode” where it is running as root and can expose all features of QEMU. There is also the “session mode” where it runs as an unprivileged user but with features dramatically reduced. For example, no firewall integration, drastically reduced network connectivity options, no PCI device assignment and so on. With the modular daemon model, a new hybrid approach is possible. A “session mode” QEMU driver can be enhanced to know how to talk to a “system mode” host device driver to do PCI device assignment (with suitable authentication prompts of course), likewise for network connectivity. This will make the unprivileged “session mode” QEMU driver a much more compelling choice for applications such as virt-manager or GNOME Boxes which prefer to run fully unprivileged.

by Daniel Berrange at February 04, 2020 01:00 PM

February 03, 2020

Daniel Berrange

Libvirt: programming language and document format consolidation

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Historical usage

In common with many projects which have been around for a similar time frame, libvirt has accumulated a variety of different programming languages and document formats, for a variety of tasks.

The main library is written in C, but around that there is the autotools build system which adds shell, make, autoconf, automake, libtool, m4, and other utilities like sed, awk, etc. Then there are many helper scripts used for code generation or testing which are variously written in shell, perl or python. For documentation, there are man pages written in POD, web docs written in HTML5 with an XSL templating system, and then some docs written in XML which generate HTML, and some docs generated from source code comments. Finally there are domain specific languages such as XDR for the RPC system.

There are a couple of issues with the situation libvirt finds itself in. The large number of different languages and formats places a broad knowledge burden on new contributors to the project. Some of the current choices are now fairly obscure & declining in popularity, thus not well known by potential project contributors. For example, Markdown and reStructuredText (RST) are more commonly known than Perl’s POD format. Developers are more likely to be competent in Python than in Perl. Some of the languages libvirt uses are simply too hard to deal with, for example it is a struggle to find anyone who can explain m4 or enjoys using it when writing configure scripts for autoconf.

Ultimately the issues all combine to have a couple of negative effects on the project. They drive away potential new contributors due to their relative obscurity. They reduce the efficiency of existing contributors due to their poor suitability for the tasks they are applied to.

Intended future usage

With the above problems in mind, there is a desire to consolidate and update the programming languages and document formats that libvirt uses. The goals are to reduce the knowledge burden on contributors, and simplify the development experience to make it easier to get the important work done. Approximately the intention is to get to a place where libvirt uses C for the core library code, Python for all helper scripts, RST for all documentation, and Meson for the build system. This should eliminate the use of shell, shell utilities like sed/awk, perl, POD, XSL, HTML5, m4, make, autoconf, automake. A decision about which static website builder to use to replace XSL hasn’t been made yet, but Sphinx and Pelican are examples of the kind of tools being considered. Getting to this desired end point will take a prolonged effort, but setting a clear direction now will assist contributors in understanding where to best spend time.

To kickstart this consolidation at the end of last year almost all of the Perl build scripts were converted to Python. This was a manual line-by-line conversion, so the structure of the scripts didn’t really change much, just the minimal language syntax changes in most cases. There are still a handful of Perl scripts remaining, which are some of the most complicated ones. These really need a rewrite, rather than a plain syntax conversion, so will take more time to bring over to Python. The few shell scripts have also been converted to Python, with more to follow. After that all the POD man pages were fed through a automated conversion pipeline to create RST formatted man pages. In doing this, the man pages are now also published online as part of the libvirt website. Some of the existing HTML content has also been converted into RST.

The next really big step is converting the build system from autotools into Meson. This will eliminate some of the most complex, buggy and least understood parts of libvirt. It will be refreshing to have a build system which only requires knowledge of a single high level domain specific language, with Python for extensions, instead of autotools which requires knowledge of at least 6 different languages and the interactions between them.

by Daniel Berrange at February 03, 2020 12:03 PM

January 31, 2020

Daniel Berrange

Libvirt: use of GCC/Clang extension for automatic cleanup functions

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon in October 2019.

Automatic memory cleanup

Libvirt has always aimed to be portable across a wide set of operating system platforms and this included portability to different compiler toolchains. In the early days of the project GCC was the most common target, but users did use the Solaris and Microsoft native compilers occasionally. Fast forward to today and the legacy UNIX platforms are much less relevant. Officially libvirt only targets Linux, FreeBSD, macOS and Windows as supported platforms and all of these have GCC or CLang or both available. These compilers are available on any platform that we’re likely to add in the future too. Conceivably people might still want to use Microsoft compilers, but their featureset is so poor compared to GCC/Clang that we long ago discounted them as a toolchain to support.

Thus libvirt in the early part of last year, libvirt made the explicit decision to only support GCC and CLang henceforth. This in turn freed the project to take full advantage of extensions to the C language offered by these compilers.

The extension which motivated this decision was the cleanup attribute. This allows a variable declaration to have a function associated with it that will be automatically invoked when the variable goes out of scope. The most obvious use for these cleanup functions is to release heap memory associated with pointers, and this is exactly what libvirt wanted to do. This is not the only use case though, they are also convenient for other tasks such as closing file descriptors, decrementing reference counts, unlocking mutexes, and so on.

The native C syntax for using this feature is fairly ugly

__attribute__((__cleanup__(free))) char *foo = NULL;

but this can be made more attractive via macros. For example, GLib provides several pretty macros to simplify life g_autofree, g_autoptr and g_auto.

Thus the old libvirt coding pattern of

void dosomething(char *param) {
  char * foo;

  ...some code...

  foo = g_strdup_printf("Some string %s", param);
  if (something() < 0)
     goto cleanup;

  ... some more code... 


Can be replaced by something like

void dosomething(char *param) {
  g_autofree(char *) foo = NLL;

  ...some code...

  foo = g_strdup_printf("Some string %s", param);
  if (something() < 0)

  ... some more code... 

There are still some “gotchas” to be aware of. Care must be taken to ensure any variable declared with automatic cleanup is always initialized, otherwise the cleanup function will touch uninitialized stack data. If a pointer stored in an automatic cleanup variable needs to be returned to the caller of the method, the local variable must be NULLd out. Fortunately GLib provides a convenient helper g_steal_pointer for exactly this purpose.

The previous blog described how many goto jumps were eliminated by aborting on OOM, instead of trying to gracefully cleanup & report it. The remaining goto jumps were primarily for free’ing memory, closing file descriptors, and releasing mutexes, most of which can be eliminated with these cleanup functions.

The result is that the libvirt code can be dramatically simplified, which reduces the maint burden on libvirt contributors, allowing more time to be spent on coding features which matter to users. As an added benefit, in converting code over to use automatic cleanup functions we’ve fixed a number of memory leaks not previously detected, which reinforces the value of using this C extension.

Incidentally after this was introduced in libvirt last year, I suggested that QEMU also adopt use of automatic cleanup functions, since it has also mandated either GCC or CLang as the only supported compilers, and this was accepted.

by Daniel Berrange at January 31, 2020 12:03 PM

January 30, 2020

Daniel Berrange

Libvirt: adoption of GLib library to replace GNULIB & home grown code

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Portability and API abstractions

Libvirt traditionally targeted the POSIX standard API but there are a number of difficulties with this. Much of POSIX is optional so can not be assumed to exist on every platform. Plenty of platforms are non-compliant with the spec, or have different behaviour in scenarios where the spec allowed for multiple interpretations. To deal with this libvirt used the GNULIB project which is a  copylib that attempt to fix POSIX non-compliance issues. It is very effective at this, but portability is only one of the problems with using POSIX APIs directly. It is a very low level API, so simple tasks like listening on a TCP socket require many complex API calls. Other APIs have poor designs by modern standards which make it easy for developers to introduce bugs. The malloc APIs are a particular case in point. As a result libvirt has created many higher level abstractions around the POSIX APIs. Looking at other modern programming languages though, such higher level abstractions are already a standard offering. This allows developers to focus on solving their application’s domain specific problems. Libvirt maintainers by contrast have spent a lot of time developing abstractions unrelated to virtualization such as object / class systems, DBus client APIs, hash tables / bitmaps, sockets / RPC systems, and much more. This is not a good use of limited resources in the long term.

Adoption of GLib

These problems are common to many applications / libraries that are written in C and thus there are a number of libraries that attempt to provide a high level “standard library”. The GLib library is one such effort from the GNOME project developers that has long been appealing. Some of libvirt’s internal APIs are inspired by those present in GLib, and it has been used by QEMU for a long time too. What prevented libvirt from using GLib in the past was the desire to catch, report and handle OOM errors. With the switch to aborting on OOM, the only blocker to use of GLib was eliminated.

The decision was thus made for libvirt to adopt the GLib library in the latter part of 2019. From the POV of application developers nothing will change in libvirt. The usage of GLib is purely internal, and so doesn’t leak into public API exposed from, which is remains compatible with what came before. In the case of QEMU/KVM hosts at least, there is also no change in what must be installed on hosts, since GLib was already a dependency of QEMU for many years. This will ultimately be a net win, as using GLib will eliminate other code in libvirt, reducing the installation footprint on aggregate between libvirt and QEMU.

With a large codebase such as libvirt’s, adopting GLib is a not as quick as flicking a switch. Some key pieces of libvirt functionality have been ported to use GLib APIs completely, while in other cases the work is going to be an incremental ongoing effort over a long time. This offers plenty of opportunities for new contributors to jump in and make useful changes which are fairly easily understood & straightforward to implement.

Removal of GNULIB

One of the anticipated benefits of using GLib was that it would be able to obsolete a lot of the portability work that GNULIB does. The GNULIB project is strongly entangled with autotools as a build system, so is a blocker to the adoption of a different build system in libvirt. There has thus been an ongoing effort to eliminate GNULIB modules from libvirt code. In many cases, GLib does indeed provide a direct replacement for the functionality needed. One of the surprises though, is that a very large portion of GNULIB was completely redundant given libvirt’s stated set of OS platform build targets. There is no need to consider portability to a wide variety of old buggy UNIX variants (Solaris, HPUX, AIX, and so on) for libvirt. After a final big push over the last few weeks, a patch series has been posted which completes the removal of GNULIB from libvirt, which will merge in the 6.1.0 release.

The work has been tested across all the platforms covered by libvirt CI, which means RHEL-7, 8, Fedora 30, 31, rawhide, Ubuntu 16.04, 18.04, Debian 9, 10, sid, FreeBSD 11, 12, macOS 10.14 with XCode 10.3 and XCode 11.3, and MinGW64. There are certainly libvirt users on platforms not covered by CI. Those using other modern Linux distros should not see problems if using GLibC, as the combination of RHEL, Debian & Ubuntu testing should expose any latent bugs. The more likely places to see regressions will be if people are using libvirt on other *BSDs, or older Linux distros. Usage of alternative C library implementations on Linux is also an unknown, since there is no CI coverage for this. Support for older Linux distros is explicitly not a goal for libvirt and the project will willingly break old platforms. Support for other modern non-Linux OS, however, is potentially interesting. What is stopping such platforms being considered explicitly by libvirt is lack of any contributors willing to help provide a CI setup and deal with fixing portability problems. IOW, libvirt is willing to entertain the idea of supporting additional modern OS platforms if contributors want to work with the project to make it happen. The same applies to Linux distros using a non-GLibC impl.

by Daniel Berrange at January 30, 2020 12:24 PM

January 29, 2020

Daniel Berrange

Libvirt: abort() when seeing ENOMEM errors

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Detecting and reporting OOM

Libvirt has always taken the view that ANY error from a function / system call must be propagated back to the caller. The out of memory condition (ENOMEM / OOM) is just one of many errors that might be seen when calling APIs, and thus libvirt attempted to report this in the normal manner. OOM is not like most other errors though.

The first challenge with OOM is that checking for a NULL return from malloc() is error prone because the return value overloads the error indicator with the normal returned pointer. To address this libvirt coding style banned direct use of malloc() and created a wrapper API that returned the allocated pointer in an output parameter, leaving the return value solely as the error indicator leading to a code pattern like:

  char *varname;

  if (VIR_ALLOC(varname) < 0) {

  ....handle OOM...


This enabled use of the ‘return_check‘ function attribute to get compile time validation that allocation errors were checked. Checking for OOM is only half the problem. Handling OOM is the much more difficult issue. Libvirt uses a ‘goto error‘ design pattern for error cleanup code paths. A surprisingly large number of these goto jumps only exist to handle OOM cleanup. Testing these code paths is non-trivial, if not impossible, in the real world. Libvirt integrated a way to force OOM on arbitrary allocations in its unit test suite. This was very successful at finding crashes and memory leaks in OOM handling code paths, but this only validates code that actually has unit test coverage. The number of bugs found in code that was tested for OOM, gives very low confidence that other non-tested code would correctly handle OOM. The OOM testing is also incredibly slow to execute since it needs to repeatedly re-run the unit tests failing a different malloc() each time. The time required grows exponentially as the number of allocations increases.

Assuming the OOM condition is detected and a jump to the error handling path is taken, there is now the problem of getting the error report back to the user. Many of the libvirt drivers run inside the libvirtd daemon, with an RPC system used to send results back to the client application. Reporting the error via RPC messages is quite likely to need memory allocation which may well fail in an OOM scenario.

Is OOM reporting useful?

The paragraphs above describe why reporting OOM scenarios is impractical, verging on impossible, in the real world. Assuming it was possible to report reliably though, would it actually benefit any application using libvirt ?

Linux systems generally default to having memory overcommit enabled, and when they run out of memory, the OOM killer will reap some unlucky process. IOW, on Linux, it is very rare for an application to ever see OOM reported from an allocation attempt. Libvirt is ported to non-Linux platforms which may manage memory differently and thus genuinely report OOM from malloc() calls. Those non-Linux users will be taking code paths that are never tested by the majority of libvirt users or developers. This gives low confidence for success.

Although libvirt provides a C library API as its core deliverable, few applications are written in C, most consume libvirt via a language binding with Perl and Go believed to be the most commonly used. Handling OOM in non-C languages is even less practical/common than in C. Many libvirt applications are also already using libraries (GTK, GLib) that will abort on OOM. Overall there is little sign that any libvirt client application attempts to handle OOM in its own code, let alone care if libvirt can report it.

One important application process using the libvirt API though is the libvirtd daemon. In the very early days, if libvirtd stopped it would take down all running QEMU VMs, but this limitation was fixed over 10 years ago. To enable software upgrades on hosts with running VMs, libvirtd needs to be able to restart itself. As a result libvirtd maintains a record of important state on disk enabling it to carry on where it left off when starting up. Recovering from OOM by aborting and allowing the libvirtd to be restarted by systemd, would align with a code path that already needs to be well tested and supported for software upgrades.

Give up on OOM handling

With all the above in mind, the decision shouldn’t be a surprise. Libvirt has decided to stop attempting to handle ENOMEM from malloc() and related APIs and will instead immediately abort. The libvirtd daemon will automatically restart and carry on where it left off. The result is that the libvirt code can be dramatically simplified by removing many goto jump and cleanup code blocks, which reduces the maint burden on libvirt contributors, allowing more time to be spent on coding features which matter to users.


by Daniel Berrange at January 29, 2020 03:07 PM

January 22, 2020

Cornelia Huck

Channel I/O, demystified

As promised, I will write some articles about one of the areas where s390x looks especially alien to people familiar with the usual suspects like x86: the native way of addressing I/O devices. 1

Channel I/O is a rather large topic, so I will concentrate on explaining it from a Linux (guest) and from a qemu/KVM (virtualization) point of view. This also means that I will prefer terminology that will make sense to somebody familiar with Linux (on x86) rather than the one used by e.g. a z/OS system programmer.

Links to the individual articles:
Channel I/O: What's in a channel subsystem?
Channel I/O: Talking to devices
Channel I/O: Types of devices
Channel I/O: More about channel paths
Channel Measurements: A Quick Overview

1. There is PCI on s390x, but it is a recent addition and its idiosyncracies are better understood if you know how channel I/O works.

by Unknown ( at January 22, 2020 02:43 PM

Channel Measurements: A Quick Overview

The s390 channel subsystem can gather some statistics on I/O performance for you, which might be useful if you try to figure out why something is not performing as well as you'd expect it to be. From a QEMU/KVM perspective, this is currently mainly useful on the host.

Channel monitoring for ccw devices

The first kind of channel measurements is those collected per subchannel. For a detailed overview of what actually happens there, turn to the Principles of Operation, Chapter 17 ("I/O Support Functions"), "Channel Monitoring". I'll cover here what will most likely be of interest to people running a Linux (host) system.

Enabling channel measurements

If you a running a non-vintage machine (i.e. a z990 or later), you will not need a system-wide setup. Older machines should be fine as well, if you do not want to measure more than 1024 devices.

To enable measurements for a specific ccw device (say, 0.0.1234), simply issue:

chccwdev -a cmb_enable=1 0.0.1234

Measurements collected

Under /sys/bus/ccw/device/0.0.1234/, you should now have a new subdirectory called cmf, which contains some files. For a system that has been running for some time, the contents may look something like the following:

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
==> cmf/avg_device_active_only_time <==
==> cmf/avg_device_busy_time <==
==> cmf/avg_device_connect_time <==
==> cmf/avg_device_disconnect_time <==
==> cmf/avg_function_pending_time <==
==> cmf/avg_initial_command_response_time <==
==> cmf/avg_sample_interval <==
==> cmf/avg_utilization <==
==> cmf/sample_count <==
==> cmf/ssch_rsch_count <==

Note that all values but sample_count and ssch_rsch_count are averaged over time. We also see that samples seem to have been taken whenever the driver issued a ssch.

The device in our example shows an avg_utilization of 0%, which is consistent with a device that mostly sits idle. But what about a device where something is actually happening?

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
==> cmf/avg_device_active_only_time <==
==> cmf/avg_device_busy_time <==
==> cmf/avg_device_connect_time <==
==> cmf/avg_device_disconnect_time <==
==> cmf/avg_function_pending_time <==
==> cmf/avg_initial_command_response_time <==
==> cmf/avg_sample_interval <==
==> cmf/avg_utilization <==
==> cmf/sample_count <==
==> cmf/ssch_rsch_count <==

Here, we see a higher avg_utilization, but actually not that many ssch invocations. Interesting is the relatively high value of avg_device_disconnect_time: It indicates that there are quite long intervals where the device and the channel subsystem do not talk to each other. That might, for example, happen if other LPARs on the same system drive a lot of I/O via the same channel paths as the device.

Help, I cannot enable channel measurements on my device!

There's one drawback when trying to enable channel measurements on a live device: It needs to execute a msch, which only can be done on an idle subchannel. For devices that execute separate ssch invocations to go about their business (e.g. dasd), the common I/O layer can squeeze in the msch between ssch invocations and all is well. However, some devices use a long-running channel program, which will not conclude during the time the device is enabled; the most prominent example are devices using QDIO, like zFCP adapters or OSA cards. In that case, the common I/O layer cannot squeeze in a msch; you might try disabling the device, but that's usually not something you want to do in a live system.

Extended channel measurements

What if you want to find out something not about an individual device, but for a channel path? There's a feature for that; you can issue
echo 1 > /sys/devices/css0/cm_enable
and will find new entries (measurement, measurement_chars) under the various chp0.xx objects.

Unfortunately, these attributes only provide some binary data, which does not seem to be publicly documented, and I'm not aware of any tool that can parse them.

Channel measurements in QEMU guests

So far, all measurements have been collected on the host; but what about measurements in the guest?

The good news: You can turn on channel measurements for ccw devices in the guest. The bad news: They are not very useful.

Consider, for example, this virtio-ccw device:
 head cmf/*
==> cmf/avg_control_unit_queuing_time <==
==> cmf/avg_device_active_only_time <==
==> cmf/avg_device_busy_time <==
==> cmf/avg_device_connect_time <==
==> cmf/avg_device_disconnect_time <==
==> cmf/avg_function_pending_time <==
==> cmf/avg_initial_command_response_time <==
==> cmf/avg_sample_interval <==
==> cmf/avg_utilization <==
==> cmf/sample_count <==
==> cmf/ssch_rsch_count <==

No samples, just a ssch count. Why? QEMU does not fully emulate the sampling infrastructure; only counting of ssch is done (which is very easy to implement). Moreover, virtio-ccw devices use channel programs mainly to set up queues, negotiate features, etc., so measurements here do not reflect what is going on on the virtqueues, which would be the interesting part for performance issues.

But what about a dasd passed through via vfio-ccw? That one should have more statistics, right?
head cmf/*           
==> cmf/avg_control_unit_queuing_time <==
==> cmf/avg_device_active_only_time <==
==> cmf/avg_device_busy_time <==
==> cmf/avg_device_connect_time <==
==> cmf/avg_device_disconnect_time <==
==> cmf/avg_function_pending_time <==
==> cmf/avg_initial_command_response_time <==
==> cmf/avg_sample_interval <==
==> cmf/avg_utilization <==
==> cmf/sample_count <==
==> cmf/ssch_rsch_count <==

No samples, just a ssch count, again. Why? Currently, vfio-ccw uses the same emulation infrastructure as the other emulated devices. In the future, we may implement some kind of passthrough for channel measurements, but that requires some work.

by Cornelia Huck ( at January 22, 2020 02:42 PM

January 14, 2020

KVM on Z

KVM at SHARE Fort Worth 2020

Yes, we will be at SHARE in Fort Worth, TX, February 23-28!
Meet us at the following session in the Linux and VM/Virtualization track:
  • KVM on IBM Z News (Session #26219): Latest news on our development work with the open source community

by Stefan Raspl ( at January 14, 2020 08:22 AM

January 09, 2020

Gerd Hoffmann

PCI IDs for virtual devices

It's a bit complicated. There are a bunch of different vendor IDs in use in the linux/qemu virtualization world. I'm trying to cover them all.

Vendor ID 1af4 (Red Hat)

This is the Red Hat vendor ID. The device ID range from 1af4:1000 to 1af4:10ff is reserved for virtio-pci devices. For virtio 1.0 (and newer) devices there is a fixed relationship between virtio device ID and PCI device ID (offset 0x1040). So allocating a virtio device ID is enough, that implicitly allocates a PCI device ID too.

Qemu uses 1af4:1100 as PCI Subsystem ID for most devices.

The ivshmem device uses 1af4:1110.

Example (virtio scsi controller):

$ lspci -vnns3:
03:00.0 SCSI storage controller [0100]: Red Hat, Inc. \
		Virtio SCSI [1af4:1048] (rev 01)
	Subsystem: Red Hat, Inc. Device [1af4:1100]
	[ ... ]

Vendor ID 1b36 (Red Hat / Qumranet)

This was the Qumranet vendor ID, and with the Qumranet aquisition by Red Hat the PCI vendor ID moved to Red Hat too. The device ID range from 1b36:0001 to 1b36:00ff is reserved for qemu, for virtual devices which are not virtio.

Such devices can either be qemu specific (test device, rocker ethernet) or devices which guest drivers typically match by PCI class code (pci bridges, serial, sdhci, xhci, nvme, ...) so they don't need to masquerade as some existing physical device.

New PCI ID assignments for qemu should use this pool.

Current allocations are listed in docs/specs/pci-ids.txt (here is the qemu 4.2 version) and have a #define in include/hw/pci/pci.h. If needed an addtional file in docs/specs/ should have the device specification.

To allocate an ID send a patch updating both docs/specs/pci-ids.txt and include/hw/pci/pci.h to the qemu-devel mailing list. Add me to Cc: to make sure it gets my attention. The commit message should describe the device you want allocate an ID for.

The 1b36:0100 ID is used by the qxl display device.

Example (xhci usb controller):

$ lspci -vnns2:
02:00.0 USB controller [0c03]: Red Hat, Inc. QEMU XHCI Host \
		Controller [1b36:000d] (rev 01) (prog-if 30 [XHCI])
	Subsystem: Red Hat, Inc. Device [1af4:1100]
	[ ... ]

Vendor ID 1234

Even though there are #defines for PCI_VENDOR_ID_QEMU with this ID at various places this is not assigned to qemu. According to the PCI ID database this belongs to "Technical Corp".

Do not use this for new devices.

The qemu stdvga uses 1234:1111. This ID is hardcoded all over the place in guest drivers, so unfortunaly we are stuck with it.

Luckily no problems due to conflicts showed up so far, seems this vendor ID isn't used by the owner to build PCI devices.

Other Vendor IDs

It is perfectly fine to use IDs from other vendors for virtual qemu devices. They must be explicitly reserved for qemu though, to avoid possible conflicts with physical hardware from the same vendor. So, if you have your own vendor ID and want contribute a device to qemu you can also allocate an device ID from your ID space instead of taking one from the Red Hat 1b36 pool.

PCI IDs for emulated devices

When emulating existing physical hardware qemu will of course use the PCI IDs of the hardware being emulated, so the guest drivers will find the device. In most cases the PCI Subsystem ID 1af4:1100 indicates the device is emulated by qemu.

Example (sound device):

$ lspci -vnns1b
00:1b.0 Audio device [0403]: Intel Corporation 82801I (ICH9 Family) \
		HD Audio Controller [8086:293e] (rev 03)
	Subsystem: Red Hat, Inc. QEMU Virtual Machine [1af4:1100]
	[ ... ]

by Gerd Hoffmann at January 09, 2020 11:00 PM

January 07, 2020

Gerd Hoffmann

sound configuration changes in qemu

The sound subsystem was the odd one in qemu. Configuration using environment variables like QEMU_AUDIO_DRV instead of command line switches. That time is over now. For qemu 4.2 we finally completed merging full -audiodev support, written by Zoltán Kővágó. Read on to learn what has changed.

Using the environment variables continues to work for now. But this configuration method is deprecated and will be dropped at some point in the future. Maybe in qemu 5.1 (earliest release the qemu deprecation policy would allow), maybe we'll leave the compatibility code in for a few more releases to allow a longer transition period given this is a rather fundamental change.

Creating audio backends

The new -audiodev command line switch creates an audio backend, simliar to -netdev for network backends or -blockdev for block device backends. All sound backend configuration is done using -audiodev parameters.

The simplest case is to just specify the backend and assign an id to refer to it, like this:

qemu -audiodev spice,id=snd0

Some backends have additional configuration options. For pulseaudio for example it is possible to specify server hostname:

qemu -audiodev pa,id=snd0,server=localhost

Stream parameters can be specified separately for in (record) and out (playback). There are some parameters which are common for all backends (frequency, channels, ...) and backend-specific parameters like the pulseaudio stream name (visible in mixer applications like pavucontrol) or the alsa device:

qemu -audiodev pa,id=snd0,out.frequency=48000,
qemu -audiodev alsa,id=snd0,

Buffer sizes are specified in microseconds everywhere. So configuring alsa with a buffer for 10 milliseconds of sound data and 4 periods (2.5 miliseconds each) works this way:

qemu -audiodev alsa,id=snd0,out.buffer-length=10000,out.period-length=2500

The -audiodev switch accepts json too (simliar to -blockdev), so the last example can also be specified this way:

qemu -audiodev "{
    'driver' : 'alsa',
    'id' : 'snd0',
    'out' : {
      'buffer-length' : 10000,
      'period-length' : 2500

Consult the audio qapi schema for all available config options (including documentation).

Using audio backends

That is the simple part. All sound devices got a new audiodev parameter to link the device with an audio backend, using the id assigned to the backend:

qemu -device usb-audio,audiodev=snd0


Note: Right now the audiodev= parameter is optional, for backward compatibility reasons. The parameter will become mandatory though when we drop the code for the deprecated evironment variable configuration method. So I strongly recommend to use the parameter explicitly everywhere, to be prepared for the future.

Multiple audio backends and 5:1 surround support

Finally a more complex example to showcast what the new sound configuration allows for:

qemu \
  -audiodev pa,id=hda,out.mixing-engine=off \
  -audiodev pa,id=usb,out.mixing-engine=off \
  -device intel-hda -device hda-output,audiodev=hda \
  -device qemu-xhci -device usb-audio,audiodev=usb,multi=on


  • We can have multiple backends, by simply specifying -audiodev multiple times on the command line and assigning different ids. That can be useful even in case of two identical backends. With pulseaudio each backend is a separate stream and can be routed to different output devices on the host (using a pulse mixer app like pavucontrol).
  • Using the backend ids we assign one backend to the HDA device and the other to the USB device.
  • qemu has an internal audio mixer, for mixing audio streams from multiple devices into one output stream. The internal mixer can also do resampling and format conversion if needed. With the pulseaudio backend we don't need it as the pulseaudio daemon can handle all this for us. Also the internal mixer is limited to mono and stereo, it can't handle multichannel (5:1) sound data. So we use mixing-engine=off to turn off the internal mixer.
  • The USB audio device has multichannel (5:1) support. This is disabled by default though, the multi=on parameter turns it on.


by Gerd Hoffmann at January 07, 2020 11:00 PM

December 22, 2019

Stefano Garzarella

QEMU 4.2 mmap(2)s kernel and initrd

In order to save memory and boot time, QEMU 4.2 and later versions are able to mmap(2) the kernel and initrd specified with -kernel and -initrd parameters. This approach allows us to avoid reading and copying them into a buffer, saving memory and time.

The memory pages that contain kernel and initrd are shared between multiple VMs using the same kernel and initrd images. So, when many VMs are launched we can save memory by sharing pages, and save time by avoiding reading them each time from the disk.

This feature is automatically used for some targets with ELF kernels (e.g. x86_64 vmlinux ELF image with PVH entry point), but it is not available with compressed kernel images (e.g. bzImage).


The patches that implement this feature are merged upstream and released with QEMU 4.2.

The main change was to map kernel and initrd into the memory, instead of reading them, using g_mapped_file_*() APIs.


We measured the memory footprint and the boot time using a standard QEMU build (qemu-system-x86_64) with a PVH kernel and initrd (cpio):

  • Initrd size: 3.0M
  • Kernel (vmlinux)
    • image size: 28M
    • sections size [size -A -d vmlinux]: 18.9M

Julio Montes did a very good analysis and he posted the results here:

Memory footprint

We used smem to measure USS and PSS:

  • USS (Unique Set Size): amount of memory that is committed to physical memory and is unique to a process; it is not shared with any other. It is the amount of memory that would be freed if the process were to terminate.
  • PSS (Proportional Set Size): This splits the accounting of shared pages that are committed to physical memory between all the processes that have them mapped.
$ smem -k | grep "PID\|$(pidof qemu-system-x86_64)"
  PID User     Command                         Swap      USS      PSS      RSS
24833 qemu     /usr/bin/qemu-system-x86_64        0    71.6M    74.3M    105.2

This is the memory footprint analysis, increasing the number of QEMU instances:

                           Memory footprint [MB]
     QEMU             before                 after
 # instances        USS     PSS           USS     PSS
      1           102.0   105.8         102.3   106.2
      2            94.6   101.2          72.3    90.1
      4            94.1    98.0          72.0    81.5
      8            94.0    96.2          71.8    76.9
     16            93.9    95.1          71.6    74.3

Boot time

We measured the boot time using the qemu-boot-time scripts described in this post.

This is the boot time analysis:

                                   Boot time [ms]
                          before                  after
 # trace points
 qemu_init_end:           63.85                   55.91
 linux_start_kernel:      82.11 (+18.26)          74.51 (+18.60)
 linux_start_user:       169.94 (+87.83)         159.06 (+84.56)


Mapping into memory the kernel and initrd images allowed us to save about 20 MB of memory when multiple VMs are started and allowed us to speed up the boot by about 10 milliseconds.

Note: both gains are strictly related to images size.

by (Stefano Garzarella) at December 22, 2019 04:46 PM

December 20, 2019

Cornelia Huck

A 2019 recap (and a bit of an outlook)

The holiday season for 2019 will soon be upon us, so I decided to do a quick recap of what I consider the highlights for this year, from my perspective, regarding s390x virtualization, and the wider ecosystem.


I attended the following conferences last year.

Linux Plumbers Conference

LPC 2019 was held in Lisbon, Portugal, on September 9-11. Of particular interest for me was the VFIO/IOMMU/PCI microconference. I talked a bit about cross-architecture considerations (and learned about some quirks on other architectures as well); the rest of the topics, while not currently concerning my work directly, were nevertheless helpful to move things forward. As usual on conferences, the hallway track is probably the most important one; met some new folks, saw others once again, and talked more about s390 I/O stuff than I anticipated. I can recommend this conference for meeting people to talk to about (not only) deeply technical things.

KVM Forum

KVM Forum 2019 was held in Lyon, France, on October 30 - November 1. As usual, a great place to meat people and have discussions in the hallway, e.g. about vfio migration. No talk from me this year, but an assortment of interesting topics presented by others; I contributed to an article on ( Of note from an s390x perspective were the talks about protected virtualization and nested testing mentioned in the article, and also the presentation on running kvm unit tests beyond KVM.

s390x changes in QEMU and elsewhere

There's a new machine (z15) on the horizon, but support for older things has been introduced or enhanced as well.


Lots of work has gone into tcg to emulate the vector instructions introduced with z13. Distributions are slowly switching to compiling for z13, which means gcc is generating vector instructions. As of QEMU 4.2, you should be able to boot recent distributions under tcg once again.


vfio-ccw now has gained support for sending HALT SUBCHANNEL and CLEAR SUBCHANNEL to the real device; this is useful e.g. for error handling, when you want to make sure an operation is really terminated at the device. Also, it is now possible to boot from a DASD attached via vfio-ccw.

vfio-ap has seen some improvements as well, including support for hotplugging the matrix device and for interrupts.

Guest side

A big change on the guest side of things was support for protected virtualization (also see the talk given at KVM Forum). This is a bit like AMD's SEV, but (of course) different. Current Linux kernels should be ready to run as a protected guest; host side support is still in progress (see below).

Other developments of interest

mdev, mdev everywhere

There has been a lot of activity around mediated devices this year. They have been successfully used in various places in the past (GPUs, vfio-ccw, vfio-ap, ...). A new development is trying to push parts of it into userspace ('muser', see the talk at KVM Forum). An attempt was made to make use of the mediating part without the vfio part, but that was met with resistance. Ideas for aggregation are still being explored.

In order to manage and persist mdev devices, we introduced the mdevctl tool, which is currently included in at least Fedora and Debian.

vfio migration

Efforts to introduce generic migration support for vfio (or at least in the first shot, for pci) are still ongoing. Current concerns mostly cycle around dirty page tracking. It might make sense to take a stab at wiring up vfio-ccw once the interface is stable.

What's up next?

While there probably will be some not-yet-expected developments next year, some things are bound to come around in 2020.

Protected virtualization

Patch sets for KVM and QEMU to support protected virtualization on s390 have already been posted this year; expect new versions of the patch sets to show up in 2020 (and hopefully make their way into the kernel respectively QEMU).


Patches to support detecting path status changes and relaying them to the guest have already been posted; expect an updated version to make its way into the kernel and QEMU in 2020. Also likely: further cleanups and bugfixes, and probably some kind of testing support, e.g. via kvm unit tests. Migration support also might be on that list.

virtio-fs support on s390x

Instead of using virtio-9p, virtio-fs is a much better way to share files between host and guest; expect support on s390x to become available once sharing of fds in QEMU without numa becomes possible. Shared memory regions on s390x (for DAX support) still need discussion, however.

by Cornelia Huck ( at December 20, 2019 03:32 PM

December 17, 2019

KVM on Z

SLES 12 SP5 released

SLES 12 SP5 is out! See this section in the release notes for a detailed look at IBM Z-specific changes in support of KVM.
Otherwise, the update ships the following code levels:
  • Linux kernel 4.12 (SP4: 4.12, unchanged),
  • QEMU v3.1 (SP4: v2.11), and
  • libvirt v5.1 (SP4: v4.0).
See previous blog entries on QEMU v2.12 and v3.0, and libvirt v4.7 and v4.10 for further details on new features that become available by the respective package updates.
Also take note of this article, which adds further details on IBM z15 support.

by Stefan Raspl ( at December 17, 2019 11:22 AM

December 13, 2019

KVM on Z

Documentation: New Performance Papers

Two new performance papers on KVM network performance were published:
See also the Performance Hot topics page in IBM Knowledge Center.

by Stefan Raspl ( at December 13, 2019 07:49 AM

QEMU v4.2 released

QEMU v4.2 is out. For highlights from a KVM on Z perspective see the Release Notes.

by Stefan Raspl ( at December 13, 2019 07:45 AM

QEMU project

QEMU version 4.2.0 released

We would like to announce the availability of the QEMU 4.2.0 release. This release contains 2200+ commits from 198 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • TCG plugin support for passive monitoring of instructions and memory accesses
  • block: NBD block driver now supports more efficient handling of copy-on-read requests
  • block: NBD server optimizations for copying of sparse images, and general fixes/improvements for NBD server/client implementations
  • block/crypto: improved performance for AES-XTS encryption for LUKS disk encryption
  • vfio-pci support for “failover_pair_id” property for easier migration of VFIO devices
  • virtio-mmio now supports virtio-compatible v2 personality and virtio 1.1 support for packed virtqueues
  • 68k: new “next-cube” machine for emulating a classic NeXTcube
  • 68k: new “q800” machine for emulating Macintosh Quadro 800
  • ARM: new “ast2600-evb” machine for emulating Aspeed AST2600 SoC
  • ARM: semihosting v2.0 support with STDOUT_STDERR/EXIT_EXTENDED extentions
  • ARM: KVM support for more than 256 CPUs
  • ARM: “virt” machine now supports memory hotplugging
  • ARM: improved TCG emulation performance
  • ARM: KVM support for SVE SIMD instructions on SVE-capable hardware
  • PowerPC: emulation support for mffsce, mffscrn, and mffscrni POWER9 instructions
  • PowerPC: “powernv” machine now supports Homer and OCC SRAM system devices
  • RISC-V: “-initrd” argument now supported
  • RISC-V: debugger can now see all architectural state
  • s390: emulation support for IEP (Instruction Execution Protection)
  • SPARC: “sun4u” IOMMU now supports “invert endianness” bit
  • x86: VMX features can be enabled/disabled via “-cpu” flags
  • x86: new “microvm” machine that uses virtio-mmio instead of PCI for use as baseline for performance optimizations
  • x86: emulation support for AVX512 BFloat16 extensions
  • x86: new CPU models for Denverton (server-class Atom-based SoC), Snowridge, and Dhyana
  • x86: macOS Hypervisor.framework support (“-accel hvf”) now considered stable
  • xtensa: new “virt” machine type
  • xtensa: call0 ABI support for user-mode emulation
  • and lots more…

Thank you to everyone involved!

December 13, 2019 06:00 AM

December 09, 2019

KVM on Z

KVM on IBM z15 Features

To take advantage of the new features of z15, the latest addition to the IBM Z family as previously announced here, use any of the following CPU models in your guest's domain XML:
  • Pre-defined model for z15
      <cpu mode='custom'>
  • Use z15 features in a migration-safe way (recommended). E.g. when running on z15 this will be a superset of the gen15a model, and feature existence will be verified on the target system prior to a live migration:
      <cpu mode='host-model'/>
  • Use z15 features in a non-migration-safe way. I.e. feature existence will not be verified on the target system prior to a live migration:
      <cpu mode='host-passthrough'/>
Here is a list of features of the new hardware generation as supported in Linux kernel 5.2 and QEMU 4.1, all activated by default in the CPU models listed above:
  • Miscellaneous Instructions
    Following the example of previous machines, new helper and general purpose instructions were
      minste3     Miscellaneous-Instruction-Extensions Facility 3 
  • SIMD Extensions
    Following up to the SIMD instructions as introduced with the previous z13 and z14 models, this feature again provides further vector instructions, which can again be used in KVM guests.
    These new vector instructions can be used to improve decimal calculations as well as for implementing high performance variants of certain cryptographic operations.
    In the z15 CPU models, the respective feature is:
      vxpdeh      Vector-Packed-Decimal-Enhancement Facility
      vxeh2       Vector enhancements facility 2
  • Deflate Conversion
    Provide acceleration for zlib compression and decompression
    In the z15 CPU model, the respective feature is:
      dflt        Deflate conversion facility
  • MSA Updates
    z15 introduces a new Message Security Assist MSA9, providing elliptic curve cryptography. It supports message authentication, the generation of elliptic curve keys, and scalar multiplication.
    This feature can be exploited in KVM guests' kernels and userspace applications independently (i.e. a KVM guest's userspace applications can take advantage of these features irrespective of the guest's kernel version).
    In the z15 CPU model, the respective feature is:
      msa9        Message-security-assist-extension 9 facility
      msa9_pckmo  Message-security-assist-extension 9 PCKMO
                  subfunctions for protected ECC keys
The z15 CPU model was backported into several Linux distributions. It is readily available in RHEL8.1, SLES 12 SP5, SLES 15 SP1 (via maintweb updates for kernel and qemu) and Ubuntu 18.04.

by Stefan Raspl ( at December 09, 2019 03:29 PM

Webinar: Dynamically provisioning resources to KVM hosted Linux virtual servers

Are you just getting started with KVM? In this session we review dynamic provisioning and de-provisioning processor, memory, disk storage, and network connectivity to your KVM host and the guest virtual servers. Examples include both graphical management and command line interfaces. Come and see how simple it can be to manage your servers.

Richard Young, Executive IT Specialist, IBM Systems

Register here. You can check the system requirements here.
After registering, you will receive a confirmation email containing information about joining the webinar.

Replay & Archive
All sessions are recorded. For the archive as well as a replay and handout of this session and all previous webcasts see here.

by Stefan Raspl ( at December 09, 2019 01:17 PM

December 06, 2019

KVM on Z

Documentation: KVM Virtual Server Management Update

Intended for KVM virtual server administrators, this book illustrates how to set up, configure, and operate Linux on KVM instances and their virtual devices running on the KVM host and IBM Z hardware.

This major update includes information about VFIO pass-through devices, virtual server setup with the virt-install command, setting up virtual switches with VNIC characteristics, and other features that are available with the latest versions of QEMU and libvirt.

You can download the .pdf here.

by Stefan Raspl ( at December 06, 2019 07:14 AM

December 01, 2019

Fabiano Fidêncio

Adopting GitLab workflow

In October 2018 there was a face-to-face short meeting with a big part of libosinfo maintainers, some contributors, and some users.

This short meeting took place during a lunch break in one of KVM Forum 2018 days and, among other things, we discussed whether we should allow, and / or prefer receiving patches through GitLab Merge Requests.

Here’s the announcement:

[Libosinfo] Merge Requests are enabled!

    From: Fabiano Fidêncio <fidencio redhat com>
    To: "libosinfo redhat com" <libosinfo redhat com>
    Subject: [Libosinfo] Merge Requests are enabled!
    Date: Fri, 21 Dec 2018 16:48:14 +0100


Although the preferred way to contribute to libosinfo, osinfo-db and
osinfo-db-tools is still sending patches to this ML, we've decided to
also enable Merge Requests on our gitlab!

Best Regards,
Fabiano Fidêncio

Now, one year past that decision, let’s check what has been done, review some numbers, and discuss what’s my take, as one of the maintainers, of the decision we made.

2019, the experiment begins …

After the e-mail shown above was sent, I’ve kept using mailing list as the preferred way to submit and review patches, keeping an eye on GitLab Merge Requests, till August 2019 from when I did a full switch to using GitLab instead of mailing list.

… and what changed? …

Well, to be honest, not much. But in order to extend a little bit more, I have to describe a little bit my not optimal workflow.

Even before describing my workflow, let me just make clear that:

  • I don’t have any scripts that would fetch the patches from my e-mail and apply them automagically for me;

  • I never ever got used to text-based mail clients (I’m a former Evolution developer, I’m an Evolution user for several years);

Knowing those things, this is how my workflow looks like:

  • Development: I’ve been using GitLab for a few years as the main host of, my forks of. the projects I contribute to. When developing a new feature, I would:

    • Create a new branch;
    • Do the needed changes;
    • Push the new branch to the project on my GitLab account;
    • Submit the patches;
  • Review: It may sound weird, maybe it really is, but the way I do review patches is by:

    • Getting the patches submitted;
    • Applying atop of master;
    • Doing a git rebase -i so I can go through each one of the patchesR;
    • Then, for each one of the patches I would:
      • Add comments;
      • Do fix-up changes;
      • Squash my fixes atop of the original patch;
      • Move to the next patch;

And now, knowing my workflow, I can tell that pretty much nothing changed.

As part of the development workflow:

  • Submitting patches:

    • git publish -> click in the URL printed when a new branch is pushed to GitLab;
  • Reviewing patches:

    • Saving patch e-mails as mbox, applying them to my tree -> pull the MR

Everything else stays pretty much the same. I still do a git rebase -i and go through the patches, adding comments / fix-ups which, later on I’ll have to organise and paste somewhere (either replying to the e-mail or adding to GitLab’s web UI) and that’s the part which consumes the most of my time.

However, although the change was not big to me as a developer, some people had to adapt their workflow in order to start reviewing all the patches I’ve been submitting to GitLab. But let’s approach this later on … :-)

Anyways, it’s important to make it crystal clear that this is my personal experience and that I do understand that people who rely more heavily on text-based mail clients and / or with a bunch of scripts tailored for their development would have a different, way way different, experience.

… do we have more contributions since the switch? …

As by November 26th, I’ve checked the amount of submissions we had on both libosinfo mailing list and libosinfo GitLab page during the current year.

Mind that I’m not counting my own submissions and that I’m counting osinfo-db’s addition, which usually may consist in adding data & tests, as a single submission.

As for the mailing list, we’ve received 32 patches; as for the GitLab, we’ve received 34 patches.

Quite similar number of contributions, let’s dig a little bit more.

The 32 patches sent to our mailing list came from 8 different contributors, and all of them had at least one previous patch merged in one of the libosinfo projects.

The 34 patches sent to our GitLab came from 15 different contributors and, from those, only 6 of them had at least one previous patch merged in one of the libosinfo projects, whilst 9 of them were first time contributors (and I hope they’ll stay around, I sincerely do ;-)).

Maybe one thing to consider here is whether forking a project on GitLab is easier than subscribing to a new mailing list when submitting a patch. This is something people usually do once per project they contribute to, but subscribing to a mailing list may actually be a barrier.

Some people would argue, though, it’s a both ways barrier, mainly considering one may extensively contribute to projects using one or the other workflow. IMHO, it’s not exactly true. Subscribing to a mailing list, getting the patches correctly formatted feels more difficult than forking a repo and submitting a Merge Request.

In my personal case, I can tell the only projects I contribute to which still didn’t adopt GitLab / GitHub workflow are the libvirt ones, although it may change in the near future, as mentioned by Daniel P. Berrangé on his KVM Forum 2019 talk.

… what are the pros and cons? …

When talking about the “pros” and “cons” is really hard to get exactly which are the objective and subjective pros and cons.

  • pros

    • CI: The possibility to have a CI running for all libosinfo projects, running the tests we have on each MR, without any effort / knowledge of the contributor about this;

    • Tracking non-reviewed patches: Although this one may be related to each one’s workflow, it’s objective that figuring out which Merge Requests need review on GitLab is way easier for a new contributor than navigating through a mailing list;

    • Centralisation: This is one of the subjective ones, for sure. For libosinfo we have adopted GitLab as its issue tracker as well, which makes my life as maintainer quite easy as I have “Issues” and “Merge Requests” in a single place. It may not be true for different projects, though.

  • cons

    • Reviewing commit messages: It seems to be impossible to review commit messages, unless you make a comment about that. Making a comment, though, is not exactly practical as I cannot go specifically to the line I want to comment and make a suggestion.

    • Creating an account to yet another service: This is another one of the subjective ones. It bothers me a lot, having to create an account on a different service in order to contribute to a project. This is my case with GitLab, GNOME GitLab, and GitHub. However, is that different from subscribing to a few different mailing lists? :-)

Those are, for me, the most prominent “pros” and “cons”. There are a few other things that I’ve seen people complaining, being the most common one related to changing their workflow. And this is something worth its own section! :-)

… is there something out there to make my workflow change easier? …

Yes and no. That’s a horrible answer, ain’t it? :-)

Daniel P. Berrangé has created a project called Bichon, which is a tool providing a terminal based user interface for reviewing GitLab Merge Requests.

Cool, right? In general, yes. But you have to keep in mind that the project is still in its embryonic stage. When more mature, I’m pretty sure it’ll help people used to mailing lists workflow to easily adapt to GitLab workflow without leaving behind the facilities of doing everything via command-line.

I’ve been using the tool for simple things, I’ve been contributing to the tool with simple patches. It’s fair to say that it I do prefer adding a comment to Merge Requests, Approve, and Merge them using Bichon than via the web UI. Is the tool enough to suffice all the people’s needs? Of course not. Will it be? Hardly. But it may be enough to surpass the blockers of migrating away from mailing lists workflow.

… a few words from different contributors …

I’ve decided to ask Cole Robinson and Felipe Borges a word or two about this subject as they are contributors / reviewers of libosinfo projects.

It should go without saying that their opinions should not be taken as “this workflow is better than the other”. However, take their words as valid points from people who are heavily using one workflow or the other, as Cole Robinson comes from libvirt / virt-tools world, which rely heavily on mailing list, and Felipe Borges comes from GNOME world, which is a huge GitLab consumer.

“The change made things different for me, slightly worse but not in any disruptive way. The main place I feel the pain is putting review comments into a web UI rather than inline in email which is more natural for me. For a busier project than libosinfo I think the pain would ramp up, but it would also force me to adapt more. I’m still largely maintaining an email based review workflow and not living in GitLab / GitHub” - Cole Robinson

“The switch to Gitlab has significantly lowered the threshold for people getting started. The mailing list workflow has its advantages but it is an entry barrier for new contributors that don’t use native mail clients and that learned the Pull Request workflow promoted by GitLab/GitHub. New contributors now can easily browse the existing Issues and find something to work on, all in the same place. Reviewing contributions with inline discussions and being able to track the status of CI pipelines in the same interface is definitely a must. I’m sure Libosinfo foresees an increase in the number of contributors without losing existing ones, considering that another advantage of Gitlab is that it allows developers to interact with the service from email, similarly to the email-driven git workflow that we were using before.” - Felipe Borges

… is there any conclusion from the author’s side?

As the first thing, I’ve to emphasize two points:

  • Avoid keeping both workflows: Although we do that on libosinfo, it’s something I’d strongly discourage. It’s almost impossible to keep the information in sync in both places in a reasonable way.

  • Be aware of changes, be welcome to changes: As mentioned above, migrating from one workflow to another will be disruptive at some level. Is it actually a blocker? Although it was not for me, it may be for you. The thing to keep in mind here is to be aware of changes and welcome them knowing you won’t have a 1:1 replacement for your current workflow.

With that said, I’m mostly happy with the change made. The number of old time contributors has not decreased and, at the same time, the number of first time contributors has increased.

Another interesting fact is that the number of contributions using the mailing list has decreased as we only had 4 contributions through this mean since June 2019.

Well, that’s all I have to say about the topic. I sincerely hope a reading through this content somehow helps your project and the contributors of your project to have a better idea about the migration.

December 01, 2019 12:00 AM

Powered by Planet!
Last updated: March 29, 2020 12:00 AM
Powered by OpenShift Online