Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

August 05, 2020

Marcin Juszkiewicz

So your hardware is ServerReady?

Recently I changed my assignment at Linaro. From Cloud to Server Architecture. Which means less time spent on Kolla things, more on server related things. And at start I got some project I managed to forget about :D

SBSA reference platform in QEMU

In 2017 someone got an idea to make a new machine for QEMU. Pure hardware emulation of SBSA compliant reference platform. Without using of virtio components.

Hongbo Zhang wrote code and got it merged into QEMU, Radosław Biernacki wrote basic support for EDK2 (also merged upstream). Out of box it can boot to UEFI shell. Linux is not bootable due to lack of ACPI tables (DeviceTree is not an option here).

ACPI tables in firmware

Tanmay Jagdale works on adding ACPI tables in his fork of edk2-platforms. With this firmware Linux boots and can be used.

Testing tools

But what the point of just having reference platform if there is no testing? So I took a look and found two interesting tools:

Server Base System Architecture — Architecture Compliance Suite

SBSA ACS tool requires ACPI tables to be present to work. And once started it nicely checks how compliant your system is:

FS0:\> Sbsa.efi -p

 SBSA Architecture Compliance Suite
    Version 2.4

 Starting tests for level  4 (Print level is  3)

 Creating Platform Information Tables
 PE_INFO: Number of PE detected       :    3
 GIC_INFO: Number of GICD             :    1
 GIC_INFO: Number of ITS              :    1
 TIMER_INFO: Number of system timers  :    0
 WATCHDOG_INFO: Number of Watchdogs   :    0
 PCIE_INFO: Number of ECAM regions    :    2
 SMMU_INFO: Number of SMMU CTRL       :    0
 Peripheral: Num of USB controllers   :    1
 Peripheral: Num of SATA controllers  :    1
 Peripheral: Num of UART controllers  :    1

      ***  Starting PE tests ***
   1 : Check for number of PE            : Result:  PASS
   2 : Check for SIMD extensions                PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   3 : Check for 16-bit ASID support            PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   4 : Check MMU Granule sizes                  PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   5 : Check Cache Architecture                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   6 : Check HW Coherence support               PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   7 : Check Cryptographic extensions           PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   8 : Check Little Endian support              PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   9 : Check EL2 implementation                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  10 : Check AARCH64 implementation             PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  11 : Check PMU Overflow signal         : Result:  PASS
  12 : Check number of PMU counters             PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  13 : Check Synchronous Watchpoints            PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  14 : Check number of Breakpoints              PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  15 : Check Arch symmetry across PE            PSCI_CPU_ON: failure

       Reg compare failed for PE index=1 for Register: CCSIDR_EL1
       Current PE value = 0x0         Other PE value = 0x100FBDB30E8
       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  16 : Check EL3 implementation                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  17 : Check CRC32 instruction support          PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  18 : Check for PMBIRQ signal
       SPE not supported on this PE      : Result:  -SKIPPED- 1
  19 : Check for RAS extension                  PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  20 : Check for 16-Bit VMID                    PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  21 : Check for Virtual host extensions        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  22 : Stage 2 control of mem and cache         PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  23 : Check for nested virtualization          PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  24 : Support Page table map size change       PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  25 : Check for pointer signing                PSCI_CPU_ON: failure

  25 : Check for pointer signing                PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  26 : Check Activity monitors extension        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  27 : Check for SHA3 and SHA512 support        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1

      *** One or more PE tests have failed... ***

      ***  Starting GIC tests ***
 101 : Check GIC version                 : Result:  PASS
 102 : If PCIe, then GIC implements ITS  : Result:  PASS
 103 : GIC number of Security states(2)  : Result:  PASS
 104 : GIC Maintenance Interrupt
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1

      One or more GIC tests failed. Check Log

      *** Starting Timer tests ***
 201 : Check Counter Frequency           : Result:  PASS
 202 : Check EL0-Phy timer interrupt     : Result:  PASS
 203 : Check EL0-Virtual timer interrupt : Result:  PASS
 204 : Check EL2-phy timer interrupt     : Result:  PASS
 205 : Check EL2-Virtual timer interrupt
       v8.1 VHE not supported on this PE : Result:  -SKIPPED- 1
 206 : SYS Timer if PE Timer not ON
       PE Timers are not always-on.
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
 207 : CNTCTLBase & CNTBaseN access
       No System timers are defined      : Result:  -SKIPPED- 1

     *** Skipping remaining System timer tests ***

      *** One or more tests have Failed/Skipped.***

      *** Starting Watchdog tests ***
 301 : Check NS Watchdog Accessibility
       No Watchdogs reported          0
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
 302 : Check Watchdog WS0 interrupt
       No Watchdogs reported          0
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1

      ***One or more tests have failed... ***

      *** Starting PCIe tests ***
 401 : Check ECAM Presence               : Result:  PASS
 402 : Check ECAM value in MCFG table    : Result:  PASS

        Unexpected exception occured
        FAR reported = 0xEBDAB180
        ESR reported = 0x97800010
     Total Tests run  =   42;  Tests Passed  =   11  Tests Failed =   22

      *** SBSA tests complete. Reset the system. ***

As you can see there is still a lot of work to do.

ACPI Tables View

This tool displays content of ACPI tables in hex/ascii format and then with information interpreted field by field.

What makes it more useful is “-r 2” argument as it enables checking tables against Server Base Boot Requirements (SBBR) v1.2 specification. On SBSA reference platform with Tanmay’s firmware it lists two errors:

ERROR: SBBR v1.2: Mandatory DBG2 table is missing
ERROR: SBBR v1.2: Mandatory PPTT table is missing

Table Statistics:
        2 Error(s)
        0 Warning(s)

So situation looks good as those can be easily added.


So we have code to check and tools to do that. Add one to another and you have a clean need for CI job. So I wrote one for Linaro CI infrastructure: “LDCG SBSA firmware“. It builds top of QEMU and EDK2, then boot it and run above tools. Results are sent to mailing list.


The Arm ServerReady compliance program provides a solution for servers that “just works”, allowing partners to deploy Arm servers with confidence. The program is based on industry standards and the Server Base System Architecture (SBSA) and Server Base Boot Requirement (SBBR) specifications, alongside Arm’s Server Architectural Compliance Suite (ACS). Arm ServerReady ensures that Arm-based servers work out-of-the-box, offering seamless interoperability with standard operating systems, hypervisors, and software.

In other words: if your hardware is SBSA complaint then you can go with SBBR compliance tests and then go and ask for certification sticker or sth like that.

But if your hardware is not SBSA compliant then EBBR is all you can get. Far from being ServerReady. Never mind what people tries to say — ServerReady requires SBBR which requires SBSA.

Future work

More tests to integrate. ARM Enterprise ACS is next on my list.

by Marcin Juszkiewicz at August 05, 2020 11:53 AM

July 29, 2020

Cornelia Huck

Configuring mediated devices (Part 2)

In the last part of this article, I talked about configuring a mediated device directly via sysfs. This is a bit cumbersome, and you may want to make your configuration more permanent. Fortunately, there is tooling available for this.

driverctl: bind to the correct driver

driverctl is a tool to manage the driver that a device may bind to. As a device that is supposed to be used via vfio will need to be bound to a vfio driver instead of its 'normal' driver, it makes sense to add some configuration that makes sure that this binding is actually done automatically. While driverctl had originally been implemented to work with PCI devices, the css bus (for subchannel devices) supports management with driverctl as of Linux 5.3 as well. (The ap bus for crypto devices does not support setting driver overrides, as it implements a different mechanism.)

Example (vfio-ccw)

Let's reuse the example from the last post, where we wanted to assign the device behind subchannel 0.0.0313 to the guest. In order to set a driver override, use

[root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw

If the subchannel is not currently bound to the vfio-ccw driver already, it will be unbound from its driver and bound to vfio_ccw. Moreover, a udev rule to bind the subchannel to vfio_ccw automatically in the future will be added.

Unfortunately, a word of caution regarding the udev rule is in order: As uevents on the css bus for I/O subchannels are delayed until after device recognition has been performed, automatic binding may not work out as desired. We plan to address that in the future by reworking the way the css bus handles uevents; until then, you may have to trigger a rebind manually. Also, keep in mind that the subchannel id for a device may not be stable (as mentioned previously); automation should be used cautiously in that case.

mdevctl: manage mediated devices

The more tedious part of configuring a passthrough setup is configuring and managing mediated devices. To help with that, mdevctl has been written. It can create, modify, and remove mediated devices (and optionally make those changes persistent), work with configurations and devices created via other means, and list mediated devices and the different types that are supported.

Creating a mediated device

In order to create a mediated device, you need a uuid. You can either provide your own (as in the manual case), or let mdevctl pick one for you. In order to get the same configuration as in the manual configuration examples, let's create a vfio-ccw device with the same uuid as before.

The following command defines the same mediated device as in the manual example:
 [root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 -p 0.0.0313 -t vfio_ccw-io -a

Note the '-a', which instructs mdevctl to start the device automatically from now on.

After you've created the device, you can check which devices mdevctl is now aware of:

  [root@host ~] # mdevctl list -d
 7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io

Note that the '-d' instructs mdevctl to show defined, but not started devices.

Let's start the device:

  [root@host ~] # mdevctl start -u 7e270a25-e163-4922-af60-757fc8ed48c6
 [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

The mediated device is now ready to be used and can be passed to a guest.

Making your configuration persistent

If you already created a mediated device manually, you may want to reuse the existing configuration and make it persistent, instead of starting from scratch.

So, let's create another vfio-ccw the manual way:

 [root@host ~] # uuidgen
 [root@host ~] # echo "b29e4ca9-5cdb-4ee1-a01b-79085b9ab237" > /sys/bus/css/drivers/vfio_ccw/0.0.0314/mdev_supported_types/vfio_ccw-io/create

mdevctl now actually knows about the active device (in addition to the device we configured before):

  [root@host ~] # mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

But it obviously does not have a definition for the manually created device:

  [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

On a restart, the new device would be gone again; but we can make it persistent:

  [root@host ~] # mdevctl define -u b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
  [root@host ~ ] mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io (defined)
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

If you check under /etc/mdevctl.d/, you will find that an appropriate JSON file has been created:

  [root@host ~] # cat /etc/mdevctl.d/0.0.0314/b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 
    "mdev_type": "vfio_ccw-io",
    "start": "manual",
    "attrs": []

(Note that this device is not automatically started by default.)

Modifying an existing device

There are good reasons to modify an existing device: you may want to modify your setup, or, in the case of vfio-ap, you need to modify some attributes before being able to use the device in the first place.

Let's first create the device. This command creates the same device as created manually in the last post:

  [root@host ~] # mdevctl define -u "669d9b23-fe1b-4ecb-be08-a2fabca99b71" --parent matrix --type vfio_ap-passthrough
 [root@host ~] # mdevctl list -d
  669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

This device is not yet very useful, as you still need to assign some queues to it. It now looks like this:

  [root@host ~]  # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
    "mdev_type": "vfio_ap-passthrough",
    "start": "manual"

Let's modify the device and add some queues:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_adapter --value=5
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=4
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0xab

The device's JSON now looks like this:

  [root@host ~] # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
  "mdev_type": "vfio_ap-passthrough",
  "start": "manual",
  "attrs": [
      "assign_adapter": "5"
      "assign_domain": "4"
      "assign_domain": "0xab"

This is now exactly what we had defined manually in the last post.

But what if you notice that you want domain 0x42 instead of domain 4? Just modify the definition. To make it easier to figure out how to specify the attribute to manipulate, use this output:

  [root@host ~] # devctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"4"}
    @{2}: {"assign_domain":"0xab"}

You want to remove attribute 1, and add a new value:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --delattr --index=1
  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0x42

Let's check that it now looks as desired:

  [root@host ~] # mdevctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"0xab"}
    @{2}: {"assign_domain":"0x42"}

Future development

While mdevctl works perfectly fine for managing individual mediated devices, it does not maintain a view of the complete system. This means you notice conflicts between two devices only when you try to activate the second one. In the case of vfio-ap, the rules to be considered are complex, and there is quite some potential for conflict. In order to be able to catch that kind of problem early, we plan to add callouts to mdevctl, which would e.g. allow to invoke a tool for validation when a new device is added, but before it is activated. This is potentially useful for other device types as well.

by Cornelia Huck ( at July 29, 2020 12:10 PM

July 28, 2020

KVM on Z

KVM on IBM Z at Virtual SHARE

Don't miss our session dedicated to Secure Execution at this year's virtual SHARE:
Protecting Data In-use With Secure Execution, presented by Reinhard Bündgen, 3:50 PM - 4:35 PM EST on Tuesday, August 4.

by Stefan Raspl ( at July 28, 2020 07:45 PM

July 27, 2020

Cornelia Huck

Configuring mediated devices (Part 1)

vfio-mdev has become popular over the last few years for assigning certain classes of devices to guests. On the s390x side, vfio-ccw and vfio-ap are using the vfio-mdev framework for making channel devices and crypto adapters accessible to guests.
This and a follow-up article aim to give an overview of the infrastructure, how to set up and manage devices, and how to use tooling for this.

What is a mediated device?

A general overview

Mediated devices grew out of the need to build upon the existing vfio infrastructure in order to support more fine grained management of resources. Some of the initial use cases included GPUs and (maybe somewhat surprisingly) s390 channel devices.

When using the mediated device (mdev) API, common tasks are performed in the mdev core driver (like device management), while device-specific tasks are done in a vendor driver. Current in-kernel examples of vendor drivers are the Intel vGPU driver, vfio-ccw, and vfio-ap.

Examples on s390


vfio-ccw can be used to assign channel devices. It is pretty straightforward: vfio-ccw is an alternative driver for I/O subchannels, and a single mediated device per subchannel is supported.


vfio-ap can be used to assign crypto cards/queues (APQNs). It is a bit more involved, requiring prior setup on the ap bus level and configuration of a 'matrix' device. Complex relationships between the resources that can be assigned to different guests exist. Configuration-wise, this is probably the most complex mediated device available today.

Configuring a mediated device: the manual way

Mediated devices can be configured manually via sysfs operations. This is a good way to see what actually happens, but probably not what you want to do as a general administration task. Tools to help here will be introduced in part 2 of this article.

I will show the steps for both vfio-ccw and vfio-ap, just to show two different approaches. (Both examples are also used in the QEMU documentation, in case this looks familiar.)

Binding to the correct driver


Assume you want to use a DASD with the device bus ID 0.0.2b09. As vfio-ccw operates on the subchannel level, you first need to locate the subchannel for this device:

   [root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}'

(A word of caution: a device is not guaranteed to use the same subchannel at all times; on LPARs, the subchannel number will usually be stable, but z/VM -- and QEMU -- assign subchannel numbers in a consecutive order. If you don't get any hotplug events for a device, the subchannel number will stay stable for at least as long as the guest is running, though.)

Now you need to unbind the subchannel device from the default I/O subchannel driver and bind it to the vfio-ccw driver (make sure the device is not in use!):

    [root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind
    [root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind


You need to perform some preliminary configuration of your crypto adapters before you can use any of them with vfio-ap. If nothing different has been set up, a crypto adapter will only bind to the default device drivers, and you cannot use it via vfio-ap. In order to be able to bind an adapter to vfio-ap, you first need to modify the /sys/bus/ap/apmask and /sys/bus/ap/aqmask entries. Both are basically bitmasks that indicate that the matching adapter IDs respectively queue indices can only be bound to the default drivers. If you want to use a certain APQN via vfio-ap, you need to unset the respective bits.

Let's assume you want to assign the APQNs (5, 4) and (5, ab). First, you need to make the adapter and the domains available to non-default drivers:

  [root@host ~]#  echo -5 > /sys/bus/ap/apmask
  [root@host ~]#  echo -4, -0xab > /sys/bus/ap/aqmask

This should result in the devices being bound to the vfio_ap driver (you can verify this by looking for them under /sys/bus/ap/drivers/vfio_ap/).

Create a mediated device

The basic workflow is "pick a uuid, create a mediated device identified by it".


For vfio-ccw, the two steps of the basic workflow are enough:

  [root@host ~]# uuidgen
  [root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \


For vfio-ap, you need a more involved approach. The uuid is used to create a mediated device under the 'matrix' device:

  [root@host ~] # uuidgen
 [root@host ~]# echo "669d9b23-fe1b-4ecb-be08-a2fabca99b71" > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create

This mediated device will need to collect all APQNs that you want to pass to a specific guest. For that, you need to use the assign_adapter, assign_domain, and possibly assign_control_domain attributes (we'll ignore control domains for simplicity's sake.) All attributes have a companion unassign_ attribute to remove adapters/domains from the mediated device again. You can only assign adapters/domains that you removed from apmask/aqmask in the previous step. To follow up on our example again:

  [root@host ~]# echo 5 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_adapter
 [root@host ~]# echo 4 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain
 [root@host ~]# echo 0xab > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain

If you want to make sure that the mediated device is set up correctly, check via

  [root@host ~]# cat /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/matrix

Configuring QEMU/libvirt

Your mediated device is now ready to be passed to a guest.


Let's assume you want the device to show up as device 0.0.1234 in the guest.

For the QEMU command line, use

-device vfio-ccw,devno=fe.0.1234,sysfsdev=\

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ccw'>
    <address uuid='7e270a25-e163-4922-af60-757fc8ed48c6'/>
  <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x1234'/>


Any APQNs will show up in the guest exactly as they show up in the host (i.e., no remapping is possible.)

For the QEMU command line, use

-device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/669d9b23-fe1b-4ecb-be08-a2fabca99b71

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ap'>
    <address uuid='669d9b23-fe1b-4ecb-be08-a2fabca99b71'/>


All this manual setup is a bit tedious; the next article in this series will look at some of the tooling that is available for mediated devices.

by Cornelia Huck ( at July 27, 2020 06:55 PM

July 18, 2020

Stefan Hajnoczi

Rethinking event loop integration for libraries

APIs for operations that take a long time are often asynchronous so that applications can continue with other tasks while an operation is running. Asynchronous APIs initiate an operation and then return immediately. The application is notified when the operation completes through a callback or by monitoring a file descriptor for activity (for example, when data arrives on a TCP socket).

Asynchronous applications are usually built around an event loop that waits for the next event and invokes a function to handle the event. Since the details of event loops differ between applications, libraries need to be designed carefully to integrate well with a variety of event loops.

The current model

A popular library with asynchronous APIs is the libcurl file transfer library that is used for making HTTP requests. It has the following (slightly simplified) event loop integration API:

#define CURL_WAIT_POLLIN 0x0001 /* Ready to read? */
#define CURL_WAIT_POLLOUT 0x0004 /* Ready to write? */

int socket_callback(CURL *easy, /* easy handle */
int fd, /* socket */
int what, /* describes the socket */
void *userp, /* private callback pointer */
void *socketp); /* private socket pointer */

libcurl invokes the applications socket_callback() to start or stop monitoring file descriptors. When the application's event loop detects file descriptor activity, the application invokes libcurl's curl_multi_socket_action() API to let the library process the file descriptor.

There are variations on this theme but generally libraries expose file descriptors and event flags (read/write/error) so the application can monitor file descriptors from its own event loop. The library then performs the read(2) or write(2) call when the file descriptor becomes ready.

How io_uring changes the picture

The Linux io_uring API (pdf) can be used to implement traditional event loops that monitor file descriptors. But it also supports asynchronous system calls like read(2) and write(2) (best used when IORING_FEAT_FAST_POLL is available). The latter is interesting because it combines two syscalls into a single efficient syscall:

  1. Waiting for file descriptor activity.
  2. Reading/writing the file descriptor.

Existing applications use syscalls like epoll_wait(2), poll(2), or the old select(2) to wait for file descriptor activity. They can also use io_uring's IORING_OP_POLL_ADD to achieve the same effect.

After the file descriptor becomes ready, a second syscall like read(2) or write(2) is required to actually perform I/O.

io_uring's asynchronous IORING_OP_READ or IORING_OP_WRITE (including variants for vectored I/O or sockets) only requires a single io_uring_enter(2) call. If io_uring sqpoll is enabled then a syscall may not even be required to submit these operations!

To summarize, it's more efficient to perform a single asynchronous read/write instead of first monitoring file descriptor activity and then performing a read(2) or write(2).

A new model

Existing library APIs do not fit the asynchronous read/write approach because they expect the application to wait for file descriptor activity and then for the library to invoke a syscall to perform I/O. A new model is needed where the library tells the application about I/O instead of asking the application to monitor file descriptors for activity.

The library can use a new callback API that lets the application perform asynchronous I/O:

* The application invokes this callback when an aio operation has completed.
* @cb_arg: the cb_arg passed to a struct aio_operations function by the library
* @ret: the return value of the aio operation (negative errno for failure)
typedef void aio_completion_fn(void *cb_arg, ssize_t ret);

* Asynchronous I/O operation callbacks provided to the library by the
* application.
* These functions initiate an I/O operation and then return immediately. When
* the operation completes the @cb callback is invoked with @cb_arg. Note that
* @cb may be invoked before the function returns (typically in the case of an
* early error).
struct aio_operations {
void read(int fd, void *data, size_t len, aio_completion_fn *cb,
void *cb_arg);
void write(int fd, void *data, size_t len, aio_completion_fn *cb,
void *cb_arg);

The concept of monitoring file descriptor activity is gone. Instead the API focusses on asynchronous I/O operations that can be implemented by the application however it sees fit.

Applications using io_uring can use IORING_OP_READ and IORING_OP_WRITE to implement asynchronous operations efficiently. Traditional applications can still use their event loops but now also perform the read(2), write(2), etc syscalls on behalf of the library.

Some libraries don't need a full set of struct aio_operationscallbacks because they only perform I/O in limited ways. For example, a library that only has a Linux eventfdcan instead present this simplified API:

* Return an eventfd(2) file descriptor that the application must read from and
* call lib_eventfd_fired() when a non-zero value was read.
int lib_get_eventfd(struct libobject *obj);

* The application must call this function when the eventfd returned by
* lib_get_eventfd() read a non-zero value.
void lib_eventfd_fired(struct libobject *obj);

Although this simplified API is similar to traditional event loop integration APIs it is now the application's responsibility to perform the eventfd read(2), not the library's. This way applications using io_uring can implement the read efficiently.

Does an extra syscall matter?

Whether it is worth eliminating the extra syscall depends on one's performance requirements. When I/O is relatively infrequent then the overhead of the additional syscall may not matter.

While working on QEMU I found that the extra read(2) on eventfds causes a measurable overhead.


Splitting file descriptor monitoring from I/O is suboptimal for Linux io_uring applications. Unfortunately, existing library APIs are often designed in this way. Letting the application perform asynchronous I/O on behalf of the library allows a more efficient implementation with io_uring while still supporting applications that use older event loops.

by Unknown ( at July 18, 2020 11:11 AM

July 12, 2020

Cole Robinson

virt-manager libvirt XML editor UI

virt-manager 2.2.0 was released in June of last year. It shipped with a major new feature: libvirt XML viewing and editing UI for new and existing domain, pools, volumes, networks.

Every VM, network, and storage object page has a XML tab at the top. Here's an example with that tab selected from the VM Overview section:

VM XML editor

Here's an example of the XML view when just a disk is selected. Note it only shows that single device's libvirt XML:

Disk XML editor

By default the XML is not editable; notice the warning at the top of the first image. After editing is enabled, the warning is gone, like in the second image. You can enable editing via Edit->Preferences from the main Manager window. Here's what the option looks like:

XML edit preference

A bit of background: We are constantly receiving requests to expose libvirt XML config options in virt-manager's UI. Some of these knobs are necessary for <1% but uninteresting to the rest. Some options are difficult to set from the command line because they must be set at VM install time, which means switch from virt-manager to virt-install which is not trivial. And so on. When these options aren't added to the UI, it makes life difficult for those affected users. It's also difficult and draining to have these types of justification conversations on the regular.

The XML editing UI was added to relieve some of the pressure on virt-manager developers fielding these requests, and to give more power to advanced virt users. The users that know they need an advanced option are usually comfortable editing the libvirt XML directly. The XML editor doesn't detract from the existing UI much IMO, and it is uneditable by default to prevent less knowledgeable users from getting into trouble. It ain't gonna win any words for great UI, but the feedback has been largely positive so far.

by Cole Robinson at July 12, 2020 04:00 AM

July 11, 2020

Cole Robinson

virt-convert tool removed in virt-manager.git

The next release of virt-manager will not ship the virt-convert tool, I removed it upstream with this commit.

Here's the slightly edited quote from my original proposal to remove it:

virt-convert takes an ovf/ova or vmx file and spits out libvirt XML. It started as a code drop a long time ago that could translate back and forth between vmx, ovf, and virt-image, a long dead appliance format. In 2014 I changed virt-convert to do vmx -> libvirt and ovf -> libvirt which was a CLI breaking change, but I never heard a peep of a complaint. It doesn't do a particularly thorough job at its intended goal, I've seen 2-3 bug reports in the past 5 years and generally it doesn't seem to have any users. Let's kill it. If anyone has the desire to keep it alive it could live as a separate project that's a wrapper around virt-install but there's no compelling reason to keep it in virt-manager.git IMO

That mostly sums it up. If there's any users of virt-convert out there, you likely can get similar results by extracting the relevant disk image from the .vmx or .ovf config, pass it to virt-manager or virt-install, and let those tools fill in the defaults. In truth that's about all virt-convert did in to begin with.

by Cole Robinson at July 11, 2020 04:00 AM

July 10, 2020

Cornelia Huck

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.
In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.


vfio-ccw has also seen some improvements over the last release cycle.
  • Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
  • In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.

    by Cornelia Huck ( at July 10, 2020 04:27 PM

    July 03, 2020

    QEMU project

    Anatomy of a Boot, a QEMU perspective

    Have you ever wondered about the process a machine goes through to get to the point of a usable system? This post will give an overview of how machines boot and how this matters to QEMU. We will discuss firmware and BIOSes and the things they do before the OS kernel is loaded and your usable system is finally ready.


    When a CPU is powered up it knows nothing about its environment. The internal state, including the program counter (PC), will be reset to a defined set of values and it will attempt to fetch the first instruction and execute it. It is the job of the firmware to bring a CPU up from the initial few instructions to running in a relatively sane execution environment. Firmware tends to be specific to the hardware in question and is stored on non-volatile memory (memory that survives a power off), usually a ROM or flash device on the computers main board.

    Some examples of what firmware does include:

    Early Hardware Setup

    Modern hardware often requires configuring before it is usable. For example most modern systems won’t have working RAM until the memory controller has been programmed with the correct timings for whatever memory is installed on the system. Processors may boot with a very restricted view of the memory map until RAM and other key peripherals have been configured to appear in its address space. Some hardware may not even appear until some sort of blob has been loaded into it so it can start responding to the CPU.

    Fortunately for QEMU we don’t have to worry too much about this very low level configuration. The device model we present to the CPU at start-up will generally respond to IO access from the processor straight away.

    BIOS or Firmware Services

    In the early days of the PC era the BIOS or Basic Input/Output System provided an abstraction interface to the operating system which allowed the OS to do basic IO operations without having to directly drive the hardware. Since then the scope of these firmware services has grown as systems become more and more complex.

    Modern firmware often follows the Unified Extensible Firmware Interface (UEFI) which provides services like secure boot, persistent variables and external time-keeping.

    There can often be multiple levels of firmware service functions. For example systems which support secure execution enclaves generally have a firmware component that executes in this secure mode which the operating system can call in a defined secure manner to undertake security sensitive tasks on its behalf.

    Hardware Enumeration

    It is easy to assume that modern hardware is built to be discoverable and all the operating system needs to do is enumerate the various buses on the system to find out what hardware exists. While buses like PCI and USB do support discovery there is usually much more on a modern system than just these two things.

    This process of discovery can take some time as devices usually need to be probed and some time allowed for the buses to settle and the probe to complete. For purely virtual machines operating in on-demand cloud environments you may operate with stripped down kernels that only support a fixed expected environment so they can boot as fast as possible.

    In the embedded world it used to be acceptable to have a similar custom compiled kernel which knew where everything is meant to be. However this was a brittle approach and not very flexible. For example a general purpose distribution would have to ship a special kernel for each variant of hardware you wanted to run on. If you try and use a kernel compiled for one platform that nominally uses the same processor as another platform the result will generally not work.

    The more modern approach is to have a “generic” kernel that has a number of different drivers compiled in which are then enabled based on a hardware description provided by the firmware. This allows flexibility on both sides. The software distribution is less concerned about managing lots of different kernels for different pieces of hardware. The hardware manufacturer is also able to make small changes to the board over time to fix bugs or change minor components.

    The two main methods for this are the Advanced Configuration and Power Interface (ACPI) and Device Trees. ACPI originated from the PC world although it is becoming increasingly common for “enterprise” hardware like servers. Device Trees of various forms have existed for a while with perhaps the most common being Flattened Device Trees (FDT).

    Boot Code

    The line between firmware and boot code is a very blurry one. However from a functionality point of view we have moved from ensuring the hardware is usable as a computing device to finding and loading a kernel which is then going to take over control of the system. Modern firmware often has the ability to boot a kernel directly and in some systems you might chain through several boot loaders before the final kernel takes control.

    The boot loader needs to do 3 things:

    • find a kernel and load it into RAM
    • ensure the CPU is in the correct mode for the kernel to boot
    • pass any information the kernel may need to boot and can’t find itself

    Once it has done these things it can jump to the kernel and let it get on with things.


    The Kernel now takes over and will be in charge of the system from now on. It will enumerate all the devices on the system (again) and load drivers that can control them. It will then locate some sort of file-system and eventually start running programs that actually do work.

    Questions to ask yourself

    Having given this overview of booting here are some questions you should ask when diagnosing boot problems.


    • is the platform fixed or dynamic?
    • is the platform enumeratable (e.g. PCI/USB)?


    • is the firmware built for the platform you are booting?
    • does the firmware need storage for variables (boot index etc)?
    • does the firmware provide a service to kernels (e.g. ACPI/EFI)?


    • is the kernel platform specific or generic?
    • how will the kernel enumerate the platform?
    • can the kernel interface talk to the firmware?

    Final Thoughts

    When users visit the IRC channel to ask why a particular kernel won’t boot our first response is almost always to check the kernel is actually matched to the hardware being instantiated. For ARM boards in particular just being built for the same processor is generally not enough and hopefully having made it through this post you see why. This complexity is also the reason why we generally suggest using a tool like virt-manager to configure QEMU as it is designed to ensure the right components and firmware is selected to boot a given system.

    by Alex Bennée at July 03, 2020 10:00 PM

    KVM on Z

    RHEL providing unlimited KVM Guests on IBM Z

    Red Hat has announced a new offering for IBM Z and LinuxONE here.
    Red Hat Enterprise Linux for IBM Z with premium support also includes
    • Red Hat Enterprise Linux Extended Update Support add-on (new)
    • Red Hat Enterprise Linux High Availability add-on (new)
    • Red Hat Smart Management (new)
    • Red Hat Insights (new)
    • Unlimited virtual guests (KVM)

    by Stefan Raspl ( at July 03, 2020 10:24 AM

    July 02, 2020

    Stefan Hajnoczi

    Avoiding bitrot in C macros

    A common approach to debug messages that can be toggled at compile-time in C programs is:

    #ifdef ENABLE_DEBUG
    #define DPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
    #define DPRINTF(fmt, ...)

    Usually the ENABLE_DEBUG macro is not defined in normal builds, so the C preprocessor expands the debug printfs to nothing. No messages are printed at runtime and the program's binary size is smaller since no instructions are generated for the debug printfs.

    This approach has the disadvantage that it suffers from bitrot, the tendency for source code to break over time when it is not actively built and used. Consider what happens when one of the variables used in the debug printf is not updated after being renamed:

    - int r;
    + int radius;
    DPRINTF("radius %d\n", r);

    The code continues to compile after r is renamed to radius because the DPRINTF() macro expands to nothing. The compiler does not syntax check the debug printf and misses that the outdated variable name r is still in use. When someone defines ENABLE_DEBUG months or years later, the compiler error becomes apparent and that person is confronted with fixing a new bug on top of whatever they were trying to debug when they enabled the debug printf!

    It's actually easy to avoid this problem by writing the macro differently:

    #ifndef ENABLE_DEBUG
    #define ENABLE_DEBUG 0
    #define DPRINTF(fmt, ...) do { \
    if (ENABLE_DEBUG) { \
    fprintf(stderr, fmt, ## __VA_ARGS__); \
    } \
    } while (0)

    When ENABLE_DEBUG is not defined the macro expands to:

    do {
    if (0) {
    fprintf(stderr, fmt, ...);
    } while (0)

    What is the difference? This time the compiler parses and syntax checks the debug printf even when it is disabled. Luckily compilers are smart enough to eliminate deadcode, code that cannot be executed, so the binary size remains small.

    This applies not just to debug printfs. More generally, all preprocessor conditionals suffer from bitrot. If an #if ... #else ... #endif can be replaced with equivalent unconditional code then it's often worth doing.

    by Unknown ( at July 02, 2020 08:33 AM

    June 16, 2020

    Cole Robinson

    virt-manager is deprecated in RHEL (but only RHEL)

    TL;DR: I'm the primary author of virt-manager. virt-manager is deprecated in RHEL8 in favor of cockpit, but ONLY in RHEL8 and future RHEL releases. The upstream project virt-manager is still maintained and is still relevant for other distros.

    Google 'virt-manager deprecated' and you'll find some discussions suggesting virt-manager is no longer maintained, Cockpit is replacing virt-manager, virt-manager is going to be removed from every distro, etc. These conclusions are misinformed.

    The primary source for this confusion is the section 'virt-manager has been deprecated' from the RHEL8 release notes virtualization deprecation section. Relevant quote from the RHEL8.2 docs:

    The Virtual Machine Manager application, also known as virt-manager, has been deprecated. The RHEL 8 web console, also known as Cockpit, is intended to become its replacement in a subsequent release.

    What that means:

    • virt-manager is in RHEL8 and will be there for the lifetime of RHEL8.
    • Red Hat engineering effort assigned to the virt-manager UI has been reduced compared to previous RHEL versions.
    • The tentative plan is to not ship the virt-manager UI in RHEL9.

    Why is this happening? As I understand it, RHEL wants to roughly standardize on Cockpit as their host admin UI tool. It's a great project with great engineers and great UI designers. Red Hat is going all in on it for RHEL and aims to replace the mismash of system-config-X tools and project specific admin frontends (like virt-manager) with a unified project. (Please note: this is my paraphrased understanding, I'm not speaking on behalf of Red Hat here.)

    Notice though, this is all about RHEL. virt-manager is not deprecated upstream, or deprecated in other distros automatically just because RHEL has made this decision. The upstream virt-manager project continues on and Red Hat continues to allocate resources to maintain it.

    Also, I'm distinguishing virt-manager UI from the virt-manager project, which includes tools like virt-install. I fully expect virt-install to be shipped in RHEL9 and actively maintained (FWIW Cockpit uses it behind the scenes).

    And even if the virt-manager UI is not in RHEL9 repos, it will likely end up shipped in EPEL, so the UI will still be available for install, just through external repos.

    Overall my personal opinion is that as long as libvirt+KVM is in use on linux desktops and servers, virt-manager will be relevant. I don't expect anything to change in that area any time soon.

    by Cole Robinson at June 16, 2020 05:00 PM

    May 22, 2020

    Stefan Hajnoczi

    How to check VIRTIO feature bits inside Linux guests

    VIRTIO devices have feature bits that indicate the presence of optional features. The feature bit space is divided into core VIRTIO features (e.g. notify on empty), transport-specific features (PCI, MMIO, CCW), and device-specific features (e.g. virtio-net checksum offloading). This article shows how to check whether a feature is enabled inside Linux guests.

    The feature bits are used during VIRTIO device initialization to negotiate features between the device and the driver. The device reports a fixed set of features, typically all the features that the device implementors wanted to offer from the VIRTIO specification version that they developed against. The driver also reports features, typically all the features that the driver developers wanted to offer from the VIRTIO specification version that they developed against.

    Feature bit negotiation determines the subset of features supported by both the device and the driver. A new driver might not be able to enable all the features it supports if the device is too old. The same is true vice versa. This offers compatibility between devices and drivers. It also means that you don't know which features are enabled until the device and driver have negotiated them at runtime.

    Where to find feature bit definitions

    VIRTIO feature bits are listed in the VIRTIO specification. You can also grep the linux/virtio-*.h header files:

    $ grep VIRTIO.*_F_ /usr/include/linux/virtio_*.h
    virtio_ring.h:#define VIRTIO_RING_F_INDIRECT_DESC 28
    virtio_ring.h:#define VIRTIO_RING_F_EVENT_IDX 29
    virtio_scsi.h:#define VIRTIO_SCSI_F_INOUT 0
    virtio_scsi.h:#define VIRTIO_SCSI_F_HOTPLUG 1
    virtio_scsi.h:#define VIRTIO_SCSI_F_CHANGE 2

    Here the VIRTIO_SCSI_F_INOUT (0) constant is for the 1st bit (1ull << 0). Bit-numbering can be confusing because different standards, vendors, and languages express it differently. Here it helps to think of a bit shift operation like 1 << BIT.

    How to check feature bits inside the guest

    The Linux virtio.ko driver that is used for all VIRTIO devices has a sysfs file called features. This file contains the feature bits in binary representation starting with the 1st bit on the left and more significant bits to the right. The reported bits are the subset that both the device and the driver support.

    To check if the virtio-blk device /dev/vda has the VIRTIO_RING_F_EVENT_IDX (29) bit set:

    $ python -c "print('$(</sys/block/vda/device/driver/virtio*/features)'[29])"

    Other device types can be found through similar sysfs paths.

    by Unknown ( at May 22, 2020 01:46 PM

    May 01, 2020

    Daniel Berrange

    ANNOUNCE: virt-viewer version 9.0 released

    I am happy to announce a new bugfix release of virt-viewer 9.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg).

    Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

    With this release the project has moved over to use GitLab for its hosting needs instead of Pagure. Instead of sending patches to the old mailing list, we have adopted modern best practices and now welcome contributions as merge requests, from where they undergo automated CI testing of the build. Bug reports directed towards upstream maintainers, should also be filed at the GitLab project now instead of the Red Hat Bugzilla

    All historical releases are available from:

    Changes in this release include:

    • Project moved to
    • Allow toggling shared clipboard in remote-viewer
    • Fix handling when initial spice connection fails
    • Fix check for govirt library
    • Add bash completion of cli args
    • Improve errors in file transfer dialog
    • Fix ovirt foreign menu storage domains query
    • Prefer TLS certs from oVirt instead of CLI
    • Improve USB device cleanup when Ctrl-C is used
    • Remember monitor mappings across restarts
    • Add a default file extension to screenshots
    • Updated translations
    • Fix misc memory leaks

    by Daniel Berrange at May 01, 2020 05:19 PM

    April 30, 2020

    KVM on Z

    QEMU v5.0 released

    QEMU v5.0 is out. For highlights from a KVM on Z perspective see the Release Notes.

    by Stefan Raspl ( at April 30, 2020 05:28 PM

    Stefan Hajnoczi

    How the Linux VFS, block layer, and device drivers fit together

    The Linux kernel storage stack consists of several components including the Virtual File System (VFS) layer, the block layer, and device drivers. This article gives an overview of the main objects that a device driver interacts with and their relationships to each other. Actual I/O requests are not covered, instead the focus is on the objects representing the disk.

    Let's start with a diagram of the key data structures and then an explanation of how they work together.

    The Virtual File System (VFS) layer

    The VFS layer is where file system concepts like files and directories are handled. The VFS provides an interface that file systems like ext4, XFS, and NFS implement to register themselves with the kernel and participate in file system operations. The struct file_operations interface is the most interesting for device drivers as we are about to see.

    System calls like open(2), read(2), etc are handled by the VFS and dispatched to the appropriate struct file_operationsfunctions.

    Block device nodes like /dev/sda are implemented in fs/block_dev.c, which forms a bridge between the VFS and the Linux block layer. The block layer handles the actual I/O requests and is aware of disk-specific information like capacity and block size.

    The main VFS concept that device drivers need to be aware of is struct block_device_operations and the struct block_deviceinstances that represent block devices in Linux. A struct block_device connects the VFS inode and struct file_operationsinterface with the block layer struct gendiskand struct request_queue.

    In Linux there are separate device nodes for the whole device (/dev/sda) and its partitions (/dev/sda1, /dev/sda2, etc). This is handled by struct block_device so that a partition has a pointer to its parent in bd_contains.

    The block layer

    The block layer handles I/O request queues, disk partitions, and other disk-specific functionality. Each disk is represented by a struct gendisk and may have multiple struct hd_struct partitions. There is always part0, a special "partition" covering the entire block device.

    I/O requests are placed into queues for processing. Requests can be merged and scheduled by the block layer. Ultimately a device driver receives a request for submission to the physical device. Queues are represented by struct request_queue.

    The device driver

    The disk device driver registers a struct genhd with the block layer and sets up the struct request_queue to receive requests that need to be submitted to the physical device.

    There is one struct genhd for the entire device even though userspace may open struct block_device instances for multiple partitions on the disk. Disk partitions are not visible at the driver level because I/O requests have already had their Logical Block Address (LBA) adjusted with the partition start offset.

    How it all fits together

    The VFS is aware of the block layer struct gendisk. The device driver is aware of both the block layer and the VFS struct block_device. The block layer does not have direct connections to the other components but the device driver provides callbacks.

    One of the interesting aspects is that a device driver may drop its reference to struct gendisk but struct block_device instances may still have their references. In this case no I/O can occur anymore because the driver has stopped the disk and the struct request_queue, but userspace processes can still call into the VFS and struct block_device_operations callbacks in the device driver can still be invoked.

    Thinking about this case is why I drew the diagram and ended up writing about this topic!

    by Unknown ( at April 30, 2020 04:21 PM

    KVM on Z

    Redbook on KVM on Z

    A new Redbook titled "Virtualization Cookbook for IBM Z Volume 5 KVM" is available now. Among others, it covers tasks such as installation, host configuration and guest deployments for Linux distributions by Red Hat, SUSE and Ubuntu.

    by Stefan Raspl ( at April 30, 2020 10:41 AM

    April 29, 2020

    QEMU project

    QEMU version 5.0.0 released

    We’d like to announce the availability of the QEMU 5.0.0 release. This release contains 2800+ commits from 232 authors.

    You can grab the tarball from our download page. The full list of changes are available in the Wiki.

    Highlights include:

    • Support for passing host filesystem directory to guest via virtiofsd
    • Live migration support for external processes running on QEMU D-Bus
    • Support for using memory backends for main/”built-in” guest RAM
    • block: support for compressed backup images via block jobs
    • block: qemu-img: ‘measure’ command now supports LUKS images, ‘convert’ command now supports skipping zero’ing of target image
    • block: experimental support for qemu-storage-daemon, which provides access to QEMU block-layer/QMP features like blocks jobs or built-in NBD server without starting a full VM
    • ARM: support for the following architecture features: ARMv8.1 VHE/VMID16/PAN/PMU ARMv8.2 UAO/DCPoP/ATS1E1/TTCNP ARMv8.3 RCPC/CCIDX ARMv8.4 PMU/RCPC
    • ARM: support for Cortex-M7 CPU
    • ARM: new board support for tacoma-bmc, Netduino Plus 2, and Orangepi PC
    • ARM: ‘virt’ machine now supports vTPM and virtio-iommu devices
    • HPPA: graphical console support via HP Artist graphics device
    • MIPS: support for GINVT (global TLB invalidation) instruction
    • PowerPC: ‘pseries’ machine no longer requires reboot to negotiate between XIVE/XICS interrupt controllers when ic-mode=dual
    • PowerPC: ‘powernv’ machine can now emulate KVM hardware acceleration to run KVM guests while in TCG mode
    • PowerPC: support for file-backed NVDIMMs for persistent memory emulation
    • RISC-V: ‘virt’ and ‘sifive_u’ boards now support generic syscon drivers in Linux to control power/reboot
    • RISC-V: ‘virt’ board now supports Goldfish RTC
    • RISC-V: experimental support for v0.5 of draft hypervisor extension
    • s390: support for Adapter Interrupt Suppression while running in KVM mode
    • and lots more…

    Thank you to everyone involved!

    April 29, 2020 06:00 PM

    KVM on Z

    RHEL 7 Structure A (KVM host support) Support Lifecycle Extended

    Red Hat has updated the Red Hat Enterprise Linux Life Cycle, extending the full support lifecycle for Red Hat Enterprise Linux Structure A on IBM Z to May 31, 2021. See here for details, and here for an entry on the Red Hat Blog, referring to Structure A as "alt packages".
    The Structure A release provides updated kernel, QEMU and libvirt packages to run KVM on IBM Z.

    by Stefan Raspl ( at April 29, 2020 02:40 PM

    April 27, 2020

    KVM on Z

    Ubuntu 20.04 released

    Ubuntu Server 20.04 is out!
    It ships
    providing Secure Execution as announced here.
    For a detailed list of KVM on Z changes, see the release notes here.

    by Stefan Raspl ( at April 27, 2020 01:55 PM

    April 24, 2020

    ARM Datacenter Project

    NUMA balancing

    NUMA balancing impact on common benchmarks

    NUMA balancing can lead to performance degradation on NUMA-based arm64 systems when tasks migrate,
    and their memory accesses now suffer additional latency.


    System Information
    Architecture aarch64
    Processor version Kunpeng 920-6426
    CPUs 128
    NUMA nodes 4
    Kernel release 5.6.0+
    Node name ARMv2-3

    Test results


    perf bench -f simple sched pipe  
    Test Result
    numa_balancing-ON 10.012 (usecs/op)
    numa_balancing-OFF 10.509 (usecs/op)


    perf bench -f simple sched messaging -l 10000  
    Test Result
    numa_balancing-ON 6.417 (Sec)
    numa_balancing-OFF 6.494 (Sec)


    perf bench -f simple  mem memset -s 4GB -l 5 -f default  
    Test Result
    numa_balancing-ON 17.438783330964565 (GB/sec)
    numa_balancing-OFF 17.63163114627642 (GB/sec)


    perf bench -f simple futex wake -s -t 1024 -w 1  
    Test Result
    numa_balancing-ON 9.2742 (ms)
    numa_balancing-OFF 9.2178 (ms)


    sysbench cpu --time=10 --threads=64 --cpu-max-prime=10000 run  
    Test Result
    numa_balancing-ON 214960.28 (Events/sec)
    numa_balancing-OFF 214965.55 (Events/sec)


    sysbench memory --memory-access-mode=rnd --threads=64 run  
    Test Result
    numa_balancing-ON 1645 (MB/s)
    numa_balancing-OFF 1959 (MB/s)


    sysbench threads --threads=64 run  
    Test Result
    numa_balancing-ON 4604 (Events/sec)
    numa_balancing-OFF 5390 (Events/sec)


    sysbench mutex --mutex-num=1 --threads=512 run  
    Test Result
    numa_balancing-ON 33.2165 (Sec)
    numa_balancing-OFF 32.1088 (Sec)

    by Peter at April 24, 2020 11:04 PM

    April 23, 2020

    ARM Datacenter Project

    LISA-QEMU Presentation

    We recently gave a presentation on LISA-QEMU to the Linaro Toolchain Working Group.

    This presentation highlights our work on LISA-QEMU and provides all the details on what LISA-QEMU is, why we established this project, and how to get up and running creating VMs with the tools we developed.

    Please visit the links below to view the presentation or meeting recording.

    by Rob Foley at April 23, 2020 03:27 PM

    April 22, 2020

    ARM Datacenter Project

    How to debug kernel using QEMU and aarch64 VM.

    QEMU is a great tool to use when needing to debug the kernel.
    There are many recipes online for this too, I have listed a few helpful ones at the end of the article for reference.

    We would like to share our steps for debug the kernel, but focused on aarch64 systems, as some of the steps might be slightly different for this type of system.

    First, create a directory to work in and run these commands to create the flash images:

    dd if=/dev/zero of=flash1.img bs=1M count=64
    dd if=/dev/zero of=flash0.img bs=1M count=64
    dd if=/usr/share/qemu-efi-aarch64/QEMU_EFI.fd of=flash0.img conv=notrunc

    Next, download a QEMU image. We will use an ubuntu image that we previously created.

    We should mention that our procedure involves building our own kernel from scratch, and feeding this image to QEMU.

    Thus the first step is to actually create a QEMU image. We will assume you already have an image to use. If not, check out our articles on:

    We prefer the first procedure using LISA-QEMU since we also have a helpful script to install your kernel into the VM image automatically.

    But don’t worry, if you want to take a different route we will show all the steps for that too!

    Installing Kernel

    You have a few options here. One is to boot the image and install the image manually or use LISA-QEMU scripts to install it. The below command will boot the image in case you want to use the later manual approach to boot the image, scp in the kernel (maybe a .deb file) and install it manually with deb -i .deb.

    qemu/build/aarch64-softmmu/qemu-system-aarch64 -nographic\
                        -machine virt,gic-version=max -m 2G -cpu max\
                        -netdev user,id=vnet,hostfwd=:\
                        -device virtio-net-pci,netdev=vnet\ 
                        -drive file=./mini_ubuntu.img,if=none,id=drive0,cache=writeback\ 
                        -device virtio-blk,drive=drive0,bootindex=0\ 
                        -drive file=./flash0.img,format=raw,if=pflash \
                        -drive file=./flash1.img,format=raw,if=pflash -smp 4 

    To bring up QEMU with a kernel, typically you will need a kernel image (that you built), an initrd image (built after installing the kernel in your image), and the OS image (created above).

    Keep in mind the below steps assume a raw image. If you have a qcow2, then use qemu-img to convert it to raw first. For example:

    qemu-img convert -O raw my_image.qcow2 my_image_output.raw

    Below is how to mount an image to copy out files. You need to copy out the initrd in this case.

    $ mkdir mnt
    $ sudo losetup -f -P ubuntu.img
    $ sudo losetup -l
    NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                DIO LOG-SEC
    /dev/loop0         0      0         0  0 ubuntu.img   0     512
    $ sudo mount /dev/loop0p2 ./mnt
    $ ls ./mnt/boot
    config-4.15.0-88-generic  grub                          initrd.img-5.5.11             vmlinuz-5.5.11
    config-5.5.11             initrd.img                    initrd.img.old                vmlinuz                    vmlinuz.old
    efi                       initrd.img-4.15.0-88-generic  vmlinuz-4.15.0-88-generic
    $ cp ./mnt/initrd.img-5.5.11 .
    $ sudo umount ./mnt
    $ sudo losetup -d /dev/loop0

    Next, boot the kernel you built with your initrd. Note the kernel you built can be found at arch/arm64/boot/Image.

    This command line will bring up your kernel image with your initrd and your OS Image.

    One item you might need to customize is the “root=/dev/vda1” argument. This tells the kernel where to find your boot partition. This might vary depending on your VM image.

    qemu/build/aarch64-softmmu/qemu-system-aarch64 -nographic\
                      -machine virt,gic-version=max -m 2G -cpu max\
                      -netdev user,id=vnet,hostfwd=:\
                      -device virtio-net-pci,netdev=vnet\
                      -drive file=./mini_ubuntu.img,if=none,id=drive0,cache=writeback\
                      -device virtio-blk,drive=drive0,bootindex=0\
                      -drive file=./flash0.img,format=raw,if=pflash\
                      -drive file=./flash1.img,format=raw,if=pflash -smp 4\
                      -kernel ./linux/arch/arm64/boot/Image\
                      -append "root=/dev/vda2 nokaslr console=ttyAMA0"\
                      -initrd ./initrd.img-5.5.11 -s -S

    -s tells QEMU to use the TCP port :1234
    -S will pause at startup, waiting for the debugger to attach.

    Before we get started debugging, update your ~/.gdbinit with the following:

    add-auto-load-safe-path linux-5.5.11/scripts/gdb/

    In another window, start the debugger. Note, if you are on a x86 host debugging aarch64, then you need to use gdb-multiarch (sudo apt-get gdb-multiarch). In our case below we are on an aarch64 host, so we just use gdb.

    It’s very important to note that we receive the “done” message below indicating symbols were loaded successfully, otherwise the following steps will not work.

    $ gdb linux-5.5.11/vmlinux
    GNU gdb (Ubuntu 8.1-0ubuntu3.2)
    Reading symbols from linux-5.5.11/vmlinux...done.

    Attach the debugger to the kernel. Remember the -s argument above? It told QEMU to use port :1234. We will connect to it now.

    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    0x0000000000000000 in ?? ()

    That it. The debugger is connected.

    Now let’s test out the setup.
    Add a breakpoint in the kernel as a test.

    (gdb) hbreak start_kernel
    Hardware assisted breakpoint 1 at 0xffff800011330cdc: file init/main.c, line 577.
    (gdb) c
    Thread 1 hit Breakpoint 1, start_kernel () at init/main.c:577
    577 {
    (gdb) l
    572 {
    573     rest_init();
    574 }
    576 asmlinkage __visible void __init start_kernel(void)
    577 {
    578     char *command_line;
    579     char *after_dashes;
    581     set_task_stack_end_magic(&init_task);

    We hit the breakpoint !

    Remember above that we used the -S option to QEMU? This told QEMU to wait to start running the image until we connected the debugger. Thus once we hit continue, QEMU actually starts booting the kernel.


    by Rob Foley at April 22, 2020 10:51 AM

    April 20, 2020

    Stefan Hajnoczi

    virtio-fs has landed in QEMU 5.0!

    The virtio-fs shared host->guest file system has landed in QEMU 5.0! It consists of two parts: the QEMU -device vhost-user-fs-pci and the actual file server called virtiofsd. Guests need to have a virtio-fs driver in order to access shared file systems. In Linux the driver is called virtiofs.ko and has been upstream since Linux v5.4.

    Using virtio-fs

    Thanks to libvirt virtio-fs support, it's possible to share directories trees from the host with the guest like this:

    <filesystem type='mount' accessmode='passthrough'>
    <driver type='virtiofs'/>
    <binary xattr='on'>
    <lock posix='on' flock='on'/>
    <source dir='/path/on/host'/>
    <target dir='mount_tag'/>

    The host /path/on/host directory tree can be mounted inside the guest like this:

    # mount -t virtiofs mount_tag /mnt

    Applications inside the guest can then access the files as if they were local files. For more information about virtio-fs, see the project website.

    How it works

    For the most part, -device vhost-user-fs-pci just facilitates the connection to virtiofsd where the real work happens. When guests submit file system requests they are handled directly by the virtiofsd process on the host and don't need to go via the QEMU process.

    virtiofsd is a FUSE file system daemon with virtio-fs extensions. virtio-fs is built on top of the FUSE protocol and therefore supports the POSIX file system semantics that applications expect from a native Linux file system. The Linux guest driver shares a lot of code with the traditional fuse.ko kernel module.

    Resources on virtio-fs

    I have given a few presentations on virtio-fs:

    Future features

    A key feature of virtio-fs is the ability to directly access the host page cache, eliminating the need to copy file contents into guest RAM. This so-called DAX support is not upstream yet.

    Live migration is not yet implemented. It is a little challenging to transfer all file system state to the destination host and seamlessly continue file system operation without remounting, but it should be doable.

    There is a Rust implementation of virtiofsd that is close to reaching maturity and will replace the C implementation. The advantage is that Rust has better memory and thread safety than C so entire classes of bugs can be eliminated. Also, the codebase is written from scratch whereas the C implementation was a combination of several existing pieces of software that were not designed together.

    by Unknown ( at April 20, 2020 02:23 PM

    April 16, 2020

    KVM on Z

    Secure Execution for IBM z15 arriving with New Models

    IBM announced the latest additions to its IBM z15 series:
    • IBM z15 Model T02
    • IBM LinuxONE III Model LT2
    A substantial part of the announcement is a new feature called Secure Execution. For a brief overview, see here. Secure Execution will become available in the following Linux distributions as announced by the respective distribution partners:
    We will publish more details on Secure Execution later. The impatient with an interest in lower level technical details might want to check out the presentations here and here.

    by Stefan Raspl ( at April 16, 2020 09:14 AM

    April 08, 2020

    Cornelia Huck

    s390x changes in QEMU 5.0

    QEMU is currently in hardfreeze, with the 5.0 release expected at the end of the month. Here's a quick list of some notable s390x changes.

    • You can finally enable Adapter Interrupt Suppression in the cpu model (ais=on) when running under KVM. This had been working under TCG for some time now, but KVM was missing an interface that was provided later -- and we finally actually check for that interface in QEMU. This is mostly interesting for PCI.
    • QEMU had been silently fixing odd memory sizes to something that can be reported via SCLP for some time. Silently changing user input is probably not such a good idea; compat machines will continue to do so to enable migration from old QEMUs for machines with odd sizes, but will print a warning now. If you have such an old machine (and you can modify it), it might be a good idea to either specify the memory size it gets rounded to or to switch to the 5.0 machine type, where memory sizes can be more finegrained due to the removal of support for memory hotplug. We may want to get rid of the code doing the fixup at some time in the future.
    • QEMU now properly performs the whole set of initial, clear, and normal cpu reset.
    • And the usual fixes, cleanups, and improvements.
    For 5.1, expect more changes; support for protected virtualization will be a big item.

    by Cornelia Huck ( at April 08, 2020 01:58 PM

    ARM Datacenter Project

    How to easily install the kernel in a VM

    This article is a follow-up to an earlier article we wrote Introducing LISA-QEMU.

    This article will outline the steps to install a kernel into a VM using some scripts we developed. In our case we have an x86_64 host and a aarch64 VM.

    We will assume you have cloned the LISA-QEMU repository already. As part of the LISA-QEMU integration we have added a script to automate the process of installing a kernel into a VM. The scripts we talk about below can be found in the LISA-QEMU github

    git clone
    cd lisa-qemu
    git submodule update --init --recursive

    We also assume you have built the kernel .deb install package. We covered the detailed steps in our README. You can also find needed dependencies for this article at that link.

    You can use to generate a new image with the kernel of choice installed.
    Assuming you have the VM image that was created similar to the steps in this post, just launch a command like the below to install your kernel.

    $ sudo python3 scripts/ --kernel_pkg ../linux/linux-image-5.5.11_5.5.11-1_arm64.deb 
    scripts/ image: build/VM-ubuntu.aarch64/ubuntu.aarch64.img
    scripts/ kernel_pkg: ../linux/linux-image-5.5.11_5.5.11-1_arm64.deb
    Install kernel successful.
    Image path: /home/rob/qemu/lisa-qemu/build/VM-ubuntu.aarch64/ubuntu.aarch64.img.kernel-5.5.11-1
    To start this image run this command:
    python3 /home/rob/qemu/lisa-qemu/scripts/ -p /home/rob/qemu/lisa-qemu/build/VM-ubuntu.aarch64/ubuntu.aarch64.img.kernel-5.5.11-1

    We need to use sudo for these commands since sudo is required as part of mounting images.

    Note that the argument is:
    -p or –kernel_pkg argument with the .deb kernel package

    Also note that the last lines in the output show the command to issue to bring this image up.

    To start this image run this command:
    python3 /home/rob/qemu/lisa-qemu/scripts/ -p /home/rob/qemu/lisa-qemu/build/VM-ubuntu.aarch64/ubuntu.aarch64.img.kernel-5.5.11-1

    You might wonder where we got the VM image from?
    It was found in a default location after running our script. See this post for more details.

    If you want to supply your own image, we have an argument for that. :)
    –image argument with the VM image to start from.

    When supplying the image, the command line might look like the below.

    sudo python3 scripts/ --kernel_pkg ../linux/linux-image-5.5.11_5.5.11-1_arm64.deb --image build/VM-ubuntu.aarch64/ubuntu.aarch64.img

    There are a few options for installing the kernel.

    By default will attempt to install your kernel using a chroot environment. This is done for speed more than anything else since in our case is is faster to use the chroot than to bring up the aarch64 emulated VM and install the kernel.

    We also support the –vm option which will bring up the VM with QEMU and then install the kernel into it. If you run into issues with the chroot environment install this would be a good alternative.

    An example of the VM install method.

    sudo python3 scripts/ --vm --kernel_pkg ../linux/linux-image-5.5.11_5.5.11-1_arm64.deb

    Thanks for taking the time to learn more about our work on LISA-QEMU !

    by Rob Foley at April 08, 2020 11:50 AM

    April 02, 2020

    ARM Datacenter Project

    LISA-QEMU Demo

    This article is a follow-up to an earlier article we wrote Introducing LISA-QEMU.

    LISA-QEMU provides an integration which allows LISA to work with QEMU VMs. LISA’s goal is to help Linux kernel developers to measure the impact of modifications in core parts of the kernel.1 Integration with QEMU will allow developers to test wide variety of hardware configurations including ARM architecture and complex NUMA topologies.

    This demo will walk through all the steps needed to build and bring up an aarch64 VM on an x86 platform. Future articles will work through reconfiguring the hardware for these VMs, inserting a new kernel into these VMs and more !

    The first step is to get your linux machine ready to run LISA-QEMU. In this step we will download all the dependencies needed. We assume Ubuntu in the below steps.

    apt-get build-dep -y qemu
    apt-get install -y python3-yaml wget git qemu-efi-aarch64 qemu-utils genisoimage qemu-user-static git

    Now that we have the correct dependencies, let’s download the LISA-QEMU code.

    git clone
    cd lisa-qemu
    git submodule update --init --progress --recursive

    One note on the above. If you do not plan to use lisa, then you can leave off the –recursive and it will update much quicker.

    The next step is to build a new VM. This build command takes all the defaults. If you want to learn more about the possible options take a look at –help.

    $ time python3 scripts/  --help
    usage: [-h] [--debug] [--dry_run] [--ssh]
                          [--image_type IMAGE_TYPE] [--image_path IMAGE_PATH]
                          [--config CONFIG] [--skip_qemu_build]
    Build the qemu VM image for use with lisa.
    optional arguments:
      -h, --help            show this help message and exit
      --debug, -D           enable debug output
      --dry_run             for debugging.  Just show commands to issue.
      --ssh                 Launch VM and open an ssh shell.
      --image_type IMAGE_TYPE, -i IMAGE_TYPE
                            Type of image to build.
                            From external/qemu/tests/vm.
                            default is ubuntu.aarch64
      --image_path IMAGE_PATH, -p IMAGE_PATH
                            Allows overriding path to image.
      --config CONFIG, -c CONFIG
                            config file.
                            default is conf/conf_default.yml.
      --skip_qemu_build     For debugging script.
      To select all defaults:
      Or select one or more arguments
        scripts/ -i ubuntu.aarch64 -c conf/conf_default.yml

    But we digress… Below is the command to build the image.

    OK let’s build that image…

    python3 scripts/

    You will see the progress of the build and other steps of the image creation on your screen. If you would like to see more comprehensive output and progress, use the –debug option.

    Depending on your system this might take many minutes. Below are some example times.

    50 minutes - Intel i7 laptop with 2 cores and 16 GB of memory
    6 minutes - Huawei Taishan 2286 V2 with 128 ARM cores and 512 GB of memory.

    Once the image creation is complete, you will see a message like the following.

    Image creation successful.
    Image path: /home/lisa-qemu/build/VM-ubuntu.aarch64/ubuntu.aarch64.img

    Now that we have an image, we can test it out by bringing up the image and opening an ssh connection to it.

    python3 scripts/

    The time to bring up the VM will vary based on your machine, but it should come up in about 2-3 minutes on most machines.

    You should expect to see the following as the system boots and we open an ssh connection to bring us to the guest prompt.

    $ python3 scripts/
    Conf:        /home/lisa-qemu/build/VM-ubuntu.aarch64/conf.yml
    Image type:  ubuntu.aarch64
    Image path:  /home/lisa-qemu/build/VM-ubuntu.aarch64/ubuntu.aarch64.img

    Now that the system is up and running, you could for example, use it for a lisa test.

    In our case we issue one command to show that we are in fact an aarch64 architecture with 8 cores.

    qemu@ubuntu-guest:~$ lscpu
    Architecture:        aarch64
    Byte Order:          Little Endian
    CPU(s):              8
    On-line CPU(s) list: 0-7
    Thread(s) per core:  1
    Core(s) per socket:  8
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           0x00
    Model:               0
    Stepping:            0x0
    BogoMIPS:            125.00
    NUMA node0 CPU(s):   0-7
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma sha3 sm3 sm4 asimddp sha512 sve asimdfhm flagm

    Once you are done with the VM, you can close the VM simply by typing “exit” a the command prompt.

    qemu@ubuntu-guest:~$ exit
    Connection to closed by remote host.

    That’s it. The VM was gracefully powered off.

    We hope this article was helpful to understand just how easy it can be to build and launch a VM with LISA-QEMU !

    1. This definition can be found on the LISA github page 

    by Rob Foley at April 02, 2020 11:07 AM

    April 01, 2020

    ARM Datacenter Project

    Introducing LISA-QEMU

    LISA-QEMU provides an integration which allows LISA to work with QEMU VMs. LISA’s goal is to help Linux kernel developers to measure the impact of modifications in core parts of the kernel1. Integration with QEMU will allow developers to test wide variety of hardware configurations including ARM architecture and complex NUMA topologies.

    One of our goals is to allow developers to test the impact of modifications on aarch64 architectures with complex NUMA topologies. Currently we are focusing on testing kernel CFS scheduler task placement decision mechanism interaction with NUMA_BALANCING

    In order to simplify and streamline the development process we created scripts and configuration files, which allow developers to quickly create QEMU VMs with a configurable number of cores and NUMA nodes. We also created a script to install custom build kernel on these VMs. Once a VM is configured with the desired topology and kernel version developers can run interactive and/or automated LISA tests.

    Please note, that you do not need physical aarch64 hardware. In fact we have demoed this project on a laptop with a Core-i7-7600U CPU with two cores.

    Our approach is to contribute improvements in QEMU and LISA back to the mainstream. In our repository we will keep scripts and configurations belonging to the integration between LISA and QEMU.

    LISA Overview: The LISA project provides a toolkit that supports regression testing and interactive analysis of Linux kernel behavior. LISA’s goal is to help Linux kernel developers measure the impact of modifications in core parts of the kernel. LISA itself runs on a host machine, and uses the devlib toolkit to interact with the target via SSH, ADB or telnet. LISA provides features to describe workloads (notably using rt-app) and run them on targets. It can collect trace files from the target OS (e.g. systrace and ftrace traces), parse them via the TRAPpy framework. These traces can then be parsed and analysed in order to examine detailed target behaviour during the workload’s execution.1

    Peter also contributed to this article.

    We also have articles on LISA-QEMU:

    1. This definition can be found on the LISA github page  2

    by Rob Foley at April 01, 2020 08:30 PM

    March 25, 2020

    Marcin Juszkiewicz

    Sharing PCIe cards across architectures

    Some days ago during one of conference calls one of my co-workers asked:

    Has anyone ever tried PCI forwarding to an ARM VM on an x86 box?

    As my machine was opened I just turned it off and inserted SATA controller into one of unused PCI Express slots. After boot I started one of my AArch64 CirrOS VM instances and gave it this card. Worked perfectly:

    [   21.603194] pcieport 0000:00:01.0: pciehp: Slot(0): Attention button pressed
    [   21.603849] pcieport 0000:00:01.0: pciehp: Slot(0) Powering on due to button press
    [   21.604124] pcieport 0000:00:01.0: pciehp: Slot(0): Card present
    [   21.604156] pcieport 0000:00:01.0: pciehp: Slot(0): Link Up
    [   21.739977] pci 0000:01:00.0: [1b21:0612] type 00 class 0x010601
    [   21.740159] pci 0000:01:00.0: reg 0x10: [io  0x0000-0x0007]
    [   21.740199] pci 0000:01:00.0: reg 0x14: [io  0x0000-0x0003]
    [   21.740235] pci 0000:01:00.0: reg 0x18: [io  0x0000-0x0007]
    [   21.740271] pci 0000:01:00.0: reg 0x1c: [io  0x0000-0x0003]
    [   21.740306] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x001f]
    [   21.740416] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x000001ff]
    [   21.742660] pci 0000:01:00.0: BAR 5: assigned [mem 0x10000000-0x100001ff]
    [   21.742709] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x101f]
    [   21.742770] pci 0000:01:00.0: BAR 0: assigned [io  0x1020-0x1027]
    [   21.742803] pci 0000:01:00.0: BAR 2: assigned [io  0x1028-0x102f]
    [   21.742834] pci 0000:01:00.0: BAR 1: assigned [io  0x1030-0x1033]
    [   21.742866] pci 0000:01:00.0: BAR 3: assigned [io  0x1034-0x1037]
    [   21.742935] pcieport 0000:00:01.0: PCI bridge to [bus 01]
    [   21.742961] pcieport 0000:00:01.0:   bridge window [io  0x1000-0x1fff]
    [   21.744805] pcieport 0000:00:01.0:   bridge window [mem 0x10000000-0x101fffff]
    [   21.745749] pcieport 0000:00:01.0:   bridge window [mem 0x8000000000-0x80001fffff 64bit pref]

    Let’s go deeper

    Next day I turned off desktop for CPU cooler upgrade. During process I went through my box of expansion cards and plugged additional USB 3.0 controller (Renesas based). Also added SATA hard drive and connected it to previously added controller.

    Once computer was back online I created new VM instance. This time I used Fedora 32 beta. But when I tried to add PCI Express card I got an error:

    Error while starting domain: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
    2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)
    Traceback (most recent call last):
      File "/usr/share/virt-manager/virtManager/", line 75, in cb_wrapper
        callback(asyncjob, *args, **kwargs)
      File "/usr/share/virt-manager/virtManager/", line 111, in tmpcb
        callback(*args, **kwargs)
      File "/usr/share/virt-manager/virtManager/object/", line 66, in newfn
        ret = fn(self, *args, **kwargs)
      File "/usr/share/virt-manager/virtManager/object/", line 1279, in startup
      File "/usr/lib64/python3.8/site-packages/", line 1234, in create
        if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
    libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
    2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

    Hmm. It worked before. Tried other card — with the same effect.


    Went to #qemu IRC channel and started discussing issue with QEMU developers. Turned out that probably no one tried sharing expansion cards to foreign architecture guest (in TCG mode instead of same architecture KVM mode).

    As I had VM instance where sharing card worked I started checking what was wrong. After some restarts it was clear that crossing 3054 MB of guest memory was enough to get VFIO errors like above.


    Issue not reported does not exist. So I opened a bug against QEMU. Filled it with error messages, “lspci” output data for used cards, QEMU command line (generated by libvirt) etc.

    Looks like the problem lies in architecture differences between x86-64 (host) and aarch64 (guest). Let me quote Alex Williamson:

    The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn’t know it’s being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue.

    Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue.

    I have to admit that this is too low-level for me. I hope that the problem I hit will help someone to improve QEMU.

    by Marcin Juszkiewicz at March 25, 2020 05:23 PM

    Powered by Planet!
    Last updated: August 05, 2020 07:10 PMEdit this page