Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

October 18, 2020

Gerd Hoffmann

Improving microvm support in qemu and seabios.

In version 4.2 the microvm machine type was added to qemu. The initial commit describes it this way:

It's a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.

The initial code uses the minimal qboot firmware to initialize the guest, to load a linux kernel and boot it. For network/storage/etc devices virtio-mmio is used. The configuration is passed to the linux kernel on the command line so the guest is able to find the devices.

That works for direct kernel boot, using qemu -kernel vmlinuz, because qemu can easily patch the kernel command line then. But what if you want - for example - boot the Fedora Cloud image? Using the Fedora kernel stored within the image?

A better plan for device discovery

When not using direct kernel boot patching the kernel command line for device discovery isn't going to fly, so we need something else. There are two established standard ways to do that in modern systems. One is device trees. The other is ACPI. Both have support for virtio-mmio.

A device tree entry for virtio-mmio looks like this:

virtio_block@3000 {
	compatible = "virtio,mmio";
	reg = <0x3000 0x100>;
	interrupts = <41>;
}

And this is the ACPI DSDT version:

Device (VR06)
{
    Name (_HID, "LNRO0005")  // _HID: Hardware ID
    Name (_UID, 0x06)  // _UID: Unique ID
    Name (_CCA, One)  // _CCA: Cache Coherency Attribute
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
        Memory32Fixed (ReadWrite,
            0xFEB00C00,         // Address Base
            0x00000200,         // Address Length
            )
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
        {
            0x00000016,
        }
    })
}

Both carry essentially the same information: What kind of device that is and which resources (registers & interrupt) it uses.

On the arm platform both are established, with device trees being common for small board computers like the raspberry pi and ACPI being used in the arm server space. On the x86 platform we don't have much of a choice though. There are some niche attempts to establish device trees, the google android goldfish platform for example. But for widespread support there is no way around using ACPI.

The nice thing about arm server using ACPI too is that this paved the way for us. The linux kernel supports both device tree and ACPI for the discovery of virtio-mmio devices:

root@fedora-bios ~/acpi# modinfo virtio-mmio | grep alias
alias:          of:N*T*Cvirtio,mmioC*
alias:          of:N*T*Cvirtio,mmio
alias:          acpi*:LNRO0005:*

So linux kernel support is a solved problem already. Yay!

virtio-mmio support for seabios

If we want load a kernel from the disk image the firmware must be able to find and read the disk. We already have virtio-pci support for blk and scsi in seabios. The differences between virtio-pci and virtio-mmio transports are not that big. Also some infrastructure for different transport modes was already there to deal with legacy vs. modern virtio-pci. So adding virtio-mmio support to the drivers wasn't much of a problem.

But of course seabios also has the problem that it must discover the devices before it can initialize the driver. Various approaches to find virtio-mmio devices using the available information sources where tried. All of them had the one or the other non-working corner case, except using ACPI. So seabios ended up getting a simple DSDT parser for device discovery.

While being at it some other small fixes where added to seabios too to make it work better with microvm. The hard dependency on the RTC CMOS has been removed for example, so latest seabios works fine with qemu -M microvm,rtc=off.

This ships with seabios version 1.14.

While speaking about seabios: When using a serial console I'd strongly recommend to run with qemu -M microvm,graphics=off. That will enable serial console support in seabios. This is one of the tweaks done by the qemu -nographic shortcut. The machine option works with the pc and q35 machine types too.

ACPI cleanups in qemu

Hooking up ACPI support for microvm on the qemu side turned out to be surprisingly difficuilt due to some historical baggage.

Years ago qemu used to have a static ACPI DSDT table. All ISA devices (serial & parallel ports, floppy, ...) are declared there, but they might not be actually present depending on qemu configuration. The LPC/ISA bride has some bits in pci config space saying whenever a device is actually present or not (qemu emulation follows physical hardware behavior here). So the devices have a _STA method looking up those bits and returning the device status. The guest had to run the method using AML interpreter to figure whenever the declared device is actually there.

// this is the qemu q35 ISA bridge at 00:1f.0
Device (ISA)
{
    Name (_ADR, 0x001F0000)  // _ADR: Address
    // ... snip ...
    OperationRegion (LPCE, PCI_Config, 0x82, 0x02)
    Field (LPCE, AnyAcc, NoLock, Preserve)
    {
        CAEN,   1,  // serial port #1 enable bit
        CBEN,   1,  // serial port #2 enable bit
        LPEN,   1   // parallel port enable bit
    }
}

// serial port #1
Device (COM1)
{
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Method (_STA, 0, NotSerialized)  // _STA: Status
    {
        Local0 = CAEN /* \_SB_.PCI0.ISA_.CAEN */
        If ((Local0 == Zero))
        {
            // serial port #1 is disabled
            Return (Zero)
        }
        Else
        {
            // serial port #1 is enabled
            Return (0x0F)
        }
    }
    // ... snip ...
}

The microvm machine type simply has no PCI support, so that approach isn't going to fly. Also these days all ACPI tables are dynamically generated anyway, so there is no reason to have the guests AML interpreter go dig into pci config space. Instead we can handle that in qemu when generating the DSDT table. Disabled devices are simply not listed. For enabled devices this is enough:

// serial port #1
Device (COM1)
{
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Name (_STA, 0x0F)  // _STA: Status
    // ... snip ...
}

So I've ended up reorganizing and simplifying the code which creates the DSDT entries for ISA devices. This landed in qemu version 5.1.

ACPI support for microvm

Now with the roadblocks out of the way it was finally possible to add acpi support to microvm. There is little reason to worry about backward compatibility to historic x86 platforms here, old guests wouldn't be able to handle virtio-mmio anyway. So this takes a rather modern approach and looks more like an arm virt machines than a x86 q35 machine. Like arm it uses the generic event device for power management support.

ACPI support for microvm is switchable, simliar to the other machine types, using the acpi=on|off machine option. The -no-acpi switch works too. By default ACPI support is enabled.

With ACPI enabled qemu uses virtio-mmio enabled seabios as firmware and doesn't bother patching the linux kernel command line for device discovery.

With ACPI disabled qemu continues to use qboot as firmware like older qemu versions do. Likewise it continues to add virtio-mmio devices to the linux kernel command line.

This will be available in qemu version 5.2. It is already merged in the master branch.

ACPI advantages

  • Number one is device discovery obviously, this is why we started all this in the first place. seabios and linux kernel find virtio-mmio devices automatically. You can boot Fedora cloud images in microvm without needing any tricks. Probably other distros too, even though I didn't try that. Compiling the linux kernel with CONFIG_VIRTIO_MMIO=y (or =m & adding the module to initramfs) is pretty much the only requirement for this to work.

  • Number two is device discovery too. ACPI will also tell the kernel which devices are not there. So with acpi=on the kernel simply skips the PS/2 probe in case the DSDT doesn't list a keyboard controller. With acpi=off the kernel assumes legacy hardware, goes into probe-harder mode and needs one second to figure that there really is no keyboard controller:

    [    0.414840] i8042: PNP: No PS/2 controller found.
    [    0.415287] i8042: Probing ports directly.
    [    1.454723] i8042: No controller found
    

    We have an simliar effect with the real time clock. With acpi=off the kernel goes register an IRQ line for the RTC even in case the device isn't there.

  • Number three is (basic) power management. ACPI provides a virtual power button, so the guest will honor shutdown requests sent that way. ACPI also provides S5 (aka poweroff) support, so qemu gets a notification from the guest when the shutdown is done and can exit.

  • Number four is better IRQ routing. The linux kernel finds the IO-APIC declared in the APIC table and uses it for IRQ routing. It is possible to use lines 16-23 for the virtio-mmio devices, avoiding IRQ sharing. Also we can refine the configuration using IRQ flags in the DSDT table.

    With acpi=off this does not work not reliable. I've seen the kernel ignore the IO-APIC in the past. Doesn't always happen though. Not clear which factors play a role here, I didn't investigate that in detail. Maybe newer kernel versions are a bit more clever here and find the IO-APIC even without ACPI.

Bottom line: ACPI helps moving the microvm machine type forward towards a world without legacy x86 platform devices.

But isn't ACPI bloated and slow?

Well, on my microvm test guest all ACPI tables combined are less than 1k in size:

root@fedora-bios /sys/firmware/acpi/tables# find -type f | xargs ls -l
-r--------. 1 root root  78 Oct  2 09:36 ./APIC
-r--------. 1 root root 482 Oct  2 09:36 ./DSDT
-r--------. 1 root root 268 Oct  2 09:36 ./FACP

I wouldn't call that bloated. This is a rather small virtual machine, with larger configurations (more CPUs, more devices) the tables will grow a bit of course.

When testing boot times I figured it is pretty hard to find any differences due to ACPI initialization. The noise (differences when doing 2-3 runs with identical configuration) is larger than the acpi=on/off difference. Seems to be at most a handful of milliseconds.

When trying that yourself take care to boot the kernel with 'quiet'. This is a good idea anyway if you want boot as fast as possible. The kernel prints more boot information with acpi=on, so slow console logging can skew your numbers if you let the kernel print out everything.

Runtime differences should be zero. There is only one AML method in the DSDT table. It toggles the power button when a notification comes in from the generic event device. It runs only on generic event device interrupts.

USB support for microvm

qemu just got a sysbus (non-pci) version of the xhci host adapter. It is needed for some arm boards. Nice thing is now that we have ACPI we can just wire that up in microvm too, add it in the DSDT table, then linux will find and use it:

Device (XHCI)
{
    Name (_HID, EisaId ("PNP0D10") /* XHCI USB Controller */)
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
        Memory32Fixed (ReadWrite,
            0xFE900000,         // Address Base
            0x00004000,         // Address Length
            )
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
        {
            0x0000000A,
        }
    })
}

USB support will be disabled by default, it can be enabled using the usual machine option: qemu -M microvm,usb=on.

Patches for qemu are in flight, should land in version 5.2
Patches for seabios are merged, will be available in version 1.15

PCIe support for microvm

There is one more arm platform thing we can reuse in microvm: The PCI Express host bridge. Again the same approach: Wire everything up, declare it in the ACPI DSDT. Linux kernel finds and uses it.

Not adding an asl snippet this time. The PCIe host bridge is a complex device so the description is a bit larger. It has GSI subdevices, IRQ routing information for each PCI slot, mmconfig configuration etc. Also shows in the DSDT size (even though that is still less than half the size q35 has):

root@fedora-bios /sys/firmware/acpi/tables# ll DSDT
-r--------. 1 root root 3130 Oct  2 10:20 DSDT

PCIe support will be disabled by default, it can be enabled using the new pcie machine option: qemu -M microvm,pcie=on.

This will be available in qemu version 5.2. It is already merged in the master branch.

Future work

My TODO list for qemu isn't very long:

  • Add second IO-APIC, allowing more IRQ lines for more virtio-mmio devices. Experimental patches exist.

  • IOMMU support, using virtio-iommu. Depends on ACPI spec update for virtio-iommu being finalized and support being merged into qemu and linux kernel. The actual wireup for microvm should be easy once all this is done.

Outside qemu there are a few more items:

  • Investigate microvm PCIe support in seabios. Experimental patches exist. I'm not sure yet whenever seabios should care though.

    The linux kernel can initialize the PCIe bus just fine on its own, whereas proper seabios support has its challenges. When running in real mode the mmconfig space can not be reached. Legacy pci config space access via port 0xcf8 is not available. Which breaks pcibios support. Probably fixable with a temporary switch to 32bit mode, at the cost of triggering a bunch of extra vmexits. Beside that seabios will spend more time at boot, initializing the pci devices.

    So, is it worth the effort? The benefit would be that seabios could support booting from pci devices on microvm then.

  • Maybe add microvm support to edk2/ovmf.

    Looks not that easy on a quick glance. ArmVirtPkg depends on device trees for virtio-mmio detection, so while we can re-use the virtio-mmio drivers we can not re-use device discovery code. Unless we maybe have qemu provide both ACPI tables and a device tree, even if ovmf happens to be the only device tree user.

    It also is not clear what other dragons (dependencies on classic x86 platform devices) are lurking in the ovmf codebase.

  • Support the new microvm features (possibly adding microvm support first) in other projects.

    Candidate number one is of course libvirt because it is the foundation for many other projects. Beside that microvm support is probably mostly useful for cloud/container-style workloads, i.e. kata and kubevirt.

by Gerd Hoffmann at October 18, 2020 10:00 PM

October 09, 2020

Stefan Hajnoczi

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.

Interrupts

Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=Ncommand-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.

Security

The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.

Sandboxing

Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.

Management

Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.

Conclusion

This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.

by Unknown (noreply@blogger.com) at October 09, 2020 05:03 PM

KVM on Z

IBM Cloud Infrastructure Center adds Support for Red Hat KVM

The latest release of IBM Cloud Infrastructure Center 1.1.2 adds support for Red Hat KVM on IBM Z and LinuxONE.

IBM Cloud Infrastructure Center is an Infrastructure-as-a-Service (IaaS) management solution that provides on-premise cloud deployments of Linux virtual machines on IBM Z and LinuxONE.
As a component of the Z Hybrid Cloud solution stack, Cloud Infrastructure Center delivers simplified IaaS management across compute, network, and storage resources, making virtual infrastructure ready to be consumed by PaaS layer solutions.
Among others, version 1.1.2 provides new capabilities, including cloud deployments for Red Hat Enterprise Linux Kernel-based Virtual Machines (KVM).

See here for the official announcement, and here for a blog entry with further details.

by Stefan Raspl (noreply@blogger.com) at October 09, 2020 04:59 PM

September 27, 2020

Stefan Hajnoczi

On unifying vhost-user and VIRTIO

The recent development of a Linux driver framework called VIRTIO Data Path Acceleration (vDPA) has laid the groundwork for exciting new vhost-user features. The implications of vDPA have not yet rippled through the community so I want to share my thoughts on how the vhost-user protocol can take advantage of new ideas from vDPA.

This post is aimed at developers and assumes familiarity with the vhost-user protocol and VIRTIO. No knowledge of vDPA is required.

vDPA helps with writing drivers for hardware that supports VIRTIO offload. Its goal is to enable hybrid hardware/software VIRTIO devices, but as a nice side-effect it has overcome limitations in the kernel vhost interface. It turns out that applying ideas from vDPA to the vhost-user protocol solves the same issues there. In this article I'll show how extending the vhost-user protocol with vDPA has the following benefits:

  • Allows existing VIRTIO device emulation code to act as a vhost-user device backend.
  • Removes the need for shim devices in the virtual machine monitor (VMM).
  • Replaces undocumented conventions with a well-defined device model.

These things can be done while reusing existing vhost-user and VIRTIO code. In fact, this is especially good news for existing codebases like QEMU because they already have a wealth of vhost-user and VIRTIO code that can now finally be connected together!

Let's look at the advantages of extending vhost-user with vDPA first and then discuss how to do it.

Why extend vhost-user with vDPA?

Reusing VIRTIO emulation code for vhost-user backends

It is a common misconception that a vhost device is a VIRTIO device. VIRTIO devices are defined in the VIRTIO specification and consist of a configuration space, virtqueues, and a device lifecycle that includes feature negotiation. A vhost device is a subset of the corresponding VIRTIO device. The exact subset depends on the device type, and some vhost devices are closer to the full functionality of their corresponding VIRTIO device than others. The most well-known example is that vhost-net devices have rx/tx virtqueues and but lack the virtio-net control virtqueue. Also, the configuration space and device lifecycle are only partially available to vhost devices.

This difference makes it impossible to use a VIRTIO device as a vhost-user device and vice versa. There is an impedance mismatch and missing functionality. That's a shame because existing VIRTIO device emulation code is mature and duplicating it to provide vhost-user backends creates additional work.

If there was a way to reuse existing VIRTIO device emulation code it would be easier to move to a multi-process architecture in QEMU. Want to run --netdev user,id=netdev0 --device virtio-net-pci,netdev=netdev0 in a separate, sandboxed process? Easy, run it as a vhost-user-net device instead of as virtio-net.

Making VMM device shims optional

Today each new vhost device type requires a shim device in the VMM. QEMU has --device vhost-user-blk-pci, --device vhost-user-input-pci, and so on. Why can't there be a single --device vhost-user device?

This limitation is a consequence of the fact that vhost devices are not full VIRTIO devices. In fact, a vhost device does not even have a way to report its device type (net, blk, scsi, etc). Therefore it is impossible for today's VMMs to offer a generic device. Each vhost device type requires a shim device.

In some cases a shim device is desirable because it allows the VMM to handle some aspects of the device instead of passing everything through to the vhost device backend. But requiring shims by default involves lots of tedious boilerplate code and prevents new device types from being used by older VMMs.

Providing a well-defined device model in vhost-user

Although vhost-user works well for users, it is difficult for developers to learn and extend. The protocol does not have a well-defined device model. Each device type has its own undocumented set of protocol messages that are used. For example, the vhost-user-blk device uses the configuration space whereas most other device types do not use the configuration space at all.

Since protocol use is not fully documented in the specification, developers might resort to reading Linux, QEMU, and DPDK code in order to figure out how to make their devices work. They typically have to debug vhost-user protocol messages and adjust their code until it appears to work. Hopefully the critical bugs are caught before the code ships. This is problematic because it's hard to produce high-quality vhost-user implementations.

Although the protocol specification can certainly be cleaned up, the problem is more fundamental. vhost-user badly needs a well-defined device model so that protocol usage is clear and uniform for all device types. The best way to do that is to simply adopt the VIRTIO device model. The VIRTIO specification already defines the device lifecycle and details of the device types. By making vhost-user devices full VIRTIO devices there is no need for additional vhost device specifications. The vhost-user specification just becomes a transport for the established VIRTIO device model. Luckily that is effectively what vDPA has done for kernel vhost ioctls.

How to do this in QEMU

The following QEMU changes are needed to implement vhost-user vDPA support. Below I will focus on vhost-user-net but most of the work is generic and benefits all device types.

Import vDPA ioctls into vhost-user

vDPA extends the Linux vhost ioctl interface. It uses a subset of vhost ioctls and adds new vDPA-specific ioctls that are implemented in the vhost_vdpa.ko kernel module. These new ioctls enable the full VIRTIO device model, including device IDs, the status register, configuration space, and so on.

In theory vhost-user could be fixed without vDPA, but it would involve effectively adding the same set of functionality that vDPA has already added onto kernel vhost. Reusing the vDPA ioctls allows VMMs to support both kernel vDPA and vhost-user with minimal code duplication.

This can be done by adding a VHOST_USER_PROTOCOL_F_VDPA feature bit to the vhost-user protocol. If both the vhost-user frontend and backend support vDPA then all vDPA messages are available. Otherwise they can either fall back on legacy vhost-user behavior or drop the connection.

The vhost-user specification could be split into a legacy section and a modern vDPA-enabled section. The modern protocol will remove vhost-user messages that are not needed by vDPA, simplifying the protocol for new developers while allowing existing implementations to support both with minimal changes.

One detail is that vDPA does not use the memory table mechanism for sharing memory. Instead it relies on the richer IOMMU message family that is optional in vhost-user today. This approach can be adopted in vhost-user too, making the IOMMU code path standard for all implementations and dropping the memory table mechanism.

Add vhost-user vDPA to the vhost code

QEMU's hw/virtio/vhost*.c code supports kernel vhost, vhost-user, and kernel vDPA. A vhost-user vDPA mode must be added to implement the new protocol. It can be implemented as a combination of the vhost-user and kernel vDPA code already in QEMU. Most likely the existing vhost-user code can simply be extended to enable vDPA support if the backend supports it.

Only small changes to hw/net/virtio-net.c and net/vhost-user.c are needed to use vhost-user vDPA with net devices. At that point QEMU can connect to a vhost-user-net device backend and use vDPA extensions.

Add a vhost-user vDPA VIRTIO transport

Next a vhost-user-net device backend can be put together using QEMU's virtio-net emulation. A translation layer is needed between the vhost-user protocol and the VirtIODevice type in QEMU. This can be done by implementing a new VIRTIO transport alongside the existing pci, mmio, and ccw transports. The transport processes vhost-user protocol messages from the UNIX domain socket and invokes VIRTIO device emulation code inside QEMU. It acts as a VIRTIO bus so that virtio-net-device, virtio-blk-device, and other device types can be plugged in.

This piece is the most involved but the vhost-user protocol communication part was already implemented in the virtio-vhost-user prototype that I worked on previously. Most of the communication code can be reused and the remainder is implementing the VirtioBusClass interface.

To summarize, a new transport acts as the vhost-user device backend and invokes QEMU VirtIODevice methods in response to vhost-user protocol messages. The syntax would be something like --device virtio-net-device,id=virtio-net0 --device vhost-user-backend,device=virtio-net0,addr.type=unix,addr.path=/tmp/vhost-user-net.sock.

Where this converges with multi-process QEMU

At this point QEMU can run ad-hoc vhost-user backends using existing VIRTIO device models. It is possible to go further by creating a qemu-dev launcher executable that implements the vhost-user spec's "Backend program conventions". This way a minimal device emulator executable hosts the device instead of a full system emulator.

The requirements for this are similar to the multi-process QEMU effort, which aims to run QEMU devices as separate processes. One of the main open questions is how to design build system and Kconfig support for building minimal device emulator executables.

In the case of vhost-user-net the qemu-dev-vhost-user-net executable would contain virtio-net-device, vhost-user-backend, any netdevs the user wishes to include, a QMP monitor, and a vhost-user backend command-line interface.

Where does this leave us? QEMU's existing VIRTIO device models can be used as vhost-user devices and run in a separate processes from the VMM. It's a great way of reusing code and having the flexibility to deploy it in the way that makes most sense for the intended use case.

Conclusion

The vhost-user protocol can be simplified by adopting the vhost vDPA ioctls that have recently been introduced in Linux. This turns vhost-user into a VIRTIO transport and vhost-user devices become full VIRTIO devices. Existing VIRTIO device emulation code can then be reused in vhost-user device backends.

by Unknown (noreply@blogger.com) at September 27, 2020 09:26 AM

September 26, 2020

Cole Robinson

virt-install --cloud-init support

As part of GSOC 2019, Athina Plaskasoviti implemented --cloud-init support for virt-install. This post provides a bit more info about the feature.

Why cloud-init

For a long while, most mainstream Linux distros have shipped 'cloud images': raw or qcow2 formatted disk images with the distro minimally pre-installed on it. These images typically have cloud-init set to run on VM bootup. cloud-init can do a variety of things, like add users, change passwords, register ssh keys, and generally perform any desired action on the VM OS. This only works when cloud-init is passed the right configuration by whatever platform is starting the VM, like OpenStack or virt-install. cloud-init supports many different datasources for getting configuration outside the VM.

Historically though the problem is that slapping these images into virt-install or virt-manager gives crappy results, because these tools were not providing any datasource. In this case, cloud-init reverts to its distro default configured behavior, which in most cases is unusable. For example on Fedora, the result was: hang waiting for cloud-init data, time out, drop to login prompt with all accounts locked.

Prior to virt-install --cloud-init support, the simplest workaround was to use libguestfs, specifically virt-customize, to set a root password and disable cloud-init:

$ virt-customize -a MY-CLOUD-IMAGE.qcow2 \
    --root-password password:SUPER-SECRET-PASSWORD \
    --uninstall cloud-init

--cloud-init option

The --cloud-init option will tell virt-install to set up a nocloud datasource via a specially formatted .iso file that is generated on the fly, and only used for the first VM bootup.

The default behavior when --cloud-init is specified with no suboptions will do the following:

  • Generate a random root password and print it to the terminal.
  • Default to VM serial console access rather than graphical console access. Makes it easier to paste the password and gives more ssh-like behavior.
  • Sets the root password to expire on first login. So the temporary password is only used once.
  • Disables cloud-init for subsequent VM startups. Otherwise on the next VM boot you'd face locked accounts again.

The --cloud-init also has suboptions to specify your own behavior, like transfer in a host ssh public key, pass in raw cloud-init user-data/meta-data, etc. See the virt-install --cloud-init man page section for the specifics.

Room for improvement

This is all a step in the right direction but there's lots more we can do here:

  • Extend virt-install's --install option to learn how to fetch cloud images for the specified distro. libosinfo and osinfo-db already track cloud image download links for many distros so the info we need is already in place. We could make virt-install --install fedora32,cloud=yes a single way to pull down a cloud image, generate a cloud-init datasource, and create the VM in one shot.
  • Use --cloud-init by default when the user passes us a cloud-init enabled disk image. virt-customize has a lot of disk image detection smarts already, but we aren't using that in virt-install yet.
  • virt-manager UI support. There's an issue tracking this with some more thoughts in it.
  • Similar support for CoreOS Ignition which fulfills a similar purpose as cloud-init for distros like Fedora CoreOS. There's an issue tracking this too.

I personally don't have plans to work on these any time soon, but I'm happy to provide guidance if anyone is interested in helping out!

by Cole Robinson at September 26, 2020 04:00 AM

September 24, 2020

KVM on Z

Webinar: KVM Network Performance - Best Practices and Tuning Recommendations

Abstract
Join us for this webinar session to get an overview of the different networking configuration choices running KVM guests on the IBM Z platforms and details to setup and capabilities. It also provides tuning recommendations for the KVM host and KVM guest environment to achieve greater network performance.

Speaker
Dr. Jürgen Dölle, Linux on IBM Z Performance Evaluation

Registration
Register here. You can check the system requirements here.
After registering, you will receive a confirmation email containing information about joining the webinar.

Replay & Archive
All sessions are recorded. For the archive as well as a replay and handout of this session and all previous webcasts see here.

by Stefan Raspl (noreply@blogger.com) at September 24, 2020 10:34 AM

September 17, 2020

Stefan Hajnoczi

Initial applications for Outreachy December-March internships Sept 20

QEMU is participating in the Outreachy December-March open source internship program. The internships are 12 weeks of full-time paid remote work contributing to QEMU/KVM. For more information about eligibility and what the internships are like, please see the Outreachy website.

The initial application deadline is September 20 and it only takes 5-30 minutes to complete: https://www.outreachy.org/apply/

If you or someone you know are considering doing an internship with QEMU, Linux kernel, or another participating organization, please remember to submit an initial application by September 20th!

by Unknown (noreply@blogger.com) at September 17, 2020 05:07 PM

September 16, 2020

Cole Robinson

virt-manager 3.0.0 released!

Yesterday I released virt-manager 3.0.0. Despite the major version number bump, things shouldn't look too different from the previous release. For me the major version number bump reflects certain feature removals (like dropping virt-convert), and the large amount of internal code changes that were done, though there's a few long awaited features sprinkled in like virt-install --cloud-init support which I plan to write more about later.

This release includes:

  • virt-install --cloud-init support (Athina Plaskasoviti, Cole Robinson)
  • The virt-convert tool has been removed. Please use virt-v2v instead
  • A handful of UI XML configuration options have been removed, in an effort to reduce maintenance ongoing maintenance burden, and to be more consistent with what types of UI knobs we expose. The XML editor is an alternative in most cases. For a larger discussion see this thread and virt-manager's DESIGN.md file.
  • The 'New VM' UI now has a 'Manual Install' option which creates a VM without any required install media
  • In the 'New VM' UI, the network/pxe install option has been removed. If you need network boot, choose 'Manual Install' and set the boot device after initial VM creation
  • 'Clone VM' UI has been reworked and simplified
  • 'Migrate VM' UI now has an XML editor for the destination VM
  • Global and per-vm option to disable graphical console autoconnect. This makes it easier to use virt-manager alongside another client like virt-viewer
  • virt-manager: set guest time after VM restore (Michael Weiser)
  • virt-manager: option to delete storage when removing disk device (Lily Nie)
  • virt-manager: show warnings if snapshot operation is unsafe (Michael Weiser)
  • Unattended install improvements (Fabiano Fidêncio)
  • cli: new --xml XPATH=VAL option for making direct XML changes
  • virt-install: new --reinstall=DOMAIN option
  • virt-install: new --autoconsole text|graphical|none option
  • virt-install: new --os-variant detect=on,require=on suboptions
  • cli: --clock, --keywrap, --blkiotune, --cputune additions (Athina Plaskasoviti)
  • cli: add --features kvm.hint-dedicated.state= (Menno Lageman)
  • cli: add --iommu option (Menno Lageman)
  • cli: Add --graphics websocket= support (Petr Benes)
  • cli: Add --disk type=nvme source.* suboptions
  • cli: Fill in all --filesystem suboptions
  • Translation string improvements (Pino Toscano)
  • Convert from .pod to .rst for man pages
  • Switch to pytest as our test runner
  • Massively improved unittest and uitest code coverage
  • Now using github issues as our bug tracker

by Cole Robinson at September 16, 2020 04:00 AM

September 14, 2020

QEMU project

An Overview of QEMU Storage Features

This article introduces QEMU storage concepts including disk images, emulated storage controllers, block jobs, the qemu-img utility, and qemu-storage-daemon. If you are new to QEMU or want an overview of storage functionality in QEMU then this article explains how things fit together.

Storage technologies

Persistently storing data and retrieving it later is the job of storage devices such as hard disks, solid state drives (SSDs), USB flash drives, network attached storage, and many others. Technologies vary in their storage capacity (disk size), access speed, price, and other factors but most of them follow the same block device model.

Block device I/O

Block devices are accessed in storage units called blocks. It is not possible to access individual bytes, instead an entire block must be transferred. Block sizes vary between devices with 512 bytes and 4KB block sizes being the most common.

As an emulator and virtualizer of computer systems, QEMU naturally has to offer block device functionality. QEMU is capable of emulating hard disks, solid state drives (SSDs), USB flash drives, SD cards, and more.

Storage for virtual machines

There is more to storage than just persisting data on behalf of a virtual machine. The lifecycle of a disk image includes several operations that are briefly covered below.

Block device I/O

Virtual machines consist of device configuration (how much RAM, which graphics card, etc) and the contents of their disks. Transferring virtual machines either to migrate them between hosts or to distribute them to users is an important workflow that QEMU and its utilities support.

Much like ISO files are used to distribute operating system installer images, QEMU supports disk image file formats that are more convenient for transferring disk images than the raw contents of a disk. In fact, disk image file formats offer many other features such as the ability to import/export disks from other hypervisors, snapshots, and instantiating new disk images from a backing file.

Finally, managing disk images also involves the ability to take backups and restore them should it be necessary to roll back after the current disk contents have been lost or corrupted.

Emulated storage controllers

The virtual machine accesses block devices through storage controllers. These are the devices that the guest talks to in order to read or write blocks. Some storage controllers facilitate access to multiple block devices, such as a SCSI Host Bus Adapter that provides access to many SCSI disks.

Storage controllers vary in their features, performance, and guest operating system support. They expose a storage interface such as virtio-blk, NVMe, or SCSI. Virtual machines program storage controller registers to transfer data between memory buffers in RAM and block devices. Modern storage controllers support multiple request queues so that I/O can processed in parallel at high rates.

The most common storage controllers in QEMU are virtio-blk, virtio-scsi, AHCI (SATA), IDE for legacy systems, and SD Card controllers on embedded or smaller boards.

Disk image file formats

Disk image file formats handle the layout of blocks within a host file or device. The simplest format is the raw format where each block is located at its Logical Block Address (LBA) in the host file. This simple scheme does not offer much in the way of features.

QEMU’s native disk image format is QCOW2 and it offers a number of features:

  • Compactness - the host file grows as blocks are written so a sparse disk image can be much smaller than the virtual disk size.
  • Backing files - disk images can be based on a parent image so that a master image can be shared by virtual machines.
  • Snapshots - the state of the disk image can be saved and later reverted.
  • Compression - block compression reduces the image size.
  • Encryption - the disk image can be encrypted to protect data at rest.
  • Dirty bitmaps - backup applications can track changed blocks so that efficient incremental backups are possible.

A number of other disk image file formats are available for importing/exporting disk images for use with other software including VMware and Hyper-V.

Block jobs

Block jobs are background operations that manipulate disk images:

  • Commit - merging backing files to shorten a backing file chain.
  • Backup - copying out a point-in-time snapshot of a disk.
  • Mirror - copying an image to a new destination while the virtual machine can still write to it.
  • Stream - populating a disk image from its backing file.
  • Create - creating new disk image files.

These background operations are powerful tools for building storage migration and backup workflows.

Some operations like mirror and stream can take a long time because they copy large amounts of data. Block jobs support throttling to limit the performance impact on virtual machines.

qemu-img and qemu-storage-daemon

The qemu-img utility manipulates disk images. It can create, resize, snapshot, repair, and inspect disk images. It has both human-friendly and JSON output formats, making it suitable for manual use as well as scripting.

qemu-storage-daemon exposes QEMU’s storage functionality in a server process without running a virtual machine. It can export disk images over the Network Block Device (NBD) protocol as well as run block jobs and other storage commands. This makes qemu-storage-daemon useful for applications that want to automate disk image manipulation.

Conclusion

QEMU presents block devices to virtual machines via emulated storage controllers. On the host side the disk image file format, block jobs, and qemu-img/qemu-storage-daemon utilities provide functionality for working with disk images. Future blog posts will dive deeper into some of these areas and describe best practices for configuring storage.

September 14, 2020 07:00 AM

September 08, 2020

KVM on Z

QEMU v5.1 released

QEMU v5.1 is out. Here are the highlights from a KVM on Z perspective:

  • Secure Guest support
  • vfio-ccw: Various enhancements
  • DASD IPL via vfio-ccw is now fully functional (requires Linux kernel 5.8)

For further details, see the Release Notes.

by Stefan Raspl (noreply@blogger.com) at September 08, 2020 01:58 PM

August 29, 2020

Stefan Hajnoczi

Using kcov code coverage with meson

The meson build system has built-in code coverage support, making it easy to identify lines of code that are not exercised by tests. Meson's code coverage support works with the gcov-based tools gcovr and lcov. This post shows how to use kcov with meson instead so that code coverage can be reported when gcov is unavailable.

How do code coverage tools work?

The gcov-based tools rely on compiler instrumentation, which both gcc and llvm support. Special compiler options instruct the compiler to emit instrumentation in every compiled function in order to record which lines of code are reached.

The kcov tool takes a different approach that does not require compiler support. It uses run-time instrumentation (like breakpoints) instead of compile-time instrumentation. This makes it possible to use kcov on existing binaries without recompilation, as long as debug information is available. The tool maps program instructions to lines of source code using the debug information.

There are pros and cons regarding exact features, performance, limitations, etc. For the most part the gcov approach works well when recompilation is possible and the compiler supports gcov. In other cases kcov is needed.

How to run meson tests under kcov

Meson's built-in code coverage support is designed for gcov and therefore works as a post-processing step after meson test was run. The workflow is different with kcov since the test itself must be run under kcov so it can instrument the process.

Run meson test as follows to get per-test coverage results:

$ meson test --wrapper='kcov kcov-output'

The $BUILD_DIR/kcov-output/ directory will contain the coverage results, one set for each test that was run.

Merging coverage results

If your goal is a single coverage percentage for the entire test suite, then the per-test results need to be merged. The follow wrapper script can be used:

$ cat kcov-wrapper.sh
#!/bin/sh
test_name=$(basename $1)
exec kcov kcov-runs/$test_name "$@"

And it is invoked like this:

$ rm -rf $BUILD_DIR/kcov-runs
$ mkdir $BUILD_DIR/kcov-runs
$ meson test --wrapper "$SOURCE_DIR/kcov-wrapper.sh"
$ rm -rf $BUILD_DIR/kcov-output
$ kcov --merge $BUILD_DIR/kcov-output $BUILD_DIR/kcov-runs/*

The merged results are located in the $BUILD_DIR/kcov-output/ directory.

Conclusion

Meson already has built-in support for gcov-based code coverage. If you cannot use gcov, then kcov is an alternative that is fairly easy to integrate into a meson project.

by Unknown (noreply@blogger.com) at August 29, 2020 01:04 PM

August 24, 2020

Stefan Hajnoczi

QEMU Internals: Event loops

This post explains event loops in QEMU v5.1.0 and their unique features compared to other event loop implementations. The APIs are not covered in detail here since they are explained in doc comments. Instead, the focus is on the big picture and why things work the way they do.

Event loops are central to many I/O-bound applications like network services and graphical desktop applications. QEMU also has I/O-bound work that fits well into an event loop. Examples include the QMP monitor, disk I/O, and timers.

An event loop monitors event sources for activity and invokes a callback function when an event occurs. This makes it possible to process multiple event sources within a single CPU thread. The application can appear to do multiple things at once without multithreading because it switches between handling different event sources. This architecture is common in Javascript, Python Twisted/asyncio, and many other environments. Sometimes the event loop is hidden underneath coroutines or async/await language features (QEMU has coroutines but often the event loop is still used directly).

The most important event sources in QEMU are:

  • File descriptors such as sockets and character devices.
  • Event notifiers (implemented as eventfds on Linux).
  • Timers for delayed function execution.
  • Bottom-halves (BHs) for invoking a function in another thread or deferring a function call to avoid reentrancy.

Event loops and threads

QEMU has several different types of threads:

  • vCPU threads that execute guest code and perform device emulation synchronously with respect to the vCPU.
  • The main loop that runs the event loops (yes, there is more than one!) used by many QEMU components.
  • IOThreads that run event loops for device emulation concurrently with vCPUs and "out-of-band" QMP monitor commands.

It's worth explaining how device emulation interacts with threads. When guest code accesses a device register the vCPU thread traps the access and dispatches it to an emulated device. The device's read/write function runs in the vCPU thread. The vCPU thread cannot resume guest code execution until the device's read/write function returns. This means long-running operations like emulating a timer chip or disk I/O cannot be performed synchronously in the vCPU thread since they would block the vCPU. Most devices solve this problem using the main loop thread's event loops. They add timer or file descriptor monitoring callbacks to the main loop and return back to guest code execution. When the timer expires or the file descriptor becomes ready the callback function runs in the main loop thread. The final part of emulating a guest timer or disk access therefore runs in the main loop thread and not a vCPU thread.

Some devices perform the guest device register access in the main loop thread or an IOThread thanks to ioeventfd. ioeventfd is a Linux KVM API and also has a userspace fallback implementation for TCG that traps vCPU device accesses and writes to a file descriptor so another thread can handle the access.

The key point is that vCPU threads do not run an event loop. The main loop thread and IOThreads run event loops. vCPU threads can add event sources to the main loop or IOThread event loops. Callbacks run in the main loop thread or IOThreads.

How the main loop and IOThreads differ

The main loop and IOThreads share some code but are fundamentally different. The common code is called AioContext and is QEMU's native event loop API. Commonly-used functions include aio_set_fd_handler(), aio_set_event_handler(), aio_timer_init(), and aio_bh_new().

The main loop actually has a glib GMainContext and two AioContext event loops. QEMU components can use any of these event loop APIs and the main loop combines them all into a single event loop function os_host_main_loop_wait() that calls qemu_poll_ns() to wait for event sources. This makes it possible to combine glib-based code with code using the native QEMU AioContext APIs.

The reason why the main loop has two AioContexts is because one, called iohandler_ctx, is used to implement older qemu_set_fd_handler() APIs whose handlers should not run when the other AioContext, called qemu_aio_context, is run using aio_poll(). The QEMU block layer and newer code uses qemu_aio_context while older code uses iohandler_ctx. Over time it may be possible to unify the two by converting iohandler_ctx handlers to safely execute in qemu_aio_context.

IOThreads have an AioContext and a glib GMainContext. The AioContext is run using the aio_poll() API, which enables the advanced features of the event loop. If a glib event loop is needed then the GMainContext can be run using g_main_loop_run() and the AioContext event sources will be included.

Code that relies on the AioContext aio_*() APIs will work with both the main loop and IOThreads. Older code using qemu_*() APIs only works with the main loop. glib code works with both the main loop and IOThreads.

The key difference between the main loop and IOThreads is that the main loop uses a traditional event loop that calls qemu_poll_ns() while IOThreads AioContext aio_poll() has advanced features that result in better performance.

AioContext features

AioContext has the following event loop features that traditional event loops do not have:

  • Adaptive polling support for lower latency but slightly higher CPU consumption. AioContext event sources can have a userspace polling function that detects events without performing syscalls (e.g. peeking at a memory location). This allows the event loop to avoid block syscalls that might lead the host kernel scheduler to yield the thread and put the physical CPU into a low power state. Keeping the CPU busy and avoiding entering the kernel minimizes latency.
  • O(1) time complexity with respect to the number of monitored file descriptors. When there are thousands of file descriptors O(n) APIs like poll(2) spend time scanning over all file descriptors, even those that have no activity. This scalability bottleneck can be avoided with Linux io_uring and epoll APIs, both of which are supported by AioContext aio_poll(2).
  • Nanosecond timers. glib's event loop only has millisecond timers, which is not sufficient for emulating hardware timers.

These features are required for performance reasons. Unfortunately glib's event loop does not support them, otherwise QEMU could use GMainContext as its only event loop.

Conclusion

QEMU uses both its native AioContext event loop and glib's GMainContext. The QEMU main loop and IOThreads work differently, with IOThreads offering the best performance thanks to its AioContext aio_poll() event loop. Modern QEMU code should use AioContext APIs for optimal performance and so that the code can be used in both the main loop and IOThreads.

by Unknown (noreply@blogger.com) at August 24, 2020 07:52 AM

August 12, 2020

Stefan Hajnoczi

Why QEMU should move from C to Rust

Welcome Redditors and HackerNews folks! This post is getting attention outside the QEMU community, so I'd like to highlight two things that may not be immediately clear: I am a QEMU maintainer and I'm not advocating to Rewrite It In Rust. Enjoy! :)

My KVM Forum 2018 presentation titled Security in QEMU: How Virtual Machines provide Isolation (pdf) (video) reviewed security bugs in QEMU and found the most common causes were C programming bugs. This includes buffer overflows, use-after-free, uninitialized memory, and more. In this post I will argue for using Rust as a safer language that prevents these classes of bugs.

In 2018 the choice of a safer language was not clear. C++ offered safe abstractions without an effective way to prohibit unsafe language features. Go also offered safety but with concerns about runtime costs. Rust looked promising but few people had deep experience with it. In 2018 I was not able to argue confidently for moving away from C in QEMU.

Now in 2020 the situation is clearer. C programming bugs are still the main cause of CVEs in QEMU. Rust has matured, its ecosystem is growing and healthy, and there are virtualization projects like Crosvm, Firecracker, and cloud-hypervisor that prove Rust is an effective language for writing Virtual Machine Monitors (VMM). In the QEMU community Paolo Bonzini and Sergio Lopez's work on rust-vmm and vhost-user code inspired me to look more closely at moving away from C.

Do we need to change programming language?

Most security bugs in QEMU are C programming bugs. This is easy to verify by looking through the CVE listings. Although I have only reviewed CVEs it seems likely that non-security bugs are also mostly C programming bugs.

Eliminating C programming bugs does not necessarily require switching programming languages. Other approaches to reducing bug rates in software include:

  • Coding style rules that forbid unsafe language features.
  • Building safe abstractions and prohibiting unsafe language features or library APIs.
  • Static checkers that scan source code for bugs.
  • Dynamic sanitizers that run software with instrumentation to identify bugs.
  • Unit testing and fuzzing.

The problem is, the QEMU community has been doing these things for years but new bugs are still introduced despite these efforts. It is certainly possible to spend more energy on these efforts but the evidence shows that bugs continue to slip through.

There are two issues with these approaches to reducing bugs. First, although these approaches help find existing bugs, eliminating classes of bugs so they cannot exist in the first place is a stronger approach. This is hard to do with C since the language is unsafe, placing the burden of safety on the programmer.

Second, much of the ability to write safe C code comes with experience. Custom conventions, APIs, tooling, and processes to reduce bugs is a hurdle for one-time contributors or newcomers. It makes the codebase inaccessible unless we accept lower standards for some contributors. Code quality should depend as little on experience as possible but C is notorious for being a programming language that requires a lot of practice before you can write production-quality code.

Why Rust?

Safe languages eliminate memory safety bugs (and other classes like concurrency bugs). Rust made this a priority in its design:

  • Use-after-free, double-free, memory leaks, and other lifetime bugs are prevented at compile-time by the borrow checker where the compiler checks ownership of data.
  • Buffer overflows and other memory corruptions are prevented by compile-time and runtime bounds-checking.
  • Pointer deference bugs are prevented by the absense of NULL pointers and strict ownership rules.
  • Uninitialized memory is prevented because all variables and fields must be initialized.

Rust programs can still "panic" at runtime when safety cannot be proven at compile time but this does not result in undefined behavior as seen in C programs. The program simply aborts with a backtrace. Bugs that could have resulted in arbitrary code execution in C become at most denial-of-service bugs in Rust. This reduces the severity of bugs.

As a result of this language design most C programming bugs that plague QEMU today are either caught by the compiler or turn into a safe program termination. It is reasonable to expect CVEs to reduce in number and in severity when switching to Rust.

At the same time Rust eliminates the need for many of the measures that the QEMU community added onto C because the Rust programming language and its compiler already enforce safety. This means newcomers and one-time contributors will not need QEMU-specific experience, can write production-quality code more easily, and can get their code merged more quickly. It also means reviewers will have to spend less time pointing out C programming bugs or asking for changes that comply with QEMU's way of doing things.

That said, Rust has a reputation for being a scary language due to the borrow checker. Most programmers have not thought about object lifetimes and ownership as systematically and explicitly as required by Rust. This raises the bar to learning the language, but I look at it this way: learning Rust is humanly possible, writing bug-free C code is not.

How can we change programming language?

When I checked in 2018 QEMU was 1.5 million lines of code. It has grown since then. Moving a large codebase to a new programming language is extremely difficult. If people want to convert QEMU to Rust that would be great, but I personally don't have the appetite to do it because I think the integration will be messy, result in a lot of duplication, and there is too much un(der)maintained code that is hard to convert.

The reason I am writing this post is because device emulation, the main security attack surface for VMMs, can be done in a separate program. That program can be written in any language and this is where Rust comes in. For vhost devices it is possible to write Rust device backends today and I hope this will become the default approach to writing new devices.

For non-vhost devices the vfio-user project is working on an interface out-of-process device emulation. It will be possible to implement devices in Rust there too.

If you are implementing new device emulation code please consider doing it in Rust!

Conclusion

Most security bugs in QEMU today are C programming bugs. Switching to a safer programming language will significantly reduce security bugs in QEMU. Rust is now mature and proven enough to use as the language for device emulation code. Thanks to vhost-user and vfio-user using Rust for device emulation does not require a big conversion of QEMU code, it can simply be done in a separate program. This way attack surfaces can be written in Rust to make them less susceptible to security bugs going forward.

by Unknown (noreply@blogger.com) at August 12, 2020 07:32 PM

August 11, 2020

QEMU project

QEMU version 5.1.0 released

We’d like to announce the availability of the QEMU 5.1.0 release. This release contains 2500+ commits from 235 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • ARM: support for ARMv8.2 TTS2UXN architecture feature
  • ARM: support for ARMv8.5 MemTag architecture feature
  • ARM: new board support for sonorapass-bmc
  • ARM: virt: support for memory hot-unplug
  • ARM: support for nvdimm hotplug for ACPI guests
  • AVR: new architecture support for AVR CPUs
  • AVR: new board support for Arduino Duemilanove, Arduino Mega 2560, Arduino Mega, and Arduino UNO
  • MIPS: support for Loongson 3A CPUs (R1 and R4)
  • MIPS: performance improvements for FPU and MSA instruction emulation
  • PowerPC: support for guest error recovery via FWNMI
  • RISC-V: support for SiFive E34 and Ibex CPUs
  • RISC-V: new board support for HiFive1 revB and OpenTitan
  • RISC-V: Spike machine now supports more than 1 CPU
  • s390: KVM support for protected virtualization (secure execution mode)
  • x86: improvements to HVF acceleration support on macOS
  • x86: reduced virtualization overhead for non-enlightened Windows guests via Windows ACPI Emulated Device Table
  • block: support for 2MB logical/physical blocksizes for virtual storage devices
  • crypto: support for passing secrets to QEMU via Linux keyring
  • crypto: support for LUKS keyslot management via qemu-img
  • NVMe: support for Persistent Memory Region from NVMe 1.4 spec
  • qemu-img: additional features added for map/convert/measure commands, as well as support for zstd compression
  • qemu-img: support for new ‘bitmap’ command for manipulating persistent bitmaps in qcow2 files
  • virtio: TCG guests can now use vhost-user threads
  • virtio: vhost-user now supports registering more than 8 RAM slots
  • and lots more…

Thank you to everyone involved!

August 11, 2020 11:00 PM

August 07, 2020

ARM Datacenter Project

How to setup NVMe/TCP with NVME-oF using KVM and QEMU

In this post we will explain how to connect two QEMU Guests (using KVM) with NVMe over Fabrics using TCP as the transport.

We will show how to connect an NVMe target to an NVMe initiator, using the NVMe/TCP transport. It is worth mentioning before we get started that we will use the terms of “target” to describe the guest which exports the target, and “initiator” to describe the guest which connects to the target.

The target QEMU guest will export a simulated NVME drive which we will create from an image file. The initiator guest will connect to the target and will be able to access this NVME drive.

Note that this configuration is largely an example to be used for evaluation and/or development with NVME-of. This setup described is not intended to be used for a production environment.

First Step: Create a Guest

Before we can get started, we need to bring up our QEMU guests and get them sharing the same network.

Fortunately, we described in an earlier post how to setup a shared network for two QEMU guests. That’s a good place to start.

We also have other posts for getting an aarch64 VM up and running, including:

Kernel Configuration

Before we get started we will make sure that the guest’s kernel has all the right modules built in.

The guest’s kernel config should have these modules.

$ cat /boot/config-`uname -r` | grep NVME

# NVME Support
CONFIG_NVME_CORE=m
CONFIG_BLK_DEV_NVME=m
# CONFIG_NVME_MULTIPATH is not set
# CONFIG_NVME_HWMON is not set
CONFIG_NVME_FABRICS=m
CONFIG_NVME_FC=m
CONFIG_NVME_TCP=m
CONFIG_NVME_TARGET=m
CONFIG_NVME_TARGET_LOOP=m
CONFIG_NVME_TARGET_FC=m
# CONFIG_NVME_TARGET_FCLOOP is not set
CONFIG_NVME_TARGET_TCP=m
# end of NVME Support

nvme-cli

Make sure the nvme-cli is installed on the guests.

sudo apt-get install nvme-cli

Initiator Guest Setup

This is the QEMU command we use to bring up the initiator QEMU guest.

sudo qemu-system-aarch64 -nographic -machine virt,gic-version=max -m 8G -cpu max   \
       -drive file=./ubuntu20-a.img,if=none,id=drive0,cache=writeback              \
       -device virtio-blk,drive=drive0,bootindex=0                                 \
       -drive file=./flash0-a.img,format=raw,if=pflash                             \
       -drive file=./flash1-a.img,format=raw,if=pflash                             \
       -smp 4 -accel kvm -netdev bridge,id=hn1                                     \
       -device virtio-net,netdev=hn1,mac=e6:c8:ff:09:76:99

Target Guest Setup

When you bring up the target’s QEMU guest, be sure to include an NVME disk.

We can create the disk with the below.

qemu-img create -f qcow2 nvme.img 10G

When we bring up QEMU, add this set of options so that the guest sees the NVME disk.

-drive file=./nvme.img,if=none,id=nvme0 -device nvme,drive=nvme0,serial=1234

This is the QEMU command we use to bring up the target QEMU guest.

Note how we added in the options for the NVMe device.

sudo qemu-system-aarch64 -nographic -machine virt,gic-version=max -m 8G -cpu max \
       -drive file=./ubuntu20-b.img,if=none,id=drive0,cache=writeback              \ 
       -device virtio-blk,drive=drive0,bootindex=0                                 \ 
       -drive file=./flash0-b.img,format=raw,if=pflash                             \
       -drive file=./flash1-b.img,format=raw,if=pflash                             \
       -smp 4 -accel kvm -netdev bridge,id=hn1                                     \ 
       -device virtio-net,netdev=hn1,mac=e6:c8:ff:09:76:9c                         \ 
       -drive file=./nvme.img,if=none,id=nvme0 -device nvme,drive=nvme0,serial=1234 

Configure Target

Load the following modules on the target:

sudo modprobe nvmet
sudo modprobe nvmet-tcp

Next, create and configure an NVMe Target subsystem. This includes creating a namespace.

cd /sys/kernel/config/nvmet/subsystems
sudo mkdir nvme-test-target
cd nvme-test-target/
echo 1 | sudo tee -a attr_allow_any_host > /dev/null
sudo mkdir namespaces/1
cd namespaces/1

Before we can attach our NVMe device to this target, we need to find the name.

sudo nvme list

Node             SN     Model            Namespace Usage                      Format           FW Rev          
---------------- ------ ---------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     1234   QEMU NVMe Ctrl   1          10.74  GB /  10.74  GB    512   B +  0 B   1.0      

The next step attaches our NVMe device /dev/nvme0n1 to this target and enables it.

echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null
echo 1|sudo tee -a enable > /dev/null

Next we will create an NVMe target port, and configure the IP address and other parameters.

sudo mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1

echo 192.168.0.16 |sudo tee -a addr_traddr > /dev/null

echo tcp|sudo tee -a addr_trtype > /dev/null
echo 4420|sudo tee -a addr_trsvcid > /dev/null
echo ipv4|sudo tee -a addr_adrfam > /dev/null

The final step creates a link to the subsystem from the port.

sudo ln -s /sys/kernel/config/nvmet/subsystems/nvme-test-target/ /sys/kernel/config/nvmet/ports/1/subsystems/nvme-test-target

At this point we should see a message in the dmesg log

dmesg |grep "nvmet_tcp"

[81528.143604] nvmet_tcp: enabling port 1 (192.168.0.16:4420)

Mount Target on Initiator

Load the following modules on the initiator:

sudo modprobe nvme
sudo modprobe nvme-tcp

Next, check that we currently do not see any NVMe devices. The output of the following command should be blank.

sudo nvme list

Next, we will attempt to discover the remote target.

When we initially tried the “discover” command we got an error that our hostnqn was needed. In our example below you will notice that we are providing a hostnqn.

sudo nvme discover -t tcp -a 192.168.0.16 -s 4420 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3ccabcd

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified, sq flow control disable supported
portid:  1
trsvcid: 4420
subnqn:  nvme-test-target
traddr:  192.168.0.16
sectype: none

Using the subnqn as the -n argument, we will connect to the discovered target.

sudo nvme connect -t tcp -n nvme-test-target -a 192.168.0.16 -s 4420 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3ccabcd

Success. We can immediately check the nvme list for the attached device.

sudo nvme list
Node             SN               Model  Namespace Usage                      Format           FW Rev  
---------------- ---------------- ------ --------- -------------------------- ---------------- --------
/dev/nvme0n1     84cfc88e9ba4a8f4 Linux  1          10.74  GB /  10.74  GB    512   B +  0 B   5.8.0-rc

To detach the target, run the following command on the initiator.

sudo nvme disconnect /dev/nvme0n1 -n nvme-test-target

References:

by Rob Foley at August 07, 2020 11:52 AM

August 06, 2020

ARM Datacenter Project

How to connect two aarch64 QEMU guests with a bridge

In this post we will show how to share a network between two QEMU guests using a bridge.

There are many possible uses of this kind of a setup. One of them is for integration testing of for example, target and initiator code using one guest for the initiator and another for the target.

This post creates a bridge on the host, which the guests both share.

Create Bridge for Shared Network

We will first create the bridge and give it a “local” address, since for now we are not planning on exporting this network off the host. You will also notice we add an IP address for the host on this network 192.168.0.1

sudo ip link add br0 type bridge
sudo ip addr add 192.168.0.1/24 dev br0
sudo ip link set br0 up

Now we can check that the bridge exists and is ready (state is UP).

$ ip addr

6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:06:bb:4c:37:a1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 scope global br0
       valid_lft forever preferred_lft forever
    inet6 fe80::8c39:6fff:fe23:ca06/64 scope link 
       valid_lft forever preferred_lft forever

Next we will tell QEMU about this bridge by adding it to the QEMU bridge configuration file.

$ echo 'allow br0' | sudo tee -a /etc/qemu/bridge.conf

Starting the guests

When we bring up the QEMU guests, we will provide the -netdev option to specify a bridge that our guests will use for their network. Below is an example of these network options.

-netdev bridge,id=hn1 -device virtio-net,netdev=hn1,mac=e6:c8:ff:09:76:99

Here are the full set of options to bring up our aarch64 guests.

Note that we specify a different MAC address for each guest.

guest A:

$ sudo qemu-system-aarch64 -nographic -machine virt,gic-version=max -m 8G -cpu max \
       -drive file=./ubuntu20-a.img,if=none,id=drive0,cache=writeback              \
       -device virtio-blk,drive=drive0,bootindex=0                                 \
       -drive file=./flash0-a.img,format=raw,if=pflash                             \
       -drive file=./flash1-a.img,format=raw,if=pflash                             \
       -smp 4 -accel kvm -netdev bridge,id=hn1                                     \
       -device virtio-net,netdev=hn1,mac=e6:c8:ff:09:76:99

guest B:

$ sudo qemu-system-aarch64 -nographic -machine virt,gic-version=max -m 8G -cpu max \
       -drive file=./ubuntu20-b.img,if=none,id=drive0,cache=writeback              \ 
       -device virtio-blk,drive=drive0,bootindex=0                                 \ 
       -drive file=./flash0-b.img,format=raw,if=pflash                             \
       -drive file=./flash1-b.img,format=raw,if=pflash                             \
       -smp 4 -accel kvm -netdev bridge,id=hn1                                     \ 
       -device virtio-net,netdev=hn1,mac=e6:c8:ff:09:76:9c

Once the guests are up, you can configure the IP addresses quickly for both guests via the below commands.

Note that we chose IP addresses of 192.168.0.8 for guest a and .16 for guest b.

$ ip addr

2: enp0s3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e6:c8:ff:09:76:99 brd ff:ff:ff:ff:ff:ff

$ sudo ip addr add 192.168.0.8/24 dev enp0s3
$ sudo ip link set enp0s3 up

Testing the Shared Network

To test that the guest’s network is up, check ip addr again. It should show “state UP”.

$ ip addr

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether e6:c8:ff:09:76:99 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.8/24 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::e4c8:ffff:fe09:7699/64 scope link 
       valid_lft forever preferred_lft forever

Then we can try pinging the host.

$ ping 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.193 ms

It works ! Now let’s try pinging the other guest.

$ ping 192.168.0.16
PING 192.168.0.16 (192.168.0.16) 56(84) bytes of data.
64 bytes from 192.168.0.16: icmp_seq=1 ttl=64 time=0.358 ms

Also worked ! The guests can now see each other and are sharing the same network.

Access to Network Beyond Host

Suppose that we wanted to get access to the external network also. This can be added by simply adding a device to the bridge. In our case the device is: enahisic2i0, and we add it to bridge br0

sudo ip link set enahisic2i0 master br0

After that, we just add public IP addresses on our bridge. You might need to remove it from the device before adding it to the bridge.

sudo ip addr del 1.234.55.67/24 dev enahisic2i0
sudo ip addr add 1.234.55.67/24 dev br0

Finally, inside the guests, give them public IP addreses as well:

sudo ip addr add 1.234.55.65/24 dev enp0s3

Note that to access beyond your local subnet, you might need to add a default route:

sudo ip route add default via 1.234.55.1 dev enops

References:

by Rob Foley at August 06, 2020 07:52 PM

August 05, 2020

Marcin Juszkiewicz

So your hardware is ServerReady?

Recently I changed my assignment at Linaro. From Cloud to Server Architecture. Which means less time spent on Kolla things, more on server related things. And at start I got some project I managed to forget about :D

SBSA reference platform in QEMU

In 2017 someone got an idea to make a new machine for QEMU. Pure hardware emulation of SBSA compliant reference platform. Without using of virtio components.

Hongbo Zhang wrote code and got it merged into QEMU, Radosław Biernacki wrote basic support for EDK2 (also merged upstream). Out of box it can boot to UEFI shell. Linux is not bootable due to lack of ACPI tables (DeviceTree is not an option here).

ACPI tables in firmware

Tanmay Jagdale works on adding ACPI tables in his fork of edk2-platforms. With this firmware Linux boots and can be used.

Testing tools

But what the point of just having reference platform if there is no testing? So I took a look and found two interesting tools:

Server Base System Architecture — Architecture Compliance Suite

SBSA ACS tool requires ACPI tables to be present to work. And once started it nicely checks how compliant your system is:

FS0:\> Sbsa.efi -p

 SBSA Architecture Compliance Suite
    Version 2.4

 Starting tests for level  4 (Print level is  3)

 Creating Platform Information Tables
 PE_INFO: Number of PE detected       :    3
 GIC_INFO: Number of GICD             :    1
 GIC_INFO: Number of ITS              :    1
 TIMER_INFO: Number of system timers  :    0
 WATCHDOG_INFO: Number of Watchdogs   :    0
 PCIE_INFO: Number of ECAM regions    :    2
 SMMU_INFO: Number of SMMU CTRL       :    0
 Peripheral: Num of USB controllers   :    1
 Peripheral: Num of SATA controllers  :    1
 Peripheral: Num of UART controllers  :    1

      ***  Starting PE tests ***
   1 : Check for number of PE            : Result:  PASS
   2 : Check for SIMD extensions                PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   3 : Check for 16-bit ASID support            PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   4 : Check MMU Granule sizes                  PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   5 : Check Cache Architecture                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   6 : Check HW Coherence support               PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   7 : Check Cryptographic extensions           PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   8 : Check Little Endian support              PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
   9 : Check EL2 implementation                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  10 : Check AARCH64 implementation             PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  11 : Check PMU Overflow signal         : Result:  PASS
  12 : Check number of PMU counters             PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  13 : Check Synchronous Watchpoints            PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  14 : Check number of Breakpoints              PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  15 : Check Arch symmetry across PE            PSCI_CPU_ON: failure

       Reg compare failed for PE index=1 for Register: CCSIDR_EL1
       Current PE value = 0x0         Other PE value = 0x100FBDB30E8
       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  16 : Check EL3 implementation                 PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  17 : Check CRC32 instruction support          PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    1 for Level=  4 : Result:  --FAIL-- 129
  18 : Check for PMBIRQ signal
       SPE not supported on this PE      : Result:  -SKIPPED- 1
  19 : Check for RAS extension                  PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  20 : Check for 16-Bit VMID                    PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  21 : Check for Virtual host extensions        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure

       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
  22 : Stage 2 control of mem and cache         PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  23 : Check for nested virtualization          PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  24 : Support Page table map size change       PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  25 : Check for pointer signing                PSCI_CPU_ON: failure


  25 : Check for pointer signing                PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  26 : Check Activity monitors extension        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1
  27 : Check for SHA3 and SHA512 support        PSCI_CPU_ON: failure
       PSCI_CPU_ON: failure
: Result:  -SKIPPED- 1

      *** One or more PE tests have failed... ***

      ***  Starting GIC tests ***
 101 : Check GIC version                 : Result:  PASS
 102 : If PCIe, then GIC implements ITS  : Result:  PASS
 103 : GIC number of Security states(2)  : Result:  PASS
 104 : GIC Maintenance Interrupt
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1

      One or more GIC tests failed. Check Log

      *** Starting Timer tests ***
 201 : Check Counter Frequency           : Result:  PASS
 202 : Check EL0-Phy timer interrupt     : Result:  PASS
 203 : Check EL0-Virtual timer interrupt : Result:  PASS
 204 : Check EL2-phy timer interrupt     : Result:  PASS
 205 : Check EL2-Virtual timer interrupt
       v8.1 VHE not supported on this PE : Result:  -SKIPPED- 1
 206 : SYS Timer if PE Timer not ON
       PE Timers are not always-on.
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
 207 : CNTCTLBase & CNTBaseN access
       No System timers are defined      : Result:  -SKIPPED- 1

     *** Skipping remaining System timer tests ***

      *** One or more tests have Failed/Skipped.***

      *** Starting Watchdog tests ***
 301 : Check NS Watchdog Accessibility
       No Watchdogs reported          0
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1
 302 : Check Watchdog WS0 interrupt
       No Watchdogs reported          0
       Failed on PE -    0 for Level=  4 : Result:  --FAIL-- 1

      ***One or more tests have failed... ***

      *** Starting PCIe tests ***
 401 : Check ECAM Presence               : Result:  PASS
 402 : Check ECAM value in MCFG table    : Result:  PASS

        Unexpected exception occured
        FAR reported = 0xEBDAB180
        ESR reported = 0x97800010
     -------------------------------------------------------
     Total Tests run  =   42;  Tests Passed  =   11  Tests Failed =   22
     ---------------------------------------------------------

      *** SBSA tests complete. Reset the system. ***

As you can see there is still a lot of work to do.

ACPI Tables View

This tool displays content of ACPI tables in hex/ascii format and then with information interpreted field by field.

What makes it more useful is “-r 2” argument as it enables checking tables against Server Base Boot Requirements (SBBR) v1.2 specification. On SBSA reference platform with Tanmay’s firmware it lists two errors:

ERROR: SBBR v1.2: Mandatory DBG2 table is missing
ERROR: SBBR v1.2: Mandatory PPTT table is missing

Table Statistics:
        2 Error(s)
        0 Warning(s)

So situation looks good as those can be easily added.

CI

So we have code to check and tools to do that. Add one to another and you have a clean need for CI job. So I wrote one for Linaro CI infrastructure: “LDCG SBSA firmware“. It builds top of QEMU and EDK2, then boot it and run above tools. Results are sent to mailing list.

ServerReady?

The Arm ServerReady compliance program provides a solution for servers that “just works”, allowing partners to deploy Arm servers with confidence. The program is based on industry standards and the Server Base System Architecture (SBSA) and Server Base Boot Requirement (SBBR) specifications, alongside Arm’s Server Architectural Compliance Suite (ACS). Arm ServerReady ensures that Arm-based servers work out-of-the-box, offering seamless interoperability with standard operating systems, hypervisors, and software.

In other words: if your hardware is SBSA complaint then you can go with SBBR compliance tests and then go and ask for certification sticker or sth like that.

But if your hardware is not SBSA compliant then EBBR is all you can get. Far from being ServerReady. Never mind what people tries to say — ServerReady requires SBBR which requires SBSA.

Future work

More tests to integrate. ARM Enterprise ACS is next on my list.

by Marcin Juszkiewicz at August 05, 2020 11:53 AM

July 29, 2020

Cornelia Huck

Configuring mediated devices (Part 2)

In the last part of this article, I talked about configuring a mediated device directly via sysfs. This is a bit cumbersome, and you may want to make your configuration more permanent. Fortunately, there is tooling available for this.

driverctl: bind to the correct driver

driverctl is a tool to manage the driver that a device may bind to. As a device that is supposed to be used via vfio will need to be bound to a vfio driver instead of its 'normal' driver, it makes sense to add some configuration that makes sure that this binding is actually done automatically. While driverctl had originally been implemented to work with PCI devices, the css bus (for subchannel devices) supports management with driverctl as of Linux 5.3 as well. (The ap bus for crypto devices does not support setting driver overrides, as it implements a different mechanism.)

Example (vfio-ccw)

Let's reuse the example from the last post, where we wanted to assign the device behind subchannel 0.0.0313 to the guest. In order to set a driver override, use

[root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw

If the subchannel is not currently bound to the vfio-ccw driver already, it will be unbound from its driver and bound to vfio_ccw. Moreover, a udev rule to bind the subchannel to vfio_ccw automatically in the future will be added.

Unfortunately, a word of caution regarding the udev rule is in order: As uevents on the css bus for I/O subchannels are delayed until after device recognition has been performed, automatic binding may not work out as desired. We plan to address that in the future by reworking the way the css bus handles uevents; until then, you may have to trigger a rebind manually. Also, keep in mind that the subchannel id for a device may not be stable (as mentioned previously); automation should be used cautiously in that case.

mdevctl: manage mediated devices

The more tedious part of configuring a passthrough setup is configuring and managing mediated devices. To help with that, mdevctl has been written. It can create, modify, and remove mediated devices (and optionally make those changes persistent), work with configurations and devices created via other means, and list mediated devices and the different types that are supported.

Creating a mediated device

In order to create a mediated device, you need a uuid. You can either provide your own (as in the manual case), or let mdevctl pick one for you. In order to get the same configuration as in the manual configuration examples, let's create a vfio-ccw device with the same uuid as before.

The following command defines the same mediated device as in the manual example:
  
 [root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 -p 0.0.0313 -t vfio_ccw-io -a

Note the '-a', which instructs mdevctl to start the device automatically from now on.

After you've created the device, you can check which devices mdevctl is now aware of:

  [root@host ~] # mdevctl list -d
 7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io

Note that the '-d' instructs mdevctl to show defined, but not started devices.

Let's start the device:

  [root@host ~] # mdevctl start -u 7e270a25-e163-4922-af60-757fc8ed48c6
 [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

The mediated device is now ready to be used and can be passed to a guest.

Making your configuration persistent

If you already created a mediated device manually, you may want to reuse the existing configuration and make it persistent, instead of starting from scratch.

So, let's create another vfio-ccw the manual way:

 [root@host ~] # uuidgen
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
 [root@host ~] # echo "b29e4ca9-5cdb-4ee1-a01b-79085b9ab237" > /sys/bus/css/drivers/vfio_ccw/0.0.0314/mdev_supported_types/vfio_ccw-io/create

mdevctl now actually knows about the active device (in addition to the device we configured before):

  [root@host ~] # mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

But it obviously does not have a definition for the manually created device:

  [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

On a restart, the new device would be gone again; but we can make it persistent:

  [root@host ~] # mdevctl define -u b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
  [root@host ~ ] mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io (defined)
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

If you check under /etc/mdevctl.d/, you will find that an appropriate JSON file has been created:

  [root@host ~] # cat /etc/mdevctl.d/0.0.0314/b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 
  {
    "mdev_type": "vfio_ccw-io",
    "start": "manual",
    "attrs": []
  }

(Note that this device is not automatically started by default.)

Modifying an existing device

There are good reasons to modify an existing device: you may want to modify your setup, or, in the case of vfio-ap, you need to modify some attributes before being able to use the device in the first place.

Let's first create the device. This command creates the same device as created manually in the last post:

  [root@host ~] # mdevctl define -u "669d9b23-fe1b-4ecb-be08-a2fabca99b71" --parent matrix --type vfio_ap-passthrough
 [root@host ~] # mdevctl list -d
  669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

This device is not yet very useful, as you still need to assign some queues to it. It now looks like this:

  [root@host ~]  # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
  {
    "mdev_type": "vfio_ap-passthrough",
    "start": "manual"
  }

Let's modify the device and add some queues:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_adapter --value=5
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=4
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0xab

The device's JSON now looks like this:

  [root@host ~] # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
{
  "mdev_type": "vfio_ap-passthrough",
  "start": "manual",
  "attrs": [
    {
      "assign_adapter": "5"
    },
    {
      "assign_domain": "4"
    },
    {
      "assign_domain": "0xab"
    }
  ]
}

This is now exactly what we had defined manually in the last post.

But what if you notice that you want domain 0x42 instead of domain 4? Just modify the definition. To make it easier to figure out how to specify the attribute to manipulate, use this output:

  [root@host ~] # devctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
  Attrs:
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"4"}
    @{2}: {"assign_domain":"0xab"}

You want to remove attribute 1, and add a new value:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --delattr --index=1
  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0x42

Let's check that it now looks as desired:

  [root@host ~] # mdevctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
  Attrs:
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"0xab"}
    @{2}: {"assign_domain":"0x42"}

Future development

While mdevctl works perfectly fine for managing individual mediated devices, it does not maintain a view of the complete system. This means you notice conflicts between two devices only when you try to activate the second one. In the case of vfio-ap, the rules to be considered are complex, and there is quite some potential for conflict. In order to be able to catch that kind of problem early, we plan to add callouts to mdevctl, which would e.g. allow to invoke a tool for validation when a new device is added, but before it is activated. This is potentially useful for other device types as well.

by Cornelia Huck (noreply@blogger.com) at July 29, 2020 01:12 PM

July 28, 2020

KVM on Z

KVM on IBM Z at Virtual SHARE

Don't miss our session dedicated to Secure Execution at this year's virtual SHARE:
Protecting Data In-use With Secure Execution, presented by Reinhard Bündgen, 3:50 PM - 4:35 PM EST on Tuesday, August 4.

by Stefan Raspl (noreply@blogger.com) at July 28, 2020 07:45 PM

July 27, 2020

Cornelia Huck

Configuring mediated devices (Part 1)

vfio-mdev has become popular over the last few years for assigning certain classes of devices to guests. On the s390x side, vfio-ccw and vfio-ap are using the vfio-mdev framework for making channel devices and crypto adapters accessible to guests.
This and a follow-up article aim to give an overview of the infrastructure, how to set up and manage devices, and how to use tooling for this.

What is a mediated device?

A general overview

Mediated devices grew out of the need to build upon the existing vfio infrastructure in order to support more fine grained management of resources. Some of the initial use cases included GPUs and (maybe somewhat surprisingly) s390 channel devices.

When using the mediated device (mdev) API, common tasks are performed in the mdev core driver (like device management), while device-specific tasks are done in a vendor driver. Current in-kernel examples of vendor drivers are the Intel vGPU driver, vfio-ccw, and vfio-ap.

Examples on s390

vfio-ccw

vfio-ccw can be used to assign channel devices. It is pretty straightforward: vfio-ccw is an alternative driver for I/O subchannels, and a single mediated device per subchannel is supported.

vfio-ap

vfio-ap can be used to assign crypto cards/queues (APQNs). It is a bit more involved, requiring prior setup on the ap bus level and configuration of a 'matrix' device. Complex relationships between the resources that can be assigned to different guests exist. Configuration-wise, this is probably the most complex mediated device available today.

Configuring a mediated device: the manual way

Mediated devices can be configured manually via sysfs operations. This is a good way to see what actually happens, but probably not what you want to do as a general administration task. Tools to help here will be introduced in part 2 of this article.

I will show the steps for both vfio-ccw and vfio-ap, just to show two different approaches. (Both examples are also used in the QEMU documentation, in case this looks familiar.)

Binding to the correct driver

vfio-ccw

Assume you want to use a DASD with the device bus ID 0.0.2b09. As vfio-ccw operates on the subchannel level, you first need to locate the subchannel for this device:

   [root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}'
  0.0.0313

(A word of caution: a device is not guaranteed to use the same subchannel at all times; on LPARs, the subchannel number will usually be stable, but z/VM -- and QEMU -- assign subchannel numbers in a consecutive order. If you don't get any hotplug events for a device, the subchannel number will stay stable for at least as long as the guest is running, though.)

Now you need to unbind the subchannel device from the default I/O subchannel driver and bind it to the vfio-ccw driver (make sure the device is not in use!):

    [root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind
    [root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind

vfio-ap

You need to perform some preliminary configuration of your crypto adapters before you can use any of them with vfio-ap. If nothing different has been set up, a crypto adapter will only bind to the default device drivers, and you cannot use it via vfio-ap. In order to be able to bind an adapter to vfio-ap, you first need to modify the /sys/bus/ap/apmask and /sys/bus/ap/aqmask entries. Both are basically bitmasks that indicate that the matching adapter IDs respectively queue indices can only be bound to the default drivers. If you want to use a certain APQN via vfio-ap, you need to unset the respective bits.

Let's assume you want to assign the APQNs (5, 4) and (5, ab). First, you need to make the adapter and the domains available to non-default drivers:

  [root@host ~]#  echo -5 > /sys/bus/ap/apmask
  [root@host ~]#  echo -4, -0xab > /sys/bus/ap/aqmask

This should result in the devices being bound to the vfio_ap driver (you can verify this by looking for them under /sys/bus/ap/drivers/vfio_ap/).

Create a mediated device

The basic workflow is "pick a uuid, create a mediated device identified by it".

vfio-ccw

For vfio-ccw, the two steps of the basic workflow are enough:

  [root@host ~]# uuidgen
  7e270a25-e163-4922-af60-757fc8ed48c6
  [root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \
    /sys/bus/css/devices/0.0.0313/mdev_supported_types/vfio_ccw-io/create

vfio-ap

For vfio-ap, you need a more involved approach. The uuid is used to create a mediated device under the 'matrix' device:

  [root@host ~] # uuidgen
  669d9b23-fe1b-4ecb-be08-a2fabca99b71
 [root@host ~]# echo "669d9b23-fe1b-4ecb-be08-a2fabca99b71" > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create

This mediated device will need to collect all APQNs that you want to pass to a specific guest. For that, you need to use the assign_adapter, assign_domain, and possibly assign_control_domain attributes (we'll ignore control domains for simplicity's sake.) All attributes have a companion unassign_ attribute to remove adapters/domains from the mediated device again. You can only assign adapters/domains that you removed from apmask/aqmask in the previous step. To follow up on our example again:

  [root@host ~]# echo 5 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_adapter
 [root@host ~]# echo 4 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain
 [root@host ~]# echo 0xab > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain

If you want to make sure that the mediated device is set up correctly, check via

  [root@host ~]# cat /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/matrix
  05.0004
  05.00ab

Configuring QEMU/libvirt

Your mediated device is now ready to be passed to a guest.

vfio-ccw

Let's assume you want the device to show up as device 0.0.1234 in the guest.

For the QEMU command line, use

-device vfio-ccw,devno=fe.0.1234,sysfsdev=\
    /sys/bus/mdev/devices/7e270a25-e163-4922-af60-757fc8ed48c6

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ccw'>
  <source>
    <address uuid='7e270a25-e163-4922-af60-757fc8ed48c6'/>
  </source>
  <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x1234'/>
</hostdev>

vfio-ap

Any APQNs will show up in the guest exactly as they show up in the host (i.e., no remapping is possible.)

For the QEMU command line, use

-device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/669d9b23-fe1b-4ecb-be08-a2fabca99b71

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ap'>
  <source>
    <address uuid='669d9b23-fe1b-4ecb-be08-a2fabca99b71'/>
  </source>
</hostdev>

Tooling

All this manual setup is a bit tedious; the next article in this series will look at some of the tooling that is available for mediated devices.

by Cornelia Huck (noreply@blogger.com) at July 27, 2020 06:55 PM

July 18, 2020

Stefan Hajnoczi

Rethinking event loop integration for libraries

APIs for operations that take a long time are often asynchronous so that applications can continue with other tasks while an operation is running. Asynchronous APIs initiate an operation and then return immediately. The application is notified when the operation completes through a callback or by monitoring a file descriptor for activity (for example, when data arrives on a TCP socket).

Asynchronous applications are usually built around an event loop that waits for the next event and invokes a function to handle the event. Since the details of event loops differ between applications, libraries need to be designed carefully to integrate well with a variety of event loops.

The current model

A popular library with asynchronous APIs is the libcurl file transfer library that is used for making HTTP requests. It has the following (slightly simplified) event loop integration API:


#define CURL_WAIT_POLLIN 0x0001 /* Ready to read? */
#define CURL_WAIT_POLLOUT 0x0004 /* Ready to write? */

int socket_callback(CURL *easy, /* easy handle */
int fd, /* socket */
int what, /* describes the socket */
void *userp, /* private callback pointer */
void *socketp); /* private socket pointer */

libcurl invokes the applications socket_callback() to start or stop monitoring file descriptors. When the application's event loop detects file descriptor activity, the application invokes libcurl's curl_multi_socket_action() API to let the library process the file descriptor.

There are variations on this theme but generally libraries expose file descriptors and event flags (read/write/error) so the application can monitor file descriptors from its own event loop. The library then performs the read(2) or write(2) call when the file descriptor becomes ready.

How io_uring changes the picture

The Linux io_uring API (pdf) can be used to implement traditional event loops that monitor file descriptors. But it also supports asynchronous system calls like read(2) and write(2) (best used when IORING_FEAT_FAST_POLL is available). The latter is interesting because it combines two syscalls into a single efficient syscall:

  1. Waiting for file descriptor activity.
  2. Reading/writing the file descriptor.

Existing applications use syscalls like epoll_wait(2), poll(2), or the old select(2) to wait for file descriptor activity. They can also use io_uring's IORING_OP_POLL_ADD to achieve the same effect.

After the file descriptor becomes ready, a second syscall like read(2) or write(2) is required to actually perform I/O.

io_uring's asynchronous IORING_OP_READ or IORING_OP_WRITE (including variants for vectored I/O or sockets) only requires a single io_uring_enter(2) call. If io_uring sqpoll is enabled then a syscall may not even be required to submit these operations!

To summarize, it's more efficient to perform a single asynchronous read/write instead of first monitoring file descriptor activity and then performing a read(2) or write(2).

A new model

Existing library APIs do not fit the asynchronous read/write approach because they expect the application to wait for file descriptor activity and then for the library to invoke a syscall to perform I/O. A new model is needed where the library tells the application about I/O instead of asking the application to monitor file descriptors for activity.

The library can use a new callback API that lets the application perform asynchronous I/O:


/*
* The application invokes this callback when an aio operation has completed.
*
* @cb_arg: the cb_arg passed to a struct aio_operations function by the library
* @ret: the return value of the aio operation (negative errno for failure)
*/
typedef void aio_completion_fn(void *cb_arg, ssize_t ret);

/*
* Asynchronous I/O operation callbacks provided to the library by the
* application.
*
* These functions initiate an I/O operation and then return immediately. When
* the operation completes the @cb callback is invoked with @cb_arg. Note that
* @cb may be invoked before the function returns (typically in the case of an
* early error).
*/
struct aio_operations {
void read(int fd, void *data, size_t len, aio_completion_fn *cb,
void *cb_arg);
void write(int fd, void *data, size_t len, aio_completion_fn *cb,
void *cb_arg);
...
};

The concept of monitoring file descriptor activity is gone. Instead the API focusses on asynchronous I/O operations that can be implemented by the application however it sees fit.

Applications using io_uring can use IORING_OP_READ and IORING_OP_WRITE to implement asynchronous operations efficiently. Traditional applications can still use their event loops but now also perform the read(2), write(2), etc syscalls on behalf of the library.

Some libraries don't need a full set of struct aio_operationscallbacks because they only perform I/O in limited ways. For example, a library that only has a Linux eventfdcan instead present this simplified API:


/*
* Return an eventfd(2) file descriptor that the application must read from and
* call lib_eventfd_fired() when a non-zero value was read.
*/
int lib_get_eventfd(struct libobject *obj);

/*
* The application must call this function when the eventfd returned by
* lib_get_eventfd() read a non-zero value.
*/
void lib_eventfd_fired(struct libobject *obj);

Although this simplified API is similar to traditional event loop integration APIs it is now the application's responsibility to perform the eventfd read(2), not the library's. This way applications using io_uring can implement the read efficiently.

Does an extra syscall matter?

Whether it is worth eliminating the extra syscall depends on one's performance requirements. When I/O is relatively infrequent then the overhead of the additional syscall may not matter.

While working on QEMU I found that the extra read(2) on eventfds causes a measurable overhead.

Conclusion

Splitting file descriptor monitoring from I/O is suboptimal for Linux io_uring applications. Unfortunately, existing library APIs are often designed in this way. Letting the application perform asynchronous I/O on behalf of the library allows a more efficient implementation with io_uring while still supporting applications that use older event loops.

by Unknown (noreply@blogger.com) at July 18, 2020 11:11 AM

July 12, 2020

Cole Robinson

virt-manager libvirt XML editor UI

virt-manager 2.2.0 was released in June of last year. It shipped with a major new feature: libvirt XML viewing and editing UI for new and existing domain, pools, volumes, networks.

Every VM, network, and storage object page has a XML tab at the top. Here's an example with that tab selected from the VM Overview section:

VM XML editor

Here's an example of the XML view when just a disk is selected. Note it only shows that single device's libvirt XML:

Disk XML editor

By default the XML is not editable; notice the warning at the top of the first image. After editing is enabled, the warning is gone, like in the second image. You can enable editing via Edit->Preferences from the main Manager window. Here's what the option looks like:

XML edit preference

A bit of background: We are constantly receiving requests to expose libvirt XML config options in virt-manager's UI. Some of these knobs are necessary for <1% but uninteresting to the rest. Some options are difficult to set from the command line because they must be set at VM install time, which means switch from virt-manager to virt-install which is not trivial. And so on. When these options aren't added to the UI, it makes life difficult for those affected users. It's also difficult and draining to have these types of justification conversations on the regular.

The XML editing UI was added to relieve some of the pressure on virt-manager developers fielding these requests, and to give more power to advanced virt users. The users that know they need an advanced option are usually comfortable editing the libvirt XML directly. The XML editor doesn't detract from the existing UI much IMO, and it is uneditable by default to prevent less knowledgeable users from getting into trouble. It ain't gonna win any awards for great UI, but the feedback has been largely positive so far.

by Cole Robinson at July 12, 2020 04:00 AM

July 11, 2020

Cole Robinson

virt-convert tool removed in virt-manager.git

The next release of virt-manager will not ship the virt-convert tool, I removed it upstream with this commit.

Here's the slightly edited quote from my original proposal to remove it:

virt-convert takes an ovf/ova or vmx file and spits out libvirt XML. It started as a code drop a long time ago that could translate back and forth between vmx, ovf, and virt-image, a long dead appliance format. In 2014 I changed virt-convert to do vmx -> libvirt and ovf -> libvirt which was a CLI breaking change, but I never heard a peep of a complaint. It doesn't do a particularly thorough job at its intended goal, I've seen 2-3 bug reports in the past 5 years and generally it doesn't seem to have any users. Let's kill it. If anyone has the desire to keep it alive it could live as a separate project that's a wrapper around virt-install but there's no compelling reason to keep it in virt-manager.git IMO

That mostly sums it up. If there's any users of virt-convert out there, you likely can get similar results by extracting the relevant disk image from the .vmx or .ovf config, pass it to virt-manager or virt-install, and let those tools fill in the defaults. In truth that's about all virt-convert did in to begin with.

Please see virt-v2v for an actively maintained tool that can covert OVA/OVF appliances to libvirt + KVM. This redhat.com article describes an example conversion.

by Cole Robinson at July 11, 2020 04:00 AM

July 10, 2020

Cornelia Huck

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.
In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.

vfio-ccw

vfio-ccw has also seen some improvements over the last release cycle.
  • Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
  • In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.

    by Cornelia Huck (noreply@blogger.com) at July 10, 2020 04:27 PM

    July 03, 2020

    QEMU project

    Anatomy of a Boot, a QEMU perspective

    Have you ever wondered about the process a machine goes through to get to the point of a usable system? This post will give an overview of how machines boot and how this matters to QEMU. We will discuss firmware and BIOSes and the things they do before the OS kernel is loaded and your usable system is finally ready.

    Firmware

    When a CPU is powered up it knows nothing about its environment. The internal state, including the program counter (PC), will be reset to a defined set of values and it will attempt to fetch the first instruction and execute it. It is the job of the firmware to bring a CPU up from the initial few instructions to running in a relatively sane execution environment. Firmware tends to be specific to the hardware in question and is stored on non-volatile memory (memory that survives a power off), usually a ROM or flash device on the computers main board.

    Some examples of what firmware does include:

    Early Hardware Setup

    Modern hardware often requires configuring before it is usable. For example most modern systems won’t have working RAM until the memory controller has been programmed with the correct timings for whatever memory is installed on the system. Processors may boot with a very restricted view of the memory map until RAM and other key peripherals have been configured to appear in its address space. Some hardware may not even appear until some sort of blob has been loaded into it so it can start responding to the CPU.

    Fortunately for QEMU we don’t have to worry too much about this very low level configuration. The device model we present to the CPU at start-up will generally respond to IO access from the processor straight away.

    BIOS or Firmware Services

    In the early days of the PC era the BIOS or Basic Input/Output System provided an abstraction interface to the operating system which allowed the OS to do basic IO operations without having to directly drive the hardware. Since then the scope of these firmware services has grown as systems become more and more complex.

    Modern firmware often follows the Unified Extensible Firmware Interface (UEFI) which provides services like secure boot, persistent variables and external time-keeping.

    There can often be multiple levels of firmware service functions. For example systems which support secure execution enclaves generally have a firmware component that executes in this secure mode which the operating system can call in a defined secure manner to undertake security sensitive tasks on its behalf.

    Hardware Enumeration

    It is easy to assume that modern hardware is built to be discoverable and all the operating system needs to do is enumerate the various buses on the system to find out what hardware exists. While buses like PCI and USB do support discovery there is usually much more on a modern system than just these two things.

    This process of discovery can take some time as devices usually need to be probed and some time allowed for the buses to settle and the probe to complete. For purely virtual machines operating in on-demand cloud environments you may operate with stripped down kernels that only support a fixed expected environment so they can boot as fast as possible.

    In the embedded world it used to be acceptable to have a similar custom compiled kernel which knew where everything is meant to be. However this was a brittle approach and not very flexible. For example a general purpose distribution would have to ship a special kernel for each variant of hardware you wanted to run on. If you try and use a kernel compiled for one platform that nominally uses the same processor as another platform the result will generally not work.

    The more modern approach is to have a “generic” kernel that has a number of different drivers compiled in which are then enabled based on a hardware description provided by the firmware. This allows flexibility on both sides. The software distribution is less concerned about managing lots of different kernels for different pieces of hardware. The hardware manufacturer is also able to make small changes to the board over time to fix bugs or change minor components.

    The two main methods for this are the Advanced Configuration and Power Interface (ACPI) and Device Trees. ACPI originated from the PC world although it is becoming increasingly common for “enterprise” hardware like servers. Device Trees of various forms have existed for a while with perhaps the most common being Flattened Device Trees (FDT).

    Boot Code

    The line between firmware and boot code is a very blurry one. However from a functionality point of view we have moved from ensuring the hardware is usable as a computing device to finding and loading a kernel which is then going to take over control of the system. Modern firmware often has the ability to boot a kernel directly and in some systems you might chain through several boot loaders before the final kernel takes control.

    The boot loader needs to do 3 things:

    • find a kernel and load it into RAM
    • ensure the CPU is in the correct mode for the kernel to boot
    • pass any information the kernel may need to boot and can’t find itself

    Once it has done these things it can jump to the kernel and let it get on with things.

    Kernel

    The Kernel now takes over and will be in charge of the system from now on. It will enumerate all the devices on the system (again) and load drivers that can control them. It will then locate some sort of file-system and eventually start running programs that actually do work.

    Questions to ask yourself

    Having given this overview of booting here are some questions you should ask when diagnosing boot problems.

    Hardware

    • is the platform fixed or dynamic?
    • is the platform enumeratable (e.g. PCI/USB)?

    Firmware

    • is the firmware built for the platform you are booting?
    • does the firmware need storage for variables (boot index etc)?
    • does the firmware provide a service to kernels (e.g. ACPI/EFI)?

    Kernel

    • is the kernel platform specific or generic?
    • how will the kernel enumerate the platform?
    • can the kernel interface talk to the firmware?

    Final Thoughts

    When users visit the IRC channel to ask why a particular kernel won’t boot our first response is almost always to check the kernel is actually matched to the hardware being instantiated. For ARM boards in particular just being built for the same processor is generally not enough and hopefully having made it through this post you see why. This complexity is also the reason why we generally suggest using a tool like virt-manager to configure QEMU as it is designed to ensure the right components and firmware is selected to boot a given system.

    by Alex Bennée at July 03, 2020 10:00 PM

    KVM on Z

    RHEL providing unlimited KVM Guests on IBM Z

    Red Hat has announced a new offering for IBM Z and LinuxONE here.
    Red Hat Enterprise Linux for IBM Z with premium support also includes
    • Red Hat Enterprise Linux Extended Update Support add-on (new)
    • Red Hat Enterprise Linux High Availability add-on (new)
    • Red Hat Smart Management (new)
    • Red Hat Insights (new)
    • Unlimited virtual guests (KVM)

    by Stefan Raspl (noreply@blogger.com) at July 03, 2020 10:24 AM

    July 02, 2020

    Stefan Hajnoczi

    Avoiding bitrot in C macros

    A common approach to debug messages that can be toggled at compile-time in C programs is:

    #ifdef ENABLE_DEBUG
    #define DPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
    #else
    #define DPRINTF(fmt, ...)
    #endif

    Usually the ENABLE_DEBUG macro is not defined in normal builds, so the C preprocessor expands the debug printfs to nothing. No messages are printed at runtime and the program's binary size is smaller since no instructions are generated for the debug printfs.

    This approach has the disadvantage that it suffers from bitrot, the tendency for source code to break over time when it is not actively built and used. Consider what happens when one of the variables used in the debug printf is not updated after being renamed:

    - int r;
    + int radius;
    ...
    DPRINTF("radius %d\n", r);

    The code continues to compile after r is renamed to radius because the DPRINTF() macro expands to nothing. The compiler does not syntax check the debug printf and misses that the outdated variable name r is still in use. When someone defines ENABLE_DEBUG months or years later, the compiler error becomes apparent and that person is confronted with fixing a new bug on top of whatever they were trying to debug when they enabled the debug printf!

    It's actually easy to avoid this problem by writing the macro differently:

    #ifndef ENABLE_DEBUG
    #define ENABLE_DEBUG 0
    #endif
    #define DPRINTF(fmt, ...) do { \
    if (ENABLE_DEBUG) { \
    fprintf(stderr, fmt, ## __VA_ARGS__); \
    } \
    } while (0)

    When ENABLE_DEBUG is not defined the macro expands to:

    do {
    if (0) {
    fprintf(stderr, fmt, ...);
    }
    } while (0)

    What is the difference? This time the compiler parses and syntax checks the debug printf even when it is disabled. Luckily compilers are smart enough to eliminate deadcode, code that cannot be executed, so the binary size remains small.

    This applies not just to debug printfs. More generally, all preprocessor conditionals suffer from bitrot. If an #if ... #else ... #endif can be replaced with equivalent unconditional code then it's often worth doing.

    by Unknown (noreply@blogger.com) at July 02, 2020 08:33 AM

    June 16, 2020

    Cole Robinson

    virt-manager is deprecated in RHEL (but only RHEL)

    TL;DR: I'm the primary author of virt-manager. virt-manager is deprecated in RHEL8 in favor of cockpit, but ONLY in RHEL8 and future RHEL releases. The upstream project virt-manager is still maintained and is still relevant for other distros.

    Google 'virt-manager deprecated' and you'll find some discussions suggesting virt-manager is no longer maintained, Cockpit is replacing virt-manager, virt-manager is going to be removed from every distro, etc. These conclusions are misinformed.

    The primary source for this confusion is the section 'virt-manager has been deprecated' from the RHEL8 release notes virtualization deprecation section. Relevant quote from the RHEL8.2 docs:

    The Virtual Machine Manager application, also known as virt-manager, has been deprecated. The RHEL 8 web console, also known as Cockpit, is intended to become its replacement in a subsequent release.

    What that means:

    • virt-manager is in RHEL8 and will be there for the lifetime of RHEL8.
    • Red Hat engineering effort assigned to the virt-manager UI has been reduced compared to previous RHEL versions.
    • The tentative plan is to not ship the virt-manager UI in RHEL9.

    Why is this happening? As I understand it, RHEL wants to roughly standardize on Cockpit as their host admin UI tool. It's a great project with great engineers and great UI designers. Red Hat is going all in on it for RHEL and aims to replace the mismash of system-config-X tools and project specific admin frontends (like virt-manager) with a unified project. (Please note: this is my paraphrased understanding, I'm not speaking on behalf of Red Hat here.)

    Notice though, this is all about RHEL. virt-manager is not deprecated upstream, or deprecated in other distros automatically just because RHEL has made this decision. The upstream virt-manager project continues on and Red Hat continues to allocate resources to maintain it.

    Also, I'm distinguishing virt-manager UI from the virt-manager project, which includes tools like virt-install. I fully expect virt-install to be shipped in RHEL9 and actively maintained (FWIW Cockpit uses it behind the scenes).

    And even if the virt-manager UI is not in RHEL9 repos, it will likely end up shipped in EPEL, so the UI will still be available for install, just through external repos.

    Overall my personal opinion is that as long as libvirt+KVM is in use on linux desktops and servers, virt-manager will be relevant. I don't expect anything to change in that area any time soon.

    by Cole Robinson at June 16, 2020 05:00 PM

    May 22, 2020

    Stefan Hajnoczi

    How to check VIRTIO feature bits inside Linux guests

    VIRTIO devices have feature bits that indicate the presence of optional features. The feature bit space is divided into core VIRTIO features (e.g. notify on empty), transport-specific features (PCI, MMIO, CCW), and device-specific features (e.g. virtio-net checksum offloading). This article shows how to check whether a feature is enabled inside Linux guests.

    The feature bits are used during VIRTIO device initialization to negotiate features between the device and the driver. The device reports a fixed set of features, typically all the features that the device implementors wanted to offer from the VIRTIO specification version that they developed against. The driver also reports features, typically all the features that the driver developers wanted to offer from the VIRTIO specification version that they developed against.

    Feature bit negotiation determines the subset of features supported by both the device and the driver. A new driver might not be able to enable all the features it supports if the device is too old. The same is true vice versa. This offers compatibility between devices and drivers. It also means that you don't know which features are enabled until the device and driver have negotiated them at runtime.

    Where to find feature bit definitions

    VIRTIO feature bits are listed in the VIRTIO specification. You can also grep the linux/virtio-*.h header files:


    $ grep VIRTIO.*_F_ /usr/include/linux/virtio_*.h
    virtio_ring.h:#define VIRTIO_RING_F_INDIRECT_DESC 28
    virtio_ring.h:#define VIRTIO_RING_F_EVENT_IDX 29
    virtio_scsi.h:#define VIRTIO_SCSI_F_INOUT 0
    virtio_scsi.h:#define VIRTIO_SCSI_F_HOTPLUG 1
    virtio_scsi.h:#define VIRTIO_SCSI_F_CHANGE 2
    ...

    Here the VIRTIO_SCSI_F_INOUT (0) constant is for the 1st bit (1ull << 0). Bit-numbering can be confusing because different standards, vendors, and languages express it differently. Here it helps to think of a bit shift operation like 1 << BIT.

    How to check feature bits inside the guest

    The Linux virtio.ko driver that is used for all VIRTIO devices has a sysfs file called features. This file contains the feature bits in binary representation starting with the 1st bit on the left and more significant bits to the right. The reported bits are the subset that both the device and the driver support.

    To check if the virtio-blk device /dev/vda has the VIRTIO_RING_F_EVENT_IDX (29) bit set:


    $ python -c "print('$(</sys/block/vda/device/driver/virtio*/features)'[29])"
    01100010011101100000000000100010100

    Other device types can be found through similar sysfs paths.

    by Unknown (noreply@blogger.com) at May 22, 2020 01:46 PM

    May 01, 2020

    Daniel Berrange

    ANNOUNCE: virt-viewer version 9.0 released

    I am happy to announce a new bugfix release of virt-viewer 9.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg).

    Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

    With this release the project has moved over to use GitLab for its hosting needs instead of Pagure. Instead of sending patches to the old mailing list, we have adopted modern best practices and now welcome contributions as merge requests, from where they undergo automated CI testing of the build. Bug reports directed towards upstream maintainers, should also be filed at the GitLab project now instead of the Red Hat Bugzilla

    All historical releases are available from:

    http://virt-manager.org/download/

    Changes in this release include:

    • Project moved to https://gitlab.com/virt-viewer/virt-viewer
    • Allow toggling shared clipboard in remote-viewer
    • Fix handling when initial spice connection fails
    • Fix check for govirt library
    • Add bash completion of cli args
    • Improve errors in file transfer dialog
    • Fix ovirt foreign menu storage domains query
    • Prefer TLS certs from oVirt instead of CLI
    • Improve USB device cleanup when Ctrl-C is used
    • Remember monitor mappings across restarts
    • Add a default file extension to screenshots
    • Updated translations
    • Fix misc memory leaks

    by Daniel Berrange at May 01, 2020 05:19 PM

    Powered by Planet!
    Last updated: October 25, 2020 10:10 AMEdit this page