Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

June 10, 2018

KVM on Z

KVM at Share St.Louis 2018


Yes, we will be at SHARE in St.Louis this August!
See our sessions in the Linux and VM track as follows:


by Stefan Raspl (noreply@blogger.com) at June 10, 2018 09:17 PM

May 31, 2018

QEMU project

QEMU’s new -nic command line option

If you used QEMU in the past, you are probably familiar with the -net command line option, which can be used to configure a network connection for the guest, or with with the -netdev option, which configures a network back-end. Yet, QEMU v2.12 introduces a third way to configure NICs, the -nic option.

The ChangeLog of QEMU v2.12 says that -nic can “quickly create a network front-end (emulated NIC) and a host back-end”. But why did QEMU need yet another way to configure the network, and how does it compare with -net and -netdev? To answer these questions, we need to look at the model behind network virtualization in QEMU.

As hinted by the ChangeLog entry, a network interface consists of two separate entities:

  1. The emulated hardware that the guest sees, i.e. the so-called NIC (network interface controller). On systems that support PCI cards, these typically could be an e1000 network card, a rtl8139 network card or a virtio-net device. This entity is also called the “front-end”.

  2. The network back-end on the host side, i.e. the interface that QEMU uses to exchange network packets with the outside (like other QEMU instances or other real hosts in your intranet or the internet). The common host back-ends are the “user” (a.k.a. SLIRP) back-end which provides access to the host’s network via NAT, the “tap” back-end which allows the guest to directly access the host’s network, or the “socket” back-end which can be used to connect multiple QEMU instances to simulate a shared network for their guests.

Based on this, it is already possible to define the most obvious difference between -net, -netdev and -nic: the -net option can create either a front-end or a back-end (and also does other things); -netdev can only create a back-end; while a single occurrence of -nic will create both a front-end and a back-end. But for the non-obvious differences, we also need to have a detailed look at the -net and -netdev options first …

The legacy -net option

QEMU’s initial way of configuring the network for the guest was the -net option. The emulated NIC hardware can be chosen with the -net nic,model=xyz,... parameter, and the host back-end with the -net <backend>,... parameter (e.g. -net user for the SLIRP back-end). However, the emulated NIC and the host back-end are not directly connected. They are rather both connected to an emulated hub (called “vlan” in older versions of QEMU). Therefore, if you start QEMU with -net nic,model=e1000 -net user -net nic,model=virtio -net tap for example, you get a setup where all the front-ends and back-ends are connected together via a hub:

Networking with -net

That means the e1000 NIC also gets the network traffic from the virtio-net NIC and both host back-ends… this is probably not what the users expected; it’s more likely that they wanted two separate networks in the guest, one for each NIC. Because -net always connects its NIC to a hub, you would have to tell QEMU to use two separate hubs, using the “vlan” parameter. For example -net nic,model=e1000,vlan=0 -net user,vlan=0 -net nic,model=virtio,vlan=1 -net tap,vlan=1 moves the virtio-net NIC and the “tap” back-end to a second hub (with ID #1).

Please note that the “vlan” parameter will be dropped in QEMU v3.0 since the term was rather confusing (it’s not related to IEEE 802.1Q for example) and caused a lot of misconfigurations in the past. Additional hubs can still be instantiated with -netdev (or -nic) and the special “hubport” back-end. The -net option itself will still stay around since it is still useful if you only want to use one front-end and one back-end together, or if you want to tunnel the traffic of multiple NICs through one back-end only (something like -net nic,model=e1000 -net nic,model=virtio -net l2tpv3,... for example).

The modern -netdev option

Beside the confusing “vlan” parameter of the -net option, there is one more major drawback with -net: the emulated hub between the NIC and the back-end gets in the way when the NIC front-end has to work closely together with the host back-end. For example, vhost acceleration cannot be enabled if you create a virtio-net device with -net nic,model=virtio.

To configure a network connection where the emulated NIC is directly connected to a host network back-end, without a hub in between, the well-established solution is to use the -netdev option for the back-end, together with -device for the front-end. Assuming that you want to configure the same devices as in the -net example above, you could use -netdev user,id=n1 -device e1000,netdev=n1 -netdev tap,id=n2 -device virtio-net,netdev=n2. This will give you straight 1:1 connections between the NICs and the host back-ends:

Networking with -netdev

Note that you can also still connect the devices to a hub with the special -netdev hubport back-end, but in most of the normal use cases, the use of a hub is not required anymore.

Now while -netdev together with -device provide a very flexible and extensive way to configure a network connection, there are still two drawbacks with this option pair which prevented us from deprecating the legacy -net option completely:

  1. The -device option can only be used for pluggable NICs. Boards (e.g. embedded boards) which feature an on-board NIC cannot be configured with -device yet, so -net nic,netdev=<id> must be used here instead.

  2. In some cases, the -net option is easier to use (less to type). For example, assuming you want to set up a “tap” network connection and your default scripts /etc/qemu-ifup and -down are already in place, it’s enough to type -net nic -net tap to start your guest. To do the same with -netdev, you always have to specify an ID here, too, for example like this: -netdev tap,id=n1 -device e1000,netdev=n1.

The new -nic option

Looking at the disadvantages listed above, users could benefit from a convenience option that:

  • is easier to use (and shorter to type) than -netdev <backend>,id=<id> -device <dev>,netdev=<id>
  • can be used to configure on-board / non-pluggable NICs, too
  • does not place a hub between the NIC and the host back-end.

This is where the new -nic option kicks in: this option can be used to configure both the guest’s NIC hardware and the host back-end in one go. For example, instead of -netdev tap,id=n1 -device e1000,netdev=n1 you can simply type -nic tap,model=e1000. If you don’t care about the exact NIC model type, you can even omit the model=... parameter and type -nic tap. This is even shorter and more convenient than the previous shortest way of typing -net nic -net tap. To get a list of NIC models that you can use with this option, you can simply run QEMU with -nic model=help.

Beside being easier to use, the -nic option can be used to configure on-board NICs, too (just like the -net option). For machines that have on-board NICs, the first -nic option configures the first on-board NIC, the second -nic option configures the second on-board NIC, and so forth.

Conclusion

  • The new -nic option gives you an easy and quick way to configure the networking of your guest.
  • For more detailed configuration, e.g. when you need to tweak the details of the emulated NIC hardware, you can use -device together with -netdev.
  • The -net option should be avoided these days unless you really want to configure a set-up with a hub between the front-ends and back-ends.

by Thomas Huth at May 31, 2018 07:50 AM

May 24, 2018

Gerd Hoffmann

Fedora 28 images uploaded

Fedora 28 was released a few weeks ago. New Fedora 28 images are finally uploaded now.

There are no raspberry pi images any more. Just use the standard Fedora arm images, they work just fine for both arm (rpi 2) and aarch64 (rpi 3).

There efi images are for qemu. Some use grub2 as bootloader, some use systemd-boot. The filename indicates which uses which. The efi images can also be booted as container, using systemd-nspawn --boot --image <file>, but you have to convert them to raw first as systemd-nspawn can't handle qcow2.

As usual the images don't have a root password. You have to set one using virt-customize -a --root-password "password:<secret>", otherwise you can't login after boot.

The images have been created with imagefish.

by Gerd Hoffmann at May 24, 2018 10:00 PM

May 17, 2018

KVM on Z

Knowledge Series: Managing KVM on IBM Z with oVirt

See here for a new entry in our "knowledge" series, providing step-by-step instructions on how to include IBM Z hosts in an oVirt data center.

by Stefan Raspl (noreply@blogger.com) at May 17, 2018 02:51 PM

May 14, 2018

KVM on Z

Getting Started: RHEL 7.5 Instructions added

Instructions for RHEL7.5 were added to the Getting Started with KVM on Z series.
See here for the actual page.

by Stefan Raspl (noreply@blogger.com) at May 14, 2018 12:40 PM

Getting Started: Instructions for Ubuntu 16.04 to 17.10 added

Instructions for Ubuntu 16.04 to 17.10 were added to the Getting Started with KVM on Z series.
See here for the entry page.

by Stefan Raspl (noreply@blogger.com) at May 14, 2018 12:34 PM

Knowledge Series: Disk Performance Hints & Tips

See here for the second entry in our "knowledge" series, providing hints & tips on KVM on z disk performance.

by Stefan Raspl (noreply@blogger.com) at May 14, 2018 12:23 PM

May 13, 2018

Gerd Hoffmann

Welcome to Jekyll!

Switched my blog from wordpress to jekyll.

Automatic import handled most of content. Didn’t cover syntax highlight (extra wordpress plugin) though, so that needed some manual invention. Also copying over the (few) images was a manual process.

Configuration isn’t imported automatically, but that is just editing a few lines in _config.yml. Permalinks can be configured to be compatible to wordpress without much trouble, so all the article links stay valid.

My blog is stored in git now. Everything is just static pages. No database needed. No user data stored anywhere.

I’m pretty pleased with the result.

by Gerd Hoffmann at May 13, 2018 10:00 PM

May 04, 2018

KVM on Z

Ubuntu 18.04 released

Ubuntu Server 18.04 LTS is out! Support for IBM Z is available here.
It ships
As obvious from these package versions, support for IBM z14 is readily in place.

Since this is a so-called LTS (Long Term Support) release providing approx. 5 years of support (in contrast to the usual 9 months of non-LTS releases), it is of particular interest to Ubuntu users interested in a stable environment for production deployments.

by Stefan Raspl (noreply@blogger.com) at May 04, 2018 08:47 PM

May 03, 2018

Cornelia Huck

A vfio-ccw primer

While basic support for vfio-ccw has been included in Linux and QEMU for some time, work has recently started to ramp up again and it seems like a good time to give some basic overview.

Why vfio-ccw?

Historically, QEMU on s390x presented paravirtualized virtio devices to the guest; first, via a protocol inspired by lguest, later, as emulated channel devices. This satisfies most needs (you get block devices, network devices, a console device, and lots more), but the device types are different from those found on LPARs or z/VM guests, and you may have a need to use e.g. a DASD directly.

For that reason, we want to do the same thing as on other platforms: pass a host device to the guest directly via vfio.

How does this work?

vfio-ccw is using the vfio mediated device framework; see the kernel documentation for an overview.

In a nutshell: The subchannel to be passed to the guest is unbound from its normal host driver (in this case, the I/O subchannel driver) and bound to the vfio-ccw driver. Any I/O request is intercepted and executed on the real device, and interrupts from the real device are relayed back to the guest.

Why subchannels and not ccw devices?

The initial attempt to implement this actually worked at the ccw device level. However, this means that the Linux common I/O layer in the host will perform various actions like handling of channel paths - which may interfere with what the guest is trying to do. Therefore, it seemed like a better idea to keep out of the way as much as possible and just implement a minimal subchannel driver that does not do much beyond what the guest actually triggered itself.

How is an actual I/O request processed?

When the guest is ready to use a channel device, it will issue I/O requests via channel programs (see here for an explanation on how that works and what things like scsw and orb mean.) The channel I/O instructions are mandatory SIE intercepts, so the host will get control for any START SUBCHANNEL the guest issues. QEMU is in charge of interpretation of channel I/O instructions, so it will process the ssch as a request to a pass-through device.

All channel I/O instructions are privileged, which means that the host kernel now needs to get involved again. QEMU does so by writing to an I/O region: the scsw (which contains, amongst other things, the fctl field specifying the start function) and the orb (pointing to the channel program). The host kernel driver now has enough information to actually issue the request on the real device after translating the ccw chain and its addresses to host addresses (involving pinning, idals and other things I will not explain here for brevity.)

After the device has processed the I/O request, it will make the subchannel status pending and generate an I/O interrupt. The host kernel driver collects the state and makes it available via the same I/O region (the IRB field), and afterwards triggers QEMU via an eventfd. QEMU now has all information needed to update its internal structures for the devices so that the guest can obtain the information related to the I/O request.

Isn't that all a bit too synchronous?

Yes, it is. Channel I/O is supposed to be asynchronous (give the device an I/O request, collect status later), but our implementation isn't yet. Why? Short answer: It is hard, and we wanted something to get us going. But this is on the list of things to be worked on.

Where is the IOMMU for this?

Due to the way channel programs work, we don't have a real IOMMU.

Does this cover everything supported by the architecture?

Not yet. Channel program wise, we support the format Linux drivers use. Also, we're emulating things like HALT SUBCHANNEL and CLEAR SUBCHANNEL in QEMU, while they really should be handed through to the device (support for this is in the works).

On the whole, you should be able to pass an ECKD DASD to a Linux guest without (known) issues.

How can I try this out?

Recent QEMU and Linux versions should have everything you need in the host; see this wiki entry for details. As a guest, any guest that can run under KVM should be fine.

What's the deal with that "unrestricted cssids" thing?

If you look at this older article, you'll notice the 'fe' value for the cssid of virtio devices (with the promise to explain it later... which I sadly never did). The basic idea at the time was to put 'virtual' devices like virtio and 'non-virtual' devices like vfio-ccw into different channel subsystem images, so that e.g. channel paths (which are per channel subsystem image) don't clash. In other words, 'virtual' and 'non-virtual' devices (and channel paths) would have different cssids (the first part of their identifiers).

This sounded like a good idea at the time; however, there's a catch: A guest operating system will by default only see the devices in the default channel subsystem image. To see all of them, it needs to explicitly enable the Multiple Channel Subsystems Extended (MCSS-E) feature - and I do not know of any operating system that has done so as of today (not very surprising, as QEMU is the only implementation of MCSS-E I'm aware of).

To work around this, we originally introduced the 's390-squash-mcss' parameter to QEMU, which would put all devices into the default channel subsystem image. But as MCSS-E support is unlikely to arrive in any guest operating system anytime soon, we agreed to rather drop the restriction of virtual devices being in css fe and non-virtual devices everywhere else (since QEMU 2.12).

What are the plans for the future?

Several things are already actively worked on, while others may come up later.
  • Intial libvirt support for vfio-ccw has been posted here.
  • Reworking the Linux host driver to make things more asynchronous and to support halt/clear is in progress.
  • Improvements in channel path handling (for example, to enable the guest to see path availability changes) are also in progress. We may need to consider things like dasd reserve/release as well.

by Cornelia Huck (noreply@blogger.com) at May 03, 2018 11:56 AM

KVM on Z

QEMU v2.12 released

QEMU v2.12 is out. Here are the highlights from a KVM on Z perspective:
  • Added support for an interactive bootloader. As always, we strongly recommend to use the existing support in libvirt.
    To enable/disable, add the following element to your guest definition:

       <os>
         <bootmenu enable=’yes|no’ timeout=’n’/>
         ...
       </os>


    The timeout parameter specifies a timeout in milliseconds after which the default entry is chosen.
    Alternatively, set attribute loadparm to PROMPT to enable the boot menu without timeout in the respective disk's element:

       <disk ...>
         <boot order=’1’ loadparm=’PROMPT’/>
         ...
       </disk>


    Example:
    To enable the boot menu for 32 seconds for a guest using a libvirt
    domain XML format follows:

       <domain type=’kvm’>
         <os>

           <bootmenu enable=’yes’ timeout=’32000’/>

           ...
         </os>
  • Exposure of guest crash information: When a guest is started using libvirt and crashes due to disabled wait, wrong interrupts or a program check loop, libvirt will print the information to the guest’s log, typically located at /var/log/libvirt/qemu.
    E.g. a crash due to a disabled wait results in an entry as follows:

       s390: psw-mask=’0xXXXXXXXXXXXXXXXX’, psw-addr=’0xXXXXXXXXXX
             XXXXXX’,crash reason: disabled wait


    Requires  libvirt v4.2.
  • Added support for guests with more than 8TB of memory.

by Stefan Raspl (noreply@blogger.com) at May 03, 2018 09:49 AM

May 02, 2018

Richard Jones

Dockerfile for running libguestfs, virt-tools and virt-v2v

FROM fedora
RUN dnf install -y libguestfs libguestfs-tools-c virt-v2v \
                   libvirt-daemon libvirt-daemon-config-network

# https://bugzilla.redhat.com/show_bug.cgi?id=1045069
RUN useradd -ms /bin/bash v2v
USER v2v
WORKDIR /home/v2v

# This is required for virt-v2v because neither systemd nor
# root libvirtd runs, and therefore there is no virbr0, and
# therefore virt-v2v cannot set up the network through libvirt.
ENV LIBGUESTFS_BACKEND direct

by rich at May 02, 2018 03:35 PM

April 25, 2018

Peter Maydell

Debian on QEMU’s Raspberry Pi 3 model

For the QEMU 2.12 release we added support for a model of the Raspberry Pi 3 board (thanks to everybody involved in developing and upstreaming that code). The model is sufficient to boot a Debian image, so I wanted to write up how to do that.

Things to know before you start

Before I start, some warnings about the current state of the QEMU emulation of this board:

  • We don’t emulate the boot rom, so QEMU will not automatically boot from an SD card image. You need to manually extract the kernel, initrd and device tree blob from the SD image first. I’ll talk about how to do that below.
  • We don’t have an emulation of the BCM2835 USB controller. This means that there is no networking support, because on the raspi devices the ethernet hangs off the USB controller.
  • Our raspi3 model will only boot AArch64 (64-bit) kernels. If you want to boot a 32-bit kernel you should use the “raspi2” board model.
  • The QEMU model is missing models of some devices, and others are guesswork due to a lack of documentation of the hardware; so although the kernel I tested here will boot, it’s quite possible that other kernels may fail.

You’ll need the following things on your host system:

  • QEMU version 2.12 or better
  • libguestfs (on Debian and Ubuntu, install the libguestfs-tools package)

Getting the image

I’m using the unofficial preview images described on the Debian wiki.

$ wget https://people.debian.org/~stapelberg/raspberrypi3/2018-01-08/2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz
$ xz -d 2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz

Extracting the guest boot partition contents

I use libguestfs to extract files from the guest SD card image. There are other ways to do this but I think libguestfs is the easiest to use. First, check that libguestfs is working on your system:

$ virt-filesystems -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Now you can ask libguestfs to extract the contents of the boot partition:

$ mkdir bootpart
$ guestfish --ro -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img -m /dev/sda1

Then at the guestfish prompt type:

copy-out / bootpart/
quit

This should have copied various files into the bootpart/ subdirectory.

Run the guest image

You should now be able to run the guest image:

$ qemu-system-aarch64 \
  -kernel bootpart/vmlinuz-4.14.0-3-arm64 \
  -initrd bootpart/initrd.img-4.14.0-3-arm64 \
  -dtb bootpart/bcm2837-rpi-3-b.dtb \
  -M raspi3 -m 1024 \
  -serial stdio \
  -append "rw earlycon=pl011,0x3f201000 console=ttyAMA0 loglevel=8 root=/dev/mmcblk0p2 fsck.repair=yes net.ifnames=0 rootwait memtest=1" \
  -drive file=2018-01-08-raspberry-pi-3-buster-PREVIEW.img,format=raw,if=sd

and have it boot to a login prompt (the root password for this Debian image is “raspberry”).

There will be several WARNING logs and backtraces printed by the kernel as it starts; these will have a backtrace like this:

[  145.157957] [] uart_get_baud_rate+0xe4/0x188
[  145.158349] [] pl011_set_termios+0x60/0x348
[  145.158733] [] uart_change_speed.isra.3+0x50/0x130
[  145.159147] [] uart_set_termios+0x7c/0x180
[  145.159570] [] tty_set_termios+0x168/0x200
[  145.159976] [] set_termios+0x2b0/0x338
[  145.160647] [] tty_mode_ioctl+0x358/0x590
[  145.161127] [] n_tty_ioctl_helper+0x54/0x168
[  145.161521] [] n_tty_ioctl+0xd4/0x1a0
[  145.161883] [] tty_ioctl+0x150/0xac0
[  145.162255] [] do_vfs_ioctl+0xc4/0x768
[  145.162620] [] SyS_ioctl+0x8c/0xa8

These are ugly but harmless. (The underlying cause is that QEMU doesn’t implement the undocumented ‘cprman’ clock control hardware, and so Linux thinks that the UART is running at a zero baud rate and complains.)

by pm215 at April 25, 2018 08:07 AM

QEMU project

QEMU version 2.12.0 released

We’d like to announce the availability of the QEMU 2.12.0 release. This release contains 2700+ commits from 204 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • Spectre/Meltdown mitigation support for x86/pseries/s390 guests. For more details see: https://www.qemu.org/2018/02/14/qemu-2-11-1-and-spectre-update/
  • Numerous block support improvements, including support for directly interacting with userspace NVMe driver, and general improvements to NBD server/client including more efficient reads of sparse files
  • Networking support for VMWare paravirtualized RDMA device (RDMA HCA and Soft-RoCE supported), CAN bus support via Linux SocketCAN and SJA1000-based PCI interfaces, and general improvements for dual-stack IPv4/IPv6 environments
  • GUI security/bug fixes, dmabufs support for GTK/Spice.
  • Better IPMI support for Platform Events and SEL logging in internal BMC emulation
  • SMBIOS support for “OEM Strings”, which can be used for automating guest image activation without relying on network-based querying
  • Disk cache information via virtio-balloon
  • ARM: AArch64 new instructions for FCMA/RDM and SIMD/FP16/crypto/complex number extensions
  • ARM: initial support for Raspberry Pi 3 machine type
  • ARM: Corex-M33/Armv8-M emulation via new mps2-an505 board and many other improvements for M profile emulation
  • HPPA: support for full machine emulation (hppa-softmmu)
  • PowerPC: PPC4xx emulation improvements, including I2C bus support
  • PowerPC: new Sam460ex machine type
  • PowerPC: significant TCG performance improvements
  • PowerPC: pseries: support for Spectre/Meltdown mitigations
  • RISC-V: new RISC-V target via “spike_v1.9.1”, “spike_v1.10”, and “virt” machine types
  • s390: non-virtual devices no longer require dedicated channel subsystem and guest support for multiple CSSs
  • s390: general PCI improvements, MSI-X support for virtio-pci devices
  • s390: improved TCG emulation support
  • s390: KVM support for systems larger than 7.999TB
  • SPARC: sun4u power device emulation
  • SPARC: improved trace-event support and emulation/debug fixes
  • Tricore: new instruction variants for JEQ/JNE and 64-bit MOV
  • x86: Intel IOMMU support for 48-bit addresses
  • Xtensa: backend now uses libisa for instruction decoding/disassebly
  • Xtensa: multi-threaded TCG support and noMMU configuration variants
  • and lots more…

Thank you to everyone involved!

April 25, 2018 03:30 AM

April 24, 2018

Yoni Bettan

VirtIO

My name is Yonathan Bettan and I work at RedHat in the virtualization KVM team.

This blog has 2 main purposes, the first is to give you an idea of what VirtIO is and why should we use it when the second is to serve as a step-by-step guide describing how to write a VirtIO device from zero with some code examples.

In addition, I will write a SIMPLE VirtIO example and documented device that you will be able to find in the Qemu project. Clone https://github.com/qemu/qemu.git repo for the full project.

Motivation

Let us start with a NIC (Network Interface Controller) as an example to understand better virtualization. A NIC is responsible for transmitting and receiving packets through the network. The received packets are written into memory and the packets to be sent are copied from the memory to the NIC for transmission with CPU intervention or without (DMA). When the NIC finishes a specific task it sends an interrupt to the OS.

If we want a physical machine to have a network connection we will have to buy a NIC and with the same logic if we want a virtual machine (VM) to have a network connection we will need to supply a virtual NIC.

One possible solution is to make the hypervisor fully emulate the NIC according to its spec – Virtual device.

When a package is sent on the guest OS it is sent to virtual NIC (vNIC). For each byte of data we will get:

A virtual interrupt will be generated ==> a VMexit will occur ==> the hypervisor will send the data to the physical NIC (pNIC) ==> the pNIC will interrupt the host OS when it finishes the transaction ==> the hypervisor will finally interrupt the guest OS to notify the transaction is finish.

We can see here a function of a NIC driver which its purpose is to read data from the device into a buffer.

NOTE: even if we use MMIO instead of PIO we still have a limitation on the MMIO write size and each MMIO write generates a VMexit so we still may have multiply VMexits.

The main benefits, in this case, is that the OS stays unchanged because the virtual device acts like a physical device so the already-written NIC driver does the job correctly on the emulated device. From the other hand it works slowly since each access to the vNIC generate a VMexit for each byte (as pNIC would have done) but in the reality this is not a real device (only code variables) so we don’t need to VMexit on each byte, instead we can just write the whole buffer and generate a single VMexit.

Another possible solution is to give the guest direct access to the pNIC – Device assignment.

When a package is sent on the guest OS it is sent to the vNIC. For each byte of data we will get:

The data is sent directly to the pNIC without the hypervisor intervention ==> the pNIC will interrupt the guest OS directly when it finishes the transaction.

Now we have max performance that the HW can supply but we need separate pNIC for each guest and another one to the host, this becomes expensive.

The tradeoff between Virtual devices and Device assignment is Paravirtual devices and its protocol VirtIO.

This case is quite similar to the Virtual device case except for 2 facts, the first is that the emulated device don’t pretend to act like a real device (no need to send virtual interrupts for each byte written but only a single virtual interrupt once the whole buffer is written) and the second is that we now have to write a new driver since the original driver no longer feets with the emulated HW.

We can now see the same function of a NIC  new driver.

Another reason to use VirtIO devices is that Linux support multiple hypervisors such as KVM, Xen, VMWare etc. Therefore we have drivers for each one of them. VirtIO provides Driver-unification, a uniform ABI for all those hypervisors. An ABI is an interface at the compiler level and describes how do parameters are passed to function (register\stack), how interrupts are propagated etc. VirtIO also provides device check and configuration.

Virtualization VS Paravirtualization

Virtualization Paravirtualization
The guest is unaware that it is being virtualized. The guest is aware that it is running on a hypervisor (and not on real HW).
No changes are requested on the OS. Requires modification of the OS.
The hypervisor must emulate device HW, this will lead to low performance. The guest and the hypervisor can work cooperatively to make this emulation efficient.

 

Sources

 

by Yonathan Bettan at April 24, 2018 12:26 PM

April 17, 2018

Fabian Deutsch

Running minikube v0.26.0 with CRIO and KVM nesting enabled by default

Probably not worth a post, as it’s mentioned in the readme, but CRIO was recently updated in minikube v0.26.0 which now makes it work like a charm.

When updating to 0.26 make sure to update the minikube binary, but also the docker-machine-driver-kvm2 binary.

Like in the past it is possible to switch to CRIO using

$ minikube start --container-runtime=cri-o
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.
$

However, my favorit launch line is:

minikube start --container-runtime=cri-o --network-plugin=cni --bootstrapper=kubeadm --vm-driver=kvm2

Which will use CRIO as the container runtime, CNI for networking, kubeadm for bringing up kube inside a KVM VM.

April 17, 2018 08:05 AM

April 12, 2018

KVM on Z

White Paper: Exploiting HiperSockets in a KVM Environment Using IP Routing with Linux on Z

Our performance group has published a new white paper titled "Exploiting HiperSockets in a KVM Environment Using IP Routing with Linux on Z".
Abstract:
"The IBM Z platforms provide the HiperSockets technology feature for high-speed communications. This paper documents how to set up and configure KVM virtual machines to use HiperSockets with IP routing capabilities of the TCP/IP stack.
It provides a Network Performance comparison between various network configurations and illustrates how HiperSockets can achieve greater performance for many workload types, across a wide range of data-flow patterns, compared with using an OSA 10GbE card.
"
This white paper is available as .pdf and .html.

by Stefan Raspl (noreply@blogger.com) at April 12, 2018 04:25 PM

April 11, 2018

KVM on Z

RHEL 7.5 with support for KVM on Z available

Red Hat Enterprise Linux 7.5 is out. From the release notes, available here:
Availability across multiple architectures
To further support customer choice in computing architecture, Red Hat Enterprise Linux 7.5 is simultaneously available across all supported architectures, including x86, IBM Power, IBM z Systems, and 64-bit Arm.
Support for IBM Z is available through the kernel-alt package, as indicated earlier here, which provides Linux kernel 4.14. QEMU ships v2.10 via package qemu-kvm-ma, and libvirt is updated to v3.9.0 for all platforms.
Thereby, all IBM z14 features as previously listed here are available.
Check these instructions on how to get started. 

by Stefan Raspl (noreply@blogger.com) at April 11, 2018 04:17 PM

April 09, 2018

Gerd Hoffmann

vgpu display support finally merged upstream

It took more than a year from the first working patches to the upstream merge. But now it's finally done. The linux kernel 4.16 (released on easter weekend) has the kernel-side code needed. The qemu code has been merged too (for gtk and spice user interfaces) and will be in the upcoming 2.12 release which is in code freeze right now. The 2.12 release candidates already have the code, so you can grab one if you don't want wait for the final release to play with this.

The vgpu code in the intel driver is off by default and must be enabled via module option. And, while being at it, also suggest to load the kvmgt module. So I've dropped a config file with these lines ...

options i915 enable_gvt=1
softdep i915 pre: kvmgt

... into /etc/modprobe.d/. For some reason dracut didn't pick the changes up even after regenerating the initrd. Because of that I've blacklisted the intel driver (rd.driver.blacklist=i915 on the kernel command line) so the driver gets loaded later, after mounting the root filesystem, and modprobe actually sets the parameter.

With that in place you should have a /sys/class/mdev_bus directory with the intel gpu in there. You can create vgpu devices now. Check the mediated device documentation for details.

One final thing to take care of: Currently using gvt mlocks all guest memory. For that work the mlock limit (ulimit -l) must be high enough, otherwise the vgpu will not work correctly and you'll see a scrambled display. Limit can be configured in /etc/security/limits.conf.

Now lets use our new vgpu with qemu:

qemu-system-x86_64 \
     -enable-kvm \
     -m 1G \
     -nodefaults \
     -M graphics=off \
     -serial stdio \
     -display gtk,gl=on \
     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID,display=on \
     -cdrom /vmdisk/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso

Details on the non-obvious qemu switches:

-nodefaults
Do not create default devices (such as vga and nic).
-M graphics=off
Hint for the firmware that the guest runs without a graphical display. This enables serial console support in seabios. We use this here because the vgpu has no firmware support (i.e. no vgabios), therefore nothing is visible on the display until the i915 kernel module loads.
-display gtk,gl=on
Use gtk display, enable opengl.
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID,display=on
Add the vgpu to the guest, enable the display. Of course you have to replace UUID with your device.

libvirt support is still being worked on. Most bits are there, but some little details are missing. For example there is no way (yet) to tell libvirt the guest doesn't need an emulated vga device, so you'll end up with two spice windows, one for the emulated vga and one for the vgpu. Other than that things are working pretty straight forward. You need spice with opengl support enabled:

<graphics type='spice'>
  <listen type='none'/>
  <gl enable='yes'/>
</graphics>

And the vgpu must be added of course:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci'>
  <source>
    <address uuid='UUID'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>

Then you can start the domain. Use "virt-viewer --attach guest" to connect to the guest. Note that guest and virt-viewer must run on the same machine, sending the vgpu display to a remote machine does not yet work.

by Gerd Hoffmann at April 09, 2018 11:17 AM

March 26, 2018

Cornelia Huck

s390x changes in QEMU 2.12

As QEMU is now in hard freeze for 2.12 (with the final release expected in mid/late April), now is a good point in time to summarize some of the changes that made it into QEMU 2.12 for s390x.

I/O Devices

  • Channel I/O: Any device can now be put into any channel subsystem image, regardless whether it is a virtual device (like virtio-ccw) or a device passed through via vfio-ccw. This obsoletes the s390-squash-mcss option (which was needed to explicitly squash vfio-ccw devices into the default channel subsystem image in order to make it visible to guests not enabling MCSS-E).
  • PCI: Fixes and refactoring, including handling of subregions. This enables usage of virtio-pci devices on s390x (although only if MSI-X is enabled, as s390x depends on it.) Previously, you could add virtio-pci devices on s390x, but they were not usable.
    For more information about PCI, see this blog entry.

Booting and s390-ccw bios

  • Support for an interactive boot menu. Note that this is a bit different than on other architectures (although it hooks into the same infrastructure). The boot menu is written on the (virtual) disk via the 'zipl' program, and these entries need to be parsed and displayed via SCLP.

System Emulation

  • KVM: In case you were short on memory before: You can now run guests with 8 TB or more.
  • KVM: Support for the bpb and ppa15 CPU features (for spectre mitigation). These have been backported to 2.11.1 as well.
  • TCG: Lots of improvements: Implementation of missing instructions, full (non-experimental) SMP support.
  • TCG: Improvements in handling of the STSI instruction (you can look at some information obtained that way via /proc/sysinfo.) Note that a TCG guest reports itself as a KVM guest, rather than an LPAR: In many ways, a TCG guest is closer to KVM, and reporting itself as an LPAR makes the Linux guest code choose an undesired target for its console output by default.
  • TCG: Wire up the zPCI instructions; you can now use virtio-pci devices under TCG.
  • CPU models: Switch the 'qemu' model to a stripped-down z12, adding all features required by kernels on recent distributions. This means that you can now run recent distributions (Fedora 26/27, Ubuntu 18.04, ...) under TCG. Older distributions may not work (older kernels required some features not implemented under TCG), unless they were built for a z900 like Debian stable.

Miscellaneous

  • Support for memory hotplug via SCLP has been removed. This was an odd interface: Unlike as on other architectures, the guest could enable 'standby' memory if it had been supplied. Another problem was that this never worked with migration. Old command lines will continue to work, but no 'standby' memory will be available to the guest any more.
    Memory hotplug on s390x will probably come back in the future with an interface that matches better what is done elsewhere, likely via some paravirtualized interface. Support for the SCLP interface might come back in the future as well, implemented in an architecture-specific way that does not try to look like memory hotplug elsewhere.
  • And of course, the usual fixes, cleanups and other improvements.

by Cornelia Huck (noreply@blogger.com) at March 26, 2018 06:14 PM

KVM on Z

SLES12 SP3 Updates


SLES12SP3, released late last year, received a couple of mostly performance and security-related updates in support of IBM z14 and LinuxONE through the maintenance web updates.
In particular:

    by Stefan Raspl (noreply@blogger.com) at March 26, 2018 09:19 AM

    March 23, 2018

    Daniel Berrange

    ANNOUNCE: gtk-vnc 0.7.2 release

    I’m pleased to announce a new release of GTK-VNC, version 0.7.2. The release focus is on bug fixing, and addresses an important regression in TLS handling from the previous release.

    • Deprecated the manual python2 binding in favour of GObject introspection. It will be deleted in the next release.
    • Emit led state notification on connect
    • Fix incorrect copyright notices
    • Simplify shifted-tab key handling
    • Don’t short circuit TLS credential request
    • Improve check for keymap under XWayland
    • Update doap description of project
    • Modernize RPM specfile

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at March 23, 2018 02:20 PM

    March 17, 2018

    KVM on Z

    KVM on z14 features

    While the latest addition to the IBM Z family has been announced, here is a list of features in support of specific features of the new hardware generation in past releases of the Linux kernel, QEMU and libvirt, all activated by default in the z14 CPU model:
    • Instruction Execution Protection
      This feature provides KVM hypervisor support for the Instruction Execution Protection (IEP) facility in the z14. The IEP prevents code execution from memory regions marked as non-executable, improving the security model.
      Other than activating/deactivating this feature in the applicable the CPU models in QEMU (which holds true for most hardware-related features on IBM Z in general), there are no switches associated with this feature.
      Requires Linux kernel 4.11 in the KVM host and guests, as well as QEMU v2.10 (host only).
      In the z14 CPU model, the respective feature is:
        iep       Instruction-execution-protection facility
    • SIMD Extensions
      Following up to the SIMD instructions as introduced with the previous z13 model, the new z14 provides further vector instructions, which can again be used in KVM guests.
      These new vector instructions can be used to improve decimal calculations as well as for implementing high performance variants of certain cryptographic operations.
      Requires Linux kernel 4.11 as well as QEMU v2.10 in the KVM host, and binaries or a respective Java Runtime Environment in guests using the new vector instructions.
      In the z14 CPU model, the respective feature is:
        vxpd      Vector packed decimal facility
        vxeh
            Vector enhancements facility
    • Keyless Guest Support
      This feature supports the so-called Keyless Subset (KSS) facility, a new feature of the z14 hardware. With the KSS facility enabled, a host is not required to perform the (costly) storage key initialization and management for KVM guests, unless a guest issues a storage key instruction.
      Requires Linux kernel 4.12 in the KVM host. As for the guests, note that starting with SLES12SP1, RHEL7.2 and Ubuntu 16.04, Linux on IBM Z does not issue any storage key operations anymore.
      This feature does not have a separate entry in the z14 CPU model.
    • CPUMF Basic Sample Configuration Level Indication
      Basic mode samples as defined in "The Load-Program-Parameter and the CPU-Measurement Facilities" (SA23-2260) do not provide an indication whether the sample was taken in a KVM host or guest. Beginning with z14, the hardware provides an indication of the configuration level (level of SIE, e.g. LPAR or KVM). This item exploits this information to make the perf guest/host decision reliable.
      Requires Linux kernel 4.12 in the KVM host.
      There is no separate entry in the z14 CPU model, since this feature applies to the host only.
    • Semaphore assist
      Improves performance of semaphore locks.
      Requires Linux kernel 4.7 and QEMU v2.10 in the KVM host. Exploitation in Linux kernels in guests is still in progress here, scheduled for 4.14.
      In the z14 CPU model, the respective feature is:
        sema      Semaphore-assist facility
    • Guarded storage
      This feature is specifically aimed at Java Virtual Machines running in KVM guests to run with fewer and shorter pauses for garbage collection.
      Requires Linux kernel 4.12 and QEMU 2.10 in the KVM host, and a Java Runtime Environment with respective support in the guests.
      In the z14 CPU model, the respective feature is:
        gs        Guarded-storage facility
    • MSA Updates
      z14 introduces 3 new Message Security Assists (MSA) for the following functionalities:
          MSA6: SHA3 hashing
          MSA7: A True Random  Number Generator (TRNG)
          MSA8: The CIPHER MESSAGE WITH AUTHENTICATION instruction,
                      which provides support for the Galois-counter-mode (GCM)
      MSA6 and MSA 7 require Linux kernel 4.7, while MSA8 requires Linux kernel 4.12. All require QEMU v2.10 in the KVM host. These features can be exploited in KVM guests' kernels and userspace applications independently (i.e. a KVM guest's userspace applications can take advantage of these features irrespective of the guest's kernel version).
      In the z14 CPU model, the respective features are:
        msa6      Message-security-assist-extension 6 facility

        msa7      Message-security-assist-extension 7 facility
        msa8      Message-security-assist-extension 8 facility
    • Compression enhancements
      New instructions improve compression capabilities and performance.
      Requires Linux kernel 4.7 in the KVM host.
      In the z14 CPU model, the respective features are:
        opc       Order Preserving Compression facility
        eec       Entropy encoding compression facility
    • Miscellaneous instructions
      Details on these instructions are to be published in the forthcoming z14 Principles of Operation (PoP).
      Requires Linux kernel 4.7 and QEMU 2.10 in the KVM host, and binaries that were compiled for the z14 instruction set using binutils v2.28 and gcc v7.1 in the guests.
      In the z14 CPU model, the respective feature is:
        minste2   Miscellaneous-instruction-extensions facility 2
    Note: All versions specified are minimum versions.

    Further features will be announced in future blog posts as usual as they find their way into the respective Open Source projects.
    Also, don't forget to check this blog entry with further details on z14 in general and Linux on z in particular.

    by Stefan Raspl (noreply@blogger.com) at March 17, 2018 12:28 PM

    March 12, 2018

    Fabian Deutsch

    Running Ubuntu on Kubernetes with KubeVirt v0.3.0

    You have this image, of a VM, which you want to run - alongside containers - why? - well, you need it. Some people would say it’s dope, but sometimes you really need it, because it has an app you want to integrate with pods.

    Here is how you can do this with KubeVirt.

    1 Deploy KubeVirt

    Deploy KubeVirt on your cluster - or follow the demo guide to setup a fresh minikube cluster.

    2 Download Ubuntu

    While KubeVirt comes up (use kubectl get --all-namespaces pods), download Ubuntu Server

    3 Install kubectl plugin

    Make sure to have the latest or recent kubectl tool installed, and install the pvc plugin:

    curl -L https://github.com/fabiand/kubectl-plugin-pvc/raw/master/install.sh | bash
    

    4 Create disk

    Upload the Ubuntu server image:

    $ kubectl plugin pvc create ubuntu1704 1Gi $PWD/ubuntu-17.04-server-amd64.iso disk.img
    Creating PVC
    persistentvolumeclaim "ubuntu1704" created
    Populating PVC
    pod "ubuntu1704" created
    total 701444
    701444 -rw-rw-r--    1 1000     1000      685.0M Aug 25  2017 disk.img
    Cleanup
    pod "ubuntu1704" deleted
    

    5 Create and launch VM

    Create a VM:

    $ kubectl apply -f -
    apiVersion: kubevirt.io/v1alpha1
    kind: VirtualMachinePreset
    metadata:
      name: large
    spec:
      selector:
        matchLabels:
          kubevirt.io/size: large
      domain:
        resources:
          requests:
            memory: 1Gi
    ---
    apiVersion: kubevirt.io/v1alpha1
    kind: OfflineVirtualMachine
    metadata:
      name: ubuntu
    spec:
      running: true
      selector:
        matchLabels:
          guest: ubuntu
      template:
        metadata:
          labels: 
            guest: ubuntu
            kubevirt.io/size: large
        spec:
          domain:
            devices:
              disks:
                - name: ubuntu
                  volumeName: ubuntu
                  disk:
                    bus: virtio
          volumes:
            - name: ubuntu
              claimName: ubuntu1710
    

    6 Connect to VM

    $ ./virtctl-v0.3.0-linux-amd64 vnc --kubeconfig ~/.kube/config ubuntu
    

    Final notes - This is booting the Ubuntu ISO image. But this flow should work for existing images, which might be much more useful.

    March 12, 2018 03:56 PM

    March 06, 2018

    Fabian Deutsch

    v2v-job v0.2.0 POC for importing VMs into KubeVirt

    KubeVirt becomes usable. And to make it easier to use it would be nice to be able to import existing VMs. After all migration is a strong point of KubeVirt.

    virt-v2v is the tool of choice to convert some guest to run on the KVM hypervisor. What a great fit.

    Thus recently I started a little POC to check if this would really work.

    This post is just to wrap it up, as I just tagged v0.2.0 and finished a nice OVA import.

    What the POC does:

    • Take an URL pointing to an OVA
    • Download and convert the OVA to a domxml and raw disk image
    • Create a PVC and move the raw disk image to it
    • Create an OfflineVirtualMachine from the domxml using xslt

    This is pretty straight forward and currently living in a Job which can be found here: https://github.com/fabiand/v2v-job

    It’s actually using an OpenShift Template, but only works on Kubernetes so far, because I didn’t finish the RBAC profiles. However, using the oc tool you can even run it on Kubernetes without Template support by using:

    $ oc process --local -f manifests/template.yaml \
        -p SOURCE_TYPE=ova \
        -p SOURCE_NAME=http://192.168.42.1:8000/my.ova \
      | kubectl apply -f -
    serviceaccount "kubevirt-privileged" created
    job "v2v" created
    
    

    The interested reader can take a peek at the whole process in this log.

    And btw - This little but awesome patch on libguestfs by Pino - will help this job to auto-detect - well, guess - the the guest operating system and set the OfflineVirtualMachine annotations correctly, in order to then - at runtime - apply the right VirtualMachinePresets, in order to launch the guest with optimized defaults.

    March 06, 2018 10:41 AM

    March 02, 2018

    Marcin Juszkiewicz

    OpenStack ‘Queens’ release done

    OpenStack community released ‘queens’ version this week. IMHO it is quite important moment for AArch64 community as well because it works out of the box for us.

    Gone are things like setting hw_firmware_type=uefi for each image you upload to Glance — Nova assumes UEFI to be the default firmware on AArch64 (unless you set the variable to different value for some reason). This simplifies things as users does not have to worry about and we should have less support questions on new setups of Linaro Developer Cloud (which will be based on ‘Queens’ instead of ‘Newton’).

    There is a working graphical console if your guest image uses properly configured kernel (4.14 from Debian/stretch-backports works fine, 4.4 from Ubuntu/xenial (used by CirrOS) does not have graphics enabled). Handy feature which we were asked already by some users.

    Sad thing is state of live migration on AArch64. It simply does not work through the whole stack (Nova, libvirt, QEMU) because we have no idea what exactly cpu we are running on and how it is compatible with other cpu cores. In theory live migration between same type of processors (like XGene1 -> XGene1) should be possible but we do not have even that level of information available. More information can be found in bug 1430987 reported against libvirt.

    Less sad part? We set cpu_model to ‘host-passthrough’ by default now (in Nova) so nevermind which deployment method is used it should work out of the box.

    When it comes to building (Kolla) and deploying (Kolla Ansible) most of the changes were done during Pike cycle. During Queens’ one most of the changes were small tweaks here and there. I think that our biggest change was convincing everyone in Kolla(-ansible) to migrate from MariaDB 10.0.x (usually from external repositories) to 10.1.x taken from distribution (Debian) or from RDO.

    What will Rocky bring? Better hotplug for PCI Express machines (AArch64/virt, x86/q35 models) is one thing. I hope that live migration stuff situation will improve as well.

    by Marcin Juszkiewicz at March 02, 2018 01:22 PM

    February 23, 2018

    Fabian Deutsch

    KubeVirt v0.3.0-alpha.3: Kubernetes native networking and storage

    First post for quite some time. A side effect of being busy to get streamline our KubeVirt user experience.

    KubeVirt v0.3.0 was not released at the beginnig of the month.

    That release was intended to be a little bigger, because it included a large architecture change (to the good). The change itself was amazingly friendly and went in without much problems - even if it took some time.

    But, the work which was building upon this patch in the storage and network areas was delayed and didn’t make it in time. Thus we skipped the release in order to let storage and network catch up.

    The important thing about these two areas is, that KubeVirt was able to connect a VM to a network, and was able to boot of a iSCSI target, but this was not really tightly integrated with Kubernetes.

    Now, just this week two patches landed which actually do integrate these areas with Kubernetes.

    Storage

    The first is storage - mainly written by Artyom, and finalized by David - which allows a user to use a persistent volume as the backing storage for a VM disk:

    metadata:
      name: myvm
    apiVersion: kubevirt.io/v1alpha1
    kind: VirtualMachine
    spec:
      domain:
        devices:
          disks:
          - name: mypvcdisk
            volumeName: mypvc
            lun: {}
      volumes:
        - name: mypvc
          persistentVolumeClaim:
            claimName: mypvc
    

    This means that any storage solution supported by Kubernetes to provide PVs can be used to store virtual machine images. This is a big step forward in terms of compatibility.

    This actually works by taking this claim, and attaching it to the VM’s pod definition, in order to let the kubelet then mount the respective volume to the VM’s pod. Once that is done, KubeVirt will take care to connect the disk image within that PV to the VM itself. This is only possible because the architecture change caused libvirt to run inside every VM pod, and thus allow the VM to consume the pod resources.

    Side note, another project is in progress to actually let a user upload a virtual machine disk to the cluster in a convenient way: https://github.com/kubevirt/containerized-data-importer.

    Network

    The second change is about network which Vladik worked on for some time. This change also required the architectural changes, in order to allow the VM and libvirt to consume the pod’s network resource.

    Just like with pods the user does not need to do anything to get basic network connectivity. KubeVirt will connect the VM to the NIC of the pod in order to give it the most compatible intergation. Thus now you are able to expose a TCP or UDP port of the VM to the outside world using regular Kubernetes Services.

    A side note here is that despite this integration we are now looking to enhance this further to allow the usage of side cars like Istio.

    Alpha Release

    The three changes - and their delay - caused the delay of v0.3.0 - which will now be released in the beginnig of March. But we have done a few pre-releases in order to allow interested users to try this code right now:

    KubeVirt v0.3.0-alpha.3 is the most recent alpha release and should work fairly well.

    More

    But these items were just a small fraction of what we have been doing.

    If you look at the kubevirt org on GitHub you will notice many more repositories there, covering storage, cockpit, and deployment with ansible - and it will be another post to write about how all of this is fitting together.

    Welcome aboard!

    KubeVirt is really speeding up and we are still looking for support. So if you are interested in working on a bleeding edge project tightly coupled with Kubernetes, but also having it’s own notion, and an great team, then just reach out to me.

    February 23, 2018 11:06 AM

    February 21, 2018

    Alex Bennée

    Workbooks for Benchmarking

    While working on a major re-factor of QEMU’s softfloat code I’ve been doing a lot of benchmarking. It can be quite tedious work as you need to be careful you’ve run the correct steps on the correct binaries and keeping notes is important. It is a task that cries out for scripting but that in itself can be a compromise as you end up stitching a pipeline of commands together in something like perl. You may script it all in a language designed for this sort of thing like R but then find your final upload step is a pain to implement.

    One solution to this is to use a literate programming workbook like this. Literate programming is a style where you interleave your code with natural prose describing the steps you go through. This is different from simply having well commented code in a source tree. For one thing you do not have to leap around a large code base as everything you need is on the file you are reading, from top to bottom. There are many solutions out there including various python based examples. Of course being a happy Emacs user I use one of its stand-out features org-mode which comes with multi-language org-babel support. This allows me to document my benchmarking while scripting up the steps in a variety of “languages” depending on the my needs at the time. Let’s take a look at the first section:

    1 Binaries To Test

    Here we have several tables of binaries to test. We refer to the
    current benchmarking set from the next stage, Run Benchmark.

    For a final test we might compare the system QEMU with a reference
    build as well as our current build.

    Binary title
    /usr/bin/qemu-aarch64 system-2.5.log
    ~/lsrc/qemu/qemu-builddirs/arm-targets.build/aarch64-linux-user/qemu-aarch64 master.log
    ~/lsrc/qemu/qemu.git/aarch64-linux-user/qemu-aarch64 softfloat-v4.log

    Well that is certainly fairly explanatory. These are named org-mode tables which can be referred to in other code snippets and passed in as variables. So the next job is to run the benchmark itself:

    2 Run Benchmark

    This runs the benchmark against each binary we have selected above.

        import subprocess
        import os
    
        runs=[]
    
        for qemu,logname in files:
        cmd="taskset -c 0 %s ./vector-benchmark -n %s | tee %s" % (qemu, tests, logname)
            subprocess.call(cmd, shell=True)
            runs.append(logname)
    
            return runs
            
    

    So why use python as the test runner? Well truth is whenever I end up munging arrays in shell script I forget the syntax and end up jumping through all sorts of hoops. Easier just to have some simple python. I use python again later to read the data back into an org-table so I can pass it to the next step, graphing:

    set title "Vector Benchmark Results (lower is better)"
    set style data histograms
    set style fill solid 1.0 border lt -1
    
    set xtics rotate by 90 right
    set yrange [:]
    set xlabel noenhanced
    set ylabel "nsecs/Kop" noenhanced
    set xtics noenhanced
    set ytics noenhanced
    set boxwidth 1
    set xtics format ""
    set xtics scale 0
    set grid ytics
    set term pngcairo size 1200,500
    
    plot for [i=2:5] data using i:xtic(1) title columnhead
    

    This is a GNU Plot script which takes the data and plots an image from it. org-mode takes care of the details of marshalling the table data into GNU Plot so all this script is really concerned with is setting styles and titles. The language is capable of some fairly advanced stuff but I could always pre-process the data with something else if I needed to.

    Finally I need to upload my graph to an image hosting service to share with my colleges. This can be done with a elaborate curl command but I have another trick at my disposal thanks to the excellent restclient-mode. This mode is actually designed for interactive debugging of REST APIs but it is also easily to use from an org-mode source block. So the whole thing looks like a HTTP session:

    :client_id = feedbeef
    
    # Upload images to imgur
    POST https://api.imgur.com/3/image
    Authorization: Client-ID :client_id
    Content-type: image/png
    
    < benchmark.png
    

    Finally because the above dumps all the headers when run (which is very handy for debugging) I actually only want the URL in most cases. I can do this simply enough in elisp:

    #+name: post-to-imgur
    #+begin_src emacs-lisp :var json-string=upload-to-imgur()
      (when (string-match
             (rx "link" (one-or-more (any "\":" whitespace))
                 (group (one-or-more (not (any "\"")))))
             json-string)
        (match-string 1 json-string))
    #+end_src
    

    The :var line calls the restclient-mode function automatically and passes it the result which it can then extract the final URL from.

    And there you have it, my entire benchmarking workflow document in a single file which I can read through tweaking each step as I go. This isn’t the first time I’ve done this sort of thing. As I use org-mode extensively as a logbook to keep track of my upstream work I’ve slowly grown a series of scripts for common tasks. For example every patch series and pull request I post is done via org. I keep the whole thing in a git repository so each time I finish a sequence I can commit the results into the repository as a permanent record of what steps I ran.

    If you want even more inspiration I suggest you look at John Kitchen’s scimax work. As a publishing scientist he makes extensive use of org-mode when writing his papers. He is able to include the main prose with the code to plot the graphs and tables in a single source document from which his camera ready documents are generated. Should he ever need to reproduce any work his exact steps are all there in the source document. Yet another example of why org-mode is awesome 😉

    by Alex at February 21, 2018 08:34 PM

    February 19, 2018

    Marcin Juszkiewicz

    Hotplug in VM. Easy to say…

    You run VM instance. Nevermind is it part of OpenStack setup or just local one started using Boxes, virt-manager, virsh or other that kind of fronted to libvirt daemon. And then you want to add some virtual hardware to it. And another card and one more controller…

    Easy to imagine scenario, right? What can go wrong, you say? “No more available PCI slots.” message can happen. On second/third card/controller… But how? Why?

    Like I wrote in one of my previous posts most of VM instances are 90s pc hardware virtual boxes. With simple PCI bus which accepts several cards to be added/removed at any moment.

    But not on AArch64 architecture. Nor on x86-64 with Q35 machine type. What is a difference? Both are PCI Express machines. And by default they have far too small amount of pcie slots (called pcie-root-port in qemu/libvirt language). More about PCI Express support can be found in PCI topology and hotplug page of libvirt documentation.

    So I wrote a patch to Nova to make sure that enough slots will be available. And then started testing. Tried few different approaches, discussed with upstream libvirt developers about ways of solving the problem and finally we selected the one and only proper way of doing it. Then discussed failures with UEFI developers. And went for help to Qemu authors. And explained what I want to achieve and why to everyone in each of those four projects. At some point I had seen pcie-root-port things everywhere…

    Turned out that the method of fixing it is kind of simple: we have to create whole pcie structure with root port and slots. This tells libvirt to not try any automatic adding of slots (which may be tricky if not configured properly as you may end with too small amount of slots for basic addons).

    Then I went with idea of using insane values. VM with one hundred PCIe slots? Sure. So I made one, booted it and then something weird happen: landed in UEFI shell instead of getting system booted. Why? How? Where is my storage? Network? Etc?

    Turns out that Qemu has limits. And libvirt has limits… All ports/slots went into one bus and memory for MMCONFIG and/or I/O space was gone. There are two interesting threads about it on qemu-devel mailing list.

    So I added magic number into my patch: 28 — this amount of pcie-root-port entries in my aarch64 VM instance was giving me bootable system. Have to check it on x86-64/q35 setup still but it should be more or less the same. I expect this patch to land in ‘Rocky’ (the next OpenStack release) and probably will have to find a way to get it into ‘Queens’ as well because this is what we are planning to use for next edition of Linaro Developer Cloud.

    Conclusion? Hotplug may be complicated. But issues with it can be solved.

    by Marcin Juszkiewicz at February 19, 2018 06:06 PM

    Cornelia Huck

    Notes on PCI on s390x

    As QEMU 2.12 will finally support PCI devices under s390x/tcg, I thought now is a good time to talk about some of the peculiarities of PCI on the mainframe.

    Oddities of PCI on s390x architecture

    Oddity #1: No MMIO, but instructions

    Everywhere else, you use MMIO when interacting with PCI devices. Not on s390x; you have a set of instructions instead. For example, if you want to read or write memory, you will need to use the PCILG or PCISTG instructions, and for refreshing translations, you will need to use the RPCIT instructions. Fortunately, these instructions can be matched to the primitives in the Linux kernel; unfortunately, all those instructions are privileged, which leads us to

    Oddity #2: No user space I/O

    As any interaction with PCI devices needs to be done through privileged instructions, Linux user space can't interact with the devices directly; the Linux kernel needs to be involved in every case. This means that there are none of the PCI user space implementations popular on other platforms available on s390x.

    Oddity #3: No topology, but FID and UID

    Usually, you'll find busses, slots and functions when you identify a certain PCI function. The PCI instructions on s390x, however, don't expose any topology to the caller. This means that an operating system will get a simple list of functions, with a function id (FID) that can be mapped to a physical slot and an UID, which the Linux kernel will map to a domain number. A PCI identifier under Linux on s390x will therefore always be of the form <domain>:00:00.0.

    Implications for the QEMU implementation of PCI on s390x

    In order to support PCI on s390x in QEMU, some specialties had to be implemented.

    Instruction handlers

    Under KVM, every PCI instruction intercepts and is routed to user space. QEMU does the heavy lifting of emulating the operations and mapping to generic PCI code. This also implied that PCI under tcg did not work until the instructions had been wired up; this has now finally happened and will be in the 2.12 release.

    Modelling and (lack of) topology

    QEMU PCI code expects the normal topology present on other platforms. However, this (made-up) topology will be invisible to guests, as the PCI instructions do not relay it. Instead, there is a special "zpci" device with "fid" and "uid" properties that can be linked to a normal PCI device. If no zpci device is specified, QEMU will autogenerate the FID and the UID.

    How can I try this out?

    If you do not have a real mainframe with a real PCI card, you can use virtio-pci devices as of QEMU 2.12 (or current git as of the time of this writing). If you do have a mainframe and a PCI card, you can use vfio-pci (but not yet via libvirt).
    Here's an example of how to specify a virtio-net-pci device for s390x, using tcg:
    s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on (...) -device zpci,uid=12,fid=2,target=vpci02,id=zpci2 -device virtio-net-pci,id="vpci02",addr=0x2
    Some notes on this:
    • You need to explicitly enable the "zpci" feature in the qemu cpu model. Two other features, "aen" and "ais", are enabled by default ("aen" and "zpci" are mandatory, "ais" is needed for Linux guest kernels prior to 4.15. If you use KVM, the host kernel also needs support for ais.)
    • The zpci device is joined with the PCI device via the "target" property. The virtio-net-pci device does not know anything about zpci devices.
    • Only virtio-pci devices using MSI-X will work on s390x.
    In the guest, this device will show up in lspci -v as
    000c:00:00.0 Ethernet controller: Red Hat, Inc. Virtio network device
    Subsystem: Red Hat, Inc. Device 0001
    Physical Slot: 00000002
     Note how the uid of 12 shows up as domain 000c and the fid of 2 as physical slot 00000002.

    by Cornelia Huck (noreply@blogger.com) at February 19, 2018 01:49 PM

    Powered by Planet!
    Last updated: June 19, 2018 03:48 PM
    Powered by OpenShift Online