Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

April 25, 2018

Peter Maydell

Debian on QEMU’s Raspberry Pi 3 model

For the QEMU 2.12 release we added support for a model of the Raspberry Pi 3 board (thanks to everybody involved in developing and upstreaming that code). The model is sufficient to boot a Debian image, so I wanted to write up how to do that.

Things to know before you start

Before I start, some warnings about the current state of the QEMU emulation of this board:

  • We don’t emulate the boot rom, so QEMU will not automatically boot from an SD card image. You need to manually extract the kernel, initrd and device tree blob from the SD image first. I’ll talk about to do that below.
  • We don’t have an emulation of the BCM2835 USB controller. This means that there is no networking support, because on the raspi devices the ethernet hangs off the USB controller.
  • Our raspi3 model will only boot AArch64 (64-bit) kernels. If you want to boot a 32-bit kernel you should use the “raspi2” board model.
  • The QEMU model is missing models of some devices, and others are guesswork due to a lack of documentation of the hardware; so although the kernel I tested here will boot, it’s quite possible that other kernels may fail.

You’ll need the following things on your host system:

  • QEMU version 2.12 or better
  • libguestfs (on Debian and Ubuntu, install the libguestfs-tools package)

Getting the image

I’m using the unofficial preview images described on the Debian wiki.

$ wget https://people.debian.org/~stapelberg/raspberrypi3/2018-01-08/2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz
$ xz -d 2018-01-08-raspberry-pi-3-buster-PREVIEW.img.xz

Extracting the guest boot partition contents

I use libguestfs to extract files from the guest SD card image. There are other ways to do this but I think libguestfs is the easiest to use. First, check that libguestfs is working on your system:

$ virt-filesystems -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Now you can ask libguestfs to extract the contents of the boot partition:

$ mkdir bootpart
$ guestfish --ro -a 2018-01-08-raspberry-pi-3-buster-PREVIEW.img -m /dev/sda1

Then at the guestfish prompt type:

copy-out / bootpart/
quit

This should have copied various files into the bootpart/ subdirectory.

Run the guest image

You should now be able to run the guest image:

$ qemu-system-aarch64 \
  -kernel bootpart/vmlinuz-4.14.0-3-arm64 \
  -initrd bootpart/initrd.img-4.14.0-3-arm64 \
  -dtb bootpart/bcm2837-rpi-3-b.dtb \
  -M raspi3 -m 1024 \
  -serial stdio \
  -append "rw earlycon=pl011,0x3f201000 console=ttyAMA0 loglevel=8 root=/dev/mmcblk0p2 fsck.repair=yes net.ifnames=0 rootwait memtest=1" \
  -drive file=2018-01-08-raspberry-pi-3-buster-PREVIEW.img,format=raw,if=sd

and have it boot to a login prompt (the root password for this Debian image is “raspberry”).

There will be several WARNING logs and backtraces printed by the kernel as it starts; these will have a backtrace like this:

[  145.157957] [] uart_get_baud_rate+0xe4/0x188
[  145.158349] [] pl011_set_termios+0x60/0x348
[  145.158733] [] uart_change_speed.isra.3+0x50/0x130
[  145.159147] [] uart_set_termios+0x7c/0x180
[  145.159570] [] tty_set_termios+0x168/0x200
[  145.159976] [] set_termios+0x2b0/0x338
[  145.160647] [] tty_mode_ioctl+0x358/0x590
[  145.161127] [] n_tty_ioctl_helper+0x54/0x168
[  145.161521] [] n_tty_ioctl+0xd4/0x1a0
[  145.161883] [] tty_ioctl+0x150/0xac0
[  145.162255] [] do_vfs_ioctl+0xc4/0x768
[  145.162620] [] SyS_ioctl+0x8c/0xa8

These are ugly but harmless. (The underlying cause is that QEMU doesn’t implement the undocumented ‘cprman’ clock control hardware, and so Linux thinks that the UART is running at a zero baud rate and complains.)

by pm215 at April 25, 2018 08:07 AM

QEMU project

QEMU version 2.12.0 released

We’d like to announce the availability of the QEMU 2.12.0 release. This release contains 2700+ commits from 204 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • Spectre/Meltdown mitigation support for x86/pseries/s390 guests. For more details see: https://www.qemu.org/2018/02/14/qemu-2-11-1-and-spectre-update/
  • Numerous block support improvements, including support for directly interacting with userspace NVMe driver, and general improvements to NBD server/client including more efficient reads of sparse files
  • Networking support for VMWare paravirtualized RDMA device (RDMA HCA and Soft-RoCE supported), CAN bus support via Linux SocketCAN and SJA1000-based PCI interfaces, and general improvements for dual-stack IPv4/IPv6 environments
  • GUI security/bug fixes, dmabufs support for GTK/Spice.
  • Better IPMI support for Platform Events and SEL logging in internal BMC emulation
  • SMBIOS support for “OEM Strings”, which can be used for automating guest image activation without relying on network-based querying
  • Disk cache information via virtio-balloon
  • ARM: AArch64 new instructions for FCMA/RDM and SIMD/FP16/crypto/complex number extensions
  • ARM: initial support for Raspberry Pi 3 machine type
  • ARM: Corex-M33/Armv8-M emulation via new mps2-an505 board and many other improvements for M profile emulation
  • HPPA: support for full machine emulation (hppa-softmmu)
  • PowerPC: PPC4xx emulation improvements, including I2C bus support
  • PowerPC: new Sam460ex machine type
  • PowerPC: significant TCG performance improvements
  • PowerPC: pseries: support for Spectre/Meltdown mitigations
  • RISC-V: new RISC-V target via “spike_v1.9.1”, “spike_v1.10”, and “virt” machine types
  • s390: non-virtual devices no longer require dedicated channel subsystem and guest support for multiple CSSs
  • s390: general PCI improvements, MSI-X support for virtio-pci devices
  • s390: improved TCG emulation support
  • s390: KVM support for systems larger than 7.999TB
  • SPARC: sun4u power device emulation
  • SPARC: improved trace-event support and emulation/debug fixes
  • Tricore: new instruction variants for JEQ/JNE and 64-bit MOV
  • x86: Intel IOMMU support for 48-bit addresses
  • Xtensa: backend now uses libisa for instruction decoding/disassebly
  • Xtensa: multi-threaded TCG support and noMMU configuration variants
  • and lots more…

Thank you to everyone involved!

April 25, 2018 03:30 AM

April 17, 2018

Fabian Deutsch

Running minikube v0.26.0 with CRIO and KVM nesting enabled by default

Probably not worth a post, as it’s mentioned in the readme, but CRIO was recently updated in minikube v0.26.0 which now makes it work like a charm.

When updating to 0.26 make sure to update the minikube binary, but also the docker-machine-driver-kvm2 binary.

Like in the past it is possible to switch to CRIO using

$ minikube start --container-runtime=cri-o
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.
$

However, my favorit launch line is:

minikube start --container-runtime=cri-o --network-plugin=cni --bootstrapper=kubeadm --vm-driver=kvm2

Which will use CRIO as the container runtime, CNI for networking, kubeadm for bringing up kube inside a KVM VM.

April 17, 2018 08:05 AM

April 12, 2018

KVM on Z

White Paper: Exploiting HiperSockets in a KVM Environment Using IP Routing with Linux on Z

Our performance group has published a new white paper titled "Exploiting HiperSockets in a KVM Environment Using IP Routing with Linux on Z".
Abstract:
"The IBM Z platforms provide the HiperSockets technology feature for high-speed communications. This paper documents how to set up and configure KVM virtual machines to use HiperSockets with IP routing capabilities of the TCP/IP stack.
It provides a Network Performance comparison between various network configurations and illustrates how HiperSockets can achieve greater performance for many workload types, across a wide range of data-flow patterns, compared with using an OSA 10GbE card.
"
This white paper is available as .pdf and .html.

by Stefan Raspl (noreply@blogger.com) at April 12, 2018 04:25 PM

April 11, 2018

KVM on Z

RHEL 7.5 with support for KVM on Z available

Red Hat Enterprise Linux 7.5 is out. From the release notes, available here:
Availability across multiple architectures
To further support customer choice in computing architecture, Red Hat Enterprise Linux 7.5 is simultaneously available across all supported architectures, including x86, IBM Power, IBM z Systems, and 64-bit Arm.
Support for IBM Z is available through the kernel-alt package, as indicated earlier here, which provides Linux kernel 4.14. QEMU ships v2.10 via package qemu-kvm-ma, and libvirt is updated to v3.9.0 for all platforms.
Thereby, all IBM z14 features as previously listed here are available.
Check these instructions on how to get started. 

by Stefan Raspl (noreply@blogger.com) at April 11, 2018 04:17 PM

April 09, 2018

Gerd Hoffmann

vgpu display support finally merged upstream

It took more than a year from the first working patches to the upstream merge. But now it’s finally done. The linux kernel 4.16 (released on easter weekend) has the kernel-side code needed. The qemu code has been merged too (for gtk and spice user interfaces) and will be in the upcoming 2.12 release which is in code freeze right now. The 2.12 release candidates already have the code, so you can grab one if you don’t want wait for the final release to play with this.

The vgpu code in the intel driver is off by default and must be enabled via module option. And, while being at it, also suggest to load the kvmgt module. So I’ve dropped a config file with these lines …

options i915 enable_gvt=1
softdep i915 pre: kvmgt

… into /etc/modprobe.d/. For some reason dracut didn’t pick the changes up even after regenerating the initrd. Because of that I’ve blacklisted the intel driver (rd.driver.blacklist=i915 on the kernel command line) so the driver gets loaded later, after mounting the root filesystem, and modprobe actually sets the parameter.

With that in place you should have a /sys/class/mdev_bus directory with the intel gpu in there. You can create vgpu devices now. Check the mediated device documentation for details.

One final thing to take care of: Currently using gvt mlocks all guest memory. For that work the mlock limit (ulimit -l) must be high enough, otherwise the vgpu will not work correctly and you’ll see a scrambled display. Limit can be configured in /etc/security/limits.conf.

Now lets use our new vgpu with qemu:

qemu-system-x86_64 \
     -enable-kvm \
     -m 1G \
     -nodefaults \
     -M graphics=off \
     -serial stdio \
     -display gtk,gl=on \
     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID,display=on \
     -cdrom /vmdisk/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso

Details on the non-obvious qemu switches:

-nodefaults
Do not create default devices (such as vga and nic).
-M graphics=off
Hint for the firmware that the guest runs without a graphical display. This enables serial console support in seabios. We use this here because the vgpu has no firmware support (i.e. no vgabios), therefore nothing is visible on the display until the i915 kernel module loads.
-display gtk,gl=on
Use gtk display, enable opengl.
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/UUID,display=on
Add the vgpu to the guest, enable the display. Of course you have to replace UUID with your device.

libvirt support is still being worked on. Most bits are there, but some little details are missing. For example there is no way (yet) to tell libvirt the guest doesn’t need an emulated vga device, so you’ll end up with two spice windows, one for the emulated vga and one for the vgpu. Other than that things are working pretty straight forward. You need spice with opengl support enabled:

<graphics type='spice'>
  <listen type='none'/>
  <gl enable='yes'/>
</graphics>

And the vgpu must be added of course:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci'>
  <source>
    <address uuid='UUID'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>

Then you can start the domain. Use “virt-viewer –attach guest” to connect to the guest. Note that guest and virt-viewer must run on the same machine, sending the vgpu display to a remote machine does not yet work.

by Gerd Hoffmann at April 09, 2018 11:17 AM

March 26, 2018

Cornelia Huck

s390x changes in QEMU 2.12

As QEMU is now in hard freeze for 2.12 (with the final release expected in mid/late April), now is a good point in time to summarize some of the changes that made it into QEMU 2.12 for s390x.

I/O Devices

  • Channel I/O: Any device can now be put into any channel subsystem image, regardless whether it is a virtual device (like virtio-ccw) or a device passed through via vfio-ccw. This obsoletes the s390-squash-mcss option (which was needed to explicitly squash vfio-ccw devices into the default channel subsystem image in order to make it visible to guests not enabling MCSS-E).
  • PCI: Fixes and refactoring, including handling of subregions. This enables usage of virtio-pci devices on s390x (although only if MSI-X is enabled, as s390x depends on it.) Previously, you could add virtio-pci devices on s390x, but they were not usable.
    For more information about PCI, see this blog entry.

Booting and s390-ccw bios

  • Support for an interactive boot menu. Note that this is a bit different than on other architectures (although it hooks into the same infrastructure). The boot menu is written on the (virtual) disk via the 'zipl' program, and these entries need to be parsed and displayed via SCLP.

System Emulation

  • KVM: In case you were short on memory before: You can now run guests with 8 TB or more.
  • KVM: Support for the bpb and ppa15 CPU features (for spectre mitigation). These have been backported to 2.11.1 as well.
  • TCG: Lots of improvements: Implementation of missing instructions, full (non-experimental) SMP support.
  • TCG: Improvements in handling of the STSI instruction (you can look at some information obtained that way via /proc/sysinfo.) Note that a TCG guest reports itself as a KVM guest, rather than an LPAR: In many ways, a TCG guest is closer to KVM, and reporting itself as an LPAR makes the Linux guest code choose an undesired target for its console output by default.
  • TCG: Wire up the zPCI instructions; you can now use virtio-pci devices under TCG.
  • CPU models: Switch the 'qemu' model to a stripped-down z12, adding all features required by kernels on recent distributions. This means that you can now run recent distributions (Fedora 26/27, Ubuntu 18.04, ...) under TCG. Older distributions may not work (older kernels required some features not implemented under TCG), unless they were built for a z900 like Debian stable.

Miscellaneous

  • Support for memory hotplug via SCLP has been removed. This was an odd interface: Unlike as on other architectures, the guest could enable 'standby' memory if it had been supplied. Another problem was that this never worked with migration. Old command lines will continue to work, but no 'standby' memory will be available to the guest any more.
    Memory hotplug on s390x will probably come back in the future with an interface that matches better what is done elsewhere, likely via some paravirtualized interface. Support for the SCLP interface might come back in the future as well, implemented in an architecture-specific way that does not try to look like memory hotplug elsewhere.
  • And of course, the usual fixes, cleanups and other improvements.

by Cornelia Huck (noreply@blogger.com) at March 26, 2018 06:14 PM

KVM on Z

SLES12 SP3 Updates


SLES12SP3, released late last year, received a couple of mostly performance and security-related updates in support of IBM z14 and LinuxONE through the maintenance web updates.
In particular:

    by Stefan Raspl (noreply@blogger.com) at March 26, 2018 09:19 AM

    March 23, 2018

    Daniel Berrange

    ANNOUNCE: gtk-vnc 0.7.2 release

    I’m pleased to announce a new release of GTK-VNC, version 0.7.2. The release focus is on bug fixing, and addresses an important regression in TLS handling from the previous release.

    • Deprecated the manual python2 binding in favour of GObject introspection. It will be deleted in the next release.
    • Emit led state notification on connect
    • Fix incorrect copyright notices
    • Simplify shifted-tab key handling
    • Don’t short circuit TLS credential request
    • Improve check for keymap under XWayland
    • Update doap description of project
    • Modernize RPM specfile

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at March 23, 2018 02:20 PM

    March 17, 2018

    KVM on Z

    KVM on z14 features

    While the latest addition to the IBM Z family has been announced, here is a list of features in support of specific features of the new hardware generation in past releases of the Linux kernel, QEMU and libvirt, all activated by default in the z14 CPU model:
    • Instruction Execution Protection
      This feature provides KVM hypervisor support for the Instruction Execution Protection (IEP) facility in the z14. The IEP prevents code execution from memory regions marked as non-executable, improving the security model.
      Other than activating/deactivating this feature in the applicable the CPU models in QEMU (which holds true for most hardware-related features on IBM Z in general), there are no switches associated with this feature.
      Requires Linux kernel 4.11 in the KVM host and guests, as well as QEMU v2.10 (host only).
      In the z14 CPU model, the respective feature is:
        iep       Instruction-execution-protection facility
    • SIMD Extensions
      Following up to the SIMD instructions as introduced with the previous z13 model, the new z14 provides further vector instructions, which can again be used in KVM guests.
      These new vector instructions can be used to improve decimal calculations as well as for implementing high performance variants of certain cryptographic operations.
      Requires Linux kernel 4.11 as well as QEMU v2.10 in the KVM host, and binaries or a respective Java Runtime Environment in guests using the new vector instructions.
      In the z14 CPU model, the respective feature is:
        vxpd      Vector packed decimal facility
        vxeh
            Vector enhancements facility
    • Keyless Guest Support
      This feature supports the so-called Keyless Subset (KSS) facility, a new feature of the z14 hardware. With the KSS facility enabled, a host is not required to perform the (costly) storage key initialization and management for KVM guests, unless a guest issues a storage key instruction.
      Requires Linux kernel 4.12 in the KVM host. As for the guests, note that starting with SLES12SP1, RHEL7.2 and Ubuntu 16.04, Linux on IBM Z does not issue any storage key operations anymore.
      This feature does not have a separate entry in the z14 CPU model.
    • CPUMF Basic Sample Configuration Level Indication
      Basic mode samples as defined in "The Load-Program-Parameter and the CPU-Measurement Facilities" (SA23-2260) do not provide an indication whether the sample was taken in a KVM host or guest. Beginning with z14, the hardware provides an indication of the configuration level (level of SIE, e.g. LPAR or KVM). This item exploits this information to make the perf guest/host decision reliable.
      Requires Linux kernel 4.12 in the KVM host.
      There is no separate entry in the z14 CPU model, since this feature applies to the host only.
    • Semaphore assist
      Improves performance of semaphore locks.
      Requires Linux kernel 4.7 and QEMU v2.10 in the KVM host. Exploitation in Linux kernels in guests is still in progress here, scheduled for 4.14.
      In the z14 CPU model, the respective feature is:
        sema      Semaphore-assist facility
    • Guarded storage
      This feature is specifically aimed at Java Virtual Machines running in KVM guests to run with fewer and shorter pauses for garbage collection.
      Requires Linux kernel 4.12 and QEMU 2.10 in the KVM host, and a Java Runtime Environment with respective support in the guests.
      In the z14 CPU model, the respective feature is:
        gs        Guarded-storage facility
    • MSA Updates
      z14 introduces 3 new Message Security Assists (MSA) for the following functionalities:
          MSA6: SHA3 hashing
          MSA7: A True Random  Number Generator (TRNG)
          MSA8: The CIPHER MESSAGE WITH AUTHENTICATION instruction,
                      which provides support for the Galois-counter-mode (GCM)
      MSA6 and MSA 7 require Linux kernel 4.7, while MSA8 requires Linux kernel 4.12. All require QEMU v2.10 in the KVM host. These features can be exploited in KVM guests' kernels and userspace applications independently (i.e. a KVM guest's userspace applications can take advantage of these features irrespective of the guest's kernel version).
      In the z14 CPU model, the respective features are:
        msa6      Message-security-assist-extension 6 facility

        msa7      Message-security-assist-extension 7 facility
        msa8      Message-security-assist-extension 8 facility
    • Compression enhancements
      New instructions improve compression capabilities and performance.
      Requires Linux kernel 4.7 in the KVM host.
      In the z14 CPU model, the respective features are:
        opc       Order Preserving Compression facility
        eec       Entropy encoding compression facility
    • Miscellaneous instructions
      Details on these instructions are to be published in the forthcoming z14 Principles of Operation (PoP).
      Requires Linux kernel 4.7 and QEMU 2.10 in the KVM host, and binaries that were compiled for the z14 instruction set using binutils v2.28 and gcc v7.1 in the guests.
      In the z14 CPU model, the respective feature is:
        minste2   Miscellaneous-instruction-extensions facility 2
    Note: All versions specified are minimum versions.

    Further features will be announced in future blog posts as usual as they find their way into the respective Open Source projects.
    Also, don't forget to check this blog entry with further details on z14 in general and Linux on z in particular.

    by Stefan Raspl (noreply@blogger.com) at March 17, 2018 12:28 PM

    March 12, 2018

    Fabian Deutsch

    Running Ubuntu on Kubernetes with KubeVirt v0.3.0

    You have this image, of a VM, which you want to run - alongside containers - why? - well, you need it. Some people would say it’s dope, but sometimes you really need it, because it has an app you want to integrate with pods.

    Here is how you can do this with KubeVirt.

    1 Deploy KubeVirt

    Deploy KubeVirt on your cluster - or follow the demo guide to setup a fresh minikube cluster.

    2 Download Ubuntu

    While KubeVirt comes up (use kubectl get --all-namespaces pods), download Ubuntu Server

    3 Install kubectl plugin

    Make sure to have the latest or recent kubectl tool installed, and install the pvc plugin:

    curl -L https://github.com/fabiand/kubectl-plugin-pvc/raw/master/install.sh | bash
    

    4 Create disk

    Upload the Ubuntu server image:

    $ kubectl plugin pvc create ubuntu1704 1Gi $PWD/ubuntu-17.04-server-amd64.iso disk.img
    Creating PVC
    persistentvolumeclaim "ubuntu1704" created
    Populating PVC
    pod "ubuntu1704" created
    total 701444
    701444 -rw-rw-r--    1 1000     1000      685.0M Aug 25  2017 disk.img
    Cleanup
    pod "ubuntu1704" deleted
    

    5 Create and launch VM

    Create a VM:

    $ kubectl apply -f -
    apiVersion: kubevirt.io/v1alpha1
    kind: VirtualMachinePreset
    metadata:
      name: large
    spec:
      selector:
        matchLabels:
          kubevirt.io/size: large
      domain:
        resources:
          requests:
            memory: 1Gi
    ---
    apiVersion: kubevirt.io/v1alpha1
    kind: OfflineVirtualMachine
    metadata:
      name: ubuntu
    spec:
      running: true
      selector:
        matchLabels:
          guest: ubuntu
      template:
        metadata:
          labels: 
            guest: ubuntu
            kubevirt.io/size: large
        spec:
          domain:
            devices:
              disks:
                - name: ubuntu
                  volumeName: ubuntu
                  disk:
                    bus: virtio
          volumes:
            - name: ubuntu
              claimName: ubuntu1710
    

    6 Connect to VM

    $ ./virtctl-v0.3.0-linux-amd64 vnc --kubeconfig ~/.kube/config ubuntu
    

    Final notes - This is booting the Ubuntu ISO image. But this flow should work for existing images, which might be much more useful.

    March 12, 2018 03:56 PM

    March 06, 2018

    Fabian Deutsch

    v2v-job v0.2.0 POC for importing VMs into KubeVirt

    KubeVirt becomes usable. And to make it easier to use it would be nice to be able to import existing VMs. After all migration is a strong point of KubeVirt.

    virt-v2v is the tool of choice to convert some guest to run on the KVM hypervisor. What a great fit.

    Thus recently I started a little POC to check if this would really work.

    This post is just to wrap it up, as I just tagged v0.2.0 and finished a nice OVA import.

    What the POC does:

    • Take an URL pointing to an OVA
    • Download and convert the OVA to a domxml and raw disk image
    • Create a PVC and move the raw disk image to it
    • Create an OfflineVirtualMachine from the domxml using xslt

    This is pretty straight forward and currently living in a Job which can be found here: https://github.com/fabiand/v2v-job

    It’s actually using an OpenShift Template, but only works on Kubernetes so far, because I didn’t finish the RBAC profiles. However, using the oc tool you can even run it on Kubernetes without Template support by using:

    $ oc process --local -f manifests/template.yaml \
        -p SOURCE_TYPE=ova \
        -p SOURCE_NAME=http://192.168.42.1:8000/my.ova \
      | kubectl apply -f -
    serviceaccount "kubevirt-privileged" created
    job "v2v" created
    
    

    The interested reader can take a peek at the whole process in this log.

    And btw - This little but awesome patch on libguestfs by Pino - will help this job to auto-detect - well, guess - the the guest operating system and set the OfflineVirtualMachine annotations correctly, in order to then - at runtime - apply the right VirtualMachinePresets, in order to launch the guest with optimized defaults.

    March 06, 2018 10:41 AM

    March 02, 2018

    Marcin Juszkiewicz

    OpenStack ‘Queens’ release done

    OpenStack community released ‘queens’ version this week. IMHO it is quite important moment for AArch64 community as well because it works out of the box for us.

    Gone are things like setting hw_firmware_type=uefi for each image you upload to Glance — Nova assumes UEFI to be the default firmware on AArch64 (unless you set the variable to different value for some reason). This simplifies things as users does not have to worry about and we should have less support questions on new setups of Linaro Developer Cloud (which will be based on ‘Queens’ instead of ‘Newton’).

    There is a working graphical console if your guest image uses properly configured kernel (4.14 from Debian/stretch-backports works fine, 4.4 from Ubuntu/xenial (used by CirrOS) does not have graphics enabled). Handy feature which we were asked already by some users.

    Sad thing is state of live migration on AArch64. It simply does not work through the whole stack (Nova, libvirt, QEMU) because we have no idea what exactly cpu we are running on and how it is compatible with other cpu cores. In theory live migration between same type of processors (like XGene1 -> XGene1) should be possible but we do not have even that level of information available. More information can be found in bug 1430987 reported against libvirt.

    Less sad part? We set cpu_model to ‘host-passthrough’ by default now (in Nova) so nevermind which deployment method is used it should work out of the box.

    When it comes to building (Kolla) and deploying (Kolla Ansible) most of the changes were done during Pike cycle. During Queens’ one most of the changes were small tweaks here and there. I think that our biggest change was convincing everyone in Kolla(-ansible) to migrate from MariaDB 10.0.x (usually from external repositories) to 10.1.x taken from distribution (Debian) or from RDO.

    What will Rocky bring? Better hotplug for PCI Express machines (AArch64/virt, x86/q35 models) is one thing. I hope that live migration stuff situation will improve as well.

    by Marcin Juszkiewicz at March 02, 2018 01:22 PM

    February 23, 2018

    Fabian Deutsch

    KubeVirt v0.3.0-alpha.3: Kubernetes native networking and storage

    First post for quite some time. A side effect of being busy to get streamline our KubeVirt user experience.

    KubeVirt v0.3.0 was not released at the beginnig of the month.

    That release was intended to be a little bigger, because it included a large architecture change (to the good). The change itself was amazingly friendly and went in without much problems - even if it took some time.

    But, the work which was building upon this patch in the storage and network areas was delayed and didn’t make it in time. Thus we skipped the release in order to let storage and network catch up.

    The important thing about these two areas is, that KubeVirt was able to connect a VM to a network, and was able to boot of a iSCSI target, but this was not really tightly integrated with Kubernetes.

    Now, just this week two patches landed which actually do integrate these areas with Kubernetes.

    Storage

    The first is storage - mainly written by Artyom, and finalized by David - which allows a user to use a persistent volume as the backing storage for a VM disk:

    metadata:
      name: myvm
    apiVersion: kubevirt.io/v1alpha1
    kind: VirtualMachine
    spec:
      domain:
        devices:
          disks:
          - name: mypvcdisk
            volumeName: mypvc
            lun: {}
      volumes:
        - name: mypvc
          persistentVolumeClaim:
            claimName: mypvc
    

    This means that any storage solution supported by Kubernetes to provide PVs can be used to store virtual machine images. This is a big step forward in terms of compatibility.

    This actually works by taking this claim, and attaching it to the VM’s pod definition, in order to let the kubelet then mount the respective volume to the VM’s pod. Once that is done, KubeVirt will take care to connect the disk image within that PV to the VM itself. This is only possible because the architecture change caused libvirt to run inside every VM pod, and thus allow the VM to consume the pod resources.

    Side note, another project is in progress to actually let a user upload a virtual machine disk to the cluster in a convenient way: https://github.com/kubevirt/containerized-data-importer.

    Network

    The second change is about network which Vladik worked on for some time. This change also required the architectural changes, in order to allow the VM and libvirt to consume the pod’s network resource.

    Just like with pods the user does not need to do anything to get basic network connectivity. KubeVirt will connect the VM to the NIC of the pod in order to give it the most compatible intergation. Thus now you are able to expose a TCP or UDP port of the VM to the outside world using regular Kubernetes Services.

    A side note here is that despite this integration we are now looking to enhance this further to allow the usage of side cars like Istio.

    Alpha Release

    The three changes - and their delay - caused the delay of v0.3.0 - which will now be released in the beginnig of March. But we have done a few pre-releases in order to allow interested users to try this code right now:

    KubeVirt v0.3.0-alpha.3 is the most recent alpha release and should work fairly well.

    More

    But these items were just a small fraction of what we have been doing.

    If you look at the kubevirt org on GitHub you will notice many more repositories there, covering storage, cockpit, and deployment with ansible - and it will be another post to write about how all of this is fitting together.

    Welcome aboard!

    KubeVirt is really speeding up and we are still looking for support. So if you are interested in working on a bleeding edge project tightly coupled with Kubernetes, but also having it’s own notion, and an great team, then just reach out to me.

    February 23, 2018 11:06 AM

    February 21, 2018

    Alex Bennée

    Workbooks for Benchmarking

    While working on a major re-factor of QEMU’s softfloat code I’ve been doing a lot of benchmarking. It can be quite tedious work as you need to be careful you’ve run the correct steps on the correct binaries and keeping notes is important. It is a task that cries out for scripting but that in itself can be a compromise as you end up stitching a pipeline of commands together in something like perl. You may script it all in a language designed for this sort of thing like R but then find your final upload step is a pain to implement.

    One solution to this is to use a literate programming workbook like this. Literate programming is a style where you interleave your code with natural prose describing the steps you go through. This is different from simply having well commented code in a source tree. For one thing you do not have to leap around a large code base as everything you need is on the file you are reading, from top to bottom. There are many solutions out there including various python based examples. Of course being a happy Emacs user I use one of its stand-out features org-mode which comes with multi-language org-babel support. This allows me to document my benchmarking while scripting up the steps in a variety of “languages” depending on the my needs at the time. Let’s take a look at the first section:

    1 Binaries To Test

    Here we have several tables of binaries to test. We refer to the
    current benchmarking set from the next stage, Run Benchmark.

    For a final test we might compare the system QEMU with a reference
    build as well as our current build.

    Binary title
    /usr/bin/qemu-aarch64 system-2.5.log
    ~/lsrc/qemu/qemu-builddirs/arm-targets.build/aarch64-linux-user/qemu-aarch64 master.log
    ~/lsrc/qemu/qemu.git/aarch64-linux-user/qemu-aarch64 softfloat-v4.log

    Well that is certainly fairly explanatory. These are named org-mode tables which can be referred to in other code snippets and passed in as variables. So the next job is to run the benchmark itself:

    2 Run Benchmark

    This runs the benchmark against each binary we have selected above.

        import subprocess
        import os
    
        runs=[]
    
        for qemu,logname in files:
        cmd="taskset -c 0 %s ./vector-benchmark -n %s | tee %s" % (qemu, tests, logname)
            subprocess.call(cmd, shell=True)
            runs.append(logname)
    
            return runs
            
    

    So why use python as the test runner? Well truth is whenever I end up munging arrays in shell script I forget the syntax and end up jumping through all sorts of hoops. Easier just to have some simple python. I use python again later to read the data back into an org-table so I can pass it to the next step, graphing:

    set title "Vector Benchmark Results (lower is better)"
    set style data histograms
    set style fill solid 1.0 border lt -1
    
    set xtics rotate by 90 right
    set yrange [:]
    set xlabel noenhanced
    set ylabel "nsecs/Kop" noenhanced
    set xtics noenhanced
    set ytics noenhanced
    set boxwidth 1
    set xtics format ""
    set xtics scale 0
    set grid ytics
    set term pngcairo size 1200,500
    
    plot for [i=2:5] data using i:xtic(1) title columnhead
    

    This is a GNU Plot script which takes the data and plots an image from it. org-mode takes care of the details of marshalling the table data into GNU Plot so all this script is really concerned with is setting styles and titles. The language is capable of some fairly advanced stuff but I could always pre-process the data with something else if I needed to.

    Finally I need to upload my graph to an image hosting service to share with my colleges. This can be done with a elaborate curl command but I have another trick at my disposal thanks to the excellent restclient-mode. This mode is actually designed for interactive debugging of REST APIs but it is also easily to use from an org-mode source block. So the whole thing looks like a HTTP session:

    :client_id = feedbeef
    
    # Upload images to imgur
    POST https://api.imgur.com/3/image
    Authorization: Client-ID :client_id
    Content-type: image/png
    
    < benchmark.png
    

    Finally because the above dumps all the headers when run (which is very handy for debugging) I actually only want the URL in most cases. I can do this simply enough in elisp:

    #+name: post-to-imgur
    #+begin_src emacs-lisp :var json-string=upload-to-imgur()
      (when (string-match
             (rx "link" (one-or-more (any "\":" whitespace))
                 (group (one-or-more (not (any "\"")))))
             json-string)
        (match-string 1 json-string))
    #+end_src
    

    The :var line calls the restclient-mode function automatically and passes it the result which it can then extract the final URL from.

    And there you have it, my entire benchmarking workflow document in a single file which I can read through tweaking each step as I go. This isn’t the first time I’ve done this sort of thing. As I use org-mode extensively as a logbook to keep track of my upstream work I’ve slowly grown a series of scripts for common tasks. For example every patch series and pull request I post is done via org. I keep the whole thing in a git repository so each time I finish a sequence I can commit the results into the repository as a permanent record of what steps I ran.

    If you want even more inspiration I suggest you look at John Kitchen’s scimax work. As a publishing scientist he makes extensive use of org-mode when writing his papers. He is able to include the main prose with the code to plot the graphs and tables in a single source document from which his camera ready documents are generated. Should he ever need to reproduce any work his exact steps are all there in the source document. Yet another example of why org-mode is awesome 😉

    by Alex at February 21, 2018 08:34 PM

    February 19, 2018

    Marcin Juszkiewicz

    Hotplug in VM. Easy to say…

    You run VM instance. Nevermind is it part of OpenStack setup or just local one started using Boxes, virt-manager, virsh or other that kind of fronted to libvirt daemon. And then you want to add some virtual hardware to it. And another card and one more controller…

    Easy to imagine scenario, right? What can go wrong, you say? “No more available PCI slots.” message can happen. On second/third card/controller… But how? Why?

    Like I wrote in one of my previous posts most of VM instances are 90s pc hardware virtual boxes. With simple PCI bus which accepts several cards to be added/removed at any moment.

    But not on AArch64 architecture. Nor on x86-64 with Q35 machine type. What is a difference? Both are PCI Express machines. And by default they have far too small amount of pcie slots (called pcie-root-port in qemu/libvirt language). More about PCI Express support can be found in PCI topology and hotplug page of libvirt documentation.

    So I wrote a patch to Nova to make sure that enough slots will be available. And then started testing. Tried few different approaches, discussed with upstream libvirt developers about ways of solving the problem and finally we selected the one and only proper way of doing it. Then discussed failures with UEFI developers. And went for help to Qemu authors. And explained what I want to achieve and why to everyone in each of those four projects. At some point I had seen pcie-root-port things everywhere…

    Turned out that the method of fixing it is kind of simple: we have to create whole pcie structure with root port and slots. This tells libvirt to not try any automatic adding of slots (which may be tricky if not configured properly as you may end with too small amount of slots for basic addons).

    Then I went with idea of using insane values. VM with one hundred PCIe slots? Sure. So I made one, booted it and then something weird happen: landed in UEFI shell instead of getting system booted. Why? How? Where is my storage? Network? Etc?

    Turns out that Qemu has limits. And libvirt has limits… All ports/slots went into one bus and memory for MMCONFIG and/or I/O space was gone. There are two interesting threads about it on qemu-devel mailing list.

    So I added magic number into my patch: 28 — this amount of pcie-root-port entries in my aarch64 VM instance was giving me bootable system. Have to check it on x86-64/q35 setup still but it should be more or less the same. I expect this patch to land in ‘Rocky’ (the next OpenStack release) and probably will have to find a way to get it into ‘Queens’ as well because this is what we are planning to use for next edition of Linaro Developer Cloud.

    Conclusion? Hotplug may be complicated. But issues with it can be solved.

    by Marcin Juszkiewicz at February 19, 2018 06:06 PM

    Cornelia Huck

    Notes on PCI on s390x

    As QEMU 2.12 will finally support PCI devices under s390x/tcg, I thought now is a good time to talk about some of the peculiarities of PCI on the mainframe.

    Oddities of PCI on s390x architecture

    Oddity #1: No MMIO, but instructions

    Everywhere else, you use MMIO when interacting with PCI devices. Not on s390x; you have a set of instructions instead. For example, if you want to read or write memory, you will need to use the PCILG or PCISTG instructions, and for refreshing translations, you will need to use the RPCIT instructions. Fortunately, these instructions can be matched to the primitives in the Linux kernel; unfortunately, all those instructions are privileged, which leads us to

    Oddity #2: No user space I/O

    As any interaction with PCI devices needs to be done through privileged instructions, Linux user space can't interact with the devices directly; the Linux kernel needs to be involved in every case. This means that there are none of the PCI user space implementations popular on other platforms available on s390x.

    Oddity #3: No topology, but FID and UID

    Usually, you'll find busses, slots and functions when you identify a certain PCI function. The PCI instructions on s390x, however, don't expose any topology to the caller. This means that an operating system will get a simple list of functions, with a function id (FID) that can be mapped to a physical slot and an UID, which the Linux kernel will map to a domain number. A PCI identifier under Linux on s390x will therefore always be of the form <domain>:00:00.0.

    Implications for the QEMU implementation of PCI on s390x

    In order to support PCI on s390x in QEMU, some specialties had to be implemented.

    Instruction handlers

    Under KVM, every PCI instruction intercepts and is routed to user space. QEMU does the heavy lifting of emulating the operations and mapping to generic PCI code. This also implied that PCI under tcg did not work until the instructions had been wired up; this has now finally happened and will be in the 2.12 release.

    Modelling and (lack of) topology

    QEMU PCI code expects the normal topology present on other platforms. However, this (made-up) topology will be invisible to guests, as the PCI instructions do not relay it. Instead, there is a special "zpci" device with "fid" and "uid" properties that can be linked to a normal PCI device. If no zpci device is specified, QEMU will autogenerate the FID and the UID.

    How can I try this out?

    If you do not have a real mainframe with a real PCI card, you can use virtio-pci devices as of QEMU 2.12 (or current git as of the time of this writing). If you do have a mainframe and a PCI card, you can use vfio-pci (but not yet via libvirt).
    Here's an example of how to specify a virtio-net-pci device for s390x, using tcg:
    s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on (...) -device zpci,uid=12,fid=2,target=vpci02,id=zpci2 -device virtio-net-pci,id="vpci02",addr=0x2
    Some notes on this:
    • You need to explicitly enable the "zpci" feature in the qemu cpu model. Two other features, "aen" and "ais", are enabled by default ("aen" and "zpci" are mandatory, "ais" is needed for Linux guest kernels prior to 4.15. If you use KVM, the host kernel also needs support for ais.)
    • The zpci device is joined with the PCI device via the "target" property. The virtio-net-pci device does not know anything about zpci devices.
    • Only virtio-pci devices using MSI-X will work on s390x.
    In the guest, this device will show up in lspci -v as
    000c:00:00.0 Ethernet controller: Red Hat, Inc. Virtio network device
    Subsystem: Red Hat, Inc. Device 0001
    Physical Slot: 00000002
     Note how the uid of 12 shows up as domain 000c and the fid of 2 as physical slot 00000002.

    by Cornelia Huck (noreply@blogger.com) at February 19, 2018 01:49 PM

    KVM on Z

    RHEL 7.5 Beta supports KVM on Z

    The Red Hat Enterprise Linux 7.5 Beta ships with support for KVM on Z through the kernel-alt packages. This will essentially ship Linux kernel 4.14.
    Here is the respective section from the release notes:
    KVM virtualization is now supported on IBM z Systems. However, this feature is only available in the newly introduced user space based on kernel version 4.14, provided by the kernel-alt packages.
    See here for further details.

    by Stefan Raspl (noreply@blogger.com) at February 19, 2018 08:47 AM

    February 14, 2018

    QEMU project

    QEMU 2.11.1 and making use of Spectre/Meltdown mitigation for KVM guests

    A previous post detailed how QEMU/KVM might be affected by Spectre/Meltdown attacks, and what the plan was to mitigate them in QEMU 2.11.1 (and eventually QEMU 2.12).

    QEMU 2.11.1 is now available, and contains the aforementioned mitigation functionality for x86 guests, along with additional mitigation functionality for pseries and s390x guests (ARM guests do not currently require additional QEMU patches). However, enabling this functionality requires additional configuration beyond just updating QEMU, which we want to address with this post.

    Please note that QEMU/KVM has at least the same requirements as other unprivileged processes running on the host with regard to Spectre/Meltdown mitigation. What is being addressed here is enabling a guest operating system to enable the same (or similar) mitigations to protect itself from unprivileged guest processes running under the guest operating system. Thus, the patches/requirements listed here are specific to that goal and should not be regarded as the full set of requirements to enable mitigations on the host side (though in some cases there is some overlap between the two with regard to required patches/etc).

    Also please note that this is a best-effort from the QEMU/KVM community, and these mitigations rely on a mix of additional kernel/firmware/microcode updates that are in some cases not yet available publicly, or may not yet be implemented in some distros, so users are highly encouraged to consult with their respective vendors/distros to confirm whether all the required components are in place. We do our best to highlight the requirements here, but this may not be an exhaustive list.

    Enabling mitigation features for x86 KVM guests

    Note: these mitigations are known to cause some performance degradation for certain workloads (whether used on host or guest), and for some Intel architectures alternative solutions like retpoline-based kernels may be available which may provide similar levels of mitigation with reduced performance impact. Please check with your distro/vendor to see what options are available to you.

    For x86 guests there are 2 additional CPU flags associated with Spectre/Meltdown mitigation: spec-ctrl, and ibpb:

    • spec-ctrl: exposes Indirect Branch Restricted Speculation (IBRS)
    • ibpb: exposes Indirect Branch Prediction Barriers

    These flags expose additional functionality made available through new microcode updates for certain Intel/AMD processors that can be used to mitigate various attack vectors related to Spectre. (Meltdown mitigation via KPTI does not require additional CPU functionality or microcode, and does not require an updated QEMU, only the related guest/host kernel patches).

    Utilizing this functionality requires guest/host kernel updates, as well as microcode updates for Intel and recent AMD processors. The status of these kernel patches upstream is still in flux, but most supported distros have some form of the patches that is sufficient to make use of the functionality. The current status/availability of microcode updates depends on your CPU architecture/model. Please check with your vendor/distro to confirm these prerequisites are available/installed.

    Generally, for Intel CPUs with updated microcode, spec-ctrl will enable both IBRS and IBPB functionality. For AMD EPYC processors, ibpb can be used to enable IBPB specifically, and is thought to be sufficient by itself for that particular architecture.

    These flags can be set in a similar manner as other CPU flags, i.e.:

    qemu-system-x86_64 -cpu qemu64,+spec-ctrl,... ...
    qemu-system-x86_64 -cpu IvyBridge,+spec-ctrl,... ...
    qemu-system-x86_64 -cpu EPYC,+ibpb,... ...
    etc...
    

    Additionally, for management stacks that lack support for setting specific CPU flags, a set of new CPU types have been added which enable the appropriate CPU flags automatically:

    qemu-system-x86_64 -cpu Nehalem-IBRS ...
    qemu-system-x86_64 -cpu Westmere-IBRS ...
    qemu-system-x86_64 -cpu SandyBridge-IBRS ...
    qemu-system-x86_64 -cpu IvyBridge-IBRS ...
    qemu-system-x86_64 -cpu Haswell-IBRS ...
    qemu-system-x86_64 -cpu Haswell-noTSX-IBRS ...
    qemu-system-x86_64 -cpu Broadwell-IBRS ...
    qemu-system-x86_64 -cpu Broadwell-noTSX-IBRS ...
    qemu-system-x86_64 -cpu Skylake-Client-IBRS ...
    qemu-system-x86_64 -cpu Skylake-Server-IBRS ...
    qemu-system-x86_64 -cpu EPYC-IBPB ...
    

    With these settings enabled, guests may still require additional configuration to enable IBRS/IBPB, which may vary somewhat from one distro to another. For RHEL guests, the following resource may be useful:

    With regard to migration compatibility, spec-ctrl/ibrs (or the corresponding CPU type) should be set the same on both source/target to maintain compatibility. Thus, guests will need to be rebooted to make use of the new features.

    Enabling mitigation features for pseries KVM guests

    For pseries guests there are 3 tri-state -machine options/capabilities relating to Spectre/Meltdown mitigation: cap-cfpc, cap-sbbc, cap-ibs, which each correspond to a set of host machine capabilities advertised by the KVM kernel module in new/patched host kernels that can be used to mitigate various aspects of Spectre/Meltdown:

    • cap-cfpc: Cache Flush on Privilege Change
    • cap-sbbc: Speculation Barrier Bounds Checking
    • cap-ibs: Indirect Branch Serialisation

    Each option can be set to one of “broken”, “workaround”, or “fixed”, which correspond, respectively, to instructing the guest whether the host is vulnerable, has OS-level workarounds available, or has hardware/firmware that does not require OS-level workarounds. Based on these options, QEMU will perform checks to validate whether the specified settings are available on the current host and pass these settings on to the guest kernel. At a minimum, any setting other than “broken” will require a host kernel that has some form of the following patches:

    commit 3214d01f139b7544e870fc0b7fcce8da13c1cb51
    KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
    
    commit 191eccb1580939fb0d47deb405b82a85b0379070
    powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
    

    and whether a host will support “workaround” and “fixed” settings for each option will depend on the hardware/firmware level of the host system.

    In turn, to make use of “workaround” or “fixed” settings for each option, the guest kernel will require at least the following set of patches:

    These are available upstream and have been backported to a number of stable kernels. Please check with your vendor/distro to confirm the required hardware/firmware and guest kernel patches are available/installed.

    All three options, cap-cfpc, cap-sbbc, and cap-ibs default to “broken” to maintain compatibility with previous versions of QEMU and unpatched host kernels. To enable them you must start QEMU with the desired mitigation strategy specified explicitly. For example:

    qemu-system-ppc64 ... \
      -machine pseries-2.11,cap-cfpc=workaround,cap-sbbc=workaround,cap-ibs=fixed
    

    With regard to migration compatibility, setting any of these features to a value other than “broken” will require an identical setting for that option on the source/destination guest. To enable these settings your guests will need to be rebooted at some point.

    Enabling mitigation features for s390x KVM guests

    For s390x guests there are 2 CPU feature bits relating to Spectre/Meltdown:

    • bpb: Branch prediction blocking
    • ppa15: PPA15 is installed

    bpb requires a host kernel patched with:

    commit 35b3fde6203b932b2b1a5b53b3d8808abc9c4f60
    KVM: s390: wire up bpb feature
    

    and both bpb and ppa15 require a firmware with the appropriate support level as well as guest kernel patches to enable the functionality within guests. Please check with your distro/vendor to confirm.

    Both bpb and ppa15 are enabled by default when using “-cpu host” and when the host kernels supports these facilities. For other CPU models, the flags have to be set manually. For example:

    qemu-system-s390x -M s390-ccw-virtio-2.11 ... \
      -cpu zEC12,bpb=on,ppa15=on
    

    With regard to migration, enabling bpb or ppa15 feature flags requires that the source/target also has those flags enabled. Since this is enabled by default for ‘-cpu host’ (when available on the host), you must ensure that bpb=off,ppa15=off is used if you wish to maintain migration compatibility with existing guests when using ‘-cpu host’, or take steps to reboot guests with bpb/ppa15 enabled prior to migration.

    by Michael Roth at February 14, 2018 04:35 PM

    February 13, 2018

    Gerd Hoffmann

    local vgpu display – status update

    After a looong time things are finally landing upstream. Both vfio API update for vgpu local display support and support for that in the intel drm driver have been merged during the 4.16 merge window. Should not take too long to get the qemu changes merged upstream now, if all goes well the next release (2.12) will have it.

    Tina Zhang wrote a nice blog article with both technical background and instructions for those who want play with it.

    by Gerd Hoffmann at February 13, 2018 07:54 AM

    February 09, 2018

    QEMU project

    Understanding QEMU devices

    Here are some notes that may help newcomers understand what is actually happening with QEMU devices:

    With QEMU, one thing to remember is that we are trying to emulate what an Operating System (OS) would see on bare-metal hardware. Most bare-metal machines are basically giant memory maps, where software poking at a particular address will have a particular side effect (the most common side effect is, of course, accessing memory; but other common regions in memory include the register banks for controlling particular pieces of hardware, like the hard drive or a network card, or even the CPU itself). The end-goal of emulation is to allow a user-space program, using only normal memory accesses, to manage all of the side-effects that a guest OS is expecting.

    As an implementation detail, some hardware, like x86, actually has two memory spaces, where I/O space uses different assembly codes than normal; QEMU has to emulate these alternative accesses. Similarly, many modern CPUs provide themselves a bank of CPU-local registers within the memory map, such as for an interrupt controller.

    With certain hardware, we have virtualization hooks where the CPU itself makes it easy to trap on just the problematic assembly instructions (those that access I/O space or CPU internal registers, and therefore require side effects different than a normal memory access), so that the guest just executes the same assembly sequence as on bare metal, but that execution then causes a trap to let user-space QEMU then react to the instructions using just its normal user-space memory accesses before returning control to the guest. This is supported in QEMU through “accelerators”.

    Virtualizing accelerators, such as KVM, can let a guest run nearly as fast as bare metal, where the slowdowns are caused by each trap from guest back to QEMU (a vmexit) to handle a difficult assembly instruction or memory address. QEMU also supports other virtualizing accelerators (such as HAXM or macOS’s Hypervisor.framework).

    QEMU also has a TCG accelerator, which takes the guest assembly instructions and compiles it on the fly into comparable host instructions or calls to host helper routines; while not as fast as hardware acceleration, it allows cross-hardware emulation, such as running ARM code on x86.

    The next thing to realize is what is happening when an OS is accessing various hardware resources. For example, most operating systems ship with a driver that knows how to manage an IDE disk - the driver is merely software that is programmed to make specific I/O requests to a specific subset of the memory map (wherever the IDE bus lives, which is specific to the the hardware board). When the IDE controller hardware receives those I/O requests it then performs the appropriate actions (via DMA transfers or other hardware action) to copy data from memory to persistent storage (writing to disk) or from persistent storage to memory (reading from the disk).

    When you first buy bare-metal hardware, your disk is uninitialized; you install the OS that uses the driver to make enough bare-metal accesses to the IDE hardware portion of the memory map to then turn the disk into a set of partitions and filesystems on top of those partitions.

    So, how does QEMU emulate this? In the big memory map it provides to the guest, it emulates an IDE disk at the same address as bare-metal would. When the guest OS driver issues particular memory writes to the IDE control registers in order to copy data from memory to persistent storage, the QEMU accelerator traps accesses to that memory region, and passes the request on to the QEMU IDE controller device model. The device model then parses the I/O requests, and emulates them by issuing host system calls. The result is that guest memory is copied into host storage.

    On the host side, the easiest way to emulate persistent storage is via treating a file in the host filesystem as raw data (a 1:1 mapping of offsets in the host file to disk offsets being accessed by the guest driver), but QEMU actually has the ability to glue together a lot of different host formats (raw, qcow2, qed, vhdx, …) and protocols (file system, block device, NBD, Ceph, gluster, …) where any combination of host format and protocol can serve as the backend that is then tied to the QEMU emulation providing the guest device.

    Thus, when you tell QEMU to use a host qcow2 file, the guest does not have to know qcow2, but merely has its normal driver make the same register reads and writes as it would on bare metal, which cause vmexits into QEMU code, then QEMU maps those accesses into reads and writes in the appropriate offsets of the qcow2 file. When you first install the guest, all the guest sees is a blank uninitialized linear disk (regardless of whether that disk is linear in the host, as in raw format, or optimized for random access, as in the qcow2 format); it is up to the guest OS to decide how to partition its view of the hardware and install filesystems on top of that, and QEMU does not care what filesystems the guest is using, only what pattern of raw disk I/O register control sequences are issued.

    The next thing to realize is that emulating IDE is not always the most efficient. Every time the guest writes to the control registers, it has to go through special handling, and vmexits slow down emulation. Of course, different hardware models have different performance characteristics when virtualized. In general, however, what works best for real hardware does not necessarily work best for virtualization, and until recently, hardware was not designed to operate fast when emulated by software such as QEMU. Therefore, QEMU includes paravirtualized devices that are designed specifically for this purpose.

    The meaning of “paravirtualization” here is slightly different from the original one of “virtualization through cooperation between the guest and host”. The QEMU developers have produced a specification for a set of hardware registers and the behavior for those registers which are designed to result in the minimum number of vmexits possible while still accomplishing what a hard disk must do, namely, transferring data between normal guest memory and persistent storage. This specification is called virtio; using it requires installation of a virtio driver in the guest. While no physical device exists that follows the same register layout as virtio, the concept is the same: a virtio disk behaves like a memory-mapped register bank, where the guest OS driver then knows what sequence of register commands to write into that bank to cause data to be copied in and out of other guest memory. Much of the speedups in virtio come by its design - the guest sets aside a portion of regular memory for the bulk of its command queue, and only has to kick a single register to then tell QEMU to read the command queue (fewer mapped register accesses mean fewer vmexits), coupled with handshaking guarantees that the guest driver won’t be changing the normal memory while QEMU is acting on it.

    As an aside, just like recent hardware is fairly efficient to emulate, virtio is evolving to be more efficient to implement in hardware, of course without sacrificing performance for emulation or virtualization. Therefore, in the future, you could stumble upon physical virtio devices as well.

    In a similar vein, many operating systems have support for a number of network cards, a common example being the e1000 card on the PCI bus. On bare metal, an OS will probe PCI space, see that a bank of registers with the signature for e1000 is populated, and load the driver that then knows what register sequences to write in order to let the hardware card transfer network traffic in and out of the guest. So QEMU has, as one of its many network card emulations, an e1000 device, which is mapped to the same guest memory region as a real one would live on bare metal.

    And once again, the e1000 register layout tends to require a lot of register writes (and thus vmexits) for the amount of work the hardware performs, so the QEMU developers have added the virtio-net card (a PCI hardware specification, although no bare-metal hardware exists yet that actually implements it), such that installing a virtio-net driver in the guest OS can then minimize the number of vmexits while still getting the same side-effects of sending network traffic. If you tell QEMU to start a guest with a virtio-net card, then the guest OS will probe PCI space and see a bank of registers with the virtio-net signature, and load the appropriate driver like it would for any other PCI hardware.

    In summary, even though QEMU was first written as a way of emulating hardware memory maps in order to virtualize a guest OS, it turns out that the fastest virtualization also depends on virtual hardware: a memory map of registers with particular documented side effects that has no bare-metal counterpart. And at the end of the day, all virtualization really means is running a particular set of assembly instructions (the guest OS) to manipulate locations within a giant memory map for causing a particular set of side effects, where QEMU is just a user-space application providing a memory map and mimicking the same side effects you would get when executing those guest instructions on the appropriate bare metal hardware.

    (This post is a slight update on an email originally posted to the qemu-devel list back in July 2017).

    by Eric Blake at February 09, 2018 07:30 PM

    Eduardo Otubo

    QEMU Sandboxing for dummies

    DevConf is an annual conference that takes place in Brno, Czech Republic. This year I applied for a talk to go over my work that I develop since mid 2012: Security on QEMU/KVM Virtual Machines using SECCOMP. Since then I became the maintainer of this feature on QEMU and released the second and better version not long ago. On this post you'll find the slides and the full video of the presentation.


    QEMU Sandboxing for dummies de Eduardo Otubo

    So here we go, very first experience lecturing in English, what a catastrophe! In my defense the audience was very peculiar, not only the guy that very started libseccomp was there, but my manager and the director of the department as well. Anxiety apart, I think it was an outstanding experience, would do it again in the future. :-)



    by noreply@blogger.com (Eduardo Otubo) at February 09, 2018 04:17 PM

    Daniel Berrange

    The Fedora virtualization software archive (aka virt-ark)

    With libvirt releasing 11 times a year and QEMU releasing three times a year, there is a quite large set of historical releases available by now. Both projects have a need to maintain compatibility across releases in varying areas. For QEMU the most important thing is that versioned machine types present the same guest ABI across releases. ie a ‘pc-2.0.0’ machine on QEMU 2.0.0, should be identical to a ‘pc-2.0.0’ machine on QEMU 2.5.0. If this rule is violated, the ability to live migrate and save/restore is doomed. For libvirt the most important thing is that a given guest configuration should be usable across many QEMU versions, even if the command line arguments required to achieve the configuration in QEMU have changed. This is key to libvirt’s promise that upgrading either libvirt or QEMU will not break either previously running guests, or future operations by the management tool. Finally management applications using libvirt may promise that they’ll operate with any version of libvirt or QEMU from a given starting version onwards. This is key to ensuring a management application can be used on a wide range of distros each with different libvirt/QEMU versions. To achieve this the application must be confident it hasn’t unexpectedly made use of a feature that doesn’t exist in a previous version of libvirt/QEMU that is intended to be supported.

    The key to all this is of course automated testing. Libvirt keeps a record of capabilities associated with each QEMU version in its GIT repo along with various sample pairs of XML files and QEMU arguments. This is good for unit testing, but there’s some stuff that’s only really practical to validate well by running functional tests against each QEMU release. For live migration compatibility, it is possible to produce reports specifying the guest ABI for each machine type, on each QEMU version and compare them for differences. There are a huge number of combinations of command line args that affect ABI though, so it is useful to actually have the real binaries available for testing, even if only to dynamically generate the reports.

    The COPR repository

    With the background motivation out of the way, lets get to the point of this blog post. A while ago I created a Fedora copr repository that contained many libvirt builds. These were created in a bit of a hacky way making it hard to keep it up to date as new releases of libvirt come out, or as new Fedora repos need to be targeted. So in the past week, I’ve done a bit of work to put this on a more sustainable footing and also integrate QEMU builds.

    As a result, there is a now a copr repo called ‘virt-ark‘ that currently targets Fedora 26 and 27, containing every QEMU version since 1.4.0 and every libvirt version since 1.2.0. That is 46 versions of libvirt dating back to Dec 2013, and 36 versions of QEMU dating back to Feb 2013. For QEMU I included all bugfix releases, which is why there are so many when there’s only 3 major releases a year compared to libvirt’s 11 major versions a year.

    # rpm -qa | grep -E '(libvirt|qemu)-ark' | sort
    libvirt-ark-1_2_0-1.2.0-1.x86_64
    libvirt-ark-1_2_10-1.2.10-2.fc27.x86_64
    libvirt-ark-1_2_11-1.2.11-2.fc27.x86_64
    ...snip...
    libvirt-ark-3_8_0-3.8.0-2.fc27.x86_64
    libvirt-ark-3_9_0-3.9.0-2.fc27.x86_64
    libvirt-ark-4_0_0-4.0.0-2.fc27.x86_64
    qemu-ark-1_4_0-1.4.0-3.fc27.x86_64
    qemu-ark-1_4_1-1.4.1-3.fc27.x86_64
    qemu-ark-1_4_2-1.4.2-3.fc27.x86_64
    ...snip....
    qemu-ark-2_8_1-2.8.1-3.fc27.x86_64
    qemu-ark-2_9_0-2.9.0-2.fc27.x86_64
    qemu-ark-2_9_1-2.9.1-3.fc27.x86_64
    
    

    Notice how the package name includes the version string. Each package version installs into /opt/$APP/$VERSION, eg /opt/libvirt/1.2.0 or /opt/qemu/2.4.0, so you can have them all installed at once and happily co-exist.

    Using the custom versions

    To launch a particular version of libvirtd

    $ sudo /opt/libvirt/1.2.20/sbin/libvirtd

    The libvirt builds store all their configuration in /opt/libvirt/$VERSION/etc/libvirt, and creates UNIX sockets in /opt/libvirt/$VERSION/var/run so will (mostly) not conflict with the main Fedora installed libvirt. As a result though, you need to use the corresponding virsh binary to connect to it

    $ /opt/libvirt/1.2.20/bin/virsh

    To test building or running another app against this version of libvirt set some environment variables

    export PKG_CONFIG_PATH=/opt/libvirt/1.2.20/lib/pkgconfig
    export LD_LIBRARY_PATH=/opt/libvirt/1.2.20/lib
    

    For libvirtd to pick up a custom QEMU version, it must appear in $PATH before the QEMU from /usr, when libvirtd is started eg

    $ su -
    # export PATH=/opt/qemu/2.0.0/bin:$PATH
    # /opt/libvirt/1.2.20/sbin/libvirtd

    Alternatively just pass in the custom QEMU binary path in the guest XML (if the management app being tested supports that).

    The build mechanics

    When managing so many different versions of a software package you don’t want to be doing lots of custom work to create each one. Thus I have tried to keep everything as simple as possible. There is a Pagure hosted GIT repo containing the source for the builds. There are libvirt-ark.spec.in and qemu-ark.spec.in RPM specfile templates which are used for every version. No attempt is made to optimize the dependencies for each version, instead BuildRequires will just be the union of dependencies required across all versions. To keep build times down, for QEMU only the x86_64 architecture system emulator is enabled. In future I might enable the system emulators for other architectures that are commonly used (ppc, arm, s390), but won’t be enabling all the other ones QEMU has. The only trouble comes when newer Fedora releases include a change which breaks the build. This has happened a few times for both libvirt and QEMU. The ‘patches/‘ subdirectory thus contains a handful of patches acquired from upstream GIT repos to fix the builds. Essentially though I can run

    $  make copr APP=libvirt ARCHIVE_FMT=xz DOWNLOAD_URL=https://libvirt.org/sources/ VERSION=1.3.0

    Or

    $  make copr APP=qemu ARCHIVE_FMT=xz DOWNLOAD_URL=https://download.qemu.org/ VERSION=2.6.0

    And it will download the pristine upstream source, write a spec file including any patches found locally, create a src.rpm and upload this to the copr build service. I’ll probably automate this a little more in future to avoid having to pass so many args to make, by keeping a CSV file with all metadata for each version.

     

    by Daniel Berrange at February 09, 2018 12:12 PM

    February 06, 2018

    Marcin Juszkiewicz

    Graphical console in OpenStack/aarch64

    OpenStack users are used to have graphical console available. They take it for granted even. But we lacked it…

    When we started working on OpenStack on 64-bit ARM there were many things missing. Most of them got sorted out already. One thing was still in a queue: graphical console. So two weeks ago I started looking at the issue.

    Whenever someone tried to use it Nova reported one simple message: “No free USB ports.” You can ask what it has to console? I thought similar and started digging…

    As usual reason was simple: yet another aarch64<>x86 difference in libvirt. Turned out that arm64 is one of those few architectures which do not have USB host controller in default setup. When Nova is told to provide graphical console it adds Spice (or VNC) graphics, video card and USB tablet. But in our case VM instance does not have any USB ports so VM start fails with “No free USB ports” message.

    Solution was simple: let’s add USB host controller into VM instance. But where? Should libvirt do that or should Nova? I discussed it with libvirt developers and got mixed answers. Opened a bug for it and went to Nova hacking.

    Turned out that Nova code for building guest configuration is not that complicated. I created a patch to add USB host controller and waited for code reviews. There were many suggestions, opinions and questions. So I rewrote code. And then again. And again. Finally 15th version got “looks good” opinion from all reviewers and got merged.

    And how result looks? Boring as it should:

    by Marcin Juszkiewicz at February 06, 2018 07:16 PM

    QEMU project

    Presentations from DevConf and FOSDEM 2018

    During the past two weeks, there were two important conferences for Open Source developers in Europe, where you could also enjoy some QEMU related presentations. The following QEMU-related talks were held at the DevConf 2018 conference in Brno:

    And at the FOSDEM 2018 in Brussels, you could listen to the following QEMU related talks:

    More virtualization related talks can be found in the schedule from the DevConf and in the schedule from the FOSDEM.

    by Thomas Huth at February 06, 2018 04:00 PM

    February 01, 2018

    Marcin Juszkiewicz

    Everyone loves 90s PC hardware?

    Do you know what is the most popular PC machine nowadays? It is “simple” PCI based x86(-64) machine with i440fx chipset and some expansion cards. Everyone is using them daily. Often you do not even realise that things you do online are handled by machines from 90s.

    Sounds like a heresy? Who would use 90s hardware in modern world? PCI cards were supposed to be replaced by PCI Express etc, right? No one uses USB 1.x host controllers, hard to find PS/2 mouse or keyboard in stores etc. And you all are wrong…

    Most of virtual machines in x86 world is a weird mix of 90s hardware with a bunch of PCI cards to keep them usable in modern world. Parallel ATA storage went to trash replaced by SCSI/SATA/virtual ones. Graphic card is usually simple framebuffer without any 3D acceleration (so like 90s) and typical PS/2 input devices connected. And you have USB 1.1 and 2.0 controllers with one tablet connected. Sounds like retro machines my friends prepare for retro events.

    You can upgrade to USB 3.0 controller, graphic card with some 3D acceleration. Or add more memory and cpu cores that i440fx based PC owner ever dreamt about. But it is still 90s hardware.

    Want to have something more modern? You can migrate to PCI Express. But nearly no one does that in x86 world. And in AArch64 world we start from here.

    And that’s the reason why working with developers of projects related to virtualization (qemu, libvirt, openstack) can be frustrating.

    Hotplug issues? Which hotplug issues? My VM instance allows to plug 10 cards while it is running so where is a problem? The problem is that your machine is 90s hardware with simple PCI bus and 31 slots present on virtual mainboard while VM instance with PCI Express (aarch64, x86 with q35 model) has only TWO free slots present on motherboard. And once they are used no new slots arrive. Unless you shutdown, add free slots and power up again.

    Or my recent stuff: adding USB host controller. x86 has it because someone ‘made a mistake’ in past and enabled it. But other architectures got sorted out and did not get it. Now all of them need to have it added when it is needed. Have a patch to Nova for it and I am rewriting it again and again to get it into acceptable form.

    Partitioning is fun too. There are so many people who fear switching to GPT…

    by Marcin Juszkiewicz at February 01, 2018 09:40 AM

    January 26, 2018

    Stefan Hajnoczi

    How to modify kernel modules without recompiling the whole Linux kernel

    Do you need to recompile your Linux kernel in order to make a change to a module? What if you just want to try to fix a small bug while running a distro kernel package?

    The Linux kernel source tree is large and rebuilding from scratch is a productivity killer. This article covers how to make changes to kernel modules without rebuilding the entire kernel.

    I need to preface this by saying that I don't know if this is a "best practice". Maybe there are better ways but here is what I've been using recently.

    Step by step

    In most cases you can safely modify one or more kernel modules without rebuilding the whole kernel. Follow these steps:

    1. Get the kernel sources

    Download the kernel source tree corresponding to your current kernel version. How to get the kernel sources for the exact kernel package version you are currently running depends on your Linux distribution. On Fedora do the following:

    $ dnf download --source kernel # or specify the exact kernel-X.Y.Z-R package you need
    kernel-4.14.14-300.fc27.src.rpm 1.9 MB/s | 98 MB 00:50
    $ rpmbuild -rp kernel-4.14.14-300.fc27.src.rpm
    $ cd ~/rpmbuild/BUILD/kernel-4.14.fc27/linux-4.14.14-300.fc27.x86_64/

    If you can't figure out how to get the corresponding kernel sources, use uname -r to find the kernel version and grab the vanilla sources from git. This will work as long as the kernel package you are running hasn't been patched too heavily by the package maintainers:

    $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
    $ cd linux-stable
    $ uname -r
    4.14.14-300.fc27.x86_64
    $ git checkout v4.14.14 # let's hope this is close to what we're running!

    2. Get the kernel config file

    It is critical that you use the same .config file as your distro as some configuration options will build incompatible kernel modules. All will be good if your .config file matches your kernel's configuration, so grab it from /boot:

    $ cp /boot/config-$(uname -r) .config
    $ make oldconfig # this shouldn't produce much output

    3. Set the version string

    Kernel module versioning relies on a version string that is compiled into each kernel module. If the version string does not match your kernel's version then the module cannot be loaded. Be sure to set CONFIG_LOCALVERSION to match uname -r in the .config file:

    $ uname -r # we only want the stuff after the X.Y.Z version number
    4.14.14-300.fc27.x86_6
    $ sed -i 's/^CONFIG_LOCALVERSION=.*$/CONFIG_LOCALVERSION="-300.fc27.x86_64"' .config

    4. Build your modules

    Use the out-of-tree build syntax to compile just the modules you need. In this example let's rebuild drivers/virtio modules:

    $ make modules_prepare
    $ make -j4 M=drivers/virtio modules # or whatever directory you want

    5. Install and copy your modules

    It can be useful to install the modules in a staging directory so they can copied to remote machines or installed locally:

    $ mkdir /tmp/staging
    $ make M=drivers/virtio INSTALL_MOD_PATH=/tmp/staging modules_install
    $ scp /tmp/staging/lib/modules/4.14.14-300.fc27.x86_64/extra/* root@remote-host:/lib/modules/4.14.14-300.fc27.x86_64/kernel/drivers/virtio/

    Beware that some distros ship compressed kernel modules. Set CONFIG_MODULE_COMPRESS_XZ=y in the .config file to get .ko.xz files, for example.

    6. Reload modules or reboot the test machine

    Now that the new modules are in /lib/modules/... it's time to load them. If the old modules are currently loaded you may be able to rmmod them after terminating processes that rely on those modules. Then load the new modules using modprobe. If the old modules cannot be unloaded because the system depends on them, you need to reboot.

    If the modules you modified are loaded during early boot, you'll need to rebuild the initramfs. Make sure you have a backup initramfs in case the system fails to boot!

    Caveats

    This approach has limitations that mean it's mostly useful for debugging and development. For quality assurance testing it is better to follow a full build process that produces the same output that end users will install.

    Here are some things to be aware of:

    • Don't make .config changes unless you are sure they are compatible with the running kernel.
    • Do not introduce new module dependencies since this approach doesn't rebuild dependency information.
    • Do not change exported symbols if other kernel modules depend on the code you are changing, unless you also rebuild the modules that depend on yours.
    • Your modified modules will not be cryptographically signed and will taint the kernel if your distro kernel package is signed.

    What happens if things go wrong? Either you'll get an error when attempting to load the kernel module. Or you might just get an oops when there is a crash due to ABI breakage.

    Conclusion

    This may seem like a long process but it's faster than recompiling a full kernel from scratch. Once you've got it working you can keep modifying code and rebuilding from Step 3.

    by stefanha (noreply@blogger.com) at January 26, 2018 09:43 AM

    January 19, 2018

    Eduardo Otubo

    Xen synchronicity between frontend and backend devices

    So I bumped into a problem last month and it took me too much time to figure out the big picture of the problem since I didn't find too much documentation about that. The help I could find when trying to figure out this was mostly from good people on the channel #xendevel @ Freenode, mostly maintainers. So if you want to understand a little bit of Xen without pinging people on IRC, that's the place.

    The problem is the following: I'm running RHEL on Xen Hypervisor and whenever I try to unload and reload xen_netfront kernel module I see outputs like that on dmesg:
    # modprobe -r xen_netfront

    # dmesg|tail
    [ 105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
    [ 105.236839] deferring g.e. 0x903 (pfn 0x35805)
    [ 105.237156] xen:grant_table: WARNING: g.e. 0x904 still in use!
    [ 105.237160] deferring g.e. 0x904 (pfn 0x35804)
    [ 105.237163] xen:grant_table: WARNING: g.e. 0x905 still in use!
    [ 105.237166] deferring g.e. 0x905 (pfn 0x35803)
    [ 105.237545] xen:grant_table: WARNING: g.e. 0x906 still in use!
    [ 105.237550] deferring g.e. 0x906 (pfn 0x35802)
    [ 105.237553] xen:grant_table: WARNING: g.e. 0x907 still in use!
    [ 105.237556] deferring g.e. 0x907 (pfn 0x35801)

    Moreover, the interface is not usable as well:

    # dmesg|tail
    [ 105.237163] xen:grant_table: WARNING: g.e. 0x905 still in use!
    [ 105.237166] deferring g.e. 0x905 (pfn 0x35803)
    [ 105.237545] xen:grant_table: WARNING: g.e. 0x906 still in use!
    [ 105.237550] deferring g.e. 0x906 (pfn 0x35802)
    [ 105.237553] xen:grant_table: WARNING: g.e. 0x907 still in use!
    [ 105.237556] deferring g.e. 0x907 (pfn 0x35801)
    [ 160.050882] xen_netfront: Initialising Xen virtual ethernet driver
    [ 160.066937] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [ 160.067270] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
    [ 160.069355] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready

    # ifconfig eth0
    eth0: flags=4098 mtu 1500
    ether 00:00:00:00:00:00 txqueuelen 1000 (Ethernet)
    RX packets 0 bytes 0 (0.0 B)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 0 bytes 0 (0.0 B)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    # ifconfig eth0 up
    SIOCSIFFLAGS: Cannot assign requested address

    The first problem happens because the backend part of the module (xen_netback) is still using some pieces of memory (g.e. which states for grant entries) that are shared between guest and host. The ideal scenario would be to wait for the netback to free those entries and only then unload the netfront module. This was actually a bug on the synchronicity of the netfront and netback parts.

    The state of the drivers are kept in separate structs, as defined in include/xen/xenbus.h:69:

    /* A xenbus device. */
    struct xenbus_device {
    const char *devicetype;
    const char *nodename;
    const char *otherend;
    int otherend_id;
    struct xenbus_watch otherend_watch;
    struct device dev;
    enum xenbus_state state;
    struct completion down;
    struct work_struct work;
    };

    And the netfront state can be seen from the hypervisor with the command:

    # xenstore-ls -fp
    [...]
    /local/domain/1/device/vif/0/state = "4" (n1,r0)
    [...]

    The number 4 indicates XenbusStateConnected (as defined in include/xen/interface/io/xenbus.h:17). So it means everything is a matter of wait for one end to finish using the memory region and the other to free, this first piece of the puzzle is solved by this patch:

    diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
    index 8b8689c6d887..391432e2725d 100644
    --- a/drivers/net/xen-netfront.c
    +++ b/drivers/net/xen-netfront.c
    @@ -87,6 +87,8 @@ struct netfront_cb {
    /* IRQ name is queue name with "-tx" or "-rx" appended */
    #define IRQ_NAME_SIZE (QUEUE_NAME_SIZE + 3)

    +static DECLARE_WAIT_QUEUE_HEAD(module_unload_q);
    +
    struct netfront_stats {
    u64 packets;
    u64 bytes;
    @@ -2021,10 +2023,12 @@ static void netback_changed(struct xenbus_device *dev,
    break;

    case XenbusStateClosed:
    + wake_up_all(&module_unload_q);
    if (dev->state == XenbusStateClosed)
    break;
    /* Missed the backend's CLOSING state -- fallthrough */
    case XenbusStateClosing:
    + wake_up_all(&module_unload_q);
    xenbus_frontend_closed(dev);
    break;
    }
    @@ -2130,6 +2134,20 @@ static int xennet_remove(struct xenbus_device *dev)

    dev_dbg(&dev->dev, "%s\n", dev->nodename);

    + if (xenbus_read_driver_state(dev->otherend) != XenbusStateClosed) {
    + xenbus_switch_state(dev, XenbusStateClosing);
    + wait_event(module_unload_q,
    + xenbus_read_driver_state(dev->otherend) ==
    + XenbusStateClosing);
    +
    + xenbus_switch_state(dev, XenbusStateClosed);
    + wait_event(module_unload_q,
    + xenbus_read_driver_state(dev->otherend) ==
    + XenbusStateClosed ||
    + xenbus_read_driver_state(dev->otherend) ==
    + XenbusStateUnknown);
    + }
    +
    xennet_disconnect_backend(info);

    unregister_netdev(info->netdev);

    The second piece of the problem is that the interface is not usable when reloaded back. And that's a lack of initializing the state of the device so the backend notices it, and hence, connects the two drivers together (frontend and backend). This was easily solved by the following patch:

    diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
    index c5a34671abda..9bd7ddeeb6a5 100644
    --- a/drivers/net/xen-netfront.c
    +++ b/drivers/net/xen-netfront.c
    @@ -1326,6 +1326,7 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)

    netif_carrier_off(netdev);

    + xenbus_switch_state(dev, XenbusStateInitialising);
    return netdev;

    exit:

    by noreply@blogger.com (Eduardo Otubo) at January 19, 2018 10:08 AM

    January 12, 2018

    Marcin Juszkiewicz

    YADIBP

    Let me introduce new awesome project: YADIBP. It is cool, foss, awesome, the one and only and invented here instead of there. And it does exactly what it has to do and in a way it has to be done. Truly cool and awesome.

    Using that tool you can build disk images with several supported Linux distributions. Or maybe even with any BSD distribution. And Haiku or ReactOS. Patches for AmigaOS exist too!

    Any architecture. Starting from 128 bit wide RUSC-VI to antique architectures like ia32 or m88k as long as you have either hardware or qemu port (patches for ARM fast models in progress).

    Just fetch from git and use. Written in BASIC so it should work everywhere. And if you lack BASIC interpreter then you can run it as Python or Erlang. Our developers are so cool and awesome!

    But let’s get back to reality — there are gazillions of projects of tool which does one simple thing: builds a disk image. And gazillion will be still written because some people have that “Not Invented Here” syndrome.

    And I am getting tired of it.

    by Marcin Juszkiewicz at January 12, 2018 01:34 PM

    January 05, 2018

    QEMU project

    QEMU and the Spectre and Meltdown attacks

    As you probably know by now, three critical architectural flaws in CPUs have been recently disclosed that allow user processes to read kernel or hypervisor memory through cache side-channel attacks. These flaws, collectively named Meltdown and Spectre, affect in one way or another almost all processors that perform out-of-order execution, including x86 (from Intel and AMD), POWER, s390 and ARM processors.

    No microcode updates are required to block the Meltdown attack. In addition, the Meltdown flaw does not allow a malicious guest to read the contents of hypervisor memory. Fixing it only requires that the operating system separates the user and kernel address spaces (known as page table isolation for the Linux kernel), which can be done separately on the host and the guests. Therefore, this post will focus on Spectre, and especially on CVE-2017-5715.

    Fixing or mitigating Spectre in general, and CVE-2017-5715 in particular, requires cooperation between the processor and the operating system kernel or hypervisor; the processor can be updated through microcode or millicode patches to provide the required functionality.

    Among the three vulnerabilities, CVE-2017-5715 is notable because it allows guests to read potentially sensitive data from hypervisor memory. Patching the host kernel is sufficient to block attacks from guests to the host. On the other hand, in order to protect the guest kernel from a malicious userspace, updates are also needed to the guest kernel and, depending on the processor architecture, to QEMU.

    Just like on bare-metal, the guest kernel will use the new functionality provided by the microcode or millicode updates. When running under a hypervisor, processor emulation is mostly out of QEMU’s scope, so QEMU’s role in the fix is small, but nevertheless important. In the case of KVM:

    • QEMU configures the hypervisor to emulate a specific processor model. For x86, QEMU has to be aware of new CPUID bits introduced by the microcode update, and it must provide them to guests depending on how the guest is configured.

    • upon virtual machine migration, QEMU reads the CPU state on the source and transmits it to the destination. For x86, QEMU has to be aware of new model specific registers (MSRs).

    Right now, there are no public patches to KVM that expose the new CPUID bits and MSRs to the virtual machines, therefore there is no urgent need to update QEMU; remember that updating the host kernel is enough to protect the host from malicious guests. Nevertheless, updates will be posted to the qemu-devel mailing list in the next few days, and a 2.11.1 patch release will be released with the fix.

    Once updates are provided, live migration to an updated version of QEMU will not be enough to protect guest kernel from guest userspace. Because the virtual CPU has to be changed to one with the new CPUID bits, the guest will have to be restarted.

    As of today, the QEMU project is not aware of whether similar changes will be required for non-x86 processors. If so, they will also be posted to the mailing list and backported to recent stable releases.

    For more information on the vulnerabilities, please refer to the Google Security Blog and Google Project Zero posts on the topic, as well as the Spectre and Meltdown FAQ.

    5 Jan 2018: clarified the level of protection provided by the host kernel update; added a note on live migration; clarified the impact of Meltdown on virtualization hosts

    by Paolo Bonzini and Eduardo Habkost at January 05, 2018 02:00 PM

    Powered by Planet!
    Last updated: April 26, 2018 02:07 AM
    Powered by OpenShift Online