Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

March 24, 2017

Cole Robinson

Easy qemu commandline passthrough with virt-xml

Libvirt has supported qemu commandline option passthrough for qemu/kvm VMs for quite a while. The format for it is a bit of a pain though since it requires setting a magic xmlns value at the top of the domain XML. Basically doing it by hand kinda sucks.

In the recently released virt-manager 1.4.1, we added a virt-install/virt-xml option --qemu-commandline that tweaks option passthrough for new or existing VMs. So for example, if you wanted to add the qemu option string '-device FOO' to an existing VM named f25, you can do:

  ./virt-xml f25 --edit --confirm --qemu-commandline="-device FOO"

The output will look like:

--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
-<domain type="kvm">
+<domain xmlns:qemu="" type="kvm">
<memory unit="KiB">4194304</memory>
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
+ <qemu:commandline>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
+ </qemu:commandline>

Define 'f25' with the changed XML? (y/n):

by Cole Robinson ( at March 24, 2017 10:30 PM

March 19, 2017

QEMU project

QEMU in the blogs: February 2017

Here is a short list of articles and blog posts about QEMU and KVM, that were posted last month.

More virtualization blog posts can be found on the virt tools planet.

In other news, QEMU is now in hard freeze for release 2.9.0. The preliminary list of features is on the wiki.

by Paolo Bonzini at March 19, 2017 09:30 AM

March 12, 2017

Richard Jones

Tip: Run virt-inspector on a compressed disk (with nbdkit)

virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.

If you have an xz-compressed disk image, you can run virt-inspector on it without uncompressing it, using the magic of captive nbdkit. Here’s how:

nbdkit xz file=win7.img.xz \
    -U - \
    --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'

What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (-U -).

We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)

The $unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.

The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.

by rich at March 12, 2017 03:44 PM

March 08, 2017

Cole Robinson

virt-manager 1.4.1 released!

I've just released virt-manager 1.4.1. The highlights are:
  • storage/nodedev event API support (Jovanka Gulicoska)
  • UI options for enabling spice GL (Marc-André Lureau)
  • Add default virtio-rng /dev/urandom for supported guest OS
  • Cloning and rename support for UEFI VMs (Pavel Hrdina)
  • libguestfs inspection UI improvements (Pino Toscano)
  • virt-install: Add --qemu-commandline
  • virt-install: Add --network vhostuser (Chen Hanxiao)
  • virt-install: Add --sysinfo (Charles Arnold)
Plus the usual slew of bug fixes and small improvements.

    by Cole Robinson ( at March 08, 2017 07:15 PM

    Cédric Bosdonnat

    System container images

    As of today creating libvirt lxc system container root file system is a pain. Docker's fun came with its image sharing idea... why couldn't we do the same for libvirt containers? I will expose here is an attempt at this.

    To achieve such a goal we need:

    • container images
    • something to share them
    • a tool to pull and use them

    Container images

    OpenBuildService thanks to kiwi knows how to create images, even container images. There even are openSUSE Docker images. To use them as system container images, some more packages need to be added to those. I thus forked the project on github and branched the OBS projects to get system container images for 42.1, 42.2 and Tumbleweed.

    Using them is as simple as downloading them, unpacking them and use them as a container's root file system. However, sharing them would be so fun!

    Sharing images

    There is no need to reinvent the wheel to share the images. We can just consider them like any docker image. With the following commands we can import the image and push it to a remote registry.

    docker import openSUSE-42.2-syscontainer-guest-docker.x86_64.tar.xz system/opensuse-42.2
    docker tag system/opensuse-42.2 myregistry:5000/system/opensuse-42.2
    docker login myregistry:5000
    docker push myregistry:5000/system/opensuse-42.2

    The good thing with this is that we can even use the docker build and Dockerfile magic to create customized images and push them to the remote repository.

    Instanciating containers

    Now we need a tool to get the images from the remote docker registry. Hopefully there is a tool that helps a lot to do this: skopeo. I wrote a small virt-bootstrap tool using it to instanciate the images as root file systems.

    Here is how instanciating a container looks like with it: --username myuser \
                      --root-password test \
                      docker://myregistry:5000/system/opensuse-42.2 /path/to/my/container
    virt-install --connect lxc:/// -n 422 --memory 250 --vcpus 1 \
                    --filesystem /path/to/my/container,/ \
                    --filesystem /etc/resolv.conf,/etc/resolv.conf \
                    --network network=default

    And voila! Creating an openSUSE 42.2 system container and running it with libvirt is now super easy!

    by Cédric Bosdonnat at March 08, 2017 03:37 PM

    March 07, 2017

    Zeeshan Ali Khattak

    GDP meets GSoC

    Are you a student? Passionate about Open Source? Want your code to run on next generation of automobiles? You're in luck! Genivi Development Platform will be participating in Google Summer of Code this summer and you are welcome to participate. We have collected a bunch of ideas for what would be a good 3 month project for a student but you're more than welcome to suggest your own project. The ideas page, also has instructions on how to get started with GDP.

    We look forward to your participation!

    by (zeenix) at March 07, 2017 06:11 PM

    March 06, 2017

    Eduardo Habkost

    The long story of the query-cpu-model-expansion QEMU interface

    So, finally the query-cpu-model-expansion x86 implementation was merged into qemu.git, just before 2.9 soft freeze. Jiri Denemark already implemented the x86 libvirt code to use it. I just can’t believe this was finally done after so many years.

    It was a weird journey. It started almost 6 years ago with this message to qemu-devel:

    Date: Fri, 10 Jun 2011 18:36:37 -0300
    Subject: semantics of “-cpu host” and “check”/”enforce”

    …it continued on an interesting thread:

    Date: Tue, 6 Mar 2012 15:27:53 -0300
    Subject: Qemu, libvirt, and CPU models

    …on another very long one:

    Date: Fri, 9 Mar 2012 17:56:52 -0300
    Subject: Re: [Qemu-devel] [libvirt] Modern CPU models cannot be used with libvirt

    …and this one:

    Date: Thu, 21 Feb 2013 11:58:18 -0300
    Subject: libvirt<->QEMU interfaces for CPU models

    I don’t even remember how many different interfaces were proposed to provide what libvirt needed.

    We had a few moments where we hopped back and forth between “just let libvirt manage everything” to “let’s keep this managed by QEMU”.

    We took a while to get the QEMU community to decide how machine-type compatibility was supposed to be handled, and what to do with the weird CPU model config file we had.

    The conversion of CPUs to QOM was fun. I think it started in 2012 and was finished only in 2015. We thought QOM properties would solve all our problems, but then we found out that machine-types and global properties make the problem more complex. The existing interfaces would require making libvirt re-run QEMU multiple times to gather all the information it needed. While doing the QOM work, we spent some time fixing or working around issues with global properties, qdev “static” properties and QOM “dynamic” properties.

    In 2014, my focus was moved to machine-types, in the hope that we could finally expose machine-type-specific information to libvirt without re-running QEMU. Useful code refactoring was done for that, but in the end we never added the really relevant information to the query-machines QMP command.

    In the meantime, we had the fun TSX issues, and QEMU developers finally agreed to keep a few constraints on CPU model changes, that would make the problem a bit simpler.

    In 2015 IBM people started sending patches related to CPU models in s390x. We finally had a multi-architecture effort to make CPU model probing work. The work started by extending query-cpu-definitions, but it was not enough. In June 2016 they proposed a query-cpu-model-expansion API. It was finally merged in September 2016.

    I sent v1 of query-cpu-model-expansion for x86 in December 2016. After a few rounds of reviews, there was a proposal to use “-cpu max” to represent the “all features supported by this QEMU binary on this host”. v3 of the series was merged last week.

    I still can’t believe it finally happened.

    Special thanks to:

    • Igor Mammedov, for all the x86 QOM/properties work and all the valuable feedback.
    • David Hildenbrand and Michael Mueller, for moving forward the API design and the s390x implementation.
    • Jiri Denemark, for the libvirt work, valuable discussions and design feedback, and for the patience during the process.
    • Daniel P. Berrangé, for the valuable feedback and for helping making QEMU developers listen to libvirt developers.
    • Andreas Färber, for the work as maintainer of QOM and the CPU core, for leading the QOM conversion effort, and all the valuable feedback.
    • Markus Armbruster and Paolo Bonzini, for valuable feedback on design discussions.
    • Many others that were involved in the effort.

    by Eduardo Habkost at March 06, 2017 05:00 PM

    Video and slides for FOSDEM 2017 talk: QEMU Internal APIs

    The slides and videos for my FOSDEM 2017 talk (QEMU: internal APIs and conflicting world views) are available online.

    The subject I tried to cover is large for a 40-minute talk, but I think I managed to scratch its surface and give useful examples.

    Many thanks for the FOSDEM team of volunteers for the wonderful event.

    by Eduardo Habkost at March 06, 2017 03:00 AM

    February 24, 2017

    Ladi Prosek

    Running Hyper-V in a QEMU/KVM Guest

    This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.

    A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.

    Kernel / KVM

    A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.

    Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.

    $ cat /sys/module/kvm_intel/parameters/nested


    QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.

    The QEMU command line must include the +vmx cpu feature, for example:

    -cpu SandyBridge,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,+vmx

    If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).


    Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.

    If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.

    If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.

    by ladipro at February 24, 2017 01:57 PM

    February 22, 2017

    Gerd Hoffmann

    vconsole 0.7 released

    vconsole is a virtual machine (serial) console manager, look here for details.

    No big changes from 0.6.

    Fetch the tarball here.

    by Gerd Hoffmann at February 22, 2017 08:15 AM

    February 21, 2017

    Gerd Hoffmann

    Fedora 25 images for qemu and raspberry pi 3 uploaded

    I’ve uploaded three new images to

    The fedora-25-rpi3 image is for the raspberry pi 3.
    The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).

    The images don’t have a root password set. You must use libguestfs-tools to set the root password …

    virt-customize -a <image> --root-password "password:<your-password-here>>

    … otherwise you can’t login after boot.

    The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.

    The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.

    by Gerd Hoffmann at February 21, 2017 08:32 AM

    February 16, 2017

    Daniel Berrange

    Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA

    Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.

    First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.

    # curl --output /etc/yum.repos.d/fedora-virt-preview.repo
    # dnf upgrade

    For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso
    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \

    The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.

        --cpu host,,cell0.cpus=0-3,cell0.memory=4096000,\
         ,cell2.cpus=6-7,cell2.memory=2048000 \

    QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).

        --machine q35

    The complete virt-install command line thus looks like

    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \
        --cpu host,,cell0.cpus=0-3,cell0.memory=4096000,\
         ,cell2.cpus=6-7,cell2.memory=2048000 \
        --machine q35

    Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”

    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='dmi-to-pci-bridge'>
      <model name='i82801b11-bridge'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>

    PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config

    <controller type='pci' index='3' model='pcie-expander-bus'>
      <target busNr='180'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    <controller type='pci' index='4' model='pcie-expander-bus'>
      <target busNr='200'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    <controller type='pci' index='5' model='pcie-expander-bus'>
      <target busNr='220'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>

    It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB

    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='6' port='0x0'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='7' port='0x8'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x10'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='9' port='0x0'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='10' port='0x8'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='11' port='0x10'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='12' port='0x0'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='13' port='0x8'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='14' port='0x10'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>|

    Notice that the values in ‘bus‘ attribute on the <address> element is matching the value of the ‘index‘ attribute on the <controller> element of the parent device in the topology. The PCI controller topology now looks like this

    pcie-root (index == 0)
      +- dmi-to-pci-bridge (index == 1)
      |    |
      |    +- pci-bridge (index == 2)
      +- pcie-expander-bus (index == 3, numa node == 0)
      |    |
      |    +- pcie-root-port (index == 6)
      |    +- pcie-root-port (index == 7)
      |    +- pcie-root-port (index == 8)
      +- pcie-expander-bus (index == 4, numa node == 1)
      |    |
      |    +- pcie-root-port (index == 9)
      |    +- pcie-root-port (index == 10)
      |    +- pcie-root-port (index == 11)
      +- pcie-expander-bus (index == 5, numa node == 2)
           +- pcie-root-port (index == 12)
           +- pcie-root-port (index == 13)
           +- pcie-root-port (index == 14)

    All the existing devices are attached to the “pci-bridge” (the controller with index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add

    <interface type='user'>
      <mac address='52:54:00:7e:6e:c6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>

    Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.

    The final configuration change is to simply add the Intel IOMMU device

    <iommu model='intel'/>

    It is a capability integrated into the chipset, so it does not need any <address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.

    # virsh domifaddr f25x86_64
     Name       MAC address          Protocol     Address
     vnet0      52:54:00:10:26:7e    ipv4
    # ssh root@

    We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes

    # dnf install numactl
    # numactl --hardware
    available: 3 nodes (0-2)
    node 0 cpus: 0 1 2 3
    node 0 size: 3856 MB
    node 0 free: 3730 MB
    node 1 cpus: 4 5
    node 1 size: 1969 MB
    node 1 free: 1813 MB
    node 2 cpus: 6 7
    node 2 size: 1967 MB
    node 2 free: 1832 MB
    node distances:
    node   0   1   2 
      0:  10  20  20 
      1:  20  10  20 
      2:  20  20  10 

    Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached

    # lspci -t -v
    -+-[0000:dc]-+-00.0-[dd]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[de]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[df]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:c8]-+-00.0-[c9]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[ca]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[cb]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:b4]-+-00.0-[b5]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[b6]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[b7]----00.0  Intel Corporation 82574L Gigabit Network Connection
     \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
                 +-01.0  Red Hat, Inc. QXL paravirtual graphic card
                 +-02.0  Red Hat, Inc. Device 000b
                 +-03.0  Red Hat, Inc. Device 000b
                 +-04.0  Red Hat, Inc. Device 000b
                 +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
                 +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
                 +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
                 +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
                 +-1e.0-[01-02]----01.0-[02]--+-01.0  Red Hat, Inc Virtio network device
                 |                            +-02.0  Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller
                 |                            +-03.0  Red Hat, Inc Virtio console
                 |                            +-04.0  Red Hat, Inc Virtio block device
                 |                            \-05.0  Red Hat, Inc Virtio memory balloon
                 +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
                 +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
                 \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

    The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.

    # vi /etc/default/grub
    ....add "intel_iommu=on"...
    # grub2-mkconfig > /etc/grub2.cfg

    While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio

    echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \

    This is also a good time to install libvirt and KVM inside the guest

    # dnf groupinstall "Virtualization"
    # dnf install libvirt-client
    # rm -f /etc/libvirt/qemu/networks/autostart/default.xml

    Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.

    Now reboot the guest. When it comes back up, there should be a /dev/kvm device present in the guest.

    # ls -al /dev/kvm
    crw-rw-rw-. 1 root kvm 10, 232 Oct  4 12:14 /dev/kvm

    If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.

    The IOMMU should have been detected and activated

    # dmesg  | grep -i DMAR
    [    0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS  BXPCDMAR 00000001 BXPC 00000001)
    [    0.000000] DMAR: IOMMU enabled
    [    0.203737] DMAR: Host address width 39
    [    0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
    [    0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02
    [    2.910862] DMAR: No RMRR found
    [    2.910863] DMAR: No ATSR found
    [    2.914870] DMAR: dmar0: Using Queued invalidation
    [    2.914924] DMAR: Setting RMRR:
    [    2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC
    [    2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
    [    2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O

    The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.

    The IOMMU should also have registered the PCI devices into various groups

    # dmesg  | grep -i iommu  |grep device
    [    2.915212] iommu: Adding device 0000:00:00.0 to group 0
    [    2.915226] iommu: Adding device 0000:00:01.0 to group 1
    [    5.588723] iommu: Adding device 0000:b5:00.0 to group 14
    [    5.588737] iommu: Adding device 0000:b6:00.0 to group 15
    [    5.588751] iommu: Adding device 0000:b7:00.0 to group 16

    Libvirt meanwhile should have detected all the PCI controllers/devices

    # virsh nodedev-list --tree
      +- net_lo_00_00_00_00_00_00
      +- pci_0000_00_00_0
      +- pci_0000_00_01_0
      +- pci_0000_00_02_0
      +- pci_0000_00_03_0
      +- pci_0000_00_04_0
      +- pci_0000_00_1d_0
      |   |
      |   +- usb_usb2
      |       |
      |       +- usb_2_0_1_0
      +- pci_0000_00_1d_1
      |   |
      |   +- usb_usb3
      |       |
      |       +- usb_3_0_1_0
      +- pci_0000_00_1d_2
      |   |
      |   +- usb_usb4
      |       |
      |       +- usb_4_0_1_0
      +- pci_0000_00_1d_7
      |   |
      |   +- usb_usb1
      |       |
      |       +- usb_1_0_1_0
      |       +- usb_1_1
      |           |
      |           +- usb_1_1_1_0
      +- pci_0000_00_1e_0
      |   |
      |   +- pci_0000_01_01_0
      |       |
      |       +- pci_0000_02_01_0
      |       |   |
      |       |   +- net_enp2s1_52_54_00_10_26_7e
      |       |     
      |       +- pci_0000_02_02_0
      |       +- pci_0000_02_03_0
      |       +- pci_0000_02_04_0
      |       +- pci_0000_02_05_0
      +- pci_0000_00_1f_0
      +- pci_0000_00_1f_2
      |   |
      |   +- scsi_host0
      |   +- scsi_host1
      |   +- scsi_host2
      |   +- scsi_host3
      |   +- scsi_host4
      |   +- scsi_host5
      +- pci_0000_00_1f_3
      +- pci_0000_b4_00_0
      |   |
      |   +- pci_0000_b5_00_0
      |       |
      |       +- net_enp181s0_52_54_00_7e_6e_c6
      +- pci_0000_b4_01_0
      |   |
      |   +- pci_0000_b6_00_0
      |       |
      |       +- net_enp182s0_52_54_00_7e_6e_c7
      +- pci_0000_b4_02_0
      |   |
      |   +- pci_0000_b7_00_0
      |       |
      |       +- net_enp183s0_52_54_00_7e_6e_c8
      +- pci_0000_c8_00_0
      |   |
      |   +- pci_0000_c9_00_0
      |       |
      |       +- net_enp201s0_52_54_00_7e_6e_d6
      +- pci_0000_c8_01_0
      |   |
      |   +- pci_0000_ca_00_0
      |       |
      |       +- net_enp202s0_52_54_00_7e_6e_d7
      +- pci_0000_c8_02_0
      |   |
      |   +- pci_0000_cb_00_0
      |       |
      |       +- net_enp203s0_52_54_00_7e_6e_d8
      +- pci_0000_dc_00_0
      |   |
      |   +- pci_0000_dd_00_0
      |       |
      |       +- net_enp221s0_52_54_00_7e_6e_e6
      +- pci_0000_dc_01_0
      |   |
      |   +- pci_0000_de_00_0
      |       |
      |       +- net_enp222s0_52_54_00_7e_6e_e7
      +- pci_0000_dc_02_0
          +- pci_0000_df_00_0
              +- net_enp223s0_52_54_00_7e_6e_e8

    And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of

    # virsh nodedev-dumpxml pci_0000_df_00_0
      <capability type='pci'>
        <product id='0x10d3'>82574L Gigabit Network Connection</product>
        <vendor id='0x8086'>Intel Corporation</vendor>
        <iommuGroup number='10'>
          <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/>
          <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/>
        <numa node='2'/>
          <link validity='cap' port='0' speed='2.5' width='1'/>
          <link validity='sta' speed='2.5' width='1'/>

    Finally, libvirt should also be reporting the NUMA topology

    # virsh capabilities
      <cells num='3'>
        <cell id='0'>
          <memory unit='KiB'>4014464</memory>
          <pages unit='KiB' size='4'>1003616</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
            <sibling id='0' value='10'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='20'/>
          <cpus num='4'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
            <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
            <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
        <cell id='1'>
          <memory unit='KiB'>2016808</memory>
          <pages unit='KiB' size='4'>504202</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
            <sibling id='0' value='20'/>
            <sibling id='1' value='10'/>
            <sibling id='2' value='20'/>
          <cpus num='2'>
            <cpu id='4' socket_id='4' core_id='0' siblings='4'/>
            <cpu id='5' socket_id='5' core_id='0' siblings='5'/>
        <cell id='2'>
          <memory unit='KiB'>2014644</memory>
          <pages unit='KiB' size='4'>503661</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
            <sibling id='0' value='20'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='10'/>
          <cpus num='2'>
            <cpu id='6' socket_id='6' core_id='0' siblings='6'/>
            <cpu id='7' socket_id='7' core_id='0' siblings='7'/>

    Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso
    # virt-install --name f25x86_64 --ram 2000 --vcpus 8 \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --hostdev pci_0000_df_00_0 --network none

    If everything went well, you should now have a nested guest with an assigned PCI device attached to it.

    This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.

    The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)

    by Daniel Berrange at February 16, 2017 12:44 PM

    ANNOUNCE: libosinfo 1.0.0 release

    NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…

    I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

    Changes in this release include:

    • Update loader to follow new layout for external database
    • Move all database files into separate osinfo-db package
    • Move osinfo-db-validate into osinfo-db-tools package

    As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:

    • libosinfo – provides the libosinfo shared library and most associated command line tools
    • osinfo-db – contains only the database XML files and RNG schema, no code at all.
    • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

    Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.

    by Daniel Berrange at February 16, 2017 11:19 AM

    February 15, 2017

    QEMU project

    Presentations from FOSDEM 2017

    Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.

    Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:

    by Thomas Huth at February 15, 2017 02:49 PM

    February 14, 2017

    Stefan Hajnoczi

    Slides posted for "Using NVDIMM under KVM" talk

    I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.

    Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.

    This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).

    Update: Video is available here.

    by stefanha ( at February 14, 2017 02:54 PM

    February 10, 2017

    Daniel Berrange

    ANNOUNCE: gtk-vnc 0.7.0 release including 2 security fixes

    I’m pleased to announce a new release of GTK-VNC, vesion 0.7.0. The release focus is on bug fixing and includes fixes for two publically reported security bugs which allow a malicious server to exploit the client. Similar bugs were recently reported & fixed in other common VNC clients too.

    • CVE-2017-5884 – fix bounds checking for RRE, hextile and copyrect encodings
    • CVE-2017-5885 – fix color map index bounds checking
    • Add API to allow smooth scaling to be disabled
    • Workaround to help SPICE servers quickly drop VNC clients which mistakenly connect, by sending “RFB ” signature bytes early
    • Don’t accept color map entries for true-color pixel formats
    • Add missing vala .deps files for gvnc & gvncpulse
    • Avoid crash if host/port is NULL
    • Add precondition checks to some public APIs
    • Fix link to home page in README file
    • Fix misc memory leaks
    • Clamp cursor hot-pixel to within cursor region

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at February 10, 2017 04:45 PM

    The surprisingly complicated world of disk image sizes

    When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:

    • Capacity – the size that is visible to the guest OS
    • Length – the current highest byte offset in the file.
    • Allocation – the amount of storage that is currently consumed.
    • Commitment – the amount of storage that could be consumed in the future.

    The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.

    Raw files

    In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.

    While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.

    QCow2 files

    In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.


    Considering the above information, for a newly created file the four size values would look like

    Format Capacity Length Allocation Commitment
    raw (sparse) 40GB 40GB 0 40GB [1]
    raw (prealloc) 40GB 40GB 40GB [1] 40GB [1]
    qcow2 (grow on demand) 40GB 193KB 196KB 41GB [2]
    qcow2 (prealloc metadata) 40GB 41GB 6.5MB 41GB [2]
    qcow2 (prealloc all) 40GB 41GB 41GB 41GB [2]
    [1] XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB
    [2] use of internal snapshots may massively increase allocation/commitment

    For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.

    That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.

    by Daniel Berrange at February 10, 2017 03:58 PM

    February 08, 2017

    Daniel Berrange

    Commenting out XML snippets in libvirt guest config by stashing it as metadata

    Libvirt uses XML as the format for configuring objects it manages, including virtual machines. Sometimes when debugging / developing it is desirable to comment out sections of the virtual machine configuration to test some idea. For example, one might want to temporarily remove a secondary disk. It is not always desirable to just delete the configuration entirely, as it may need to be re-added immediately after. XML has support for comments <!-- .... some text --> which one might try to use to achieve this. Using comments in XML fed into libvirt, however, will result in an unwelcome suprise – the commented out text is thrown into /dev/null by libvirt.

    This is an unfortunate consequence of the way libvirt handles XML documents. It does not consider the XML document to be the master representation of an object’s configuration – a series of C structs are the actual internal representation. XML is simply a data interchange format for serializing structs into a text format that can be interchanged with the management application, or persisted on disk. So when receiving an XML document libvirt will parse it, extracting the pieces of information it cares about which are they stored in memory in some structs, while the XML document is discarded (along with the comments it contained). Given this way of working, to preserve comments would require libvirt to add 100’s of extra fields to its internal structs and extract comments from every part of the XML document that might conceivably contain them. This is totally impractical to do in realityg. The alternative would be to consider the parsed XML DOM as the canonical internal representation of the config. This is what the libvirt-gconfig library in fact does, but it means you can no longer just do simple field accesses to access information – getter/setter methods would have to be used, which quickly becomes tedious in C. It would also involve re-refactoring almost the entire libvirt codebase so such a change in approach would realistically never be done.

    Given that it is not possible to use XML comments in libvirt, what other options might be available ?

    Many years ago libvirt added the ability to store arbitrary user defined metadata in domain XML documents. The caveat is that they have to be located in a specific place in the XML document as a child of the <metadata> tag, in a private XML namespace. This metadata facility to be used as a hack to temporarily stash some XML out of the way. Consider a guest which contains a disk to be “commented out”:

    <domain type="kvm">
        <disk type='file' device='disk'>
        <driver name='qemu' type='raw'/>
        <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
          <target dev='vda' bus='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>

    To stash the disk config as a piece of metadata requires changing the XML to

    <domain type="kvm">
        <s:disk xmlns:s="" type='file' device='disk'>
          <driver name='qemu' type='raw'/>
          <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
          <target dev='vda' bus='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>

    What we have done here is

    – Added a <metadata> element at the top level
    – Moved the <disk> element to be a child of <metadata> instead of a child of <devices>
    – Added an XML namespace to <disk> by giving it an ‘s:’ prefix and associating a URI with this prefix

    Libvirt only allows a single top level metadata element per namespace, so if there are multiple tihngs to be stashed, just give them each a custom namespace, or introduce an arbitrary wrapper. Aside from mandating the use of a unique namespace, libvirt treats the metadata as entirely opaque and will not try to intepret or parse it in any way. Any valid XML construct can be stashed in the metadata, even invalid XML constructs, provided they are hidden inside a CDATA block. For example, if you’re using virsh edit to make some changes interactively and want to get out before finishing them, just stash the changed in a CDATA section, avoiding the need to worry about correctly closing the elements.

    <domain type="kvm">
        <s:stash xmlns:s="">
          <disk type='file' device='disk'>
            <driver name='qemu' type='raw'/>
            <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
            <target dev='vda' bus='virtio'/>
            <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
            <driver name='qemu' type='raw'/>
            ...i'll finish writing this later...

    Admittedly this is a somewhat cumbersome solution. In most cases it is probably simpler to just save the snippet of XML in a plain text file outside libvirt. This metadata trick, however, might just come in handy some times.

    As an aside the real, intended, usage of the <metdata> facility is to allow applications which interact with libvirt to store custom data they may wish to associated with the guest. As an example, the recently announced libvirt websockets console proxy uses it to record which consoles are to be exported. I know of few other real world applications using this metadata feature, however, so it is worth remembering it exists :-) System administrators are free to use it for local book keeping purposes too.

    by Daniel Berrange at February 08, 2017 07:14 PM

    Thomas Huth

    Testing the edk2 SMM driver stack with QEMU, KVM & libvirt

    Laszlo wrote an article over at the edk2 wiki about testing SMM with OVMF, in QEMU/KVM virtual machines managed by libvirt:,-KVM-and-libvirt

    The primary goal of the article is to help rapid development and testing of SMM-related firmware code (or any edk2 code in general).

    February 08, 2017 12:10 PM

    Alberto Garcia

    QEMU and the qcow2 metadata checks

    When choosing a disk image format for your virtual machine one of the factors to take into considerations is its I/O performance. In this post I’ll talk a bit about the internals of qcow2 and about one of the aspects that can affect its performance under QEMU: its consistency checks.

    As you probably know, qcow2 is QEMU’s native file format. The first thing that I’d like to highlight is that this format is perfectly fine in most cases and its I/O performance is comparable to that of a raw file. When it isn’t, chances are that this is due to an insufficiently large L2 cache. In one of my previous blog posts I wrote about the qcow2 L2 cache and how to tune it, so if your virtual disk is too slow, you should go there first.

    I also recommend Max Reitz and Kevin Wolf’s qcow2: why (not)? talk from KVM Forum 2015, where they talk about a lot of internal details and show some performance tests.

    qcow2 clusters: data and metadata

    A qcow2 file is organized into units of constant size called clusters. The cluster size defaults to 64KB, but a different value can be set when creating a new image:

    qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

    Clusters can contain either data or metadata. A qcow2 file grows dynamically and only allocates space when it is actually needed, so apart from the header there’s no fixed location for any of the data and metadata clusters: they can appear mixed anywhere in the file.

    Here’s an example of what it looks like internally:

    In this example we can see the most important types of clusters that a qcow2 file can have:

    • Header: this one contains basic information such as the virtual size of the image, the version number, and pointers to where the rest of the metadata is located, among other things.
    • Data clusters: the data that the virtual machine sees.
    • L1 and L2 tables: a two-level structure that maps the virtual disk that the guest can see to the actual location of the data clusters in the qcow2 file.
    • Refcount table and blocks: a two-level structure with a reference count for each data cluster. Internal snapshots use this: a cluster with a reference count >= 2 means that it’s used by other snapshots, and therefore any modifications require a copy-on-write operation.

    Metadata overlap checks

    In order to detect corruption when writing to qcow2 images QEMU (since v1.7) performs several sanity checks. They verify that QEMU does not try to overwrite sections of the file that are already being used for metadata. If this happens, the image is marked as corrupted and further access is prevented.

    Although in most cases these checks are innocuous, under certain scenarios they can have a negative impact on disk write performance. This depends a lot on the case, and I want to insist that in most scenarios it doesn’t have any effect. When it does, the general rule is that you’ll have more chances of noticing it if the storage backend is very fast or if the qcow2 image is very large.

    In these cases, and if I/O performance is critical for you, you might want to consider tweaking the images a bit or disabling some of these checks, so let’s take a look at them. There are currently eight different checks. They’re named after the metadata sections that they check, and can be divided into the following categories:

    1. Checks that run in constant time. These are equally fast for all kinds of images and I don’t think they’re worth disabling.
      • main-header
      • active-l1
      • refcount-table
      • snapshot-table
    2. Checks that run in variable time but don’t need to read anything from disk.
      • refcount-block
      • active-l2
      • inactive-l1
    3. Checks that need to read data from disk. There is just one check here and it’s only needed if there are internal snapshots.
      • inactive-l2

    By default all tests are enabled except for the last one (inactive-l2), because it needs to read data from disk.

    Disabling the overlap checks

    Tests can be disabled or enabled from the command line using the following syntax:

    -drive file=hd.qcow2,overlap-check.inactive-l2=on
    -drive file=hd.qcow2,overlap-check.snapshot-table=off

    It’s also possible to select the group of checks that you want to enable using the following syntax:

    -drive file=hd.qcow2,overlap-check.template=none
    -drive file=hd.qcow2,overlap-check.template=constant
    -drive file=hd.qcow2,overlap-check.template=cached
    -drive file=hd.qcow2,overlap-check.template=all

    Here, none means that no tests are enabled, constant enables all tests from group 1, cached enables all tests from groups 1 and 2, and all enables all of them.

    As I explained in the previous section, if you’re worried about I/O performance then the checks that are probably worth evaluating are refcount-block, active-l2 and inactive-l1. I’m not counting inactive-l2 because it’s off by default. Let’s look at the other three:

    • inactive-l1: This is a variable length check because it depends on the number of internal snapshots in the qcow2 image. However its performance impact is likely to be negligible in all cases so I don’t think it’s worth bothering with.
    • active-l2: This check depends on the virtual size of the image, and on the percentage that has already been allocated. This check might have some impact if the image is very large (several hundred GBs or more). In that case one way to deal with it is to create an image with a larger cluster size. This also has the nice side effect of reducing the amount of memory needed for the L2 cache.
    • refcount-block: This check depends on the actual size of the qcow2 file and it’s independent from its virtual size. This check is relatively expensive even for small images, so if you notice performance problems chances are that they are due to this one. The good news is that we have been working on optimizing it, so if it’s slowing down your VMs the problem might go away completely in QEMU 2.9.


    The qcow2 consistency checks are useful to detect data corruption, but they can affect write performance.

    If you’re unsure and you want to check it quickly, open an image with overlap-check.template=none and see for yourself, but remember again that this will only affect write operations. To obtain more reliable results you should also open the image with cache=none in order to perform direct I/O and bypass the page cache. I’ve seen performance increases of 50% and more, but whether you’ll see them depends a lot on your setup. In many cases you won’t notice any difference.

    I hope this post was useful to learn a bit more about the qcow2 format. There are other things that can help QEMU perform better, and I’ll probably come back to them in future posts, so stay tuned!


    My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the rest of the QEMU development team.

    by berto at February 08, 2017 08:52 AM

    February 04, 2017

    QEMU project

    A new website for QEMU

    At last, QEMU’s new website is up!

    The new site aims to be simple and provides the basic information needed to download and start contributing to QEMU. It complements the wiki, which remains the central point for developers to share information quickly with the rest of the community.

    We tried to test the website on most browsers and to make it lightweight and responsive. It is built using Jekyll and the source code for the website can be cloned from the qemu-web.git repository. Just like for any other project hosted by QEMU, the best way to propose or contribute a new change is by sending a patch through the mailing list.

    For example, if you would like to add a new screenshot to the homepage, you can clone the qemu-web.git repository, add a PNG file to the screenshots/ directory, and edit the _data/screenshots.yml file to include the new screenshot.

    Blog posts about QEMU are also welcome; they are simple HTML or Markdown files and are stored in the _posts/ directory of the repository.

    by Paolo Bonzini at February 04, 2017 07:40 AM

    January 24, 2017

    Gerd Hoffmann

    local display for intel vgpu starts working

    Intel vgpu guest display shows up in the qemu gtk window. This is a linux kernel booted to the dracut emergency shell prompt, with the framebuffer console running @ inteldrmfb. Booting a full linux guest with X11 and/or wayland not tested yet. There are rendering glitches too, running “dmesg” looks like this:

    So, a bunch of issues to solve before this is ready for users, but it’s a nice start.

    For the brave:

    host kernel:

    Take care, both branches are moving targets (aka: rebasing at times).

    by Gerd Hoffmann at January 24, 2017 08:38 PM

    January 18, 2017

    Gerd Hoffmann

    tweak arm images with libguestfs-tools

    So, when using the official fedora arm images on your raspberry pi (or any other arm board) board you might have faced the problem that it is not easy to use them for a headless (i.e. no keyboard and display connected) machine. There is no default password, fedora asks you to set one on the first boot instead. Which is from a security point of view surely better than shipping with a fixed password. But for headless machines it is quite inconvenient …

    Luckily there is an easy way out. You can use libguestfs-tools. The tools have been created to configure virtual machine images (this is where the name comes from). But the tools work fine with sdcards too.

    I’m using a usb sdcard reader which shows up as /dev/sdc on my system. I can just pass /dev/sdc as image to the tools (take care, the device is probably something else for you). For example, to set a root password:

    virt-customize -a /dev/sdc --root-password "password:<your-password-here>"

    The initial setup on the first boot is a systemd service, and it can be turned off by simply removing the symlinks which enable the service:

    virt-customize -a /dev/sdc \
      --delete /etc/systemd/system/ \
      --delete /etc/systemd/system/

    You can use virt-copy-in (or virt-tar-in) to copy config files to the disk image. Small (or empty) configuration files can also be created with the write command:

    virt-customize -a /dev/sdc --write "/.autorelabel:"

    Adding the .autorelabel file will force selinux relabeling on the first boot (takes a while). It is a good idea to do that in case you copy files to the sdcard, to make sure the new files are labeled correctly. Especially in case you copy security sensitive things like ssh keys or ssh config files. Without relabeling selinux will not allow sshd access those files, which in turn can break remote logins.

    There is alot more the virt-* tools can do for you. Check out the manual pages for more info. And you can easily script things, virt-customize has a --commands-from-file switch which accepts a file with a list of commands.

    by Gerd Hoffmann at January 18, 2017 10:56 AM

    January 17, 2017

    Gerd Hoffmann

    virtual gpu support landing upstream

    The upstreaming process of virtual gpu support (vgpu) made a big step forward with the 4.10 merge window. Two important pieces have been merged:

    First, the mediated device framework (mdev). Basically this allows kernel drivers to present virtual pci devices, using the vfio framework and interfaces. Both nvidia and intel will use mdev to partition the physical gpu of the host into multiple virtual devices which then can be assigned to virtual machines.

    Second, intel landed initial mdev support for the i915 driver too. There is quite some work left to do in future kernel releases though. Accessing to the guest display is not supported yet, so you must run x11vnc or simliar tools in the guest to see the screen. Also there are some stability issues to find and fix.

    If you want play with this nevertheless, here is how to do it. But be prepared for crashes and better don’t try this on a production machine.

    On the host: create virtual devices

    On the host machine you obviously need a 4.10 kernel. Also the intel graphics device (igd) must be broadwell or newer. In the kernel configuration enable vfio and mdev (all CONFIG_VFIO_* options). Enable CONFIG_DRM_I915_GVT and CONFIG_DRM_I915_GVT_KVMGT for intel vgpu support. Building the mtty sample driver (CONFIG_SAMPLE_VFIO_MDEV_MTTY, a virtual serial port) can be useful too, for testing.

    Boot the new kernel. Load all modules: vfio-pci, vfio-mdev, optionally mtty. Also i915 and kvmgt of course, but that probably happened during boot already.

    Go to the /sys/class/mdev_bus directory. This should look like this:

    kraxel@broadwell ~# cd /sys/class/mdev_bus
    kraxel@broadwell .../class/mdev_bus# ls -l
    total 0
    lrwxrwxrwx. 1 root root 0 17. Jan 10:51 0000:00:02.0 -> ../../devices/pci0000:00/0000:00:02.0
    lrwxrwxrwx. 1 root root 0 17. Jan 11:57 mtty -> ../../devices/virtual/mtty/mtty

    Each driver with mdev support has a directory there. Go to $device/mdev_supported_types to check what kind of virtual devices you can create.

    kraxel@broadwell .../class/mdev_bus# cd 0000:00:02.0/mdev_supported_types
    kraxel@broadwell .../0000:00:02.0/mdev_supported_types# ls -l
    total 0
    drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_1
    drwxr-xr-x. 3 root root 0 17. Jan 11:57 i915-GVTg_V4_2
    drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_4

    As you can see intel supports three different configurations on my machine. The configuration (basically the amount of video memory) differs, and the number of instances you can create. Check the description and available_instance files in the directories:

    kraxel@broadwell .../0000:00:02.0/mdev_supported_types# cd i915-GVTg_V4_2
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat description 
    low_gm_size: 64MB
    high_gm_size: 192MB
    fence: 4
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat available_instance 

    Now it is possible to create virtual devices by writing a UUID into the create file:

    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# uuid=$(uuidgen)
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# echo $uuid
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# sudo sh -c "echo $uuid > create"

    The new vgpu device will show up as subdirectory of the host gpu:

    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cd ../../$uuid
    kraxel@broadwell .../0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c# ls -l
    total 0
    lrwxrwxrwx. 1 root root    0 17. Jan 12:31 driver -> ../../../../bus/mdev/drivers/vfio_mdev
    lrwxrwxrwx. 1 root root    0 17. Jan 12:35 iommu_group -> ../../../../kernel/iommu_groups/10
    lrwxrwxrwx. 1 root root    0 17. Jan 12:35 mdev_type -> ../mdev_supported_types/i915-GVTg_V4_2
    drwxr-xr-x. 2 root root    0 17. Jan 12:35 power
    --w-------. 1 root root 4096 17. Jan 12:35 remove
    lrwxrwxrwx. 1 root root    0 17. Jan 12:31 subsystem -> ../../../../bus/mdev
    -rw-r--r--. 1 root root 4096 17. Jan 12:35 uevent

    You can see the device landed in iommu group 10. We’ll need that in a moment.

    On the host: configure guests

    Ideally this would be as simple as adding <hostdev> to your guests libvirt xml config. The mdev devices don’t have a pci address on the host though, and because of that they must be passed to qemu using the sysfs device path instead of the pci address. libvirt doesn’t (yet) support sysfs paths though, so it is a bit more complicated for now. Alot of the setup libvirt does for hostdevs automatically must be done manually instead.

    First, we must allow qemu access /dev. By default libvirt uses control groups to restrict access. That must be turned off. Edit /etc/libvirt/qemu.conf. Uncomment the cgroup_controllers line. Remove "devices" from the list. Restart libvirtd.

    Second, we must allow qemu access the iommu group (10 in my case). A simple chmod will do:

    kraxel@broadwell ~# chmod 666 /dev/vfio/10

    Third, we must update the guest configuration:

    <domain type='kvm' xmlns:qemu=''>
      [ ... ]
      <currentMemory unit='KiB'>1048576</currentMemory>
      [ ... ]
        <qemu:arg value='-device'/>
        <qemu:arg value='vfio-pci,addr=05.0,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c'/>

    There is special qemu namespace which can be used to pass extra command line arguments to qemu. We do this here to use a qemu feature not yet supported by libvirt (use sysfs paths for vfio-pci). Also we must explicitly allow to lock down guest memory.

    Now we are ready to go:

    kraxel@broadwell ~# virsh start --console $guest

    In the guest

    It is a good idea to prepare the guest a bit before adding the vgpu to the guest configuration. Setup a serial console, so you can talk to it even in case graphics are broken. Blacklist the i915 module and load it manually, at least until you have a known-working configuration. Also booting to runlevel 3 (aka instead of 5 (aka and starting the xorg server manually is better for now.

    For the guest machine intel recommends the 4.8 kernel. In theory newer kernels should work too, in practice they didn’t last time I tested (4.10-rc2). Also make sure the xorg server uses the modesetting driver, the intel driver didn’t work in my testing. This config file will do:

    root@guest ~# cat /etc/X11/xorg.conf.d/intel.conf 
    Section "Device"
            Identifier  "Card0"
    #       Driver      "intel"
            Driver      "modesetting"
            BusID       "PCI:0:5:0"

    I’m starting the xorg server with x11vnc, xterm and mwm (motif window manager) using this little script:

    # debug
    echo "# $0: DISPLAY=$DISPLAY"
    # start server
    if test "$DISPLAY" != ":4"; then
            echo "# $0: starting Xorg server"
            exec startx $0 -- /usr/bin/Xorg :4
            exit 1
    echo "# $0: starting session"
    # configure session
    xrdb $HOME/.Xdefaults
    # start clients
    x11vnc -rfbport 5904 &
    xterm &
    exec mwm

    The session runs on display 4, so you should be able to connect from the host this way:

    kraxel@broadwell ~# vncviewer $guest_ip:4

    Have fun!

    by Gerd Hoffmann at January 17, 2017 03:25 PM

    January 16, 2017

    Thomas Huth

    Controlling Open Firmware with -prom-env

    Open Firmware has a concept of environment configuration variables that are used to control the boot flow and other behavior of the firmware. On ppc64 systems, these variables are normally stored in the NVRAM of the machine (as defined in the LoPAPR specification, chapter, “System NVRAM Partition”), so that they are persistent between reboots - at least on real systems. With QEMU, the emulated NVRAM has got to be backed with a real file on the host, of course, to make the contents persistent. This can be done with the parameter -drive if=pflash,file=filename,format=raw of QEMU, for example.

    Now, if the backing file via pflash is not used, the contents of the NVRAM can be created by QEMU dynamically. QEMU features a dedicated parameter called -prom-env for setting the configuration variables in NVRAM. This works with all the OpenBIOS based machines in QEMU (like the mac99 or g3beige machine) since a long time already, but since version 2.8 has been released in last December, QEMU now also supports this option for the pseries machine, i.e. it can now also be used to control the boot behavior of the SLOF firmware of an sPAPR guest. So this is a good point in time to have a closer look at this parameter to explain how it can be used with the pseries machine…

    The supported environment variables and their values can be listed by typing printenv at the SLOF firmware prompt:

    0 > printenv  
    ---environment variable--------current value-------------default value------
       load-base                   4000                      4000 
       real-mode?                  true                      true
       direct-serial?              false                     false
       use-nvramrc?                false                     false
       selftest-#megs              0                         0 
       security-mode               0                         0 
       security-#badlogins         0                         0 
       screen-#rows                200                       200 
       screen-#columns             200                       200 
       oem-logo?                   false                     false
       oem-logo                    1e762bb0 0                1e762110 0 
       oem-banner?                 false                     false
       fcode-debug?                true                      true
       diag-switch?                false                     false
       boot-command                boot                      boot
       auto-boot?                  false                     true

    Many of the configuration variables are either only useful for debugging (like “fcode-debug?”), or even only there without real functionality since the IEEE 1275 standard mandates that they’ve got to be available (like “oem-logo”), but the implementation of the intended functionality did not make much sense in SLOF.

    Some other configuration variables are really useful, though. For example, if you want to avoid that SLOF boots automatically (so you can do some things at the firmware prompt before running the OS), you can start QEMU with -prom-env 'auto-boot?=false' to disable the auto-boot feature. Of course you could also hit the “s” key during boot to drop to the firmware prompt, but this can be quite annoying if you’re doing multiple things in parallel and thus you easily miss the very right point in time. Using the configuration variable is much more convenient.

    Another very useful trick is that you can execute arbitrary Forth code during the boot process with the -prom-env parameter! The likely obvious way is to use the “nvramrc” variable. For example, if you start QEMU with the parameters -prom-env 'use-nvramrc?=true' -prom-env 'nvramrc=." Hello World!" cr', SLOF will print “Hello World!” during the boot process, followed by a carriage return. But there is another way for executing Forth code, which I personally prefer if I do not have to boot an OS afterwards (since you do not have to set two variables in this case): You can override the boot-command variable, which also contains Forth code. For example using the parameters -prom-env 'boot-command=." Hello World!" cr' will print “Hello World” during the boot process, too, just at a little bit later point in time. This can also be useful to run firmware reboot tests: When you run QEMU with -prom-env 'boot-command=reset-all', the firmware will reboot automatically each time instead of booting an operating system. Or use -prom-env 'boot-command=power-off' to shut down the VM automatically at the end of the firmware boot process.

    Something else that bugged me for a long time was the behavior of the input selection in SLOF. When you boot your pseries guest with a VGA graphics card, SLOF always automatically uses the emulated USB keyboard as input device. But if you want to debug the VGA or USB code in the firmware for example, it is much more convenient if you can interact with the firwmare via the serial console (aka. hvterm). So now that QEMU supports the -prom-env parameter for the pseries machine, too, I’ve recently added some code to SLOF that forces the firmware to stay with the serial input instead of switching to the emulated USB keyboard. You can control the behavior now with the “direct-serial?” configuration variable. If you start QEMU with the parameters -nographic -vga std -prom-env 'direct-serial?=true' for example, you can still interact with the firmware in the terminal window even though the firmware detected a graphics card and USB keyboard. (Note: This new feature is currently only available in the development version of SLOF, but it will be part of the SLOF release that will be shipped with QEMU version 2.9)

    January 16, 2017 10:45 AM

    January 06, 2017

    Ladi Prosek

    Extracting Windows VM crash dumps

    This post aims to provides an overview of the many options for extracting Windows crash dump files and their equivalents out of Windows virtual machines. The assumption is that a Windows instance has just crashed with a Blue Screen Of Death (BSOD) and a dump is deemed helpful to diagnose the issue. Note that most of the tips below apply to physical machines as well but the examples here focus on a QEMU/KVM virtual machine.

    1. Copy MEMORY.DMP after reboot

    This is straightforward as long as MEMORY.DMP has actually been generated and there is a simple way to copy it out of the VM. Note that when Windows is showing the BSOD screen and “dumping memory to disk”, it is not creating a crash dump file yet. To increase the chances that the memory dump file will be persisted, Windows writes it in the page file first. Then, on first reboot, it tries to extract it into a separate file, usually %SystemRoot%\MEMORY.DMP.

    A few things can go wrong here. First, if the system got super messed up by whatever caused the crash, the VM may not be able to boot again. No boot means no MEMORY.DMP and we’re out of luck. There may also not be enough free disk space to create the crash dump. This would, again, mean no MEMORY.DMP for us. Note that freeing disk space after the fact won’t help as the dump in the page file is already partially overwritten by the time the user gets to do anything about the disk space situation. Unless of course the disk can be accessed offline from the host or another VM. If you’re willing to do this, it is not necessary to reboot though (see below). Here’s an older blog post with a detailed description of the Windows behavior and useful tips.

    2. Extract the page file

    Little known is the fact that the page file itself can be fed into windbg and used in lieu of MEMORY.DMP (thanks to Vadim Rozenfeld for enlightening me!). The page file will be larger than MEMORY.DMP but it doesn’t require the reboot mentioned above. A Windows VM crashed and doesn’t boot back up? No problem, just extract the page file:

    $ guestmount -a Windows10.qcow2 -i --ro /mnt/windows
    $ cp /mnt/windows/pagefile.sys ~

    And then on a Windows machine:

    C:\>windbg -z path\to\pagefile.sys

    3. Process QEMU guest memory dump with Volatility

    This is not guaranteed to succeed but is worth the try as a last ditch effort or if the VM is in production and a “crash” dump is required without actually crashing it.

    Side note: The easiest way to trigger a Windows VM crash is raising a non-maskable interrupt. This can be done by issuing the inject-nmi QMP command or its equivalent:

    Welcome to the QMP low-level shell!
    Connected to QEMU 2.8.50
    (QEMU) inject-nmi

    Back to the non-crashing scenario. First, get the guest memory dump:

    Welcome to the QMP low-level shell!
    Connected to QEMU 2.8.50
    (QEMU) dump-guest-memory paging=false protocol=file:/tmp/vmdump

    The file is a complete image of the guest physical memory in ELF format. It contains most of the data found in MEMORY.DMP but the structure is different. Use Volatility to convert it:

    python -f /tmp/vmdump --profile=Win10x64 raw2dmp -O /tmp/MEMORY.DMP
    Volatility Foundation Volatility Framework 2.5
    Writing data (5.00 MB chunks): 

    Note that you have to know or guess (or let Volatility guess) the version of Windows running in the VM, sometimes down to the Service Pack / build level. If all works well, the resulting MEMORY.DMP can be opened in windbg as usual. You just won’t see any bug check code and BSOD parameters (duh!) and probably no context either.

    4. Push dump data out of the VM programmatically

    This is advanced and more of a note to self, as I hope to turn it into a tool at some point. Windows offers two kernel APIs of interest.

    KeRegisterBugCheckReasonCallback with KbCallbackDumpIo
    In theory, this should allow a kernel driver to be invoked by Windows as the BSOD is happening and memory is being dumped to the page file. The driver is given a pointer to the memory blocks being dumped and can push this data out of the VM using, for instance, a virtual serial port. Quite understandably, the callback code running on BSOD is very limited in what it can do. Memory allocations are obviously out of the question as is everything else that might end up allocating. What’s worse is that it’s not clear if the addresses passed to the callback are virtual or physical. The first part of the dump (header) is indicated with virtual addresses. The second part (body) starts with a few virtual pages and is then followed by physical pages indicated with physical addresses. Secondary data is again virtual. I might be missing something but apparently I’m not the only one observing this. As it stands now, this API is not usable without ugly and fragile heuristics.

    This call produces the crash dump header, a 4096 byte data structure at offset 0 of MEMORY.DMP. And it works, as long as the physical memory layout that Windows works with is identical to that of whoever is producing the actual physical memory data to be appended to the header. Which sadly doesn’t seem to be the case with QEMU’s dump-guest-memory. So on one hand, having a valid and correct header saves us from the guesswork that Volatility has to do (no need to mess with Volatility “profiles”). But it’s still necessary to at least understand, or better patch the header to adapt it to the guest memory dump. Here’s a partial description of the dump header layout:

    by ladipro at January 06, 2017 02:21 PM

    January 05, 2017

    Daniel Berrange

    ANNOUNCE: New libvirt project Go XML parser model

    Shortly before christmas, I announced the availability of new Go bindings for the libvirt API. This post announces a companion package for dealing with XML parsing/formatting in Go. The master repository is available on the libvirt GIT server, but it is expected that Go projects will consume it via an import of the github mirror, since the Go ecosystem is heavilty github focused (e.g. can’t produce docs for stuff hosted on git)

    import (
      libvirtxml ""
    domcfg := &libvirtxml.Domain{Type: "kvm", Name: "demo",
                                 UUID: "8f99e332-06c4-463a-9099-330fb244e1b3",
    xmldoc, err := xml.Marshal(domcfg)

    API documentation is available on the godoc website.

    When dealing with the libvirt API, most applications will find themselves needing to either parse or format XML documents describing configuration of various libvirt objects. Traditionally this task has been left upto the application to deal with and as a result most applications end up creating some kind of structure / object model to represent the XML document in a more easily accessible manner. To try to reduce this duplicate effort, libvirt has already created the libvirt-glib package, which contains a libvirt-gconfig library mapping libvirt XML documents into the GObject world. This library is accessible to many programming languages via the magic of GObject Introspection, and while there is some work to support this in Go, it is not particularly mature at this time.

    In the Go world, there is a package “encoding/xml” which is able to transform between XML documents and Go structs, given suitable annotations on the struct fields. It is very easy to deal with, simply requiring someone to define a bit set of structs with annotated fields to map to the XML document. There’s no real “code” to write as it is really a data definition task.  Looking at applications using libvirt in Go, we see quite a few have already go down this route for dealing with libvirt XML. It should be readily apparent that every application using libvirt in Go is essentially going to end up writing an identical set of structs to deal with the XML handling. This duplication of effort makes no sense at all, and as such, we have started this new libvirt-go-xml package to provide a standard set of Go structs to deal with libvirt XML. The current level of schema support is pretty minimal supporting the capabilities XML, secrets XML and a small amount of the domain XML, so we’d encourage anyone interested in this to contribute patches to expand the XML schema coverage.

    The following illustrates a further example of its usage in combination with the libvirt-go library (with error checking omitted for brevity):

    import (
      libvirt ""
      libvirtxml ""
    conn, err := libvirt.NewConnect("qemu:///system")
    dom := conn.LookupDomainByName("demo")
    xmldoc, err := dom.GetXMLDesc(0)
    domcfg := &libvirtxml.Domain{}
    err := xml.Unmarshal([]byte(xmldocC), domcfg)
    fmt.Printf("Virt type %s", domcfg.Type)


    by Daniel Berrange at January 05, 2017 12:15 PM

    December 22, 2016

    Eduardo Habkost

    QEMU APIs: introduction to QemuOpts

    This post is a short introduction to the QemuOpts API inside QEMU. This is part of a series, see the introduction for other pointers and additional information.

    QemuOpts was introduced in 2009. It is a simple abstraction that handles two tasks:

    1. Parsing of config files and command-line options
    2. Storage of configuration options

    Data structures

    The QemuOpts data model is pretty simple:

    • QemuOptsList carries the list of all options belonging to a given config group. Each entity is represented by a QemuOpts struct.

    • QemuOpts represents a set of key-value pairs. (Some of the code refers to that as a config group, but to avoid confusion with QemuOptsList, I will call them config sections).

    • QemuOpt is a single key=value pair.

    Some config groups have multiple QemuOpts structs (e.g. “drive”, “object”, “device”, that represent multiple drives, multiple objects, and multiple devices, respectively), while others always have only one QemuOpts struct (e.g. the “machine” config group).

    For example, the following command-line options:

    -drive id=disk1,file=disk.raw,format=raw,if=ide \
    -drive id=disk2,file=disk.qcow2,format=qcow2,if=virtio \
    -machine usb=on -machine accel=kvm

    are represented internally as:

    Diagram showing two QemuOptsList objects: qemu_drive_opts and qemu_machine_opts. qemu_drive_opts has two QemuOpts entries: disk1 and disk2. disk2 has three QemuOpt entries: file=disk.raw, format=raw, if=ide. disk2 has three QemuOpt entries: file=disk.qcow2, format=qcow2, if=virtio. qemu_machine_opts has one QemuOpts entry. The QemuOpts entry for qemu_machine_opts has two QemuOpt entries: usb=on, accel=kvm

    Data Types

    QemuOpts supports a limited number of data types for option values:

    • Strings
    • Boolean options
    • Numbers (integers)
    • Sizes


    Strings are just used as-is, after the command-line or config file is parsed.

    Note: On the command-line, options are separated by commas, but commas inside option values can be escaped as ,,.

    Boolean options

    The QemuOpt parser accepts only “on” and “off” as values for this option.

    Warning: note that this behavior is different from the QOM property parser. I plan to explore this in future posts.

    Numbers (integers)

    Numbers are supposed to be unsigned 64-bit integers. However, the code relies on the behavior of strtoull() and does not reject negative numbers. That means the parsed uint64_t value might be converted to a signed integer later. For example, the following command-line is not rejected by QEMU:

    $ qemu-system-x86_64 -smp cpus=-18446744073709551615,cores=1,threads=1

    I don’t know if there is existing code that requires negative numbers to be accepted by the QemuOpts parser. I assume it exists, so we couldn’t easily change the existing parsing rules without breaking existing code.


    Sizes are represented internally as integers, but the parser accept suffixes like K, M, G, T.

    qemu-system-x86_64 -m size=2G

    is equivalent to:

    qemu-system-x86_64 -m size=2048M

    Note: there are two different size-suffix parsers inside QEMU: one at util/cutils.c and another at util/qemu-option.c. Figuring out which one is going to be used is left as an exercise to the reader.

    Working around the QemuOpts parsers

    QEMU code sometimes uses tricks to avoid or work around the QemuOpts option value parsers:

    Example 1: using the raw option value

    It is possible to get the original raw option value as a string using qemu_opt_get(), even after it was already parsed. For example, the code that handles memory options in QEMU does that, to ensure a suffix-less number is interpreted as Mebibytes, not bytes:

        mem_str = qemu_opt_get(opts, "size");
        if (mem_str) {
            /* [...] */
            sz = qemu_opt_get_size(opts, "size", ram_size);
            /* Fix up legacy suffix-less format */
            if (g_ascii_isdigit(mem_str[strlen(mem_str) - 1])) {
                sz <<= 20;
                /* [...] */

    Example 2: empty option name list

    Some options do not use the QemuOpts value parsers at all, by not defining any option names in the QemuOptsList struct. In those cases, the option values are parsed and validated using different methods. Some examples:

    static QemuOptsList qemu_machine_opts = {
        .name = "machine",
        .implied_opt_name = "type",
        .merge_lists = true,
        .head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
        .desc = {
             * no elements => accept any
             * sanity checking will happen later
             * when setting machine properties
            { }
    static QemuOptsList qemu_acpi_opts = {
        .name = "acpi",
        .implied_opt_name = "data",
        .head = QTAILQ_HEAD_INITIALIZER(qemu_acpi_opts.head),
        .desc = { { 0 } } /* validated with OptsVisitor */

    This is a common pattern when options are translated to other data representations: mostly QOM properties or QAPI structs. I plan to explore this in a future blog post.

    The following config groups use this method and do their own parsing/validation of config options: acpi, device, drive, machine, net, netdev, numa, object, smbios, tpmdev


    The QemuOpts code is responsible for two tasks:

    1. Parsing command-line options and config files
    2. Storage of configuration options

    This means sometimes config options are parsed by custom code and converted to QemuOpts data structures. Storage of config options inside QemuOpts allow the existing QEMU configuration to be written to a file using the -writeconfig command-line option.

    The original commit introducing -writeconfig describes it this way:

    In theory you should be able to do:

    qemu < machine config cmd line switches here > -writeconfig vm.cfg
    qemu -readconfig vm.cfg

    In practice it will not work. Not all command line switches are converted to QemuOpts, so you’ll have to keep the not-yet converted ones on the second line. Also there might be bugs lurking which prevent even the converted ones from working correctly.

    This has improved over the years, but the comment still applies today: most command-line options are converted to QemuOpts options, but not all of them.

    Further reading

    by Eduardo Habkost at December 22, 2016 02:00 AM

    December 17, 2016

    Stefan Hajnoczi

    13 years of using Linux

    I've been using Linux on both work and personal machines for 13 years. Over time I've tried various distributions, changed the nature of my work, and revisited other operating systems to arrive back to the same conclusion every time: Linux works best for me.

    The reason I started using Linux remains the reason why it's my operating system of choice today:

    It's free and easy to install most software under an open source license that allows both commercial and non-commercial use.

    That means software to do common tasks is available for free without limitations. The cost of entry for exploring and learning new things is zero.

    The amount of packaged software available in major Linux distributions is incredible. Niche open source operating systems don't have this wide selection of high-quality software. Proprietary operating systems have high-quality software but there is constant irritation in dealing with the artificial limitations of closed source software. The strength of Linux is this sweet spot between high-quality mainstream software and the advantages of open source software.

    The pain points of Linux have changed over the years. In the beginning hardware support was limited. This has largely been solved for laptops, desktop, and server hardware as vendors began to contribute drivers and publish hardware datasheets free of NDAs. Class-compliant USB devices also cut down on the number of vendor-specific drivers. Nowadays the reputation for limited hardware support is largely unjustified.

    Another issue that has subsided is the Windows-only software that kept many people tied to that platform. Two trends killed Windows-only software: the move to the web and the rise of the Mac. A lot of applications migrated to pure web applications without the need for ActiveX or Java applets with platform-specific code - and Adobe Flash is close to its end too. Ever since Macs rose to popularity again it was no longer acceptable to ship Windows-only software. As a result so many things are now on the web or cross-platform applications with Linux support.

    Migrating to Linux is still a big change just like switching from Windows to Mac is a big change. It will always be hard to overcome this, even with virtualization, because the virtual machine experience isn't seamless. Ultimately users need to pick native applications and import their existing data. And it's worth it because you get access to applications that can do almost everything without the hassles of proprietary platforms. That's the lasting advantage that Linux has over the competition.

    by stefanha ( at December 17, 2016 06:00 PM

    December 15, 2016

    Daniel Berrange

    ANNOUNCE: New libvirt project Go language bindings

    I’m happy to announce that the libvirt project is now supporting Go language bindings as a primary deliverable, joining Python and Perl, as language bindings with 100% API coverage of libvirt C library. The master repository is available on the libvirt GIT server, but it is expected that Go projects will consume it via an import of the github mirror, since the Go ecosystem is heavilty github focused (e.g. can’t produce docs for stuff hosted on git)

    import (
        libvirt ""
    conn, err := libvirt.NewConnect("qemu:///system")

    API documentation is available on the godoc website.

    For a while now libvirt has relied on 3rd parties to provide Go language bindings. The one most people use was first created by Alex Zorin and then taken over by Kyle Kelly. There’s been a lot of excellent work put into these bindings, however, the API coverage largely stops at what was available in libvirt 1.2.2, with the exception of a few APIs from libvirt 1.2.14 which have to be enabled via Go build tags. Libvirt is now working on version 3.0.0 and there have been many APIs added in that time, not to mention enums and other constants. Comparing the current libvirt-go API coverage against what the libvirt C library exposes reveals 163 missing functions (out of 476 total), 367 missing enum constants (out of 847 total) and 165 missing macro constants (out of 200 total). IOW while there is alot already implemented, there was still a very long way to go.

    Initially I intended to contribute patches to address the missing API coverage to the existing libvirt-go bindings. In looking at the code though I had some concerns about the way some of the APIs had been exposed to Go. In the libvirt C library there are a set of APIs which accept or return a “virTypedParameterPtr” array, for cases where we need APIs to be easily extensible to handle additions of an arbitrary extra data fields in the future. The way these APIs work is one of the most ugly and unpleasant parts of the C API and thus in language bindings we never expose the virTypedParameter concept directly, but instead map it into a more suitable language specific data structure. In Perl and Python this meant mapping them to hash tables, which gives application developers a language friendly way to interact with the APIs. Unfortunately the current Go API bindings have exposed the virTypedParameter concept directly to Go and since Go does not support unions, the result is arguably even more unpleasant in Go than it already is in C. The second concern is with the way events are exposed to Go – in the C layer we have different callbacks that are needed for each event type, but have one method for registering callbacks, requiring an ugly type cast. This was again exposed directly in Go, meaning that the Go compiler can’t do strong type checking on the callback registration, instead only doing a runtime check at time of event dispatch. There were some other minor concerns about the Go API mapping, such as fact that it needlessly exposed the “vir” prefix on all methods & constants despite already being in a “libvirt” package namespace, returning of a struct instead of pointer to a struct for objects. Understandably the current maintainer had a desire to keep API compatibility going forward, so the decision was made to fork the existing libvirt-go codebase. This allowed us to take advantage of all the work put in so far, while fixing the design problems, and also extending them to have 100% API coverage. The idea is that applications can then decide to opt-in to the new Go binding at a point in time where they’re ready to adapt their code to the API changes.

    For users of the existing libvirt Go binding, converting to the new official libvirt Go binding requires a little bit of work, but nothing too serious and will simplify the code if using any of the typed parameter methods. The changes are roughly as follows:

    • The “VIR_” prefix is dropped from all constants. eg libvirt.VIR_DOMAIN_METADATA_DESCRIPTION because libvirt.DOMAIN_METADATA_DESCRIPTION
    • The “vir” prefix is dropped from all types. eg libvirt.virDomain becomes libvirt.Domain
    • Methods returning objects now return a pointer eg “* Domain” instead of “Domain”, allowing us to return the usual “nil” constant on error, instead of a struct with no underlying libvirt connection
    • The domain events DomainEventRegister method has been replaced by a separate method for each event type. eg DomainEventLifecycleRegister, DomainEventRebootRegister, etc giving compile time type checking of callbacks
    • The domain events API now accepts a single callback, instead of taking a pair of callbacks – the caller can create an anonymous function to invoke multiple things if required.
    • Methods accepting or returning typed parameters now have a formal struct defined to expose all the parameters in a manner that allows direct access without type casts and enables normal Go compile time type checking. eg the Domain.GetBlockIOTune method returns a DomainBlockIoTuneParameters struct
    • It is no longer necessary to use Go compiler build tags to access functionality in different libvirt versions. Through the magic of conditional compilation, the binding will transparently build against every libvirt version from 1.2.0 through 3.0.0
    • The binding can find libvirt via pkg-config making it easy to compile against a libvirt installed to a non-standard location by simply setting “PKG_CONFIG_PATH”
    • There is 100% coverage of all APIs [1], constants and macros, verified by the libvirt CI system, so that it always keeps up with GIT master of the Libvirt C library.
    • The error callback concept is removed from the binding as this is deprecated by libvirt due to not being thread safe. It was also redundant since every method already directly returns an error object.
    • There are now explicit types defined for all enums and methods which take flags or enums now use these types instead of “uint32”, again allowing stronger compiler type checking

    With the exception of the typed parameter changes adapting existing apps should be a largely boring mechanical conversion to just adapt to the renames.

    Again, without the effort put in by Alex Zorin and Kyle Kelly & other community contributors, creation of these new libvirt-go bindings would have taken at least 4-5 weeks instead of the 2 weeks effort put into this. So there’s a huge debt owed to all the people who previously contributed to libvirt Go bindings. I hope that having these new bindings with guaranteed 100% API coverage will be of benefit to the Go community going forward.

    [1] At time of writing this is a slight lie, as i’ve not quite finished the virStream and virEvent callback method bindings, but this will be done shortly.

    by Daniel Berrange at December 15, 2016 11:56 AM

    Powered by Planet!
    Last updated: March 29, 2017 07:01 AM