Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

April 06, 2017

Zeeshan Ali Khattak

GNOME ❤ Rust Hackfest in Mexico

While I'm known as a Vala fanboy in GNOME, I've tried to stress time and again that I see Vala as more a practical solution than an ideal one. "Safe programming" has always been something that intrigued me, having dealt with numerous crashes and other hard-to-debug runtime issues in the past. So when I first heard of Rust some years back, it got me super excited but it was not exactly stable  and there was no integration with GNOME libraries or D-Bus and hence it was not at all a viable option for developing desktop code. Lately (in past 2 years) things have significantly changed. Not only we have Rust 1.0 but we also have crates that provide integration with GNOME libraries and D-Bus. On top of that, some of us took steps to start converting some C code into Rust and many of us started seriously talking with Rust hackers to make Rust a first class programming language for GNOME.

To make things really go foward, we decided to arrange a hackfest, which took place last week at the Red Hat offices in Mexico city. The event was a big success in my opinion. The actual work done and started during the hackfest aside, it brought two communities much closer together and we learnt quite a lot from each other in a very short amount of time. The main topics at the hackfest were:
  • GObject-introspection consumption by Rust.
  • GObject creation from Rust.
  • Better out of the box Rust support in GNOME Builder
  • GMainLoop and Tokio integration
  • D-Bus bindings
While most folks were focused on the first three and I did participate in discussions on all these topics (except for Builder, of which I don't know anything), I spent most of my time looking into the last one. D-Bus is widely used in automotive industry these days and I serve that industry these days so it made sense, aside from my person interest in D-Bus. We established (some of it before the hackfest) that to make Rust attractive to C and Vala developers, we need to provide:
  1. Syntactic sugar for making D-Bus coding simple

    Very similar to what Vala offers. Antoni already had a project, dbus-macros that targets this goal through the use of Rust's (powerful) macro system. So I spent a lot of time fixing and improving dbus-macros crate. Having Antoni and other Rust experts in the same room, greatly helped me get around some very hard to decipher compiler issues. I found out (the hard way) that while rustc is very good at spotting errors, it fails miserably to give you the proper context when it comes to macros. I complained enough about this to Mozilla folks that I'm sure they'll be looking into fixing that experience at some point in near future. :)

    We also contacted the author of dbus crate, David Henningsson over e-mail about a few D-Bus related subjects (more below) including this one. (I was surprised to find out that he also lives in Sweden). His opinion was that we probably should be using procedural macros for this. I agree with him, except that procedural macros are not yet in stable Rust. So for now, I decided to continue with current approach of the project.

    During the hackfest, I became the maintainer of the dbus-macros crate since the first thing I did was to reduce the very small amount of code by 70%. Next, I created a backlog for myself and worked my way through it one issue at a time. I'm going to continue with that.

  2. Asynchronous D-Bus methods

    While ability to make D-Bus method calls asynchronously from clients is very important (you don't want to block the UI for your IPC), it would be also very nice for services to be able to asynchronously handle method calls as well. Brian Anderson from Mozilla was working on this during the hackfest. His approach was to hack dbus crate to add async API through the use of tokio crate. I spent most of the second day of hackfest, sitting next to Brian for some peer-programming. The author of tokio, Alex Crichton, sitting next to us helped us a lot in understanding tokio API. In the end, Brian submitted a working proof of concept for client-side async calls, which will hopefully provide a very good bases for David's actual implementation.

  3. Code generation from D-Bus introspection XML

    With both GLib and Qt providing utilities to generate code for handling D-Bus for a decade now, most projects doing D-Bus make use of this. My intention was too look into this during the hackfest but just before, I found out that David had not only already started this work in dbus crate but also his approach is exactly what I'd have gone for. So while I decided not to work on this, I did have lengthy (electronic) conversations with David about how to consolidate code generation with dbus-macros.

    Ideally, the API of the generated code should be very similar to one you'd manually create using dbus-macros to make it easy for developers to switch from one approach to another. But since David and I didn't agree with current dbus-macros approach, I kind of gave-up on this goal, at least for now. Once macro procedures stabilize, there is a good chance we will change dbus-macros (though it'll be a completely new version or maybe even a different crate) to make use of them and we can revisit consolidation of code generation and dbus-macros.
A few weeks prior to the event, I decided to create a new project, gps-share. The aim is to provide ability to share your (standalone) GPS device from your laptop/desktop to other devices on the network and at the same time add standalone GPS device support into Geoclue (without any new feature code in Geoclue). I decided to write it in Rust for a few reasons, one of them being my desire to learn enough about the language before the event (I hadn't wrote any serious/complicated code in Rust before) and another one was to have an actual test case for D-Bus adventures (it's supposed to talk to Avahi on D-Bus). I'm glad that I did that since I encountered a few issues with dbus-macros when using them in gps-share and the awesome Mozilla folks were able to help me figure them out very quickly. Otherwise it would have taken me a very long time to figure the issues.




On the last day of hackfest, after a delicious lunch, we decided to go for a long stroll around Mexico city and hang out in the park, where we had more interesting conversations, about life, universe and everything (including Rust and GNOME).

After the hackfest, I stayed around for 3 more days. On Saturday, I mostly hung out with Federico, Christian, Antoni and Joaquín. We walked around the city center and watched Federico and Joaquín interviewed by Rancho Electronico folks. I was really excited to see that they use GNOME for their desktop and GStreamer for streaming. The guy handling the streaming was very glad to meet someone with GStreamer experience.

On Sunday, I rented a car and went to a hike at Tepoztlán with Felipe. Driving in Mexico wasn't easy so having a Mexican with me, helped a lot.


And on Monday, we drove to the Sun pyramid.


I would like to thank both GNOME Foundation and my employer, Pelagicore for sponsoring my participation to this event.


by noreply@blogger.com (zeenix) at April 06, 2017 11:31 AM

April 05, 2017

Thomas Huth

KVM with SELinux on a z/VM s390x machine

When you are trying to start a KVM guest via libvirt on an s390x Linux installation that is running on an older version of z/VM, you might run into the problem that QEMU refuses to start with this error message:

cannot set up guest memory 's390.ram': Permission denied.

This happens because older versions of z/VM (before version 6.3) do not support the so-called “enhanced suppression on protection facility” (ESOP) yet, so QEMU has to allocate the memory for the guest with a “hack”, and this hack uses mmap(… PROT_EXEC …) for the allocation.

Now this mmap() call is not allowed by the default SELinux rules (at least not on RHEL-based systems), so QEMU fails to allocate the memory for the guest here. Turning off SELinux completely just to run a KVM guest is of course a bad idea, but fortunately there is already a SELinux boolean value called virt_use_execmem which can be used to tune the behavior here:

setsebool virt_use_execmem 1

This configuration switch has originally been introduced for running TCG guests (i.e. running QEMU without KVM), but in this case it also fixes the problem with KVM guests. Anyway, since setting this SELinux variable to 1 is also a slight decrease in security (but still better than disabling SELinux completely), you should better upgrade your z/VM to version 6.3 (or newer) or use a real LPAR for the KVM host installation instead, if that is feasible.

April 05, 2017 12:20 PM

March 24, 2017

Cole Robinson

Easy qemu commandline passthrough with virt-xml

Libvirt has supported qemu commandline option passthrough for qemu/kvm VMs for quite a while. The format for it is a bit of a pain though since it requires setting a magic xmlns value at the top of the domain XML. Basically doing it by hand kinda sucks.

In the recently released virt-manager 1.4.1, we added a virt-install/virt-xml option --qemu-commandline that tweaks option passthrough for new or existing VMs. So for example, if you wanted to add the qemu option string '-device FOO' to an existing VM named f25, you can do:

  ./virt-xml f25 --edit --confirm --qemu-commandline="-device FOO"

The output will look like:

--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
-<domain type="kvm">
+<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
<name>f25</name>
<uuid>9b6f1795-c88b-452a-a54c-f8579ddc18dd</uuid>
<memory unit="KiB">4194304</memory>
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
</rng>
</devices>
+ <qemu:commandline>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
+ </qemu:commandline>
</domain>

Define 'f25' with the changed XML? (y/n):

by Cole Robinson (noreply@blogger.com) at March 24, 2017 10:30 PM

March 19, 2017

QEMU project

QEMU in the blogs: February 2017

Here is a short list of articles and blog posts about QEMU and KVM, that were posted last month.

More virtualization blog posts can be found on the virt tools planet.

In other news, QEMU is now in hard freeze for release 2.9.0. The preliminary list of features is on the wiki.

by Paolo Bonzini at March 19, 2017 09:30 AM

March 12, 2017

Richard Jones

Tip: Run virt-inspector on a compressed disk (with nbdkit)

virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.

If you have an xz-compressed disk image, you can run virt-inspector on it without uncompressing it, using the magic of captive nbdkit. Here’s how:

nbdkit xz file=win7.img.xz \
    -U - \
    --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'

What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (-U -).

We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)

The $unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.

The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.


by rich at March 12, 2017 03:44 PM

March 08, 2017

Cole Robinson

virt-manager 1.4.1 released!

I've just released virt-manager 1.4.1. The highlights are:
  • storage/nodedev event API support (Jovanka Gulicoska)
  • UI options for enabling spice GL (Marc-André Lureau)
  • Add default virtio-rng /dev/urandom for supported guest OS
  • Cloning and rename support for UEFI VMs (Pavel Hrdina)
  • libguestfs inspection UI improvements (Pino Toscano)
  • virt-install: Add --qemu-commandline
  • virt-install: Add --network vhostuser (Chen Hanxiao)
  • virt-install: Add --sysinfo (Charles Arnold)
Plus the usual slew of bug fixes and small improvements.

    by Cole Robinson (noreply@blogger.com) at March 08, 2017 07:15 PM

    Cédric Bosdonnat

    System container images

    As of today creating libvirt lxc system container root file system is a pain. Docker's fun came with its image sharing idea... why couldn't we do the same for libvirt containers? I will expose here is an attempt at this.

    To achieve such a goal we need:

    • container images
    • something to share them
    • a tool to pull and use them

    Container images

    OpenBuildService thanks to kiwi knows how to create images, even container images. There even are openSUSE Docker images. To use them as system container images, some more packages need to be added to those. I thus forked the project on github and branched the OBS projects to get system container images for 42.1, 42.2 and Tumbleweed.

    Using them is as simple as downloading them, unpacking them and use them as a container's root file system. However, sharing them would be so fun!

    Sharing images

    There is no need to reinvent the wheel to share the images. We can just consider them like any docker image. With the following commands we can import the image and push it to a remote registry.

    docker import openSUSE-42.2-syscontainer-guest-docker.x86_64.tar.xz system/opensuse-42.2
    docker tag system/opensuse-42.2 myregistry:5000/system/opensuse-42.2
    docker login myregistry:5000
    docker push myregistry:5000/system/opensuse-42.2
    

    The good thing with this is that we can even use the docker build and Dockerfile magic to create customized images and push them to the remote repository.

    Instanciating containers

    Now we need a tool to get the images from the remote docker registry. Hopefully there is a tool that helps a lot to do this: skopeo. I wrote a small virt-bootstrap tool using it to instanciate the images as root file systems.

    Here is how instanciating a container looks like with it:

    virt-bootstrap.py --username myuser \
                      --root-password test \
                      docker://myregistry:5000/system/opensuse-42.2 /path/to/my/container
    
    virt-install --connect lxc:/// -n 422 --memory 250 --vcpus 1 \
                    --filesystem /path/to/my/container,/ \
                    --filesystem /etc/resolv.conf,/etc/resolv.conf \
                    --network network=default
    

    And voila! Creating an openSUSE 42.2 system container and running it with libvirt is now super easy!

    by Cédric Bosdonnat at March 08, 2017 03:37 PM

    March 07, 2017

    Zeeshan Ali Khattak

    GDP meets GSoC

    Are you a student? Passionate about Open Source? Want your code to run on next generation of automobiles? You're in luck! Genivi Development Platform will be participating in Google Summer of Code this summer and you are welcome to participate. We have collected a bunch of ideas for what would be a good 3 month project for a student but you're more than welcome to suggest your own project. The ideas page, also has instructions on how to get started with GDP.

    We look forward to your participation!

    by noreply@blogger.com (zeenix) at March 07, 2017 06:11 PM

    March 06, 2017

    Eduardo Habkost

    The long story of the query-cpu-model-expansion QEMU interface

    So, finally the query-cpu-model-expansion x86 implementation was merged into qemu.git, just before 2.9 soft freeze. Jiri Denemark already implemented the x86 libvirt code to use it. I just can’t believe this was finally done after so many years.

    It was a weird journey. It started almost 6 years ago with this message to qemu-devel:

    Date: Fri, 10 Jun 2011 18:36:37 -0300
    Subject: semantics of “-cpu host” and “check”/”enforce”

    …it continued on an interesting thread:

    Date: Tue, 6 Mar 2012 15:27:53 -0300
    Subject: Qemu, libvirt, and CPU models

    …on another very long one:

    Date: Fri, 9 Mar 2012 17:56:52 -0300
    Subject: Re: [Qemu-devel] [libvirt] Modern CPU models cannot be used with libvirt

    …and this one:

    Date: Thu, 21 Feb 2013 11:58:18 -0300
    Subject: libvirt<->QEMU interfaces for CPU models

    I don’t even remember how many different interfaces were proposed to provide what libvirt needed.

    We had a few moments where we hopped back and forth between “just let libvirt manage everything” to “let’s keep this managed by QEMU”.

    We took a while to get the QEMU community to decide how machine-type compatibility was supposed to be handled, and what to do with the weird CPU model config file we had.

    The conversion of CPUs to QOM was fun. I think it started in 2012 and was finished only in 2015. We thought QOM properties would solve all our problems, but then we found out that machine-types and global properties make the problem more complex. The existing interfaces would require making libvirt re-run QEMU multiple times to gather all the information it needed. While doing the QOM work, we spent some time fixing or working around issues with global properties, qdev “static” properties and QOM “dynamic” properties.

    In 2014, my focus was moved to machine-types, in the hope that we could finally expose machine-type-specific information to libvirt without re-running QEMU. Useful code refactoring was done for that, but in the end we never added the really relevant information to the query-machines QMP command.

    In the meantime, we had the fun TSX issues, and QEMU developers finally agreed to keep a few constraints on CPU model changes, that would make the problem a bit simpler.

    In 2015 IBM people started sending patches related to CPU models in s390x. We finally had a multi-architecture effort to make CPU model probing work. The work started by extending query-cpu-definitions, but it was not enough. In June 2016 they proposed a query-cpu-model-expansion API. It was finally merged in September 2016.

    I sent v1 of query-cpu-model-expansion for x86 in December 2016. After a few rounds of reviews, there was a proposal to use “-cpu max” to represent the “all features supported by this QEMU binary on this host”. v3 of the series was merged last week.

    I still can’t believe it finally happened.

    Special thanks to:

    • Igor Mammedov, for all the x86 QOM/properties work and all the valuable feedback.
    • David Hildenbrand and Michael Mueller, for moving forward the API design and the s390x implementation.
    • Jiri Denemark, for the libvirt work, valuable discussions and design feedback, and for the patience during the process.
    • Daniel P. Berrangé, for the valuable feedback and for helping making QEMU developers listen to libvirt developers.
    • Andreas Färber, for the work as maintainer of QOM and the CPU core, for leading the QOM conversion effort, and all the valuable feedback.
    • Markus Armbruster and Paolo Bonzini, for valuable feedback on design discussions.
    • Many others that were involved in the effort.

    by Eduardo Habkost at March 06, 2017 05:00 PM

    Video and slides for FOSDEM 2017 talk: QEMU Internal APIs

    The slides and videos for my FOSDEM 2017 talk (QEMU: internal APIs and conflicting world views) are available online.

    The subject I tried to cover is large for a 40-minute talk, but I think I managed to scratch its surface and give useful examples.

    Many thanks for the FOSDEM team of volunteers for the wonderful event.

    by Eduardo Habkost at March 06, 2017 03:00 AM

    February 24, 2017

    Ladi Prosek

    Running Hyper-V in a QEMU/KVM Guest

    This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.

    Update 4/2017: AMD is broken, fix is coming.

    A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.

    Kernel / KVM

    A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.

    Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.

    $ cat /sys/module/kvm_intel/parameters/nested
    

    Update 4/2017: On newer Intel CPUs with PML (Page Modification Logging) support such as Kaby Lake, Skylake, and some server Broadwell chips, PML needs to be disabled by passing kvm-intel.pml=0 to the kernel as a parameter. Fix is coming.

    QEMU

    QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.

    The QEMU command line must include the +vmx cpu feature, for example:

    -cpu SandyBridge,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,+vmx
    

    If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).

    Hyper-V

    Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.

    If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.

    If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.


    by ladipro at February 24, 2017 01:57 PM

    February 22, 2017

    Gerd Hoffmann

    vconsole 0.7 released

    vconsole is a virtual machine (serial) console manager, look here for details.

    No big changes from 0.6.

    Fetch the tarball here.

    by Gerd Hoffmann at February 22, 2017 08:15 AM

    February 21, 2017

    Gerd Hoffmann

    Fedora 25 images for qemu and raspberry pi 3 uploaded

    I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.

    The fedora-25-rpi3 image is for the raspberry pi 3.
    The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).

    The images don’t have a root password set. You must use libguestfs-tools to set the root password …

    virt-customize -a <image> --root-password "password:<your-password-here>>
    

    … otherwise you can’t login after boot.

    The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.

    The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.

    by Gerd Hoffmann at February 21, 2017 08:32 AM

    February 16, 2017

    Daniel Berrange

    Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA

    Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.

    First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.

    # curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo
    # dnf upgrade

    For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \
        ...

    The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.

        ...
        --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
                   cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
                   cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
        ...
    

    QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).

        ...
        --machine q35
    

    The complete virt-install command line thus looks like

    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \
        --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
                   cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
                   cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
        --machine q35
    

    Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”

    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='dmi-to-pci-bridge'>
      <model name='i82801b11-bridge'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    

    PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config

    <controller type='pci' index='3' model='pcie-expander-bus'>
      <target busNr='180'>
        <node>0</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='4' model='pcie-expander-bus'>
      <target busNr='200'>
        <node>1</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='5' model='pcie-expander-bus'>
      <target busNr='220'>
        <node>2</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    

    It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB

    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='6' port='0x0'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='7' port='0x8'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x10'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='9' port='0x0'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='10' port='0x8'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='11' port='0x10'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='12' port='0x0'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='13' port='0x8'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='14' port='0x10'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>|
    </controller>
    

    Notice that the values in ‘bus‘ attribute on the <address> element is matching the value of the ‘index‘ attribute on the <controller> element of the parent device in the topology. The PCI controller topology now looks like this

    pcie-root (index == 0)
      |
      +- dmi-to-pci-bridge (index == 1)
      |    |
      |    +- pci-bridge (index == 2)
      |
      +- pcie-expander-bus (index == 3, numa node == 0)
      |    |
      |    +- pcie-root-port (index == 6)
      |    +- pcie-root-port (index == 7)
      |    +- pcie-root-port (index == 8)
      |
      +- pcie-expander-bus (index == 4, numa node == 1)
      |    |
      |    +- pcie-root-port (index == 9)
      |    +- pcie-root-port (index == 10)
      |    +- pcie-root-port (index == 11)
      |
      +- pcie-expander-bus (index == 5, numa node == 2)
           |
           +- pcie-root-port (index == 12)
           +- pcie-root-port (index == 13)
           +- pcie-root-port (index == 14)
    

    All the existing devices are attached to the “pci-bridge” (the controller with index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add

    <interface type='user'>
      <mac address='52:54:00:7e:6e:c6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
    </interface>
    

    Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.

    The final configuration change is to simply add the Intel IOMMU device

    <iommu model='intel'/>
    

    It is a capability integrated into the chipset, so it does not need any <address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.

    # virsh domifaddr f25x86_64
     Name       MAC address          Protocol     Address
    -------------------------------------------------------------------------------
     vnet0      52:54:00:10:26:7e    ipv4         192.168.122.3/24
    
    # ssh root@192.168.122.3
    

    We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes

    # dnf install numactl
    # numactl --hardware
    available: 3 nodes (0-2)
    node 0 cpus: 0 1 2 3
    node 0 size: 3856 MB
    node 0 free: 3730 MB
    node 1 cpus: 4 5
    node 1 size: 1969 MB
    node 1 free: 1813 MB
    node 2 cpus: 6 7
    node 2 size: 1967 MB
    node 2 free: 1832 MB
    node distances:
    node   0   1   2 
      0:  10  20  20 
      1:  20  10  20 
      2:  20  20  10 
    

    Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached

    # lspci -t -v
    -+-[0000:dc]-+-00.0-[dd]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[de]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[df]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:c8]-+-00.0-[c9]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[ca]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[cb]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:b4]-+-00.0-[b5]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[b6]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[b7]----00.0  Intel Corporation 82574L Gigabit Network Connection
     \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
                 +-01.0  Red Hat, Inc. QXL paravirtual graphic card
                 +-02.0  Red Hat, Inc. Device 000b
                 +-03.0  Red Hat, Inc. Device 000b
                 +-04.0  Red Hat, Inc. Device 000b
                 +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
                 +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
                 +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
                 +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
                 +-1e.0-[01-02]----01.0-[02]--+-01.0  Red Hat, Inc Virtio network device
                 |                            +-02.0  Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller
                 |                            +-03.0  Red Hat, Inc Virtio console
                 |                            +-04.0  Red Hat, Inc Virtio block device
                 |                            \-05.0  Red Hat, Inc Virtio memory balloon
                 +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
                 +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
                 \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller
    

    The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.

    # vi /etc/default/grub
    ....add "intel_iommu=on"...
    # grub2-mkconfig > /etc/grub2.cfg
    

    While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio

    echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \
      /etc/modprobe.d/vfio.conf
    

    This is also a good time to install libvirt and KVM inside the guest

    # dnf groupinstall "Virtualization"
    # dnf install libvirt-client
    # rm -f /etc/libvirt/qemu/networks/autostart/default.xml
    

    Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.

    Now reboot the guest. When it comes back up, there should be a /dev/kvm device present in the guest.

    # ls -al /dev/kvm
    crw-rw-rw-. 1 root kvm 10, 232 Oct  4 12:14 /dev/kvm
    

    If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.

    The IOMMU should have been detected and activated

    # dmesg  | grep -i DMAR
    [    0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS  BXPCDMAR 00000001 BXPC 00000001)
    [    0.000000] DMAR: IOMMU enabled
    [    0.203737] DMAR: Host address width 39
    [    0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
    [    0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02
    [    2.910862] DMAR: No RMRR found
    [    2.910863] DMAR: No ATSR found
    [    2.914870] DMAR: dmar0: Using Queued invalidation
    [    2.914924] DMAR: Setting RMRR:
    [    2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC
    [    2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
    [    2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O
    

    The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.

    The IOMMU should also have registered the PCI devices into various groups

    # dmesg  | grep -i iommu  |grep device
    [    2.915212] iommu: Adding device 0000:00:00.0 to group 0
    [    2.915226] iommu: Adding device 0000:00:01.0 to group 1
    ...snip...
    [    5.588723] iommu: Adding device 0000:b5:00.0 to group 14
    [    5.588737] iommu: Adding device 0000:b6:00.0 to group 15
    [    5.588751] iommu: Adding device 0000:b7:00.0 to group 16
    

    Libvirt meanwhile should have detected all the PCI controllers/devices

    # virsh nodedev-list --tree
    computer
      |
      +- net_lo_00_00_00_00_00_00
      +- pci_0000_00_00_0
      +- pci_0000_00_01_0
      +- pci_0000_00_02_0
      +- pci_0000_00_03_0
      +- pci_0000_00_04_0
      +- pci_0000_00_1d_0
      |   |
      |   +- usb_usb2
      |       |
      |       +- usb_2_0_1_0
      |         
      +- pci_0000_00_1d_1
      |   |
      |   +- usb_usb3
      |       |
      |       +- usb_3_0_1_0
      |         
      +- pci_0000_00_1d_2
      |   |
      |   +- usb_usb4
      |       |
      |       +- usb_4_0_1_0
      |         
      +- pci_0000_00_1d_7
      |   |
      |   +- usb_usb1
      |       |
      |       +- usb_1_0_1_0
      |       +- usb_1_1
      |           |
      |           +- usb_1_1_1_0
      |             
      +- pci_0000_00_1e_0
      |   |
      |   +- pci_0000_01_01_0
      |       |
      |       +- pci_0000_02_01_0
      |       |   |
      |       |   +- net_enp2s1_52_54_00_10_26_7e
      |       |     
      |       +- pci_0000_02_02_0
      |       +- pci_0000_02_03_0
      |       +- pci_0000_02_04_0
      |       +- pci_0000_02_05_0
      |         
      +- pci_0000_00_1f_0
      +- pci_0000_00_1f_2
      |   |
      |   +- scsi_host0
      |   +- scsi_host1
      |   +- scsi_host2
      |   +- scsi_host3
      |   +- scsi_host4
      |   +- scsi_host5
      |     
      +- pci_0000_00_1f_3
      +- pci_0000_b4_00_0
      |   |
      |   +- pci_0000_b5_00_0
      |       |
      |       +- net_enp181s0_52_54_00_7e_6e_c6
      |         
      +- pci_0000_b4_01_0
      |   |
      |   +- pci_0000_b6_00_0
      |       |
      |       +- net_enp182s0_52_54_00_7e_6e_c7
      |         
      +- pci_0000_b4_02_0
      |   |
      |   +- pci_0000_b7_00_0
      |       |
      |       +- net_enp183s0_52_54_00_7e_6e_c8
      |         
      +- pci_0000_c8_00_0
      |   |
      |   +- pci_0000_c9_00_0
      |       |
      |       +- net_enp201s0_52_54_00_7e_6e_d6
      |         
      +- pci_0000_c8_01_0
      |   |
      |   +- pci_0000_ca_00_0
      |       |
      |       +- net_enp202s0_52_54_00_7e_6e_d7
      |         
      +- pci_0000_c8_02_0
      |   |
      |   +- pci_0000_cb_00_0
      |       |
      |       +- net_enp203s0_52_54_00_7e_6e_d8
      |         
      +- pci_0000_dc_00_0
      |   |
      |   +- pci_0000_dd_00_0
      |       |
      |       +- net_enp221s0_52_54_00_7e_6e_e6
      |         
      +- pci_0000_dc_01_0
      |   |
      |   +- pci_0000_de_00_0
      |       |
      |       +- net_enp222s0_52_54_00_7e_6e_e7
      |         
      +- pci_0000_dc_02_0
          |
          +- pci_0000_df_00_0
              |
              +- net_enp223s0_52_54_00_7e_6e_e8
    

    And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of

    # virsh nodedev-dumpxml pci_0000_df_00_0
    <device>
      <name>pci_0000_df_00_0</name>
      <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path>
      <parent>pci_0000_dc_02_0</parent>
      <driver>
        <name>e1000e</name>
      </driver>
      <capability type='pci'>
        <domain>0</domain>
        <bus>223</bus>
        <slot>0</slot>
        <function>0</function>
        <product id='0x10d3'>82574L Gigabit Network Connection</product>
        <vendor id='0x8086'>Intel Corporation</vendor>
        <iommuGroup number='10'>
          <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/>
          <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/>
        </iommuGroup>
        <numa node='2'/>
        <pci-express>
          <link validity='cap' port='0' speed='2.5' width='1'/>
          <link validity='sta' speed='2.5' width='1'/>
        </pci-express>
      </capability>
    </device>
    

    Finally, libvirt should also be reporting the NUMA topology

    # virsh capabilities
    ...snip...
    <topology>
      <cells num='3'>
        <cell id='0'>
          <memory unit='KiB'>4014464</memory>
          <pages unit='KiB' size='4'>1003616</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='20'/>
          </distances>
          <cpus num='4'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
            <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
            <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>2016808</memory>
          <pages unit='KiB' size='4'>504202</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='20'/>
            <sibling id='1' value='10'/>
            <sibling id='2' value='20'/>
          </distances>
          <cpus num='2'>
            <cpu id='4' socket_id='4' core_id='0' siblings='4'/>
            <cpu id='5' socket_id='5' core_id='0' siblings='5'/>
          </cpus>
        </cell>
        <cell id='2'>
          <memory unit='KiB'>2014644</memory>
          <pages unit='KiB' size='4'>503661</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='20'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='10'/>
          </distances>
          <cpus num='2'>
            <cpu id='6' socket_id='6' core_id='0' siblings='6'/>
            <cpu id='7' socket_id='7' core_id='0' siblings='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    ...snip...
    

    Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
    # virt-install --name f25x86_64 --ram 2000 --vcpus 8 \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --hostdev pci_0000_df_00_0 --network none

    If everything went well, you should now have a nested guest with an assigned PCI device attached to it.

    This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.

    The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)

    by Daniel Berrange at February 16, 2017 12:44 PM

    ANNOUNCE: libosinfo 1.0.0 release

    NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…

    I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

    Changes in this release include:

    • Update loader to follow new layout for external database
    • Move all database files into separate osinfo-db package
    • Move osinfo-db-validate into osinfo-db-tools package

    As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:

    • libosinfo – provides the libosinfo shared library and most associated command line tools
    • osinfo-db – contains only the database XML files and RNG schema, no code at all.
    • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

    Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.

    by Daniel Berrange at February 16, 2017 11:19 AM

    February 15, 2017

    QEMU project

    Presentations from FOSDEM 2017

    Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.

    Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:

    by Thomas Huth at February 15, 2017 02:49 PM

    February 14, 2017

    Stefan Hajnoczi

    Slides posted for "Using NVDIMM under KVM" talk

    I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.

    Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.

    This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).

    Update: Video is available here.

    by stefanha (noreply@blogger.com) at February 14, 2017 02:54 PM

    February 10, 2017

    Daniel Berrange

    ANNOUNCE: gtk-vnc 0.7.0 release including 2 security fixes

    I’m pleased to announce a new release of GTK-VNC, vesion 0.7.0. The release focus is on bug fixing and includes fixes for two publically reported security bugs which allow a malicious server to exploit the client. Similar bugs were recently reported & fixed in other common VNC clients too.

    • CVE-2017-5884 – fix bounds checking for RRE, hextile and copyrect encodings
    • CVE-2017-5885 – fix color map index bounds checking
    • Add API to allow smooth scaling to be disabled
    • Workaround to help SPICE servers quickly drop VNC clients which mistakenly connect, by sending “RFB ” signature bytes early
    • Don’t accept color map entries for true-color pixel formats
    • Add missing vala .deps files for gvnc & gvncpulse
    • Avoid crash if host/port is NULL
    • Add precondition checks to some public APIs
    • Fix link to home page in README file
    • Fix misc memory leaks
    • Clamp cursor hot-pixel to within cursor region

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at February 10, 2017 04:45 PM

    The surprisingly complicated world of disk image sizes

    When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:

    • Capacity – the size that is visible to the guest OS
    • Length – the current highest byte offset in the file.
    • Allocation – the amount of storage that is currently consumed.
    • Commitment – the amount of storage that could be consumed in the future.

    The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.

    Raw files

    In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.

    While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.

    QCow2 files

    In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.

    Summary

    Considering the above information, for a newly created file the four size values would look like

    Format Capacity Length Allocation Commitment
    raw (sparse) 40GB 40GB 0 40GB [1]
    raw (prealloc) 40GB 40GB 40GB [1] 40GB [1]
    qcow2 (grow on demand) 40GB 193KB 196KB 41GB [2]
    qcow2 (prealloc metadata) 40GB 41GB 6.5MB 41GB [2]
    qcow2 (prealloc all) 40GB 41GB 41GB 41GB [2]
    [1] XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB
    [2] use of internal snapshots may massively increase allocation/commitment

    For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.

    That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.

    by Daniel Berrange at February 10, 2017 03:58 PM

    February 08, 2017

    Daniel Berrange

    Commenting out XML snippets in libvirt guest config by stashing it as metadata

    Libvirt uses XML as the format for configuring objects it manages, including virtual machines. Sometimes when debugging / developing it is desirable to comment out sections of the virtual machine configuration to test some idea. For example, one might want to temporarily remove a secondary disk. It is not always desirable to just delete the configuration entirely, as it may need to be re-added immediately after. XML has support for comments <!-- .... some text --> which one might try to use to achieve this. Using comments in XML fed into libvirt, however, will result in an unwelcome suprise – the commented out text is thrown into /dev/null by libvirt.

    This is an unfortunate consequence of the way libvirt handles XML documents. It does not consider the XML document to be the master representation of an object’s configuration – a series of C structs are the actual internal representation. XML is simply a data interchange format for serializing structs into a text format that can be interchanged with the management application, or persisted on disk. So when receiving an XML document libvirt will parse it, extracting the pieces of information it cares about which are they stored in memory in some structs, while the XML document is discarded (along with the comments it contained). Given this way of working, to preserve comments would require libvirt to add 100’s of extra fields to its internal structs and extract comments from every part of the XML document that might conceivably contain them. This is totally impractical to do in realityg. The alternative would be to consider the parsed XML DOM as the canonical internal representation of the config. This is what the libvirt-gconfig library in fact does, but it means you can no longer just do simple field accesses to access information – getter/setter methods would have to be used, which quickly becomes tedious in C. It would also involve re-refactoring almost the entire libvirt codebase so such a change in approach would realistically never be done.

    Given that it is not possible to use XML comments in libvirt, what other options might be available ?

    Many years ago libvirt added the ability to store arbitrary user defined metadata in domain XML documents. The caveat is that they have to be located in a specific place in the XML document as a child of the <metadata> tag, in a private XML namespace. This metadata facility to be used as a hack to temporarily stash some XML out of the way. Consider a guest which contains a disk to be “commented out”:

    <domain type="kvm">
      ...
      <devices>
        ...
        <disk type='file' device='disk'>
        <driver name='qemu' type='raw'/>
        <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
          <target dev='vda' bus='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
        </disk>
        ...
      </devices>
    </domain>
    

    To stash the disk config as a piece of metadata requires changing the XML to

    <domain type="kvm">
      ...
      <metadata>
        <s:disk xmlns:s="http://stashed.org/1" type='file' device='disk'>
          <driver name='qemu' type='raw'/>
          <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
          <target dev='vda' bus='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
        </s:disk>
      </metadata>
      ...
      <devices>
        ...
      </devices>
    </domain>
    

    What we have done here is

    – Added a <metadata> element at the top level
    – Moved the <disk> element to be a child of <metadata> instead of a child of <devices>
    – Added an XML namespace to <disk> by giving it an ‘s:’ prefix and associating a URI with this prefix

    Libvirt only allows a single top level metadata element per namespace, so if there are multiple tihngs to be stashed, just give them each a custom namespace, or introduce an arbitrary wrapper. Aside from mandating the use of a unique namespace, libvirt treats the metadata as entirely opaque and will not try to intepret or parse it in any way. Any valid XML construct can be stashed in the metadata, even invalid XML constructs, provided they are hidden inside a CDATA block. For example, if you’re using virsh edit to make some changes interactively and want to get out before finishing them, just stash the changed in a CDATA section, avoiding the need to worry about correctly closing the elements.

    <domain type="kvm">
      ...
      <metadata>
        <s:stash xmlns:s="http://stashed.org/1">
        <![CDATA[
          <disk type='file' device='disk'>
            <driver name='qemu' type='raw'/>
            <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
            <target dev='vda' bus='virtio'/>
            <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
          </disk>
          <disk>
            <driver name='qemu' type='raw'/>
            ...i'll finish writing this later...
        ]]>
        </s:stash>
      </metadata>
      ...
      <devices>
        ...
      </devices>
    </domain>
    

    Admittedly this is a somewhat cumbersome solution. In most cases it is probably simpler to just save the snippet of XML in a plain text file outside libvirt. This metadata trick, however, might just come in handy some times.

    As an aside the real, intended, usage of the <metdata> facility is to allow applications which interact with libvirt to store custom data they may wish to associated with the guest. As an example, the recently announced libvirt websockets console proxy uses it to record which consoles are to be exported. I know of few other real world applications using this metadata feature, however, so it is worth remembering it exists :-) System administrators are free to use it for local book keeping purposes too.

    by Daniel Berrange at February 08, 2017 07:14 PM

    Thomas Huth

    Testing the edk2 SMM driver stack with QEMU, KVM & libvirt

    Laszlo wrote an article over at the edk2 wiki about testing SMM with OVMF, in QEMU/KVM virtual machines managed by libvirt:

    https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt

    The primary goal of the article is to help rapid development and testing of SMM-related firmware code (or any edk2 code in general).

    February 08, 2017 12:10 PM

    Alberto Garcia

    QEMU and the qcow2 metadata checks

    When choosing a disk image format for your virtual machine one of the factors to take into considerations is its I/O performance. In this post I’ll talk a bit about the internals of qcow2 and about one of the aspects that can affect its performance under QEMU: its consistency checks.

    As you probably know, qcow2 is QEMU’s native file format. The first thing that I’d like to highlight is that this format is perfectly fine in most cases and its I/O performance is comparable to that of a raw file. When it isn’t, chances are that this is due to an insufficiently large L2 cache. In one of my previous blog posts I wrote about the qcow2 L2 cache and how to tune it, so if your virtual disk is too slow, you should go there first.

    I also recommend Max Reitz and Kevin Wolf’s qcow2: why (not)? talk from KVM Forum 2015, where they talk about a lot of internal details and show some performance tests.

    qcow2 clusters: data and metadata

    A qcow2 file is organized into units of constant size called clusters. The cluster size defaults to 64KB, but a different value can be set when creating a new image:

    qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

    Clusters can contain either data or metadata. A qcow2 file grows dynamically and only allocates space when it is actually needed, so apart from the header there’s no fixed location for any of the data and metadata clusters: they can appear mixed anywhere in the file.

    Here’s an example of what it looks like internally:

    In this example we can see the most important types of clusters that a qcow2 file can have:

    • Header: this one contains basic information such as the virtual size of the image, the version number, and pointers to where the rest of the metadata is located, among other things.
    • Data clusters: the data that the virtual machine sees.
    • L1 and L2 tables: a two-level structure that maps the virtual disk that the guest can see to the actual location of the data clusters in the qcow2 file.
    • Refcount table and blocks: a two-level structure with a reference count for each data cluster. Internal snapshots use this: a cluster with a reference count >= 2 means that it’s used by other snapshots, and therefore any modifications require a copy-on-write operation.

    Metadata overlap checks

    In order to detect corruption when writing to qcow2 images QEMU (since v1.7) performs several sanity checks. They verify that QEMU does not try to overwrite sections of the file that are already being used for metadata. If this happens, the image is marked as corrupted and further access is prevented.

    Although in most cases these checks are innocuous, under certain scenarios they can have a negative impact on disk write performance. This depends a lot on the case, and I want to insist that in most scenarios it doesn’t have any effect. When it does, the general rule is that you’ll have more chances of noticing it if the storage backend is very fast or if the qcow2 image is very large.

    In these cases, and if I/O performance is critical for you, you might want to consider tweaking the images a bit or disabling some of these checks, so let’s take a look at them. There are currently eight different checks. They’re named after the metadata sections that they check, and can be divided into the following categories:

    1. Checks that run in constant time. These are equally fast for all kinds of images and I don’t think they’re worth disabling.
      • main-header
      • active-l1
      • refcount-table
      • snapshot-table
    2. Checks that run in variable time but don’t need to read anything from disk.
      • refcount-block
      • active-l2
      • inactive-l1
    3. Checks that need to read data from disk. There is just one check here and it’s only needed if there are internal snapshots.
      • inactive-l2

    By default all tests are enabled except for the last one (inactive-l2), because it needs to read data from disk.

    Disabling the overlap checks

    Tests can be disabled or enabled from the command line using the following syntax:

    -drive file=hd.qcow2,overlap-check.inactive-l2=on
    -drive file=hd.qcow2,overlap-check.snapshot-table=off

    It’s also possible to select the group of checks that you want to enable using the following syntax:

    -drive file=hd.qcow2,overlap-check.template=none
    -drive file=hd.qcow2,overlap-check.template=constant
    -drive file=hd.qcow2,overlap-check.template=cached
    -drive file=hd.qcow2,overlap-check.template=all

    Here, none means that no tests are enabled, constant enables all tests from group 1, cached enables all tests from groups 1 and 2, and all enables all of them.

    As I explained in the previous section, if you’re worried about I/O performance then the checks that are probably worth evaluating are refcount-block, active-l2 and inactive-l1. I’m not counting inactive-l2 because it’s off by default. Let’s look at the other three:

    • inactive-l1: This is a variable length check because it depends on the number of internal snapshots in the qcow2 image. However its performance impact is likely to be negligible in all cases so I don’t think it’s worth bothering with.
    • active-l2: This check depends on the virtual size of the image, and on the percentage that has already been allocated. This check might have some impact if the image is very large (several hundred GBs or more). In that case one way to deal with it is to create an image with a larger cluster size. This also has the nice side effect of reducing the amount of memory needed for the L2 cache.
    • refcount-block: This check depends on the actual size of the qcow2 file and it’s independent from its virtual size. This check is relatively expensive even for small images, so if you notice performance problems chances are that they are due to this one. The good news is that we have been working on optimizing it, so if it’s slowing down your VMs the problem might go away completely in QEMU 2.9.

    Conclusion

    The qcow2 consistency checks are useful to detect data corruption, but they can affect write performance.

    If you’re unsure and you want to check it quickly, open an image with overlap-check.template=none and see for yourself, but remember again that this will only affect write operations. To obtain more reliable results you should also open the image with cache=none in order to perform direct I/O and bypass the page cache. I’ve seen performance increases of 50% and more, but whether you’ll see them depends a lot on your setup. In many cases you won’t notice any difference.

    I hope this post was useful to learn a bit more about the qcow2 format. There are other things that can help QEMU perform better, and I’ll probably come back to them in future posts, so stay tuned!

    Acknowledgments

    My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the rest of the QEMU development team.

    by berto at February 08, 2017 08:52 AM

    February 04, 2017

    QEMU project

    A new website for QEMU

    At last, QEMU’s new website is up!

    The new site aims to be simple and provides the basic information needed to download and start contributing to QEMU. It complements the wiki, which remains the central point for developers to share information quickly with the rest of the community.

    We tried to test the website on most browsers and to make it lightweight and responsive. It is built using Jekyll and the source code for the website can be cloned from the qemu-web.git repository. Just like for any other project hosted by QEMU, the best way to propose or contribute a new change is by sending a patch through the qemu-devel@nongnu.org mailing list.

    For example, if you would like to add a new screenshot to the homepage, you can clone the qemu-web.git repository, add a PNG file to the screenshots/ directory, and edit the _data/screenshots.yml file to include the new screenshot.

    Blog posts about QEMU are also welcome; they are simple HTML or Markdown files and are stored in the _posts/ directory of the repository.

    by Paolo Bonzini at February 04, 2017 07:40 AM

    A new website for QEMU

    At last, QEMU’s new website is up!

    The new site aims to be simple and provides the basic information needed to download and start contributing to QEMU. It complements the wiki, which remains the central point for developers to share information quickly with the rest of the community.

    We tried to test the website on most browsers and to make it lightweight and responsive. It is built using Jekyll and the source code for the website can be cloned from the qemu-web.git repository. Just like for any other project hosted by QEMU, the best way to propose or contribute a new change is by sending a patch through the qemu-devel@nongnu.org mailing list.

    For example, if you would like to add a new screenshot to the homepage, you can clone the qemu-web.git repository, add a PNG file to the screenshots/ directory, and edit the _data/screenshots.yml file to include the new screenshot.

    Blog posts about QEMU are also welcome; they are simple HTML or Markdown files and are stored in the _posts/ directory of the repository.

    by Paolo Bonzini at February 04, 2017 07:40 AM

    January 24, 2017

    Gerd Hoffmann

    local display for intel vgpu starts working

    Intel vgpu guest display shows up in the qemu gtk window. This is a linux kernel booted to the dracut emergency shell prompt, with the framebuffer console running @ inteldrmfb. Booting a full linux guest with X11 and/or wayland not tested yet. There are rendering glitches too, running “dmesg” looks like this:

    So, a bunch of issues to solve before this is ready for users, but it’s a nice start.

    For the brave:

    host kernel: https://www.kraxel.org/cgit/linux/log/?h=intel-vgpu
    qemu: https://www.kraxel.org/cgit/qemu/log/?h=work/intel-vgpu

    Take care, both branches are moving targets (aka: rebasing at times).

    by Gerd Hoffmann at January 24, 2017 08:38 PM

    January 18, 2017

    Gerd Hoffmann

    tweak arm images with libguestfs-tools

    So, when using the official fedora arm images on your raspberry pi (or any other arm board) board you might have faced the problem that it is not easy to use them for a headless (i.e. no keyboard and display connected) machine. There is no default password, fedora asks you to set one on the first boot instead. Which is from a security point of view surely better than shipping with a fixed password. But for headless machines it is quite inconvenient …

    Luckily there is an easy way out. You can use libguestfs-tools. The tools have been created to configure virtual machine images (this is where the name comes from). But the tools work fine with sdcards too.

    I’m using a usb sdcard reader which shows up as /dev/sdc on my system. I can just pass /dev/sdc as image to the tools (take care, the device is probably something else for you). For example, to set a root password:

    virt-customize -a /dev/sdc --root-password "password:<your-password-here>"
    

    The initial setup on the first boot is a systemd service, and it can be turned off by simply removing the symlinks which enable the service:

    virt-customize -a /dev/sdc \
      --delete /etc/systemd/system/multi-user.target.wants/initial-setup.service \
      --delete /etc/systemd/system/graphical.target.wants/initial-setup.service
    

    You can use virt-copy-in (or virt-tar-in) to copy config files to the disk image. Small (or empty) configuration files can also be created with the write command:

    virt-customize -a /dev/sdc --write "/.autorelabel:"
    

    Adding the .autorelabel file will force selinux relabeling on the first boot (takes a while). It is a good idea to do that in case you copy files to the sdcard, to make sure the new files are labeled correctly. Especially in case you copy security sensitive things like ssh keys or ssh config files. Without relabeling selinux will not allow sshd access those files, which in turn can break remote logins.

    There is alot more the virt-* tools can do for you. Check out the manual pages for more info. And you can easily script things, virt-customize has a --commands-from-file switch which accepts a file with a list of commands.

    by Gerd Hoffmann at January 18, 2017 10:56 AM

    January 17, 2017

    Gerd Hoffmann

    virtual gpu support landing upstream

    The upstreaming process of virtual gpu support (vgpu) made a big step forward with the 4.10 merge window. Two important pieces have been merged:

    First, the mediated device framework (mdev). Basically this allows kernel drivers to present virtual pci devices, using the vfio framework and interfaces. Both nvidia and intel will use mdev to partition the physical gpu of the host into multiple virtual devices which then can be assigned to virtual machines.

    Second, intel landed initial mdev support for the i915 driver too. There is quite some work left to do in future kernel releases though. Accessing to the guest display is not supported yet, so you must run x11vnc or simliar tools in the guest to see the screen. Also there are some stability issues to find and fix.

    If you want play with this nevertheless, here is how to do it. But be prepared for crashes and better don’t try this on a production machine.

    On the host: create virtual devices

    On the host machine you obviously need a 4.10 kernel. Also the intel graphics device (igd) must be broadwell or newer. In the kernel configuration enable vfio and mdev (all CONFIG_VFIO_* options). Enable CONFIG_DRM_I915_GVT and CONFIG_DRM_I915_GVT_KVMGT for intel vgpu support. Building the mtty sample driver (CONFIG_SAMPLE_VFIO_MDEV_MTTY, a virtual serial port) can be useful too, for testing.

    Boot the new kernel. Load all modules: vfio-pci, vfio-mdev, optionally mtty. Also i915 and kvmgt of course, but that probably happened during boot already.

    Go to the /sys/class/mdev_bus directory. This should look like this:

    kraxel@broadwell ~# cd /sys/class/mdev_bus
    kraxel@broadwell .../class/mdev_bus# ls -l
    total 0
    lrwxrwxrwx. 1 root root 0 17. Jan 10:51 0000:00:02.0 -> ../../devices/pci0000:00/0000:00:02.0
    lrwxrwxrwx. 1 root root 0 17. Jan 11:57 mtty -> ../../devices/virtual/mtty/mtty
    

    Each driver with mdev support has a directory there. Go to $device/mdev_supported_types to check what kind of virtual devices you can create.

    kraxel@broadwell .../class/mdev_bus# cd 0000:00:02.0/mdev_supported_types
    kraxel@broadwell .../0000:00:02.0/mdev_supported_types# ls -l
    total 0
    drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_1
    drwxr-xr-x. 3 root root 0 17. Jan 11:57 i915-GVTg_V4_2
    drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_4
    

    As you can see intel supports three different configurations on my machine. The configuration (basically the amount of video memory) differs, and the number of instances you can create. Check the description and available_instance files in the directories:

    kraxel@broadwell .../0000:00:02.0/mdev_supported_types# cd i915-GVTg_V4_2
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat description 
    low_gm_size: 64MB
    high_gm_size: 192MB
    fence: 4
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat available_instance 
    2
    

    Now it is possible to create virtual devices by writing a UUID into the create file:

    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# uuid=$(uuidgen)
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# echo $uuid
    f321853c-c584-4a6b-b99a-3eee22a3919c
    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# sudo sh -c "echo $uuid > create"
    

    The new vgpu device will show up as subdirectory of the host gpu:

    kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cd ../../$uuid
    kraxel@broadwell .../0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c# ls -l
    total 0
    lrwxrwxrwx. 1 root root    0 17. Jan 12:31 driver -> ../../../../bus/mdev/drivers/vfio_mdev
    lrwxrwxrwx. 1 root root    0 17. Jan 12:35 iommu_group -> ../../../../kernel/iommu_groups/10
    lrwxrwxrwx. 1 root root    0 17. Jan 12:35 mdev_type -> ../mdev_supported_types/i915-GVTg_V4_2
    drwxr-xr-x. 2 root root    0 17. Jan 12:35 power
    --w-------. 1 root root 4096 17. Jan 12:35 remove
    lrwxrwxrwx. 1 root root    0 17. Jan 12:31 subsystem -> ../../../../bus/mdev
    -rw-r--r--. 1 root root 4096 17. Jan 12:35 uevent
    

    You can see the device landed in iommu group 10. We’ll need that in a moment.

    On the host: configure guests

    Ideally this would be as simple as adding <hostdev> to your guests libvirt xml config. The mdev devices don’t have a pci address on the host though, and because of that they must be passed to qemu using the sysfs device path instead of the pci address. libvirt doesn’t (yet) support sysfs paths though, so it is a bit more complicated for now. Alot of the setup libvirt does for hostdevs automatically must be done manually instead.

    First, we must allow qemu access /dev. By default libvirt uses control groups to restrict access. That must be turned off. Edit /etc/libvirt/qemu.conf. Uncomment the cgroup_controllers line. Remove "devices" from the list. Restart libvirtd.

    Second, we must allow qemu access the iommu group (10 in my case). A simple chmod will do:

    kraxel@broadwell ~# chmod 666 /dev/vfio/10
    

    Third, we must update the guest configuration:

    <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
      [ ... ]
      <currentMemory unit='KiB'>1048576</currentMemory>
      <memoryBacking>
        <locked/>
      </memoryBacking>
      [ ... ]
      <qemu:commandline>
        <qemu:arg value='-device'/>
        <qemu:arg value='vfio-pci,addr=05.0,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c'/>
      </qemu:commandline>
    </domain>
    

    There is special qemu namespace which can be used to pass extra command line arguments to qemu. We do this here to use a qemu feature not yet supported by libvirt (use sysfs paths for vfio-pci). Also we must explicitly allow to lock down guest memory.

    Now we are ready to go:

    kraxel@broadwell ~# virsh start --console $guest
    

    In the guest

    It is a good idea to prepare the guest a bit before adding the vgpu to the guest configuration. Setup a serial console, so you can talk to it even in case graphics are broken. Blacklist the i915 module and load it manually, at least until you have a known-working configuration. Also booting to runlevel 3 (aka multi-user.target) instead of 5 (aka graphical.target) and starting the xorg server manually is better for now.

    For the guest machine intel recommends the 4.8 kernel. In theory newer kernels should work too, in practice they didn’t last time I tested (4.10-rc2). Also make sure the xorg server uses the modesetting driver, the intel driver didn’t work in my testing. This config file will do:

    root@guest ~# cat /etc/X11/xorg.conf.d/intel.conf 
    Section "Device"
            Identifier  "Card0"
    #       Driver      "intel"
            Driver      "modesetting"
            BusID       "PCI:0:5:0"
    EndSection
    

    I’m starting the xorg server with x11vnc, xterm and mwm (motif window manager) using this little script:

    #!/bin/sh
    
    # debug
    echo "# $0: DISPLAY=$DISPLAY"
    
    # start server
    if test "$DISPLAY" != ":4"; then
            echo "# $0: starting Xorg server"
            exec startx $0 -- /usr/bin/Xorg :4
            exit 1
    fi
    echo "# $0: starting session"
    
    # configure session
    xrdb $HOME/.Xdefaults
    
    # start clients
    x11vnc -rfbport 5904 &
    xterm &
    exec mwm
    

    The session runs on display 4, so you should be able to connect from the host this way:

    kraxel@broadwell ~# vncviewer $guest_ip:4
    

    Have fun!

    by Gerd Hoffmann at January 17, 2017 03:25 PM

    January 16, 2017

    Thomas Huth

    Controlling Open Firmware with -prom-env

    Open Firmware has a concept of environment configuration variables that are used to control the boot flow and other behavior of the firmware. On ppc64 systems, these variables are normally stored in the NVRAM of the machine (as defined in the LoPAPR specification, chapter 8.4.1.1, “System NVRAM Partition”), so that they are persistent between reboots - at least on real systems. With QEMU, the emulated NVRAM has got to be backed with a real file on the host, of course, to make the contents persistent. This can be done with the parameter -drive if=pflash,file=filename,format=raw of QEMU, for example.

    Now, if the backing file via pflash is not used, the contents of the NVRAM can be created by QEMU dynamically. QEMU features a dedicated parameter called -prom-env for setting the configuration variables in NVRAM. This works with all the OpenBIOS based machines in QEMU (like the mac99 or g3beige machine) since a long time already, but since version 2.8 has been released in last December, QEMU now also supports this option for the pseries machine, i.e. it can now also be used to control the boot behavior of the SLOF firmware of an sPAPR guest. So this is a good point in time to have a closer look at this parameter to explain how it can be used with the pseries machine…

    The supported environment variables and their values can be listed by typing printenv at the SLOF firmware prompt:

    0 > printenv  
    ---environment variable--------current value-------------default value------
       load-base                   4000                      4000 
       real-mode?                  true                      true
       direct-serial?              false                     false
       use-nvramrc?                false                     false
       selftest-#megs              0                         0 
       security-password                                     
       security-mode               0                         0 
       security-#badlogins         0                         0 
       screen-#rows                200                       200 
       screen-#columns             200                       200 
       output-device                                         
       oem-logo?                   false                     false
       oem-logo                    1e762bb0 0                1e762110 0 
       oem-banner?                 false                     false
       oem-banner                                            
       nvramrc                                               
       input-device                                          
       fcode-debug?                true                      true
       diag-switch?                false                     false
       diag-file                                             
       diag-device                                           
       boot-command                boot                      boot
       boot-file                                             
       boot-device                                           
       auto-boot?                  false                     true

    Many of the configuration variables are either only useful for debugging (like “fcode-debug?”), or even only there without real functionality since the IEEE 1275 standard mandates that they’ve got to be available (like “oem-logo”), but the implementation of the intended functionality did not make much sense in SLOF.

    Some other configuration variables are really useful, though. For example, if you want to avoid that SLOF boots automatically (so you can do some things at the firmware prompt before running the OS), you can start QEMU with -prom-env 'auto-boot?=false' to disable the auto-boot feature. Of course you could also hit the “s” key during boot to drop to the firmware prompt, but this can be quite annoying if you’re doing multiple things in parallel and thus you easily miss the very right point in time. Using the configuration variable is much more convenient.

    Another very useful trick is that you can execute arbitrary Forth code during the boot process with the -prom-env parameter! The likely obvious way is to use the “nvramrc” variable. For example, if you start QEMU with the parameters -prom-env 'use-nvramrc?=true' -prom-env 'nvramrc=." Hello World!" cr', SLOF will print “Hello World!” during the boot process, followed by a carriage return. But there is another way for executing Forth code, which I personally prefer if I do not have to boot an OS afterwards (since you do not have to set two variables in this case): You can override the boot-command variable, which also contains Forth code. For example using the parameters -prom-env 'boot-command=." Hello World!" cr' will print “Hello World” during the boot process, too, just at a little bit later point in time. This can also be useful to run firmware reboot tests: When you run QEMU with -prom-env 'boot-command=reset-all', the firmware will reboot automatically each time instead of booting an operating system. Or use -prom-env 'boot-command=power-off' to shut down the VM automatically at the end of the firmware boot process.

    Something else that bugged me for a long time was the behavior of the input selection in SLOF. When you boot your pseries guest with a VGA graphics card, SLOF always automatically uses the emulated USB keyboard as input device. But if you want to debug the VGA or USB code in the firmware for example, it is much more convenient if you can interact with the firwmare via the serial console (aka. hvterm). So now that QEMU supports the -prom-env parameter for the pseries machine, too, I’ve recently added some code to SLOF that forces the firmware to stay with the serial input instead of switching to the emulated USB keyboard. You can control the behavior now with the “direct-serial?” configuration variable. If you start QEMU with the parameters -nographic -vga std -prom-env 'direct-serial?=true' for example, you can still interact with the firmware in the terminal window even though the firmware detected a graphics card and USB keyboard. (Note: This new feature is currently only available in the development version of SLOF, but it will be part of the SLOF release that will be shipped with QEMU version 2.9)

    January 16, 2017 10:45 AM

    January 06, 2017

    Ladi Prosek

    Extracting Windows VM crash dumps

    This post aims to provides an overview of the many options for extracting Windows crash dump files and their equivalents out of Windows virtual machines. The assumption is that a Windows instance has just crashed with a Blue Screen Of Death (BSOD) and a dump is deemed helpful to diagnose the issue. Note that most of the tips below apply to physical machines as well but the examples here focus on a QEMU/KVM virtual machine.

    1. Copy MEMORY.DMP after reboot

    This is straightforward as long as MEMORY.DMP has actually been generated and there is a simple way to copy it out of the VM. Note that when Windows is showing the BSOD screen and “dumping memory to disk”, it is not creating a crash dump file yet. To increase the chances that the memory dump file will be persisted, Windows writes it in the page file first. Then, on first reboot, it tries to extract it into a separate file, usually %SystemRoot%\MEMORY.DMP.

    A few things can go wrong here. First, if the system got super messed up by whatever caused the crash, the VM may not be able to boot again. No boot means no MEMORY.DMP and we’re out of luck. There may also not be enough free disk space to create the crash dump. This would, again, mean no MEMORY.DMP for us. Note that freeing disk space after the fact won’t help as the dump in the page file is already partially overwritten by the time the user gets to do anything about the disk space situation. Unless of course the disk can be accessed offline from the host or another VM. If you’re willing to do this, it is not necessary to reboot though (see below). Here’s an older blog post with a detailed description of the Windows behavior and useful tips.

    2. Extract the page file

    Little known is the fact that the page file itself can be fed into windbg and used in lieu of MEMORY.DMP (thanks to Vadim Rozenfeld for enlightening me!). The page file will be larger than MEMORY.DMP but it doesn’t require the reboot mentioned above. A Windows VM crashed and doesn’t boot back up? No problem, just extract the page file:

    $ guestmount -a Windows10.qcow2 -i --ro /mnt/windows
    $ cp /mnt/windows/pagefile.sys ~
    

    And then on a Windows machine:

    C:\>windbg -z path\to\pagefile.sys
    

    3. Process QEMU guest memory dump with Volatility

    This is not guaranteed to succeed but is worth the try as a last ditch effort or if the VM is in production and a “crash” dump is required without actually crashing it.

    Side note: The easiest way to trigger a Windows VM crash is raising a non-maskable interrupt. This can be done by issuing the inject-nmi QMP command or its equivalent:

    Welcome to the QMP low-level shell!
    Connected to QEMU 2.8.50
    
    (QEMU) inject-nmi
    

    Back to the non-crashing scenario. First, get the guest memory dump:

    Welcome to the QMP low-level shell!
    Connected to QEMU 2.8.50
    
    (QEMU) dump-guest-memory paging=false protocol=file:/tmp/vmdump
    

    The file is a complete image of the guest physical memory in ELF format. It contains most of the data found in MEMORY.DMP but the structure is different. Use Volatility to convert it:

    python vol.py -f /tmp/vmdump --profile=Win10x64 raw2dmp -O /tmp/MEMORY.DMP
    Volatility Foundation Volatility Framework 2.5
    Writing data (5.00 MB chunks): 
    |..........................................................................
    ...........................................................................
    ...........................................................................
    ...........................................................................
    ...........................................................................
    ..............................|
    

    Note that you have to know or guess (or let Volatility guess) the version of Windows running in the VM, sometimes down to the Service Pack / build level. If all works well, the resulting MEMORY.DMP can be opened in windbg as usual. You just won’t see any bug check code and BSOD parameters (duh!) and probably no context either.

    4. Push dump data out of the VM programmatically

    This is advanced and more of a note to self, as I hope to turn it into a tool at some point. Windows offers two kernel APIs of interest.

    KeRegisterBugCheckReasonCallback with KbCallbackDumpIo
    In theory, this should allow a kernel driver to be invoked by Windows as the BSOD is happening and memory is being dumped to the page file. The driver is given a pointer to the memory blocks being dumped and can push this data out of the VM using, for instance, a virtual serial port. Quite understandably, the callback code running on BSOD is very limited in what it can do. Memory allocations are obviously out of the question as is everything else that might end up allocating. What’s worse is that it’s not clear if the addresses passed to the callback are virtual or physical. The first part of the dump (header) is indicated with virtual addresses. The second part (body) starts with a few virtual pages and is then followed by physical pages indicated with physical addresses. Secondary data is again virtual. I might be missing something but apparently I’m not the only one observing this. As it stands now, this API is not usable without ugly and fragile heuristics.

    KeInitializeCrashDumpHeader
    This call produces the crash dump header, a 4096 byte data structure at offset 0 of MEMORY.DMP. And it works, as long as the physical memory layout that Windows works with is identical to that of whoever is producing the actual physical memory data to be appended to the header. Which sadly doesn’t seem to be the case with QEMU’s dump-guest-memory. So on one hand, having a valid and correct header saves us from the guesswork that Volatility has to do (no need to mess with Volatility “profiles”). But it’s still necessary to at least understand, or better patch the header to adapt it to the guest memory dump. Here’s a partial description of the dump header layout: computer.forensikblog.de/en/2006/03/dmp-file-structure.html.


    by ladipro at January 06, 2017 02:21 PM

    January 05, 2017

    Daniel Berrange

    ANNOUNCE: New libvirt project Go XML parser model

    Shortly before christmas, I announced the availability of new Go bindings for the libvirt API. This post announces a companion package for dealing with XML parsing/formatting in Go. The master repository is available on the libvirt GIT server, but it is expected that Go projects will consume it via an import of the github mirror, since the Go ecosystem is heavilty github focused (e.g. godoc.org can’t produce docs for stuff hosted on libvirt.org git)

    import (
      libvirtxml "github.com/libvirt/libvirt-go-xml"
      "encoding/xml"
    )
    
    domcfg := &libvirtxml.Domain{Type: "kvm", Name: "demo",
                                 UUID: "8f99e332-06c4-463a-9099-330fb244e1b3",
                                 ....}
    xmldoc, err := xml.Marshal(domcfg)
    

    API documentation is available on the godoc website.

    When dealing with the libvirt API, most applications will find themselves needing to either parse or format XML documents describing configuration of various libvirt objects. Traditionally this task has been left upto the application to deal with and as a result most applications end up creating some kind of structure / object model to represent the XML document in a more easily accessible manner. To try to reduce this duplicate effort, libvirt has already created the libvirt-glib package, which contains a libvirt-gconfig library mapping libvirt XML documents into the GObject world. This library is accessible to many programming languages via the magic of GObject Introspection, and while there is some work to support this in Go, it is not particularly mature at this time.

    In the Go world, there is a package “encoding/xml” which is able to transform between XML documents and Go structs, given suitable annotations on the struct fields. It is very easy to deal with, simply requiring someone to define a bit set of structs with annotated fields to map to the XML document. There’s no real “code” to write as it is really a data definition task.  Looking at applications using libvirt in Go, we see quite a few have already go down this route for dealing with libvirt XML. It should be readily apparent that every application using libvirt in Go is essentially going to end up writing an identical set of structs to deal with the XML handling. This duplication of effort makes no sense at all, and as such, we have started this new libvirt-go-xml package to provide a standard set of Go structs to deal with libvirt XML. The current level of schema support is pretty minimal supporting the capabilities XML, secrets XML and a small amount of the domain XML, so we’d encourage anyone interested in this to contribute patches to expand the XML schema coverage.

    The following illustrates a further example of its usage in combination with the libvirt-go library (with error checking omitted for brevity):

    import (
      libvirt "github.com/libvirt/libvirt-go"
      libvirtxml "github.com/libvirt/libvirt-go-xml"
      "encoding/xml"
      "fmt"
    )
    
    conn, err := libvirt.NewConnect("qemu:///system")
    dom := conn.LookupDomainByName("demo")
    xmldoc, err := dom.GetXMLDesc(0)
    
    domcfg := &libvirtxml.Domain{}
    err := xml.Unmarshal([]byte(xmldocC), domcfg)
    
    fmt.Printf("Virt type %s", domcfg.Type)
    

     

    by Daniel Berrange at January 05, 2017 12:15 PM


    Powered by Planet!
    Last updated: April 24, 2017 04:01 PM