Virt Tools Blog Planet

Subscriptions

Feeds

August 24, 2016

Alex Williamson

KVM Forum 2016 - An Introduction to PCI Device Assignment with VFIO

Slides available here:

http://awilliam.github.io/presentations/KVM-Forum-2016

Video to come

by Alex Williamson (noreply@blogger.com) at August 24, 2016 03:46 PM

August 16, 2016

Daniel Berrange

Improving QEMU security part 7: TLS support for migration

This blog is part 7 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

The live migration feature in QEMU allows a running VM to be moved from one host to another with no noticeable interruption in service and minimal performance impact. The live migration data stream will contain a serialized copy of state of all emulated devices, along with all the guest RAM. In some versions of QEMU it is also used to transfer disk image content, but in modern QEMU use of the NBD protocol is preferred for this purpose. The guest RAM in particular can contain sensitive data that needs to be protected against any would be attackers on the network between source and target hosts. There are a number of ways to provide such security using external tools/services including VPNs, IPsec, SSH/stunnel tunnelling. The libvirtd daemon often already has a secure connection between the source and destination hosts for its own purposes, so many years back support was added to libvirt to automatically tunnel the live migration data stream over libvirt’s own secure connection. This solved both the encryption and authentication problems at once, but there are some downsides to this approach. Tunnelling the connection means extra data copies for the live migration traffic and when we look at guests with RAM many GB in size, the number of data copies will start to matter. The libvirt tunnel only supports a tunnelling of a single data connection and in future QEMU may well wish to use multiple TCP connections for the migration data stream to improve performance of post-copy. The use of NBD for storage migration is not supported with tunnelling via libvirt, since it would require extra connections too. IOW while tunnelling over libvirt was a useful short term hack to provide security, it has outlived its practicality.

It is clear that QEMU needs to support TLS encryption natively on its live migration connections. The QEMU migration code has historically had its own distinct I/O layer called QEMUFile which mixes up tracking of migration state with the connection establishment and I/O transfer support. As mentioned in previous blog post, QEMU now has a general purpose I/O channel framework, so the bulk of the work involved converting the migration code over to use the QIOChannel classes and APIs, which greatly reduced the amount of code in the QEMU migration/ sub-folder as well as simplifying it somewhat. The TLS support involves the addition of two new parameters to the migration code. First the “tls-creds” parameter provides the ID of a previously created TLS credential object, thus enabling use of TLS on the migration channel. This must be set on both the source and target QEMU’s involved in the migration.

On the target host, QEMU would be launched with a set of TLS credentials for a server endpoint:

$ qemu-system-x86_64 -monitor stdio -incoming defer \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=server,id=tls0 \
    ...other args...

To enable incoming TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate_incoming tcp:myhostname:9000

On the source host, QEMU is launched in a similar manner but using client endpoint credentials

$ qemu-system-x86_64 -monitor stdio \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=client,id=tls0 \
    ...other args...

To enable outgoing TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate tcp:otherhostname:9000

The migration code supports a number of different protocols besides just “tcp:“. In particular it allows an “fd:” protocol to tell QEMU to use a passed-in file descriptor, and an “exec:” protocol to tell QEMU to launch an external command to tunnel the connection. It is desirable to be able to use TLS with these protocols too, but when using TLS the client QEMU needs to know the hostname of the target QEMU in order to correctly validate the x509 certificate it receives. Thus, a second “tls-hostname” parameter was added to allow QEMU to be informed of the hostname to use for x509 certificate validation when using a non-tcp migration protocol. This can be set on the source QEMU prior to starting the migration using the “migrate_set_str_parameter” monitor command

(qemu) migrate_set_str_parameter tls-hostname myhost.mydomain

This feature has been under development for a while and finally merged into QEMU GIT early in the 2.7.0 development cycle, so will be available for use when 2.7.0 is released in a few weeks. With the arrival of the 2.7.0 release there will finally be TLS support across all QEMU host services where TCP connections are commonly used, namely VNC, SPICE, NBD, migration and character devices.

In this blog series:

by Daniel Berrange at August 16, 2016 01:00 PM

Improving QEMU security part 6: TLS support for character devices

This blog is part 6 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

A number of QEMU device models and objects use a character devices for providing connectivity with the outside world, including the QEMU monitor, serial ports, parallel ports, virtio serial channels, RNG EGD object, CCID smartcard passthrough, IPMI device, USB device redirection and vhost-user. While some of these will only ever need a character device configured with local connectivity, some will certainly need to make use of TCP connections to remote hosts. Historically these connections have always been entirely in clear text, which is unacceptable in the modern hostile network environment where even internal networks cannot be trusted. Clearly the QEMU character device code requires the ability to use TLS for encrypting sensitive data and providing some level of authentication on connections.

The QEMU character device code was mostly using GLib’s  GIOChannel framework for doing I/O but this has a number of unsatisfactory limitations. It can not do vectored I/O, is not easily extensible and does not concern itself at all with initial connection establishment. These are all reasons why the QIOChannel framework was added to QEMU. So the first step in supporting TLS on character devices was to convert the code over to use QIOChannel instead of GIOChannel. With that done, adding in support for TLS was quite straightforward, merely requiring addition of a new configuration property (“tls-creds“) to set the desired TLS credentials.

For example to run a QEMU VM with a serial port listening on IP 10.0.01, port 9000, acting as a TLS server:

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0,server \
      -device isa-serial,chardev=s0
      ...other QEMU options...

It is possible test connectivity to this TLS server using the gnutls-cli tool

$ gnutls-cli --priority=NORMAL -p 9000 \
--x509cafile=/home/berrange/security/qemutls/ca-cert.pem \
127.0.0.1

In the above example, QEMU was running as a TCP server, and acting as the TLS server endpoint, but this matching is not required. It is valid to configure it to run as a TLS client if desired, though this would be somewhat uncommon.

Of course you can connect 2 QEMU VMs together, both using TLS. Assuming the above QEMU is still running, we can launch a second QEMU connecting to it with

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0 \
      -device isa-serial,chardev=s0
      ...other QEMU options...

Notice, we’ve changed the “endpoint” and removed the “server” option, so this second QEMU runs as a TCP client and acts as the TLS client endpoint.

This feature is available since the QEMU 2.6.0 release a few months ago.

In this blog series:

by Daniel Berrange at August 16, 2016 12:11 PM

August 10, 2016

Zeeshan Ali Khattak

Life is change

Quite a few major life events happened/happening this summer so I thought I blog about them and some of the experiences I had.

New job & new city/country

Yes, I found it hard to believe too that I'll ever be leaving Red Hat and the best manager I ever had (no offence to others but competing with Matthias is just impossible) but I'll be moving to Gothenburg to join Pelagicore folks as a Software Architect in just 2 weeks. I have always found Swedish language to be a very cute language so looking forward to my attempt of learning Swedish. If only I had learnt Swedish rather than Finnish when I was in Finland.

BTW, I'm selling all my furniture so if you're in London and need some furniture, get in touch!

Fresh helicopter pilot

So after two years of hard work and getting myself sinking in bank loans, I finally did it! Last week, I passed the skills test for Private Pilot License (Helicopters) and currently awaiting anxiously for my license to come through (it usually takes at least two weeks). Once I have that, I can rent Helicopters and take passengers with me. I'll be able to share the costs with passengers but I'm not allowed to make money out of it. The test was very tough and I came very close to failing at one particular point. The good news is that despite me being very tense and very windy conditions on test day, the biggest negative point from my examiner was that I was being over-cautious and hence very slow. So I think it wasn't so bad.



There are a few differences to a driving test. A minor one is is that in driving test, you are not expected to explain your steps but simply execute, where as in skills test for flying, you're expected to think everything out loud. But the most major difference is that in driving test, you are not expected to drive on your own until you pass the test, where as in flying test, you are required to have flown solo for at least 10 hours, which needs to include a solo cross country flight of at least a 100 nautical miles (185 KM) involving 3 major aeorodromes.  Mine involved Estree, Cranfield and Duxford. I've been GPS logging while flying so I can show you log of my qualifying solo cross country flight (click here to see details and notes):



I still got a long way towards Commercial License but at least now I can share the price with friends so building hours towards commercial license, won't be so expensive (I hope). I've found a nice company in Gothenburg that trains in and rents helicopters so I'm very much looking forward to flying over the costs in there. Wanna join? Let me know. :)

by noreply@blogger.com (zeenix) at August 10, 2016 01:24 PM

August 05, 2016

Gerd Hoffmann

Two new images uploaded

Uploded two new images. First a centos7 image for the raspberry pi 3 (arm64). Very simliar to the fedora images, see the other raspberrypi posts for details. Second a armv7 qemu image with grub2 boot loader. Boots with efi firmware (edk2/tianocore). No need to copy kernel+initrd from the image and pass that to qemu, making […]

by Gerd Hoffmann at August 05, 2016 08:24 AM

July 15, 2016

Alex Williamson

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release.  There's already pretty thorough documentation of the modes available in the source tree, please give it a read.  There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons.  Which ones are available to you depends on your hardware.  UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge.  If you have a processor older than SandyBridge, stop now, this is not going to work for you.  If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM.  Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer.  Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list).  I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined).  The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver.  Personally I avoid this by blacklisting the i915 driver.  Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working.  The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively.  To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use?  try both, video=vesafb:off,efifb:off).  The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub.  Plan for this.  I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </hostdev>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required.  Delete the <video> and <graphics> sections and everything should just magically work.  Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT  mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system.  This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc.  One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge.  Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons.  Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35.  This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT.  Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment.  To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly.  Great news, right?  Well sort of.  First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode).  In fact, um, there's no output support of any kind by default in UPT mode.  How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM.  Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while.  UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address.  Usually not a problem, but some of you create some crazy configs.  You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional?  Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love.  Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected.  Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on).  I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT.  Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available.  Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer.  Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot.  Don't be too alarmed by this, especially if it stops before the display is initialized.  This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host.  Unless you have an ongoing spew of these, they can probably be ignored.  If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware.  Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM.  Give it a try and good luck.

by Alex Williamson (noreply@blogger.com) at July 15, 2016 05:34 PM

July 01, 2016

Daniel Berrange

ANNOUNCE: libosinfo 0.3.1 released

I am happy to announce a new release of libosinfo, version 0.3.1 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

Thanks to everyone who contributed towards this release.

A special note to downstream vendors/distributors.

The next major release of libosinfo will include a major change in the way libosinfo is released and distributed. The current single release will be replaced with three indepedently released artefacts:

The libosinfo and osinfo-db-tools releases will be fairly infrequently as they are today. The osinfo-db releases will be done very frequently, with automated releases made available no more than 1 day after updated DB content is submitted to the project.

by Daniel Berrange at July 01, 2016 10:31 AM

June 30, 2016

Daniel Berrange

ANNOUNCE: virt-viewer 4.0 release

I am happy to announce a new bugfix release of virt-viewer 4.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

http://virt-manager.org/download/

Changes in this release include:

Thanks to everyone who contributed towards this release.

by Daniel Berrange at June 30, 2016 03:39 PM

June 29, 2016

Cole Robinson

UEFI virt roms now in official Fedora repos

Kamil got to it first, but just a note that UEFI roms for x86 and aarch64 virt are now shipped in the standard Fedora repos, where previously the recommended place to grab them was an external nightly repo. Kamil has updated the UEFI+QEMU wiki page to reflect this change.

On up to date Fedora 23+ these roms will be installed automatically with the relevant qemu packages, and libvirt is properly configured to advertise the rom files to applications, so enabling this with tools like virt-manager is available out of the box.

For the curious, the reason we can now ship these binaries in Fedora is because the problematic EDK2 'FatPkg' code, which had a Fedora incompatible license, was replaced with an implementation with a less restrictive (and more Fedora friendly) license.

by Cole Robinson (noreply@blogger.com) at June 29, 2016 02:28 PM

June 27, 2016

Gerd Hoffmann

New Raspberry PI images uploaded.

I’ve uploaded new images. Both a Fedora 23 refresh and new Fedora 24 images. There are not many changes, almost all notes from the two older articles here and here still apply. Noteworthy change is that the 32bit images don’t have a 64bit kernel for the rpi3 any more, so both rpi2 and rpi3 boot […]

by Gerd Hoffmann at June 27, 2016 08:24 AM

June 18, 2016

Cole Robinson

virt-manager 1.4.0 release

I've just released virt-manager 1.4.0. Besides the spice GL bits that I previously talked about, nothing too much exciting in this release except a lot of virt-install/virt-xml command line extensions.

The changelog highlights:

by Cole Robinson (noreply@blogger.com) at June 18, 2016 12:06 PM

June 10, 2016

Cole Robinson

check-pylint: mini tool for running pylint anywhere

pylint and pep8 are indispensable tools for python development IMO. For projects I maintain I've long ago added a 'setup pylint' sub-command to run both commands, and I've documented this as a necessary step in the contributor guidelines.

But over the years I've accumulated many repos for small bits of python code that never have need for a setup.py script, but I still want the convenience of being able to run pylint and pep8 with a single command and a reasonable set of options.

So, a while back I wrote this tiny 'check-pylint' script which does exactly that. The main bit it adds is automatically searching the current directory for python scripts and modules and passing them to pylint/pep8. From the README:

Simple helper script that scoops up all python modules and scripts beneath the current directory, and passes them through pylint and pep8. Has a bit of smarts to ignore .git directory, and handle files that don't end in .py

The point is that you can just fire off 'check-pylint' in any directory containing python code and get a quick report.

by Cole Robinson (noreply@blogger.com) at June 10, 2016 09:46 AM

May 22, 2016

Cole Robinson

spice OpenGL/virgl acceleration on Fedora 24

New in Fedora 24 virt is 3D accelerated SPICE graphics, via Virgl. This is kinda-sorta OpenGL passthrough from the VM up to the host machine. Much of the initial support has been around since qemu 2.5, but it's more generally accessible now that SPICE is in the mix, since that's the default display type used by virt-manager and gnome-boxes.

I'll explain below how you can test things on Fedora 24, but first let's cover the hurdles and caveats. This is far from being something that can be turned on by default and there's still serious integration issues to iron out. All of this is regarding usage with libvirt tools.

Caveats and known issues

Because of these issues, we haven't exposed this as a UI knob in any of the tools yet, to save us some redundant bug reports for the above issues from users who are just clicking a cool sounding check box :) Instead you'll need to explicitly opt in via the command line.

Testing it out

You'll need the following packages (or later) to test this:
Once you install a Fedora 24 VM through the standard methods, you can enable spice GL for your VM with these two commands:

$ virt-xml --connect $URI $VM_NAME --confirm --edit --video clearxml=yes,model=virtio,accel3d=yes
$ virt-xml --connect $URI $VM_NAME --confirm --edit --graphics clearxml=yes,type=spice,gl=on,listen=none

The first command will switch the graphics device to 'virtio' and enable the 3D acceleration setting. The second command will set up spice to listen locally only, and enable GL. Make sure to fully poweroff the VM afterwards for the settings to take effect. If you want to make the changes manually with 'virsh edit', the XML specifics are described in the spice GL documentation.

Once your VM has started up, you can verify that everything is working correctly by checking glxinfo output in the VM, 'virgl' should appear in the renderer string:

$ glxinfo | grep virgl
Device: virgl (0x1010)
OpenGL renderer string: Gallium 0.4 on virgl

And of course the more fun test of giving supertuxkart a spin :)

Credit to Dave Airlie, Gerd Hoffman, and Marc-André Lureau for all the great work that got us to this point!

by Cole Robinson (noreply@blogger.com) at May 22, 2016 11:56 AM

May 12, 2016

Richard Jones

Libguestfs appliance boot in under 600ms

$ ./run ./utils/boot-benchmark/boot-benchmark
Warming up the libguestfs cache ...
Running the tests ...

test version: libguestfs 1.33.28
 test passes: 10
host version: Linux moo.home.annexia.org 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
     backend: direct               [to change set $LIBGUESTFS_BACKEND]
        qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64 [to change set $LIBGUESTFS_HV]
qemu version: QEMU emulator version 2.5.94, Copyright (c) 2003-2008 Fabrice Bellard
         smp: 1                    [to change use --smp option]
     memsize: 500                  [to change use --memsize option]
      append:                      [to change use --append option]

Result: 575.9ms ±5.3ms

There are various tricks here:

  1. I’m using the (still!) not upstream qemu DMA patches.
  2. I’ve compiled my own very minimal guest Linux kernel.
  3. I’m using my nearly upstream "crypto: Add a flag allowing the self-tests to be disabled at runtime." patch.
  4. I’ve got two sets of non-upstream libguestfs patches 1, 2
  5. I am not using libvirt, but if you do want to use libvirt, make sure you use the very latest version since it contains an important performance patch.

Previously


by rich at May 12, 2016 10:07 PM

Daniel Berrange

Analysis of techniques for ensuring migration completion with KVM

Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.

If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.

The techniques available

Measuring impact of the techniques

Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.

In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.

Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/guestperf.py tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run

$ tests/migration/guestperf.py --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json

The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/guestperf-plot.py tool can consume this data file and produce interactive HTML charts illustrating the results.

$ tests/migration/guestperf-plot.py --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json

To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/guestperf-batch.py tool, in which case it is merely required to provide the desired hardware configuration

$ tests/migration/guestperf-batch.py --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb

This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same guestperf-plot.py tool can be used to create charts combining multiple data sets at once to allow easy comparison.

Performance results for QEMU 2.6

With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.

To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.

The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario

UNIX socket, 1 vCPU, 1 GB RAM

Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

UNIX socket, 4 vCPU, 8 GB RAM

Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 1 vCPU, 1 GB RAM

Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 4 vCPU, 8 GB RAM

Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 1 vCPU, 1 GB RAM

Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 4 vCPU, 8 GB RAM

Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 1 vCPU, 1 GB RAM

Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 4 vCPU, 8 GB RAM

Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

Analysis of results

The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show

Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link

It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.

There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.

In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.

In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.

by Daniel Berrange at May 12, 2016 03:17 PM

April 29, 2016

Cole Robinson

Using CPU host-passthrough with virt-manager

I described virt-manager's CPU model default in this post. In that post I explained the difficulties of using either of the libvirt options for mirroring the host CPU: mode=host-model still has operational issues, and mode=host-passthrough isn't recommended for use with libvirt over supportability concerns.

Unfortunately since writing that post the situation hasn't improved any, and since host-passthrough is the only reliably way to expose the full capabilities of the host CPU to the VM, users regularly want to enable it. This is particularly apparent if trying to do nested virt, which often doesn't work on Intel CPUs unless host-passthrough is used.

However we don't explicitly expose this option in virt-manager since it's not generally recommended for libvirt usage. You can however still enable it in virt-manager:

by Cole Robinson (noreply@blogger.com) at April 29, 2016 04:48 PM

UEFI support in virt-install and virt-manager

One of the new features in virt-manager 1.2.0 (from back in May) is user friendly support for enabling UEFI.

First a bit about terminology: When UEFI is packaged up to run in an x86 VM, it's often called OVMF. When UEFI is packaged up to run in an AArch64 VM, it's often called AAVMF. But I'll just refer to all of it as UEFI.

Using UEFI with virt-install and virt-manager


The first step to enable this for VMs is to install the binaries. UEFI still has some licensing issues that make it incompatible with Fedora's policies, so the bits are hosted in an external repo. Details for installing the repo and UEFI bits are over here.

Once the bits are installed (and you're on Fedora 22 or later), virt-manager and virt-install provide simple options to enable UEFI when creating VMs.

Marcin has a great post with screenshots describing this for virt-manager (for aarch64, but the steps are identical for x86 VMs).

For virt-install it's as simple as doing:

$ sudo virt-install --boot uefi ...

virt-install will get the binary paths from libvirt and set everything up with the optimal config. If virt-install can't figure out the correct parameters, like if no UEFI binaries are installed, you'll see an error like: ERROR    Error: --boot uefi: Don't know how to setup UEFI for arch 'x86'

See 'virt-install --boot help' if you need to tweak the parameters individually.

Implementing support in applications


Libvirt needs to know about UEFI<->NVRAM config file mapping, so it can advertise it to tools like virt-manager/virt-install. Libvirt looks at a hardcoded list of known host paths to see if any firmware is installed, and if so, lists those paths in domain capabilities output (virsh domcapabilities). Libvirt in Fedora 22+ knows to look for the paths provided by the repo mentioned above, so just installing the firmware is sufficient to make libvirt advertise UEFI support.

The domain capabilities output only lists the firmware path and the associated variable store path. Notably lacking is any indication of what architecture the binaries are meant for. So tools need to determine this mapping themselves.

virt-manager/virt-install and libguestfs use a similar path matching heuristic. The libguestfs code is a good reference:

    match guest_arch with
| "i386" | "i486" | "i586" | "i686" ->
[ "/usr/share/edk2.git/ovmf-ia32/OVMF_CODE-pure-efi.fd",
"/usr/share/edk2.git/ovmf-ia32/OVMF_VARS-pure-efi.fd" ]
| "x86_64" ->
[ "/usr/share/OVMF/OVMF_CODE.fd",
"/usr/share/OVMF/OVMF_VARS.fd";
"/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd",
"/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd" ]
| "aarch64" ->
[ "/usr/share/AAVMF/AAVMF_CODE.fd",
"/usr/share/AAVMF/AAVMF_VARS.fd";
"/usr/share/edk2.git/aarch64/QEMU_EFI-pflash.raw",
"/usr/share/edk2.git/aarch64/vars-template-pflash.raw" ]
| arch ->
error (f_"don't know how to convert UEFI guests for architecture %s")
guest_arch in

Having to track this in every app is quite crappy, but it's the only good solution at the moment. Hopefully long term libvirt will grow some solution that makes this easier for applications.

by Cole Robinson (noreply@blogger.com) at April 29, 2016 04:35 PM

Polkit password-less access for the 'libvirt' group

Many users, who admin their own machines, want to be able to use tools like virt-manager without having to enter a root password. Just google 'virt-manager without password' and see all the hits. I've seen many blogs and articles over the years describing various ways to work around it.

The password prompting is via libvirt's polkit integration. The idea is that we want the applications to run as a regular unprivileged user (running GUI apps as root is considered a no-no), and only use the root authentication for talking to system libvirt instance. Most workarounds suggest installing a polkit rule to allow your user, or a particular user group, to access libvirt without needing to enter the root password.

In libvirt v1.2.16 we finally added official support for this (and backported to Fedora22+). The group is predictably called 'libvirt'. This matches polkit rules that debian and suse were already shipping too.

So just add your user to the 'libvirt' group and enjoy passwordless virt-manager usage:

$ usermod --append --groups libvirt `whoami`

by Cole Robinson (noreply@blogger.com) at April 29, 2016 04:32 PM

qemu:///system vs qemu:///session

If you've spent time using libvirt apps like virt-manager, you've likely seen references to libvirt URIs. The URI is how users or apps tell libvirt what hypervisor (qemu, xen, lxc, etc) to connect to, what host it's on, what authentication method to use, and a few other bits. 

For QEMU/KVM (and a few other hypervisors), there's a concept of system URI vs session URI:

That describes the 'what', but the 'why' of it is a bigger story. The privilege level of the daemon and VMs have pros and cons depending on your usecase. The easiest way to understand the benefit of one over the other is to list the problems with each setup.


qemu:///system runs libvirtd as root, and access is mediated by polkit. This means if you are connecting to it as a regular user (like when launching virt-manager), you need to enter the host root password, which is annoying and not generally desktop usecase friendly. There are ways to work around it but it requires explicit admin configuration.

Desktop use cases also suffer since VMs are running as the 'qemu' user, but the app (like virt-manager) is running as your local user. For example, say you download an ISO to $HOME and want to attach it to a VM. The VM is running as unprivileged user=qemu and can't access your $HOME, so libvirt has to change the ISO file owner to qemu:qemu and virt-manager has to give search access to $HOME for user=qemu. It's a pain for apps to handle, and it's confusing for users, but after dealing with it for a while in virt-manager we've made it generally work. (Though try giving a VM access to a file on a fat32 USB drive that was automounted by your desktop session...)


qemu:///session runs libvirtd and VMs as your unprivileged user. This integrates better with desktop use cases since permissions aren't an issue, no root password is required, and each user has their own separate pool of VMs.

However because nothing in the chain is privileged, any VM setup tasks that need host admin privileges aren't an option. Unfortunately this includes most general purpose networking options.

The default qemu mode in this case is usermode networking (or SLIRP). This is an IP stack implemented in userspace. This has many drawbacks: the VM can not easily be accessed by the outside world, the VM can access talk to the outside world but only over a limited number of networking protocols, and it's very slow.

There is an option for qemu:///session VMs to use a privileged networking setup, via the setuid qemu-bridge-helper. Basically the host admin sets up a bridge, adds it to a whitelist at /etc/qemu/bridge.conf, then it's available for unprivileged qemu instances. By default on Fedora this contains 'virbr0' which is the default virtual network bridge provided by the system libvirtd instance, and what qemu:///system VMs typically use.

gnome-boxes originally used usermode networking, but switched around Fedora 21 timeframe to use virbr0 via qemu-bridge-helper. But that's dependent on virbr0 being set up correctly by the host admin, or via package install (libvirt-daemon-config-network package on Fedora).

qemu:///session also misses some less common features that require host admin privileges, like host PCI device assignment. Also VM autostart doesn't work as expected because the session daemon itself isn't autostarted.


Apps have to decide for themselves which libvirtd mode to use, depending on their use case.

qemu:///system is completely fine for big apps like oVirt and Openstack that require admin access to the virt hosts anyways.

virt-manager largely defaults to qemu:///system because that's what it has always done, and that default long precedes qemu-bridge-helper. We could switch but it would just trade one set of issues for another. virt-manager can be used with qemu:///session though (or any URI for that matter).

libguestfs uses qemu:///session since it avoids all the permission issues and the VM appliance doesn't really care about networking.

gnome-boxes prioritized desktop integration from day 1, so qemu:///session was the natural choice. But they've struggled with the networking issues in various forms.

Other apps are in a pickle: they would like to use qemu:///session to avoid the permission issues, but they also need to tweak the network setup. This is the case vagrant-libvirt currently finds itself in.

by Cole Robinson (noreply@blogger.com) at April 29, 2016 04:11 PM

github 'hub' command line tool

I don't often need to contribute patches to code hosted on github; most of the projects I contribute to are either old school and don't use github for anything but mirroring their main git repo, or are small projects I entirely maintain so I don't submit pull-requests.

But when I do need to submit patches, github's hub tool makes my life a lot simpler, which allows forking repositories and submitting pull-requests very easily from the command line.

The 'hub' tool wants to be installed as an alias for 'git'. I originally tried that, but it made my bash prompt insanely slow since I show the current git branch and dirty state in my bash prompt. When I first encountered this, I filed a bug against the hub tool (with a bogus workaround), and nowadays it seems they have a disclaimer in their README.

Their recommended fix is to s/git/command git/g in git-prompt.sh, which doesn't work too well if you use the linked fedora suggestion of pointing at the package installed file in /usr/share, so I avoid the alias. You can run 'hub' standalone, but instead I like to do:

 sudo dnf install hub  
ln -s /usr/bin/hub /usr/libexec/git-core/git-hub

Then I can git hub fork and git hub pull-request all I want :)

by Cole Robinson (noreply@blogger.com) at April 29, 2016 04:04 PM

April 22, 2016

Gerd Hoffmann

linux evdev input support in qemu 2.6

Another nice goodie coming in the qemu 2.6 release is support for linux evdev devices. qemu can pick up input events directly from the devices now, which is quite useful in case you don’t use gtk/sdl/vnc/spice for display output. Most common use case for this is vga pass-through. It’s used this way: qemu -object input-linux,id=$name,evdev=/dev/input/event$nr […]

by Gerd Hoffmann at April 22, 2016 12:01 PM

Fedora on Raspberry PI updates

Some updates for the Running Fedora on the Raspberry PI 2 article. There has been great progress on getting the Raspberry PI changes merged upstream in the last half year. u-boot has decent support meanwhile for the whole family, including 64bit support for the Raspberry PI 3. Raspberry PI 2 support has been merged upstream […]

by Gerd Hoffmann at April 22, 2016 05:57 AM

April 21, 2016

Gerd Hoffmann

fbida 2.12 released.

New fbida release is out. Big user-visible item is the new fbpdf application, a console pdf viewer using the poppler library. Can be used instead of the fbgs wrapper script. Download the bits here.

by Gerd Hoffmann at April 21, 2016 07:48 PM

latest updates from virtio-gpu department

It’s been a while since the last virtio-cpu status report. qemu 2.6 release is just around the corner, time for a writeup about what is coming. virgl support was added to qemu 2.5 already. It was supported only by the SDL2 and gtk UIs. Which is a nice start, but has the drawback that qemu […]

by Gerd Hoffmann at April 21, 2016 04:15 PM

April 05, 2016

Daniel Berrange

Improving QEMU security part 5: TLS support for NBD server & client

This blog is part 5 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

For many years now QEMU has had code to support the NBD protocol, either as a client or as a server. The qemu-nbd command line tool can be used to export a disk image over NBD to a remote machine, or connect it directly to the local kernel’s NBD block device driver. The QEMU system emulators also have a block driver that acts as an NBD client, allowing VMs to be run from NBD volumes. More recently the QEMU system emulators gained the ability to export the disks from a running VM as named NBD volumes. The latter is particularly interesting because it is the foundation of live migration with block device replication, allowing VMs to be migrated even if you don’t have shared storage between the two hosts. In common with most network block device protocols, NBD has never offered any kind of data security capability. Administrators are recommended to run NBD over a private LAN/vLAN, use network layer security like IPSec, or tunnel it over some other kind of secure channel. While all these options are capable of working, none are very convenient to use because they require extra setup steps outside of the basic operation of the NBD server/clients. Libvirt has long had the ability to tunnel the QEMU migration channel over its own secure connection to the target host, but this has not been extended to cover the NBD channel(s) opened when doing block migration. While it could theoretically be extended to cover NBD, it would not be ideal from a performance POV because the libvirtd architecture means that the TLS encryption/decryption for multiple separate network connections would be handled by a single thread. For fast networks (10-GigE), libvirt will quickly become the bottleneck on performance even if the CPU has native support for AES.

Thus it was decided that the QEMU NBD client & server would need to be extended to support TLS encryption of the data channel natively. Initially the thought was to just add a flag to the client/server code to indicate that TLS was desired and run the TLS handshake before even starting the NBD protocol. After some discussion with the NBD maintainers though, it was decided to explicitly define a way to support TLS in the NBD protocol negotiation phase. The primary benefit of doing this is to allow clearer error reporting to the user if the client connects to a server requiring use of TLS and the client itself does not support TLS, or vica-verca – ie instead of just seeing what appears to be a mangled NBD handshake and not knowing what it means, the client can clearly report “This NBD server requires use of TLS encryption”.

The extension to the NBD protocol was fairly straightforward. After the initial NBD greeting (where the client & server agree the NBD protocol variant to be used) the client is able to request a number of protocol options. A new option was defined to allow the client to request TLS support. If the server agrees to use TLS, then they perform a standard TLS handshake and the rest of the NBD protocol carries on as normal. To prevent downgrade attacks, if the NBD server requires TLS and the client does not request the TLS option, then it will respond with an error and drop the client. In addition if the server requires TLS, then TLS must be the first option that the client requests – other options are only permitted once the TLS session is active & the server will again drop the client if it tries to request non-TLS options first.

The QEMU NBD implementation was originally using plain POSIX sockets APIs for all its I/O. So the first step in enabling TLS was to update the NBD code so that it used the new general purpose QEMU I/O channel  APIs instead. With that done it was simply a matter of instantiating a new QIOChannelTLS object at the correct part of the protocol handshake and adding various command line options to the QEMU system emulator and qemu-nbd program to allow the user to turn on TLS and configure x509 certificates.

Running a NBD server using TLS can be done as follows:

$ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
           --tls-creds tls0 /path/to/disk/image.qcow2

On the client host, a QEMU guest can then be launched, connecting to this NBD server:

$ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
                     -drive driver=nbd,host=theotherhost,port=10809,tls-creds=tls0 \
                     ...other QEMU options...

Finally to enable support for live migration with block device replication, the QEMU system monitor APIs gained support for a new parameter when starting the internal NBD server. All of this code was merged in time for the forthcoming QEMU 2.6 release. Work has not yet started to enable TLS with NBD in libvirt, as there is little point securing the NBD protocol streams, until the primary live migration stream is using TLS. More on live migration in a future blog post, as that’s going to be QEMU 2.7 material now.

In this blog series:

by Daniel Berrange at April 05, 2016 09:35 AM

April 04, 2016

Daniel Berrange

Improving QEMU security part 4: generic I/O channel framework to simplify TLS

This blog is part 4 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

Part 2 of this series described the creation of a general purpose API for simplifying TLS session handling inside QEMU, particularly with a view to hiding the complexity of the handshake and x509 certificate validation. The VNC server was converted to use this API, which was a big benefit, but there was still a need to add extra code to support TLS in the I/O paths. Specifically, anywhere that the VNC server would read/write on the network socket, had to be made TLS aware so that it would use plain POSIX send/recv functions vs the TLS wrapped send/recv functions as appropriate. For the VNC server it is actually even more complex, because it also supports websockets, so each I/O point had to choose between plain, TLS, websockets and websockets plus TLS.  As TLS support extends to other areas of QEMU this pattern would continue to complicate I/O paths in each backend.

Clearly there was a need for some form of I/O channel abstraction that would allow TLS to be enabled in each QEMU network backend without having to add conditional logic at every I/O send/recv call. Looking around at the QEMU subsystems that would ultimately need TLS support, showed a variety of approaches currently in use

The GIOChannel APIs used by the character device backend theoretically provide an extensible framework for I/O and there is even a TLS implementation of the GIOChannel API. The two limitations of GIOChannel for QEMU though are that it does not support scatter / gather / vectored I/O APIs and that it does not support file descriptor passing over UNIX sockets. The latter is not a show stopper, since you can still access the socket handle directly to send/recv file descriptors. The lack of vectored I/O though would be a significant issue for migration and NBD servers where performance is very important. While we could potentially extend GIOChannel to add support for new callbacks to do vectored I/O, by the time you’ve done that most of the original GIOChannel code isn’t going to be used, limiting the benefit of starting from GIOChannel as a base. It is also clear that GIOChannel is really not something that is going to get any further development from the GLib maintainers, since their focus is on the new and much better GIO library. This supports file descriptor passing and TLS encryption, but again lacks support for vectored I/O. The bigger show stopper though is that to get access to the TLS support requires depending on a version on GLib that is much newer than what QEMU is willing to use. The existing QEMUFile APIs could form the basis of a general purpose I/O channel system if they were untangled & extracted from migration codebase. One limitation is that QEMUFile only concerns itself with I/O, not the initial channel establishment which is left to the migration core code to deal with, so did not actually provide very much of a foundation on which to build.

After looking through the various approaches in use in QEMU, and potentially available from GLib, it was decided that QEMU would be best served by creating a new general purpose I/O channel API. Thus a new QEMU subsystem was added in the io/ and include/io/ directories to provide a set of classes for I/O over a variety of different data channels. The core design aims were to use the QEMU object model (QOM) framework to provide a standard pattern for extending / subclassing, use the QEMU Error object for all error reporting, file  descriptor passing, main loop watch integration and coroutine integration. Overall the new design took many elements of its design from GIOChannel and the GIO library, and blended them with QEMU’s own codebase design. The initial goal was to provide enough functionality to convert the VNC server as a proof of concept. To this end the following classes were created

To avoid making this blog posting even larger, I won’t go into details of these (the code is available in QEMU git for anyone who’s really interesting), but instead illustrate it with a comparison of the VNC code before & after. First consider the original code in the VNC server for dealing with writing a buffer of data over a plain socket or websocket either with TLS enabled. The following functions existed in the VNC server code to handle all the combinations:

ssize_t vnc_tls_push(const char *buf, size_t len, void *opaque)
{
    VncState *vs = opaque;
    ssize_t ret;

 retry:
    ret = send(vs->csock, buf, len, 0);
    if (ret < 0) {
        if (errno == EINTR) {
            goto retry;
        }
        return -1;
    }
    return ret;
}

ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
    ssize_t ret;
    int err = 0;
    if (vs->tls) {
        ret = qcrypto_tls_session_write(vs->tls, (const char *)data, datalen);
        if (ret < 0) {
            err = errno;
        }
    } else {
        ret = send(vs->csock, (const void *)data, datalen, 0);
        if (ret < 0) {
            err = socket_error();
        }
    }
    return vnc_client_io_error(vs, ret, err);
}

long vnc_client_write_ws(VncState *vs)
{
    long ret;
    vncws_encode_frame(&vs->ws_output, vs->output.buffer, vs->output.offset);
    buffer_reset(&vs->output);
    return vnc_client_write_buf(vs, vs->ws_output.buffer, vs->ws_output.offset);
}

static void vnc_client_write_locked(void *opaque)
{
    VncState *vs = opaque;

    if (vs->encode_ws) {
        vnc_client_write_ws(vs);
    } else {
        vnc_client_write_plain(vs);
    }
}

After conversion to use the new QIOChannel classes for sockets, websockets and TLS, all of the VNC server code above turned into

ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
    Error *err = NULL;
    ssize_t ret;
    ret = qio_channel_write(vs->ioc, (const char *)data, datalen, &err);
    return vnc_client_io_error(vs, ret, &err);
}

It is clearly a major win for maintainability of the VNC server code to have all the TLS and websockets I/O support handled by the QIOChannel APIs. There is no impact to supporting TLS and websockets anywhere in the VNC server I/O paths now. The only place where there is new code is the point where the TLS or websockets session is initiated and this now only requires instantiation of a suitable QIOChannel subclass and registering a callback to be run when the session handshake completes (or fails).

tls = qio_channel_tls_new_server(vs->ioc, vs->vd->tlscreds, vs->vd->tlsaclname, &err);
if (!tls) {
    vnc_client_error(vs);
    return 0;
}

object_unref(OBJECT(vs->ioc));
vs->ioc = QIO_CHANNEL(tls);

qio_channel_tls_handshake(tls, vnc_tls_handshake_done, vs, NULL);

Notice that the code is simply replacing the current QIOChannel handle ‘vs->ioc’ with an instance of the QIOChannelTLS class. The vnc_tls_handshake_done method is invoked when the TLS handshake is complete or failed and lets the VNC server continue with the next part of its authentication protocol, or drop the client connection as appropriate. So adding TLS session support to the VNC server comes in at about 10 lines of code now.

In this blog series:

by Daniel Berrange at April 04, 2016 11:26 AM

April 01, 2016

Daniel Berrange

Improving QEMU security part 3: securely passing in credentials

This blog is part 3 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

When configuring a virtual machine, there are a number of places where QEMU requires some form of security sensitive credentials, typically passwords or encryption keys. Historically QEMU has had no standard approach for getting these credentials from the user, so things have grown in an adhoc manner with predictably awful results. If the VNC server is configured to use basic VNC authentication, then it requires a password to be set. When I first wrote patches to add password auth to QEMU’s VNC server it was clearly not desirable to expose the password on the command line, so when configuring VNC you just request password authentication be enabled using -vnc 0.0.0.0:0,password and then have to use the monitor interface to set the actual password value “change vnc password“. Until a password has been set via the monitor, the VNC server should reject all clients, except that we’ve accidentally broken this in the past, allowing clients when no server password is set :-( The qcow & qcow2 disk image formats support use of AES for encryption (remember this is horribly broken) and so there needs to be a way to provide the decryption password for this. Originally you had to wait for QEMU to prompt for the disk password on the interactive console. This clearly doesn’t work very nicely when QEMU is being managed by libvirt, so we added another monitor command which allows apps to provide the disk password upfront, avoiding the need to prompt. Fast forward a few years and QEMU’s block device layer gained support for various network protocols including iSCSI, RBD, FTP and HTTP(s). All of these potentially require authentication and thus a password needs to be provided to QEMU. The CURL driver for ftp, http(s) simply skipped support for authentication since there was no easy way to provide the passwords securely. Sadly, the iSCSI and RBD drivers simply decided to allow the password to be provided in the command line. Hence the passwords for RBD and iSCSI are visible in plain text in the process listing and in libvirt’s QEMU log files, which often get attached to bug reports, which has resulted in a CVE being filed against libvirt. I had an intention to add support for the LUKS format in the QEMU block layer which will also require passwords to be provided securely to QEMU, and it would be desirable if the x509 keys provided to QEMU could be encrypted too.

Looking at this mess and the likely future requirements, it was clear that QEMU was in desperate need of a standard mechanism for securely receiving credentials from the user / management app (libvirt). There are a variety of channels via which credentials can be theoretically passed to QEMU:

As mentioned previously, using command line arguments or environment variables is not secure if the credential is passed in plain text, because they are visible in the processing list and log files. It would be possible to create a plain file on disk and write each password to it and use file permissions to ensure only QEMU can read it. Using files is not too bad as long as your host filesystem is on encrypted storage. It has a minor complexity of having to dynamically create files on the fly each time you want to hotplug a new device using a password. Most of these problems can be avoided by using an anonymous pipe, but this is more complicated for end users because for hotplugging devices it would require passing file descriptors over a UNIX socket. Finally the monitor provides a decent secure channel which users / mgmt apps will typically already have open via a UNIX socket. There is a chicken & egg problem with it though, because the credentials are often required at initial QEMU startup when parsing the command line arguments, and the monitor is not available that early.

After considering all the options, it was decided that using plain files and/or anonymous pipes to pass credentials would be the most desirable approach. The qemu_open() method has a convenient feature whereby there is a special path prefix that allows mgmt apps to pass a file descriptor across instead of a regular filename. To enable reuse of existing -object command line argument and object_add monitor commands for definin credentials, the QEMU object model framework (QOM) was used to define a ‘secret’ object class. The ‘secret‘ class has a ‘path‘ property which is the filename containing the credential. For example it could be used

 # echo "letmein" > mydisk.pw
 # $QEMU -object secret,id=sec0,file=mydisk.pw

Having written this, I realized that it would be possible to allow passwords to be provided directly via the command line if we allowed secret data to be encrypted with a master key. The idea would be that when a QEMU process is first started, it gets given a new unique AES key via a file. The credentials for individual disks / servers would be encrypted with the master key and then passed directly on the command line. The benefit of this is that the mgmt app only needs to deal with a single file on disk with a well defined lifetime.

First a master key is generated and saved to a file in base64 format

 # openssl rand -base64 32 > master-key.b64

Lets say we have two passwords we need to give to QEMU. We will thus need two initialization vectors

 # openssl rand -base64 16 > sec0-iv.b64
 # openssl rand -base64 16 > sec1-iv.b64

Each password is now encrypted using the master key and its respective initialization vector

 # SEC0=$(printf "letmein" |
          openssl enc -aes-256-cbc -a \
             -K $(base64 -d master-key.b64 | hexdump -v -e '/1 "%02X"') \
             -iv $(base64 -d sec0-iv.b64 | hexdump -v -e '/1 "%02X"'))
 # SEC1=$(printf "1234567" |
          openssl enc -aes-256-cbc -a \
             -K $(base64 -d master-key.b64 | hexdump -v -e '/1 "%02X"') \
             -iv $(base64 -d sec1-iv.b64 | hexdump -v -e '/1 "%02X"'))

Finally when QEMU is launched, three secrets are defined, the first gives the master key via a file, and the others provide the two encrypted user passwords

 # $QEMU \
      -object secret,id=secmaster,format=base64,file=key.b64 \
      -object secret,id=sec0,keyid=secmaster,format=base64,\
              data=$SECRET,iv=$(<sec0-iv.b64) \
      -object secret,id=sec1,keyid=secmaster,format=base64,\
              data=$SECRET,iv=$(<sec1-iv.b64) \
      ...other args using the secrets...

Now we have a way to securely get credentials into QEMU, there just remains the task of associating the secrets with the things in QEMU that need to use them. The TLS credentials object previously added originally required the x509 server key to be provided in an unencrypted PEM file. The tls-creds-x509 object can now gain a new property “passwordid” which provides the ID of a secret object that defines the password to use for decrypting the x509 key.

 # $QEMU \
      -object secret,id=secmaster,format=base64,file=key.b64 \
      -object secret,id=sec0,keyid=secmaster,format=base64,\
              data=$SECRET,iv=$(<sec0-iv.b64) \
      -object tls-creds-x509,id=tls0,dir=/home/berrange/qemutls,endpoint=server,passwordid=sec0 \
      -vnc 0.0.0.0:0,tls-creds=tls0

Aside from adding support for encrypted x509 certificates, the RBD, iSCSI and CURL block drivers in QEMU have all been updated to allow authentication passwords to be provided using the ‘secret‘ object type. Libvirt will shortly be gaining support to use this facility which will address the long standing problem of RBD/ISCSI passwords being visible in clear text in the QEMU process command line arguments. All the enhancements described in this posting have been merged for the forthcoming QEMU 2.6.0 release so will soon be available to users. The corresponding enhancements to libvirt to make use of these features are under active development.
In this blog series:

by Daniel Berrange at April 01, 2016 03:42 PM

Improving QEMU security part 2: generic TLS support

This blog is part 2 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

After the initial consolidation of cryptographic APIs into one area of the QEMU codebase, it was time to move onto step two, which is where things start to directly benefit the users. With the original patches to support TLS in the VNC server, configuration of the TLS credentials was done as part of the -vnc command line argument. The downside of such an approach is that as we add support for TLS to other arguments like -chardev, the user would have to supply the same TLS information in multiple places. So it was necessary to isolate the TLS credential configuration from the TLS session handling code, enabling a single set of TLS credentials to be associated with multiple network servers (they can of course each have a unique set of credentials if desired). To achieve this, the new code made use of QEMU’s object model framework (QOM) to define a general TLS credentials interface and then provide implementations for anonymous credentials (totally insecure but needed for back compat with existing QEMU features) and for x509 certificates (the preferred & secure option). There are now two QOM types tls-creds-anon and tls-creds-x509 that can be created on the command line via QEMU’s -object argument, or in the monitor using the ‘object_add’ command. The VNC server was converted to use the new TLS credential objects for its configuration, so whereas in QEMU 2.4 VNC with TLS would be configured using

-vnc 0.0.0.0:0,tls,x509verify=/path/to/certificates/directory

As of QEMU 2.5 the preferred approach is to use the new credential objects

-object tls-creds-x509,id=tls0.endpoint=server,dir=/path/to/certificates/directory
-vnc 0.0.0.0:0,tls-creds=tls0

The old CLI syntax is still supported, but gets translated internally to create the right TLS credential objects. By default the x509 credentials will require that the client provide a certificate, which is equivalent to the traditional ‘x509verify‘ option for VNC. To remove the requirement for client certs, the ‘verify-peer=no‘ option can be given when creating the x509 credentials object.

Generating correct x509 certificates is something that users often struggle with and when getting it wrong the failures are pretty hard to debug – usually just resulting in an unhelpful “handshake failed” type error message. To help troubleshoot problems, the new x509 credentials code in QEMU will sanity check all certificates it loads prior to using them. For example, it will check that the CA certificate has basic constraints set to indicate usage as a CA, catching problems where people give a server/client cert instead of a CA cert. Likewise it will check that the server certificate has basic constraints set to indicate usage in a server. It’ll check that the server certificate is actually signed by the CA that is provided and that none of the certs have expired already. These are all things that the client would check when it connects, so we’re not adding / removing security here, just helping administrators to detect misconfiguration of their TLS certificates as early as possible. These same checks have been done in libvirt for several years now and have been very beneficial in reducing the bugs reports we get related to misconfiguration of TLS.

With the generic TLS credential objects created, the second step was to create a general purpose API for handling the TLS protocol inside QEMU, especially the simplifying the handshake which requires a non-negligible amount of code. The TLS session APIs were designed such that they are independent of the underling data transport since while the VNC server always runs over TCP/UNIX sockets, other QEMU backends may wish to run TLS over non-socket based transports. Overall the API for dealing with TLS session establishment in QEMU can be used as follows

  static ssize_t mysock_send(const char *buf, size_t len,
                             void *opaque)
  {
      int fd = GPOINTER_TO_INT(opaque);
 
      return write(*fd, buf, len);
  }
 
  static ssize_t mysock_recv(const char *buf, size_t len,
                             void *opaque)
  {
      int fd = GPOINTER_TO_INT(opaque);
 
      return read(*fd, buf, len);
  }
 
  static int mysock_run_tls(int sockfd,
                            QCryptoTLSCreds *creds,
                            Error *erp)
  {
      QCryptoTLSSession *sess;
 
      sess = qcrypto_tls_session_new(creds,
                                     "vnc.example.com",
                                     NULL,
                                     QCRYPTO_TLS_CREDS_ENDPOINT_CLIENT,
                                     errp);
      if (sess == NULL) {
          return -1;
      }
 
      qcrypto_tls_session_set_callbacks(sess,
                                        mysock_send,
                                        mysock_recv,
                                        GINT_TO_POINTER(fd));
 
      while (1) {
          if (qcrypto_tls_session_handshake(sess, errp) < 0) {
              qcrypto_tls_session_free(sess);
              return -1;
          }
 
          switch(qcrypto_tls_session_get_handshake_status(sess)) {
          case QCRYPTO_TLS_HANDSHAKE_COMPLETE:
              if (qcrypto_tls_session_check_credentials(sess, errp) < )) {
                  qcrypto_tls_session_free(sess);
                  return -1;
              }
              goto done;
          case QCRYPTO_TLS_HANDSHAKE_RECVING:
              ...wait for GIO_IN event on fd...
              break;
          case QCRYPTO_TLS_HANDSHAKE_SENDING:
              ...wait for GIO_OUT event on fd...
              break;
          }
      }
    done:
 
      ....send/recv payload data on sess...
 
      qcrypto_tls_session_free(sess):
  }

The particularly important thing to note with this example is how the network service (eg VNC, NBD, chardev) that is enabling TLS no longer has to have any knowledge of x509 certificates. They are loaded automatically when the user provides the ‘-object tls-creds-x509‘ argument to QEMU, and they are validated automatically by the call to qcrypto_tls_session_handshake(). This makes it easy to add TLS support to other network backends in QEMU, with minimal overhead significantly lowering the risk of screwing up the security. Since it already had TLS support, the VNC server was converted to use this new TLS session API instead of using the gnutls APIs directly. Once again this had a very positive impact on maintainability of the VNC code, since it allowed countless #ifdef CONFIG_GNUTLS conditionals to be removed which clarified the code flow significantly. This work on TLS and the VNC server all merged for the 2.5 release of QEMU, so is already available to benefit users. There is corresponding libvirt work to be done still to convert over to use the new command line syntax for configuring TLS with QEMU.

In this blog series:

by Daniel Berrange at April 01, 2016 08:51 AM

March 31, 2016

Daniel Berrange

Improving QEMU security part 1: crypto code consolidation

This blog is part 1 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

Many years ago I wrote patches for QEMU to enable use of TLS with the VNC server via the VeNCrypt protocol extension. In those patches I modified the VNC server code to directly call out to gnutls in various places to perform the TLS handshake, validate certificates and encrypt/decrypt data. Fast-forward 8 years and I’m once again looking at QEMU with a view to adding TLS encryption support to many other QEMU network services, in particular character device backends, migration and NBD. The TLS certificate handling code is complex enough that I really didn’t fancy repeating it in multiple different areas of the QEMU codebase, so I started thinking about extracting the TLS code from the VNC server for purpose of easier reuse. Aside from VNC with TLS, QEMU uses cryptographic routines in a number of other areas, AES for qcow2 native encryption (which is horribly broke btw), single DES (yes, really single DES) in the VNC server for the awful VNC password authentication, SHA256 hashing in the quorum block driver and SHA1 hashing in the VNC websockets handshake, and AES in many of its CPU emulation backends for the various architecture specific AES acceleration instructions. QEMU actually has its own built-in impl of AES and DES that is uses, rather than calling out to a 3rd party crypto library, since the emulated CPU instructions need to run distinct internal steps of the AES algorithm, not merely consume the final output.

Looking to the future, as well as the expanded use of TLS, it was clear that use of cryptography will only ever increase in QEMU. For example, support of a LUKS encryption driver in the block layer will need access to countless encryption ciphers and hashes. It would be possible to get access to ciphers and hashes via the gnutls APIs, but sadly it doesn’t expose all the possible algorithms supported by the underlying libraries it uses. For added fun gnutls can be using either libgcrypt or nettle depending on what version of gnutls you have. So if QEMU wanted to get access to algorithms not exposed by gnutls, it would ideally have to support use of two different libraries. It was clear that QEMU would benefit from a consolidated internal API for dealing with anything related to encryption, to isolate the main bulk of the code from needing to directly deal with whatever 3rd party crypto libraries QEMU linked to. Thus I created a new top level directory in the QEMU codebase crypto/ and associated headers include/crypto/ which will contain all the code for interfacing with gnutls, libgcrypt, nettle, and whatever other cryptographic libraries we might need in the future. First of all the existing AES and DES implementations were moved into this directory. Then I created APIs for dealing with hash and cipher algorithms.

The cipher APIs are written to preferentially use either nettle or libcrypt depending on which one gnutls linked to, though this can be overridden via arguments to configure to force a particular choice. For those who really want to build without these 3rd party libraries the APIs can be built to use the internal AES or DES impls as a falback. A short example of encrypting data using AES-128 and CBC mode would look like this

  QCryptoCipher *cipher;
  uint8_t key = ....;
  size_t keylen = 16;
  uint8_t iv = ....;
 
  if (!qcrypto_cipher_supports(QCRYPTO_CIPHER_ALG_AES_128)) {
     error_report(errp, "Feature <blah> requires AES cipher support");
     return -1;
  }
 
  cipher = qcrypto_cipher_new(QCRYPTO_CIPHER_ALG_AES_128,
                              QCRYPTO_CIPHER_MODE_CBC,
                              key, keylen,
                              errp);
  if (!cipher) {
     return -1;
  }
 
  if (qcrypto_cipher_set_iv(cipher, iv, keylen, errp) < 0) {
     return -1;
  }
 
  if (qcrypto_cipher_encrypt(cipher, rawdata, encdata, datalen, errp) < 0) {
     return -1;
  }
 
  qcrypto_cipher_free(cipher);

The hash algorithms still use the gnutls APIs, though that will change in the 2.7 series to directly use libgcrypt or nettle. The hash APIs are slightly simpler since QEMU doesn’t (currently at least) need the ability to incrementally hash data, so the currently APIs just supporting one-shot hashing of buffers.

  char *digest = NULL;
 
  if (!qcrypto_hash_supports(QCRYPTO_HASH_ALG_SHA256)) {
     error_report(errp, "Feature <blah> requires sha256 hash support");
     return -1;
  }
 
  if (qcrypto_hash_digest(QCRYPTO_HASH_ALG_SHA256,
                          buf, len, &digest
                          errp) < 0) {
     return -1;
  }

The qcrypto_hash_digest() method outputs as printable hex characters. There is also qcrypto_hash_bytes() which returns the raw bytes, or qcrypto_hash_base64() which base64 encodes the result. As well as passing a single buffer, it is possible to provide a list of buffers in an ‘struct iovec’

The calls to qcrypto_cipher_supports() and qcrypto_hash_supports() are entirely optional – errors will be raised by other methods if needed, but they offer the opportunity to emit friendly error messages in the code. For example the VNC server can explicitly say which feature it can’t support due to missing DES support. Just converting the existing code in QEMU code to use these new cipher/hash APIs already had significant benefit, because it allowed for many #ifdef CONFIG_GNUTLS statements to be removed from across the codebase, particularly the VNC server. The other benefit is that the internal AES and DES implementations are no longer used by any QEMU code, except for the CPU instruction emulation, which is not even used if running with KVM. So modern KVM accelerated guests will be using well supported, audited & certified cipher & hash implementations which is often important to enterprise distribution vendors. This first stage of consolidation was completed and merged for the QEMU 2.4 release series but it has been invisible to users, mostly just benefiting the QEMU & distro maintainers.

In this blog series:

by Daniel Berrange at March 31, 2016 06:06 PM

March 25, 2016

Richard Jones

libguestfs appliance boot in under 1s

$ time LIBGUESTFS_BACKEND=direct LIBGUESTFS_HV=~/d/qemu/x86_64-softmmu/qemu-system-x86_64 guestfish -a /dev/null run

real	0m0.966s
user	0m0.623s
sys	0m0.281s

However I had to patch qemu to enable DMA loading of the kernel and initrd.


by rich at March 25, 2016 03:50 PM


Powered by Planet!
Last updated: August 25, 2016 02:01 AM