Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

October 19, 2016

Amit Shah

Ten Years of KVM

We recently celebrated 25 years of Linux on the 25th anniversary of the famous email Linus sent to announce the start of the Linux project.  Going by the same yardstick, today marks the 10th anniversary of the KVM project — Avi Kivity first announced the project on the 19th Oct, 2006 by this posting on LKML:

The first patchset added support for hardware virtualization on Linux for the Intel CPUs.  Support for AMD CPUs followed soon:

KVM was subsequently merged in the upstream kernel on the 10th December 2006 (commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7).  Linux 2.6.20, released on 4 Feb 2007 was the first kernel release to include KVM.

KVM has come a long way in these 10 years.  I’m writing a detailed post about some of the history of the KVM project — stay tuned for that.

Till then, cheers!

by Amit Shah at October 19, 2016 04:35 PM

October 15, 2016

Alex Williamson

Intel processors with ACS support

If you've been keeping up with this blog then you understand a bit about IOMMU groups and device isolation.  In my howto series I describe the limitations of the Xeon E3 processor that I use in my example system and recommend Xeon E5 or higher processors to provide the best case device isolation for those looking to build a system.  Well, thanks to the vfio-users mailing list, it has come to my attention that there are in fact Core i7 processors with PCIe Access Control Services (ACS) support on the processor root ports.

Intel lists these processors as High End Desktop Processors, they include Socket 2011-v3 Haswell E processors, Socket 2011 Ivy Bridge E processors, and Socket 2011 Sandy Bridge E processors.  The linked datasheets for each family clearly lists ACS register capabilities.  Current listings for these processors include:

Haswell-E (LGA2011-v3)
i7-5960X (8-core, 3/3.5GHz)
i7-5930K (6-core, 3.2/3.8GHz)
i7-5820K (6-core, 3.3/3.6GHz)

Ivy Bridge-E (LGA2011)
i7-4960X (6-core, 3.6/4GHz)
i7-4930K (6-core, 3.4/3.6GHz)
i7-4820K (4-core, 3.7/3.9GHz)

Sandy Bridge-E (LGA2011)
i7-3960X (6-core, 3.3/3.9GHz)
i7-3970X (6-core, 3.5/4GHz)
i7-3930K (6-core, 3.2/3.8GHz)
i7-3820 (4-core, 3.6/3.8GHz)

These also appear to be the only Intel Core processors compatible with Socket 2011 and 2011-v3 found on X79 and X99 motherboards, so basing your platform around these chipsets will hopefully lead to success.  My recommendation is based only on published specs, not first hand experience though, so your mileage may vary.

Unfortunately there are not yet any Skylake based "High End Desktop Processors" and from what we've seen on the mailing list, Skylake does not implement ACS on processor root ports, nor do we have quirks to enable isolation on the Z170 PCH root ports (which include a read-only ACS capability, effectively confirming lack of isolation), and integrated I/O devices on the motherboard exhibit poor grouping with system management components (aside from the onboard I219 LOM, which we do have quirked).  This makes the currently available Skylake platforms a really bad choice for doing device assignment.

Based on this new data, I'll revise my recommendation for Intel platforms to include Xeon E5 and higher processors or Core i7 High End Desktop Processors (as listed by Intel).  Of course there are combinations where regular Core i5, i7 and Xeon E3 processors will work well, we simply need to be aware of their limitations and factor that into our system design.

EDIT (Oct 30 04:18 UTC 2015): A more subtle feature also found in these E series processors is support for IOMMU super pages.  The datasheets for the E5 Xeons and these High End Desktop Processors indicate support for 2MB and 1GB IOMMU pages while the standard Core i5 and i7 only support 4KB pages.  This means less space wasted for the IOMMU page tables, more efficient table walks by the hardware, and less thrashing of the I/O TLB under I/O load resulting in I/O stalls.  Will you notice it?  Maybe.  VFIO will take advantage of IOMMU super pages any time we find a sufficiently sized range of contiguous pages.  To help insure this happens, make use of hugepages in the VM.

EDIT (Oct 15 18:20 UTC 2016): Intel Broadwell-E processors have been out for some time and as we'd expect, the datasheets do indicate that ACS is supported.  So add to the list above:

Broadwell-E (LGA2011-v3)
i7-6950X (10-core, 3.0/3.5GHz)
i7-6900K (8-core, 3.2/3.7GHz)
i7-6850K (6-core, 3.6/3.8GHz)
i7-6800K (6-core, 3.4/3.6GHz)

by Alex Williamson ( at October 15, 2016 01:21 PM

October 13, 2016

Alex Williamson

How to improve performance in Windows 7

A contribution from Thomas Lindroth on the vfio-users mailing list:
I thought I'd share a trick for improving the performance on win7 guests. The
tl;dr version is add <feature policy='disable' name='hypervisor'/> to the
<cpu> section of your libvirt xml like so:

<cpu mode='host-passthrough'>
<topology sockets='1' cores='3' threads='1'/>
<feature policy='disable' name='hypervisor'/>

The long story is that according to Microsoft's documentation "On systems
where the TSC is not suitable for timekeeping, Windows automatically selects
a platform counter (either the HPET timer or the ACPI PM timer) as the basis
for QPC." QPC = QueryPerformanceCounter() which is a windows api for getting
timing info. Some redhat documentation say: "Windows 7 do not use the TSC as
a time source if the hypervisor-present bit is set". Instead if falls back on
acpi_pm or hpet if hpet is enabled in the xml.

The hypervisor present bit is a fake cpuid flag qemu and other hypervisors
injects to show the guest it's running under a hypervisor. This is different
from the KVM signature that can be hidden with <kvm><hidden state='on'>.
With the hypervisor flag disabled in libvirt xml windows 7 started using TSC
as timing source for me.

Nvidia has a "Timer Function Performance" benchmark on their web page to
measure overhead from timers. With acpi_pm the timer query took 3,605ns on
average and with TSC 12.52ns. Passmark's CPU floating point performance
benchmark, which query timers 265,000 times/sec, went from 3952 points with
acpi_pm to 5594 points with TSC. The reason TSC is so much faster is because
both acpi_pm and hpet are emulated by qemu in userspace and TSC is handled by
KVM in kernel space.

All games I've tested use the timer at least 25,000 times/sec. I'm guessing
it's the graphics drivers doing that. Some games like Outlast query the timer
~275,000 times/sec. The performance for those games are basically limited by
how fast the host can do context switches. I expect the performance
improvement with TSC is great in those games. Unfortunately 3dmark's fire
strike benchmark still do 25,000 queries/sec to the acpi_pm even with the
hypervisor flag hidden. There must be some other windows api for using the
"platform counter" as Microsoft calls it but most games don't use it.

Unless you are using windows 7 you'll probably not benefit from this. Windows
10 is probably using the hypervclock instead. That redhat documentation
talking about the hypervisor bit was actually a guide for how to turn off TSC
to "resolve guest timing issues". I don't experience any problems myself but
if you got one of those "clocksource tsc unstable” systems this might not
work so well.

Google says this is the NVIDIA timer benchmark.

It would be interesting to see how the hyper-v extensions compare and whether tests like Fire Strike actually makes use of them. Thanks Thomas!

(copied with permission)

by Alex Williamson ( at October 13, 2016 05:09 PM

September 28, 2016

Gerd Hoffmann

Using virtio-gpu with libvirt and spice

It’s been a while since the last virtio-gpu status report. So, here we go with an update. I gave a talk about qemu graphics at KVM Forum 2016 in Toronto, covering (among other things) virtio-gpu. Here are the slides. I’ll go summarize the important stuff for those who want to play with it below. The […]

by Gerd Hoffmann at September 28, 2016 01:34 PM

September 26, 2016

Alex Williamson

Passing QEMU command line options through libvirt

This one comes from Stefan's blog with some duct tape and bailing wire courtesy of Laine.  I've talked previously about using wrapper scripts to launch QEMU, which typically use sed to insert options that libvirt doesn't know about.  This is by far better than defining full vfio-pci devices using <qemu:arg> options, which many guides suggest, but it hides the devices from libvirt and causes all sorts of problems with device permissions and locked memory, etc.  But, there's a nice compromise as Stefan shows in his last example at the link above.  Say we only want to add x-vga=on to one of our hostdev entries in the libvirt domain XML.  We can do something like this:

  <qemu:arg value='-set'/>
  <qemu:arg value='device.hostdev0.x-vga=on'/>

The effect is that we add the option x-vga=on to the hostdev0 device, which is defined via a normal <hostdev> section in the XML and gets all the device permission and locked memory love from libvirt.  So which device is hostdev0?  Well, things get a little mushy there.  libvirt invents the names based on the order of the hostdev entries in the XML, so you can simply count (starting from zero) to pick the entry for the additional option.  It's a bit cleaner overall than needing to manage a wrapper script separately from the VM.  Also, don't forget that to use <qemu:commandline> you need to first enable the QEMU namespace in the XML by updating the first line in the domain XML to:

<domain type='kvm' xmlns:qemu=''>

Otherwise libvirt will promptly discard the extra options when you save the domain.

by Alex Williamson ( at September 26, 2016 06:00 PM

"Intel-IOMMU: enabled": It doesn't mean what you think it means

A quick post just because I keep seeing this in practically every how-to guide I come across.  The instructions grep dmesg for "IOMMU" and come up with either "Intel-IOMMU: enabled" or "DMAR: IOMMU enabled". Clearly that means it's enabled, right? Wrong. That line comes from a __setup() function that parses the options for "intel_iommu=". Nothing has been done at that point, not even a check to see if VT-d hardware is present. Pass intel_iommu=on as a boot option to an AMD system and you'll see this line. Yes, this is clearly not a very intuitive message. So for the record, the mouthful that you should be looking for is this line:

DMAR: Intel(R) Virtualization Technology for Directed I/O

or on older kernels the prefix is different:

PCI-DMA: Intel(R) Virtualization Technology for Directed I/O

When you see this, you're pretty much past all the failure points of initializing VT-d. FWIW, the "DMAR" flavors of the above appeared in v4.2, so on a more recent kernel, that's your better option.

by Alex Williamson ( at September 26, 2016 05:15 PM

September 17, 2016

Stefan Hajnoczi

Making I/O deterministic with blkdebug breakpoints

Recently I was looking for a deterministic way to reproduce a QEMU bug that occurred when a guest reboots while an IDE/ATA TRIM (discard) request is in flight. You can find the bug report here.

A related problem is how to trigger the code path where request A is in flight when request B is issued. Being able to do this is useful for writing test suites that check I/O requests interact correctly with each other.

Both of these scenarios require the ability to put an I/O request into a specific state and keep it there. This makes the request deterministic so there is no chance of it completing too early.

QEMU has a disk I/O error injection framework called blkdebug. Block drivers like qcow2 tell blkdebug about request states, making it possible to fail or suspend requests at certain points. For example, it's possible to suspend an I/O request when qcow2 decides to free a cluster.

The blkdebug documentation mentions the break command that can suspend a request when it reaches a specific state.

Here is how we can suspend a request when qcow2 decides to free a cluster:

$ qemu-system-x86_64 -drive if=ide,id=ide-drive,file=blkdebug::test.qcow2,format=qcow2
(qemu) qemu-io ide-drive "break cluster_free A"

QEMU prints a message when the request is suspended. It can be resumed with:

(qemu) qemu-io ide-drive "resume A"

The tag name 'A' is arbitrary and you can pick your own name or even suspend multiple requests at the same time.

Automated tests need to wait until a request has suspended. This can be done with the wait_break command:

(qemu) qemu-io ide-drive "wait_break A"

For another example of blkdebug breakpoints, see the qemu-iotests 046 test case.

by stefanha ( at September 17, 2016 04:08 PM

September 12, 2016

Gerd Hoffmann

Using virtio-input with libvirt

The new virtio input devices are not that new any more. Support was merged in qemu 2.4 (host) and linux kernel 4.1 (guest). Which means that most distributions should have picked up support for virtio-input meanwhile. libvirt gained support for libvirt-input too (version 1.3.0 & newer), so using virtio-input devices is as simple as adding […]

by Gerd Hoffmann at September 12, 2016 12:41 PM

September 01, 2016

Alex Williamson

And now you're an expert

Video from my KVM Forum 2016 talk:

by Alex Williamson ( at September 01, 2016 02:07 PM

August 30, 2016

Kashyap Chamarthy

LinuxCon talk slides: “A Practical Look at QEMU’s Block Layer Primitives”

Last week I spent time at LinuxCon (and the co-located KVM Forum) Toronto. I presented a talk on QEMU’s block layer primitives. Specifically, the QMP primitives block-commit, drive-mirror, drive-backup, and QEMU’s built-in NBD (Network Block Device) server.

Here are the slides.

by kashyapc at August 30, 2016 10:22 AM

August 24, 2016

Alex Williamson

KVM Forum 2016 - An Introduction to PCI Device Assignment with VFIO

Slides available here:

Video to come

by Alex Williamson ( at August 24, 2016 03:46 PM

August 16, 2016

Daniel Berrange

Improving QEMU security part 7: TLS support for migration

This blog is part 7 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

The live migration feature in QEMU allows a running VM to be moved from one host to another with no noticeable interruption in service and minimal performance impact. The live migration data stream will contain a serialized copy of state of all emulated devices, along with all the guest RAM. In some versions of QEMU it is also used to transfer disk image content, but in modern QEMU use of the NBD protocol is preferred for this purpose. The guest RAM in particular can contain sensitive data that needs to be protected against any would be attackers on the network between source and target hosts. There are a number of ways to provide such security using external tools/services including VPNs, IPsec, SSH/stunnel tunnelling. The libvirtd daemon often already has a secure connection between the source and destination hosts for its own purposes, so many years back support was added to libvirt to automatically tunnel the live migration data stream over libvirt’s own secure connection. This solved both the encryption and authentication problems at once, but there are some downsides to this approach. Tunnelling the connection means extra data copies for the live migration traffic and when we look at guests with RAM many GB in size, the number of data copies will start to matter. The libvirt tunnel only supports a tunnelling of a single data connection and in future QEMU may well wish to use multiple TCP connections for the migration data stream to improve performance of post-copy. The use of NBD for storage migration is not supported with tunnelling via libvirt, since it would require extra connections too. IOW while tunnelling over libvirt was a useful short term hack to provide security, it has outlived its practicality.

It is clear that QEMU needs to support TLS encryption natively on its live migration connections. The QEMU migration code has historically had its own distinct I/O layer called QEMUFile which mixes up tracking of migration state with the connection establishment and I/O transfer support. As mentioned in previous blog post, QEMU now has a general purpose I/O channel framework, so the bulk of the work involved converting the migration code over to use the QIOChannel classes and APIs, which greatly reduced the amount of code in the QEMU migration/ sub-folder as well as simplifying it somewhat. The TLS support involves the addition of two new parameters to the migration code. First the “tls-creds” parameter provides the ID of a previously created TLS credential object, thus enabling use of TLS on the migration channel. This must be set on both the source and target QEMU’s involved in the migration.

On the target host, QEMU would be launched with a set of TLS credentials for a server endpoint:

$ qemu-system-x86_64 -monitor stdio -incoming defer \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=server,id=tls0 \
    ...other args...

To enable incoming TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate_incoming tcp:myhostname:9000

On the source host, QEMU is launched in a similar manner but using client endpoint credentials

$ qemu-system-x86_64 -monitor stdio \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=client,id=tls0 \
    ...other args...

To enable outgoing TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate tcp:otherhostname:9000

The migration code supports a number of different protocols besides just “tcp:“. In particular it allows an “fd:” protocol to tell QEMU to use a passed-in file descriptor, and an “exec:” protocol to tell QEMU to launch an external command to tunnel the connection. It is desirable to be able to use TLS with these protocols too, but when using TLS the client QEMU needs to know the hostname of the target QEMU in order to correctly validate the x509 certificate it receives. Thus, a second “tls-hostname” parameter was added to allow QEMU to be informed of the hostname to use for x509 certificate validation when using a non-tcp migration protocol. This can be set on the source QEMU prior to starting the migration using the “migrate_set_str_parameter” monitor command

(qemu) migrate_set_str_parameter tls-hostname myhost.mydomain

This feature has been under development for a while and finally merged into QEMU GIT early in the 2.7.0 development cycle, so will be available for use when 2.7.0 is released in a few weeks. With the arrival of the 2.7.0 release there will finally be TLS support across all QEMU host services where TCP connections are commonly used, namely VNC, SPICE, NBD, migration and character devices.

In this blog series:

by Daniel Berrange at August 16, 2016 01:00 PM

Improving QEMU security part 6: TLS support for character devices

This blog is part 6 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

A number of QEMU device models and objects use a character devices for providing connectivity with the outside world, including the QEMU monitor, serial ports, parallel ports, virtio serial channels, RNG EGD object, CCID smartcard passthrough, IPMI device, USB device redirection and vhost-user. While some of these will only ever need a character device configured with local connectivity, some will certainly need to make use of TCP connections to remote hosts. Historically these connections have always been entirely in clear text, which is unacceptable in the modern hostile network environment where even internal networks cannot be trusted. Clearly the QEMU character device code requires the ability to use TLS for encrypting sensitive data and providing some level of authentication on connections.

The QEMU character device code was mostly using GLib’s  GIOChannel framework for doing I/O but this has a number of unsatisfactory limitations. It can not do vectored I/O, is not easily extensible and does not concern itself at all with initial connection establishment. These are all reasons why the QIOChannel framework was added to QEMU. So the first step in supporting TLS on character devices was to convert the code over to use QIOChannel instead of GIOChannel. With that done, adding in support for TLS was quite straightforward, merely requiring addition of a new configuration property (“tls-creds“) to set the desired TLS credentials.

For example to run a QEMU VM with a serial port listening on IP 10.0.01, port 9000, acting as a TLS server:

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=,port=9000,tls-creds=tls0,server \
      -device isa-serial,chardev=s0
      ...other QEMU options...

It is possible test connectivity to this TLS server using the gnutls-cli tool

$ gnutls-cli --priority=NORMAL -p 9000 \
--x509cafile=/home/berrange/security/qemutls/ca-cert.pem \

In the above example, QEMU was running as a TCP server, and acting as the TLS server endpoint, but this matching is not required. It is valid to configure it to run as a TLS client if desired, though this would be somewhat uncommon.

Of course you can connect 2 QEMU VMs together, both using TLS. Assuming the above QEMU is still running, we can launch a second QEMU connecting to it with

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=,port=9000,tls-creds=tls0 \
      -device isa-serial,chardev=s0
      ...other QEMU options...

Notice, we’ve changed the “endpoint” and removed the “server” option, so this second QEMU runs as a TCP client and acts as the TLS client endpoint.

This feature is available since the QEMU 2.6.0 release a few months ago.

In this blog series:

by Daniel Berrange at August 16, 2016 12:11 PM

August 10, 2016

Zeeshan Ali Khattak

Life is change

Quite a few major life events happened/happening this summer so I thought I blog about them and some of the experiences I had.

New job & new city/country

Yes, I found it hard to believe too that I'll ever be leaving Red Hat and the best manager I ever had (no offence to others but competing with Matthias is just impossible) but I'll be moving to Gothenburg to join Pelagicore folks as a Software Architect in just 2 weeks. I have always found Swedish language to be a very cute language so looking forward to my attempt of learning Swedish. If only I had learnt Swedish rather than Finnish when I was in Finland.

BTW, I'm selling all my furniture so if you're in London and need some furniture, get in touch!

Fresh helicopter pilot

So after two years of hard work and getting myself sinking in bank loans, I finally did it! Last week, I passed the skills test for Private Pilot License (Helicopters) and currently awaiting anxiously for my license to come through (it usually takes at least two weeks). Once I have that, I can rent Helicopters and take passengers with me. I'll be able to share the costs with passengers but I'm not allowed to make money out of it. The test was very tough and I came very close to failing at one particular point. The good news is that despite me being very tense and very windy conditions on test day, the biggest negative point from my examiner was that I was being over-cautious and hence very slow. So I think it wasn't so bad.

There are a few differences to a driving test. A minor one is is that in driving test, you are not expected to explain your steps but simply execute, where as in skills test for flying, you're expected to think everything out loud. But the most major difference is that in driving test, you are not expected to drive on your own until you pass the test, where as in flying test, you are required to have flown solo for at least 10 hours, which needs to include a solo cross country flight of at least a 100 nautical miles (185 KM) involving 3 major aeorodromes.  Mine involved Estree, Cranfield and Duxford. I've been GPS logging while flying so I can show you log of my qualifying solo cross country flight (click here to see details and notes):

I still got a long way towards Commercial License but at least now I can share the price with friends so building hours towards commercial license, won't be so expensive (I hope). I've found a nice company in Gothenburg that trains in and rents helicopters so I'm very much looking forward to flying over the costs in there. Wanna join? Let me know. :)

by (zeenix) at August 10, 2016 01:24 PM

August 05, 2016

Gerd Hoffmann

Two new images uploaded

Uploded two new images. First a centos7 image for the raspberry pi 3 (arm64). Very simliar to the fedora images, see the other raspberrypi posts for details. Second a armv7 qemu image with grub2 boot loader. Boots with efi firmware (edk2/tianocore). No need to copy kernel+initrd from the image and pass that to qemu, making […]

by Gerd Hoffmann at August 05, 2016 08:24 AM

July 15, 2016

Alex Williamson

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release.  There's already pretty thorough documentation of the modes available in the source tree, please give it a read.  There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons.  Which ones are available to you depends on your hardware.  UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge.  If you have a processor older than SandyBridge, stop now, this is not going to work for you.  If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM.  Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer.  Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list).  I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined).  The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver.  Personally I avoid this by blacklisting the i915 driver.  Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working.  The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively.  To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use?  try both, video=vesafb:off,efifb:off).  The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub.  Plan for this.  I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
        <address domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required.  Delete the <video> and <graphics> sections and everything should just magically work.  Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT  mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system.  This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc.  One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge.  Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons.  Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35.  This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT.  Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment.  To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly.  Great news, right?  Well sort of.  First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode).  In fact, um, there's no output support of any kind by default in UPT mode.  How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM.  Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while.  UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address.  Usually not a problem, but some of you create some crazy configs.  You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional?  Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love.  Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected.  Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on).  I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT.  Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available.  Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer.  Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot.  Don't be too alarmed by this, especially if it stops before the display is initialized.  This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host.  Unless you have an ongoing spew of these, they can probably be ignored.  If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware.  Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM.  Give it a try and good luck.

by Alex Williamson ( at July 15, 2016 05:34 PM

July 01, 2016

Daniel Berrange

ANNOUNCE: libosinfo 0.3.1 released

I am happy to announce a new release of libosinfo, version 0.3.1 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Require glib2 >= 2.36
  • Replace GSimpleAsyncResult usage with GTask
  • Fix VPATH based builds
  • Don’t include autogenerated enum files in dist
  • Fix build with older GCC versions
  • Add/improve/fix data for
    • Debian
    • OpenSUSE
    • FreeBSD
    • Windows
    • RHEL
    • Ubuntu
  • Update README content
  • Fix string comparison for bootable media detection
  • Fix linker flags for OS-X & solaris
  • Fix darwin detection code
  • Fix multiple memory leaks

Thanks to everyone who contributed towards this release.

A special note to downstream vendors/distributors.

The next major release of libosinfo will include a major change in the way libosinfo is released and distributed. The current single release will be replaced with three indepedently released artefacts:

  • libosinfo – this will continue to provide the libosinfo shared library and most associated command line tools
  • osinfo-db – this will contain only the database XML files and RNG schema, no code at all.
  • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

The libosinfo and osinfo-db-tools releases will be fairly infrequently as they are today. The osinfo-db releases will be done very frequently, with automated releases made available no more than 1 day after updated DB content is submitted to the project.

by Daniel Berrange at July 01, 2016 10:31 AM

June 30, 2016

Daniel Berrange

ANNOUNCE: virt-viewer 4.0 release

I am happy to announce a new bugfix release of virt-viewer 4.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

Changes in this release include:

  • Drop support for gtk2 builds
  • Require spice-gtk >= 0.31
  • Require glib2 >= 2.38
  • Require gtk3 >= 3.10
  • Require libvirt-glib >= 0.1.8
  • Increase minimum window size fo 320×200 instead of 50×50
  • Remove use of GSLice
  • Don’t show usbredir button if not connected yet
  • Only compute monitor mapping in full screen
  • Don’t ignore usb-filter in spiec vv-file
  • Port to use GtkApplication API
  • Don’t leave window open after connection failure
  • Validate symbols from max glib/gdk versions
  • Don’t use GtkStock
  • Don’t use gtk_widget-modify_{fg,bg} APIs
  • Drop use of built-in eventloop in favour of libvirt-glib
  • Don’t open X display while parsing command line
  • Fix window title
  • Use GResource for building ui files into binary
  • Fix crash with invalid spice monitor mapping
  • Add dialog to show file transfer progress and allow cancelling
  • Remove unused nsis installer support
  • Include adwaita icon theme in msi builds
  • Add more menu mnemonics
  • Fix support for windows consoles to allow I/O redirection
  • Add support for ovirt sso-token in vv-file
  • Fix crash with zooming window while not connected
  • Remove custom auto drawer widget with GtkRevealer
  • Add appdata file for gnome software
  • Misc other bug fixes
  • Refresh translations

Thanks to everyone who contributed towards this release.

by Daniel Berrange at June 30, 2016 03:39 PM

June 29, 2016

Cole Robinson

UEFI virt roms now in official Fedora repos

Kamil got to it first, but just a note that UEFI roms for x86 and aarch64 virt are now shipped in the standard Fedora repos, where previously the recommended place to grab them was an external nightly repo. Kamil has updated the UEFI+QEMU wiki page to reflect this change.

On up to date Fedora 23+ these roms will be installed automatically with the relevant qemu packages, and libvirt is properly configured to advertise the rom files to applications, so enabling this with tools like virt-manager is available out of the box.

For the curious, the reason we can now ship these binaries in Fedora is because the problematic EDK2 'FatPkg' code, which had a Fedora incompatible license, was replaced with an implementation with a less restrictive (and more Fedora friendly) license.

by Cole Robinson ( at June 29, 2016 02:28 PM

June 27, 2016

Gerd Hoffmann

New Raspberry PI images uploaded.

I’ve uploaded new images. Both a Fedora 23 refresh and new Fedora 24 images. There are not many changes, almost all notes from the two older articles here and here still apply. Noteworthy change is that the 32bit images don’t have a 64bit kernel for the rpi3 any more, so both rpi2 and rpi3 boot […]

by Gerd Hoffmann at June 27, 2016 08:24 AM

June 18, 2016

Cole Robinson

virt-manager 1.4.0 release

I've just released virt-manager 1.4.0. Besides the spice GL bits that I previously talked about, nothing too much exciting in this release except a lot of virt-install/virt-xml command line extensions.

The changelog highlights:
  • virt-manager: spice GL console support (Marc-André Lureau, Cole Robinson)
  • Bump gtk and pygobject deps to 3.14
  • virt-manager: add checkbox to forget keyring password (Pavel Hrdina)
  • cli: add --graphics gl= (Marc-André Lureau)
  • cli: add --video accel3d= (Marc-André Lureau)
  • cli: add --graphics listen=none (Marc-André Lureau)
  • cli: add --transient flag (Richard W.M. Jones)
  • cli: --features gic= support, and set a default for it (Pavel Hrdina)
  • cli: Expose --video heads, ram, vram, vgamem
  • cli: add --graphics listen=socket
  • cli: add device address.type/address.bus/...
  • cli: add --disk seclabelX.model (and .label, .relabel)
  • cli: add --cpu (and .cpus, and .memory)
  • cli: add --network rom_bar= and rom_file=
  • cli: add --disk backing_format=

by Cole Robinson ( at June 18, 2016 12:06 PM

June 10, 2016

Cole Robinson

check-pylint: mini tool for running pylint anywhere

pylint and pep8 are indispensable tools for python development IMO. For projects I maintain I've long ago added a 'setup pylint' sub-command to run both commands, and I've documented this as a necessary step in the contributor guidelines.

But over the years I've accumulated many repos for small bits of python code that never have need for a script, but I still want the convenience of being able to run pylint and pep8 with a single command and a reasonable set of options.

So, a while back I wrote this tiny 'check-pylint' script which does exactly that. The main bit it adds is automatically searching the current directory for python scripts and modules and passing them to pylint/pep8. From the README:

Simple helper script that scoops up all python modules and scripts beneath the current directory, and passes them through pylint and pep8. Has a bit of smarts to ignore .git directory, and handle files that don't end in .py

The point is that you can just fire off 'check-pylint' in any directory containing python code and get a quick report.

by Cole Robinson ( at June 10, 2016 09:46 AM

May 22, 2016

Cole Robinson

spice OpenGL/virgl acceleration on Fedora 24

New in Fedora 24 virt is 3D accelerated SPICE graphics, via Virgl. This is kinda-sorta OpenGL passthrough from the VM up to the host machine. Much of the initial support has been around since qemu 2.5, but it's more generally accessible now that SPICE is in the mix, since that's the default display type used by virt-manager and gnome-boxes.

I'll explain below how you can test things on Fedora 24, but first let's cover the hurdles and caveats. This is far from being something that can be turned on by default and there's still serious integration issues to iron out. All of this is regarding usage with libvirt tools.

Caveats and known issues

Because of these issues, we haven't exposed this as a UI knob in any of the tools yet, to save us some redundant bug reports for the above issues from users who are just clicking a cool sounding check box :) Instead you'll need to explicitly opt in via the command line.

Testing it out

You'll need the following packages (or later) to test this:
  • qemu-2.6.0-2.fc24
  • libvirt-
  • virt-manager-1.3.2-4.20160520git2204de62d9.fc24
  • At least F24 beta on the host
  • Fedore 24 beta in the guest. Anything earlier is not going to actually enable the 3D acceleration. I have no idea about the state of other distributions. And to make it abundantly clear this is linux only and likely will be for a long time at least, I don't know if Windows driver support is even on the radar.
Once you install a Fedora 24 VM through the standard methods, you can enable spice GL for your VM with these two commands:

$ virt-xml --connect $URI $VM_NAME --confirm --edit --video clearxml=yes,model=virtio,accel3d=yes
$ virt-xml --connect $URI $VM_NAME --confirm --edit --graphics clearxml=yes,type=spice,gl=on,listen=none

The first command will switch the graphics device to 'virtio' and enable the 3D acceleration setting. The second command will set up spice to listen locally only, and enable GL. Make sure to fully poweroff the VM afterwards for the settings to take effect. If you want to make the changes manually with 'virsh edit', the XML specifics are described in the spice GL documentation.

Once your VM has started up, you can verify that everything is working correctly by checking glxinfo output in the VM, 'virgl' should appear in the renderer string:

$ glxinfo | grep virgl
Device: virgl (0x1010)
OpenGL renderer string: Gallium 0.4 on virgl

And of course the more fun test of giving supertuxkart a spin :)

Credit to Dave Airlie, Gerd Hoffman, and Marc-André Lureau for all the great work that got us to this point!

by Cole Robinson ( at May 22, 2016 11:56 AM

May 13, 2016

Stefan Hajnoczi

Git: Internals of how objects are stored

Git has become the ubiquitous version control tool. A big community has grown around Git over the years. From its origins with the Linux kernel community to GitHub, the popular project hosting service, its core design has scaled. Well it needed to since from day one the Linux source tree and commit history was large by most standards.

I consider Git one of the innovations (like BitTorrent and Bitcoin) that come along every few years to change the way we do things. These tools didn't necessarily invent brand new algorithms but they applied them in a way that was more effective than previous approaches. They have designs worth studying and have something we can learn from.

This post explains the internals of how git show 365e45161384f6dc704ec798828dc927d63e3b22 fetches the commit object for this SHA1. This SHA1 lookup is the main query operation for all object types in Git including commits, tags, and trees.

Repository setup

Let's start with a simple repo:

$ mkdir test && cd test
$ git init .
Initialized empty Git repository in /tmp/test/.git/
$ echo 'Hello world' >test.txt
$ git add test.txt && git commit -s

Here is the initial commit:

$ git show --format=raw
commit 365e45161384f6dc704ec798828dc927d63e3b22
tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <> 1463158107 -0700
committer Stefan Hajnoczi <> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <>

diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..802992c
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+Hello world

Loose objects

The function to look up an object from a binary SHA1 is struct object *parse_object(const unsigned char *sha1) in Git v2.8.2-396-g5fe494c. You can browse the source here. In Git the versioned data and metadata is stored as "objects" identified by their SHA1 hash. There are several types of objects so this function can look up commits, tags, trees, etc.

Git stores objects in several ways. The simplest is called a "loose object", which means that object 365e45161384f6dc704ec798828dc927d63e3b22 is stored zlib DEFLATE compressed in its own file. We can manually view the object like this:

$ openssl zlib -d <.git/objects/36/5e45161384f6dc704ec798828dc927d63e3b22
commit 241tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <> 1463158107 -0700
committer Stefan Hajnoczi <> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <>

There! So this little command-line produces similar output to $ git show --format=raw 365e45161384f6dc704ec798828dc927d63e3b22 without using Git.

Note that openssl(1) is used in this example as a convenient command-line tool for decompressing zlib DEFLATE data. This has nothing to do with OpenSSL cryptography.

Git spreads objects across subdirectories in .git/objects/ by the SHA1's first byte to avoid having a single directory with a huge number of files. This helps file systems and userspace tools scale better when there are many commits.

Despite this directory layout it can still be inefficient to access loose objects. Each object requires open(2)/close(2) syscalls and possibly an access(2) call to check for existence first. This puts a lot of faith in cheap directory lookups in the file system and is not optimal when the number of objects grows.

Pack files

Git's solution to large numbers of objects is found in .git/objects/pack/. Each "pack" consists of an index file (.idx) and the pack itself (.pack). Here is how that looks in our test repository:

$ git repack
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ ls -l .git/object/pack/
total 8
-r--r--r--. 1 stefanha stefanha 1156 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.idx
-r--r--r--. 1 stefanha stefanha 246 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.pack

In order to look up an object from a SHA1, the index file must be used:

The index file provides an efficient way to look up the byte offset into the pack file where the object is located. The index file's 256-entry lookup table takes the first byte of the SHA1 and produces the highest index for objects starting with that SHA1 byte. This lookup table makes it possible to search only objects starting with the same SHA1 byte while excluding all other objects. That's similar to the subdirectory layout used for loose objects!

The SHA1s of all objects inside the pack are stored in a sorted table. From the 1st SHA1 byte lookup we now know the lowest and highest index into the sorted SHA1 table where we can binary search for the SHA1 we are looking for. If the SHA1 cannot be found then the pack file does not contain that object.

Once the SHA1 has been found in the sorted SHA1 table, its index is used with the offsets table. The offset is the location in the pack file where the object is located. There is a minor complication here: offset table entries are only 32-bits so an extension is necessary for 64-bit offsets.

The 64-bit offset extension uses the top-most bit in the 32-bit offset table entry to indicate that a 64-bit offset is needed. This means the 32-bit offset table actually only works for offsets that fit into 31 bits! When the top-most bit is set the remaining bits form an index into the 64-bit offset table where the actual offset is located.

Each object in the pack file has a header describing the object type (commit, tree, tag, etc) and its size. After the header is the zlib DEFLATE compressed data.


Although the pack file can contain simple zlib DEFLATE compressed data for objects, it can also contain a chain of delta objects. The chain is ultimately terminated by a non-delta "base object". Deltas avoid storing some of the same bytes again each time a modified object is added. A delta object describes only the changed data between its parent and itself. This can be much smaller than storing the full object.

Think about the case where a single line is added to a long text file. It would be nice to avoid storing all the unmodified lines again when the modified text file is added. This is exactly what deltas do!

In order to unpack a delta object it is necessary to start with the base object and apply the chain of delta objects, one after the other. I won't go into the details of the binary format of delta objects, but it is a memcpy(3) engine rather than a traditional text diff produced by diff(1). Each memcpy(3) operation needs offset and length parameters, as well as a source buffer - either the parent object or the delta data. This memcpy(3) engine approach is enough to delete bytes, add bytes, and copy bytes.


Git objects can be stored either as zlib DEFLATE compressed "loose object" files or inside a "pack file". The pack file reduces the need to access many files and also features deltas to reduce storage requirements.

The pack index file has a fairly straightforward layout for taking a SHA1 and finding the offset into the pack file. The layout is cache-friendly because it co-locates fields into separate tables instead of storing full records in a single table. This way the CPU cache only reads data that is being searched, not the other record fields that are irrelevant to the search.

The pack file format works as an immutable archive. It is optimized for lookups and does not support arbitrary modification well. An advantage of this approach is that the design is relatively simple and multiple git processes can perform lookups without fear of races.

I hope this gives a useful overview of git data storage internals. I'm not a git developer so I hope this description matches reality. Please point out my mistakes in the comments.

by stefanha ( at May 13, 2016 11:51 PM

May 12, 2016

Richard Jones

Libguestfs appliance boot in under 600ms

$ ./run ./utils/boot-benchmark/boot-benchmark
Warming up the libguestfs cache ...
Running the tests ...

test version: libguestfs 1.33.28
 test passes: 10
host version: Linux 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
     backend: direct               [to change set $LIBGUESTFS_BACKEND]
        qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64 [to change set $LIBGUESTFS_HV]
qemu version: QEMU emulator version 2.5.94, Copyright (c) 2003-2008 Fabrice Bellard
         smp: 1                    [to change use --smp option]
     memsize: 500                  [to change use --memsize option]
      append:                      [to change use --append option]

Result: 575.9ms ±5.3ms

There are various tricks here:

  1. I’m using the (still!) not upstream qemu DMA patches.
  2. I’ve compiled my own very minimal guest Linux kernel.
  3. I’m using my nearly upstream "crypto: Add a flag allowing the self-tests to be disabled at runtime." patch.
  4. I’ve got two sets of non-upstream libguestfs patches 1, 2
  5. I am not using libvirt, but if you do want to use libvirt, make sure you use the very latest version since it contains an important performance patch.


by rich at May 12, 2016 10:07 PM

Daniel Berrange

Analysis of techniques for ensuring migration completion with KVM

Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.

If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.

The techniques available

  • Downtime tuning. Unless the guest is completely idle, it never possible to get to a point where 100% of memory has been transferred to the target host. So at some point there needs to be a decision made about whether enough memory has been transferred to allow the switch over to the target host with acceptable blackout period. The downtime tunable controls how long a blackout period is permitted during the switchover. QEMU measures the network transfer rate it is achieving and compares it to the amount of outstanding RAM to determine if it can be transferred within the configured downtime window. When migrating it is not desirable to set QEMU to use the maximum accepted downtime straightaway, as that guarantees that the guest will always suffer from the maximum downtime blackout. Instead, it is better to start off with a fairly small downtime value and increase the permitted downtime as time passes. The idea is to maximise the likelihood that migration can complete with a small downtime.
  • Bandwidth tuning. If the migration is taking place over a NIC that is used for other non-migration related actions, it may be desirable to prevent the migration stream from consuming all bandwidth. As noted earlier though, even a relatively small guest is capable of dirtying RAM fast enough that even a 10Gbs NIC will not be able to complete migration. Thus if the goal is to maximise the chances of getting a successful migration though, the aim should be to maximise the network bandwidth available to the migration operation. Following on from this, it is wise not to try to run multiple migration operations in parallel unless their transfer rates show that they are not maxing out the available bandwidth, as running parallel migrations may well mean neither will ever finish.
  • Pausing CPUs. The simplest and crudest mechanism for ensuring guest migration complete is to simply pause the guest CPUs. This prevent the guest from continuing to dirty memory and thus even on the slowest network, it will ensure migration completes in a finite amount of time. The cost is that the guest workload will be completely stopped for a prolonged period of time. Think of pausing the guest as being equivalent to setting an arbitrarily long maximum permitted downtime. For example, assuming a guest with 8 GB of RAM and an idle 10Gbs NIC, in the worst case pausing would lead to to approx 6 second period of downtime. If higher speed NICs are available, the impact of pausing will decrease until it converges with a typical max downtime setting.
  • Auto-convergence. The rate at which a guest can dirty memory is related to the amount of time the guest CPUs are permitted to run for. Thus by throttling the CPU execution time it is possible to prevent the guest from dirtying memory so quickly and thus allow migration data transfer to keep ahead of RAM dirtying. If this feature is enabled, by default QEMU starts by cutting 20% of the guest vCPU execution time. At the startof each iteration over RAM, it will check progress during the previous two iterations. If insufficient forward progress is being made, it will repeatedly cut 10% off the running time allowed to vCPUs. QEMU will throttle CPUs all the way to 99%. This should guarantee that migration can complete on all by the most sluggish networks, but has a pretty high cost to guest CPU performance. It is also indiscriminate in that all guest vCPUs are throttled by the same factor, even if only one guest process is responsible for the memory dirtying.
  • Post-copy. Normally migration will only switch over to running on the target host once all RAM has been transferred. With post-copy, the goal is to transfer “enough” or “most” RAM across and then switch over to running on the target. When the target QEMU gets a fault for a memory page that has not yet been transferred, it’ll make an explicit out of band request for that page from the source QEMU. Since it is possible to switch to post-copy mode at any time, it avoids the entire problem of having to complete migration in a fixed downtime window. The cost is that while running in post-copy mode, guest page faults can be quite expensive, since there is a need to wait for the source host to transfer the memory page over to the target, which impacts performance of the guest during post-copy phase. If there is a network interruption while in post-copy mode it will also be impossible to recover. Since neither the source or target host has a complete view of the guest RAM it will be necessary to reboot the guest.
  • Compression. The migration pages are usually transferred to the target host as-is. For many guest workloads, memory page contents will be fairly easily compressible. So if there are available CPU cycles on the source host and the network bandwidth is a limiting factor, it may be worth while burning source CPUs in order to compress data transferred over the network. Depending on the level of compression achieved it may allow migration to complete. If the memory is not compression friendly though, it would be burning CPU cycles for no benefit. QEMU supports two compression methods, XBZRLE and multi-thread, either of which can be enabled. With XBZRLE a cache of previously sent memory pages is maintained that is sized to be some percentage of guest RAM. When a page is dirtied by the guest, QEMU compares the new page contents to that in the cache and then only sends a delta of the changes rather than the entire page. For this to be effective the cache size must generally be quite large – 50% of guest RAM would not be unreasonable.  The alternative compression approach uses multiple threads which simply use zlib to directly compress the full RAM pages. This avoids the need to maintain a large cache of previous RAM pages, but is much more CPU intensive unless hardware acceleration is available for the zlib compression algorithm.

Measuring impact of the techniques

Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.

In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.

Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/ tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run

$ tests/migration/ --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json

The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/ tool can consume this data file and produce interactive HTML charts illustrating the results.

$ tests/migration/ --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json

To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/ tool, in which case it is merely required to provide the desired hardware configuration

$ tests/migration/ --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb

This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same tool can be used to create charts combining multiple data sets at once to allow easy comparison.

Performance results for QEMU 2.6

With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.

To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.

The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario

UNIX socket, 1 vCPU, 1 GB RAM

Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

UNIX socket, 4 vCPU, 8 GB RAM

Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 1 vCPU, 1 GB RAM

Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket local, 4 vCPU, 8 GB RAM

Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 1 vCPU, 1 GB RAM

Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

TCP socket remote, 4 vCPU, 8 GB RAM

Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 1 vCPU, 1 GB RAM

Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

RDMA socket, 4 vCPU, 8 GB RAM

Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario Tunable
Pause unlimited BW 0 iters 1 iters 5 iters 20 iters
Pause 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Post-copy unlimited BW 0 iters 1 iters 5 iters 20 iters
Post-copy 5 iters 100 mbs 300 mbs 1 gbs 10 gbs unlimited
Auto-converge unlimited BW 5% CPU step 10% CPU step 20% CPU step
Auto-converge 10% CPU step 100 mbs 300 mbs 1 gbs 10 gbs unlimited
MT compression unlimited BW 1 thread 2 threads 4 threads
XBZRLE compression unlimited BW 5% cache 10% cache 20% cache 50% cache

Analysis of results

The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show

  • There is a clear periodic pattern in guest performance that coincides with the start of each migration iteration. Specifically at the start of each iteration there is a notable and consistent momentary drop in guest CPU performance. Picking an example where this effect is clearly visible – the 1 vCPU, 1GB RAM config with the “Pause 5 iters, 300 mbs” test – we can see the guest CPU performance drop from 200ms/GB of data modified, to 450ms/GB. QEMU maintains a bitmap associated with guest RAM to track which pages are dirtied by the guest while migration is running. At the start of each iteration over RAM, this bitmap has to be read and reset and this action is what is responsible for this momentary drop in performance.
  • With the larger guest sizes, there is a second roughly periodic but slightly more chaotic pattern in guest performance that is continual throughout migration. The magnitude of these spikes is about 1/2 that of those occurring at the start of each iteration. An example where this effect is clearly visible is the 4 vCPU, 8GB RAM config with the “Pause unlimited BW, 20 iters” test – we can see the guest CPU performance is dropping from 500ms/GB to between 700ms/GB and 800ms/GB. The host NUMA node that the guest is confined to has 4 CPUs and the guest itself has 4 CPUs. When migration is running, QEMU has a dedicated thread performing the migration data I/O and this is sharing time on the 4 host CPUs with the guest CPUs. So with QEMU emulator threads sharing the same pCPUs as the vCPU threads, we have 5 workloads competing for 4 CPUs. IOW the frequently slightly chaotic spikes in guest performance throughout the migration iteration are a result of overcommiting the host pCPUs. The magnitude of the spikes is directly proportional to the total transfer bandwidth permitted for the migration. This is not an inherent problem with migration – it would be possible to place QEMU emulator threads on a separate pCPU from vCPU threads if strong isolation is desired between the guest workload and migration processing.
  • The baseline guest CPU performance differs between the 1 vCPU, 1 GB RAM and 4 vCPU 8 GB RAM guests. Comparing the UNIX socket “Pause unlimited BW, 20 iters” test results for these 1 vCPU and 4 vCPU configs we see the former has a baseline performance of 200ms/GB of data modified while the latter has 400ms/GB of data modified. This is clearly nothing to do with migration at all. Naively one might think that going from 1 vCPU to 4 vCPUs would result in 4 times the performance, since we have 4 times more threads available to do work. What we’re seeing here is likely the result of hitting the memory bandwidth limit, so each vCPU is competing for memory bandwidth and thus the overall performance of each vCPU has decreased. So instead of getting x4 the performance going from 1 to 4 vCPUs only doubled the performance.
  • When post-copy is operating in its pre-copy phase, it has no measurable impact on the gust performance compared to when post-copy is not enabled at all. This can be seen by comparing the TCP socket “Paused 5 iters, 1 Gbs” test results with the “Post-copy 5 iters, 1 Gbs” test results. Both show the same baseline guest CPU performance and the same magnitude of spikes at the start of each iteration. This shows that it is viable to unconditionally enable the post-copy feature for all migration operations, even if the migration is likely to complete without needing to switch from pre-copy to post-copy phases. It provides the admin/app the flexibility to dynamically decide on the fly whether to switch to post-copy mode or stay in pre-copy mode until completion.
  • When post-copy migration switches from its pre-copy phase to the post-copy phase, there is a major but short-lived spike in guest CPU performance. What is happening here is that the guest has perhaps 80% of its RAM transferred to the target host when post-copy phase starts but the guest workload is touching some pages which are still on the source, so the page fault is having to wait for the page to be transferred across the network. The magnitude of the spike and duration of the post-copy phase is related to the total guest RAM size and bandwidth available. Taking the remote TCP case with 1 vCPU, 1 GB RAM hardware config for clarity, and comparing the “Post-copy 5 iters, 1Gbs” scenario with the “Post-copy 5 iters, 10Gbs” scenario, we can see the magnitude of the spike in guest performance is the same order of magnitude in both cases. The overall time for each iteration of pre-copy phase is clearly shorter in the 10Gbs case. If we further compare with the local UNIX domain socket, we can see the spike in performance is much lower at the post-copy phase. What this is telling us is that the magnitude of the spike in the post-copy phase is largely driven by the latency in the time to transfer an out of band requested page from the source to the target, rather than the overall bandwidth available. There are plans in QEMU to allow migration to use multiple TCP connections which should significantly reduce the post-copy latency spike as the out-of-band requested pages will not get stalled behind a long TCP transmit queue for the background bulk-copy.
  • Auto-converge will often struggle to ensure convergence for larger guest sizes or when the bandwidth is limited. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing effects of different bandwidth limits we can see that with a 10Gbs bandwidth cap, auto-converge had to throttle to 80% to allow completion, while other tests show as much as 95% or even 99% in some cases. With a lower bandwidth limit of 1Gbs, the test case timed out after 5 minutes of running, having only attempted throttled down by 20%, showing auto-converge is not nearly aggressive enough when faced with low bandwidth links. The worst case guest performance seen when running auto-converge with CPUs throttled to 80% was on a par with that seen with post-copy immediately after switching to post-copy phase. The difference is that auto-converge shows that worst-case hit for a very long time during pre-copy, potentially many minutes, where as post-copy only showed it for a few seconds.
  • Multi-thread compression was actively harmful to chances of a successful migration. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing thread counts, we can see that increasing the number of threads actually made performance worse, with less iterations over RAM being completed before the 5 minute timeout was hit. The longer each iteration takes the more time the guest has to dirty RAM, so the less likely migration is to complete. There are two factors believe to be at work here to make MT compression results so bad. First, as noted earlier QEMU is confined to 4 pCPUs, so with 4 vCPUs running, the compression threads have to compete for time with the vCPU threads slowing down speed of compression. The stress test workload run in the guest is writing completely random bytes which are a pathological input dataset for compression, allowing almost no compression. Given the fact the compression was CPU limited though, even if there had been a good compression ratio, it would be unlikely to have a significant benefit since the increased time to iterate over RAM would allow the guest to dirty more data eliminating the advantage of compressing it. If the QEMU emulator threads were given dedicated host pCPUs to run on it may have increased the performance somewhat, but then that assumes the host has CPUs free that are not running other guests.
  • XBZRLE compression fared a little better than MT compression. Again considering the 4 vCPU, 8 GB RAM remote TCP test comparing RAM cache sizing, we can see that the time required for each iteration over RAM did not noticeably increase. This shows that while XBZRLE compression did have a notable impact on guest CPU performance, is not seeing a major bottleneck on processing of each page as compared to MT compression. Again though, it did not help to achieve migration completion, with all tests timing out after 5 minutes or 30 iterations over RAM. This is due to the fact that the guest stress workload is again delivering input data that hits the pathological worst case in the algorithm. Faced with such a workload, no matter how much CPU time or RAM cache is available, XBZRLE can never have any positive impact on migration.
  • The RDMA data transport showed up a few of its quirks. First, by looking at the RDMA results comparing pause bandwidth, we can clearly identify a bug in QEMU’s RDMA implementation – it is not honouring the requested bandwidth limits – it always transfers at maximum link speed. Second, all the post-copy results show failure, confirming that post-copy is currently not compatible with RDMA migration. When comparing 10Gbs RDMA against 10Gbs TCP transports, there is no obvious benefit to using RDMA – it was not any more likely to complete migration in any of the test scenarios.

Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link

It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.

There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.

In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.

In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.

by Daniel Berrange at May 12, 2016 03:17 PM

April 29, 2016

Cole Robinson

Using CPU host-passthrough with virt-manager

I described virt-manager's CPU model default in this post. In that post I explained the difficulties of using either of the libvirt options for mirroring the host CPU: mode=host-model still has operational issues, and mode=host-passthrough isn't recommended for use with libvirt over supportability concerns.

Unfortunately since writing that post the situation hasn't improved any, and since host-passthrough is the only reliably way to expose the full capabilities of the host CPU to the VM, users regularly want to enable it. This is particularly apparent if trying to do nested virt, which often doesn't work on Intel CPUs unless host-passthrough is used.

However we don't explicitly expose this option in virt-manager since it's not generally recommended for libvirt usage. You can however still enable it in virt-manager:
  • Navigate to VM Details->CPU
  • Enter host-passthrough in the CPU model field
  • Click Apply

by Cole Robinson ( at April 29, 2016 04:48 PM

UEFI support in virt-install and virt-manager

One of the new features in virt-manager 1.2.0 (from back in May) is user friendly support for enabling UEFI.

First a bit about terminology: When UEFI is packaged up to run in an x86 VM, it's often called OVMF. When UEFI is packaged up to run in an AArch64 VM, it's often called AAVMF. But I'll just refer to all of it as UEFI.

Using UEFI with virt-install and virt-manager

The first step to enable this for VMs is to install the binaries. UEFI still has some licensing issues that make it incompatible with Fedora's policies, so the bits are hosted in an external repo. Details for installing the repo and UEFI bits are over here.

Once the bits are installed (and you're on Fedora 22 or later), virt-manager and virt-install provide simple options to enable UEFI when creating VMs.

Marcin has a great post with screenshots describing this for virt-manager (for aarch64, but the steps are identical for x86 VMs).

For virt-install it's as simple as doing:

$ sudo virt-install --boot uefi ...

virt-install will get the binary paths from libvirt and set everything up with the optimal config. If virt-install can't figure out the correct parameters, like if no UEFI binaries are installed, you'll see an error like: ERROR    Error: --boot uefi: Don't know how to setup UEFI for arch 'x86'

See 'virt-install --boot help' if you need to tweak the parameters individually.

Implementing support in applications

Libvirt needs to know about UEFI<->NVRAM config file mapping, so it can advertise it to tools like virt-manager/virt-install. Libvirt looks at a hardcoded list of known host paths to see if any firmware is installed, and if so, lists those paths in domain capabilities output (virsh domcapabilities). Libvirt in Fedora 22+ knows to look for the paths provided by the repo mentioned above, so just installing the firmware is sufficient to make libvirt advertise UEFI support.

The domain capabilities output only lists the firmware path and the associated variable store path. Notably lacking is any indication of what architecture the binaries are meant for. So tools need to determine this mapping themselves.

virt-manager/virt-install and libguestfs use a similar path matching heuristic. The libguestfs code is a good reference:

    match guest_arch with
| "i386" | "i486" | "i586" | "i686" ->
[ "/usr/share/edk2.git/ovmf-ia32/OVMF_CODE-pure-efi.fd",
"/usr/share/edk2.git/ovmf-ia32/OVMF_VARS-pure-efi.fd" ]
| "x86_64" ->
[ "/usr/share/OVMF/OVMF_CODE.fd",
"/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd" ]
| "aarch64" ->
[ "/usr/share/AAVMF/AAVMF_CODE.fd",
"/usr/share/edk2.git/aarch64/vars-template-pflash.raw" ]
| arch ->
error (f_"don't know how to convert UEFI guests for architecture %s")
guest_arch in

Having to track this in every app is quite crappy, but it's the only good solution at the moment. Hopefully long term libvirt will grow some solution that makes this easier for applications.

by Cole Robinson ( at April 29, 2016 04:35 PM

Polkit password-less access for the 'libvirt' group

Many users, who admin their own machines, want to be able to use tools like virt-manager without having to enter a root password. Just google 'virt-manager without password' and see all the hits. I've seen many blogs and articles over the years describing various ways to work around it.

The password prompting is via libvirt's polkit integration. The idea is that we want the applications to run as a regular unprivileged user (running GUI apps as root is considered a no-no), and only use the root authentication for talking to system libvirt instance. Most workarounds suggest installing a polkit rule to allow your user, or a particular user group, to access libvirt without needing to enter the root password.

In libvirt v1.2.16 we finally added official support for this (and backported to Fedora22+). The group is predictably called 'libvirt'. This matches polkit rules that debian and suse were already shipping too.

So just add your user to the 'libvirt' group and enjoy passwordless virt-manager usage:

$ usermod --append --groups libvirt `whoami`

by Cole Robinson ( at April 29, 2016 04:32 PM

qemu:///system vs qemu:///session

If you've spent time using libvirt apps like virt-manager, you've likely seen references to libvirt URIs. The URI is how users or apps tell libvirt what hypervisor (qemu, xen, lxc, etc) to connect to, what host it's on, what authentication method to use, and a few other bits. 

For QEMU/KVM (and a few other hypervisors), there's a concept of system URI vs session URI:
  • qemu:///system: Connects to the system libvirtd instance, the one launched by systemd. libvirtd is running as root, so has access to all host resources. qemu VMs are launched as the unprivileged 'qemu' user, though libvirtd can grant the VM selective access to root owned resources. Daemon config is in /etc/libvirt, VM logs and other bits are stored in /var/lib/libvirt. virt-manager and big management apps like Openstack and oVirt use this by default.
  • qemu:///session: Connects to a libvirtd instance running as the app user, the daemon is auto-launched if it's not already running. libvirt and all VMs run as the app user. All config and logs and disk images are stored in $HOME. This means each user has their own qemu:///session VMs, separate from all other users. gnome-boxes and libguestfs use this by default.

That describes the 'what', but the 'why' of it is a bigger story. The privilege level of the daemon and VMs have pros and cons depending on your usecase. The easiest way to understand the benefit of one over the other is to list the problems with each setup.

qemu:///system runs libvirtd as root, and access is mediated by polkit. This means if you are connecting to it as a regular user (like when launching virt-manager), you need to enter the host root password, which is annoying and not generally desktop usecase friendly. There are ways to work around it but it requires explicit admin configuration.

Desktop use cases also suffer since VMs are running as the 'qemu' user, but the app (like virt-manager) is running as your local user. For example, say you download an ISO to $HOME and want to attach it to a VM. The VM is running as unprivileged user=qemu and can't access your $HOME, so libvirt has to change the ISO file owner to qemu:qemu and virt-manager has to give search access to $HOME for user=qemu. It's a pain for apps to handle, and it's confusing for users, but after dealing with it for a while in virt-manager we've made it generally work. (Though try giving a VM access to a file on a fat32 USB drive that was automounted by your desktop session...)

qemu:///session runs libvirtd and VMs as your unprivileged user. This integrates better with desktop use cases since permissions aren't an issue, no root password is required, and each user has their own separate pool of VMs.

However because nothing in the chain is privileged, any VM setup tasks that need host admin privileges aren't an option. Unfortunately this includes most general purpose networking options.

The default qemu mode in this case is usermode networking (or SLIRP). This is an IP stack implemented in userspace. This has many drawbacks: the VM can not easily be accessed by the outside world, the VM can access talk to the outside world but only over a limited number of networking protocols, and it's very slow.

There is an option for qemu:///session VMs to use a privileged networking setup, via the setuid qemu-bridge-helper. Basically the host admin sets up a bridge, adds it to a whitelist at /etc/qemu/bridge.conf, then it's available for unprivileged qemu instances. By default on Fedora this contains 'virbr0' which is the default virtual network bridge provided by the system libvirtd instance, and what qemu:///system VMs typically use.

gnome-boxes originally used usermode networking, but switched around Fedora 21 timeframe to use virbr0 via qemu-bridge-helper. But that's dependent on virbr0 being set up correctly by the host admin, or via package install (libvirt-daemon-config-network package on Fedora).

qemu:///session also misses some less common features that require host admin privileges, like host PCI device assignment. Also VM autostart doesn't work as expected because the session daemon itself isn't autostarted.

Apps have to decide for themselves which libvirtd mode to use, depending on their use case.

qemu:///system is completely fine for big apps like oVirt and Openstack that require admin access to the virt hosts anyways.

virt-manager largely defaults to qemu:///system because that's what it has always done, and that default long precedes qemu-bridge-helper. We could switch but it would just trade one set of issues for another. virt-manager can be used with qemu:///session though (or any URI for that matter).

libguestfs uses qemu:///session since it avoids all the permission issues and the VM appliance doesn't really care about networking.

gnome-boxes prioritized desktop integration from day 1, so qemu:///session was the natural choice. But they've struggled with the networking issues in various forms.

Other apps are in a pickle: they would like to use qemu:///session to avoid the permission issues, but they also need to tweak the network setup. This is the case vagrant-libvirt currently finds itself in.

by Cole Robinson ( at April 29, 2016 04:11 PM

Powered by Planet!
Last updated: October 22, 2016 11:01 PM