Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

January 19, 2018

Eduardo Otubo

Xen synchronicity between frontend and backend devices

So I bumped into a problem last month and it took me too much time to figure out the big picture of the problem since I didn't find too much documentation about that. The help I could find when trying to figure out this was mostly from good people on the channel #xendevel @ Freenode, mostly maintainers. So if you want to understand a little bit of Xen without pinging people on IRC, that's the place.

The problem is the following: I'm running RHEL on Xen Hypervisor and whenever I try to unload and reload xen_netfront kernel module I see outputs like that on dmesg:

# modprobe -r xen_netfront
# dmesg|tail
[  105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
[  105.236839] deferring g.e. 0x903 (pfn 0x35805)
[  105.237156] xen:grant_table: WARNING: g.e. 0x904 still in use!
[  105.237160] deferring g.e. 0x904 (pfn 0x35804)
[  105.237163] xen:grant_table: WARNING: g.e. 0x905 still in use!
[  105.237166] deferring g.e. 0x905 (pfn 0x35803)
[  105.237545] xen:grant_table: WARNING: g.e. 0x906 still in use!
[  105.237550] deferring g.e. 0x906 (pfn 0x35802)
[  105.237553] xen:grant_table: WARNING: g.e. 0x907 still in use!
[  105.237556] deferring g.e. 0x907 (pfn 0x35801)

Moreover, the interface is not usable as well:

# dmesg|tail
[  105.237163] xen:grant_table: WARNING: g.e. 0x905 still in use!
[  105.237166] deferring g.e. 0x905 (pfn 0x35803)
[  105.237545] xen:grant_table: WARNING: g.e. 0x906 still in use!
[  105.237550] deferring g.e. 0x906 (pfn 0x35802)
[  105.237553] xen:grant_table: WARNING: g.e. 0x907 still in use!
[  105.237556] deferring g.e. 0x907 (pfn 0x35801)
[  160.050882] xen_netfront: Initialising Xen virtual ethernet driver
[  160.066937] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[  160.067270] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[  160.069355] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
# ifconfig eth0
eth0: flags=4098  mtu 1500
        ether 00:00:00:00:00:00  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# ifconfig eth0 up
SIOCSIFFLAGS: Cannot assign requested address

The first problem happens because the backend part of the module (xen_netback) is still using some pieces of memory (g.e. which states for grant entries) that are shared between guest and host. The ideal scenario would be to wait for the netback to free those entries and only then unload the netfront module. This was actually a bug on the synchronicity of the netfront and netback parts.

The state of the drivers are kept in separate structs, as defined in include/xen/xenbus.h:69:

/* A xenbus device. */

struct xenbus_device {
        const char *devicetype;
        const char *nodename;
        const char *otherend;   
        int otherend_id;
        struct xenbus_watch otherend_watch;            
        struct device dev;
        enum xenbus_state state;
        struct completion down;
        struct work_struct work;                              

And the netfront state can be seen from the hypervisor with the command:

# xenstore-ls -fp
/local/domain/1/device/vif/0/state = "4"   (n1,r0)

The number 4 indicates XenbusStateConnected (as defined in include/xen/interface/io/xenbus.h:17). So it means everything is a matter of wait for one end to finish using the memory region and the other to free, this first piece of the puzzle is solved by this patch:

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 8b8689c6d887..391432e2725d 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -87,6 +87,8 @@ struct netfront_cb {
 /* IRQ name is queue name with "-tx" or "-rx" appended */

+static DECLARE_WAIT_QUEUE_HEAD(module_unload_q);
 struct netfront_stats {
        u64                     packets;
        u64                     bytes;
@@ -2021,10 +2023,12 @@ static void netback_changed(struct xenbus_device *dev,

        case XenbusStateClosed:
+               wake_up_all(&module_unload_q);
                if (dev->state == XenbusStateClosed)
                /* Missed the backend's CLOSING state -- fallthrough */
        case XenbusStateClosing:
+               wake_up_all(&module_unload_q);
@@ -2130,6 +2134,20 @@ static int xennet_remove(struct xenbus_device *dev)

        dev_dbg(&dev->dev, "%s\n", dev->nodename);

+       if (xenbus_read_driver_state(dev->otherend) != XenbusStateClosed) {
+               xenbus_switch_state(dev, XenbusStateClosing);
+               wait_event(module_unload_q,
+                          xenbus_read_driver_state(dev->otherend) ==
+                          XenbusStateClosing);
+               xenbus_switch_state(dev, XenbusStateClosed);
+               wait_event(module_unload_q,
+                          xenbus_read_driver_state(dev->otherend) ==
+                          XenbusStateClosed ||
+                          xenbus_read_driver_state(dev->otherend) ==
+                          XenbusStateUnknown);
+       }


The second piece of the problem is that the interface is not usable when reloaded back. And that's a lack of initializing the state of the device so the backend notices it, and hence, connects the two drivers together (frontend and backend). This was easily solved by the following patch:

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index c5a34671abda..9bd7ddeeb6a5 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1326,6 +1326,7 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
+       xenbus_switch_state(dev, XenbusStateInitialising);
        return netdev;

by (Eduardo Otubo) at January 19, 2018 10:08 AM

January 05, 2018

QEMU project

QEMU and the Spectre and Meltdown attacks

As you probably know by now, three critical architectural flaws in CPUs have been recently disclosed that allow user processes to read kernel or hypervisor memory through cache side-channel attacks. These flaws, collectively named Meltdown and Spectre, affect in one way or another almost all processors that perform out-of-order execution, including x86 (from Intel and AMD), POWER, s390 and ARM processors.

No microcode updates are required to block the Meltdown attack. In addition, the Meltdown flaw does not allow a malicious guest to read the contents of hypervisor memory. Fixing it only requires that the operating system separates the user and kernel address spaces (known as page table isolation for the Linux kernel), which can be done separately on the host and the guests. Therefore, this post will focus on Spectre, and especially on CVE-2017-5715.

Fixing or mitigating Spectre in general, and CVE-2017-5715 in particular, requires cooperation between the processor and the operating system kernel or hypervisor; the processor can be updated through microcode or millicode patches to provide the required functionality.

Among the three vulnerabilities, CVE-2017-5715 is notable because it allows guests to read potentially sensitive data from hypervisor memory. Patching the host kernel is sufficient to block attacks from guests to the host. On the other hand, in order to protect the guest kernel from a malicious userspace, updates are also needed to the guest kernel and, depending on the processor architecture, to QEMU.

Just like on bare-metal, the guest kernel will use the new functionality provided by the microcode or millicode updates. When running under a hypervisor, processor emulation is mostly out of QEMU’s scope, so QEMU’s role in the fix is small, but nevertheless important. In the case of KVM:

  • QEMU configures the hypervisor to emulate a specific processor model. For x86, QEMU has to be aware of new CPUID bits introduced by the microcode update, and it must provide them to guests depending on how the guest is configured.

  • upon virtual machine migration, QEMU reads the CPU state on the source and transmits it to the destination. For x86, QEMU has to be aware of new model specific registers (MSRs).

Right now, there are no public patches to KVM that expose the new CPUID bits and MSRs to the virtual machines, therefore there is no urgent need to update QEMU; remember that updating the host kernel is enough to protect the host from malicious guests. Nevertheless, updates will be posted to the qemu-devel mailing list in the next few days, and a 2.11.1 patch release will be released with the fix.

Once updates are provided, live migration to an updated version of QEMU will not be enough to protect guest kernel from guest userspace. Because the virtual CPU has to be changed to one with the new CPUID bits, the guest will have to be restarted.

As of today, the QEMU project is not aware of whether similar changes will be required for non-x86 processors. If so, they will also be posted to the mailing list and backported to recent stable releases.

For more information on the vulnerabilities, please refer to the Google Security Blog and Google Project Zero posts on the topic, as well as the Spectre and Meltdown FAQ.

5 Jan 2018: clarified the level of protection provided by the host kernel update; added a note on live migration; clarified the impact of Meltdown on virtualization hosts

by Paolo Bonzini and Eduardo Habkost at January 05, 2018 02:00 PM

December 19, 2017

Thomas Huth

How to use Fedora 27 s390x ISO images with KVM

When you are trying to install a s390x Fedora 27 (or RHEL 7.4) KVM guest on an IBM z Systems mainframe by using one of the provided ISO DVD images (e.g. Fedora-Server-dvd-s390x-27-1.6.iso), there are two pitfalls you should be aware of.

The first problem is that QEMU’s firmware for the s390x guests can not directly boot the ISO images yet, since the images have been built without the so-called El-Torito boot information, which is required for bootable images here. This has recently been fixed in the Lorax tool, so this problem should hopefully be gone with Fedora 28. But for the current version of Fedora, you still have to fetch the kernel and initrd images from the ISO image and specify them with the --kernel and --initrd parameters when running QEMU.

The second problem only occurs if you try to use the ISO image via the “convenience” --cdrom parameter of QEMU: In this case the disk is configured as a so-called virtio-block device, which shows up as /dev/vd* in the guest’s file system. Unfortunately, the start-up scripts of Fedora expect a SCSI DVD drives instead (which should show up as /dev/sr* in the guest), so you only eventually get an “dracut-initqueue timeout” and an emergency shell if you try to boot your guest this way. Thus you can not use the --cdrom parameter here, and you have to configure a virtio-scsi CD-ROM drive for the ISO image instead.

In a nutshell, these are the steps that you have to take to install a Fedora 27 s390x KVM guest from an ISO image:

qemu-img create -f qcow2 fedora.qcow2 16G
sudo mount -o loop,ro Fedora-Server-dvd-s390x-27-1.6.iso /mnt
cp /mnt/images/kernel.img /mnt/images/initrd.img .
sudo umount /mnt
qemu-system-s390x -M accel=kvm -m 1G --nographic --device virtio-scsi \
  --drive file=Fedora-Server-dvd-s390x-27-1.6.iso,format=raw,if=none,id=c1 \
  --device scsi-cd,drive=c1 --hda fedora.qcow2 \
  --kernel kernel.img --initrd initrd.img

December 19, 2017 12:30 PM

December 07, 2017

Daniel Berrange

Full coverage of libvirt XML schemas achieved in libvirt-go-xml

In recent times I have been aggressively working to expand the coverage of libvirt XML schemas in the libvirt-go-xml project. Today this work has finally come to a conclusion, when I achieved what I believe to be effectively 100% coverage of all of the libvirt XML schemas. More on this later, but first some background on Go and XML….

For those who aren’t familiar with Go, the core library’s encoding/xml module provides a very easy way to consume and produce XML documents in Go code. You simply define a set of struct types and annotate their fields to indicate what elements & attributes each should map to. For example, given the Go structs:

type Person struct {
    XMLName xml.Name `xml:"person"`
    Name string `xml:"name,attr"`
    Age string `xml:"age,attr"` 
    Home *Address `xml:"home"`
    Office *Address `xml:"office"`
type Address struct { 
    Street string `xml:"street"`
    City string `xml:"city"` 

You can parse/format XML documents looking like

<person name="Joe Blogs" age="24">
    <street>Some where</street><city>London</city>
    <street>Some where else</street><city>London</city>

Other programming languages I’ve used required a great deal more work when dealing with XML. For parsing, there’s typically a choice between an XML stream based parser where you have to react to tokens as they’re parsed and stuff them into structs, or a DOM object hierarchy from which you then have to pull data out into your structs. For outputting XML, apps either build up a DOM object hierarchy again, or dynamically format the XML document incrementally. Whichever approach is taken, it generally involves writing alot of tedious & error prone boilerplate code. In most cases, the Go encoding/xml module eliminates all the boilerplate code, only requiring the data type defintions. This really makes dealing with XML a much more enjoyable experience, because you effectively don’t deal with XML at all! There are some exceptions to this though, as the simple annotations can’t capture every nuance of many XML documents. For example, integer values are always parsed & formatted in base 10, so extra work is needed for base 16. There’s also no concept of unions in Go, or the XML annotations. In these edge cases custom marshaling / unmarshalling methods need to be written. BTW, this approach to XML is also taken for other serialization formats including JSON and YAML too, with one struct field able to have many annotations so it can be serialized to a range of formats.

Back to the point of the blog post, when I first started writing Go code using libvirt it was immediately obvious that everyone using libvirt from Go would end up re-inventing the wheel for XML handling. Thus about 1 year ago, I created the libvirt-go-xml project whose goal is to define a set of structs that can handle documents in every libvirt public XML schema. Initially the level of coverage was fairly light, and over the past year 18 different contributors have sent patches to expand the XML coverage in areas that their respective applications touched. It was clear, however, that taking an incremental approach would mean that libvirt-go-xml is forever trailing what libvirt itself supports. It needed an aggressive push to achieve 100% coverage of the XML schemas, or as near as practically identifiable.

Alongside each set of structs we had also been writing unit tests with a set of structs populated with data, and a corresponding expected XML document. The idea for writing the tests was that the author would copy a snippet of XML from a known good source, and then populate the structs that would generate this XML. In retrospect this was not a scalable approach, because there is an enourmous range of XML documents that libvirt supports. A further complexity is that Go doesn’t generate XML documents in the exact same manner. For example, it never generates self-closing tags, instead always outputting a full opening & closing pair. This is semantically equivalent, but makes a plain string comparison of two XML documents impractical in the general case.

Considering the need to expand the XML coverage, and provide a more scalable testing approach, I decided to change approach. The libvirt.git tests/ directory currently contains 2739 XML documents that are used to validate libvirt’s own native XML parsing & formatting code. There is no better data set to use for validating the libvirt-go-xml coverage than this. Thus I decided to apply a round-trip testing methodology. The libvirt-go-xml code would be used to parse the sample XML document from libvirt.git, and then immediately serialize them back into a new XML document. Both the original and new XML documents would then be parsed generically to form a DOM hierarchy which can be compared for equivalence. Any place where documents differ would cause the test to fail and print details of where the problem is. For example:

$ go test -tags xmlroundtrip
--- FAIL: TestRoundTrip (1.01s)
	xml_test.go:384: testdata/libvirt/tests/vircaps2xmldata/vircaps-aarch64-basic.xml: \
            /capabilities[0]/host[0]/topology[0]/cells[0]/cell[0]/pages[0]: \
            element in expected XML missing in actual XML

This shows the filename that failed to correctly roundtrip, and the position within the XML tree that didn’t match. Here the NUMA cell topology has a ‘<pages>‘  element expected but not present in the newly generated XML. Now it was simply a matter of running the roundtrip test over & over & over & over & over & over & over……….& over & over & over, adding structs / fields for each omission that the test identified.

After doing this for some time, libvirt-go-xml now has 586 structs defined containing 1816 fields, and has certified 100% coverage of all libvirt public XML schemas. Of course when I say 100% coverage, this is probably a lie, as I’m blindly assuming that the libvirt.git test suite has 100% coverage of all its own XML schemas. This is certainly a goal, but I’m confident there are cases where libvirt itself is missing test coverage. So if any omissions are identified in libvirt-go-xml, these are likely omissions in libvirt’s own testing.

On top of this, the XML roundtrip test is set to run in the libvirt jenkins and travis CI systems, so as libvirt extends its XML schemas, we’ll get build failures in libvirt-go-xml and thus know to add support there to keep up.

In expanding the coverage of XML schemas, a number of non-trivial changes were made to existing structs  defined by libvirt-go-xml. These were mostly in places where we have to handle a union concept defined by libvirt. Typically with libvirt an element will have a “type” attribute, whose value then determines what child elements are permitted. Previously we had been defining a single struct, whose fields represented all possible children across all the permitted type values. This did not scale well and gave the developer no clue what content is valid for each type value. In the new approach, for each distinct type attribute value, we now define a distinct Go struct to hold the contents. This will cause API breakage for apps already using libvirt-go-xml, but on balance it is worth it get a better structure over the long term. There were also cases where a child XML element previously represented a single value and this was mapped to a scalar struct field. Libvirt then added one or more attributes on this element, meaning the scalar struct field had to turn into a struct field that points to another struct. These kind of changes are unavoidable in any nice manner, so while we endeavour not to gratuitously change currently structs, if the libvirt XML schema gains new content, it might trigger further changes in the libvirt-go-xml structs that are not 100% backwards compatible.

Since we are now tracking libvirt.git XML schemas, going forward we’ll probably add tags in the libvirt-go-xml repo that correspond to each libvirt release. So for app developers we’ll encourage use of Go vendoring to pull in a precise version of libvirt-go-xml instead of blindly tracking master all the time.

by Daniel Berrange at December 07, 2017 02:14 PM

December 02, 2017

Stefan Hajnoczi

My favorite software engineering books

The programming books that I find most interesting are neither about computer science theory nor the latest technology fads. Instead they are about the thought process behind building software and the best practices for doing so.

Here are some of my favorite books on software engineering topics:

The Practice of Programming

The Practice of Programming is a great round-trip through the struggle of writing programs. It covers many aspects that come together when writing software, like design, algorithms, coding style, and testing. Especially useful early on in your programming journey as an overview of challenges that you'll face.

Programming Pearls

Programming Pearls is a collection of essays by Jon Bentley from Communications of the Association for Computing Machinery. There are "Aha!" moments throughout the essays as they discuss how to analyze problems and come up with good solutions.

Bonus link: if you enjoy the problem solving in this book, then check out Hacker's Delight for clever problem solving and optimizations using bit-twiddling.

The Pragmatic Programmer

The Pragmatic Programmer covers the mindset of systematic and mindful software development. It goes beyond best practices and explains key qualities and trade-offs in programming. Thinking about these issues allow you to customize your approach to software development and produce better programs.

Code Complete

Code Complete is a survey of programming best practices. It draws on research evidence on code quality and provides guidelines on coding style. A great book if you're thinking about how to improve the quality, clarity, and maintainability of your programs.

Applying UML and Patterns

Applying UML and Patterns helped me learn to break down requirements and come up with software designs. This is a great book to help you get past the stage where programs you write from scratch are unmaintainable spaghetti code. I don't know how well this book has aged and would probably ignore the details of UML and Use Cases but the essence remains valuable.

Bonus link: if this book helps you design programs from scratch, then Refactoring will help you recognize that software is "soft" and can be changed substantially in a safe way by following a disciplined approach.

Writing Secure Code

Writing Secure Code discusses software security and the various classes of bugs that lead to security holes. Security is essential for writing code because the majority of programs have a security boundary where they process untrusted inputs. It's important to have a background in security as well as language and technology specifics of secure coding.

Producing Open Source Software

Producing Open Source Software explains how open source projects and communities work. It covers topics like licenses, project governance, source control, code review, and more. If you are getting contributing to open source or considering running an open source project then this book will prepare you.


These books all contributed to how I think about software development. Let me know which practical programming books you like in the comments!

by stefanha ( at December 02, 2017 01:05 PM

December 01, 2017

Daniel Berrange

Full colour emojis in virtual machine names in Fedora 27

Quite by chance today I discovered that Fedora 27 can display full colour glyphs for unicode characters that correspond to emojis, when the terminal displaying my mutt mail reader displayed someone’s name with a full colour glyph showing stars:

Mutt in GNOME terminal rendering color emojis in sender name

Chatting with David Gilbert on IRC I learnt that this is a new feature in Fedora 27 GNOME, thanks to recent work in the GTK/Pango stack. David then pointed out this works in libvirt, so I thought I would illustrate it.

Virtual machine name with full colour emojis rendered

No special hacks were required to do this, I simply entered the emojis as the virtual machine name when creating it from virt-manager’s wizard

Virtual machine name with full colour emojis rendered

As mentioned previously, GNOME terminal displays colour emojis, so these virtual machine names appear nicely when using virsh and other command line tools

Virtual machine name rendered with full colour emojis in terminal commands

The more observant readers will notice that the command line args have a bug as the snowman in the machine name is incorrectly rendered in the process listing. The actual data in /proc/$PID/cmdline is correct, so something about the “ps” command appears to be mangling it prior to output. It isn’t simply a font problem because other comamnds besides “ps” render properly, and if you grep the “ps” output for the snowman emoji no results are displayed.

by Daniel Berrange at December 01, 2017 01:28 PM

November 22, 2017

QEMU project

Accelerating QEMU on Windows with HAXM

In this post, I’m going to introduce a useful technique to people who are using, or are interested in using, QEMU on Windows. Basically, you can make the most of your hardware to accelerate QEMU virtual machines on Windows: starting with its 2.9.0 release, QEMU is able to take advantage of Intel HAXM to run x86 and x86_64 VMs with hardware acceleration.

If you have used QEMU on Linux, you have probably enjoyed the performance boost brought by KVM: the same VM runs a lot faster when you launch QEMU with the -accel kvm (or -enable-kvm) option, thanks to hardware-assisted virtualization. On Windows, you can achieve a similar speed-up with -accel hax (or -enable-hax), after completing a one-time setup process.

First, make sure your host system meets the requirements of HAXM:

  1. An Intel CPU that supports Intel VT-x with Extended Page Tables (EPT).
    • Intel CPUs that do not support the said feature are almost extinct now. If you have a Core i3/i5/i7, you should be good to go.
  2. Windows 7 or later.
    • HAXM works on both 32-bit and 64-bit versions of Windows. For the rest of this tutorial, I’ll assume you are running 64-bit Windows, which is far more popular than 32-bit nowadays.

Next, check your BIOS (or UEFI boot firmware) settings, and make sure VT-x (or Virtualization Technology, depending on your BIOS) is enabled. If there is also a setting for Execute Disable Bit, make sure that one is enabled as well. In most cases, both settings are enabled by default.

  • If your system is protected against changes to BIOS, e.g. you have enabled BitLocker Drive Encryption or any other tamper protection mechanism, you may need to take preventive measures to avoid being locked out after changing the said BIOS settings.

After that, if you are on Windows 8 or later, make sure Hyper-V is disabled. This is especially important for Windows 10, which enables Hyper-V by default. The reason is that Hyper-V makes exclusive use of VT-x, preventing HAXM and other third-party hypervisors (such as VMware and VirtualBox) from seeing that hardware feature. There are a number of ways to disable Hyper-V; one of them is to bring up the Start menu, type Windows Features and Enter, uncheck Hyper-V in the resulting dialog, and click on OK to confirm.

  • Note that changing the Hyper-V setting could also trigger the alarm of the tamper protection mechanism (e.g. BitLocker) that may be enabled on your system. Again, make sure you won’t be locked out after the reboot.

Disabling Hyper-V in Windows Features

Now you’re ready to install HAXM, which needs to run as a kernel-mode driver on Windows so as to execute the privileged VT-x instructions. Simply download the latest HAXM release for Windows here, unzip, and run intelhaxm-android.exe to launch the installation wizard. (Despite the file name, Android is not the only guest OS that can be accelerated by HAXM.)

Installing HAXM on Windows

If you haven’t installed QEMU, now is the time to do it. I recommend getting the latest stable release from here. At the time of this writing, the latest stable release is 2.10.1 (build 20171006), so I downloaded qemu-w64-setup-20171006.exe, which is an easy-to-use installer.

With all that, we’re ready to launch a real VM in QEMU. You can use your favorite QEMU disk image, provided that the guest OS installed there is compatible with the x86 (i386) or x86_64 (amd64) architecture. My choice for this tutorial is debian_wheezy_amd64_standard.qcow2, which contains a fresh installation of the standard Debian Wheezy system for x86_64, available here. To boot it, open a new command prompt window, switch to your QEMU installation directory (e.g. cd "C:\Program Files\qemu"), and run:

qemu-system-x86_64.exe -hda X:\path\to\debian_wheezy_amd64_standard.qcow2 -accel hax

You don’t have to leave the screen as the VM boots up, because soon you’ll be able to see the Debian shell and log in.

Debian Wheezy (Standard) booted up in QEMU+HAXM

To feel the difference made by HAXM acceleration, shut down the VM, and relaunch it without -accel hax:

qemu-system-x86_64.exe -hda X:\path\to\debian_wheezy_amd64_standard.qcow2

If you’re still not impressed, try a more sophisticated VM image such as debian_wheezy_amd64_desktop.qcow2, which boots to a desktop environment. VMs like this are hardly usable without hardware acceleration.

That’s it! I hope HAXM gives you a more enjoyable QEMU experience on Windows. You may run into issues at some point, because there are, inevitably, bugs in HAXM (e.g. booting an ISO image from CD-ROM doesn’t work at the moment). But the good news is that HAXM is now open source on GitHub, so everyone can help improve it. Please create an issue on GitHub if you have a question, bug report or feature request.

by Yu Ning at November 22, 2017 07:00 AM

November 20, 2017

Richard Jones

libguestfs for RHEL 7.5 preview

As usual I’ve placed the proposed RHEL 7.5 libguestfs packages in a public repository so you can try them out.

Thanks to Pino Toscano for doing the packaging work.

by rich at November 20, 2017 09:56 AM

November 16, 2017

Alberto Garcia

“Improving the performance of the qcow2 format” at KVM Forum 2017

I was in Prague last month for the 2017 edition of the KVM Forum. There I gave a talk about some of the work that I’ve been doing this year to improve the qcow2 file format used by QEMU for storing disk images. The focus of my work is to make qcow2 faster and to reduce its memory requirements.

The video of the talk is now available and you can get the slides here.

The KVM Forum was co-located with the Open Source Summit and the Embedded Linux Conference Europe. Igalia was sponsoring both events one more year and I was also there together with some of my colleages. Juanjo Sánchez gave a talk about WPE, the WebKit port for embedded platforms that we released.

The video of his talk is also available.

by berto at November 16, 2017 10:16 AM

November 15, 2017

Stefan Hajnoczi

Video and slides available for "Applying Polling Techniques to QEMU: Reducing virtio-blk I/O Latency"

At KVM Forum 2017 I gave a talk about the AioContext polling optimization that was merged in QEMU 2.9. It reduces latency for virtio-blk and virtio-scsi devices with the iothread= option on high IOPS devices like recent NVMe PCIe SSDs drives. It increases performance for latency-sensitive workloads and has been designed to avoid interfering with workloads that do not benefit from polling thanks to a self-tuning algorithm.

The video of the talk is now available:

The slides are available here (PDF).

by stefanha ( at November 15, 2017 08:50 PM

November 13, 2017

Stefan Hajnoczi

Common disk benchmarking mistakes

Collecting benchmark results is the first step to solving disk I/O performance problems. Unfortunately, many bug reports and performance investigations fall down at the first step because bogus benchmark data is collected. This post explains common mistakes when running disk I/O benchmarks.

Disk I/O patterns

Skip this section if you are already familiar with these terms. Before we begin, it is important to understand the different I/O patterns and how they are used in benchmarking.

Sequential vs random I/O is the access pattern in which data is read or written. Sequential I/O is in-order data access commonly found in workloads like streaming multimedia or writing log files. Random I/O is access of non-adjacent data commonly found when accessing many small files or on systems running multiple applications that access the disk at the same time. It is easy to prefetch sequential I/O so both disk read caches and operating system page caches may keep the next piece of data ready even before it is accessed. Random I/O does not offer opportunities for prefetching and is therefore a harder access pattern to optimize.

Block or request size is the amount of data transferred by a single access. Small request sizes are 512B through 4 KB, large request sizes are 64 KB through 128 KB, while very large request sizes could be 1 MB (although the maximum allowed request size ultimately depends on the hardware). Fewer requests are needed to transfer the same amount of data when the request size is larger. Therefore, throughput is usually higher at larger request sizes because less per-request overhead is incurred for the same amount of data.

Read vs write is the request type that determines whether data is transferred to or from the storage medium. Reads can be completed cheaply if data is already in the disk read cache and, failing that, the access time depends on the storage medium. Traditional spinning disks have significant average seek times in the range of 4-15 milliseconds, depending on the drive, when the head is not positioned in the read location, while solid-state storage devices might just take on the order of 10 microseconds. Writes can be completed cheaply by leaving data in the disk write cache unless the cache is full or the cache is disabled.

Queue depth is the number of in-flight I/O requests at a given time. Latency-sensitive workloads submit one request and wait for it to complete before submitting the next request. This is queue depth 1. Parallel workloads submit many requests without waiting for earlier requests to complete first. The maximum queue depth depends on the hardware with 64 being a common number. Maximum throughput is usually achieved when queue depth is fairly high because the disk can keep busy without waiting for the next request to be submitted and it may optimize the order in which requests are processed.

Random reads are a good way to force storage medium access and minimize cache hit rates. Sequentual reads are a good way to maximize cache hit rates. Which I/O pattern is appropriate depends on your goals.

Real-life workloads are usually a mixture of sequential vs random, block sizes, reads vs writes, and the queue depth may vary over time. It is simplest to benchmark a specific I/O pattern in isolation but benchmark tools can also be configured to produce mixed I/O patterns like 70% reads/30% writes. The goal when configuring a benchmark is to produce the I/O pattern that is critical for real-life workload performance.

1. Use a real benchmarking tool

It is often tempting to use file utilities instead of real benchmarking tools because file utilities report I/O throughput like real benchmarking tools and time taken can be easily measured. Therefore it might seem like there is no need to install a real benchmarking tool when file utilities are already available on every system.

Do not use cp(1), scp(1), or even dd(1). Instead, use a real benchmark like fio(1).

What's the difference? Real benchmarking tools can be configured to produce specific I/O patterns, like 4 KB random reads with queue depth 8, whereas file utilities offer limited or no ability to choose the I/O pattern. Since disk performance varies depending on the I/O pattern, it is hard to understand or compare results between systems without full control over the I/O pattern.

The second reason why real benchmarking tools are necessary is that file utilities are not designed to exercise the disk, they are designed to manipulate files. This means file utilities spend time doing things that does not involve disk I/O and therefore produces misleading performance results. The most important example of this is that file utilities use the operating system's page cache and this can result in no disk I/O activity at all!

2. Bypass the page cache

One of the most common mistakes is forgetting to bypass the operating system's page cache. Files and block devices opened with the O_DIRECT flag perform I/O to the disk without going through the page cache. This is the best way to guarantee that the disk actually gets I/O requests. Files opened without this flag are in "buffered I/O" mode and that means I/O may be fulfilled entirely within the page cache in RAM without any disk I/O activity. If the goal is to benchmark disk performance then the page cache needs to be eliminated.

fio(1) jobs must use the direct=1 parameter to exercise the disk.

It is not sufficient to echo 3 > /proc/sys/vm/drop_caches before running the benchmark instead of using O_DIRECT. Although this command is often used to make non-disk benchmarks produce more consistent results between runs, it does not guarantee that the disk will actually receive I/O requests. In addition, the page cache interferes with the desired benchmark I/O pattern since page cache prefetch and writeback will alter the actual I/O pattern that the disk sees.

3. Bypass file systems and device mapper

fio(1) can do both file I/O and disk I/O benchmarking, so it's often mistakenly used in file I/O mode instead of disk I/O mode. When benchmarking disk performance it is best to eliminate file systems and device mapper targets to isolate raw disk I/O performance. File systems and device mapper targets may have their own internal bottlenecks, such as software locks, that are unrelated to disk performance. File systems and device mapper targets are also likely to modify the I/O pattern because they submit their own metadata I/O.

fio(1) jobs must use the filename=/path/to/disk to do disk I/O benchmarking.

Without a block device filename parameter, the benchmark would create regular files on whatever file system is in use. Remember to double- and triple-check the block device filename before running benchmarks that write to the disk to avoid accidentally overwriting important data like the system root disk!

Example benchmark configurations

Here are a few example fio(1) jobs that you can use as a starting point.

High-throughput parallel reads

This job is a read-heavy workload with lots of parallelism that is likely to show off the device's best throughput:

ramp_time=10 # start measuring after warm-up time

offset_increment=128m # each job starts at a different offset

Latency-sensitive random reads

This job is a latency-sensitive workload that stresses per-request overhead and seek times:

ramp_time=10 # start measuring after warm-up time


Mixed workload

This job simulates a more real-life workload with an I/O pattern that contains boths reads and writes:

ramp_time=10 # start measuring after warm-up time



There are several common issues with disk benchmarking that can lead to useless results. Using a real benchmarking tool and bypassing the page cache and file system are the basic requirements for useful disk benchmark results. If you have questions or suggestions about disk benchmarking, feel free to post a comment.

by stefanha ( at November 13, 2017 09:02 PM

November 09, 2017

Cornelia Huck

Notes from KVM Forum 2017

KVM Forum 2017 took place in Prague Oct 25 - 27 and I had the pleasure of attending. Let me share some of my notes and observations (not exhaustive in any way).

General notes

KVM Forum this year was quite large, but with enough space to sit down and talk to others (or do some hacking). As always, the hallway track was great to meet some people (both old acquaintances and folks you never met in real life before) and to discuss things face-to-face that would take more time done via the mailing lists or IRC.

The first day featured a single track (shared with OSS Europe), and also the invitation-only QEMU summit in the morning (minutes will be posted to qemu-devel). Days two and three were dual-track except for the first sessions. Obviously, this means I was only able to see a subset of sessions (and one also needs a break sometimes...); fortunately, videos are slowly making their way unto youtube (check here for updates).


Christian Bornträger presented the KVM status, Paolo Bonzini the QEMU status and Peter Krempa the libvirt status. We seem to have a healthy development community, and interesting new topics still come up.


I listened with interest to Christoffer Dall's talks about KVM on ARM: Reducing hypervisor overhead, and nested virtualization. Seeing the virtualization architecture on ARM makes me really glad to work on s390, and it makes the efforts of the ARM folks all the more impressive (writing code for an architecture revision for which no hardware yet exists... oh dear).


Virtio is currently moving towards the 1.1 revision of the standard, with one of the biggest changes a new ring layout. Jens Freimann (who took over last minute from Michael S. Tsirkin, who unfortunately could not make it) presented about this ongoing work, and also gave some tips on how to get changes included in the standard (let me point to the OASIS virtio TC here). There was also a presentation about virtio-crypto, which I unfortunately was not able to attend.

VFIO (and the mediated device framework) continues to be a topic of interest. There were talks about enabling migration, buffer sharing, and adding support for a new platform bus. On a related note, we (the s390 maintainers) were able to sit down with Alex Williamson and Daniel Berrange to discuss the vfio-ap proposal for s390 crypto cards; this approach has a good chance of being workable.

Hannes Reinecke and Paolo Bonzini presented about the challenges virtualizing SCSI Fibre Channel (NPIV): A good overview of what the challenges are and how we might be able to solve them.

Also of note: Improving virtio-blk performance with polling, introducing a paravirtualized RDMA device, and vhost-user-scsi and SPDK.


I'm currently trying to learn more about tcg and used the opportunity to attend two talks.

Alessandro Di Federico started his talk with a good general introduction to tcg. His work to split out a libtcg is interesting, but it might be done at the wrong level for usage with QEMU (or so I understood from the discussion.)

I also enjoyed Alex Bennée's talk about handling vectors in tcg (complete with an historical overview of vector instructions).


David Gilbert presented on the various reasons a migration might fail. Most migrations don't fail, so people tend to do it more and more in an automated way. If you have the misfortune of having a migration actually do fail on you, his talk gives a lot of pointers on how to find out why it happened (and hopefully, avoiding the problem in the future.)

Markus Armbruster gave an overview of the current QEMU command line infrastructure and how to improve it via QAPIfication (we probably need to sacrifice some rubber chickens backward compatibility).

Also: KubeVirt, running large deployments via libvirt, and the effects of changed defaults on users.


KVM Forum 2017 was really interesting (if somewhat exhausting) for learning about new developments and discussing with people; if you are working in the area, I can only recommend trying to attend KVM Forum.

by Cornelia Huck ( at November 09, 2017 01:41 PM

October 24, 2017

Kashyap Chamarthy

Documentation of QEMU Block Device Operations

QEMU Block Layer currently (as of QEMU 2.10) supports four major kinds of live block device jobs – stream, commit, mirror, and backup. These can be used to manipulate disk image chains to accomplish certain tasks, e.g.: live copy data from backing files into overlays; shorten long disk image chains by merging data from overlays into backing files; live synchronize data from a disk image chain (including current active disk) to another target image; and point-in-time (and incremental) backups of a block device.

To that end, recently I have written documentation (thanks to the QEMU Block Layer maintainers & developers for the reviews) of the usage of following commands:

  • block-stream
  • block-commit
  • drive-mirror (and blockdev-mirror)
  • drive-backup (and blockdev-backup)

Each of the above block device jobs, their QMP (QEMU Machine Protocol) invocation examples are documented.

Here’s the source. And here’s the Sphinx-rendered HTML version.

This documentation can be handy in those (debugging) scenarios when it’s instructive to look at what is happening behind the scenes of QEMU. For example, live storage migration (without shared storage setup) is one of the most common use-cases that takes advantage of the QMP drive-mirror command and QEMU’s built-in Network Block Device (NBD) server. Here’s the QMP-level workflow for it — this is the flow libvirt internally implements (with some additional niceties).

by kashyapc at October 24, 2017 12:06 PM

October 04, 2017

QEMU project

QEMU stable version 2.10.1 released

We are pleased to announce that the QEMU v2.10.1 stable release is now available! You can grab the tarball from our download page.

Version 2.10.1 is now tagged in the official qemu.git repository (where you can also find the changelog with details), and the stable-2.10 branch has been updated accordingly.

Apart from the normal range of general fixes, this update contains security fixes addressing guest-induced crashing of host QEMU process (CVE-2017-13672, CVE-2017-13673) and possible code injection into host QEMU process via a crafted multiboot ELF kernel when specified directly via QEMU command-line option (CVE-2017-14167). Please update accordingly.

October 04, 2017 09:00 AM

September 27, 2017

Gerd Hoffmann

Running macOS as guest in kvm.

There are various approaches to run macOS as guest under kvm. One is to add apple-specific features to OVMF, as described by Gabriel L. Somlo. I’ve choose to use the Clover EFI bootloader instead. Here is how my setup looks like.

What you needed

First a bootable installer disk image. You can create a bootable usbstick using the createinstallmedia tool shipped with the installer. You can then dd the usb stick to a raw disk image.

Next a clover disk image. I’ve created a script which uses guestfish to generate a disk image from a clover iso image, with a custom config file. The advantage of having a separate disk only for clover is that you can easily update clover, downgrade clover and tweak the clover configuration without booting the virtual machine. So, if something goes wrong recovering is a lot easier.

Qemu. Version 2.10 (or newer) strongly recommended. macOS versions up to 10.12.3 work fine in qemu 2.9. macOS 10.12.4 requires fixes for the qemu applesmc emulation which got merged for qemu 2.10.

OVMF. Latest git works fine for me. Older OVMF versions trip over a recent qemu update and provides broken ACPI tables to the OS then. With the result that macOS doesn’t boot, even though ovmf itself shows no signs of trouble.

Configuring your virtual machine

Here are snippets of my libvirt config, with comments explaining the important things:

<domain type='kvm' xmlns:qemu=''>

The xmlns:qemu entry is needed for qemu-specific tweaks, that way we can ask libvirt to add extra arguments to the qemu command line.

    <type arch='x86_64' machine='pc-q35-2.9'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>/var/lib/libvirt/qemu/nvram/macos-test-org-base_VARS.fd</nvram>
    <bootmenu enable='yes'/>

Using the q35 machine type here, and the cutting edge edk2 builds from my firmware repo.

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Penryn</model>
    <feature policy='require' name='invtsc'/>

CPU. Penryn is known-good. The invtsc feature is needed because macOS uses the TSC for timekeeping. When asking to provide a fixed TSC frequency qemu will store the TSC frequency in a hypervisor cpuid leaf. And macOS will pick it up there. Without that macOS does a wild guess, likely gets it very wrong, and wall clock in your guest runs either way too fast or way too slow.

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/path/to/lvm/volume'/>
      <target dev='sda' bus='sata'/>

This is the system disk where macOS is be installed on. Attached as sata disk to the q35 ahci controller.

You also need the installmedia image. On a real macintosh you’ll typically use a usbstick. macOS doesn’t care much where the installer is stored though, so you can attach the image as sata disk too and it’ll work fine.

Finally you need the clover disk image. edk2 alone isn’t able to boot from the system or installer disk, so it’ll start clover. clover in turn will load the hfs+ filesystem driver so the other disks can be accessed, will offer a boot menu and allows to boot macOS.

    <interface type='network'>
      <source network='default'/>
      <model type='e1000-82545em'/>

Ethernet card. macOS has drivers for this model. Seems to have problems with link detection now and then. Set link status to down for a moment, then to up again (using virsh domif-setlink) gets the virtual machine online.

    <input type='tablet' bus='usb'/>
    <input type='keyboard' bus='usb'/>

USB tablet and keyboard, as input devices. Tablet allows to operate the mouse without pointer grabs which is much more convenient than using a virtual mouse.

      <model type='vga' vram='65536'/>

Qemu standard vga.

    <qemu:arg value='-readconfig'/>
    <qemu:arg value='/path/to/macintosh.cfg'/>

This is the extra configuration item for stuff not supported by libvirt. The macintosh.cfg file looks like this, adding the emulated smc device:

[device "smc"]
driver = "isa-applesmc"
osk = "<insert-real-osk-here>"

You can run Gabriels SmcDumpKey tool on a macintosh to figure what the osk is.

Configuring clover

I’m using config.plist.stripped.qemu as starting point. Here are the most important settings:

Kernel command line. There are lots of options you can start the kernel with. You might want try "-v" to start the kernel in verbose mode where it prints boot messages to the screen (like the linux kernel without "quiet"). For trouble-shooting, or to impress your friends.
Name of the volume clover should boot from by default. Put your system disk name here, otherwise clover will wait forever for you to pick a boot menu item.
As the Name says, the Display Resolution. Note that OVMF has a Display Resolution Setting too. Hit ESC at the splash screen to enter the Setup, then go to Device Manager / OVMF Platform Configuration / Change Preferred Resolution. The two Settings must match, otherwise macOS will boot with a scrambled display.


That should be it. Boot the virtual machine. Installing and using macOS should work as usual.

Final comment

Gabriel also has some comments on the legal side of this. Summary: Probably allowed by Apple on macintosh hardware, i.e. when running linux on your macintosh, then run macOS as guest there. If in doubt check with your lawyer.

by Gerd Hoffmann at September 27, 2017 06:52 AM

September 08, 2017

Cédric Bosdonnat

Virt-bootstrap 1.0.0 released

Yesterday, virt-bootstrap came to life. This tool aims at simplifying the creation of root file systems for use with libvirt's LXC container drivers. I started prototyping it a few months ago and Radostin Stoyanov wonderfully took it over during this year's Google Summer of Code.

For most users, this tool will just be used by virt-manager (since version 1.4.2). But it can be used directly from any script or command line.

The nice thing about virt-bootstrap is that will allow you to create a root file system out of existing docker images, tarballs or virt-builder templates. For example, the following command will get and unpack the official openSUSE docker image in /tmp/foo.

$ virt-bootstrap docker://opensuse /tmp/foo

Virt-bootstrap offers options to:

  • generate qcow2 image with backing chain instead of plain folder
  • apply user / group ID mapping
  • set the root password in the container

Enjoy easy containers creation with libvirt ecosystem, and have fun!

by Cédric Bosdonnat at September 08, 2017 07:24 AM

September 01, 2017

QEMU project

QEMU version 2.10.0 released

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • Support for ACPI NUMA distance info and control over CPU NUMA assignments via ‘-numa cpu’ parameters
  • Support for LUKS encryption format in qcow2 images
  • Monitor/Management interface improvments: additional debug information available through ‘info ramblock/cmma/register/qtree’, support for viewing connected clients via ‘info vnc’, improved parsing support for QMP protocol, and other additional commands
  • QXL and virtio-gpu support for controlling default display resolution
  • Support for vhost-user-scsi devices
  • NVMe emulation support for Write Zeroes command and Controller Memory Buffers
  • Guest agent support for querying guest hostname, users, timezone, and OS version/release information
  • ARM: KVM support for Raspberry Pi 3
  • ARM: emulation support for MPS2/MPS2+ FPGA-based dev boards
  • ARM: zynq: SPIPS flash support
  • ARM: exynos4210: hardware PRNG device, SDHCI, and system poweroff
  • Microblaze: support for CPU versions 9.4, 9.5, 9.6, and 10.0
  • MIPS: support for Enhanced Virtual Addressing (EVA)
  • MIPS: initrd support for kaslr-enabled kernels
  • OpenRISC: support for shadow registers, idle states, and numcores/coreid/EVAR/EPH registers
  • PowerPC: Multi-threaded TCG emulation support
  • PowerPC: OpenBIOS VGA driver for MacOS guests
  • PowerPC: pseries: KVM and emulation support for POWER9 guests
  • PowerPC: pseries: support for hash page table resizing
  • s390: channel device passthrough support via vfio-ccw
  • s390: support for channel-attached 3270 “green screen” devices for use as guest consoles or additional TTYs
  • s390: improved support for PCI (AEN, AIS, and zPCI)
  • s390: support for z14 CPU models and netboot/TFTP via CCW BIOS
  • s390: TCG support for atomic “LOAD AND x” and “COMPARE SWAP” operations, LOAD PROGRAM PARAMETER, extended facilities, CPU type, and many more less-common instructions
  • SH: TCG support for host atomic instructions for emulating tas.b and gUSA (user-space atomics), and support for fpchg/fsrra instructions
  • SPARC: fixes for booting Solaris 2.6 on sun4m/OpenBIOS machines
  • x86: Q35 MCH supports TSEG higher than 8MB
  • x86: SSE register access via gdbstub
  • Xen: support for multi-page shared rings, and 9pfs/virtfs backend
  • Xtensa: sim machine console can be directed to chardev via -serial
  • and lots more…

Thank you to everyone involved!

September 01, 2017 07:00 AM

August 27, 2017

Nathan Gauër

3D Acceleration using VirtIO

August 27, 2017 03:37 PM

August 15, 2017

Daniel Berrange

ANNOUNCE: virt-viewer 6.0 release

I am happy to announce a new bugfix release of virt-viewer 6.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). The virsh and virt-viewer binaries in the Windows builds should now successfully connect to libvirtd, following fixes to libvirt’s mingw port.

Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

Changes in this release include:

  • Mention use of ssh-agent in man page
  • Display connection issue warnings in main window
  • Switch to GTask API
  • Add support changing CD ISO with oVirt foreign menu
  • Update various outdated links in README
  • Avoid printing password in debug logs
  • Pass hostname to authentication dialog
  • Fix example URLs in man page
  • Add args to virt-viewer to specify whether to resolve VM based on ID, UUID or name
  • Fix misc runtime warnings
  • Improve support for extracting listening info from XML
  • Enable connecting to SPICE over UNIX socket
  • Fix warnings with newer GCCs
  • Allow controlling zoom level with keypad
  • Don’t close app during seemless migration
  • Don’t show toolbar in kiosk mode
  • Re-show auth dialog in kiosk mode
  • Don’t show error when cancelling auth
  • Change default screenshot name to ‘Screenshot.png’
  • Report errors when saving screenshot
  • Fix build with latest glib-mkenums

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 02:20 PM

ANNOUNCE: libosinfo 1.1.0 release

I am happy to announce a new release of libosinfo version 1.1.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Force UTF-8 locale for new glib-mkenums
  • Avoid python warnings in example program
  • Misc test suite updates
  • Fix typo in error messages
  • Remove ISO header string padding
  • Disable bogus gcc warning about unsafe loop optimizations
  • Remove reference to
  • Don’t hardcode /usr/bin/perl, use /usr/bin/env
  • Support eject-after-install parameter in OsinfoMedia
  • Fix misc warnings in docs
  • Fix error propagation when loading DB
  • Add usb.ids / pci.ids locations for FreeBSD
  • Don’t include private headers in gir/vapi generation

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 11:09 AM

August 10, 2017

QEMU project

Deprecation of old parameters and features

QEMU has a lot of interfaces (like command line options or HMP commands) and old features (like certain devices) which are considered as deprecated since other more generic or better interfaces/features have been established instead. While the QEMU developers are generally trying to keep each QEMU release compatible with the previous ones, the old legacy sometimes gets into the way when developing new code and/or causes quite some burden of maintaining it.

Thus we are currently considering to get rid of some of the old interfaces and features in a future release and have started to collect a list of such old items in our QEMU documentation. If you are running QEMU directly, please have a look at this deprecation chapter of the QEMU documentation to see whether you are still using one of these old interfaces or features, so you can adapt your setup to use the new interfaces/features instead. Or if you rather think that one of the items should not be removed from QEMU at all, please speak up on the qemu-devel mailing list to explain why the interface or feature is still required.

by Thomas Huth at August 10, 2017 08:45 AM

August 08, 2017

Cornelia Huck

Channel I/O, demystified

As promised, I will write some articles about one of the areas where s390x looks especially alien to people familiar with the usual suspects like x86: the native way of addressing I/O devices. 1

Channel I/O is a rather large topic, so I will concentrate on explaining it from a Linux (guest) and from a qemu/KVM (virtualization) point of view. This also means that I will prefer terminology that will make sense to somebody familiar with Linux (on x86) rather than the one used by e.g. a z/OS system programmer.

Links to the individual articles:
Channel I/O: What's in a channel subsystem?
Channel I/O: Talking to devices
Channel I/O: Types of devices
Channel I/O: More about channel paths

1. There is PCI on s390x, but it is a recent addition and its idiosyncracies are better understood if you know how channel I/O works.

by Cornelia Huck ( at August 08, 2017 05:04 PM

Channel I/O: More about channel paths

recent discussion on qemu-devel touched upon some aspects of channel paths and their handling (or not-handling) in QEMU. I will try to summarize and give some further information here.

I previously published some information on channel paths here. This post will concentrate a bit more on aspects that are not yet relevant in QEMU, but may become so in the future.

To recap: Channel paths represent the means by which the mainframe talks to the device - it (somewhat) corresponds to the actual cabling. Let's take a look at the output of lscss on a z/VM guest as an actual example:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
0.0.0150 0.0.0000  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.0151 0.0.0001  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.8000 0.0.0002  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8001 0.0.0003  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8002 0.0.0004  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8003 0.0.0005  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8004 0.0.0006  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8005 0.0.0007  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.0191 0.0.0008  3390/0a 3990/e9      e0  e0  ff   2a3a0900 00000000
0.0.208f 0.0.0009  3390/0c 3990/e9 yes  e0  e0  ff   3a2a1a00 00000000
0.0.218f 0.0.000a  3390/0c 3990/e9 yes  e0  e0  ff   2a3a0900 00000000
0.0.228f 0.0.000b  3390/0c 3990/e9 yes  e0  e0  ff   2a3a1a00 00000000
0.0.238f 0.0.000c  3390/0c 3990/e9 yes  e0  e0  ff   093a2a00 00000000
0.0.000c 0.0.000d  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000d 0.0.000e  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000e 0.0.000f  0000/00 1403/00      80  80  ff   08000000 00000000
0.0.0009 0.0.0010  0000/00 3215/00 yes  80  80  ff   08000000 00000000
0.0.0190 0.0.0011  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.019d 0.0.0012  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.019e 0.0.0013  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.0592 0.0.0014  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.ffff 0.0.0015  9336/10 6310/80      80  80  ff   08000000 00000000

A couple of interesting observations with regard to channel paths can be made here:
  • Devices 0.0.0150/0.0.0151, 0.0.000c/0.0.000d, 0.0.000e, 0.0.0009, and 0.0.ffff all share the same channel path, 08, despite being of different types (virtual CTC, virtual card punch/card reader/printer, virtual console, and virtual FBA DASD). This is because they are all emulated devices, and z/VM chooses to use the same virtual channel path for them.
  • Devices 0.0.8000 - 0.0.8002 uses channel path 0 as their only channel path, as can be seen by the PIM being 80.
  • Devices 0.0.8000 - 0.0.8002 and 0.0.8003-0.0.8005 use the same channel path, respectively; that is because they make up the device triplet for an OSA device.
  • The remaining devices (all ECKD DASD) use several channel paths (09, 1a, 2a, 3a), but only three at a time (as evidenced by the PIM of 0e), and also in different combination. This is probably a quirk of the individual setup for this guest.
The output of lschp of the same guest looks like this:

CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
0.00   1     1     11    -    -      (ff00)
0.01   1     1     11    -    -      (ff01)
0.08   1     1     1a    -    -       0598 
0.09   1     1     1a    -    -       0599 
0.0a   1     1     1a    -    -       059c 
0.0b   1     1     25    -    -       059d 
0.0c   1     1     1a    -    -       05ac 
0.17   1     1     11    -    -       05b4 
0.18   1     1     1a    -    -       05a0 
0.19   1     1     1a    -    -       05a1 
0.1a   1     1     1a    -    -       05a4 
0.1b   1     1     25    -    -       05a5 
0.1c   1     1     1a    -    -       05ad 
0.28   1     1     1a    -    -       05d8 
0.29   1     1     1a    -    -       05d9 
0.2a   1     1     1a    -    -       05dc 
0.2b   1     1     25    -    -       05dd 
0.2c   1     1     1a    -    -       05d0 
0.34   1     1     11    -    -       05ec 
0.35   1     1     11    -    -       05a8 
0.38   1     1     1a    -    -       05e0 
0.39   1     1     1a    -    -       05e1 
0.3a   1     1     1a    -    -       05e4 
0.3b   1     1     25    -    -       05e5 
0.3c   1     1     1a    -    -       05d1 
0.60   1     1     24    -    -      (070c)
0.61   1     1     24    -    -      (070d)
0.62   1     1     24    -    -      (070e)
0.63   1     1     24    -    -      (070f)

Here we find the various channel paths again, together with more:
  • There are several channel paths that are available to the guest, but not in use by any device currently available to the guest (and therefore not turning up in the output of lscss).
  • Channel paths 00 and 01 (used by the OSA cards) use an internal channel (the number in the last column are in brackets) - we can therefore conclude that the cards are virtualized by z/VM.
  • The channel path 08 (which is referenced by all virtual devices) is actually backed by a physical path (0598). I frankly have no idea why z/VM is doing that.
  • The channel paths used by the ECKD DASD (09, 1a, 2a, 3a) all are of the same type (1a - FICON, IIRC) and are backed by different physical paths (last column).
Various modifications can be done to the channel paths; under Linux, the chchp tool is useful for that. Let's try to vary off a path:

chchp -v 0 0.3a
Vary offline 0.3a... done.

lschp shows the changed state for the path:

0.3a   0     1     1a    -    -       05e4

The lscss output remains unchanged - which isn't surprising as doing a vary off only affects the state of the channel path within Linux: Linux will no longer use the path for I/O, but the path masks as managed by the hardware and z/VM are not changed.

Let's try to configure off another path:

chchp -c 0 0.2a
Configure standby 0.2a... failed - attribute value not as expected

That did not work as expected. Why? This is supposed to issue a SCLP command to set the channel path to standby - but the my guest apparently does not have the rights or ability to do so. Which is a pity, as I would have liked to show the effects of configuring a channel path to standby:
  • It (unsurprisingly) changes the state in lschp.
  • It also changes the path masks, as shown in lscss.
  • It may generate a machine check with a channel report word (CRW) that informs the OS that something has happened to the channel path - this is dependent upon the environment, however.
So let's stop here. I'll continue with another setup, once I have it.

by Cornelia Huck ( at August 08, 2017 05:03 PM

July 29, 2017

Stefan Hajnoczi

Tracing userspace static probes with perf(1)

The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.

Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.

On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.

Without further ado, here is how to trace static userspace probes with perf(1)!

Scan the binary for static userspace probes

The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:

# perf buildid-cache --add /usr/bin/qemu-system-x86_64

List static userspace probes

Once the ELF binaries have been scanned you can list the probes as follows:

# perf list sdt_*:*

List of pre-defined events (to be used in -e):

sdt_qemu:aio_co_schedule [SDT event]
sdt_qemu:aio_co_schedule_bh_cb [SDT event]
sdt_qemu:alsa_no_frames [SDT event]

Let's trace something!

First add probes for the events you are interested in:

# perf probe sdt_qemu:blk_co_preadv
Added new event:
sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)

You can now use it in all perf tools, such as:

perf record -e sdt_qemu:blk_co_preadv -aR sleep 1

Then capture trace data as follows:

# perf record -a -e sdt_qemu:blk_co_preadv
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB (4714 samples) ]

The trace can be printed using perf-script(1):

# perf script
qemu-system-x86 3425 [000] 2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0

If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.

Current limitations

As of July 2017 there are a few limitations to be aware of:

Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.

The contents of string arguments are not printed, only the memory address of the string.

Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.

The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.

Some history and alternative tools

Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.

On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.

More recently the GDB debugger gained support for static userspace probes. See the Static Probe Points documentation if you want to use userspace static probes from GDB.


It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.

by stefanha ( at July 29, 2017 11:22 AM

July 24, 2017

Peter Maydell

Installing Debian on QEMU’s 64-bit ARM “virt” board

This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…

Why the “virt” board?

For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.

Prerequisites and assumptions

I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.

I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)

Getting the installer files

I suggest creating a subdirectory for these and the other files we’re going to create.

wget -O installer-linux
wget -O installer-initrd.gz

Saving them locally as installer-linux and installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.

(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)


First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.

qemu-img create -f qcow2 hda.qcow2 5G

(Oops — an earlier version of this blogpost created a “qcow” format image, which will work but is less efficient. If you created a qcow image by mistake, you can convert it to qcow2 with mv hda.qcow2 old-hda.qcow && qemu-img convert -O qcow2 old-hda.qcow hda.qcow2. Don’t try it while the VM is running! You then need to update your QEMU command line to say “format=qcow2” rather than “format=qcow”. You can delete the old-hda.qcow once you’ve checked that the new qcow2 file works.)

Now we can run the installer:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel installer-linux \
  -initrd installer-initrd.gz \
  -drive if=none,file=hda.qcow2,format=qcow2,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic -no-reboot

The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.

The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.

Late in the process, the installer will print the following warning dialog:

   +-----------------| [!] Continue without boot loader |------------------+
   |                                                                       |
   |                       No boot loader installed                        |
   | No boot loader has been installed, either because you chose not to or |
   | because your specific architecture doesn't support a boot loader yet. |
   |                                                                       |
   | You will need to boot manually with the /vmlinuz kernel on partition  |
   | /dev/vda1 and root=/dev/vda2 passed as a kernel argument.             |
   |                                                                       |
   |                              <Continue>                               |
   |                                                                       |

Press continue for now, and we’ll sort this out later.

Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the -no-reboot option).

At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.

Extracting the kernel

The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.

There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:

$ virt-filesystems -a hda.qcow2 

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Looking at what’s in our disk we can see the kernel and initrd in /boot:

$ virt-ls -a hda.qcow2 /boot/

and we can copy them out to the host filesystem:

virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .

(We want the longer filenames, because vmlinuz and initrd.img are just symlinks and virt-copy-out won’t copy them.)

An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.

If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.


To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel vmlinuz-4.9.0-3-arm64 \
  -initrd initrd.img-4.9.0-3-arm64 \
  -append 'root=/dev/vda2' \
  -drive if=none,file=hda.qcow2,format=qcow2,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \

This should boot to a login prompt, where you can log in with the user and password you set up during the install.

The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.

by pm215 at July 24, 2017 09:25 AM

July 21, 2017

Ladi Prosek

Nesting Hyper-V in QEMU/KVM: Known issues

This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.

This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.

Issue description Status Public bug tracker
Hyper-V on KVM does not work at all (initial work item) Fixed in kernel 4.10
RHBZ 1326138
Hyper-V on KVM does not work on new Intel CPUs with PML Fixed in kernel 4.11
RHBZ 1440022
Hyper-V on KVM does not work on AMD CPUs Fixed in kernel 4.12 for 1 vCPU

and in kernel 4.13 for >1 vCPU

RHBZ 1440025 rtl8139 and e1000 QEMU network cards don’t work with Hyper-V enabled Not fixed yet RHBZ 1452546 L2 Linux guest in Hyper-V on KVM hangs on boot Fixed in kernel 4.13
71c2a2d0a8 RHBZ 1457866 Windows TSC page does not work with Hyper-V enabled Fixed in kernel 4.14

and in QEMU 2.11

RHBZ 1464412 Hyper-V on KVM does not work with OVMF (required for secure boot) Fixed in kernel 4.15
cc3d967f7e RHBZ 1488203 Not Hyper-V but prevents virtualization-based security from running 00663d047f
fda8f631ed RHBZ 1496170
TianocoreBZ 727


by ladipro at July 21, 2017 07:34 AM

July 13, 2017

Stefan Hajnoczi

Packet capture coming to AF_VSOCK

For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.

In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.

Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:

I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:

The basic flow is as follows:

# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0

It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.

by stefanha ( at July 13, 2017 03:31 PM

Gerd Hoffmann

Fresh Fedora 26 images uploaded

Fedora 26 is out of the door, and here are fresh fedora 26 images.

There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.

There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.

The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.

The images have been created with imagefish.

by Gerd Hoffmann at July 13, 2017 12:33 PM

June 27, 2017

Richard Jones

virt-builder Debian 9 image available

Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:

$ virt-builder -l | grep debian
debian-6                 x86_64     Debian 6 (Squeeze)
debian-7                 sparc64    Debian 7 (Wheezy) (sparc64)
debian-7                 x86_64     Debian 7 (Wheezy)
debian-8                 x86_64     Debian 8 (Jessie)
debian-9                 x86_64     Debian 9 (stretch)

$ virt-builder debian-9 \
    --root-password password:123456
[   0.5] Downloading:
[   1.2] Planning how to build this image
[   1.2] Uncompressing
[   5.5] Opening the new disk
[  15.4] Setting a random seed
virt-builder: warning: random seed could not be set for this type of guest
[  15.4] Setting passwords
[  16.7] Finishing off
                   Output file: debian-9.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 3.9G
                    Free space: 3.1G (78%)

$ qemu-system-x86_64 \
    -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=debian-9.img,format=raw,if=virtio \
    -serial stdio

by rich at June 27, 2017 09:01 AM

June 04, 2017

Richard Jones

New in libguestfs: Rewriting bits of the daemon in OCaml

libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:

 │ main program      │
 │                   │
 │                   │           child process / appliance
 │                   │          ┌──────────────────────────┐
 │                   │          │ qemu                     │
 ├───────────────────┤   RPC    │      ┌─────────────────┐ │
 │ libguestfs  ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd        │ │
 │                   │          │      ├─────────────────┤ │
 └───────────────────┘          │      │ Linux kernel    │ │
                                │      └────────┬────────┘ │
                                                │ virtio-scsi
                                         │  Device or  │
                                         │  disk image │

The library has to be written in C because it needs to be linked to any main program. The daemon (guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.

The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from malloc, strdup etc., making even the simplest task non-trivial and full of untested code paths.

So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C ↔ OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.

OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.

Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).

I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.

by rich at June 04, 2017 01:14 PM

Powered by Planet!
Last updated: January 23, 2018 08:47 PM
Powered by OpenShift Online