Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

December 06, 2016

Kashyap Chamarthy

QEMU Advent Calendar 2016

The QEMU Advent Calendar website 2016 features a QEMU disk image each day from 01-DEC to 24 DEC. Each day a new package becomes available for download (of format tar.xz) which contains a file describing the image (readme.txt or similar), and a little run shell script that starts QEMU with the recommended command-line parameters for the disk image.

The disk images contain interesting operating systems and software that run under the QEMU emulator. Some of them are well-known or not-so-well-known operating systems, old and new, others are custom demos and neat algorithms.” [From the About section.]

This is brought to you by Thomas Huth (his initial announcement here) and yours truly.


Explore the last five days of images from the 2016 edition here! [Extract the download with, e.g. for Day 05: tar -xf.day05.tar.xz]

PS: We still have a few open slots, so please don’t hesitate to contact if you have any fun disk image(s) to contribute.


by kashyapc at December 06, 2016 03:10 PM

December 01, 2016

Thomas Huth

QEMU Advent Calendar 2016 starts

The QEMU Advent Calendar 2016 website reveals a new disk image for download each day in December until 2016-12-24, to create a fun experience for the QEMU community, and to celebrate the 10th anniversary of KVM. Starting today, on December 1st, the first door of the advent calendar can now be opened – Get ready for some 16-bit retro feeling!

December 01, 2016 08:40 AM

November 29, 2016

Eduardo Habkost

An incomplete list of QEMU APIs

Having seen many people (including myself) feeling confused about the purpose of some QEMU’s internal APIs when reviewing and contributing code to QEMU, I am trying to document things I learned about them.

I want to make more detailed blog posts about some of them, stating their goals (as I perceive them), where they are used, and what we can expect to see happening to them in the future. When I do that, I will update this post to include pointers to the more detailed content.

QemuOpts

Introduced in 2009. Compared to the newer abstractions below, it is quite simple. As described in the original commit: it “stores device parameters in a better way than unparsed strings”. It is still used by configuration and command-line parsing code.

Making QemuOpts work with the more modern abstractions (esp. QOM and QAPI) may be painful. Sometimes you can pretend it is not there, but you can’t run away from it if you are dealing with QEMU configuration or command-line parameters.

qdev

qdev was added to QEMU in 2009. qdev manages the QEMU device tree, based on a hierarchy of buses and devices. You can see the device tree managed by qdev using the info qtree monitor command in QEMU.

qdev allows device code to register implementations of device types. Machine code, on the other hand, would instantiate those devices and configure them by setting properties, and not accessing internal device data structures directly. Some devices can be plugged from the QEMU monitor or command-line, and their properties can be configured as arguments to the -device option or device_add command.

From the original code:

The theory here is that it should be possible to create a machine without knowledge of specific devices. Historically board init routines have passed a bunch of arguments to each device, requiring the board know exactly which device it is dealing with. This file provides an abstract API for device configuration and initialization. Devices will generally inherit from a particular bus (e.g. PCI or I2C) rather than this API directly.

Some may argue that qdev doesn’t exist anymore, and was replaced by QOM. Others (including myself) describe it as being built on top of QOM. Either way you describe it, the same features provided by the original qdev code are provided by the QOM-based code living in hw/core.

See also:

  • KVM Forum 2010 talk by Markus Armbruster: “QEMU’s new device model qdev” (slides)
  • KVM Forum 2011 talk by Markus Armbruster: “QEMU’s device model qdev: Where do we go from here?” (slides. video)
  • KVM Forum 2013 talk by Andreas Färber: “Modern QEMU Devices” (slides, video)

QOM

QOM is short for QEMU Object Model and was introduced in 2011. It is heavily documented on its header file. It started as a generalization of qdev. Today the device tree and backend objects are managed through the QOM object tree.

From its documentation:

The QEMU Object Model provides a framework for registering user creatable types and instantiating objects from those types. QOM provides the following features:

  • System for dynamically registering types
  • Support for single-inheritance of types
  • Multiple inheritance of stateless interfaces

QOM also has a property system for introspection and object/device configuration. qdev’s property system is built on top of QOM’s property system.

Some QOM types and their properties are meant to be used internally only (e.g. some devices that are not pluggable and only created by machine code; accelerator objects). Some types can be instantiated and configured directly from the QEMU monitor or command-line (using, e.g., -device, device_add, -object, object-add).

See also:

  • KVM Forum 2014 talk by Paolo Bonzini: “QOM exegesis and apocalypse” (slides, video).

VMState

VMState was introduced in 2009. It was added to change the device state saving/loading (for savevm and migration) from error-prone ad-hoc coding to a table-based approach.

From the original commit:

This patch introduces VMState infrastructure, to convert the save/load functions of devices to a table approach. This new approach has the following advantages:

  • it is type-safe
  • you can’t have load/save functions out of sync
  • will allows us to have new interesting commands, like dump , that shows all its internal state.
  • Just now, the only added type is arrays, but we can add structures.
  • Uses old load_state() function for loading old state.

See also:

  • KVM Forum 2010 talk by Juan Quintela: “Migration: How to hop from machine to machine without losing state” (slides)
  • KVM Forum 2011 talk by Juan Quintela: “Migration: one year later” (slides, video)
  • KVM Forum 2012 talk by Michael Roth: “QIDL: An Embedded Language to Serialize Guest Data Structures for Live Migration” (slides)

QMP

QMP is the QEMU Machine Protocol. Introduced in 2009. From its documentation:

The QEMU Machine Protocol (QMP) allows applications to operate a QEMU instance.

QMP is JSON based and features the following:

  • Lightweight, text-based, easy to parse data format
  • Asynchronous messages support (i.e. events)
  • Capabilities Negotiation

For detailed information on QMP’s usage, please, refer to the following files:

  • qmp-spec.txt QEMU Machine Protocol current specification
  • qmp-commands.txt QMP supported commands (auto-generated at build-time)
  • qmp-events.txt List of available asynchronous events

See also: KVM Forum 2010 talk by Luiz Capitulino, A Quick Tour of the QEMU Monitor Protocol.

QObject

QObject was introduced in 2009. It was added during the work to add QMP. It provides a generic QObject data type, and available subtypes include integers, strings, lists, and dictionaries. It includes reference counting. It was also called QEMU Object Model when the code was introduced, but do not confuse it with QOM.

It started a as simple implementation, but was expanded later to support all the data types defined in the QAPI schema (see below).

QAPI

QAPI was introduced in 2011. The original documentation (which can be outdated) can be seen at http://wiki.qemu.org/Features/QAPI.

From the original patch series:

Goals of QAPI

1) Make all interfaces consumable in C such that we can use the interfaces in QEMU

2) Make all interfaces exposed through a library using code generation from static introspection

3) Make all interfaces well specified in a formal schema

From the documentation:

QAPI is a native C API within QEMU which provides management-level functionality to internal and external users. For external users/processes, this interface is made available by a JSON-based wire format for the QEMU Monitor Protocol (QMP) for controlling qemu, as well as the QEMU Guest Agent (QGA) for communicating with the guest. The remainder of this document uses “Client JSON Protocol” when referring to the wire contents of a QMP or QGA connection.

To map Client JSON Protocol interfaces to the native C QAPI implementations, a JSON-based schema is used to define types and function signatures, and a set of scripts is used to generate types, signatures, and marshaling/dispatch code. This document will describe how the schemas, scripts, and resulting code are used.

See also:

  • KVM Forum 2011 talk by Anthony Liguori: “Code Generation for Fun and Profit” (slides, video)

Visitor API

QAPI includes an API to define and use visitors for the QAPI-defined data types. Visitors are the mechanism used to serialize QAPI data to/from the external world (e.g. through QMP, the command-line, or config files).

From its documentation:

The QAPI schema defines both a set of C data types, and a QMP wire format. QAPI objects can contain references to other QAPI objects, resulting in a directed acyclic graph. QAPI also generates visitor functions to walk these graphs. This file represents the interface for doing work at each node of a QAPI graph; it can also be used for a virtual walk, where there is no actual QAPI C struct.

There are four kinds of visitor classes: input visitors (QObject, string, and QemuOpts) parse an external representation and build the corresponding QAPI graph, output visitors (QObject and string) take a completed QAPI graph and generate an external representation, the dealloc visitor can take a QAPI graph (possibly partially constructed) and recursively free its resources, and the clone visitor performs a deep clone of one QAPI object to another. While the dealloc and QObject input/output visitors are general, the string, QemuOpts, and clone visitors have some implementation limitations; see the documentation for each visitor for more details on what it supports. Also, see visitor-impl.h for the callback contracts implemented by each visitor, and docs/qapi-code-gen.txt for more about the QAPI code generator.

The End

Although large, this list is incomplete. In the near future, I plan to write about QAPI, QOM, and QemuOpts, and how they work (and sometimes don’t work) together.

Most of the abstractions above are about data modeling, in one way or another. That’s not a coincidence: one of the things I want to write about are how some times those data abstractions have conflicting world views, and the issues resulting from that.

Feedback wanted: if you have any correction or suggestion to this list, please send your comments. You can use the GitHub page for the post to send comments or suggest changes, or just e-mail me.

by Eduardo Habkost at November 29, 2016 02:51 AM

November 15, 2016

Thomas Huth

QEMU Advent Calendar 2016 website now online

This year, we are celebrating the 10th anniversary of KVM (see Amit Sha’s article on LWN.net for more information), and the 25th anniversary of Linux, so to contribute to this celebration, Kashyap Chamarthy and I are preparing another edition of the QEMU Advent Kalendar this year.

The QEMU Advent Calendar 2016 will be a website that reveals a new disk image for download each day in December until 2016-12-24. To motivate people to contribute some disk images to the calendar (we still need some to be able to provide one each day), the new website for the advent calendar is now already online at www.qemu-advent-calendar.org – but of course the doors with the disk images will not open before December 1st.

November 15, 2016 01:50 PM

Daniel Berrange

New TLS algorithm priority config for libvirt with gnutls on Fedora >= 25

Libvirt has long supported use of TLS for its remote API service, using the gnutls library as its backend. When negotiating a TLS session, there are a huge number of possible algorithms that could be used and the client & server need to decide on the best one, where “best” is commonly some notion of “most secure”. The preference for negotiation is expressed by simply having an list of possible algorithms, sorted best to worst, and the client & server choose the first matching entry in their respective lists. Historically libvirt has not expressed any interest in the handshake priority configuration, simply delegating the decision to the gnutls library on that basis that its developers knew better than libvirt developers which are best. In gnutls terminology, this means that libvirt has historically used the “DEFAULT” priority string.

The past year or two has seen a seemingly never ending stream of CVEs related to TLS, some of them particular to specific algorithms. The only way some of these flaws can be addressed is by discontinuing use of the affected algorithm. The TLS library implementations have to be fairly conservative in dropping algorithms, because this has an effect on consumers of the library in question. There is also potentially a significant delay between a decision to discontinue support for an algorithm, and updated libraries being deployed to hosts. To address this Fedora 21 introduced the ability to define the algorithm priority strings in host configuration files, outside of the library code. This system administrators can edit a file /etc/crypto-policies/config to change the algorithm priority for all apps using TLS on the host. After editting this file, the update-crypto-policies command is run to generate the library specific configuration files. For example, it populates /etc/crypto-policies/back-ends/gnutls.config In gnutls use of this file is enabled by specifying that an application wants to use the “@SYSTEM” priority string.

This is a good step forward, as it takes the configuration out of source code and into data files, but it has limited flexibility because it applies to all apps on the host. There can be two apps on a host which have mutually incompatible views about what the best algorithm priority is. For example, a web browser will want to be fairly conservative in dropping algorithms to avoid breaking access to countless websites. An application like libvirtd though, where there is a well known set of servers and clients to connect in any site, can be fairly aggressive in only supporting the very best algorithms. What is desired is a way to override the algorithm priorty per application. Now of course this can easily be done via the application’s own configuration file, and so libvirt has added a new parameter “tls_priority” to /etc/libvirt/libvirtd.conf

The downside of using the application’s own configuration, is that the system administrator has to go hunting through many different files to update each application. It is much nicer to have a central location where the TLS priority settings for all applications can be controlled. What is desired is a way for libvirt to be built such that it can tell gnutls to first look for a libvirt specific priority string, and then fallback to the global priority string. To address this patches were written for GNUTLS to extend its priority string syntax. It is now possible to for libvirt to pass “@LIBVIRT,SYSTEM” to gnutls as the priority. It will thus read /etc/crypto-policies/back-ends/gnutls.config first looking for an entry matching “LIBVIRT” and then looking for an entry matching “SYSTEM“. To go along with the gnutls change, there is also an enhancement to the update-crypto-policies tool to allow application specific entries to be included when generating the /etc/crypto-policies/back-ends/gnutls.config file. It is thus possible to configure the libvirt priority string by simply creating a file /etc/crypto-policies/local.d/gnutls-libvirt.config containing the desired string and re-running update-crypto-policies.

In summary, the libvirt default priority settings are now:

  • RHEL-6/7 – NORMAL – a string hard coded in gnutls at build time
  • Fedora 25 - @SYSTEM – a priority level defined by sysadmin based on /etc/crypto-policies/config
  • Fedora >= 25 – @LIBVIRT,SYSTEM – a raw priority string defined in /etc/crypto-policies/local.d/gnutls-libvirt.config, falling back to /etc/crypto-policies/config if not present.

In all cases it is still possible to customize in /etc/libvirt/libvirtd.conf via the tls_priority setting, but it is is recommended to use the global system /etc/crypto-policies facility where possible.

by Daniel Berrange at November 15, 2016 12:08 PM

November 11, 2016

Daniel Berrange

New libvirt website design

The current previous libvirt website design dated from circa 2008 just a few years after the libvirt project started. We have grown alot of content since that time, but the overall styling and layout of the libvirt website has not substantially changed. Compared to websites for more recently launched projects, libvirt was starting to look rather outdated. So I spent a little time to come up with a new design for the libvirt website to bring it into the modern era. There were two core aspects to the new design, simplify the layout and navigation, and implement new branding.

From the layout / navigation POV we have killed the massive expanding menu that was on the left hand side of every page. It was not serving its purpose very effectively since it was requiring far too many clicks & page loads to access some of the most frequently needed content. The front page now has direct links to key pieces of content (as identified from our web access stats), while the rest of the pages are directly visible in a flat menu on the “docs” page. The download page has been overhauled to recognise the fact that libvirt is shipping more than just the core C library – we have language bindings, object model mappings, docs and test suites. Also new is a page directly targeting new contributors with information about how to get involved in the project and the kind of help we’re looking for. The final notable change is used of some jquery magic to pull in a feed of blog posts to the site front page.

From the branding POV, we have taken the opportunity to re-create the project logo. We sadly lost the original master vector artwork used to produce the libvirt logo eons ago, so only had a png file of it in certain resolutions. When creating docbook content, we did have a new SVG created that was intended to mirror the original logo, but it was quite crudely drawn. None the less it was a useful basis to start from to create new master logo graphics. As a result we now have an attractively rendered new logo for the project, available in two variants – a standard square(-ish) format

Libvirt logo

and in a horizontal banner format

Libvirt logo banner

With the new logo prepared, we took the colour palette and font used in the graphic and applied both to the main website content, bringing together a consistent style.

Libvirt website v1 (2006-2008)

libvirt-website-v1-download
libvirt-website-v1-index

Libvirt website v2 (2008-2016)

libvirt-website-v2-index
libvirt-website-v2-download

Libvirt website v3 (2016-)

libvirt-website-v3-index
libvirt-website-v3-download

by Daniel Berrange at November 11, 2016 04:16 PM

November 08, 2016

Gerd Hoffmann

raspberry pi status update

You might have noticed meanwhile that Fedora 25 ships with raspberry pi support and might have wondered what this means for my packages and images.

The fedora images use a different partition layout than my images. Specifically the fedora images have a separate vfat partition for the firmware and uboot and the /boot partition with the linux kernels lives on ext2. My images have a vfat /boot partition with everything (firmware, uboot, kernels), and the rpms in my repo will only work properly on such a sdcard. You can’t mix & match stuff and there is no easy way to switch from my sdcard layout to the fedora one.

Current plan forward:

I will continue to build rpms for armv7 (32bit) for a while for existing installs. There will be no new fedora 25 images though. For new devices or reinstalls I recommend to use the official fedora images instead.

Fedora 25 has no aarch64 (64bit) support, although it is expected to land in one of the next releases. Most likely I’ll create new Fedora 25 images for aarch64 (after final release), and of course I’ll continue to build kernel updates too.

Finally some words on the upstream kernel status:

The 4.8 dwc2 usb host adapter driver has some serious problems on the raspberry pi. 4.7 works ok, and so do the 4.9-rc kernels. But 4.7 doesn’t get stable updates any more, so I jumped straight to the 4.9-rc kernels for mainline. You might have noticed already if you updated your rpi recently. The raspberry pi foundation kernels don’t suffer from that issue as they use a different (not upstream) driver for the dwc.

by Gerd Hoffmann at November 08, 2016 01:54 PM

November 04, 2016

Daniel Berrange

ANNOUNCE: libvirt-glib release 1.0.0

I am pleased to announce that a new release of the libvirt-glib package, version 1.0.0, is now available from

https://libvirt.org/sources/glib/

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Switch to new release numbering scheme, major digit incremented each year, minor for each release, micro for stable branches (if any)
  • Fix Libs.private variable in pkg-config file
  • Fix git introspection warnings
  • Add ability to set SPICE gl property
  • Add support for virtio video model
  • Add support for 3d accel property
  • Add support for querying video model
  • Add support for host device config for PCI devs
  • Add docs for more APIs
  • Avoid unused variable warnings
  • Fix check for libvirt optional features to use pkg-config
  • Delete manually written python binding. All apps should use PyGObject with gobject introspection.
  • Allow schema to be NULL on config objects
  • Preserve unknown devices listed in XML
  • Add further test coverage

libvirt-glib comprises three distinct libraries:

  • libvirt-glib – Integrate with the GLib event loop and error handling
  • libvirt-gconfig – Representation of libvirt XML documents as GObjects
  • libvirt-gobject – Mapping of libvirt APIs into the GObject type system

NB: While libvirt aims to be API/ABI stable forever, with libvirt-glib we are not currently guaranteeing that libvirt-glib libraries are permanently API/ABI stable. That said we do not expect to break the API/ABI for the forseeable future and will always strive avoid it.

Follow up comments about libvirt-glib should be directed to the regular libvir-list@redhat.com development list.

Thanks to all the people involved in contributing to this release.

 

by Daniel Berrange at November 04, 2016 05:30 PM

Gerd Hoffmann

advanced network booting

I use network booting a lot. Very convenient, I rarely have to fiddle with boot iso images these days. My setup has evolved over the years, and I’m going to describe here how it looks today.

general intro

First thing needed is a machine which runs the dhcp server. In a typical setups the home internet router also provides the dhcp service, but usually you can’t configure the dhcp server much. So I turned off the dhcp server in the router and configured a linux machine with dhcpd instead. These days this is a raspberry pi, running fedora 24, serving dhcp and dns.

I used to run this on my x86 server acting as NAS, but that turned out to be a bad idea. We have power failures now and then, the NAS goes check the filesystems at boot, which can easily take half an hour, and there used to be no dhcp and dns service during that time. In contrast the raspberry pi is back to service in less than a minute.

Obviously the raspberry pi itself can’t get a ip address via dhcp, it needs a static address assigned instead. systemd-networkd does the job with this ethernet.network config file:

[Match]
Name=eth0

[Network]
Address=192.168.2.10/24
Gateway=192.168.2.1
DNS=127.0.0.1
Domains=home.kraxel.org

Address=, Gateway= and Domain= must be adjusted according to your local network setup of course. The same applies to all other config file snippets following below.

dhcp setup

Ok, lets walk through the dhcpd.conf file:

option arch code 93 = unsigned integer 16;      # RFC4578

Defines option arch which will be used later.

default-lease-time 86400;                       # 1d
max-lease-time 604800;                          # 7d

With long lease times there will be less chatter in the log file, also nobody will loose network connectivity in case the dhcpd server is down for a moment (for example when fiddeling with the config and dhcpd not restarting properly due to syntax errors).

ddns-update-style none;
authoritative;

No ddns used here. I assign static ip addresses instead (more on this below).

subnet 192.168.2.0 netmask 255.255.255.0 {
        # network
        range 192.168.2.200 192.168.2.249;
        option routers 192.168.2.1;

Default ip address and network of my router (Fritzbox). The small range configured here is used for dynamic ip addresses only, the room below 200 is left to static ip addresses.

        # dns, ntp
        option domain-name "home.kraxel.org";
        option domain-name-servers 192.168.2.10;
        option ntp-servers ntp.home.kraxel.org;

Oh, right, almost forgot to mention that, the raspberry pi also runs ntp so not each and every machine has to talk to the ntp pool servers. So we announce that here (together with the dns server), so the dhcp clients can pick up that information.

Nothing tricky so far, now the network boot configuration starts:

        if (option arch = 00:0b) {
                # EFI aarch64
                next-server 192.168.2.14;
                filename "efi-a64/grubaa64.efi";

Here I use the option arch to figure what client is booting, to pick a matching boot file. This is for 64bit arm machines, loading grub2 efi binary. grub in turn will look for a grub.cfg file in the efi-a64/ directory, so placing stuff in sub-directories separates things nicely.

        } else if (option arch = 00:09) {
                # EFI x64 -- ipxe
                next-server 192.168.2.14;
                filename "efi-x64/shim.efi";
        } else if (option arch = 00:07) {
                # EFI x64 -- ovmf
                next-server 192.168.2.14;
                filename "efi-x64/shim.efi";

Same for x86_64. For some reason ovmf (with the builtin virtio-net driver) and ipxe efi drivers use different arch ids to signal x86_64. I simply list both here to get things going no matter what.

Update (Nov 9th): 7 is the correct value for x86_64. Looks like RfC 4578 got this wrong initially. There is an errata for this. ipxe is fixed meanwhile.

        } else if ((exists vendor-class-identifier) and
                   (option vendor-class-identifier = "U-Boot.armv8")) {
                # rpi 3
                next-server 192.168.2.108;
                filename "rpi3/boot.conf";

This is uboot (64bit) on a raspberry pi 3 trying to netboot. Well, any aarch64 to be exact, but I don’t own any other device. The boot.conf file is a dummy and doesn’t exist. uboot will pick up the rpi3/ sub-directory though, and after failing to load boot.conf it will try to boot pxelinux style, i.e. look for config files in the rpi3/pxelinux.cfg/ directory.

        } else if ((exists vendor-class-identifier) and
                   (option vendor-class-identifier = "U-Boot.armv7")) {
                # rpi 2
                next-server 192.168.2.108;
                filename "rpi2/boot.conf";

Same for armv7 (32bit) uboot, raspberry pi 2 in my case. raspberry pi 3 with 32bit uboot will land here too.

        } else if ((exists user-class) and ((option user-class = "iPXE") or
                                            (option user-class = "gPXE"))) {
                # bios -- gpxe/ipxe
                next-server 192.168.2.14;
                filename "http://192.168.2.14/tftpboot/pxelinux.0";

This is for ipxe/gpxe on classic bios-based x86 machines. ipxe can load files over http, and pxelinux running with ipxe as network driver can do this too. So we can specify a http url as filename, and it’ll work fine and faster than using tftp.

        } else {
                # bios -- chainload gpxe
                next-server 192.168.2.14;
                filename "gpxe-undionly.kpxe";
        }

Everything else chainloads gpxe. I think this is unused since ages, new physical machines have EFI these days and virtual machines already use gpxe/ipxe roms when running with seabios.

}
include "/etc/dhcp/home.4.dhcp.inc";
include "/etc/dhcp/xeni.4.dhcp.inc";

Here are the static ip addresses assigned. This is in a include file because these files are generated by a script. I basically have a file with a table listing hostname, mac address and ip address, and my script generates include files for dhcpd.conf and dns zone files from that. Inside the include there are lots of entries looking like this:

host photosmart {
        hardware ethernet 10:60:4b:11:71:34;
        fixed-address 192.168.2.53;
}

So all machines get a static ip address assigned, based on the mac address. Together with the dns configuration this allows to connect to the machines by hostname.

In case you want a different netboot configuration for a specific host it is also possible to add next-server and filename entries here to override the network defaults configured above.

That’s it for dhcp.

tftp setup

Ok, an tftp server is needed too. That can run on the same machine, but doesn’t has to. You can even have multiple machines serving tftp. You might have noticed that not all entries above have the same next-server. The raspberry pi entries point to my workstation where I cross-compile arm kernels now and then, so I can boot them directly. All other entries point to the NAS where all the install trees are located. Getting tftp running is easy these days:

  1. dnf install tftp-server
  2. systemctl enable tftp
  3. Place the boot files in /var/lib/tftpboot

Placing the boot files is left as an exercise to the reader, otherwise this becomes too long. Maybe I’ll do a separate post on the topic later.

For now just a general hint: in.tftpd has a --verbose switch. Turning that on and watching the log is a good way to see which files clients are trying to load. Helpful for trouble-shooting.

httpd setup

Only needed if you want boot over http as mentioned above. Unless you have apache httpd already running you have to install and enable it of course. Then you can drop this tftpboot.conf file into /etc/httpd/conf.d to make /var/lib/tftpboot available over http too:

<Directory "/var/lib/tftpboot">
        Options Indexes FollowSymLinks Includes
        AllowOverride None
        Require ip 127.0.0.1/8 192.168.0.0/16
</Directory>

Alias   "/tftpboot"      "/var/lib/tftpboot"

libvirt setup

So, of course I want netboot my virtual machines too, even if they are on a NAT-ed libvirt network. In that case they are not using the dhcp server running on my raspberry pi, but the dnsmasq server started by libvirt on the virtualization host. Luckily it is possible to configure network booting in the network configuration, like this:

<network>
  <name>default</name>
  <forward mode='nat'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.123.10' end='192.168.123.99'/>
      <bootp file='http://192.168.2.14/tftpboot/pxelinux.0'/>
    </dhcp>
  </ip>
</network>

That’ll work just fine for seabios guests, but not when running ovmf. Unfortunately libvirt doesn’t support serving different boot files depending in client architecture. But there is an easy way out: Just define a separate network for ovmf guests:

<network>
  <name>efi-x64</name>
  <forward mode='nat'/>
  <ip address='192.168.132.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.132.10' end='192.168.132.99'/>
      <bootp file='efi-x64/shim.efi' server='192.168.2.14'/>
    </dhcp>
  </ip>
</network>

Ok, almost there. While the tcp-based http protocol goes through the NAT forwarding without a hitch the udp-based tftp protocol doesn’t. It needs some extra help. The nf_nat_tftp kernel module handles that. You can use the systemd module load service to make sure it gets loaded on each boot.

using qemu directly

The qemu user network stack has netboot support too, you point it to the tftp server this way:

qemu-system-x86_64 \
        -netdev user,id=net0,bootfile=http://192.168.2.14/tftpboot/pxelinux.0 \
        -device virtio-net-pci,netdev=net0 \
        [ more args follow here ]

In case the tftpboot directory is available on the local machine it is also possible to use the builtin tftp server instead:

qemu-system-x86_64 \
        -netdev user,id=net0,tftp=/var/lib/tftpboot,bootfile=pxelinux.0 \
        -device virtio-net-pci,netdev=net0 \
        [ more args follow here ]

by Gerd Hoffmann at November 04, 2016 12:25 PM

Amit Shah

Ten Years of KVM: Article on LWN.net

As promised in the earlier post, I’ve written an article on some of the history and the journey of the KVM project:

https://lwn.net/Articles/705160/

I was initially going to just do a writeup on this blog, but I asked the folks at LWN if they were interested.. and they were!  This is my first article for LWN.  I’ve followed the site and the excellent content for a really long time, and now I’m very thrilled to also be an author.

by Amit Shah at November 04, 2016 05:07 AM

November 03, 2016

Peter Maydell

Installing Debian on QEMU’s 32-bit ARM “virt” board

In this post I’m going to describe how to set up Debian on QEMU emulating a 32-bit ARM “virt” board. There are a lot of older tutorials out there which suggest using boards like “versatilepb” or “vexpress-a9”, but these days “virt” is a far better choice for most people, so some documentation of how to use it seems overdue. (I may do a followup post for 64-bit ARM later.)

Why the “virt” board?

QEMU has models of nearly 50 different ARM boards, which makes it difficult for new users to pick one which is right for their purposes. This wild profusion reflects a similar diversity in the real hardware world: ARM systems come in many different flavours with very different hardware components and capabilities. A kernel which is expecting to run on one system will likely not run on another. Many of QEMU’s models are annoyingly limited because the real hardware was also limited — there’s no PCI bus on most mobile devices, after all, and a fifteen year old development board wouldn’t have had a gigabyte of RAM on it.

My recommendation is that if you don’t know for certain that you want a model of a specific device, you should choose the “virt” board. This is a purely virtual platform designed for use in virtual machines, and it supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics, but graphical programs on a fully emulated system run very slowly anyway so are best avoided.

Why Debian?

Debian has had good support for ARM for a long time, and with the Debian Jessie release it has a “multiplatform” kernel, so there’s no need to build a custom kernel. Because we’re installing a full distribution rather than a cut-down embedded environment, any development tools you need inside the VM will be easy to install later.

Prerequisites and assumptions

I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.6). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.

Getting the installer files

I suggest creating a subdirectory for these and the other files we’re going to create.

To install on QEMU we will want the multiplatform “armmp” kernel and initrd from the Debian website:

wget -O installer-vmlinuz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/vmlinuz
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/jessie/main/installer-armhf/current/images/netboot/initrd.gz

Saving them locally as installer-vmlinuz and installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.

(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)

Installing

First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.

qemu-img create -f qcow hda.qcow2 5G

Now we can run the installer:

qemu-system-arm -M virt -m 1024 \
  -kernel installer-vmlinuz \
  -initrd installer-initrd.gz \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-device,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-device,netdev=mynet \
  -nographic -no-reboot

(I would have preferred to use QEMU’s PCI virtio devices, but unfortunately the Debian kernel doesn’t support them; a future Debian release very likely will, which would allow you to use virtio-blk-pci and virtio-net-pci instead of virtio-blk-device and virtio-net-device.)

The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian release manual may help.
(Don’t worry about all the warnings the installer kernel produces about GPIOs when it first boots.)

The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.

Late in the process, the installer will print the following warning dialog:

   +-----------------| [!] Continue without boot loader |------------------+
   |                                                                       |
   |                       No boot loader installed                        |
   | No boot loader has been installed, either because you chose not to or |
   | because your specific architecture doesn't support a boot loader yet. |
   |                                                                       |
   | You will need to boot manually with the /vmlinuz kernel on partition  |
   | /dev/vda1 and root=/dev/vda2 passed as a kernel argument.             |
   |                                                                       |
   |                              <Continue>                               |
   |                                                                       |
   +-----------------------------------------------------------------------+  

Press continue for now, and we’ll sort this out later.

Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the -no-reboot option).

At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.

Extracting the kernel

The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.

There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:

$ virt-filesystems -a hda.qcow2 
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Looking at what’s in our disk we can see the kernel and initrd in /boot:

$ virt-ls -a hda.qcow2 /boot/
System.map-3.16.0-4-armmp-lpae
config-3.16.0-4-armmp-lpae
initrd.img
initrd.img-3.16.0-4-armmp-lpae
lost+found
vmlinuz
vmlinuz-3.16.0-4-armmp-lpae

and we can copy them out to the host filesystem:

$ virt-copy-out -a hda.qcow2 /boot/vmlinuz-3.16.0-4-armmp-lpae /boot/initrd.img-3.16.0-4-armmp-lpae .

(We want the longer filenames, because vmlinuz and initrd.img are just symlinks and virt-copy-out won’t copy them.)

An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.

Running

To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:

qemu-system-arm -M virt -m 1024 \
  -kernel vmlinuz-3.16.0-4-armmp-lpae \
  -initrd initrd.img-3.16.0-4-armmp-lpae \
  -append 'root=/dev/vda2' \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-device,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-device,netdev=mynet \
  -nographic

This should boot to a login prompt, where you can log in with the user and password you set up during the install.

The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.

by pm215 at November 03, 2016 10:33 PM

October 26, 2016

Thomas Huth

GRUB 2 network booting with Open Firmware

While it’s also possible to load single files via the Open Firmware interface already (by typing something like “boot net:” at the firmware prompt for example), you need a secondary boot loader like yaboot or GRUB 2 to handle more complex boot scenarios like loading both, kernel and and initrd.

GRUB 2 is able to load files via network, too, and this is also working when it is running on an Open Firmware implementation like SLOF. However, while there are a couple of pages on the web that describe how to do network booting with GRUB 2 when it is running on a x86-based PC, I haven’t found a single page that tells you how to do network booting in GRUB when it is running on a POWER-based machine. So here’s how to do it – you basically have got to put something like this in your grub.cfg file:

 menuentry 'Network boot' --class os {
     insmod net
     insmod ofnet
     insmod tftp
     insmod gzio
     insmod part_gpt
     # TFTP server address (seems not to work via DHCP configuration): 
     set net_default_server=10.0.2.2
     # DHCP setup:
     net_bootp
     echo 'Network status: '
     net_ls_cards
     net_ls_addr
     net_ls_routes
     echo 'Loading Linux ...'
     linux (tftp)/linux
     echo 'Loading initial ramdisk ...'
     initrd (tftp)/initrd
 }

The tricky line here was the “insmod ofnet” (which adds the Open Firmware network driver), the other parts are pretty much the same as on other architectures.

October 26, 2016 03:51 PM

Welcome to Thomas Huth's blog!

Welcome to the blog of Thomas Huth. I’m working for Red Hat on various virtualization related Open Source software (mainly KVM, QEMU and SLOF on POWER machines), so you’ll find information related to these topics in this blog.

October 26, 2016 11:04 AM

October 19, 2016

Amit Shah

Ten Years of KVM

We recently celebrated 25 years of Linux on the 25th anniversary of the famous email Linus sent to announce the start of the Linux project.  Going by the same yardstick, today marks the 10th anniversary of the KVM project — Avi Kivity first announced the project on the 19th Oct, 2006 by this posting on LKML:

http://lkml.iu.edu/hypermail/linux/kernel/0610.2/1369.html

The first patchset added support for hardware virtualization on Linux for the Intel CPUs.  Support for AMD CPUs followed soon:

http://lkml.iu.edu/hypermail/linux/kernel/0611.3/0850.html

KVM was subsequently merged in the upstream kernel on the 10th December 2006 (commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7).  Linux 2.6.20, released on 4 Feb 2007 was the first kernel release to include KVM.

KVM has come a long way in these 10 years.  I’m writing a detailed post about some of the history of the KVM project — stay tuned for that. [Update 3 Nov 2016: I’ve written that article now at LWN.net: https://lwn.net/Articles/705160/]

Till then, cheers!

by Amit Shah at October 19, 2016 04:35 PM

October 15, 2016

Alex Williamson

Intel processors with ACS support

If you've been keeping up with this blog then you understand a bit about IOMMU groups and device isolation.  In my howto series I describe the limitations of the Xeon E3 processor that I use in my example system and recommend Xeon E5 or higher processors to provide the best case device isolation for those looking to build a system.  Well, thanks to the vfio-users mailing list, it has come to my attention that there are in fact Core i7 processors with PCIe Access Control Services (ACS) support on the processor root ports.

Intel lists these processors as High End Desktop Processors, they include Socket 2011-v3 Haswell E processors, Socket 2011 Ivy Bridge E processors, and Socket 2011 Sandy Bridge E processors.  The linked datasheets for each family clearly lists ACS register capabilities.  Current listings for these processors include:

Haswell-E (LGA2011-v3)
i7-5960X (8-core, 3/3.5GHz)
i7-5930K (6-core, 3.2/3.8GHz)
i7-5820K (6-core, 3.3/3.6GHz)

Ivy Bridge-E (LGA2011)
i7-4960X (6-core, 3.6/4GHz)
i7-4930K (6-core, 3.4/3.6GHz)
i7-4820K (4-core, 3.7/3.9GHz)

Sandy Bridge-E (LGA2011)
i7-3960X (6-core, 3.3/3.9GHz)
i7-3970X (6-core, 3.5/4GHz)
i7-3930K (6-core, 3.2/3.8GHz)
i7-3820 (4-core, 3.6/3.8GHz)

These also appear to be the only Intel Core processors compatible with Socket 2011 and 2011-v3 found on X79 and X99 motherboards, so basing your platform around these chipsets will hopefully lead to success.  My recommendation is based only on published specs, not first hand experience though, so your mileage may vary.

Unfortunately there are not yet any Skylake based "High End Desktop Processors" and from what we've seen on the mailing list, Skylake does not implement ACS on processor root ports, nor do we have quirks to enable isolation on the Z170 PCH root ports (which include a read-only ACS capability, effectively confirming lack of isolation), and integrated I/O devices on the motherboard exhibit poor grouping with system management components (aside from the onboard I219 LOM, which we do have quirked).  This makes the currently available Skylake platforms a really bad choice for doing device assignment.

Based on this new data, I'll revise my recommendation for Intel platforms to include Xeon E5 and higher processors or Core i7 High End Desktop Processors (as listed by Intel).  Of course there are combinations where regular Core i5, i7 and Xeon E3 processors will work well, we simply need to be aware of their limitations and factor that into our system design.

EDIT (Oct 30 04:18 UTC 2015): A more subtle feature also found in these E series processors is support for IOMMU super pages.  The datasheets for the E5 Xeons and these High End Desktop Processors indicate support for 2MB and 1GB IOMMU pages while the standard Core i5 and i7 only support 4KB pages.  This means less space wasted for the IOMMU page tables, more efficient table walks by the hardware, and less thrashing of the I/O TLB under I/O load resulting in I/O stalls.  Will you notice it?  Maybe.  VFIO will take advantage of IOMMU super pages any time we find a sufficiently sized range of contiguous pages.  To help insure this happens, make use of hugepages in the VM.

EDIT (Oct 15 18:20 UTC 2016): Intel Broadwell-E processors have been out for some time and as we'd expect, the datasheets do indicate that ACS is supported.  So add to the list above:

Broadwell-E (LGA2011-v3)
i7-6950X (10-core, 3.0/3.5GHz)
i7-6900K (8-core, 3.2/3.7GHz)
i7-6850K (6-core, 3.6/3.8GHz)
i7-6800K (6-core, 3.4/3.6GHz)

by Alex Williamson (noreply@blogger.com) at October 15, 2016 01:21 PM

October 13, 2016

Alex Williamson

How to improve performance in Windows 7

A contribution from Thomas Lindroth on the vfio-users mailing list:
I thought I'd share a trick for improving the performance on win7 guests. The
tl;dr version is add <feature policy='disable' name='hypervisor'/> to the
<cpu> section of your libvirt xml like so:

<cpu mode='host-passthrough'>
<topology sockets='1' cores='3' threads='1'/>
<feature policy='disable' name='hypervisor'/>
</cpu>

The long story is that according to Microsoft's documentation "On systems
where the TSC is not suitable for timekeeping, Windows automatically selects
a platform counter (either the HPET timer or the ACPI PM timer) as the basis
for QPC." QPC = QueryPerformanceCounter() which is a windows api for getting
timing info. Some redhat documentation say: "Windows 7 do not use the TSC as
a time source if the hypervisor-present bit is set". Instead if falls back on
acpi_pm or hpet if hpet is enabled in the xml.

The hypervisor present bit is a fake cpuid flag qemu and other hypervisors
injects to show the guest it's running under a hypervisor. This is different
from the KVM signature that can be hidden with <kvm><hidden state='on'>.
With the hypervisor flag disabled in libvirt xml windows 7 started using TSC
as timing source for me.

Nvidia has a "Timer Function Performance" benchmark on their web page to
measure overhead from timers. With acpi_pm the timer query took 3,605ns on
average and with TSC 12.52ns. Passmark's CPU floating point performance
benchmark, which query timers 265,000 times/sec, went from 3952 points with
acpi_pm to 5594 points with TSC. The reason TSC is so much faster is because
both acpi_pm and hpet are emulated by qemu in userspace and TSC is handled by
KVM in kernel space.

All games I've tested use the timer at least 25,000 times/sec. I'm guessing
it's the graphics drivers doing that. Some games like Outlast query the timer
~275,000 times/sec. The performance for those games are basically limited by
how fast the host can do context switches. I expect the performance
improvement with TSC is great in those games. Unfortunately 3dmark's fire
strike benchmark still do 25,000 queries/sec to the acpi_pm even with the
hypervisor flag hidden. There must be some other windows api for using the
"platform counter" as Microsoft calls it but most games don't use it.

Unless you are using windows 7 you'll probably not benefit from this. Windows
10 is probably using the hypervclock instead. That redhat documentation
talking about the hypervisor bit was actually a guide for how to turn off TSC
to "resolve guest timing issues". I don't experience any problems myself but
if you got one of those "clocksource tsc unstable” systems this might not
work so well.

Google says this is the NVIDIA timer benchmark.

It would be interesting to see how the hyper-v extensions compare and whether tests like Fire Strike actually makes use of them. Thanks Thomas!

(copied with permission)

by Alex Williamson (noreply@blogger.com) at October 13, 2016 05:09 PM

September 28, 2016

Gerd Hoffmann

Using virtio-gpu with libvirt and spice

It’s been a while since the last virtio-gpu status report. So, here we go with an update.

I gave a talk about qemu graphics at KVM Forum 2016 in Toronto, covering (among other things) virtio-gpu. Here are the slides. I’ll go summarize the important stuff for those who want to play with it below. The new blog picture above is the Toronto skyline btw.

First the good news: Almost everything needed is upstream meanwhile, so using virtio-gpu (with virgl aka opengl acceleration) is alot easier these days. Fedora 24 supports it out-of-the-box for both host and guest.

Requirements for the guest:

  • linux kernel 4.4 (version 4.2 without opengl)
  • mesa 11.1
  • xorg server 1.19, or 1.18 with commit "5627708 dri2: add virtio-gpu pci ids" backported

Requirements for the host:

  • qemu 2.6
  • virglrenderer
  • spice server 0.13.2 (development release)
  • spice-gtk 0.32 (used by virt-viewer & friends)
    Note that 0.32 got a new shared library major version, therefore the tools using this must be compiled against that version.
  • mesa 10.6
  • libepoxy 1.3.1

The libvirt domain config snippet:

<graphics type='spice'>
  <listen type='none'/>
  <gl enable='yes'/>
</graphics>
<video>
  <model type='virtio'/>
</video>

And the final important bit is that spice needs a unix socket connection for opengl to work, and you’ll get it by attaching the viewer to qemu this way:

virt-viewer --attach $domain

If everything went fine you should see this in the guests linux kernel log:

root@fedora ~# dmesg | grep '\[drm\]'
[drm] Initialized drm 1.1.0 20060810
[drm] pci: virtio-vga detected
[drm] virgl 3d acceleration enabled
[drm] virtio vbuffers: 272 bufs, 192B each, 51kB total.
[drm] number of scanouts: 1
[drm] number of cap sets: 1
[drm] cap set 0: id 1, max-version 1, max-size 308
[drm] Initialized virtio_gpu 0.0.1 0 on minor 0

And glxinfo should print something like this:

root@fedora ~# glxinfo | grep ^OpenGL
OpenGL vendor string: Red Hat
OpenGL renderer string: Gallium 0.4 on virgl
OpenGL core profile version string: 3.3 (Core Profile) Mesa 12.0.2
OpenGL core profile shading language version string: 3.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 12.0.2
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 12.0.2
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
OpenGL ES profile extensions:

by Gerd Hoffmann at September 28, 2016 01:34 PM

September 26, 2016

Alex Williamson

Passing QEMU command line options through libvirt

This one comes from Stefan's blog with some duct tape and bailing wire courtesy of Laine.  I've talked previously about using wrapper scripts to launch QEMU, which typically use sed to insert options that libvirt doesn't know about.  This is by far better than defining full vfio-pci devices using <qemu:arg> options, which many guides suggest, but it hides the devices from libvirt and causes all sorts of problems with device permissions and locked memory, etc.  But, there's a nice compromise as Stefan shows in his last example at the link above.  Say we only want to add x-vga=on to one of our hostdev entries in the libvirt domain XML.  We can do something like this:

<qemu:commandline>
  <qemu:arg value='-set'/>
  <qemu:arg value='device.hostdev0.x-vga=on'/>
</qemu:commandline>

The effect is that we add the option x-vga=on to the hostdev0 device, which is defined via a normal <hostdev> section in the XML and gets all the device permission and locked memory love from libvirt.  So which device is hostdev0?  Well, things get a little mushy there.  libvirt invents the names based on the order of the hostdev entries in the XML, so you can simply count (starting from zero) to pick the entry for the additional option.  It's a bit cleaner overall than needing to manage a wrapper script separately from the VM.  Also, don't forget that to use <qemu:commandline> you need to first enable the QEMU namespace in the XML by updating the first line in the domain XML to:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

Otherwise libvirt will promptly discard the extra options when you save the domain.

by Alex Williamson (noreply@blogger.com) at September 26, 2016 06:00 PM

"Intel-IOMMU: enabled": It doesn't mean what you think it means

A quick post just because I keep seeing this in practically every how-to guide I come across.  The instructions grep dmesg for "IOMMU" and come up with either "Intel-IOMMU: enabled" or "DMAR: IOMMU enabled". Clearly that means it's enabled, right? Wrong. That line comes from a __setup() function that parses the options for "intel_iommu=". Nothing has been done at that point, not even a check to see if VT-d hardware is present. Pass intel_iommu=on as a boot option to an AMD system and you'll see this line. Yes, this is clearly not a very intuitive message. So for the record, the mouthful that you should be looking for is this line:

DMAR: Intel(R) Virtualization Technology for Directed I/O

or on older kernels the prefix is different:

PCI-DMA: Intel(R) Virtualization Technology for Directed I/O

When you see this, you're pretty much past all the failure points of initializing VT-d. FWIW, the "DMAR" flavors of the above appeared in v4.2, so on a more recent kernel, that's your better option.

by Alex Williamson (noreply@blogger.com) at September 26, 2016 05:15 PM

September 17, 2016

Stefan Hajnoczi

Making I/O deterministic with blkdebug breakpoints

Recently I was looking for a deterministic way to reproduce a QEMU bug that occurred when a guest reboots while an IDE/ATA TRIM (discard) request is in flight. You can find the bug report here.

A related problem is how to trigger the code path where request A is in flight when request B is issued. Being able to do this is useful for writing test suites that check I/O requests interact correctly with each other.

Both of these scenarios require the ability to put an I/O request into a specific state and keep it there. This makes the request deterministic so there is no chance of it completing too early.

QEMU has a disk I/O error injection framework called blkdebug. Block drivers like qcow2 tell blkdebug about request states, making it possible to fail or suspend requests at certain points. For example, it's possible to suspend an I/O request when qcow2 decides to free a cluster.

The blkdebug documentation mentions the break command that can suspend a request when it reaches a specific state.

Here is how we can suspend a request when qcow2 decides to free a cluster:


$ qemu-system-x86_64 -drive if=ide,id=ide-drive,file=blkdebug::test.qcow2,format=qcow2
(qemu) qemu-io ide-drive "break cluster_free A"

QEMU prints a message when the request is suspended. It can be resumed with:


(qemu) qemu-io ide-drive "resume A"

The tag name 'A' is arbitrary and you can pick your own name or even suspend multiple requests at the same time.

Automated tests need to wait until a request has suspended. This can be done with the wait_break command:


(qemu) qemu-io ide-drive "wait_break A"

For another example of blkdebug breakpoints, see the qemu-iotests 046 test case.

by stefanha (noreply@blogger.com) at September 17, 2016 04:08 PM

September 12, 2016

Gerd Hoffmann

Using virtio-input with libvirt

The new virtio input devices are not that new any more. Support was merged in qemu 2.4 (host) and linux kernel 4.1 (guest). Which means that most distributions should have picked up support for virtio-input meanwhile. libvirt gained support for libvirt-input too (version 1.3.0 & newer), so using virtio-input devices is as simple as adding

<input type='tablet' bus='virtio'/>

to your domain xml configuration. Or replacing the usb tablet with a virtio tablet, possibly eliminating the need to have a usb host adapter in your virtual machine as often the usb tablet is the only usb device.

There are also virtio keyboard and mouse devices. Using them on x86 isn’t very useful as every virtual machine has ps/2 keyboard and mouse anyway. For ppc64, arm and aarch64 architectures the virtio keyboard is a possible alternative to the usb keyboard:

<input type='keyboard' bus='virtio'/>

At the moment the firmware (edk2/slof) lacks support for virtio keyboards, so switching from usb to virtio looses the ability to do any keyboard input before the linux kernel driver loads, i.e. you can’t operate the grub boot menu. I hope this changes in the future.

If you have to stick to the usb keyboard or usb tablet due to missing guest drivers for virtio input I strongly suggest to use xhci as usb host adapter:

<controller type='usb' model='nec-xhci'/>
<input type='tablet' bus='usb'/>

xhci emulation needs noticable fewer cpu cycles when compared to uhci, ohci and ehci host adapters. That of course requires xhci driver support in the guest, but that should be less of an issue these days. Windows 7 is probably the only guest without xhci support which is still in widespread use.

by Gerd Hoffmann at September 12, 2016 12:41 PM

September 01, 2016

Alex Williamson

And now you're an expert

Video from my KVM Forum 2016 talk:


by Alex Williamson (noreply@blogger.com) at September 01, 2016 02:07 PM

August 30, 2016

Kashyap Chamarthy

LinuxCon talk slides: “A Practical Look at QEMU’s Block Layer Primitives”

Last week I spent time at LinuxCon (and the co-located KVM Forum) Toronto. I presented a talk on QEMU’s block layer primitives. Specifically, the QMP primitives block-commit, drive-mirror, drive-backup, and QEMU’s built-in NBD (Network Block Device) server.

Here are the slides.


by kashyapc at August 30, 2016 10:22 AM

August 24, 2016

Alex Williamson

KVM Forum 2016 - An Introduction to PCI Device Assignment with VFIO

Slides available here:

http://awilliam.github.io/presentations/KVM-Forum-2016

Video to come

by Alex Williamson (noreply@blogger.com) at August 24, 2016 03:46 PM

August 16, 2016

Daniel Berrange

Improving QEMU security part 7: TLS support for migration

This blog is part 7 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

The live migration feature in QEMU allows a running VM to be moved from one host to another with no noticeable interruption in service and minimal performance impact. The live migration data stream will contain a serialized copy of state of all emulated devices, along with all the guest RAM. In some versions of QEMU it is also used to transfer disk image content, but in modern QEMU use of the NBD protocol is preferred for this purpose. The guest RAM in particular can contain sensitive data that needs to be protected against any would be attackers on the network between source and target hosts. There are a number of ways to provide such security using external tools/services including VPNs, IPsec, SSH/stunnel tunnelling. The libvirtd daemon often already has a secure connection between the source and destination hosts for its own purposes, so many years back support was added to libvirt to automatically tunnel the live migration data stream over libvirt’s own secure connection. This solved both the encryption and authentication problems at once, but there are some downsides to this approach. Tunnelling the connection means extra data copies for the live migration traffic and when we look at guests with RAM many GB in size, the number of data copies will start to matter. The libvirt tunnel only supports a tunnelling of a single data connection and in future QEMU may well wish to use multiple TCP connections for the migration data stream to improve performance of post-copy. The use of NBD for storage migration is not supported with tunnelling via libvirt, since it would require extra connections too. IOW while tunnelling over libvirt was a useful short term hack to provide security, it has outlived its practicality.

It is clear that QEMU needs to support TLS encryption natively on its live migration connections. The QEMU migration code has historically had its own distinct I/O layer called QEMUFile which mixes up tracking of migration state with the connection establishment and I/O transfer support. As mentioned in previous blog post, QEMU now has a general purpose I/O channel framework, so the bulk of the work involved converting the migration code over to use the QIOChannel classes and APIs, which greatly reduced the amount of code in the QEMU migration/ sub-folder as well as simplifying it somewhat. The TLS support involves the addition of two new parameters to the migration code. First the “tls-creds” parameter provides the ID of a previously created TLS credential object, thus enabling use of TLS on the migration channel. This must be set on both the source and target QEMU’s involved in the migration.

On the target host, QEMU would be launched with a set of TLS credentials for a server endpoint:

$ qemu-system-x86_64 -monitor stdio -incoming defer \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=server,id=tls0 \
    ...other args...

To enable incoming TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate_incoming tcp:myhostname:9000

On the source host, QEMU is launched in a similar manner but using client endpoint credentials

$ qemu-system-x86_64 -monitor stdio \
    -object tls-creds-x509,dir=/home/berrange/security/qemutls,endpoint=client,id=tls0 \
    ...other args...

To enable outgoing TLS migration 2 monitor commands are then used

(qemu) migrate_set_str_parameter tls-creds tls0
(qemu) migrate tcp:otherhostname:9000

The migration code supports a number of different protocols besides just “tcp:“. In particular it allows an “fd:” protocol to tell QEMU to use a passed-in file descriptor, and an “exec:” protocol to tell QEMU to launch an external command to tunnel the connection. It is desirable to be able to use TLS with these protocols too, but when using TLS the client QEMU needs to know the hostname of the target QEMU in order to correctly validate the x509 certificate it receives. Thus, a second “tls-hostname” parameter was added to allow QEMU to be informed of the hostname to use for x509 certificate validation when using a non-tcp migration protocol. This can be set on the source QEMU prior to starting the migration using the “migrate_set_str_parameter” monitor command

(qemu) migrate_set_str_parameter tls-hostname myhost.mydomain

This feature has been under development for a while and finally merged into QEMU GIT early in the 2.7.0 development cycle, so will be available for use when 2.7.0 is released in a few weeks. With the arrival of the 2.7.0 release there will finally be TLS support across all QEMU host services where TCP connections are commonly used, namely VNC, SPICE, NBD, migration and character devices.

In this blog series:

by Daniel Berrange at August 16, 2016 01:00 PM

Improving QEMU security part 6: TLS support for character devices

This blog is part 6 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

A number of QEMU device models and objects use a character devices for providing connectivity with the outside world, including the QEMU monitor, serial ports, parallel ports, virtio serial channels, RNG EGD object, CCID smartcard passthrough, IPMI device, USB device redirection and vhost-user. While some of these will only ever need a character device configured with local connectivity, some will certainly need to make use of TCP connections to remote hosts. Historically these connections have always been entirely in clear text, which is unacceptable in the modern hostile network environment where even internal networks cannot be trusted. Clearly the QEMU character device code requires the ability to use TLS for encrypting sensitive data and providing some level of authentication on connections.

The QEMU character device code was mostly using GLib’s  GIOChannel framework for doing I/O but this has a number of unsatisfactory limitations. It can not do vectored I/O, is not easily extensible and does not concern itself at all with initial connection establishment. These are all reasons why the QIOChannel framework was added to QEMU. So the first step in supporting TLS on character devices was to convert the code over to use QIOChannel instead of GIOChannel. With that done, adding in support for TLS was quite straightforward, merely requiring addition of a new configuration property (“tls-creds“) to set the desired TLS credentials.

For example to run a QEMU VM with a serial port listening on IP 10.0.01, port 9000, acting as a TLS server:

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0,server \
      -device isa-serial,chardev=s0
      ...other QEMU options...

It is possible test connectivity to this TLS server using the gnutls-cli tool

$ gnutls-cli --priority=NORMAL -p 9000 \
--x509cafile=/home/berrange/security/qemutls/ca-cert.pem \
127.0.0.1

In the above example, QEMU was running as a TCP server, and acting as the TLS server endpoint, but this matching is not required. It is valid to configure it to run as a TLS client if desired, though this would be somewhat uncommon.

Of course you can connect 2 QEMU VMs together, both using TLS. Assuming the above QEMU is still running, we can launch a second QEMU connecting to it with

$ qemu-system-x86_64 \
      -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
      -chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0 \
      -device isa-serial,chardev=s0
      ...other QEMU options...

Notice, we’ve changed the “endpoint” and removed the “server” option, so this second QEMU runs as a TCP client and acts as the TLS client endpoint.

This feature is available since the QEMU 2.6.0 release a few months ago.

In this blog series:

by Daniel Berrange at August 16, 2016 12:11 PM

August 10, 2016

Zeeshan Ali Khattak

Life is change

Quite a few major life events happened/happening this summer so I thought I blog about them and some of the experiences I had.

New job & new city/country

Yes, I found it hard to believe too that I'll ever be leaving Red Hat and the best manager I ever had (no offence to others but competing with Matthias is just impossible) but I'll be moving to Gothenburg to join Pelagicore folks as a Software Architect in just 2 weeks. I have always found Swedish language to be a very cute language so looking forward to my attempt of learning Swedish. If only I had learnt Swedish rather than Finnish when I was in Finland.

BTW, I'm selling all my furniture so if you're in London and need some furniture, get in touch!

Fresh helicopter pilot

So after two years of hard work and getting myself sinking in bank loans, I finally did it! Last week, I passed the skills test for Private Pilot License (Helicopters) and currently awaiting anxiously for my license to come through (it usually takes at least two weeks). Once I have that, I can rent Helicopters and take passengers with me. I'll be able to share the costs with passengers but I'm not allowed to make money out of it. The test was very tough and I came very close to failing at one particular point. The good news is that despite me being very tense and very windy conditions on test day, the biggest negative point from my examiner was that I was being over-cautious and hence very slow. So I think it wasn't so bad.



There are a few differences to a driving test. A minor one is is that in driving test, you are not expected to explain your steps but simply execute, where as in skills test for flying, you're expected to think everything out loud. But the most major difference is that in driving test, you are not expected to drive on your own until you pass the test, where as in flying test, you are required to have flown solo for at least 10 hours, which needs to include a solo cross country flight of at least a 100 nautical miles (185 KM) involving 3 major aeorodromes.  Mine involved Estree, Cranfield and Duxford. I've been GPS logging while flying so I can show you log of my qualifying solo cross country flight (click here to see details and notes):



I still got a long way towards Commercial License but at least now I can share the price with friends so building hours towards commercial license, won't be so expensive (I hope). I've found a nice company in Gothenburg that trains in and rents helicopters so I'm very much looking forward to flying over the costs in there. Wanna join? Let me know. :)

by noreply@blogger.com (zeenix) at August 10, 2016 01:24 PM

August 05, 2016

Gerd Hoffmann

Two new images uploaded

Uploded two new images.

First a centos7 image for the raspberry pi 3 (arm64). Very simliar to the fedora images, see the other raspberrypi posts for details.

Second a armv7 qemu image with grub2 boot loader. Boots with efi firmware (edk2/tianocore). No need to copy kernel+initrd from the image and pass that to qemu, making the boot process much more convenient. Looks like this will not land in Fedora though. The Fedora maintainers prefer to stick with u-boot for armv7 and plan to impove u-boot instead so it works better with qemu. Which will of course solve the boot issue too, but it doesn’t exist today.

Usage: Fetch edk2.git-arm from the firmware repo. Use this to define a libvirt guest:

<domain type='qemu'>
  <name>fedora-armv7</name>
  <memory unit='KiB'>1048576</memory>
  <os>
    <type arch='armv7l' machine='virt'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/arm/QEMU_EFI-pflash.raw</loader>
    <nvram template='/usr/share/edk2.git/arm/vars-template-pflash.raw'/>
  </os>
  <features>
    <gic version='2'/>
  </features>
  <cpu mode='custom' match='exact'>
    <model fallback='allow'>cortex-a15</model>
  </cpu>
  <devices>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/path/to/arm-qemu-f24-uefi.raw'/>
      <target dev='sda' bus='scsi'/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>
    <console/>
  </devices>
</domain>

Update 1: root password for the qemu uefi image is “uefi”.
Update 2: comments closed due to spam flood, sorry.

by Gerd Hoffmann at August 05, 2016 08:24 AM

July 15, 2016

Alex Williamson

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release.  There's already pretty thorough documentation of the modes available in the source tree, please give it a read.  There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons.  Which ones are available to you depends on your hardware.  UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge.  If you have a processor older than SandyBridge, stop now, this is not going to work for you.  If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM.  Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer.  Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list).  I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined).  The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver.  Personally I avoid this by blacklisting the i915 driver.  Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working.  The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively.  To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use?  try both, video=vesafb:off,efifb:off).  The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub.  Plan for this.  I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </hostdev>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required.  Delete the <video> and <graphics> sections and everything should just magically work.  Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT  mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system.  This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc.  One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge.  Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons.  Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35.  This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT.  Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment.  To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly.  Great news, right?  Well sort of.  First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode).  In fact, um, there's no output support of any kind by default in UPT mode.  How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM.  Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while.  UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address.  Usually not a problem, but some of you create some crazy configs.  You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional?  Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love.  Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected.  Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on).  I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT.  Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available.  Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer.  Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot.  Don't be too alarmed by this, especially if it stops before the display is initialized.  This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host.  Unless you have an ongoing spew of these, they can probably be ignored.  If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware.  Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM.  Give it a try and good luck.

by Alex Williamson (noreply@blogger.com) at July 15, 2016 05:34 PM

July 01, 2016

Daniel Berrange

ANNOUNCE: libosinfo 0.3.1 released

I am happy to announce a new release of libosinfo, version 0.3.1 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Require glib2 >= 2.36
  • Replace GSimpleAsyncResult usage with GTask
  • Fix VPATH based builds
  • Don’t include autogenerated enum files in dist
  • Fix build with older GCC versions
  • Add/improve/fix data for
    • Debian
    • SLES/SLED
    • OpenSUSE
    • FreeBSD
    • Windows
    • RHEL
    • Ubuntu
  • Update README content
  • Fix string comparison for bootable media detection
  • Fix linker flags for OS-X & solaris
  • Fix darwin detection code
  • Fix multiple memory leaks

Thanks to everyone who contributed towards this release.

A special note to downstream vendors/distributors.

The next major release of libosinfo will include a major change in the way libosinfo is released and distributed. The current single release will be replaced with three indepedently released artefacts:

  • libosinfo – this will continue to provide the libosinfo shared library and most associated command line tools
  • osinfo-db – this will contain only the database XML files and RNG schema, no code at all.
  • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

The libosinfo and osinfo-db-tools releases will be fairly infrequently as they are today. The osinfo-db releases will be done very frequently, with automated releases made available no more than 1 day after updated DB content is submitted to the project.

by Daniel Berrange at July 01, 2016 10:31 AM


Powered by Planet!
Last updated: December 11, 2016 05:01 AM