Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

August 15, 2017

Daniel Berrange

ANNOUNCE: virt-viewer 6.0 release

I am happy to announce a new bugfix release of virt-viewer 6.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). The virsh and virt-viewer binaries in the Windows builds should now successfully connect to libvirtd, following fixes to libvirt’s mingw port.

Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

http://virt-manager.org/download/

Changes in this release include:

  • Mention use of ssh-agent in man page
  • Display connection issue warnings in main window
  • Switch to GTask API
  • Add support changing CD ISO with oVirt foreign menu
  • Update various outdated links in README
  • Avoid printing password in debug logs
  • Pass hostname to authentication dialog
  • Fix example URLs in man page
  • Add args to virt-viewer to specify whether to resolve VM based on ID, UUID or name
  • Fix misc runtime warnings
  • Improve support for extracting listening info from XML
  • Enable connecting to SPICE over UNIX socket
  • Fix warnings with newer GCCs
  • Allow controlling zoom level with keypad
  • Don’t close app during seemless migration
  • Don’t show toolbar in kiosk mode
  • Re-show auth dialog in kiosk mode
  • Don’t show error when cancelling auth
  • Change default screenshot name to ‘Screenshot.png’
  • Report errors when saving screenshot
  • Fix build with latest glib-mkenums

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 02:20 PM

ANNOUNCE: libosinfo 1.1.0 release

I am happy to announce a new release of libosinfo version 1.1.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Force UTF-8 locale for new glib-mkenums
  • Avoid python warnings in example program
  • Misc test suite updates
  • Fix typo in error messages
  • Remove ISO header string padding
  • Disable bogus gcc warning about unsafe loop optimizations
  • Remove reference to fedorahosted.org
  • Don’t hardcode /usr/bin/perl, use /usr/bin/env
  • Support eject-after-install parameter in OsinfoMedia
  • Fix misc warnings in docs
  • Fix error propagation when loading DB
  • Add usb.ids / pci.ids locations for FreeBSD
  • Don’t include private headers in gir/vapi generation

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 11:09 AM

August 10, 2017

QEMU project

Deprecation of old parameters and features

QEMU has a lot of interfaces (like command line options or HMP commands) and old features (like certain devices) which are considered as deprecated since other more generic or better interfaces/features have been established instead. While the QEMU developers are generally trying to keep each QEMU release compatible with the previous ones, the old legacy sometimes gets into the way when developing new code and/or causes quite some burden of maintaining it.

Thus we are currently considering to get rid of some of the old interfaces and features in a future release and have started to collect a list of such old items in our QEMU documentation. If you are running QEMU directly, please have a look at this deprecation chapter of the QEMU documentation to see whether you are still using one of these old interfaces or features, so you can adapt your setup to use the new interfaces/features instead. Or if you rather think that one of the items should not be removed from QEMU at all, please speak up on the qemu-devel mailing list to explain why the interface or feature is still required.

by Thomas Huth at August 10, 2017 08:45 AM

July 29, 2017

Stefan Hajnoczi

Tracing userspace static probes with perf(1)

The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.

Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.

On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.

Without further ado, here is how to trace static userspace probes with perf(1)!

Scan the binary for static userspace probes

The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:

# perf buildid-cache --add /usr/bin/qemu-system-x86_64

List static userspace probes

Once the ELF binaries have been scanned you can list the probes as follows:

# perf list sdt_*:*

List of pre-defined events (to be used in -e):

sdt_qemu:aio_co_schedule [SDT event]
sdt_qemu:aio_co_schedule_bh_cb [SDT event]
sdt_qemu:alsa_no_frames [SDT event]
...

Let's trace something!

First add probes for the events you are interested in:

# perf probe sdt_qemu:blk_co_preadv
Added new event:
sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)

You can now use it in all perf tools, such as:

perf record -e sdt_qemu:blk_co_preadv -aR sleep 1

Then capture trace data as follows:

# perf record -a -e sdt_qemu:blk_co_preadv
^C
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB perf.data (4714 samples) ]

The trace can be printed using perf-script(1):

# perf script
qemu-system-x86 3425 [000] 2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0
...

If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.

Current limitations

As of July 2017 there are a few limitations to be aware of:

Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.

The contents of string arguments are not printed, only the memory address of the string.

Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.

The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.

Some history and alternative tools

Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.

On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.

More recently the GDB debugger gained support for static userspace probes. See the Static Probe Points documentation if you want to use userspace static probes from GDB.

Conclusion

It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.

by stefanha (noreply@blogger.com) at July 29, 2017 12:22 PM

July 24, 2017

Peter Maydell

Installing Debian on QEMU’s 64-bit ARM “virt” board

This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…

Why the “virt” board?

For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.

Prerequisites and assumptions

I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.

I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)

Getting the installer files

I suggest creating a subdirectory for these and the other files we’re going to create.

wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz

Saving them locally as installer-linux and installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.

(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)

Installing

First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.

qemu-img create -f qcow hda.qcow2 5G

Now we can run the installer:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel installer-linux \
  -initrd installer-initrd.gz \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic -no-reboot

The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.

The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.

Late in the process, the installer will print the following warning dialog:

   +-----------------| [!] Continue without boot loader |------------------+
   |                                                                       |
   |                       No boot loader installed                        |
   | No boot loader has been installed, either because you chose not to or |
   | because your specific architecture doesn't support a boot loader yet. |
   |                                                                       |
   | You will need to boot manually with the /vmlinuz kernel on partition  |
   | /dev/vda1 and root=/dev/vda2 passed as a kernel argument.             |
   |                                                                       |
   |                              <Continue>                               |
   |                                                                       |
   +-----------------------------------------------------------------------+  

Press continue for now, and we’ll sort this out later.

Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the -no-reboot option).

At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.

Extracting the kernel

The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.

There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:

$ virt-filesystems -a hda.qcow2 
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Looking at what’s in our disk we can see the kernel and initrd in /boot:

$ virt-ls -a hda.qcow2 /boot/
System.map-4.9.0-3-arm64
config-4.9.0-3-arm64
initrd.img
initrd.img-4.9.0-3-arm64
initrd.img.old
lost+found
vmlinuz
vmlinuz-4.9.0-3-arm64
vmlinuz.old

and we can copy them out to the host filesystem:

virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .

(We want the longer filenames, because vmlinuz and initrd.img are just symlinks and virt-copy-out won’t copy them.)

An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.

If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.

Running

To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel vmlinuz-4.9.0-3-arm64 \
  -initrd initrd.img-4.9.0-3-arm64 \
  -append 'root=/dev/vda2' \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic

This should boot to a login prompt, where you can log in with the user and password you set up during the install.

The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.

by pm215 at July 24, 2017 09:25 AM

July 21, 2017

Ladi Prosek

Nesting Hyper-V in QEMU/KVM: Known issues

This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.

This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.

Issue description Status Public bug tracker
Hyper-V on KVM does not work at all (initial work item) Fixed in kernel 4.10
7ca29de213
ee146c1c10
9ed38ffad4
1dc35dacc1
RHBZ 1326138
Hyper-V on KVM does not work on new Intel CPUs with PML Fixed in kernel 4.11
ab007cc94f
1fb883bb82
RHBZ 1440022
Hyper-V on KVM does not work on AMD CPUs Fixed in kernel 4.12 for 1 vCPU
405a353a0e

and in kernel 4.13 for >1 vCPU
4aebd0e9ca
ab2f4d73eb
9b61174793
a12713c25b
1a5e185294

RHBZ 1440025 rtl8139 and e1000 QEMU network cards don’t work with Hyper-V enabled Not fixed yet RHBZ 1452546 L2 Linux guest in Hyper-V on KVM hangs on boot Fixed in kernel 4.13
2cf0284223
71c2a2d0a8 RHBZ 1457866 Windows TSC page does not work with Hyper-V enabled Not fixed yet RHBZ 1464412


by ladipro at July 21, 2017 07:34 AM

July 13, 2017

Stefan Hajnoczi

Packet capture coming to AF_VSOCK

For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.

In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.

Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:

I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:

The basic flow is as follows:


# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0

It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.

by stefanha (noreply@blogger.com) at July 13, 2017 04:31 PM

Gerd Hoffmann

Fresh Fedora 26 images uploaded

Fedora 26 is out of the door, and here are fresh fedora 26 images.

There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.

There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.

The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.

The images have been created with imagefish.

by Gerd Hoffmann at July 13, 2017 12:33 PM

June 27, 2017

Richard Jones

virt-builder Debian 9 image available

Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:

$ virt-builder -l | grep debian
debian-6                 x86_64     Debian 6 (Squeeze)
debian-7                 sparc64    Debian 7 (Wheezy) (sparc64)
debian-7                 x86_64     Debian 7 (Wheezy)
debian-8                 x86_64     Debian 8 (Jessie)
debian-9                 x86_64     Debian 9 (stretch)

$ virt-builder debian-9 \
    --root-password password:123456
[   0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz
[   1.2] Planning how to build this image
[   1.2] Uncompressing
[   5.5] Opening the new disk
[  15.4] Setting a random seed
virt-builder: warning: random seed could not be set for this type of guest
[  15.4] Setting passwords
[  16.7] Finishing off
                   Output file: debian-9.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 3.9G
                    Free space: 3.1G (78%)

$ qemu-system-x86_64 \
    -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=debian-9.img,format=raw,if=virtio \
    -serial stdio

by rich at June 27, 2017 09:01 AM

June 04, 2017

Richard Jones

New in libguestfs: Rewriting bits of the daemon in OCaml

libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:

 ┌───────────────────┐
 │ main program      │
 │                   │
 │                   │           child process / appliance
 │                   │          ┌──────────────────────────┐
 │                   │          │ qemu                     │
 ├───────────────────┤   RPC    │      ┌─────────────────┐ │
 │ libguestfs  ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd        │ │
 │                   │          │      ├─────────────────┤ │
 └───────────────────┘          │      │ Linux kernel    │ │
                                │      └────────┬────────┘ │
                                └───────────────│──────────┘
                                                │
                                                │ virtio-scsi
                                         ┌──────┴──────┐
                                         │  Device or  │
                                         │  disk image │
                                         └─────────────┘

The library has to be written in C because it needs to be linked to any main program. The daemon (guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.

The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from malloc, strdup etc., making even the simplest task non-trivial and full of untested code paths.

So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C ↔ OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.

OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.

Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).

I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.


by rich at June 04, 2017 01:14 PM

May 16, 2017

QEMU project

Presentations from DevConf 2017

There were a couple of QEMU / virtualization related talks at the DevConf 2017 conference that took place at the end of January already, but so far we missed to gather the links to the recordings of these talks. So here is now the list:

by Thomas Huth at May 16, 2017 02:00 PM

May 13, 2017

Nathan Gauër

GSoC | Log#2: API Forwarding

May 13, 2017 06:52 PM

GSoC | Log#1: Project presentation

May 13, 2017 06:31 PM

Linux graphic stack: an overview

May 13, 2017 03:37 PM

April 25, 2017

Gerd Hoffmann

meson experiments

Seems the world of build systems is going to change. Traditional approach is make. Pimped up with autoconf and automake. But some newcomers provide new approaches to building your projects.

First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.

Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.

So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.

So, here is the Makefile:

CC      ?= gcc
CFLAGS  ?= -Os -g -std=c99
CFLAGS  += -Wall

TARGETS := drminfo drmtest gtktest

drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1)
drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1)

drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += -ljpeg

gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += -ljpeg

all: $(TARGETS)

clean:
        rm -f $(TARGETS)
        rm -f *~ *.o

drminfo: drminfo.o drmtools.o
drmtest: drmtest.o drmtools.o render.o image.o
gtktest: gtktest.o render.o image.o

Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.

Now, compare with the meson.build file:

project('drminfo', 'c')

# pkg-config deps
libdrm_dep    = dependency('libdrm')
gbm_dep       = dependency('gbm')
epoxy_dep     = dependency('epoxy')
cairo_dep     = dependency('cairo')
cairo_gl_dep  = dependency('cairo-gl')
pixman_dep    = dependency('pixman-1')
gtk3_dep      = dependency('gtk+-3.0')

# libjpeg dep
jpeg_dep      = declare_dependency(link_args : '-ljpeg')

drminfo_srcs  = [ 'drminfo.c', 'drmtools.c' ]
drmtest_srcs  = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ]
gtktest_srcs  = [ 'gtktest.c', 'render.c', 'image.c' ]

drminfo_deps  = [ libdrm_dep, cairo_dep, pixman_dep ]
drmtest_deps  = [ libdrm_dep, gbm_dep, epoxy_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]
gtktest_deps  = [ gtk3_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]

executable('drminfo',
           sources      : drminfo_srcs,
           dependencies : drminfo_deps)
executable('drmtest',
           sources      : drmtest_srcs,
           dependencies : drmtest_deps)
executable('gtktest',
           sources      : gtktest_srcs,
           dependencies : gtktest_deps,
           install      : false)

Pretty straight forward translation. So, what are the differences?

First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:

  • Automatic header dependencies. When a header file changes all source files which include it get rebuilt.
  • Automatic build system dependencies: When meson.build changes the ninja build files are updated.
  • Rebuilds on command changes: When the build command line for a target changes the target is rebuilt.

Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.

Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.

Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.

Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.

by Gerd Hoffmann at April 25, 2017 08:36 AM

April 05, 2017

Thomas Huth

KVM with SELinux on a z/VM s390x machine

When you are trying to start a KVM guest via libvirt on an s390x Linux installation that is running on an older version of z/VM, you might run into the problem that QEMU refuses to start with this error message:

cannot set up guest memory 's390.ram': Permission denied.

This happens because older versions of z/VM (before version 6.3) do not support the so-called “enhanced suppression on protection facility” (ESOP) yet, so QEMU has to allocate the memory for the guest with a “hack”, and this hack uses mmap(… PROT_EXEC …) for the allocation.

Now this mmap() call is not allowed by the default SELinux rules (at least not on RHEL-based systems), so QEMU fails to allocate the memory for the guest here. Turning off SELinux completely just to run a KVM guest is of course a bad idea, but fortunately there is already a SELinux boolean value called virt_use_execmem which can be used to tune the behavior here:

setsebool virt_use_execmem 1

This configuration switch has originally been introduced for running TCG guests (i.e. running QEMU without KVM), but in this case it also fixes the problem with KVM guests. Anyway, since setting this SELinux variable to 1 is also a slight decrease in security (but still better than disabling SELinux completely), you should better upgrade your z/VM to version 6.3 (or newer) or use a real LPAR for the KVM host installation instead, if that is feasible.

April 05, 2017 12:20 PM

March 24, 2017

Cole Robinson

Easy qemu commandline passthrough with virt-xml

Libvirt has supported qemu commandline option passthrough for qemu/kvm VMs for quite a while. The format for it is a bit of a pain though since it requires setting a magic xmlns value at the top of the domain XML. Basically doing it by hand kinda sucks.

In the recently released virt-manager 1.4.1, we added a virt-install/virt-xml option --qemu-commandline that tweaks option passthrough for new or existing VMs. So for example, if you wanted to add the qemu option string '-device FOO' to an existing VM named f25, you can do:

  ./virt-xml f25 --edit --confirm --qemu-commandline="-device FOO"

The output will look like:

--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
-<domain type="kvm">
+<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
<name>f25</name>
<uuid>9b6f1795-c88b-452a-a54c-f8579ddc18dd</uuid>
<memory unit="KiB">4194304</memory>
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
</rng>
</devices>
+ <qemu:commandline>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
+ </qemu:commandline>
</domain>

Define 'f25' with the changed XML? (y/n):

by Cole Robinson (noreply@blogger.com) at March 24, 2017 10:30 PM

March 19, 2017

QEMU project

QEMU in the blogs: February 2017

Here is a short list of articles and blog posts about QEMU and KVM, that were posted last month.

More virtualization blog posts can be found on the virt tools planet.

In other news, QEMU is now in hard freeze for release 2.9.0. The preliminary list of features is on the wiki.

by Paolo Bonzini at March 19, 2017 09:30 AM

March 12, 2017

Richard Jones

Tip: Run virt-inspector on a compressed disk (with nbdkit)

virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.

If you have an xz-compressed disk image, you can run virt-inspector on it without uncompressing it, using the magic of captive nbdkit. Here’s how:

nbdkit xz file=win7.img.xz \
    -U - \
    --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'

What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (-U -).

We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)

The $unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.

The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.


by rich at March 12, 2017 03:44 PM

March 08, 2017

Cole Robinson

virt-manager 1.4.1 released!

I've just released virt-manager 1.4.1. The highlights are:
  • storage/nodedev event API support (Jovanka Gulicoska)
  • UI options for enabling spice GL (Marc-André Lureau)
  • Add default virtio-rng /dev/urandom for supported guest OS
  • Cloning and rename support for UEFI VMs (Pavel Hrdina)
  • libguestfs inspection UI improvements (Pino Toscano)
  • virt-install: Add --qemu-commandline
  • virt-install: Add --network vhostuser (Chen Hanxiao)
  • virt-install: Add --sysinfo (Charles Arnold)
Plus the usual slew of bug fixes and small improvements.

    by Cole Robinson (noreply@blogger.com) at March 08, 2017 07:15 PM

    Cédric Bosdonnat

    System container images

    As of today creating libvirt lxc system container root file system is a pain. Docker's fun came with its image sharing idea... why couldn't we do the same for libvirt containers? I will expose here is an attempt at this.

    To achieve such a goal we need:

    • container images
    • something to share them
    • a tool to pull and use them

    Container images

    OpenBuildService thanks to kiwi knows how to create images, even container images. There even are openSUSE Docker images. To use them as system container images, some more packages need to be added to those. I thus forked the project on github and branched the OBS projects to get system container images for 42.1, 42.2 and Tumbleweed.

    Using them is as simple as downloading them, unpacking them and use them as a container's root file system. However, sharing them would be so fun!

    Sharing images

    There is no need to reinvent the wheel to share the images. We can just consider them like any docker image. With the following commands we can import the image and push it to a remote registry.

    docker import openSUSE-42.2-syscontainer-guest-docker.x86_64.tar.xz system/opensuse-42.2
    docker tag system/opensuse-42.2 myregistry:5000/system/opensuse-42.2
    docker login myregistry:5000
    docker push myregistry:5000/system/opensuse-42.2
    

    The good thing with this is that we can even use the docker build and Dockerfile magic to create customized images and push them to the remote repository.

    Instanciating containers

    Now we need a tool to get the images from the remote docker registry. Hopefully there is a tool that helps a lot to do this: skopeo. I wrote a small virt-bootstrap tool using it to instanciate the images as root file systems.

    Here is how instanciating a container looks like with it:

    virt-bootstrap.py --username myuser \
                      --root-password test \
                      docker://myregistry:5000/system/opensuse-42.2 /path/to/my/container
    
    virt-install --connect lxc:/// -n 422 --memory 250 --vcpus 1 \
                    --filesystem /path/to/my/container,/ \
                    --filesystem /etc/resolv.conf,/etc/resolv.conf \
                    --network network=default
    

    And voila! Creating an openSUSE 42.2 system container and running it with libvirt is now super easy!

    by Cédric Bosdonnat at March 08, 2017 03:37 PM

    March 06, 2017

    Eduardo Habkost

    The long story of the query-cpu-model-expansion QEMU interface

    So, finally the query-cpu-model-expansion x86 implementation was merged into qemu.git, just before 2.9 soft freeze. Jiri Denemark already implemented the x86 libvirt code to use it. I just can’t believe this was finally done after so many years.

    It was a weird journey. It started almost 6 years ago with this message to qemu-devel:

    Date: Fri, 10 Jun 2011 18:36:37 -0300
    Subject: semantics of “-cpu host” and “check”/”enforce”

    …it continued on an interesting thread:

    Date: Tue, 6 Mar 2012 15:27:53 -0300
    Subject: Qemu, libvirt, and CPU models

    …on another very long one:

    Date: Fri, 9 Mar 2012 17:56:52 -0300
    Subject: Re: [Qemu-devel] [libvirt] Modern CPU models cannot be used with libvirt

    …and this one:

    Date: Thu, 21 Feb 2013 11:58:18 -0300
    Subject: libvirt<->QEMU interfaces for CPU models

    I don’t even remember how many different interfaces were proposed to provide what libvirt needed.

    We had a few moments where we hopped back and forth between “just let libvirt manage everything” to “let’s keep this managed by QEMU”.

    We took a while to get the QEMU community to decide how machine-type compatibility was supposed to be handled, and what to do with the weird CPU model config file we had.

    The conversion of CPUs to QOM was fun. I think it started in 2012 and was finished only in 2015. We thought QOM properties would solve all our problems, but then we found out that machine-types and global properties make the problem more complex. The existing interfaces would require making libvirt re-run QEMU multiple times to gather all the information it needed. While doing the QOM work, we spent some time fixing or working around issues with global properties, qdev “static” properties and QOM “dynamic” properties.

    In 2014, my focus was moved to machine-types, in the hope that we could finally expose machine-type-specific information to libvirt without re-running QEMU. Useful code refactoring was done for that, but in the end we never added the really relevant information to the query-machines QMP command.

    In the meantime, we had the fun TSX issues, and QEMU developers finally agreed to keep a few constraints on CPU model changes, that would make the problem a bit simpler.

    In 2015 IBM people started sending patches related to CPU models in s390x. We finally had a multi-architecture effort to make CPU model probing work. The work started by extending query-cpu-definitions, but it was not enough. In June 2016 they proposed a query-cpu-model-expansion API. It was finally merged in September 2016.

    I sent v1 of query-cpu-model-expansion for x86 in December 2016. After a few rounds of reviews, there was a proposal to use “-cpu max” to represent the “all features supported by this QEMU binary on this host”. v3 of the series was merged last week.

    I still can’t believe it finally happened.

    Special thanks to:

    • Igor Mammedov, for all the x86 QOM/properties work and all the valuable feedback.
    • David Hildenbrand and Michael Mueller, for moving forward the API design and the s390x implementation.
    • Jiri Denemark, for the libvirt work, valuable discussions and design feedback, and for the patience during the process.
    • Daniel P. Berrangé, for the valuable feedback and for helping making QEMU developers listen to libvirt developers.
    • Andreas Färber, for the work as maintainer of QOM and the CPU core, for leading the QOM conversion effort, and all the valuable feedback.
    • Markus Armbruster and Paolo Bonzini, for valuable feedback on design discussions.
    • Many others that were involved in the effort.

    by Eduardo Habkost at March 06, 2017 05:00 PM

    Video and slides for FOSDEM 2017 talk: QEMU Internal APIs

    The slides and videos for my FOSDEM 2017 talk (QEMU: internal APIs and conflicting world views) are available online.

    The subject I tried to cover is large for a 40-minute talk, but I think I managed to scratch its surface and give useful examples.

    Many thanks for the FOSDEM team of volunteers for the wonderful event.

    by Eduardo Habkost at March 06, 2017 03:00 AM

    February 24, 2017

    Ladi Prosek

    Running Hyper-V in a QEMU/KVM Guest

    This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.

    Update 4/2017: AMD is broken, fix is coming.
    Update 7/2017: Check out Nesting Hyper-V in QEMU/KVM: Known issues for a list of known issues and their status

    A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.

    Kernel / KVM

    A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.

    Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.

    $ cat /sys/module/kvm_intel/parameters/nested
    

    Update 4/2017: On newer Intel CPUs with PML (Page Modification Logging) support such as Kaby Lake, Skylake, and some server Broadwell chips, PML needs to be disabled by passing kvm-intel.pml=0 to the kernel as a parameter. Fix is coming.

    QEMU

    QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.

    The QEMU command line must include the +vmx cpu feature, for example:

    -cpu SandyBridge,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,+vmx
    

    If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).

    Hyper-V

    Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.

    If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.

    If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.


    by ladipro at February 24, 2017 01:57 PM

    February 22, 2017

    Gerd Hoffmann

    vconsole 0.7 released

    vconsole is a virtual machine (serial) console manager, look here for details.

    No big changes from 0.6.

    Fetch the tarball here.

    by Gerd Hoffmann at February 22, 2017 08:15 AM

    February 21, 2017

    Gerd Hoffmann

    Fedora 25 images for qemu and raspberry pi 3 uploaded

    I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.

    The fedora-25-rpi3 image is for the raspberry pi 3.
    The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).

    The images don’t have a root password set. You must use libguestfs-tools to set the root password …

    virt-customize -a <image> --root-password "password:<your-password-here>>
    

    … otherwise you can’t login after boot.

    The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.

    The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.

    by Gerd Hoffmann at February 21, 2017 08:32 AM

    February 16, 2017

    Daniel Berrange

    Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA

    Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.

    First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.

    # curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo
    # dnf upgrade

    For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \
        ...

    The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.

        ...
        --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
                   cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
                   cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
        ...
    

    QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).

        ...
        --machine q35
    

    The complete virt-install command line thus looks like

    # virt-install --name f25x86_64  \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --ram 8000 --vcpus 8 \
        --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
                   cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
                   cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
        --machine q35
    

    Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”

    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='dmi-to-pci-bridge'>
      <model name='i82801b11-bridge'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    

    PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config

    <controller type='pci' index='3' model='pcie-expander-bus'>
      <target busNr='180'>
        <node>0</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='4' model='pcie-expander-bus'>
      <target busNr='200'>
        <node>1</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='5' model='pcie-expander-bus'>
      <target busNr='220'>
        <node>2</node>
      </target>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    

    It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB

    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='6' port='0x0'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='7' port='0x8'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x10'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='9' port='0x0'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='10' port='0x8'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='11' port='0x10'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='12' port='0x0'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='13' port='0x8'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='14' port='0x10'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>|
    </controller>
    

    Notice that the values in ‘bus‘ attribute on the <address> element is matching the value of the ‘index‘ attribute on the <controller> element of the parent device in the topology. The PCI controller topology now looks like this

    pcie-root (index == 0)
      |
      +- dmi-to-pci-bridge (index == 1)
      |    |
      |    +- pci-bridge (index == 2)
      |
      +- pcie-expander-bus (index == 3, numa node == 0)
      |    |
      |    +- pcie-root-port (index == 6)
      |    +- pcie-root-port (index == 7)
      |    +- pcie-root-port (index == 8)
      |
      +- pcie-expander-bus (index == 4, numa node == 1)
      |    |
      |    +- pcie-root-port (index == 9)
      |    +- pcie-root-port (index == 10)
      |    +- pcie-root-port (index == 11)
      |
      +- pcie-expander-bus (index == 5, numa node == 2)
           |
           +- pcie-root-port (index == 12)
           +- pcie-root-port (index == 13)
           +- pcie-root-port (index == 14)
    

    All the existing devices are attached to the “pci-bridge” (the controller with index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add

    <interface type='user'>
      <mac address='52:54:00:7e:6e:c6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:c8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:d8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e6'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e7'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
    </interface>
    <interface type='user'>
      <mac address='52:54:00:7e:6e:e8'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
    </interface>
    

    Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.

    The final configuration change is to simply add the Intel IOMMU device

    <iommu model='intel'/>
    

    It is a capability integrated into the chipset, so it does not need any <address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.

    # virsh domifaddr f25x86_64
     Name       MAC address          Protocol     Address
    -------------------------------------------------------------------------------
     vnet0      52:54:00:10:26:7e    ipv4         192.168.122.3/24
    
    # ssh root@192.168.122.3
    

    We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes

    # dnf install numactl
    # numactl --hardware
    available: 3 nodes (0-2)
    node 0 cpus: 0 1 2 3
    node 0 size: 3856 MB
    node 0 free: 3730 MB
    node 1 cpus: 4 5
    node 1 size: 1969 MB
    node 1 free: 1813 MB
    node 2 cpus: 6 7
    node 2 size: 1967 MB
    node 2 free: 1832 MB
    node distances:
    node   0   1   2 
      0:  10  20  20 
      1:  20  10  20 
      2:  20  20  10 
    

    Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached

    # lspci -t -v
    -+-[0000:dc]-+-00.0-[dd]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[de]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[df]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:c8]-+-00.0-[c9]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[ca]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[cb]----00.0  Intel Corporation 82574L Gigabit Network Connection
     +-[0000:b4]-+-00.0-[b5]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           +-01.0-[b6]----00.0  Intel Corporation 82574L Gigabit Network Connection
     |           \-02.0-[b7]----00.0  Intel Corporation 82574L Gigabit Network Connection
     \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
                 +-01.0  Red Hat, Inc. QXL paravirtual graphic card
                 +-02.0  Red Hat, Inc. Device 000b
                 +-03.0  Red Hat, Inc. Device 000b
                 +-04.0  Red Hat, Inc. Device 000b
                 +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
                 +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
                 +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
                 +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
                 +-1e.0-[01-02]----01.0-[02]--+-01.0  Red Hat, Inc Virtio network device
                 |                            +-02.0  Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller
                 |                            +-03.0  Red Hat, Inc Virtio console
                 |                            +-04.0  Red Hat, Inc Virtio block device
                 |                            \-05.0  Red Hat, Inc Virtio memory balloon
                 +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
                 +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
                 \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller
    

    The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.

    # vi /etc/default/grub
    ....add "intel_iommu=on"...
    # grub2-mkconfig > /etc/grub2.cfg
    

    While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio

    echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \
      /etc/modprobe.d/vfio.conf
    

    This is also a good time to install libvirt and KVM inside the guest

    # dnf groupinstall "Virtualization"
    # dnf install libvirt-client
    # rm -f /etc/libvirt/qemu/networks/autostart/default.xml
    

    Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.

    Now reboot the guest. When it comes back up, there should be a /dev/kvm device present in the guest.

    # ls -al /dev/kvm
    crw-rw-rw-. 1 root kvm 10, 232 Oct  4 12:14 /dev/kvm
    

    If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.

    The IOMMU should have been detected and activated

    # dmesg  | grep -i DMAR
    [    0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS  BXPCDMAR 00000001 BXPC 00000001)
    [    0.000000] DMAR: IOMMU enabled
    [    0.203737] DMAR: Host address width 39
    [    0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
    [    0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02
    [    2.910862] DMAR: No RMRR found
    [    2.910863] DMAR: No ATSR found
    [    2.914870] DMAR: dmar0: Using Queued invalidation
    [    2.914924] DMAR: Setting RMRR:
    [    2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC
    [    2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
    [    2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O
    

    The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.

    The IOMMU should also have registered the PCI devices into various groups

    # dmesg  | grep -i iommu  |grep device
    [    2.915212] iommu: Adding device 0000:00:00.0 to group 0
    [    2.915226] iommu: Adding device 0000:00:01.0 to group 1
    ...snip...
    [    5.588723] iommu: Adding device 0000:b5:00.0 to group 14
    [    5.588737] iommu: Adding device 0000:b6:00.0 to group 15
    [    5.588751] iommu: Adding device 0000:b7:00.0 to group 16
    

    Libvirt meanwhile should have detected all the PCI controllers/devices

    # virsh nodedev-list --tree
    computer
      |
      +- net_lo_00_00_00_00_00_00
      +- pci_0000_00_00_0
      +- pci_0000_00_01_0
      +- pci_0000_00_02_0
      +- pci_0000_00_03_0
      +- pci_0000_00_04_0
      +- pci_0000_00_1d_0
      |   |
      |   +- usb_usb2
      |       |
      |       +- usb_2_0_1_0
      |         
      +- pci_0000_00_1d_1
      |   |
      |   +- usb_usb3
      |       |
      |       +- usb_3_0_1_0
      |         
      +- pci_0000_00_1d_2
      |   |
      |   +- usb_usb4
      |       |
      |       +- usb_4_0_1_0
      |         
      +- pci_0000_00_1d_7
      |   |
      |   +- usb_usb1
      |       |
      |       +- usb_1_0_1_0
      |       +- usb_1_1
      |           |
      |           +- usb_1_1_1_0
      |             
      +- pci_0000_00_1e_0
      |   |
      |   +- pci_0000_01_01_0
      |       |
      |       +- pci_0000_02_01_0
      |       |   |
      |       |   +- net_enp2s1_52_54_00_10_26_7e
      |       |     
      |       +- pci_0000_02_02_0
      |       +- pci_0000_02_03_0
      |       +- pci_0000_02_04_0
      |       +- pci_0000_02_05_0
      |         
      +- pci_0000_00_1f_0
      +- pci_0000_00_1f_2
      |   |
      |   +- scsi_host0
      |   +- scsi_host1
      |   +- scsi_host2
      |   +- scsi_host3
      |   +- scsi_host4
      |   +- scsi_host5
      |     
      +- pci_0000_00_1f_3
      +- pci_0000_b4_00_0
      |   |
      |   +- pci_0000_b5_00_0
      |       |
      |       +- net_enp181s0_52_54_00_7e_6e_c6
      |         
      +- pci_0000_b4_01_0
      |   |
      |   +- pci_0000_b6_00_0
      |       |
      |       +- net_enp182s0_52_54_00_7e_6e_c7
      |         
      +- pci_0000_b4_02_0
      |   |
      |   +- pci_0000_b7_00_0
      |       |
      |       +- net_enp183s0_52_54_00_7e_6e_c8
      |         
      +- pci_0000_c8_00_0
      |   |
      |   +- pci_0000_c9_00_0
      |       |
      |       +- net_enp201s0_52_54_00_7e_6e_d6
      |         
      +- pci_0000_c8_01_0
      |   |
      |   +- pci_0000_ca_00_0
      |       |
      |       +- net_enp202s0_52_54_00_7e_6e_d7
      |         
      +- pci_0000_c8_02_0
      |   |
      |   +- pci_0000_cb_00_0
      |       |
      |       +- net_enp203s0_52_54_00_7e_6e_d8
      |         
      +- pci_0000_dc_00_0
      |   |
      |   +- pci_0000_dd_00_0
      |       |
      |       +- net_enp221s0_52_54_00_7e_6e_e6
      |         
      +- pci_0000_dc_01_0
      |   |
      |   +- pci_0000_de_00_0
      |       |
      |       +- net_enp222s0_52_54_00_7e_6e_e7
      |         
      +- pci_0000_dc_02_0
          |
          +- pci_0000_df_00_0
              |
              +- net_enp223s0_52_54_00_7e_6e_e8
    

    And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of

    # virsh nodedev-dumpxml pci_0000_df_00_0
    <device>
      <name>pci_0000_df_00_0</name>
      <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path>
      <parent>pci_0000_dc_02_0</parent>
      <driver>
        <name>e1000e</name>
      </driver>
      <capability type='pci'>
        <domain>0</domain>
        <bus>223</bus>
        <slot>0</slot>
        <function>0</function>
        <product id='0x10d3'>82574L Gigabit Network Connection</product>
        <vendor id='0x8086'>Intel Corporation</vendor>
        <iommuGroup number='10'>
          <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/>
          <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/>
        </iommuGroup>
        <numa node='2'/>
        <pci-express>
          <link validity='cap' port='0' speed='2.5' width='1'/>
          <link validity='sta' speed='2.5' width='1'/>
        </pci-express>
      </capability>
    </device>
    

    Finally, libvirt should also be reporting the NUMA topology

    # virsh capabilities
    ...snip...
    <topology>
      <cells num='3'>
        <cell id='0'>
          <memory unit='KiB'>4014464</memory>
          <pages unit='KiB' size='4'>1003616</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='20'/>
          </distances>
          <cpus num='4'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
            <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
            <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>2016808</memory>
          <pages unit='KiB' size='4'>504202</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='20'/>
            <sibling id='1' value='10'/>
            <sibling id='2' value='20'/>
          </distances>
          <cpus num='2'>
            <cpu id='4' socket_id='4' core_id='0' siblings='4'/>
            <cpu id='5' socket_id='5' core_id='0' siblings='5'/>
          </cpus>
        </cell>
        <cell id='2'>
          <memory unit='KiB'>2014644</memory>
          <pages unit='KiB' size='4'>503661</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='20'/>
            <sibling id='1' value='20'/>
            <sibling id='2' value='10'/>
          </distances>
          <cpus num='2'>
            <cpu id='6' socket_id='6' core_id='0' siblings='6'/>
            <cpu id='7' socket_id='7' core_id='0' siblings='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    ...snip...
    

    Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device

    # cd /var/lib/libvirt/images
    # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
    # virt-install --name f25x86_64 --ram 2000 --vcpus 8 \
        --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \
        --cdrom f25x86_64-boot.iso --os-type fedora23 \
        --hostdev pci_0000_df_00_0 --network none

    If everything went well, you should now have a nested guest with an assigned PCI device attached to it.

    This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.

    The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)

    by Daniel Berrange at February 16, 2017 12:44 PM

    ANNOUNCE: libosinfo 1.0.0 release

    NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…

    I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

    Changes in this release include:

    • Update loader to follow new layout for external database
    • Move all database files into separate osinfo-db package
    • Move osinfo-db-validate into osinfo-db-tools package

    As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:

    • libosinfo – provides the libosinfo shared library and most associated command line tools
    • osinfo-db – contains only the database XML files and RNG schema, no code at all.
    • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

    Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.

    by Daniel Berrange at February 16, 2017 11:19 AM

    February 15, 2017

    QEMU project

    Presentations from FOSDEM 2017

    Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.

    Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:

    by Thomas Huth at February 15, 2017 02:49 PM

    February 14, 2017

    Stefan Hajnoczi

    Slides posted for "Using NVDIMM under KVM" talk

    I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.

    Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.

    This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).

    Update: Video is available here.

    by stefanha (noreply@blogger.com) at February 14, 2017 02:54 PM


    Powered by Planet!
    Last updated: August 24, 2017 08:01 AM