This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…
For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.
I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.
I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)
I suggest creating a subdirectory for these and the other files we’re going to create.
wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz
Saving them locally as
installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.
(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)
First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.
qemu-img create -f qcow hda.qcow2 5G
Now we can run the installer:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel installer-linux \ -initrd installer-initrd.gz \ -drive if=none,file=hda.qcow2,format=qcow,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic -no-reboot
The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.
The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.
Late in the process, the installer will print the following warning dialog:
+-----------------| [!] Continue without boot loader |------------------+ | | | No boot loader installed | | No boot loader has been installed, either because you chose not to or | | because your specific architecture doesn't support a boot loader yet. | | | | You will need to boot manually with the /vmlinuz kernel on partition | | /dev/vda1 and root=/dev/vda2 passed as a kernel argument. | | | | <Continue> | | | +-----------------------------------------------------------------------+
Press continue for now, and we’ll sort this out later.
Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the
At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.
The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.
There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:
$ virt-filesystems -a hda.qcow2 /dev/sda1 /dev/sda2
If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:
/bootare installed not-world-readable; you can fix this with
sudo chmod 644 /boot/vmlinuz*
Looking at what’s in our disk we can see the kernel and initrd in /boot:
$ virt-ls -a hda.qcow2 /boot/ System.map-4.9.0-3-arm64 config-4.9.0-3-arm64 initrd.img initrd.img-4.9.0-3-arm64 initrd.img.old lost+found vmlinuz vmlinuz-4.9.0-3-arm64 vmlinuz.old
and we can copy them out to the host filesystem:
virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .
(We want the longer filenames, because
initrd.img are just symlinks and virt-copy-out won’t copy them.)
An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.
If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.
To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel vmlinuz-4.9.0-3-arm64 \ -initrd initrd.img-4.9.0-3-arm64 \ -append 'root=/dev/vda2' \ -drive if=none,file=hda.qcow2,format=qcow,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic
This should boot to a login prompt, where you can log in with the user and password you set up during the install.
The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.
This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.
This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.
|Issue description||Status||Public bug tracker|
|Hyper-V on KVM does not work at all (initial work item)||Fixed in kernel 4.10
|Hyper-V on KVM does not work on new Intel CPUs with PML||Fixed in kernel 4.11
|Hyper-V on KVM does not work on AMD CPUs||Fixed in kernel 4.12 for 1 vCPU|
For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.
In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.
Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:
I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:
The basic flow is as follows:
# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0
It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.
Fedora 26 is out of the door, and here are fresh fedora 26 images.
There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.
There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.
The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.
The images have been created with imagefish.
Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:
$ virt-builder -l | grep debian debian-6 x86_64 Debian 6 (Squeeze) debian-7 sparc64 Debian 7 (Wheezy) (sparc64) debian-7 x86_64 Debian 7 (Wheezy) debian-8 x86_64 Debian 8 (Jessie) debian-9 x86_64 Debian 9 (stretch) $ virt-builder debian-9 \ --root-password password:123456 [ 0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz [ 1.2] Planning how to build this image [ 1.2] Uncompressing [ 5.5] Opening the new disk [ 15.4] Setting a random seed virt-builder: warning: random seed could not be set for this type of guest [ 15.4] Setting passwords [ 16.7] Finishing off Output file: debian-9.img Output size: 6.0G Output format: raw Total usable space: 3.9G Free space: 3.1G (78%) $ qemu-system-x86_64 \ -machine accel=kvm:tcg -cpu host -m 2048 \ -drive file=debian-9.img,format=raw,if=virtio \ -serial stdio
libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:
┌───────────────────┐ │ main program │ │ │ │ │ child process / appliance │ │ ┌──────────────────────────┐ │ │ │ qemu │ ├───────────────────┤ RPC │ ┌─────────────────┐ │ │ libguestfs ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd │ │ │ │ │ ├─────────────────┤ │ └───────────────────┘ │ │ Linux kernel │ │ │ └────────┬────────┘ │ └───────────────│──────────┘ │ │ virtio-scsi ┌──────┴──────┐ │ Device or │ │ disk image │ └─────────────┘
The library has to be written in C because it needs to be linked to any main program. The daemon (
guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.
The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from
strdup etc., making even the simplest task non-trivial and full of untested code paths.
So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C ↔ OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.
OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.
Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).
I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.
There were a couple of QEMU / virtualization related talks at the DevConf 2017 conference that took place at the end of January already, but so far we missed to gather the links to the recordings of these talks. So here is now the list:
How to write a legacy storage device emulator by John Snow
GPU Passthrough using GNOME Boxes by Felipe Borges
Self-virtualizing Linux on x86 by Radim Krčmář
New ways to remote desktops with GStreamer integration by Victor Toso de Carvalho and Pavel Grunt
Can VMs networking benefit from DPDK by Maxime Coquelin and Victor Kaplansky
Windows VMs on KVM: The Good, the Bad and the Ugly by Ladi Prosek
Pet VMs in Kubernetes? WTH by Fabian Deutsch
Virtualization development improved with Lago by Rafael Martins
An ARMful of guests: virtualization on 64-bit ARM by Andrea Bolognani
First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.
Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.
So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.
So, here is the Makefile:
CC ?= gcc CFLAGS ?= -Os -g -std=c99 CFLAGS += -Wall TARGETS := drminfo drmtest gtktest drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1) drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1) drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += -ljpeg gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += -ljpeg all: $(TARGETS) clean: rm -f $(TARGETS) rm -f *~ *.o drminfo: drminfo.o drmtools.o drmtest: drmtest.o drmtools.o render.o image.o gtktest: gtktest.o render.o image.o
Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.
Now, compare with the meson.build file:
project('drminfo', 'c') # pkg-config deps libdrm_dep = dependency('libdrm') gbm_dep = dependency('gbm') epoxy_dep = dependency('epoxy') cairo_dep = dependency('cairo') cairo_gl_dep = dependency('cairo-gl') pixman_dep = dependency('pixman-1') gtk3_dep = dependency('gtk+-3.0') # libjpeg dep jpeg_dep = declare_dependency(link_args : '-ljpeg') drminfo_srcs = [ 'drminfo.c', 'drmtools.c' ] drmtest_srcs = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ] gtktest_srcs = [ 'gtktest.c', 'render.c', 'image.c' ] drminfo_deps = [ libdrm_dep, cairo_dep, pixman_dep ] drmtest_deps = [ libdrm_dep, gbm_dep, epoxy_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] gtktest_deps = [ gtk3_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] executable('drminfo', sources : drminfo_srcs, dependencies : drminfo_deps) executable('drmtest', sources : drmtest_srcs, dependencies : drmtest_deps) executable('gtktest', sources : gtktest_srcs, dependencies : gtktest_deps, install : false)
Pretty straight forward translation. So, what are the differences?
First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:
Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.
Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.
Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.
Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.
When you are trying to start a KVM guest via libvirt on an s390x Linux installation that is running on an older version of z/VM, you might run into the problem that QEMU refuses to start with this error message:
cannot set up guest memory 's390.ram': Permission denied.
This happens because older versions of z/VM (before version 6.3) do not support the so-called “enhanced suppression on protection facility” (ESOP) yet, so QEMU has to allocate the memory for the guest with a “hack”, and this hack uses mmap(… PROT_EXEC …) for the allocation.
Now this mmap() call is not allowed by the default SELinux rules (at least not on RHEL-based systems), so QEMU fails to allocate the memory for the guest here. Turning off SELinux completely just to run a KVM guest is of course a bad idea, but fortunately there is already a SELinux boolean value called virt_use_execmem which can be used to tune the behavior here:
setsebool virt_use_execmem 1
This configuration switch has originally been introduced for running TCG guests (i.e. running QEMU without KVM), but in this case it also fixes the problem with KVM guests. Anyway, since setting this SELinux variable to 1 is also a slight decrease in security (but still better than disabling SELinux completely), you should better upgrade your z/VM to version 6.3 (or newer) or use a real LPAR for the KVM host installation instead, if that is feasible.
--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
+<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
Define 'f25' with the changed XML? (y/n):
Here is a short list of articles and blog posts about QEMU and KVM, that were posted last month.
QEMU and the qcow2 metadata checks by Alberto Garcia
Running Hyper-V in a QEMU/KVM Guest by Ladislav Prošek
Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA by Daniel P. Berrangé
Testing SMM with QEMU, KVM and libvirt by László Érsek
The surprisingly complicated world of disk image sizes by Daniel P. Berrangé
More virtualization blog posts can be found on the virt tools planet.
In other news, QEMU is now in hard freeze for release 2.9.0. The preliminary list of features is on the wiki.
virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.
nbdkit xz file=win7.img.xz \ -U - \ --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'
What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (
We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)
$unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.
The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.
As of today creating libvirt lxc system container root file system is a pain. Docker's fun came with its image sharing idea... why couldn't we do the same for libvirt containers? I will expose here is an attempt at this.
To achieve such a goal we need:
OpenBuildService thanks to kiwi knows how to create images, even container images. There even are openSUSE Docker images. To use them as system container images, some more packages need to be added to those. I thus forked the project on github and branched the OBS projects to get system container images for 42.1, 42.2 and Tumbleweed.
Using them is as simple as downloading them, unpacking them and use them as a container's root file system. However, sharing them would be so fun!
There is no need to reinvent the wheel to share the images. We can just consider them like any docker image. With the following commands we can import the image and push it to a remote registry.
docker import openSUSE-42.2-syscontainer-guest-docker.x86_64.tar.xz system/opensuse-42.2 docker tag system/opensuse-42.2 myregistry:5000/system/opensuse-42.2 docker login myregistry:5000 docker push myregistry:5000/system/opensuse-42.2
The good thing with this is that we can even use the
docker build and
Dockerfile magic to create customized images and push them to the remote
Now we need a tool to get the images from the remote docker registry. Hopefully there is a tool that helps a lot to do this: skopeo. I wrote a small virt-bootstrap tool using it to instanciate the images as root file systems.
Here is how instanciating a container looks like with it:
virt-bootstrap.py --username myuser \ --root-password test \ docker://myregistry:5000/system/opensuse-42.2 /path/to/my/container virt-install --connect lxc:/// -n 422 --memory 250 --vcpus 1 \ --filesystem /path/to/my/container,/ \ --filesystem /etc/resolv.conf,/etc/resolv.conf \ --network network=default
And voila! Creating an openSUSE 42.2 system container and running it with libvirt is now super easy!
So, finally the
query-cpu-model-expansion x86 implementation was merged
just before 2.9 soft freeze. Jiri Denemark already implemented
the x86 libvirt code to use it. I just can’t believe this was
finally done after so many years.
It was a weird journey. It started almost 6 years ago with this message to qemu-devel:
Date: Fri, 10 Jun 2011 18:36:37 -0300
Subject: semantics of “-cpu host” and “check”/”enforce”
…it continued on an interesting thread:
Date: Tue, 6 Mar 2012 15:27:53 -0300
Subject: Qemu, libvirt, and CPU models
…on another very long one:
Date: Fri, 9 Mar 2012 17:56:52 -0300
Subject: Re: [Qemu-devel] [libvirt] Modern CPU models cannot be used with libvirt
…and this one:
Date: Thu, 21 Feb 2013 11:58:18 -0300
Subject: libvirt<->QEMU interfaces for CPU models
I don’t even remember how many different interfaces were proposed to provide what libvirt needed.
We had a few moments where we hopped back and forth between “just let libvirt manage everything” to “let’s keep this managed by QEMU”.
We took a while to get the QEMU community to decide how machine-type compatibility was supposed to be handled, and what to do with the weird CPU model config file we had.
The conversion of CPUs to QOM was fun. I think it started in 2012 and was finished only in 2015. We thought QOM properties would solve all our problems, but then we found out that machine-types and global properties make the problem more complex. The existing interfaces would require making libvirt re-run QEMU multiple times to gather all the information it needed. While doing the QOM work, we spent some time fixing or working around issues with global properties, qdev “static” properties and QOM “dynamic” properties.
In 2014, my focus was moved to machine-types, in the hope that we
could finally expose machine-type-specific information to libvirt
without re-running QEMU. Useful code refactoring was done for
that, but in the end we never added the really relevant
information to the
query-machines QMP command.
In the meantime, we had the fun TSX issues, and QEMU developers finally agreed to keep a few constraints on CPU model changes, that would make the problem a bit simpler.
In 2015 IBM people started sending patches related to CPU models
in s390x. We finally had a multi-architecture effort to make CPU
model probing work. The work started by extending
query-cpu-definitions, but it was not enough. In June 2016 they
query-cpu-model-expansion API. It was finally merged
in September 2016.
I sent v1 of
query-cpu-model-expansion for x86 in December 2016.
After a few rounds of reviews, there was a proposal to use
“-cpu max” to represent the “all features supported by this QEMU
binary on this host”. v3 of the series was
merged last week.
I still can’t believe it finally happened.
Special thanks to:
The slides and videos for my FOSDEM 2017 talk (QEMU: internal APIs and conflicting world views) are available online.
The subject I tried to cover is large for a 40-minute talk, but I think I managed to scratch its surface and give useful examples.
Many thanks for the FOSDEM team of volunteers for the wonderful event.
This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.
Update 4/2017: AMD is broken, fix is coming.
Update 7/2017: Check out Nesting Hyper-V in QEMU/KVM: Known issues for a list of known issues and their status
A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.
A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.
Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.
$ cat /sys/module/kvm_intel/parameters/nested
Update 4/2017: On newer Intel CPUs with PML (Page Modification Logging) support such as Kaby Lake, Skylake, and some server Broadwell chips, PML needs to be disabled by passing kvm-intel.pml=0 to the kernel as a parameter. Fix is coming.
QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.
The QEMU command line must include the +vmx cpu feature, for example:
If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).
Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.
If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.
If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.
I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.
The fedora-25-rpi3 image is for the raspberry pi 3.
The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).
The images don’t have a root password set. You must use libguestfs-tools to set the root password …
virt-customize -a <image> --root-password "password:<your-password-here>>
… otherwise you can’t login after boot.
The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.
The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.
Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.
First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.
# curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo # dnf upgrade
For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install
# cd /var/lib/libvirt/images # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso # virt-install --name f25x86_64 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --ram 8000 --vcpus 8 \ ...
The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.
... --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\ cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\ cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \ ...
QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).
... --machine q35
The complete virt-install command line thus looks like
# virt-install --name f25x86_64 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --ram 8000 --vcpus 8 \ --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\ cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\ cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \ --machine q35
Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “
virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”
<controller type='pci' index='0' model='pcie-root'/> <controller type='pci' index='1' model='dmi-to-pci-bridge'> <model name='i82801b11-bridge'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> </controller> <controller type='pci' index='2' model='pci-bridge'> <model name='pci-bridge'/> <target chassisNr='2'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller>
PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config
<controller type='pci' index='3' model='pcie-expander-bus'> <target busNr='180'> <node>0</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-expander-bus'> <target busNr='200'> <node>1</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='5' model='pcie-expander-bus'> <target busNr='220'> <node>2</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller>
It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB
<controller type='pci' index='6' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='6' port='0x0'/> <alias name='pci.6'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='7' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='7' port='0x8'/> <alias name='pci.7'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='8' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='8' port='0x10'/> <alias name='pci.8'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='9' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='9' port='0x0'/> <alias name='pci.9'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='10' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='10' port='0x8'/> <alias name='pci.10'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='11' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='11' port='0x10'/> <alias name='pci.11'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='12' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='12' port='0x0'/> <alias name='pci.12'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='13' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='13' port='0x8'/> <alias name='pci.13'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='14' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='14' port='0x10'/> <alias name='pci.14'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>| </controller>
Notice that the values in ‘
bus‘ attribute on the
<address> element is matching the value of the ‘
index‘ attribute on the
<controller> element of the parent device in the topology. The PCI controller topology now looks like this
pcie-root (index == 0) | +- dmi-to-pci-bridge (index == 1) | | | +- pci-bridge (index == 2) | +- pcie-expander-bus (index == 3, numa node == 0) | | | +- pcie-root-port (index == 6) | +- pcie-root-port (index == 7) | +- pcie-root-port (index == 8) | +- pcie-expander-bus (index == 4, numa node == 1) | | | +- pcie-root-port (index == 9) | +- pcie-root-port (index == 10) | +- pcie-root-port (index == 11) | +- pcie-expander-bus (index == 5, numa node == 2) | +- pcie-root-port (index == 12) +- pcie-root-port (index == 13) +- pcie-root-port (index == 14)
All the existing devices are attached to the “
pci-bridge” (the controller with
index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “
pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add
<interface type='user'> <mac address='52:54:00:7e:6e:c6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:c7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:c8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/> </interface>
Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.
The final configuration change is to simply add the Intel IOMMU device
It is a capability integrated into the chipset, so it does not need any
<address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.
# virsh domifaddr f25x86_64 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet0 52:54:00:10:26:7e ipv4 192.168.122.3/24 # ssh firstname.lastname@example.org
We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes
# dnf install numactl # numactl --hardware available: 3 nodes (0-2) node 0 cpus: 0 1 2 3 node 0 size: 3856 MB node 0 free: 3730 MB node 1 cpus: 4 5 node 1 size: 1969 MB node 1 free: 1813 MB node 2 cpus: 6 7 node 2 size: 1967 MB node 2 free: 1832 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10
Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached
# lspci -t -v -+-[0000:dc]-+-00.0-[dd]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[de]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[df]----00.0 Intel Corporation 82574L Gigabit Network Connection +-[0000:c8]-+-00.0-[c9]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[ca]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[cb]----00.0 Intel Corporation 82574L Gigabit Network Connection +-[0000:b4]-+-00.0-[b5]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[b6]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[b7]----00.0 Intel Corporation 82574L Gigabit Network Connection \-[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller +-01.0 Red Hat, Inc. QXL paravirtual graphic card +-02.0 Red Hat, Inc. Device 000b +-03.0 Red Hat, Inc. Device 000b +-04.0 Red Hat, Inc. Device 000b +-1d.0 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 +-1d.1 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 +-1d.2 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 +-1d.7 Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 +-1e.0-[01-02]----01.0---+-01.0 Red Hat, Inc Virtio network device | +-02.0 Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller | +-03.0 Red Hat, Inc Virtio console | +-04.0 Red Hat, Inc Virtio block device | \-05.0 Red Hat, Inc Virtio memory balloon +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller
The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.
# vi /etc/default/grub ....add "intel_iommu=on"... # grub2-mkconfig > /etc/grub2.cfg
While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \ /etc/modprobe.d/vfio.conf
This is also a good time to install libvirt and KVM inside the guest
# dnf groupinstall "Virtualization" # dnf install libvirt-client # rm -f /etc/libvirt/qemu/networks/autostart/default.xml
Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.
Now reboot the guest. When it comes back up, there should be a
/dev/kvm device present in the guest.
# ls -al /dev/kvm crw-rw-rw-. 1 root kvm 10, 232 Oct 4 12:14 /dev/kvm
If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.
The IOMMU should have been detected and activated
# dmesg | grep -i DMAR [ 0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS BXPCDMAR 00000001 BXPC 00000001) [ 0.000000] DMAR: IOMMU enabled [ 0.203737] DMAR: Host address width 39 [ 0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1 [ 0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02 [ 2.910862] DMAR: No RMRR found [ 2.910863] DMAR: No ATSR found [ 2.914870] DMAR: dmar0: Using Queued invalidation [ 2.914924] DMAR: Setting RMRR: [ 2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC [ 2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] [ 2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O
The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.
The IOMMU should also have registered the PCI devices into various groups
# dmesg | grep -i iommu |grep device [ 2.915212] iommu: Adding device 0000:00:00.0 to group 0 [ 2.915226] iommu: Adding device 0000:00:01.0 to group 1 ...snip... [ 5.588723] iommu: Adding device 0000:b5:00.0 to group 14 [ 5.588737] iommu: Adding device 0000:b6:00.0 to group 15 [ 5.588751] iommu: Adding device 0000:b7:00.0 to group 16
Libvirt meanwhile should have detected all the PCI controllers/devices
# virsh nodedev-list --tree computer | +- net_lo_00_00_00_00_00_00 +- pci_0000_00_00_0 +- pci_0000_00_01_0 +- pci_0000_00_02_0 +- pci_0000_00_03_0 +- pci_0000_00_04_0 +- pci_0000_00_1d_0 | | | +- usb_usb2 | | | +- usb_2_0_1_0 | +- pci_0000_00_1d_1 | | | +- usb_usb3 | | | +- usb_3_0_1_0 | +- pci_0000_00_1d_2 | | | +- usb_usb4 | | | +- usb_4_0_1_0 | +- pci_0000_00_1d_7 | | | +- usb_usb1 | | | +- usb_1_0_1_0 | +- usb_1_1 | | | +- usb_1_1_1_0 | +- pci_0000_00_1e_0 | | | +- pci_0000_01_01_0 | | | +- pci_0000_02_01_0 | | | | | +- net_enp2s1_52_54_00_10_26_7e | | | +- pci_0000_02_02_0 | +- pci_0000_02_03_0 | +- pci_0000_02_04_0 | +- pci_0000_02_05_0 | +- pci_0000_00_1f_0 +- pci_0000_00_1f_2 | | | +- scsi_host0 | +- scsi_host1 | +- scsi_host2 | +- scsi_host3 | +- scsi_host4 | +- scsi_host5 | +- pci_0000_00_1f_3 +- pci_0000_b4_00_0 | | | +- pci_0000_b5_00_0 | | | +- net_enp181s0_52_54_00_7e_6e_c6 | +- pci_0000_b4_01_0 | | | +- pci_0000_b6_00_0 | | | +- net_enp182s0_52_54_00_7e_6e_c7 | +- pci_0000_b4_02_0 | | | +- pci_0000_b7_00_0 | | | +- net_enp183s0_52_54_00_7e_6e_c8 | +- pci_0000_c8_00_0 | | | +- pci_0000_c9_00_0 | | | +- net_enp201s0_52_54_00_7e_6e_d6 | +- pci_0000_c8_01_0 | | | +- pci_0000_ca_00_0 | | | +- net_enp202s0_52_54_00_7e_6e_d7 | +- pci_0000_c8_02_0 | | | +- pci_0000_cb_00_0 | | | +- net_enp203s0_52_54_00_7e_6e_d8 | +- pci_0000_dc_00_0 | | | +- pci_0000_dd_00_0 | | | +- net_enp221s0_52_54_00_7e_6e_e6 | +- pci_0000_dc_01_0 | | | +- pci_0000_de_00_0 | | | +- net_enp222s0_52_54_00_7e_6e_e7 | +- pci_0000_dc_02_0 | +- pci_0000_df_00_0 | +- net_enp223s0_52_54_00_7e_6e_e8
And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of
# virsh nodedev-dumpxml pci_0000_df_00_0 <device> <name>pci_0000_df_00_0</name> <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path> <parent>pci_0000_dc_02_0</parent> <driver> <name>e1000e</name> </driver> <capability type='pci'> <domain>0</domain> <bus>223</bus> <slot>0</slot> <function>0</function> <product id='0x10d3'>82574L Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <iommuGroup number='10'> <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/> <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/> </iommuGroup> <numa node='2'/> <pci-express> <link validity='cap' port='0' speed='2.5' width='1'/> <link validity='sta' speed='2.5' width='1'/> </pci-express> </capability> </device>
Finally, libvirt should also be reporting the NUMA topology
# virsh capabilities ...snip... <topology> <cells num='3'> <cell id='0'> <memory unit='KiB'>4014464</memory> <pages unit='KiB' size='4'>1003616</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='2' value='20'/> </distances> <cpus num='4'> <cpu id='0' socket_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='1' core_id='0' siblings='1'/> <cpu id='2' socket_id='2' core_id='0' siblings='2'/> <cpu id='3' socket_id='3' core_id='0' siblings='3'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>2016808</memory> <pages unit='KiB' size='4'>504202</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='10'/> <sibling id='2' value='20'/> </distances> <cpus num='2'> <cpu id='4' socket_id='4' core_id='0' siblings='4'/> <cpu id='5' socket_id='5' core_id='0' siblings='5'/> </cpus> </cell> <cell id='2'> <memory unit='KiB'>2014644</memory> <pages unit='KiB' size='4'>503661</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='20'/> <sibling id='2' value='10'/> </distances> <cpus num='2'> <cpu id='6' socket_id='6' core_id='0' siblings='6'/> <cpu id='7' socket_id='7' core_id='0' siblings='7'/> </cpus> </cell> </cells> </topology> ...snip...
Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device
# cd /var/lib/libvirt/images # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso # virt-install --name f25x86_64 --ram 2000 --vcpus 8 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --hostdev pci_0000_df_00_0 --network none
If everything went well, you should now have a nested guest with an assigned PCI device attached to it.
This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.
The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)
NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…
I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.
Changes in this release include:
As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:
Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.
Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.
Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:
Network Block Device – how, what, why by Wouter Verhelst
Using NVDIMM under KVM – Applications of persistent memory in virtualization by Stefan Hajnoczi
I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.
Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.
This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).
Update: Video is available here.
I’m pleased to announce a new release of GTK-VNC, vesion 0.7.0. The release focus is on bug fixing and includes fixes for two publically reported security bugs which allow a malicious server to exploit the client. Similar bugs were recently reported & fixed in other common VNC clients too.
Thanks to all those who reported bugs and provides patches that went into this new release.
When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:
The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.
In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.
While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.
In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.
Considering the above information, for a newly created file the four size values would look like
|raw (sparse)||40GB||40GB||0||40GB |
|raw (prealloc)||40GB||40GB||40GB ||40GB |
|qcow2 (grow on demand)||40GB||193KB||196KB||41GB |
|qcow2 (prealloc metadata)||40GB||41GB||6.5MB||41GB |
|qcow2 (prealloc all)||40GB||41GB||41GB||41GB |
| XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB|
| use of internal snapshots may massively increase allocation/commitment|
For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.
That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.
Libvirt uses XML as the format for configuring objects it manages, including virtual machines. Sometimes when debugging / developing it is desirable to comment out sections of the virtual machine configuration to test some idea. For example, one might want to temporarily remove a secondary disk. It is not always desirable to just delete the configuration entirely, as it may need to be re-added immediately after. XML has support for comments
<!-- .... some text --> which one might try to use to achieve this. Using comments in XML fed into libvirt, however, will result in an unwelcome suprise – the commented out text is thrown into /dev/null by libvirt.
This is an unfortunate consequence of the way libvirt handles XML documents. It does not consider the XML document to be the master representation of an object’s configuration – a series of C structs are the actual internal representation. XML is simply a data interchange format for serializing structs into a text format that can be interchanged with the management application, or persisted on disk. So when receiving an XML document libvirt will parse it, extracting the pieces of information it cares about which are they stored in memory in some structs, while the XML document is discarded (along with the comments it contained). Given this way of working, to preserve comments would require libvirt to add 100’s of extra fields to its internal structs and extract comments from every part of the XML document that might conceivably contain them. This is totally impractical to do in realityg. The alternative would be to consider the parsed XML DOM as the canonical internal representation of the config. This is what the libvirt-gconfig library in fact does, but it means you can no longer just do simple field accesses to access information – getter/setter methods would have to be used, which quickly becomes tedious in C. It would also involve re-refactoring almost the entire libvirt codebase so such a change in approach would realistically never be done.
Given that it is not possible to use XML comments in libvirt, what other options might be available ?
Many years ago libvirt added the ability to store arbitrary user defined metadata in domain XML documents. The caveat is that they have to be located in a specific place in the XML document as a child of the <metadata> tag, in a private XML namespace. This metadata facility to be used as a hack to temporarily stash some XML out of the way. Consider a guest which contains a disk to be “commented out”:
<domain type="kvm"> ... <devices> ... <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/home/berrange/VirtualMachines/demo.qcow2'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </disk> ... </devices> </domain>
To stash the disk config as a piece of metadata requires changing the XML to
<domain type="kvm"> ... <metadata> <s:disk xmlns:s="http://stashed.org/1" type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/home/berrange/VirtualMachines/demo.qcow2'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </s:disk> </metadata> ... <devices> ... </devices> </domain>
What we have done here is
– Added a <metadata> element at the top level
– Moved the <disk> element to be a child of <metadata> instead of a child of <devices>
– Added an XML namespace to <disk> by giving it an ‘s:’ prefix and associating a URI with this prefix
Libvirt only allows a single top level metadata element per namespace, so if there are multiple tihngs to be stashed, just give them each a custom namespace, or introduce an arbitrary wrapper. Aside from mandating the use of a unique namespace, libvirt treats the metadata as entirely opaque and will not try to intepret or parse it in any way. Any valid XML construct can be stashed in the metadata, even invalid XML constructs, provided they are hidden inside a CDATA block. For example, if you’re using virsh edit to make some changes interactively and want to get out before finishing them, just stash the changed in a CDATA section, avoiding the need to worry about correctly closing the elements.
<domain type="kvm"> ... <metadata> <s:stash xmlns:s="http://stashed.org/1"> <![CDATA[ <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/home/berrange/VirtualMachines/demo.qcow2'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </disk> <disk> <driver name='qemu' type='raw'/> ...i'll finish writing this later... ]]> </s:stash> </metadata> ... <devices> ... </devices> </domain>
Admittedly this is a somewhat cumbersome solution. In most cases it is probably simpler to just save the snippet of XML in a plain text file outside libvirt. This metadata trick, however, might just come in handy some times.
As an aside the real, intended, usage of the <metdata> facility is to allow applications which interact with libvirt to store custom data they may wish to associated with the guest. As an example, the recently announced libvirt websockets console proxy uses it to record which consoles are to be exported. I know of few other real world applications using this metadata feature, however, so it is worth remembering it exists :-) System administrators are free to use it for local book keeping purposes too.
Laszlo wrote an article over at the edk2 wiki about testing SMM with OVMF, in QEMU/KVM virtual machines managed by libvirt:
The primary goal of the article is to help rapid development and testing of SMM-related firmware code (or any edk2 code in general).