I am happy to announce a new bugfix release of virt-viewer 6.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). The
virt-viewer binaries in the Windows builds should now successfully connect to libvirtd, following fixes to libvirt’s mingw port.
Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)
All historical releases are available from:
Changes in this release include:
Thanks to everyone who contributed towards this release.
I am happy to announce a new release of libosinfo version 1.1.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.
Changes in this release include:
Thanks to everyone who contributed towards this release.
QEMU has a lot of interfaces (like command line options or HMP commands) and old features (like certain devices) which are considered as deprecated since other more generic or better interfaces/features have been established instead. While the QEMU developers are generally trying to keep each QEMU release compatible with the previous ones, the old legacy sometimes gets into the way when developing new code and/or causes quite some burden of maintaining it.
Thus we are currently considering to get rid of some of the old interfaces and features in a future release and have started to collect a list of such old items in our QEMU documentation. If you are running QEMU directly, please have a look at this deprecation chapter of the QEMU documentation to see whether you are still using one of these old interfaces or features, so you can adapt your setup to use the new interfaces/features instead. Or if you rather think that one of the items should not be removed from QEMU at all, please speak up on the qemu-devel mailing list to explain why the interface or feature is still required.
The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.
Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.
On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.
Without further ado, here is how to trace static userspace probes with perf(1)!
The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:
# perf buildid-cache --add /usr/bin/qemu-system-x86_64
Once the ELF binaries have been scanned you can list the probes as follows:
# perf list sdt_*:*
List of pre-defined events (to be used in -e):
sdt_qemu:aio_co_schedule [SDT event]
sdt_qemu:aio_co_schedule_bh_cb [SDT event]
sdt_qemu:alsa_no_frames [SDT event]
First add probes for the events you are interested in:
# perf probe sdt_qemu:blk_co_preadv
Added new event:
sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)
You can now use it in all perf tools, such as:
perf record -e sdt_qemu:blk_co_preadv -aR sleep 1
Then capture trace data as follows:
# perf record -a -e sdt_qemu:blk_co_preadv
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB perf.data (4714 samples) ]
The trace can be printed using perf-script(1):
# perf script
qemu-system-x86 3425  2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425  2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425  2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0
If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.
As of July 2017 there are a few limitations to be aware of:
Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.
The contents of string arguments are not printed, only the memory address of the string.
Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.
The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.
Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.
On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.
It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.
This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…
For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.
I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.
I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)
I suggest creating a subdirectory for these and the other files we’re going to create.
wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz
Saving them locally as
installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.
(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)
First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.
qemu-img create -f qcow hda.qcow2 5G
Now we can run the installer:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel installer-linux \ -initrd installer-initrd.gz \ -drive if=none,file=hda.qcow2,format=qcow,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic -no-reboot
The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.
The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.
Late in the process, the installer will print the following warning dialog:
+-----------------| [!] Continue without boot loader |------------------+ | | | No boot loader installed | | No boot loader has been installed, either because you chose not to or | | because your specific architecture doesn't support a boot loader yet. | | | | You will need to boot manually with the /vmlinuz kernel on partition | | /dev/vda1 and root=/dev/vda2 passed as a kernel argument. | | | | <Continue> | | | +-----------------------------------------------------------------------+
Press continue for now, and we’ll sort this out later.
Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the
At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.
The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.
There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:
$ virt-filesystems -a hda.qcow2 /dev/sda1 /dev/sda2
If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:
/bootare installed not-world-readable; you can fix this with
sudo chmod 644 /boot/vmlinuz*
Looking at what’s in our disk we can see the kernel and initrd in /boot:
$ virt-ls -a hda.qcow2 /boot/ System.map-4.9.0-3-arm64 config-4.9.0-3-arm64 initrd.img initrd.img-4.9.0-3-arm64 initrd.img.old lost+found vmlinuz vmlinuz-4.9.0-3-arm64 vmlinuz.old
and we can copy them out to the host filesystem:
virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .
(We want the longer filenames, because
initrd.img are just symlinks and virt-copy-out won’t copy them.)
An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.
If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.
To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel vmlinuz-4.9.0-3-arm64 \ -initrd initrd.img-4.9.0-3-arm64 \ -append 'root=/dev/vda2' \ -drive if=none,file=hda.qcow2,format=qcow,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic
This should boot to a login prompt, where you can log in with the user and password you set up during the install.
The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.
This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.
This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.
|Issue description||Status||Public bug tracker|
|Hyper-V on KVM does not work at all (initial work item)||Fixed in kernel 4.10
|Hyper-V on KVM does not work on new Intel CPUs with PML||Fixed in kernel 4.11
|Hyper-V on KVM does not work on AMD CPUs||Fixed in kernel 4.12 for 1 vCPU|
For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.
In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.
Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:
I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:
The basic flow is as follows:
# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0
It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.
Fedora 26 is out of the door, and here are fresh fedora 26 images.
There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.
There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.
The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.
The images have been created with imagefish.
Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:
$ virt-builder -l | grep debian debian-6 x86_64 Debian 6 (Squeeze) debian-7 sparc64 Debian 7 (Wheezy) (sparc64) debian-7 x86_64 Debian 7 (Wheezy) debian-8 x86_64 Debian 8 (Jessie) debian-9 x86_64 Debian 9 (stretch) $ virt-builder debian-9 \ --root-password password:123456 [ 0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz [ 1.2] Planning how to build this image [ 1.2] Uncompressing [ 5.5] Opening the new disk [ 15.4] Setting a random seed virt-builder: warning: random seed could not be set for this type of guest [ 15.4] Setting passwords [ 16.7] Finishing off Output file: debian-9.img Output size: 6.0G Output format: raw Total usable space: 3.9G Free space: 3.1G (78%) $ qemu-system-x86_64 \ -machine accel=kvm:tcg -cpu host -m 2048 \ -drive file=debian-9.img,format=raw,if=virtio \ -serial stdio
libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:
┌───────────────────┐ │ main program │ │ │ │ │ child process / appliance │ │ ┌──────────────────────────┐ │ │ │ qemu │ ├───────────────────┤ RPC │ ┌─────────────────┐ │ │ libguestfs ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd │ │ │ │ │ ├─────────────────┤ │ └───────────────────┘ │ │ Linux kernel │ │ │ └────────┬────────┘ │ └───────────────│──────────┘ │ │ virtio-scsi ┌──────┴──────┐ │ Device or │ │ disk image │ └─────────────┘
The library has to be written in C because it needs to be linked to any main program. The daemon (
guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.
The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from
strdup etc., making even the simplest task non-trivial and full of untested code paths.
So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C ↔ OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.
OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.
Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).
I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.
There were a couple of QEMU / virtualization related talks at the DevConf 2017 conference that took place at the end of January already, but so far we missed to gather the links to the recordings of these talks. So here is now the list:
How to write a legacy storage device emulator by John Snow
GPU Passthrough using GNOME Boxes by Felipe Borges
Self-virtualizing Linux on x86 by Radim Krčmář
New ways to remote desktops with GStreamer integration by Victor Toso de Carvalho and Pavel Grunt
Can VMs networking benefit from DPDK by Maxime Coquelin and Victor Kaplansky
Windows VMs on KVM: The Good, the Bad and the Ugly by Ladi Prosek
Pet VMs in Kubernetes? WTH by Fabian Deutsch
Virtualization development improved with Lago by Rafael Martins
An ARMful of guests: virtualization on 64-bit ARM by Andrea Bolognani
First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.
Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.
So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.
So, here is the Makefile:
CC ?= gcc CFLAGS ?= -Os -g -std=c99 CFLAGS += -Wall TARGETS := drminfo drmtest gtktest drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1) drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1) drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += -ljpeg gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += -ljpeg all: $(TARGETS) clean: rm -f $(TARGETS) rm -f *~ *.o drminfo: drminfo.o drmtools.o drmtest: drmtest.o drmtools.o render.o image.o gtktest: gtktest.o render.o image.o
Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.
Now, compare with the meson.build file:
project('drminfo', 'c') # pkg-config deps libdrm_dep = dependency('libdrm') gbm_dep = dependency('gbm') epoxy_dep = dependency('epoxy') cairo_dep = dependency('cairo') cairo_gl_dep = dependency('cairo-gl') pixman_dep = dependency('pixman-1') gtk3_dep = dependency('gtk+-3.0') # libjpeg dep jpeg_dep = declare_dependency(link_args : '-ljpeg') drminfo_srcs = [ 'drminfo.c', 'drmtools.c' ] drmtest_srcs = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ] gtktest_srcs = [ 'gtktest.c', 'render.c', 'image.c' ] drminfo_deps = [ libdrm_dep, cairo_dep, pixman_dep ] drmtest_deps = [ libdrm_dep, gbm_dep, epoxy_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] gtktest_deps = [ gtk3_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] executable('drminfo', sources : drminfo_srcs, dependencies : drminfo_deps) executable('drmtest', sources : drmtest_srcs, dependencies : drmtest_deps) executable('gtktest', sources : gtktest_srcs, dependencies : gtktest_deps, install : false)
Pretty straight forward translation. So, what are the differences?
First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:
Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.
Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.
Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.
Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.
When you are trying to start a KVM guest via libvirt on an s390x Linux installation that is running on an older version of z/VM, you might run into the problem that QEMU refuses to start with this error message:
cannot set up guest memory 's390.ram': Permission denied.
This happens because older versions of z/VM (before version 6.3) do not support the so-called “enhanced suppression on protection facility” (ESOP) yet, so QEMU has to allocate the memory for the guest with a “hack”, and this hack uses mmap(… PROT_EXEC …) for the allocation.
Now this mmap() call is not allowed by the default SELinux rules (at least not on RHEL-based systems), so QEMU fails to allocate the memory for the guest here. Turning off SELinux completely just to run a KVM guest is of course a bad idea, but fortunately there is already a SELinux boolean value called virt_use_execmem which can be used to tune the behavior here:
setsebool virt_use_execmem 1
This configuration switch has originally been introduced for running TCG guests (i.e. running QEMU without KVM), but in this case it also fixes the problem with KVM guests. Anyway, since setting this SELinux variable to 1 is also a slight decrease in security (but still better than disabling SELinux completely), you should better upgrade your z/VM to version 6.3 (or newer) or use a real LPAR for the KVM host installation instead, if that is feasible.
--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
+<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
Define 'f25' with the changed XML? (y/n):
Here is a short list of articles and blog posts about QEMU and KVM, that were posted last month.
QEMU and the qcow2 metadata checks by Alberto Garcia
Running Hyper-V in a QEMU/KVM Guest by Ladislav Prošek
Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA by Daniel P. Berrangé
Testing SMM with QEMU, KVM and libvirt by László Érsek
The surprisingly complicated world of disk image sizes by Daniel P. Berrangé
More virtualization blog posts can be found on the virt tools planet.
In other news, QEMU is now in hard freeze for release 2.9.0. The preliminary list of features is on the wiki.
virt-inspector is a very convenient tool to examine a disk image and find out if it contains an operating system, what applications are installed and so on.
nbdkit xz file=win7.img.xz \ -U - \ --run 'virt-inspector --format=raw -a nbd://?socket=$unixsocket'
What’s happening here is we run nbdkit with the xz plugin, and tell it to serve NBD over a randomly named Unix domain socket (
We then run virt-inspector as a sub-process. This is called “captive nbdkit”. (Nbdkit is “captive” here, because it will exit as soon as virt-inspector exits, so there’s no need to clean anything up.)
$unixsocket variable expands to the name of the randomly generated Unix domain socket, forming a libguestfs NBD URL which allows virt-inspector to examine the raw uncompressed data exported by nbdkit.
The nbdkit xz plugin only uncompresses those blocks of the data which are actually accessed, so this is quite efficient.
As of today creating libvirt lxc system container root file system is a pain. Docker's fun came with its image sharing idea... why couldn't we do the same for libvirt containers? I will expose here is an attempt at this.
To achieve such a goal we need:
OpenBuildService thanks to kiwi knows how to create images, even container images. There even are openSUSE Docker images. To use them as system container images, some more packages need to be added to those. I thus forked the project on github and branched the OBS projects to get system container images for 42.1, 42.2 and Tumbleweed.
Using them is as simple as downloading them, unpacking them and use them as a container's root file system. However, sharing them would be so fun!
There is no need to reinvent the wheel to share the images. We can just consider them like any docker image. With the following commands we can import the image and push it to a remote registry.
docker import openSUSE-42.2-syscontainer-guest-docker.x86_64.tar.xz system/opensuse-42.2 docker tag system/opensuse-42.2 myregistry:5000/system/opensuse-42.2 docker login myregistry:5000 docker push myregistry:5000/system/opensuse-42.2
The good thing with this is that we can even use the
docker build and
Dockerfile magic to create customized images and push them to the remote
Now we need a tool to get the images from the remote docker registry. Hopefully there is a tool that helps a lot to do this: skopeo. I wrote a small virt-bootstrap tool using it to instanciate the images as root file systems.
Here is how instanciating a container looks like with it:
virt-bootstrap.py --username myuser \ --root-password test \ docker://myregistry:5000/system/opensuse-42.2 /path/to/my/container virt-install --connect lxc:/// -n 422 --memory 250 --vcpus 1 \ --filesystem /path/to/my/container,/ \ --filesystem /etc/resolv.conf,/etc/resolv.conf \ --network network=default
And voila! Creating an openSUSE 42.2 system container and running it with libvirt is now super easy!
So, finally the
query-cpu-model-expansion x86 implementation was merged
just before 2.9 soft freeze. Jiri Denemark already implemented
the x86 libvirt code to use it. I just can’t believe this was
finally done after so many years.
It was a weird journey. It started almost 6 years ago with this message to qemu-devel:
Date: Fri, 10 Jun 2011 18:36:37 -0300
Subject: semantics of “-cpu host” and “check”/”enforce”
…it continued on an interesting thread:
Date: Tue, 6 Mar 2012 15:27:53 -0300
Subject: Qemu, libvirt, and CPU models
…on another very long one:
Date: Fri, 9 Mar 2012 17:56:52 -0300
Subject: Re: [Qemu-devel] [libvirt] Modern CPU models cannot be used with libvirt
…and this one:
Date: Thu, 21 Feb 2013 11:58:18 -0300
Subject: libvirt<->QEMU interfaces for CPU models
I don’t even remember how many different interfaces were proposed to provide what libvirt needed.
We had a few moments where we hopped back and forth between “just let libvirt manage everything” to “let’s keep this managed by QEMU”.
We took a while to get the QEMU community to decide how machine-type compatibility was supposed to be handled, and what to do with the weird CPU model config file we had.
The conversion of CPUs to QOM was fun. I think it started in 2012 and was finished only in 2015. We thought QOM properties would solve all our problems, but then we found out that machine-types and global properties make the problem more complex. The existing interfaces would require making libvirt re-run QEMU multiple times to gather all the information it needed. While doing the QOM work, we spent some time fixing or working around issues with global properties, qdev “static” properties and QOM “dynamic” properties.
In 2014, my focus was moved to machine-types, in the hope that we
could finally expose machine-type-specific information to libvirt
without re-running QEMU. Useful code refactoring was done for
that, but in the end we never added the really relevant
information to the
query-machines QMP command.
In the meantime, we had the fun TSX issues, and QEMU developers finally agreed to keep a few constraints on CPU model changes, that would make the problem a bit simpler.
In 2015 IBM people started sending patches related to CPU models
in s390x. We finally had a multi-architecture effort to make CPU
model probing work. The work started by extending
query-cpu-definitions, but it was not enough. In June 2016 they
query-cpu-model-expansion API. It was finally merged
in September 2016.
I sent v1 of
query-cpu-model-expansion for x86 in December 2016.
After a few rounds of reviews, there was a proposal to use
“-cpu max” to represent the “all features supported by this QEMU
binary on this host”. v3 of the series was
merged last week.
I still can’t believe it finally happened.
Special thanks to:
The slides and videos for my FOSDEM 2017 talk (QEMU: internal APIs and conflicting world views) are available online.
The subject I tried to cover is large for a 40-minute talk, but I think I managed to scratch its surface and give useful examples.
Many thanks for the FOSDEM team of volunteers for the wonderful event.
This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.
Update 4/2017: AMD is broken, fix is coming.
Update 7/2017: Check out Nesting Hyper-V in QEMU/KVM: Known issues for a list of known issues and their status
A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.
A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.
Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.
$ cat /sys/module/kvm_intel/parameters/nested
Update 4/2017: On newer Intel CPUs with PML (Page Modification Logging) support such as Kaby Lake, Skylake, and some server Broadwell chips, PML needs to be disabled by passing kvm-intel.pml=0 to the kernel as a parameter. Fix is coming.
QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.
The QEMU command line must include the +vmx cpu feature, for example:
If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).
Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.
If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.
If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.
I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.
The fedora-25-rpi3 image is for the raspberry pi 3.
The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).
The images don’t have a root password set. You must use libguestfs-tools to set the root password …
virt-customize -a <image> --root-password "password:<your-password-here>>
… otherwise you can’t login after boot.
The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.
The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.
Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.
First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.
# curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo # dnf upgrade
For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install
# cd /var/lib/libvirt/images # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso # virt-install --name f25x86_64 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --ram 8000 --vcpus 8 \ ...
The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.
... --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\ cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\ cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \ ...
QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).
... --machine q35
The complete virt-install command line thus looks like
# virt-install --name f25x86_64 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --ram 8000 --vcpus 8 \ --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\ cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\ cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \ --machine q35
Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “
virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”
<controller type='pci' index='0' model='pcie-root'/> <controller type='pci' index='1' model='dmi-to-pci-bridge'> <model name='i82801b11-bridge'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/> </controller> <controller type='pci' index='2' model='pci-bridge'> <model name='pci-bridge'/> <target chassisNr='2'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller>
PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config
<controller type='pci' index='3' model='pcie-expander-bus'> <target busNr='180'> <node>0</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-expander-bus'> <target busNr='200'> <node>1</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='5' model='pcie-expander-bus'> <target busNr='220'> <node>2</node> </target> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller>
It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB
<controller type='pci' index='6' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='6' port='0x0'/> <alias name='pci.6'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='7' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='7' port='0x8'/> <alias name='pci.7'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='8' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='8' port='0x10'/> <alias name='pci.8'/> <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='9' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='9' port='0x0'/> <alias name='pci.9'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='10' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='10' port='0x8'/> <alias name='pci.10'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='11' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='11' port='0x10'/> <alias name='pci.11'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='12' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='12' port='0x0'/> <alias name='pci.12'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='13' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='13' port='0x8'/> <alias name='pci.13'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='14' model='pcie-root-port'> <model name='ioh3420'/> <target chassis='14' port='0x10'/> <alias name='pci.14'/> <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>| </controller>
Notice that the values in ‘
bus‘ attribute on the
<address> element is matching the value of the ‘
index‘ attribute on the
<controller> element of the parent device in the topology. The PCI controller topology now looks like this
pcie-root (index == 0) | +- dmi-to-pci-bridge (index == 1) | | | +- pci-bridge (index == 2) | +- pcie-expander-bus (index == 3, numa node == 0) | | | +- pcie-root-port (index == 6) | +- pcie-root-port (index == 7) | +- pcie-root-port (index == 8) | +- pcie-expander-bus (index == 4, numa node == 1) | | | +- pcie-root-port (index == 9) | +- pcie-root-port (index == 10) | +- pcie-root-port (index == 11) | +- pcie-expander-bus (index == 5, numa node == 2) | +- pcie-root-port (index == 12) +- pcie-root-port (index == 13) +- pcie-root-port (index == 14)
All the existing devices are attached to the “
pci-bridge” (the controller with
index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “
pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add
<interface type='user'> <mac address='52:54:00:7e:6e:c6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:c7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:c8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:d8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e6'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e7'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/> </interface> <interface type='user'> <mac address='52:54:00:7e:6e:e8'/> <model type='e1000e'/> <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/> </interface>
Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.
The final configuration change is to simply add the Intel IOMMU device
It is a capability integrated into the chipset, so it does not need any
<address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.
# virsh domifaddr f25x86_64 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet0 52:54:00:10:26:7e ipv4 192.168.122.3/24 # ssh firstname.lastname@example.org
We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes
# dnf install numactl # numactl --hardware available: 3 nodes (0-2) node 0 cpus: 0 1 2 3 node 0 size: 3856 MB node 0 free: 3730 MB node 1 cpus: 4 5 node 1 size: 1969 MB node 1 free: 1813 MB node 2 cpus: 6 7 node 2 size: 1967 MB node 2 free: 1832 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10
Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached
# lspci -t -v -+-[0000:dc]-+-00.0-[dd]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[de]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[df]----00.0 Intel Corporation 82574L Gigabit Network Connection +-[0000:c8]-+-00.0-[c9]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[ca]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[cb]----00.0 Intel Corporation 82574L Gigabit Network Connection +-[0000:b4]-+-00.0-[b5]----00.0 Intel Corporation 82574L Gigabit Network Connection | +-01.0-[b6]----00.0 Intel Corporation 82574L Gigabit Network Connection | \-02.0-[b7]----00.0 Intel Corporation 82574L Gigabit Network Connection \-[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller +-01.0 Red Hat, Inc. QXL paravirtual graphic card +-02.0 Red Hat, Inc. Device 000b +-03.0 Red Hat, Inc. Device 000b +-04.0 Red Hat, Inc. Device 000b +-1d.0 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 +-1d.1 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 +-1d.2 Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 +-1d.7 Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 +-1e.0-[01-02]----01.0---+-01.0 Red Hat, Inc Virtio network device | +-02.0 Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller | +-03.0 Red Hat, Inc Virtio console | +-04.0 Red Hat, Inc Virtio block device | \-05.0 Red Hat, Inc Virtio memory balloon +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller
The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.
# vi /etc/default/grub ....add "intel_iommu=on"... # grub2-mkconfig > /etc/grub2.cfg
While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \ /etc/modprobe.d/vfio.conf
This is also a good time to install libvirt and KVM inside the guest
# dnf groupinstall "Virtualization" # dnf install libvirt-client # rm -f /etc/libvirt/qemu/networks/autostart/default.xml
Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.
Now reboot the guest. When it comes back up, there should be a
/dev/kvm device present in the guest.
# ls -al /dev/kvm crw-rw-rw-. 1 root kvm 10, 232 Oct 4 12:14 /dev/kvm
If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.
The IOMMU should have been detected and activated
# dmesg | grep -i DMAR [ 0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS BXPCDMAR 00000001 BXPC 00000001) [ 0.000000] DMAR: IOMMU enabled [ 0.203737] DMAR: Host address width 39 [ 0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1 [ 0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02 [ 2.910862] DMAR: No RMRR found [ 2.910863] DMAR: No ATSR found [ 2.914870] DMAR: dmar0: Using Queued invalidation [ 2.914924] DMAR: Setting RMRR: [ 2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC [ 2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] [ 2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O
The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.
The IOMMU should also have registered the PCI devices into various groups
# dmesg | grep -i iommu |grep device [ 2.915212] iommu: Adding device 0000:00:00.0 to group 0 [ 2.915226] iommu: Adding device 0000:00:01.0 to group 1 ...snip... [ 5.588723] iommu: Adding device 0000:b5:00.0 to group 14 [ 5.588737] iommu: Adding device 0000:b6:00.0 to group 15 [ 5.588751] iommu: Adding device 0000:b7:00.0 to group 16
Libvirt meanwhile should have detected all the PCI controllers/devices
# virsh nodedev-list --tree computer | +- net_lo_00_00_00_00_00_00 +- pci_0000_00_00_0 +- pci_0000_00_01_0 +- pci_0000_00_02_0 +- pci_0000_00_03_0 +- pci_0000_00_04_0 +- pci_0000_00_1d_0 | | | +- usb_usb2 | | | +- usb_2_0_1_0 | +- pci_0000_00_1d_1 | | | +- usb_usb3 | | | +- usb_3_0_1_0 | +- pci_0000_00_1d_2 | | | +- usb_usb4 | | | +- usb_4_0_1_0 | +- pci_0000_00_1d_7 | | | +- usb_usb1 | | | +- usb_1_0_1_0 | +- usb_1_1 | | | +- usb_1_1_1_0 | +- pci_0000_00_1e_0 | | | +- pci_0000_01_01_0 | | | +- pci_0000_02_01_0 | | | | | +- net_enp2s1_52_54_00_10_26_7e | | | +- pci_0000_02_02_0 | +- pci_0000_02_03_0 | +- pci_0000_02_04_0 | +- pci_0000_02_05_0 | +- pci_0000_00_1f_0 +- pci_0000_00_1f_2 | | | +- scsi_host0 | +- scsi_host1 | +- scsi_host2 | +- scsi_host3 | +- scsi_host4 | +- scsi_host5 | +- pci_0000_00_1f_3 +- pci_0000_b4_00_0 | | | +- pci_0000_b5_00_0 | | | +- net_enp181s0_52_54_00_7e_6e_c6 | +- pci_0000_b4_01_0 | | | +- pci_0000_b6_00_0 | | | +- net_enp182s0_52_54_00_7e_6e_c7 | +- pci_0000_b4_02_0 | | | +- pci_0000_b7_00_0 | | | +- net_enp183s0_52_54_00_7e_6e_c8 | +- pci_0000_c8_00_0 | | | +- pci_0000_c9_00_0 | | | +- net_enp201s0_52_54_00_7e_6e_d6 | +- pci_0000_c8_01_0 | | | +- pci_0000_ca_00_0 | | | +- net_enp202s0_52_54_00_7e_6e_d7 | +- pci_0000_c8_02_0 | | | +- pci_0000_cb_00_0 | | | +- net_enp203s0_52_54_00_7e_6e_d8 | +- pci_0000_dc_00_0 | | | +- pci_0000_dd_00_0 | | | +- net_enp221s0_52_54_00_7e_6e_e6 | +- pci_0000_dc_01_0 | | | +- pci_0000_de_00_0 | | | +- net_enp222s0_52_54_00_7e_6e_e7 | +- pci_0000_dc_02_0 | +- pci_0000_df_00_0 | +- net_enp223s0_52_54_00_7e_6e_e8
And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of
# virsh nodedev-dumpxml pci_0000_df_00_0 <device> <name>pci_0000_df_00_0</name> <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path> <parent>pci_0000_dc_02_0</parent> <driver> <name>e1000e</name> </driver> <capability type='pci'> <domain>0</domain> <bus>223</bus> <slot>0</slot> <function>0</function> <product id='0x10d3'>82574L Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <iommuGroup number='10'> <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/> <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/> </iommuGroup> <numa node='2'/> <pci-express> <link validity='cap' port='0' speed='2.5' width='1'/> <link validity='sta' speed='2.5' width='1'/> </pci-express> </capability> </device>
Finally, libvirt should also be reporting the NUMA topology
# virsh capabilities ...snip... <topology> <cells num='3'> <cell id='0'> <memory unit='KiB'>4014464</memory> <pages unit='KiB' size='4'>1003616</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='2' value='20'/> </distances> <cpus num='4'> <cpu id='0' socket_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='1' core_id='0' siblings='1'/> <cpu id='2' socket_id='2' core_id='0' siblings='2'/> <cpu id='3' socket_id='3' core_id='0' siblings='3'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>2016808</memory> <pages unit='KiB' size='4'>504202</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='10'/> <sibling id='2' value='20'/> </distances> <cpus num='2'> <cpu id='4' socket_id='4' core_id='0' siblings='4'/> <cpu id='5' socket_id='5' core_id='0' siblings='5'/> </cpus> </cell> <cell id='2'> <memory unit='KiB'>2014644</memory> <pages unit='KiB' size='4'>503661</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='20'/> <sibling id='2' value='10'/> </distances> <cpus num='2'> <cpu id='6' socket_id='6' core_id='0' siblings='6'/> <cpu id='7' socket_id='7' core_id='0' siblings='7'/> </cpus> </cell> </cells> </topology> ...snip...
Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device
# cd /var/lib/libvirt/images # wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso # virt-install --name f25x86_64 --ram 2000 --vcpus 8 \ --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \ --cdrom f25x86_64-boot.iso --os-type fedora23 \ --hostdev pci_0000_df_00_0 --network none
If everything went well, you should now have a nested guest with an assigned PCI device attached to it.
This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.
The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)
NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…
I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.
Changes in this release include:
As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:
Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.
Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.
Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:
Network Block Device – how, what, why by Wouter Verhelst
Using NVDIMM under KVM – Applications of persistent memory in virtualization by Stefan Hajnoczi
I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.
Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.
This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).
Update: Video is available here.