As usual I’ve placed the proposed RHEL 7.5 libguestfs packages in a public repository so you can try them out.
Thanks to Pino Toscano for doing the packaging work.
I was in Prague last month for the 2017 edition of the KVM Forum. There I gave a talk about some of the work that I’ve been doing this year to improve the qcow2 file format used by QEMU for storing disk images. The focus of my work is to make qcow2 faster and to reduce its memory requirements.
The KVM Forum was co-located with the Open Source Summit and the Embedded Linux Conference Europe. Igalia was sponsoring both events one more year and I was also there together with some of my colleages. Juanjo Sánchez gave a talk about WPE, the WebKit port for embedded platforms that we released.
The video of his talk is also available.
At KVM Forum 2017 I gave a talk about the AioContext polling optimization that was merged in QEMU 2.9. It reduces latency for virtio-blk and virtio-scsi devices with the iothread= option on high IOPS devices like recent NVMe PCIe SSDs drives. It increases performance for latency-sensitive workloads and has been designed to avoid interfering with workloads that do not benefit from polling thanks to a self-tuning algorithm.
The video of the talk is now available:
The slides are available here (PDF).
Collecting benchmark results is the first step to solving disk I/O performance problems. Unfortunately, many bug reports and performance investigations fall down at the first step because bogus benchmark data is collected. This post explains common mistakes when running disk I/O benchmarks.
Skip this section if you are already familiar with these terms. Before we begin, it is important to understand the different I/O patterns and how they are used in benchmarking.
Sequential vs random I/O is the access pattern in which data is read or written. Sequential I/O is in-order data access commonly found in workloads like streaming multimedia or writing log files. Random I/O is access of non-adjacent data commonly found when accessing many small files or on systems running multiple applications that access the disk at the same time. It is easy to prefetch sequential I/O so both disk read caches and operating system page caches may keep the next piece of data ready even before it is accessed. Random I/O does not offer opportunities for prefetching and is therefore a harder access pattern to optimize.
Block or request size is the amount of data transferred by a single access. Small request sizes are 512B through 4 KB, large request sizes are 64 KB through 128 KB, while very large request sizes could be 1 MB (although the maximum allowed request size ultimately depends on the hardware). Fewer requests are needed to transfer the same amount of data when the request size is larger. Therefore, throughput is usually higher at larger request sizes because less per-request overhead is incurred for the same amount of data.
Read vs write is the request type that determines whether data is transferred to or from the storage medium. Reads can be completed cheaply if data is already in the disk read cache and, failing that, the access time depends on the storage medium. Traditional spinning disks have significant average seek times in the range of 4-15 milliseconds, depending on the drive, when the head is not positioned in the read location, while solid-state storage devices might just take on the order of 10 microseconds. Writes can be completed cheaply by leaving data in the disk write cache unless the cache is full or the cache is disabled.
Queue depth is the number of in-flight I/O requests at a given time. Latency-sensitive workloads submit one request and wait for it to complete before submitting the next request. This is queue depth 1. Parallel workloads submit many requests without waiting for earlier requests to complete first. The maximum queue depth depends on the hardware with 64 being a common number. Maximum throughput is usually achieved when queue depth is fairly high because the disk can keep busy without waiting for the next request to be submitted and it may optimize the order in which requests are processed.
Random reads are a good way to force storage medium access and minimize cache hit rates. Sequentual reads are a good way to maximize cache hit rates. Which I/O pattern is appropriate depends on your goals.
Real-life workloads are usually a mixture of sequential vs random, block sizes, reads vs writes, and the queue depth may vary over time. It is simplest to benchmark a specific I/O pattern in isolation but benchmark tools can also be configured to produce mixed I/O patterns like 70% reads/30% writes. The goal when configuring a benchmark is to produce the I/O pattern that is critical for real-life workload performance.
It is often tempting to use file utilities instead of real benchmarking tools because file utilities report I/O throughput like real benchmarking tools and time taken can be easily measured. Therefore it might seem like there is no need to install a real benchmarking tool when file utilities are already available on every system.
Do not use cp(1), scp(1), or even dd(1). Instead, use a real benchmark like fio(1).
What's the difference? Real benchmarking tools can be configured to produce specific I/O patterns, like 4 KB random reads with queue depth 8, whereas file utilities offer limited or no ability to choose the I/O pattern. Since disk performance varies depending on the I/O pattern, it is hard to understand or compare results between systems without full control over the I/O pattern.
The second reason why real benchmarking tools are necessary is that file utilities are not designed to exercise the disk, they are designed to manipulate files. This means file utilities spend time doing things that does not involve disk I/O and therefore produces misleading performance results. The most important example of this is that file utilities use the operating system's page cache and this can result in no disk I/O activity at all!
One of the most common mistakes is forgetting to bypass the operating system's page cache. Files and block devices opened with the O_DIRECT flag perform I/O to the disk without going through the page cache. This is the best way to guarantee that the disk actually gets I/O requests. Files opened without this flag are in "buffered I/O" mode and that means I/O may be fulfilled entirely within the page cache in RAM without any disk I/O activity. If the goal is to benchmark disk performance then the page cache needs to be eliminated.
fio(1) jobs must use the direct=1 parameter to exercise the disk.
It is not sufficient to echo 3 > /proc/sys/vm/drop_caches before running the benchmark instead of using O_DIRECT. Although this command is often used to make non-disk benchmarks produce more consistent results between runs, it does not guarantee that the disk will actually receive I/O requests. In addition, the page cache interferes with the desired benchmark I/O pattern since page cache prefetch and writeback will alter the actual I/O pattern that the disk sees.
fio(1) can do both file I/O and disk I/O benchmarking, so it's often mistakenly used in file I/O mode instead of disk I/O mode. When benchmarking disk performance it is best to eliminate file systems and device mapper targets to isolate raw disk I/O performance. File systems and device mapper targets may have their own internal bottlenecks, such as software locks, that are unrelated to disk performance. File systems and device mapper targets are also likely to modify the I/O pattern because they submit their own metadata I/O.
fio(1) jobs must use the filename=/path/to/disk to do disk I/O benchmarking.
Without a block device filename parameter, the benchmark would create regular files on whatever file system is in use. Remember to double- and triple-check the block device filename before running benchmarks that write to the disk to avoid accidentally overwriting important data like the system root disk!
Here are a few example fio(1) jobs that you can use as a starting point.
This job is a read-heavy workload with lots of parallelism that is likely to show off the device's best throughput:
ramp_time=10 # start measuring after warm-up time
offset_increment=128m # each job starts at a different offset
This job is a latency-sensitive workload that stresses per-request overhead and seek times:
ramp_time=10 # start measuring after warm-up time
This job simulates a more real-life workload with an I/O pattern that contains boths reads and writes:
ramp_time=10 # start measuring after warm-up time
There are several common issues with disk benchmarking that can lead to useless results. Using a real benchmarking tool and bypassing the page cache and file system are the basic requirements for useful disk benchmark results. If you have questions or suggestions about disk benchmarking, feel free to post a comment.
QEMU Block Layer currently (as of QEMU 2.10) supports four major kinds of live block device jobs – stream, commit, mirror, and backup. These can be used to manipulate disk image chains to accomplish certain tasks, e.g.: live copy data from backing files into overlays; shorten long disk image chains by merging data from overlays into backing files; live synchronize data from a disk image chain (including current active disk) to another target image; and point-in-time (and incremental) backups of a block device.
To that end, recently I have written documentation (thanks to the QEMU Block Layer maintainers & developers for the reviews) of the usage of following commands:
Each of the above block device jobs, their QMP (QEMU Machine Protocol) invocation examples are documented.
This documentation can be handy in those (debugging) scenarios when it’s instructive to look at what is happening behind the scenes of QEMU. For example, live storage migration (without shared storage setup) is one of the most common use-cases that takes advantage of the QMP
drive-mirror command and QEMU’s built-in Network Block Device (NBD) server. Here’s the QMP-level workflow for it — this is the flow libvirt internally implements (with some additional niceties).
We are pleased to announce that the QEMU v2.10.1 stable release is now available! You can grab the tarball from our download page.
Apart from the normal range of general fixes, this update contains security fixes addressing guest-induced crashing of host QEMU process (CVE-2017-13672, CVE-2017-13673) and possible code injection into host QEMU process via a crafted multiboot ELF kernel when specified directly via QEMU command-line option (CVE-2017-14167). Please update accordingly.
There are various approaches to run macOS as guest under kvm. One is to add apple-specific features to OVMF, as described by Gabriel L. Somlo. I’ve choose to use the Clover EFI bootloader instead. Here is how my setup looks like.
What you needed
First a bootable installer disk image. You can create a bootable usbstick using the createinstallmedia tool shipped with the installer. You can then dd the usb stick to a raw disk image.
Next a clover disk image. I’ve created a script which uses guestfish to generate a disk image from a clover iso image, with a custom config file. The advantage of having a separate disk only for clover is that you can easily update clover, downgrade clover and tweak the clover configuration without booting the virtual machine. So, if something goes wrong recovering is a lot easier.
Qemu. Version 2.10 (or newer) strongly recommended. macOS versions up to 10.12.3 work fine in qemu 2.9. macOS 10.12.4 requires fixes for the qemu applesmc emulation which got merged for qemu 2.10.
OVMF. Latest git works fine for me. Older OVMF versions trip over a recent qemu update and provides broken ACPI tables to the OS then. With the result that macOS doesn’t boot, even though ovmf itself shows no signs of trouble.
Configuring your virtual machine
Here are snippets of my libvirt config, with comments explaining the important things:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
xmlns:qemu entry is needed for qemu-specific tweaks, that way we can ask libvirt to add extra arguments to the qemu command line.
<os> <type arch='x86_64' machine='pc-q35-2.9'>hvm</type> <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader> <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>/var/lib/libvirt/qemu/nvram/macos-test-org-base_VARS.fd</nvram> <bootmenu enable='yes'/> </os>
Using the q35 machine type here, and the cutting edge edk2 builds from my firmware repo.
<cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>Penryn</model> <feature policy='require' name='invtsc'/> </cpu>
CPU. Penryn is known-good. The invtsc feature is needed because macOS uses the TSC for timekeeping. When asking to provide a fixed TSC frequency qemu will store the TSC frequency in a hypervisor cpuid leaf. And macOS will pick it up there. Without that macOS does a wild guess, likely gets it very wrong, and wall clock in your guest runs either way too fast or way too slow.
<disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/> <source dev='/dev/path/to/lvm/volume'/> <target dev='sda' bus='sata'/> </disk>
This is the system disk where macOS is be installed on. Attached as sata disk to the q35 ahci controller.
You also need the installmedia image. On a real macintosh you’ll typically use a usbstick. macOS doesn’t care much where the installer is stored though, so you can attach the image as sata disk too and it’ll work fine.
Finally you need the clover disk image. edk2 alone isn’t able to boot from the system or installer disk, so it’ll start clover. clover in turn will load the hfs+ filesystem driver so the other disks can be accessed, will offer a boot menu and allows to boot macOS.
<interface type='network'> <source network='default'/> <model type='e1000-82545em'/> </interface>
Ethernet card. macOS has drivers for this model. Seems to have problems with link detection now and then. Set link status to down for a moment, then to up again (using virsh domif-setlink) gets the virtual machine online.
<input type='tablet' bus='usb'/> <input type='keyboard' bus='usb'/>
USB tablet and keyboard, as input devices. Tablet allows to operate the mouse without pointer grabs which is much more convenient than using a virtual mouse.
<video> <model type='vga' vram='65536'/> </video>
Qemu standard vga.
<qemu:commandline> <qemu:arg value='-readconfig'/> <qemu:arg value='/path/to/macintosh.cfg'/> </qemu:commandline>
This is the extra configuration item for stuff not supported by libvirt. The
macintosh.cfg file looks like this, adding the emulated smc device:
[device "smc"] driver = "isa-applesmc" osk = "<insert-real-osk-here>"
You can run Gabriels SmcDumpKey tool on a macintosh to figure what the osk is.
I’m using config.plist.stripped.qemu as starting point. Here are the most important settings:
-v" to start the kernel in verbose mode where it prints boot messages to the screen (like the linux kernel without "
quiet"). For trouble-shooting, or to impress your friends.
Device Manager / OVMF Platform Configuration / Change Preferred Resolution. The two Settings must match, otherwise macOS will boot with a scrambled display.
That should be it. Boot the virtual machine. Installing and using macOS should work as usual.
Gabriel also has some comments on the legal side of this. Summary: Probably allowed by Apple on macintosh hardware, i.e. when running linux on your macintosh, then run macOS as guest there. If in doubt check with your lawyer.
Yesterday, virt-bootstrap came to life. This tool aims at simplifying the creation of root file systems for use with libvirt's LXC container drivers. I started prototyping it a few months ago and Radostin Stoyanov wonderfully took it over during this year's Google Summer of Code.
For most users, this tool will just be used by virt-manager (since version 1.4.2). But it can be used directly from any script or command line.
The nice thing about virt-bootstrap is that will allow you to create
a root file system out of existing docker images, tarballs or virt-builder
templates. For example, the following command will get and unpack the official
openSUSE docker image in
$ virt-bootstrap docker://opensuse /tmp/foo
Virt-bootstrap offers options to:
Enjoy easy containers creation with libvirt ecosystem, and have fun!
Thank you to everyone involved!
I am happy to announce a new bugfix release of virt-viewer 6.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). The
virt-viewer binaries in the Windows builds should now successfully connect to libvirtd, following fixes to libvirt’s mingw port.
Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)
All historical releases are available from:
Changes in this release include:
Thanks to everyone who contributed towards this release.
I am happy to announce a new release of libosinfo version 1.1.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.
Changes in this release include:
Thanks to everyone who contributed towards this release.
QEMU has a lot of interfaces (like command line options or HMP commands) and old features (like certain devices) which are considered as deprecated since other more generic or better interfaces/features have been established instead. While the QEMU developers are generally trying to keep each QEMU release compatible with the previous ones, the old legacy sometimes gets into the way when developing new code and/or causes quite some burden of maintaining it.
Thus we are currently considering to get rid of some of the old interfaces and features in a future release and have started to collect a list of such old items in our QEMU documentation. If you are running QEMU directly, please have a look at this deprecation chapter of the QEMU documentation to see whether you are still using one of these old interfaces or features, so you can adapt your setup to use the new interfaces/features instead. Or if you rather think that one of the items should not be removed from QEMU at all, please speak up on the qemu-devel mailing list to explain why the interface or feature is still required.
The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.
Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.
On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.
Without further ado, here is how to trace static userspace probes with perf(1)!
The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:
# perf buildid-cache --add /usr/bin/qemu-system-x86_64
Once the ELF binaries have been scanned you can list the probes as follows:
# perf list sdt_*:*
List of pre-defined events (to be used in -e):
sdt_qemu:aio_co_schedule [SDT event]
sdt_qemu:aio_co_schedule_bh_cb [SDT event]
sdt_qemu:alsa_no_frames [SDT event]
First add probes for the events you are interested in:
# perf probe sdt_qemu:blk_co_preadv
Added new event:
sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)
You can now use it in all perf tools, such as:
perf record -e sdt_qemu:blk_co_preadv -aR sleep 1
Then capture trace data as follows:
# perf record -a -e sdt_qemu:blk_co_preadv
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB perf.data (4714 samples) ]
The trace can be printed using perf-script(1):
# perf script
qemu-system-x86 3425  2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425  2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425  2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0
If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.
As of July 2017 there are a few limitations to be aware of:
Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.
The contents of string arguments are not printed, only the memory address of the string.
Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.
The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.
Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.
On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.
It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.
This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…
For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.
I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.
I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)
I suggest creating a subdirectory for these and the other files we’re going to create.
wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz
Saving them locally as
installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.
(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)
First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.
qemu-img create -f qcow2 hda.qcow2 5G
(Oops — an earlier version of this blogpost created a “qcow” format image, which will work but is less efficient. If you created a qcow image by mistake, you can convert it to qcow2 with
mv hda.qcow2 old-hda.qcow && qemu-img convert -O qcow2 old-hda.qcow hda.qcow2. Don’t try it while the VM is running! You then need to update your QEMU command line to say “format=qcow2” rather than “format=qcow”. You can delete the
old-hda.qcow once you’ve checked that the new qcow2 file works.)
Now we can run the installer:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel installer-linux \ -initrd installer-initrd.gz \ -drive if=none,file=hda.qcow2,format=qcow2,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic -no-reboot
The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.
The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.
Late in the process, the installer will print the following warning dialog:
+-----------------| [!] Continue without boot loader |------------------+ | | | No boot loader installed | | No boot loader has been installed, either because you chose not to or | | because your specific architecture doesn't support a boot loader yet. | | | | You will need to boot manually with the /vmlinuz kernel on partition | | /dev/vda1 and root=/dev/vda2 passed as a kernel argument. | | | | <Continue> | | | +-----------------------------------------------------------------------+
Press continue for now, and we’ll sort this out later.
Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the
At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.
The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.
There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:
$ virt-filesystems -a hda.qcow2 /dev/sda1 /dev/sda2
If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:
/bootare installed not-world-readable; you can fix this with
sudo chmod 644 /boot/vmlinuz*
Looking at what’s in our disk we can see the kernel and initrd in /boot:
$ virt-ls -a hda.qcow2 /boot/ System.map-4.9.0-3-arm64 config-4.9.0-3-arm64 initrd.img initrd.img-4.9.0-3-arm64 initrd.img.old lost+found vmlinuz vmlinuz-4.9.0-3-arm64 vmlinuz.old
and we can copy them out to the host filesystem:
virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .
(We want the longer filenames, because
initrd.img are just symlinks and virt-copy-out won’t copy them.)
An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.
If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.
To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:
qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \ -kernel vmlinuz-4.9.0-3-arm64 \ -initrd initrd.img-4.9.0-3-arm64 \ -append 'root=/dev/vda2' \ -drive if=none,file=hda.qcow2,format=qcow2,id=hd \ -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet \ -device virtio-net-pci,netdev=mynet \ -nographic
This should boot to a login prompt, where you can log in with the user and password you set up during the install.
The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.
This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.
This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.
|Issue description||Status||Public bug tracker|
|Hyper-V on KVM does not work at all (initial work item)||Fixed in kernel 4.10
|Hyper-V on KVM does not work on new Intel CPUs with PML||Fixed in kernel 4.11
|Hyper-V on KVM does not work on AMD CPUs||Fixed in kernel 4.12 for 1 vCPU|
For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.
In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.
Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:
I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:
The basic flow is as follows:
# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0
It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.
Fedora 26 is out of the door, and here are fresh fedora 26 images.
There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.
There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.
The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.
The images have been created with imagefish.
Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:
$ virt-builder -l | grep debian debian-6 x86_64 Debian 6 (Squeeze) debian-7 sparc64 Debian 7 (Wheezy) (sparc64) debian-7 x86_64 Debian 7 (Wheezy) debian-8 x86_64 Debian 8 (Jessie) debian-9 x86_64 Debian 9 (stretch) $ virt-builder debian-9 \ --root-password password:123456 [ 0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz [ 1.2] Planning how to build this image [ 1.2] Uncompressing [ 5.5] Opening the new disk [ 15.4] Setting a random seed virt-builder: warning: random seed could not be set for this type of guest [ 15.4] Setting passwords [ 16.7] Finishing off Output file: debian-9.img Output size: 6.0G Output format: raw Total usable space: 3.9G Free space: 3.1G (78%) $ qemu-system-x86_64 \ -machine accel=kvm:tcg -cpu host -m 2048 \ -drive file=debian-9.img,format=raw,if=virtio \ -serial stdio
libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:
┌───────────────────┐ │ main program │ │ │ │ │ child process / appliance │ │ ┌──────────────────────────┐ │ │ │ qemu │ ├───────────────────┤ RPC │ ┌─────────────────┐ │ │ libguestfs ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd │ │ │ │ │ ├─────────────────┤ │ └───────────────────┘ │ │ Linux kernel │ │ │ └────────┬────────┘ │ └───────────────│──────────┘ │ │ virtio-scsi ┌──────┴──────┐ │ Device or │ │ disk image │ └─────────────┘
The library has to be written in C because it needs to be linked to any main program. The daemon (
guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.
The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from
strdup etc., making even the simplest task non-trivial and full of untested code paths.
So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.
OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.
Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).
I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.
There were a couple of QEMU / virtualization related talks at the DevConf 2017 conference that took place at the end of January already, but so far we missed to gather the links to the recordings of these talks. So here is now the list:
How to write a legacy storage device emulator by John Snow
GPU Passthrough using GNOME Boxes by Felipe Borges
Self-virtualizing Linux on x86 by Radim Krčmář
New ways to remote desktops with GStreamer integration by Victor Toso de Carvalho and Pavel Grunt
Can VMs networking benefit from DPDK by Maxime Coquelin and Victor Kaplansky
Windows VMs on KVM: The Good, the Bad and the Ugly by Ladi Prosek
Pet VMs in Kubernetes? WTH by Fabian Deutsch
Virtualization development improved with Lago by Rafael Martins
An ARMful of guests: virtualization on 64-bit ARM by Andrea Bolognani
First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.
Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.
So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.
So, here is the Makefile:
CC ?= gcc CFLAGS ?= -Os -g -std=c99 CFLAGS += -Wall TARGETS := drminfo drmtest gtktest drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1) drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1) drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1) drmtest : LDLIBS += -ljpeg gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1) gtktest : LDLIBS += -ljpeg all: $(TARGETS) clean: rm -f $(TARGETS) rm -f *~ *.o drminfo: drminfo.o drmtools.o drmtest: drmtest.o drmtools.o render.o image.o gtktest: gtktest.o render.o image.o
Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.
Now, compare with the meson.build file:
project('drminfo', 'c') # pkg-config deps libdrm_dep = dependency('libdrm') gbm_dep = dependency('gbm') epoxy_dep = dependency('epoxy') cairo_dep = dependency('cairo') cairo_gl_dep = dependency('cairo-gl') pixman_dep = dependency('pixman-1') gtk3_dep = dependency('gtk+-3.0') # libjpeg dep jpeg_dep = declare_dependency(link_args : '-ljpeg') drminfo_srcs = [ 'drminfo.c', 'drmtools.c' ] drmtest_srcs = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ] gtktest_srcs = [ 'gtktest.c', 'render.c', 'image.c' ] drminfo_deps = [ libdrm_dep, cairo_dep, pixman_dep ] drmtest_deps = [ libdrm_dep, gbm_dep, epoxy_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] gtktest_deps = [ gtk3_dep, cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ] executable('drminfo', sources : drminfo_srcs, dependencies : drminfo_deps) executable('drmtest', sources : drmtest_srcs, dependencies : drmtest_deps) executable('gtktest', sources : gtktest_srcs, dependencies : gtktest_deps, install : false)
Pretty straight forward translation. So, what are the differences?
First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:
Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.
Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.
Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.
Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.