Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

October 04, 2017

QEMU project

QEMU stable version 2.10.1 released

We are pleased to announce that the QEMU v2.10.1 stable release is now available! You can grab the tarball from our download page.

Version 2.10.1 is now tagged in the official qemu.git repository (where you can also find the changelog with details), and the stable-2.10 branch has been updated accordingly.

Apart from the normal range of general fixes, this update contains security fixes addressing guest-induced crashing of host QEMU process (CVE-2017-13672, CVE-2017-13673) and possible code injection into host QEMU process via a crafted multiboot ELF kernel when specified directly via QEMU command-line option (CVE-2017-14167). Please update accordingly.

October 04, 2017 09:00 AM

September 27, 2017

Gerd Hoffmann

Running macOS as guest in kvm.

There are various approaches to run macOS as guest under kvm. One is to add apple-specific features to OVMF, as described by Gabriel L. Somlo. I’ve choose to use the Clover EFI bootloader instead. Here is how my setup looks like.

What you needed

First a bootable installer disk image. You can create a bootable usbstick using the createinstallmedia tool shipped with the installer. You can then dd the usb stick to a raw disk image.

Next a clover disk image. I’ve created a script which uses guestfish to generate a disk image from a clover iso image, with a custom config file. The advantage of having a separate disk only for clover is that you can easily update clover, downgrade clover and tweak the clover configuration without booting the virtual machine. So, if something goes wrong recovering is a lot easier.

Qemu. Version 2.10 (or newer) strongly recommended. macOS versions up to 10.12.3 work fine in qemu 2.9. macOS 10.12.4 requires fixes for the qemu applesmc emulation which got merged for qemu 2.10.

OVMF. Latest git works fine for me. Older OVMF versions trip over a recent qemu update and provides broken ACPI tables to the OS then. With the result that macOS doesn’t boot, even though ovmf itself shows no signs of trouble.

Configuring your virtual machine

Here are snippets of my libvirt config, with comments explaining the important things:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

The xmlns:qemu entry is needed for qemu-specific tweaks, that way we can ask libvirt to add extra arguments to the qemu command line.

  <os>
    <type arch='x86_64' machine='pc-q35-2.9'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>/var/lib/libvirt/qemu/nvram/macos-test-org-base_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>

Using the q35 machine type here, and the cutting edge edk2 builds from my firmware repo.

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Penryn</model>
    <feature policy='require' name='invtsc'/>
  </cpu>

CPU. Penryn is known-good. The invtsc feature is needed because macOS uses the TSC for timekeeping. When asking to provide a fixed TSC frequency qemu will store the TSC frequency in a hypervisor cpuid leaf. And macOS will pick it up there. Without that macOS does a wild guess, likely gets it very wrong, and wall clock in your guest runs either way too fast or way too slow.

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/path/to/lvm/volume'/>
      <target dev='sda' bus='sata'/>
    </disk>

This is the system disk where macOS is be installed on. Attached as sata disk to the q35 ahci controller.

You also need the installmedia image. On a real macintosh you’ll typically use a usbstick. macOS doesn’t care much where the installer is stored though, so you can attach the image as sata disk too and it’ll work fine.

Finally you need the clover disk image. edk2 alone isn’t able to boot from the system or installer disk, so it’ll start clover. clover in turn will load the hfs+ filesystem driver so the other disks can be accessed, will offer a boot menu and allows to boot macOS.

    <interface type='network'>
      <source network='default'/>
      <model type='e1000-82545em'/>
    </interface>

Ethernet card. macOS has drivers for this model. Seems to have problems with link detection now and then. Set link status to down for a moment, then to up again (using virsh domif-setlink) gets the virtual machine online.

    <input type='tablet' bus='usb'/>
    <input type='keyboard' bus='usb'/>

USB tablet and keyboard, as input devices. Tablet allows to operate the mouse without pointer grabs which is much more convenient than using a virtual mouse.

    <video>
      <model type='vga' vram='65536'/>
    </video>

Qemu standard vga.

  <qemu:commandline>
    <qemu:arg value='-readconfig'/>
    <qemu:arg value='/path/to/macintosh.cfg'/>
  </qemu:commandline>

This is the extra configuration item for stuff not supported by libvirt. The macintosh.cfg file looks like this, adding the emulated smc device:

[device "smc"]
driver = "isa-applesmc"
osk = "<insert-real-osk-here>"

You can run Gabriels SmcDumpKey tool on a macintosh to figure what the osk is.

Configuring clover

I’m using config.plist.stripped.qemu as starting point. Here are the most important settings:

Boot/Arguments
Kernel command line. There are lots of options you can start the kernel with. You might want try "-v" to start the kernel in verbose mode where it prints boot messages to the screen (like the linux kernel without "quiet"). For trouble-shooting, or to impress your friends.
Boot/DefaultVolume
Name of the volume clover should boot from by default. Put your system disk name here, otherwise clover will wait forever for you to pick a boot menu item.
GUI/ScreenResolution
As the Name says, the Display Resolution. Note that OVMF has a Display Resolution Setting too. Hit ESC at the splash screen to enter the Setup, then go to Device Manager / OVMF Platform Configuration / Change Preferred Resolution. The two Settings must match, otherwise macOS will boot with a scrambled display.

Go!

That should be it. Boot the virtual machine. Installing and using macOS should work as usual.

Final comment

Gabriel also has some comments on the legal side of this. Summary: Probably allowed by Apple on macintosh hardware, i.e. when running linux on your macintosh, then run macOS as guest there. If in doubt check with your lawyer.

by Gerd Hoffmann at September 27, 2017 06:52 AM

September 08, 2017

Cédric Bosdonnat

Virt-bootstrap 1.0.0 released

Yesterday, virt-bootstrap came to life. This tool aims at simplifying the creation of root file systems for use with libvirt's LXC container drivers. I started prototyping it a few months ago and Radostin Stoyanov wonderfully took it over during this year's Google Summer of Code.

For most users, this tool will just be used by virt-manager (since version 1.4.2). But it can be used directly from any script or command line.

The nice thing about virt-bootstrap is that will allow you to create a root file system out of existing docker images, tarballs or virt-builder templates. For example, the following command will get and unpack the official openSUSE docker image in /tmp/foo.

$ virt-bootstrap docker://opensuse /tmp/foo

Virt-bootstrap offers options to:

  • generate qcow2 image with backing chain instead of plain folder
  • apply user / group ID mapping
  • set the root password in the container

Enjoy easy containers creation with libvirt ecosystem, and have fun!

by Cédric Bosdonnat at September 08, 2017 07:24 AM

September 01, 2017

QEMU project

QEMU version 2.10.0 released

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • Support for ACPI NUMA distance info and control over CPU NUMA assignments via ‘-numa cpu’ parameters
  • Support for LUKS encryption format in qcow2 images
  • Monitor/Management interface improvments: additional debug information available through ‘info ramblock/cmma/register/qtree’, support for viewing connected clients via ‘info vnc’, improved parsing support for QMP protocol, and other additional commands
  • QXL and virtio-gpu support for controlling default display resolution
  • Support for vhost-user-scsi devices
  • NVMe emulation support for Write Zeroes command and Controller Memory Buffers
  • Guest agent support for querying guest hostname, users, timezone, and OS version/release information
  • ARM: KVM support for Raspberry Pi 3
  • ARM: emulation support for MPS2/MPS2+ FPGA-based dev boards
  • ARM: zynq: SPIPS flash support
  • ARM: exynos4210: hardware PRNG device, SDHCI, and system poweroff
  • Microblaze: support for CPU versions 9.4, 9.5, 9.6, and 10.0
  • MIPS: support for Enhanced Virtual Addressing (EVA)
  • MIPS: initrd support for kaslr-enabled kernels
  • OpenRISC: support for shadow registers, idle states, and numcores/coreid/EVAR/EPH registers
  • PowerPC: Multi-threaded TCG emulation support
  • PowerPC: OpenBIOS VGA driver for MacOS guests
  • PowerPC: pseries: KVM and emulation support for POWER9 guests
  • PowerPC: pseries: support for hash page table resizing
  • s390: channel device passthrough support via vfio-ccw
  • s390: support for channel-attached 3270 “green screen” devices for use as guest consoles or additional TTYs
  • s390: improved support for PCI (AEN, AIS, and zPCI)
  • s390: support for z14 CPU models and netboot/TFTP via CCW BIOS
  • s390: TCG support for atomic “LOAD AND x” and “COMPARE SWAP” operations, LOAD PROGRAM PARAMETER, extended facilities, CPU type, and many more less-common instructions
  • SH: TCG support for host atomic instructions for emulating tas.b and gUSA (user-space atomics), and support for fpchg/fsrra instructions
  • SPARC: fixes for booting Solaris 2.6 on sun4m/OpenBIOS machines
  • x86: Q35 MCH supports TSEG higher than 8MB
  • x86: SSE register access via gdbstub
  • Xen: support for multi-page shared rings, and 9pfs/virtfs backend
  • Xtensa: sim machine console can be directed to chardev via -serial
  • and lots more…

Thank you to everyone involved!

September 01, 2017 07:00 AM

August 27, 2017

Nathan Gauër

3D Acceleration using VirtIO

August 27, 2017 03:37 PM

August 15, 2017

Daniel Berrange

ANNOUNCE: virt-viewer 6.0 release

I am happy to announce a new bugfix release of virt-viewer 6.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). The virsh and virt-viewer binaries in the Windows builds should now successfully connect to libvirtd, following fixes to libvirt’s mingw port.

Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

http://virt-manager.org/download/

Changes in this release include:

  • Mention use of ssh-agent in man page
  • Display connection issue warnings in main window
  • Switch to GTask API
  • Add support changing CD ISO with oVirt foreign menu
  • Update various outdated links in README
  • Avoid printing password in debug logs
  • Pass hostname to authentication dialog
  • Fix example URLs in man page
  • Add args to virt-viewer to specify whether to resolve VM based on ID, UUID or name
  • Fix misc runtime warnings
  • Improve support for extracting listening info from XML
  • Enable connecting to SPICE over UNIX socket
  • Fix warnings with newer GCCs
  • Allow controlling zoom level with keypad
  • Don’t close app during seemless migration
  • Don’t show toolbar in kiosk mode
  • Re-show auth dialog in kiosk mode
  • Don’t show error when cancelling auth
  • Change default screenshot name to ‘Screenshot.png’
  • Report errors when saving screenshot
  • Fix build with latest glib-mkenums

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 02:20 PM

ANNOUNCE: libosinfo 1.1.0 release

I am happy to announce a new release of libosinfo version 1.1.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Force UTF-8 locale for new glib-mkenums
  • Avoid python warnings in example program
  • Misc test suite updates
  • Fix typo in error messages
  • Remove ISO header string padding
  • Disable bogus gcc warning about unsafe loop optimizations
  • Remove reference to fedorahosted.org
  • Don’t hardcode /usr/bin/perl, use /usr/bin/env
  • Support eject-after-install parameter in OsinfoMedia
  • Fix misc warnings in docs
  • Fix error propagation when loading DB
  • Add usb.ids / pci.ids locations for FreeBSD
  • Don’t include private headers in gir/vapi generation

Thanks to everyone who contributed towards this release.

by Daniel Berrange at August 15, 2017 11:09 AM

August 10, 2017

QEMU project

Deprecation of old parameters and features

QEMU has a lot of interfaces (like command line options or HMP commands) and old features (like certain devices) which are considered as deprecated since other more generic or better interfaces/features have been established instead. While the QEMU developers are generally trying to keep each QEMU release compatible with the previous ones, the old legacy sometimes gets into the way when developing new code and/or causes quite some burden of maintaining it.

Thus we are currently considering to get rid of some of the old interfaces and features in a future release and have started to collect a list of such old items in our QEMU documentation. If you are running QEMU directly, please have a look at this deprecation chapter of the QEMU documentation to see whether you are still using one of these old interfaces or features, so you can adapt your setup to use the new interfaces/features instead. Or if you rather think that one of the items should not be removed from QEMU at all, please speak up on the qemu-devel mailing list to explain why the interface or feature is still required.

by Thomas Huth at August 10, 2017 08:45 AM

August 08, 2017

Cornelia Huck

Channel I/O, demystified

As promised, I will write some articles about one of the areas where s390x looks especially alien to people familiar with the usual suspects like x86: the native way of addressing I/O devices. 1

Channel I/O is a rather large topic, so I will concentrate on explaining it from a Linux (guest) and from a qemu/KVM (virtualization) point of view. This also means that I will prefer terminology that will make sense to somebody familiar with Linux (on x86) rather than the one used by e.g. a z/OS system programmer.

Links to the individual articles:
Channel I/O: What's in a channel subsystem?
Channel I/O: Talking to devices
Channel I/O: Types of devices
Channel I/O: More about channel paths

1. There is PCI on s390x, but it is a recent addition and its idiosyncracies are better understood if you know how channel I/O works.

by Cornelia Huck (noreply@blogger.com) at August 08, 2017 06:04 PM

Channel I/O: More about channel paths

recent discussion on qemu-devel touched upon some aspects of channel paths and their handling (or not-handling) in QEMU. I will try to summarize and give some further information here.

I previously published some information on channel paths here. This post will concentrate a bit more on aspects that are not yet relevant in QEMU, but may become so in the future.

To recap: Channel paths represent the means by which the mainframe talks to the device - it (somewhat) corresponds to the actual cabling. Let's take a look at the output of lscss on a z/VM guest as an actual example:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0150 0.0.0000  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.0151 0.0.0001  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.8000 0.0.0002  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8001 0.0.0003  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8002 0.0.0004  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8003 0.0.0005  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8004 0.0.0006  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8005 0.0.0007  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.0191 0.0.0008  3390/0a 3990/e9      e0  e0  ff   2a3a0900 00000000
0.0.208f 0.0.0009  3390/0c 3990/e9 yes  e0  e0  ff   3a2a1a00 00000000
0.0.218f 0.0.000a  3390/0c 3990/e9 yes  e0  e0  ff   2a3a0900 00000000
0.0.228f 0.0.000b  3390/0c 3990/e9 yes  e0  e0  ff   2a3a1a00 00000000
0.0.238f 0.0.000c  3390/0c 3990/e9 yes  e0  e0  ff   093a2a00 00000000
0.0.000c 0.0.000d  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000d 0.0.000e  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000e 0.0.000f  0000/00 1403/00      80  80  ff   08000000 00000000
0.0.0009 0.0.0010  0000/00 3215/00 yes  80  80  ff   08000000 00000000
0.0.0190 0.0.0011  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.019d 0.0.0012  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.019e 0.0.0013  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.0592 0.0.0014  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.ffff 0.0.0015  9336/10 6310/80      80  80  ff   08000000 00000000

A couple of interesting observations with regard to channel paths can be made here:
  • Devices 0.0.0150/0.0.0151, 0.0.000c/0.0.000d, 0.0.000e, 0.0.0009, and 0.0.ffff all share the same channel path, 08, despite being of different types (virtual CTC, virtual card punch/card reader/printer, virtual console, and virtual FBA DASD). This is because they are all emulated devices, and z/VM chooses to use the same virtual channel path for them.
  • Devices 0.0.8000 - 0.0.8002 uses channel path 0 as their only channel path, as can be seen by the PIM being 80.
  • Devices 0.0.8000 - 0.0.8002 and 0.0.8003-0.0.8005 use the same channel path, respectively; that is because they make up the device triplet for an OSA device.
  • The remaining devices (all ECKD DASD) use several channel paths (09, 1a, 2a, 3a), but only three at a time (as evidenced by the PIM of 0e), and also in different combination. This is probably a quirk of the individual setup for this guest.
The output of lschp of the same guest looks like this:

CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
============================================
0.00   1     1     11    -    -      (ff00)
0.01   1     1     11    -    -      (ff01)
0.08   1     1     1a    -    -       0598 
0.09   1     1     1a    -    -       0599 
0.0a   1     1     1a    -    -       059c 
0.0b   1     1     25    -    -       059d 
0.0c   1     1     1a    -    -       05ac 
0.17   1     1     11    -    -       05b4 
0.18   1     1     1a    -    -       05a0 
0.19   1     1     1a    -    -       05a1 
0.1a   1     1     1a    -    -       05a4 
0.1b   1     1     25    -    -       05a5 
0.1c   1     1     1a    -    -       05ad 
0.28   1     1     1a    -    -       05d8 
0.29   1     1     1a    -    -       05d9 
0.2a   1     1     1a    -    -       05dc 
0.2b   1     1     25    -    -       05dd 
0.2c   1     1     1a    -    -       05d0 
0.34   1     1     11    -    -       05ec 
0.35   1     1     11    -    -       05a8 
0.38   1     1     1a    -    -       05e0 
0.39   1     1     1a    -    -       05e1 
0.3a   1     1     1a    -    -       05e4 
0.3b   1     1     25    -    -       05e5 
0.3c   1     1     1a    -    -       05d1 
0.60   1     1     24    -    -      (070c)
0.61   1     1     24    -    -      (070d)
0.62   1     1     24    -    -      (070e)
0.63   1     1     24    -    -      (070f)

Here we find the various channel paths again, together with more:
  • There are several channel paths that are available to the guest, but not in use by any device currently available to the guest (and therefore not turning up in the output of lscss).
  • Channel paths 00 and 01 (used by the OSA cards) use an internal channel (the number in the last column are in brackets) - we can therefore conclude that the cards are virtualized by z/VM.
  • The channel path 08 (which is referenced by all virtual devices) is actually backed by a physical path (0598). I frankly have no idea why z/VM is doing that.
  • The channel paths used by the ECKD DASD (09, 1a, 2a, 3a) all are of the same type (1a - FICON, IIRC) and are backed by different physical paths (last column).
Various modifications can be done to the channel paths; under Linux, the chchp tool is useful for that. Let's try to vary off a path:

chchp -v 0 0.3a
Vary offline 0.3a... done.

lschp shows the changed state for the path:

0.3a   0     1     1a    -    -       05e4

The lscss output remains unchanged - which isn't surprising as doing a vary off only affects the state of the channel path within Linux: Linux will no longer use the path for I/O, but the path masks as managed by the hardware and z/VM are not changed.

Let's try to configure off another path:

chchp -c 0 0.2a
Configure standby 0.2a... failed - attribute value not as expected

That did not work as expected. Why? This is supposed to issue a SCLP command to set the channel path to standby - but the my guest apparently does not have the rights or ability to do so. Which is a pity, as I would have liked to show the effects of configuring a channel path to standby:
  • It (unsurprisingly) changes the state in lschp.
  • It also changes the path masks, as shown in lscss.
  • It may generate a machine check with a channel report word (CRW) that informs the OS that something has happened to the channel path - this is dependent upon the environment, however.
So let's stop here. I'll continue with another setup, once I have it.

by Cornelia Huck (noreply@blogger.com) at August 08, 2017 06:03 PM

July 29, 2017

Stefan Hajnoczi

Tracing userspace static probes with perf(1)

The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.

Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.

On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.

Without further ado, here is how to trace static userspace probes with perf(1)!

Scan the binary for static userspace probes

The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:

# perf buildid-cache --add /usr/bin/qemu-system-x86_64

List static userspace probes

Once the ELF binaries have been scanned you can list the probes as follows:

# perf list sdt_*:*

List of pre-defined events (to be used in -e):

sdt_qemu:aio_co_schedule [SDT event]
sdt_qemu:aio_co_schedule_bh_cb [SDT event]
sdt_qemu:alsa_no_frames [SDT event]
...

Let's trace something!

First add probes for the events you are interested in:

# perf probe sdt_qemu:blk_co_preadv
Added new event:
sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)

You can now use it in all perf tools, such as:

perf record -e sdt_qemu:blk_co_preadv -aR sleep 1

Then capture trace data as follows:

# perf record -a -e sdt_qemu:blk_co_preadv
^C
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB perf.data (4714 samples) ]

The trace can be printed using perf-script(1):

# perf script
qemu-system-x86 3425 [000] 2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
qemu-system-x86 3425 [001] 2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0
...

If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.

Current limitations

As of July 2017 there are a few limitations to be aware of:

Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.

The contents of string arguments are not printed, only the memory address of the string.

Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.

The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.

Some history and alternative tools

Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.

On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.

More recently the GDB debugger gained support for static userspace probes. See the Static Probe Points documentation if you want to use userspace static probes from GDB.

Conclusion

It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.

by stefanha (noreply@blogger.com) at July 29, 2017 12:22 PM

July 24, 2017

Peter Maydell

Installing Debian on QEMU’s 64-bit ARM “virt” board

This post is a 64-bit companion to an earlier post of mine where I described how to get Debian running on QEMU emulating a 32-bit ARM “virt” board. Thanks to commenter snak3xe for reminding me that I’d said I’d write this up…

Why the “virt” board?

For 64-bit ARM QEMU emulates many fewer boards, so “virt” is almost the only choice, unless you specifically know that you want to emulate one of the 64-bit Xilinx boards. “virt” supports supports PCI, virtio, a recent ARM CPU and large amounts of RAM. The only thing it doesn’t have out of the box is graphics.

Prerequisites and assumptions

I’m going to assume you have a Linux host, and a recent version of QEMU (at least QEMU 2.8). I also use libguestfs to extract files from a QEMU disk image, but you could use a different tool for that step if you prefer.

I’m going to document how to set up a guest which directly boots the kernel. It should also be possible to have QEMU boot a UEFI image which then boots the kernel from a disk image, but that’s not something I’ve looked into doing myself. (There may be tutorials elsewhere on the web.)

Getting the installer files

I suggest creating a subdirectory for these and the other files we’re going to create.

wget -O installer-linux http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/linux
wget -O installer-initrd.gz http://http.us.debian.org/debian/dists/stretch/main/installer-arm64/current/images/netboot/debian-installer/arm64/initrd.gz

Saving them locally as installer-linux and installer-initrd.gz means they won’t be confused with the final kernel and initrd that the installation process produces.

(If we were installing on real hardware we would also need a “device tree” file to tell the kernel the details of the exact hardware it’s running on. QEMU’s “virt” board automatically creates a device tree internally and passes it to the kernel, so we don’t need to provide one.)

Installing

First we need to create an empty disk drive to install onto. I picked a 5GB disk but you can make it larger if you like.

qemu-img create -f qcow hda.qcow2 5G

Now we can run the installer:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel installer-linux \
  -initrd installer-initrd.gz \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic -no-reboot

The installer will display its messages on the text console (via an emulated serial port). Follow its instructions to install Debian to the virtual disk; it’s straightforward, but if you have any difficulty the Debian installation guide may help.

The actual install process will take a few hours as it downloads packages over the network and writes them to disk. It will occasionally stop to ask you questions.

Late in the process, the installer will print the following warning dialog:

   +-----------------| [!] Continue without boot loader |------------------+
   |                                                                       |
   |                       No boot loader installed                        |
   | No boot loader has been installed, either because you chose not to or |
   | because your specific architecture doesn't support a boot loader yet. |
   |                                                                       |
   | You will need to boot manually with the /vmlinuz kernel on partition  |
   | /dev/vda1 and root=/dev/vda2 passed as a kernel argument.             |
   |                                                                       |
   |                              <Continue>                               |
   |                                                                       |
   +-----------------------------------------------------------------------+  

Press continue for now, and we’ll sort this out later.

Eventually the installer will finish by rebooting — this should cause QEMU to exit (since we used the -no-reboot option).

At this point you might like to make a copy of the hard disk image file, to save the tedium of repeating the install later.

Extracting the kernel

The installer warned us that it didn’t know how to arrange to automatically boot the right kernel, so we need to do it manually. For QEMU that means we need to extract the kernel the installer put into the disk image so that we can pass it to QEMU on the command line.

There are various tools you can use for this, but I’m going to recommend libguestfs, because it’s the simplest to use. To check that it works, let’s look at the partitions in our virtual disk image:

$ virt-filesystems -a hda.qcow2 
/dev/sda1
/dev/sda2

If this doesn’t work, then you should sort that out first. A couple of common reasons I’ve seen:

  • if you’re on Ubuntu then your kernels in /boot are installed not-world-readable; you can fix this with sudo chmod 644 /boot/vmlinuz*
  • if you’re running Virtualbox on the same host it will interfere with libguestfs’s attempt to run KVM; you can fix that by exiting Virtualbox

Looking at what’s in our disk we can see the kernel and initrd in /boot:

$ virt-ls -a hda.qcow2 /boot/
System.map-4.9.0-3-arm64
config-4.9.0-3-arm64
initrd.img
initrd.img-4.9.0-3-arm64
initrd.img.old
lost+found
vmlinuz
vmlinuz-4.9.0-3-arm64
vmlinuz.old

and we can copy them out to the host filesystem:

virt-copy-out -a hda.qcow2 /boot/vmlinuz-4.9.0-3-arm64 /boot/initrd.img-4.9.0-3-arm64 .

(We want the longer filenames, because vmlinuz and initrd.img are just symlinks and virt-copy-out won’t copy them.)

An important warning about libguestfs, or any other tools for accessing disk images from the host system: do not try to use them while QEMU is running, or you will get disk corruption when both the guest OS inside QEMU and libguestfs try to update the same image.

If you subsequently upgrade the kernel inside the guest, you’ll need to repeat this step to extract the new kernel and initrd, and then update your QEMU command line appropriately.

Running

To run the installed system we need a different command line which boots the installed kernel and initrd, and passes the kernel the command line arguments the installer told us we’d need:

qemu-system-aarch64 -M virt -m 1024 -cpu cortex-a53 \
  -kernel vmlinuz-4.9.0-3-arm64 \
  -initrd initrd.img-4.9.0-3-arm64 \
  -append 'root=/dev/vda2' \
  -drive if=none,file=hda.qcow2,format=qcow,id=hd \
  -device virtio-blk-pci,drive=hd \
  -netdev user,id=mynet \
  -device virtio-net-pci,netdev=mynet \
  -nographic

This should boot to a login prompt, where you can log in with the user and password you set up during the install.

The installation has an SSH client, so one easy way to get files in and out is to use “scp” from inside the VM to talk to an SSH server outside it. Or you can use libguestfs to write files directly into the disk image (for instance using virt-copy-in) — but make sure you only use libguestfs when the VM is not running, or you will get disk corruption.

by pm215 at July 24, 2017 09:25 AM

July 21, 2017

Ladi Prosek

Nesting Hyper-V in QEMU/KVM: Known issues

This is a follow-up to Running Hyper-V in a QEMU/KVM Guest published earlier this year. The article provided instructions on setting up Hyper-V in a QEMU/KVM Windows guest as enabled by a particular KVM patchset (on Intel hardware only, as it turned out later). Several issues have been found since then; some already fixed, some in the process of being fixed, and some still not fully understood.

This post aims to be an up-to-date list of issues related to Hyper-V on KVM, showing their current status and, where applicable, upstream commit IDs. The issues are ordered chronologically from the oldest ones to those found recently.

Issue description Status Public bug tracker
Hyper-V on KVM does not work at all (initial work item) Fixed in kernel 4.10
7ca29de213
ee146c1c10
9ed38ffad4
1dc35dacc1
RHBZ 1326138
Hyper-V on KVM does not work on new Intel CPUs with PML Fixed in kernel 4.11
ab007cc94f
1fb883bb82
RHBZ 1440022
Hyper-V on KVM does not work on AMD CPUs Fixed in kernel 4.12 for 1 vCPU
405a353a0e

and in kernel 4.13 for >1 vCPU
4aebd0e9ca
ab2f4d73eb
9b61174793
a12713c25b
1a5e185294

RHBZ 1440025 rtl8139 and e1000 QEMU network cards don’t work with Hyper-V enabled Not fixed yet RHBZ 1452546 L2 Linux guest in Hyper-V on KVM hangs on boot Fixed in kernel 4.13
2cf0284223
71c2a2d0a8 RHBZ 1457866 1d268dece4 Windows TSC page does not work with Hyper-V enabled Fixed in kernel 4.14
72c139bacf

and in QEMU 2.11
1d268dece4
ddb98b5a9f
4bb95b82df
d72bc7f6f8

RHBZ 1464412 Hyper-V on KVM does not work with OVMF (required for secure boot) Not fixed yet RHBZ 1488203 Not Hyper-V but prevents virtualization-based security from running Not fixed yet RHBZ 1496170
TianocoreBZ 727

Advertisements

by ladipro at July 21, 2017 07:34 AM

July 13, 2017

Stefan Hajnoczi

Packet capture coming to AF_VSOCK

For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.

In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.

Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:

I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:

The basic flow is as follows:


# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0

It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.

by stefanha (noreply@blogger.com) at July 13, 2017 04:31 PM

Gerd Hoffmann

Fresh Fedora 26 images uploaded

Fedora 26 is out of the door, and here are fresh fedora 26 images.

There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.

There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.

The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.

The images have been created with imagefish.

by Gerd Hoffmann at July 13, 2017 12:33 PM

June 27, 2017

Richard Jones

virt-builder Debian 9 image available

Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:

$ virt-builder -l | grep debian
debian-6                 x86_64     Debian 6 (Squeeze)
debian-7                 sparc64    Debian 7 (Wheezy) (sparc64)
debian-7                 x86_64     Debian 7 (Wheezy)
debian-8                 x86_64     Debian 8 (Jessie)
debian-9                 x86_64     Debian 9 (stretch)

$ virt-builder debian-9 \
    --root-password password:123456
[   0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz
[   1.2] Planning how to build this image
[   1.2] Uncompressing
[   5.5] Opening the new disk
[  15.4] Setting a random seed
virt-builder: warning: random seed could not be set for this type of guest
[  15.4] Setting passwords
[  16.7] Finishing off
                   Output file: debian-9.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 3.9G
                    Free space: 3.1G (78%)

$ qemu-system-x86_64 \
    -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=debian-9.img,format=raw,if=virtio \
    -serial stdio

by rich at June 27, 2017 09:01 AM

June 04, 2017

Richard Jones

New in libguestfs: Rewriting bits of the daemon in OCaml

libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:

 ┌───────────────────┐
 │ main program      │
 │                   │
 │                   │           child process / appliance
 │                   │          ┌──────────────────────────┐
 │                   │          │ qemu                     │
 ├───────────────────┤   RPC    │      ┌─────────────────┐ │
 │ libguestfs  ◀╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍▶ guestfsd        │ │
 │                   │          │      ├─────────────────┤ │
 └───────────────────┘          │      │ Linux kernel    │ │
                                │      └────────┬────────┘ │
                                └───────────────│──────────┘
                                                │
                                                │ virtio-scsi
                                         ┌──────┴──────┐
                                         │  Device or  │
                                         │  disk image │
                                         └─────────────┘

The library has to be written in C because it needs to be linked to any main program. The daemon (guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.

The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from malloc, strdup etc., making even the simplest task non-trivial and full of untested code paths.

So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C ↔ OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.

OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.

Code gets quite a bit shorter. For example the case_sensitive_path API — all string handling and directory lookups — goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).

I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.


by rich at June 04, 2017 01:14 PM

June 01, 2017

Cornelia Huck

Linux 4.12 and QEMU 2.10 will have basic support for vfio-ccw

If you want to passthrough some channel devices to your guest, you will be able to do so with a host kernel >= 4.12 and a QEMU >= 2.10.

For some hints about configuration and restrictions, see this entry in the QEMU wiki.

by Cornelia Huck (noreply@blogger.com) at June 01, 2017 03:06 PM

May 16, 2017

QEMU project

Presentations from DevConf 2017

There were a couple of QEMU / virtualization related talks at the DevConf 2017 conference that took place at the end of January already, but so far we missed to gather the links to the recordings of these talks. So here is now the list:

by Thomas Huth at May 16, 2017 02:00 PM

May 13, 2017

Nathan Gauër

GSoC | Log#2: API Forwarding

May 13, 2017 06:52 PM

GSoC | Log#1: Project presentation

May 13, 2017 06:31 PM

Linux graphic stack: an overview

May 13, 2017 03:37 PM

May 10, 2017

Cornelia Huck

Channel I/O: Types of devices

The last posts in this series tried to examine some basic principles of channel I/O. But what kinds of devices are actually available?

I'll focus on the device types that are available to a guest running under QEMU.

The most important channel devices for QEMU guests (and often, the only ones present in a guest) are virtio-ccw devices. These have been used in the previous examples. Think of them as the channel I/O equivalent of virtio-pci devices: that is, a device that is discoverable in the guest and acts as a means to access the virtio device.
All virtio-ccw devices share the following characteristics:
  • Fully virtual (i.e., fully emulated). There is no "real hardware" virtio-ccw device.
  • A control unit type of 0x3832.
  • One virtual channel path, type 0x32.

New (well, to QEMU) and just recently added (will be in 2.10) are 3270 devices (the channel-attached variety). The classic green-screen console; some details about what works (and what yet doesn't) and how to set this up may be found in the QEMU wiki.
3270 devices have the following characteristics:
  • Fully virtual (emulated). While you could passthrough 3270 devices while running under z/VM, this depends on the non-yet-merged vfio-ccw infrastructure (see below) and does not really make much sense.
  • A control unit type of 0x3270.
  • One virtual channel path, type 0x1a.

Still being worked on, but on a good track, is the vfio-ccw infrastructure. The kernel part has been merged for 4.12, the QEMU part will hopefully be merged soon. vfio-ccw brings the same functionality to channel devices that vfio-pci brought to pci devices: Give hardware devices to the guest to use. This is still quite experimental, and has only really been tested with one device type yet: ECKD DASD.
'DASD' basically refers to disks; this wikipedia article explains more. 'ECKD' refers to the data recording format; this wikipedia article probably explains more than you ever wanted to know. Linux accesses ECKD DASD as block devices with some minor oddities.
If you pass through an ECKD DASD, you can expect the following to show up in your guest:
  • A device that corresponds one-to-one to a device on the host, although it might have a different device number (depending on how it was configured).
  • A control unit and device type corresponding to ECKD DASD.  A control unit type of 0x3990 and a device type of 0x3390 are the most likely.
  • One to eight channel paths, corresponding to real channel paths.
vfio-ccw opens the way to expose all kinds of channel devices to QEMU/KVM guests: FBA DASD, channel-attached tapes - basically everything that is supported by Linux on the host.

by Cornelia Huck (noreply@blogger.com) at May 10, 2017 04:44 PM

April 25, 2017

Gerd Hoffmann

meson experiments

Seems the world of build systems is going to change. Traditional approach is make. Pimped up with autoconf and automake. But some newcomers provide new approaches to building your projects.

First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.

Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.

So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.

So, here is the Makefile:

CC      ?= gcc
CFLAGS  ?= -Os -g -std=c99
CFLAGS  += -Wall

TARGETS := drminfo drmtest gtktest

drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1)
drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1)

drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += -ljpeg

gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += -ljpeg

all: $(TARGETS)

clean:
        rm -f $(TARGETS)
        rm -f *~ *.o

drminfo: drminfo.o drmtools.o
drmtest: drmtest.o drmtools.o render.o image.o
gtktest: gtktest.o render.o image.o

Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.

Now, compare with the meson.build file:

project('drminfo', 'c')

# pkg-config deps
libdrm_dep    = dependency('libdrm')
gbm_dep       = dependency('gbm')
epoxy_dep     = dependency('epoxy')
cairo_dep     = dependency('cairo')
cairo_gl_dep  = dependency('cairo-gl')
pixman_dep    = dependency('pixman-1')
gtk3_dep      = dependency('gtk+-3.0')

# libjpeg dep
jpeg_dep      = declare_dependency(link_args : '-ljpeg')

drminfo_srcs  = [ 'drminfo.c', 'drmtools.c' ]
drmtest_srcs  = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ]
gtktest_srcs  = [ 'gtktest.c', 'render.c', 'image.c' ]

drminfo_deps  = [ libdrm_dep, cairo_dep, pixman_dep ]
drmtest_deps  = [ libdrm_dep, gbm_dep, epoxy_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]
gtktest_deps  = [ gtk3_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]

executable('drminfo',
           sources      : drminfo_srcs,
           dependencies : drminfo_deps)
executable('drmtest',
           sources      : drmtest_srcs,
           dependencies : drmtest_deps)
executable('gtktest',
           sources      : gtktest_srcs,
           dependencies : gtktest_deps,
           install      : false)

Pretty straight forward translation. So, what are the differences?

First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:

  • Automatic header dependencies. When a header file changes all source files which include it get rebuilt.
  • Automatic build system dependencies: When meson.build changes the ninja build files are updated.
  • Rebuilds on command changes: When the build command line for a target changes the target is rebuilt.

Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.

Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.

Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.

Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.

by Gerd Hoffmann at April 25, 2017 08:36 AM

April 12, 2017

Cornelia Huck

SELinux vs. QEMU/KVM

Trying to run QEMU/KVM under an older version of z/VM? Make sure to read Thomas' hint.

by Cornelia Huck (noreply@blogger.com) at April 12, 2017 12:14 PM

April 05, 2017

Thomas Huth

KVM with SELinux on a z/VM s390x machine

When you are trying to start a KVM guest via libvirt on an s390x Linux installation that is running on an older version of z/VM, you might run into the problem that QEMU refuses to start with this error message:

cannot set up guest memory 's390.ram': Permission denied.

This happens because older versions of z/VM (before version 6.3) do not support the so-called “enhanced suppression on protection facility” (ESOP) yet, so QEMU has to allocate the memory for the guest with a “hack”, and this hack uses mmap(… PROT_EXEC …) for the allocation.

Now this mmap() call is not allowed by the default SELinux rules (at least not on RHEL-based systems), so QEMU fails to allocate the memory for the guest here. Turning off SELinux completely just to run a KVM guest is of course a bad idea, but fortunately there is already a SELinux boolean value called virt_use_execmem which can be used to tune the behavior here:

setsebool virt_use_execmem 1

This configuration switch has originally been introduced for running TCG guests (i.e. running QEMU without KVM), but in this case it also fixes the problem with KVM guests. Anyway, since setting this SELinux variable to 1 is also a slight decrease in security (but still better than disabling SELinux completely), you should better upgrade your z/VM to version 6.3 (or newer) or use a real LPAR for the KVM host installation instead, if that is feasible.

April 05, 2017 12:20 PM

March 30, 2017

Cornelia Huck

Oldies but Goldies: Channel I/O KVM Forum 2012 talk

Some of the information has been superseded in the meanwhile, but the slides from my talk at the 2012 KVM Forum contain some information that may still be interesting. (Sadly, no video of the talk was recorded.)

by Cornelia Huck (noreply@blogger.com) at March 30, 2017 01:30 PM

March 28, 2017

Cornelia Huck

Channel I/O: Talking to devices

Having a nice set of channel devices available to your OS is all fine and good; but how do you actually talk to them? This post attempts to give a high-level overview, while also explaining some more acronyms.

Let's look again at the example configuration from the last post:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000  0000/00 3832/01 yes  80  80  ff   00000000 00000000
0.0.0042 0.0.0001  0000/00 3832/02 yes  80  80  ff   00000000 00000000
The second device is the virtio-blk device 0.0.0042 on subchannel 0.0.0001, having channel path 0. Being virtio, this is a very simplified variation of what you'd see on real hardware (although this also can be a benefit in some way). Think of it as the following:

Device 0.0.0042 is accessed via channel path 0, and subchannel 0.0.0001 is used as a means to address it.

The access (channel path) is configured in the hypervisor (or in the hardware definitions). The subchannel is what the OS will use as a target for I/O instructions and how it can associate I/O interrupts with the device they are for.

I/O instructions? There's a whole zoo of them, but they share some characteristics:
  • They take a subchannel identifier as parameter.
  • They are privileged: I.e., on a Linux system, they can only be issued by the kernel and not from user space.
Check the Principles of Operation (SA22-7832), chapter 14, for the whole story. Here, I'll concentrate just on two instructions:
  • START SUBCHANNEL (SSCH) - start a channel program
  • TEST SUBCHANNEL (TSCH) - retrieve subchannel status
So, what's a channel program? It's basically a set of instructions sent to the control unit and executed. You can even branch in them. The basic unit of those instructions is the channel command word (CCW). The ccw is such a basic characteristic of channel I/O that it is used throughout Linux and QEMU for channel devices: For example, Linux has ccw_devices on the ccw bus, and QEMU has CcwDevices, most notably VirtIOCcwDevices (like the devices in the example).

ccws consist of three parts:
  • The command. This falls into the categories of read (read data from the device), write (write data to the device) or control (for example, rewinding a tape). An 8-bit value.
  • The flags, which control error handling or program flow. I'll ignore them for simplicity here.
  • The data address. This is an address in memory where data is written to (read) or read from (write).
Let's take an example. The SENSE ID command is a basic operation supported by both virtual devices like the virtio devices and real hardware devices. It is used to obtain configuration information from the device, like the CU type information in the output above. It is usually the first ccw an operating system issues to the device.

The operating system will assemble a ccw: The command code will be 0xe4 for SENSE ID, and the data address will point to a location wherethe OS wants to have the obtained information. The OS will also assemble a so-called ORB (operation request block), which, amongst other things, points to the assembled ccw (respectively the first one in a chain). This ORB and the subchannel id are the two parameters for the SSCH instruction. If all goes well, the OS will receive a condition code 0 and knows that it will be signalled asynchronously once the channel program has been processed (successfully or with errors)1.

Processing of the actual channel command is done asynchronously by real hardware (QEMU does it synchronously for simplicity reasons). The result is that the wanted data is put into the memory area refered to by the ccw2. Subsequently, the subchannel is made status pending: Information is ready for retrieval by the OS.

Usually, the OS wants to have a notification that the subchannel became status pending; this is done via an I/O interrupt. I/O interrupts on s390 carry extra status which is written to the low memory area of the cpu receiving the interrupt; amongst other things, this status contains the subchannel id.

Next, the OS needs to actually retreive the status information: This is done via the TSCH instruction, which in turn makes the subchannel no longer status pending and ready for the next I/O request via SSCH. The status contains enough information for the OS to determine whether the request was successful (and the sense id information has been stored), or whether there was an error.3

Of course, this is all only scratching at the surface of channel programs; interested readers can peek at the Linux kernel and QEMU to get a feel for both parts or at the Principles of Operation for the whole story.4

1. In the Linux source code, you'll find this under drivers/s390/cio/
2. In the QEMU source code, you'll find channel command interpretation under hw/s390x/css.c
3. Again, you'll find this under drivers/s390/cio/ in the Linux source code
4. Command chaining, channel path management, I/O instructions to terminate a channel program are just some of the interesting topics.

by Cornelia Huck (noreply@blogger.com) at March 28, 2017 02:32 PM

March 24, 2017

Cole Robinson

Easy qemu commandline passthrough with virt-xml

Libvirt has supported qemu commandline option passthrough for qemu/kvm VMs for quite a while. The format for it is a bit of a pain though since it requires setting a magic xmlns value at the top of the domain XML. Basically doing it by hand kinda sucks.

In the recently released virt-manager 1.4.1, we added a virt-install/virt-xml option --qemu-commandline that tweaks option passthrough for new or existing VMs. So for example, if you wanted to add the qemu option string '-device FOO' to an existing VM named f25, you can do:

  ./virt-xml f25 --edit --confirm --qemu-commandline="-device FOO"

The output will look like:

--- Original XML
+++ Altered XML
@@ -1,4 +1,4 @@
-<domain type="kvm">
+<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
<name>f25</name>
<uuid>9b6f1795-c88b-452a-a54c-f8579ddc18dd</uuid>
<memory unit="KiB">4194304</memory>
@@ -104,4 +104,8 @@
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
</rng>
</devices>
+ <qemu:commandline>
+ <qemu:arg value="-device"/>
+ <qemu:arg value="foo"/>
+ </qemu:commandline>
</domain>

Define 'f25' with the changed XML? (y/n):

by Cole Robinson (noreply@blogger.com) at March 24, 2017 10:30 PM

March 21, 2017

Cornelia Huck

Channel I/O: What's in a channel subsystem?

When you start trying to get familiar with channel I/O and its concepts, one thing you notice is usually a host of very similar-sounding acronyms that are easily confused. The easiest way to get a hold of this is probably to look at a small machine started by QEMU and to examine what a Linux guest sees.

So, let's start with the following command line: 

s390x-softmmu/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -m 1024 -nographic -drive file=/dev/dasdb,if=none,id=drive-virtio-disk0,format=raw,serial=ccwdasd1,cache=none -device virtio-blk-ccw,devno=fe.0.0042,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,scsi=off

(Note that this assumes you're running an s390x system and have a bootable system on /dev/dasdb.)

This will start up a machine with two channel devices: one virtio-blk (as specified on the command line) and one virtio-net (always autogenerated unless explicitly turned off).

Let's log into the guest via the console1 and examine what channel devices Linux sees:
[root@localhost ~]# lscss
Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000  0000/00 3832/01 yes  80  80  ff   00000000 00000000
0.0.0042 0.0.0001  0000/00 3832/02 yes  80  80  ff   00000000 00000000

Let's go through this information column-by-column.

Device is the identifier for the device, which is unique guest-wide. The xx.y.zzzz format (often called bus id) is specific to Linux (and has leaked over into QEMU) and is made up of the following elements:
  • The channel subsystem id (cssid) xx (here: 0, as on all current Linux systems)
  • The subchannel set id (ssid) y (here: 0, can be any value from 0-3 on current Linux systems)
  • The device number (devno) zzzz (here: 0000 respectively 0042, can be any value from 0-0xffff)
The two values in this example have different origins:
  • 0.0.0000 (the virtio-net device) has been autogenerated.
  • 0.0.0042 (the virtio-blk device) has been specified on the command line.
But wait: The value on the command line was fe.0.0042, wasn't it? I will explain this in a later post; just remember for now that you specify the cssid fe for a virtio device on the QEMU command line and it will show up as cssid 0 in the Linux guest.

The devno basically belongs to the device; cssid and ssid indicate the addressing within the channel subsystem, which is why we encounter them again in the next id, Subchan.

This is the identifier for the subchannel, which is basically the means to actually address the device. It again uses the xx.y.zzzz format and is made up of the following elements:
  • The cssid xx (same as for the device)
  • The ssid y (again, the same as for the device)
  • The subchannel number zzzz (here: 0000 respectively 0001, generally not the same as the devno, although it can be any value from 0-0xffff as well)
These values are always autogenerated by QEMU (i.e., you can't specify them on the command line). They basically depend on the order in which devices are initialized (either from the initial command line, autogenerated or via device hotplug) - the only restriction is that the cssid and ssid are set by the device's bus id, if specified. The reasoning behind this is that a subchannel is only a means to access the device and as such needs only to be unique, but not pre-defined.

In contrast to the bus id for a device (which is a Linux and QEMU construct), the bus id for a subchannel actually has an equivalent in the architecture: the subchannel-identification word (often referred to as schid in Linux and QEMU), which is basically a 32 bit value composed of the cssid, the ssid, and the subchannel number. This is used to address a device via a certain subchannel by the various channel I/O related instructions.

The next two columns, DevType and CU Type, are part of the self description element of channel devices: The concept is that the operating system asks the device nicely to identify itself and the device responds with information about its type and what it can do. The device and the control unit are, in principle, two separate, cascaded entities; for virtio purposes, you can think of the device as the virtio backend (like the virtio-blk device) and of the control unit as the virtio proxy device (like the pci device used to access virtio devices on other platforms). That's also the reason why the device type is always zero for virtio devices. The control unit type is of the form aaaa/bb and consists of the following elements:
  • The type aaaa (a value from 0-0xffff; 0x3832 denotes a virtio-ccw control unit)
  • The model bb (a value from 0-0xff; for virtio devices, this is the device id as specified by the virtio standard)
In our example, we can therefore see that device 0.0.0000 is a virtio-net device (CU model 1) and device 0.0.0042 is a virtio-blk device (CU model 2).

The next column, Use, points to a big difference from other I/O architectures: In order to be able to use a subchannel to talk to a device, the operating system first needs to enable it. For virtio devices, this is done by the Linux driver by default (see the 'yes' for all devices); for other device types, this needs to be triggered by Linux user space (which implies that you can't simply go ahead and use a device, you always need to do some kind of setup).

The last four columns, PIM, PAM, POM and CHPIDs, deal with channel paths: An issue which is completely irrelevant for QEMU guests, but very interesting on real hardware. Just a quick overview:
  • PIM (path installed mask), PAM (path available mask) and POM (path operational mask) are all 8 bit values corresponding bit-by-bit to one of eight channel paths. If the corresponding bit is set in all of the three masks, the channel path can be used for I/O.
  • CHPIDs are channel path identifiers: Each channel path has an id from 0-0xff, which is unique combined with the relevant cssid. For virtio devices, there's only one valid channel path with the id 02.
Channel paths on real hardware correspond (simplistically spoken) to the connections between the actual mainframe and e.g. the storage server containing the disk devices. The setup is usually redundant, and load balancing and failover is possible between the paths. The channel paths are not per-device; usually, a set of devices shares a set of channel paths. For a virtual setup like a QEMU guest with only virtio devices, there is no real equivalent for this. Therefore, there's only a virtual channel path which does nothing but satisfy the architecture. This means that the output of the following command is not very interesting for our example guest:
[root@localhost ~]# lschp
CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
============================================
0.00   1     -     32    -    -       -   

  • CHPID is the channel-path identifier of the form xx.nn, where xx is the cssid and nn the chpid. This is always 0.00 on virtio-only guests.
  • Vary means that the channel path is online to the guest. You don't want to change this for the only path.
  • Type is the channel-path type. 0x32 is a reserved type for the virtio virtual channel path.
All of this does not explain how Linux actually talks to those devices (and how QEMU emulates this). I'll get to that in a future post.
1. A VT220 compatible console via SCLP is automatically generated.
2. Which, in hindsight, turned out to not be the cleverest choice - see the confusing output of lscss.

by Cornelia Huck (noreply@blogger.com) at March 21, 2017 05:27 PM


Powered by Planet!
Last updated: October 20, 2017 10:01 AM