Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

February 24, 2017

Ladi Prosek

Running Hyper-V in a QEMU/KVM Guest

This article provides a how-to on setting up nested virtualization, in particular running Microsoft Hyper-V as a guest of QEMU/KVM. The usual terminology is going to be used in the text: L0 is the bare-metal host running Linux with KVM and QEMU. L1 is L0’s guest, running Microsoft Windows Server 2016 with the Hyper-V role enabled. And L2 is L1’s guest, a virtual machine running Linux, Windows, or anything else. Only Intel hardware is considered here. It is possible that the same can be achieved with AMD’s hardware virtualization support but it has not been tested yet.

A quick note on performance. Since the Intel VMX technology does not directly support nested virtualization in hardware, what L1 perceives as hardware-accelerated virtualization is in fact software emulation of VMX by L0. Thus, workloads will inevitably run slower in L2 compared to L1.

Kernel / KVM

A fairly recent kernel is required for Hyper-V on QEMU/KVM to function properly. The first commit known to work is 1dc35da, available in Linux 4.10 and newer.

Nested Intel virtualization must be enabled. If the following command does not return “Y”, kvm-intel.nested=1 must be passed to the kernel as a parameter.

$ cat /sys/module/kvm_intel/parameters/nested

QEMU

QEMU 2.7 should be enough to make nested virtualization work. As always, it is advisable to use the latest stable version available. SeaBIOS version 1.10 or later is required.

The QEMU command line must include the +vmx cpu feature, for example:

-cpu SandyBridge,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,+vmx

If QEMU warns about the vmx feature not being available on the host, nested virt has likely not been enabled in KVM (see the previous paragraph).

Hyper-V

Once the Windows L1 guest is installed, add the Hyper-V role as usual. Only Windows Server 2016 is known to support nested virtualization at the moment.

If Windows complains about missing HW virtualization support, re-check QEMU and SeaBIOS versions. If the Hyper-V role is already installed and nested virt is misconfigured or not supported, the error shown by Windows tends to mention “Hyper-V components not running” like in the following screenshot.

If everything goes well, both Gen 1 and Gen 2 Hyper-V virtual machines can be created and started. Here’s a screenshot of Windows XP 64-bit running as a guest in Windows Server 2016, which itself is a guest in QEMU/KVM.


by ladipro at February 24, 2017 01:57 PM

February 22, 2017

Gerd Hoffmann

vconsole 0.7 released

vconsole is a virtual machine (serial) console manager, look here for details.

No big changes from 0.6.

Fetch the tarball here.

by Gerd Hoffmann at February 22, 2017 08:15 AM

February 21, 2017

Gerd Hoffmann

Fedora 25 images for qemu and raspberry pi 3 uploaded

I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.

The fedora-25-rpi3 image is for the raspberry pi 3.
The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).

The images don’t have a root password set. You must use libguestfs-tools to set the root password …

virt-customize -a <image> --root-password "password:<your-password-here>>

… otherwise you can’t login after boot.

The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.

The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.

by Gerd Hoffmann at February 21, 2017 08:32 AM

February 16, 2017

Daniel Berrange

Setting up a nested KVM guest for developing & testing PCI device assignment with NUMA

Over the past few years OpenStack Nova project has gained support for managing VM usage of NUMA, huge pages and PCI device assignment. One of the more challenging aspects of this is availability of hardware to develop and test against. In the ideal world it would be possible to emulate everything we need using KVM, enabling developers / test infrastructure to exercise the code without needing access to bare metal hardware supporting these features. KVM has long has support for emulating NUMA topology in guests, and guest OS can use huge pages inside the guest. What was missing were pieces around PCI device assignment, namely IOMMU support and the ability to associate NUMA nodes with PCI devices. Co-incidentally a QEMU community member was already working on providing emulation of the Intel IOMMU. I made a request to the Red Hat KVM team to fill in the other missing gap related to NUMA / PCI device association. To do this required writing code to emulate a PCI/PCI-E Expander Bridge (PXB) device, which provides a light weight host bridge that can be associated with a NUMA node. Individual PCI devices are then attached to this PXB instead of the main PCI host bridge, thus gaining affinity with a NUMA node. With this, it is now possible to configure a KVM guest such that it can be used as a virtual host to test NUMA, huge page and PCI device assignment integration. The only real outstanding gap is support for emulating some kind of SRIOV network device, but even without this, it is still possible to test most of the Nova PCI device assignment logic – we’re merely restricted to using physical functions, no virtual functions. This blog posts will describe how to configure such a virtual host.

First of all, this requires very new libvirt & QEMU to work, specifically you’ll want libvirt >= 2.3.0 and QEMU 2.7.0. We could technically support earlier QEMU versions too, but that’s pending on a patch to libvirt to deal with some command line syntax differences in QEMU for older versions. No currently released Fedora has new enough packages available, so even on Fedora 25, you must enable the “Virtualization Preview” repository on the physical host to try this out – F25 has new enough QEMU, so you just need a libvirt update.

# curl --output /etc/yum.repos.d/fedora-virt-preview.repo https://fedorapeople.org/groups/virt/virt-preview/fedora-virt-preview.repo
# dnf upgrade

For sake of illustration I’m using Fedora 25 as the OS inside the virtual guest, but any other Linux OS will do just fine. The initial task is to install guest with 8 GB of RAM & 8 CPUs using virt-install

# cd /var/lib/libvirt/images
# wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
# virt-install --name f25x86_64  \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --ram 8000 --vcpus 8 \
    ...

The guest needs to use host CPU passthrough to ensure the guest gets to see VMX, as well as other modern instructions and have 3 virtual NUMA nodes. The first guest NUMA node will have 4 CPUs and 4 GB of RAM, while the second and third NUMA nodes will each have 2 CPUs and 2 GB of RAM. We are just going to let the guest float freely across host NUMA nodes since we don’t care about performance for dev/test, but in production you would certainly pin each guest NUMA node to a distinct host NUMA node.

    ...
    --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
               cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
               cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
    ...

QEMU emulates various different chipsets and historically for x86, the default has been to emulate the ancient PIIX4 (it is 20+ years old dating from circa 1995). Unfortunately this is too ancient to be able to use the Intel IOMMU emulation with, so it is neccessary to tell QEMU to emulate the marginally less ancient chipset Q35 (it is only 9 years old, dating from 2007).

    ...
    --machine q35

The complete virt-install command line thus looks like

# virt-install --name f25x86_64  \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 20 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --ram 8000 --vcpus 8 \
    --cpu host,cell0.id=0,cell0.cpus=0-3,cell0.memory=4096000,\
               cell1.id=1,cell1.cpus=4-5,cell1.memory=2048000,\
               cell2.id=2,cell2.cpus=6-7,cell2.memory=2048000 \
    --machine q35

Once the installation is completed, shut down this guest since it will be necessary to make a number of changes to the guest XML configuration to enable features that virt-install does not know about, using “virsh edit“. With the use of Q35, the guest XML should initially show three PCI controllers present, a “pcie-root”, a “dmi-to-pci-bridge” and a “pci-bridge”

<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='dmi-to-pci-bridge'>
  <model name='i82801b11-bridge'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
</controller>
<controller type='pci' index='2' model='pci-bridge'>
  <model name='pci-bridge'/>
  <target chassisNr='2'/>
  <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</controller>

PCI endpoint devices are not themselves associated with NUMA nodes, rather the bus they are connected to has affinity. The default pcie-root is not associated with any NUMA node, but extra PCI-E Expander Bridge controllers can be added and associated with a NUMA node. So while in edit mode, add the following to the XML config

<controller type='pci' index='3' model='pcie-expander-bus'>
  <target busNr='180'>
    <node>0</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='4' model='pcie-expander-bus'>
  <target busNr='200'>
    <node>1</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</controller>
<controller type='pci' index='5' model='pcie-expander-bus'>
  <target busNr='220'>
    <node>2</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</controller>

It is not possible to plug PCI endpoint devices directly into the PXB, so the next step is to add PCI-E root ports into each PXB – we’ll need one port per device to be added, so 9 ports in total. This is where the requirement for libvirt >= 2.3.0 – earlier versions mistakenly prevented you adding more than one root port to the PXB

<controller type='pci' index='6' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='6' port='0x0'/>
  <alias name='pci.6'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='7' port='0x8'/>
  <alias name='pci.7'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='8' port='0x10'/>
  <alias name='pci.8'/>
  <address type='pci' domain='0x0000' bus='0x03' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='9' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='9' port='0x0'/>
  <alias name='pci.9'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='10' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='10' port='0x8'/>
  <alias name='pci.10'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='11' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='11' port='0x10'/>
  <alias name='pci.11'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='12' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='12' port='0x0'/>
  <alias name='pci.12'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='13' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='13' port='0x8'/>
  <alias name='pci.13'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='14' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='14' port='0x10'/>
  <alias name='pci.14'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>|
</controller>

Notice that the values in ‘bus‘ attribute on the <address> element is matching the value of the ‘index‘ attribute on the <controller> element of the parent device in the topology. The PCI controller topology now looks like this

pcie-root (index == 0)
  |
  +- dmi-to-pci-bridge (index == 1)
  |    |
  |    +- pci-bridge (index == 2)
  |
  +- pcie-expander-bus (index == 3, numa node == 0)
  |    |
  |    +- pcie-root-port (index == 6)
  |    +- pcie-root-port (index == 7)
  |    +- pcie-root-port (index == 8)
  |
  +- pcie-expander-bus (index == 4, numa node == 1)
  |    |
  |    +- pcie-root-port (index == 9)
  |    +- pcie-root-port (index == 10)
  |    +- pcie-root-port (index == 11)
  |
  +- pcie-expander-bus (index == 5, numa node == 2)
       |
       +- pcie-root-port (index == 12)
       +- pcie-root-port (index == 13)
       +- pcie-root-port (index == 14)

All the existing devices are attached to the “pci-bridge” (the controller with index == 2). The devices we intend to use for PCI device assignment inside the virtual host will be attached to the new “pcie-root-port” controllers. We will provide 3 e1000 per NUMA node, so that’s 9 devices in total to add

<interface type='user'>
  <mac address='52:54:00:7e:6e:c6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:c7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:c8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:d8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e6'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e7'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
</interface>
<interface type='user'>
  <mac address='52:54:00:7e:6e:e8'/>
  <model type='e1000e'/>
  <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
</interface>

Note that we’re using the “user” networking, aka SLIRP. Normally one would never want to use SLIRP but we don’t care about actually sending traffic over these NICs, and so using SLIRP avoids polluting our real host with countless TAP devices.

The final configuration change is to simply add the Intel IOMMU device

<iommu model='intel'/>

It is a capability integrated into the chipset, so it does not need any <address> element of its own. At this point, save the config and start the guest once more. Use the “virsh domifaddrs” command to discover the IP address of the guest’s primary NIC and ssh into it.

# virsh domifaddr f25x86_64
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      52:54:00:10:26:7e    ipv4         192.168.122.3/24

# ssh root@192.168.122.3

We can now do some sanity check that everything visible in the guest matches what was enabled in the libvirt XML config in the host. For example, confirm the NUMA topology shows 3 nodes

# dnf install numactl
# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3
node 0 size: 3856 MB
node 0 free: 3730 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1813 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1832 MB
node distances:
node   0   1   2 
  0:  10  20  20 
  1:  20  10  20 
  2:  20  20  10 

Confirm that the PCI topology shows the three PCI-E Expander Bridge devices, each with three NICs attached

# lspci -t -v
-+-[0000:dc]-+-00.0-[dd]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[de]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[df]----00.0  Intel Corporation 82574L Gigabit Network Connection
 +-[0000:c8]-+-00.0-[c9]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[ca]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[cb]----00.0  Intel Corporation 82574L Gigabit Network Connection
 +-[0000:b4]-+-00.0-[b5]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           +-01.0-[b6]----00.0  Intel Corporation 82574L Gigabit Network Connection
 |           \-02.0-[b7]----00.0  Intel Corporation 82574L Gigabit Network Connection
 \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
             +-01.0  Red Hat, Inc. QXL paravirtual graphic card
             +-02.0  Red Hat, Inc. Device 000b
             +-03.0  Red Hat, Inc. Device 000b
             +-04.0  Red Hat, Inc. Device 000b
             +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
             +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
             +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
             +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
             +-1e.0-[01-02]----01.0-[02]--+-01.0  Red Hat, Inc Virtio network device
             |                            +-02.0  Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller
             |                            +-03.0  Red Hat, Inc Virtio console
             |                            +-04.0  Red Hat, Inc Virtio block device
             |                            \-05.0  Red Hat, Inc Virtio memory balloon
             +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
             +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
             \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

The IOMMU support will not be enabled yet as the kernel defaults to leaving it off. To enable it, we must update the kernel command line parameters with grub.

# vi /etc/default/grub
....add "intel_iommu=on"...
# grub2-mkconfig > /etc/grub2.cfg

While intel-iommu device in QEMU can do interrupt remapping, there is no way enable that feature via libvirt at this time. So we need to set a hack for vfio

echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > \
  /etc/modprobe.d/vfio.conf

This is also a good time to install libvirt and KVM inside the guest

# dnf groupinstall "Virtualization"
# dnf install libvirt-client
# rm -f /etc/libvirt/qemu/networks/autostart/default.xml

Note we’re disabling the default libvirt network, since it’ll clash with the IP address range used by this guest. An alternative would be to edit the default.xml to change the IP subnet.

Now reboot the guest. When it comes back up, there should be a /dev/kvm device present in the guest.

# ls -al /dev/kvm
crw-rw-rw-. 1 root kvm 10, 232 Oct  4 12:14 /dev/kvm

If this is not the case, make sure the physical host has nested virtualization enabled for the “kvm-intel” or “kvm-amd” kernel modules.

The IOMMU should have been detected and activated

# dmesg  | grep -i DMAR
[    0.000000] ACPI: DMAR 0x000000007FFE2541 000048 (v01 BOCHS  BXPCDMAR 00000001 BXPC 00000001)
[    0.000000] DMAR: IOMMU enabled
[    0.203737] DMAR: Host address width 39
[    0.203739] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
[    0.203776] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 12008c22260206 ecap f02
[    2.910862] DMAR: No RMRR found
[    2.910863] DMAR: No ATSR found
[    2.914870] DMAR: dmar0: Using Queued invalidation
[    2.914924] DMAR: Setting RMRR:
[    2.914926] DMAR: Prepare 0-16MiB unity mapping for LPC
[    2.915039] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[    2.915140] DMAR: Intel(R) Virtualization Technology for Directed I/O

The key message confirming everything is good is the last line there – if that’s missing something went wrong – don’t be mislead by the earlier “DMAR: IOMMU enabled” line which merely says the kernel saw the “intel_iommu=on” command line option.

The IOMMU should also have registered the PCI devices into various groups

# dmesg  | grep -i iommu  |grep device
[    2.915212] iommu: Adding device 0000:00:00.0 to group 0
[    2.915226] iommu: Adding device 0000:00:01.0 to group 1
...snip...
[    5.588723] iommu: Adding device 0000:b5:00.0 to group 14
[    5.588737] iommu: Adding device 0000:b6:00.0 to group 15
[    5.588751] iommu: Adding device 0000:b7:00.0 to group 16

Libvirt meanwhile should have detected all the PCI controllers/devices

# virsh nodedev-list --tree
computer
  |
  +- net_lo_00_00_00_00_00_00
  +- pci_0000_00_00_0
  +- pci_0000_00_01_0
  +- pci_0000_00_02_0
  +- pci_0000_00_03_0
  +- pci_0000_00_04_0
  +- pci_0000_00_1d_0
  |   |
  |   +- usb_usb2
  |       |
  |       +- usb_2_0_1_0
  |         
  +- pci_0000_00_1d_1
  |   |
  |   +- usb_usb3
  |       |
  |       +- usb_3_0_1_0
  |         
  +- pci_0000_00_1d_2
  |   |
  |   +- usb_usb4
  |       |
  |       +- usb_4_0_1_0
  |         
  +- pci_0000_00_1d_7
  |   |
  |   +- usb_usb1
  |       |
  |       +- usb_1_0_1_0
  |       +- usb_1_1
  |           |
  |           +- usb_1_1_1_0
  |             
  +- pci_0000_00_1e_0
  |   |
  |   +- pci_0000_01_01_0
  |       |
  |       +- pci_0000_02_01_0
  |       |   |
  |       |   +- net_enp2s1_52_54_00_10_26_7e
  |       |     
  |       +- pci_0000_02_02_0
  |       +- pci_0000_02_03_0
  |       +- pci_0000_02_04_0
  |       +- pci_0000_02_05_0
  |         
  +- pci_0000_00_1f_0
  +- pci_0000_00_1f_2
  |   |
  |   +- scsi_host0
  |   +- scsi_host1
  |   +- scsi_host2
  |   +- scsi_host3
  |   +- scsi_host4
  |   +- scsi_host5
  |     
  +- pci_0000_00_1f_3
  +- pci_0000_b4_00_0
  |   |
  |   +- pci_0000_b5_00_0
  |       |
  |       +- net_enp181s0_52_54_00_7e_6e_c6
  |         
  +- pci_0000_b4_01_0
  |   |
  |   +- pci_0000_b6_00_0
  |       |
  |       +- net_enp182s0_52_54_00_7e_6e_c7
  |         
  +- pci_0000_b4_02_0
  |   |
  |   +- pci_0000_b7_00_0
  |       |
  |       +- net_enp183s0_52_54_00_7e_6e_c8
  |         
  +- pci_0000_c8_00_0
  |   |
  |   +- pci_0000_c9_00_0
  |       |
  |       +- net_enp201s0_52_54_00_7e_6e_d6
  |         
  +- pci_0000_c8_01_0
  |   |
  |   +- pci_0000_ca_00_0
  |       |
  |       +- net_enp202s0_52_54_00_7e_6e_d7
  |         
  +- pci_0000_c8_02_0
  |   |
  |   +- pci_0000_cb_00_0
  |       |
  |       +- net_enp203s0_52_54_00_7e_6e_d8
  |         
  +- pci_0000_dc_00_0
  |   |
  |   +- pci_0000_dd_00_0
  |       |
  |       +- net_enp221s0_52_54_00_7e_6e_e6
  |         
  +- pci_0000_dc_01_0
  |   |
  |   +- pci_0000_de_00_0
  |       |
  |       +- net_enp222s0_52_54_00_7e_6e_e7
  |         
  +- pci_0000_dc_02_0
      |
      +- pci_0000_df_00_0
          |
          +- net_enp223s0_52_54_00_7e_6e_e8

And if you look at at specific PCI device, it should report the NUMA node it is associated with and the IOMMU group it is part of

# virsh nodedev-dumpxml pci_0000_df_00_0
<device>
  <name>pci_0000_df_00_0</name>
  <path>/sys/devices/pci0000:dc/0000:dc:02.0/0000:df:00.0</path>
  <parent>pci_0000_dc_02_0</parent>
  <driver>
    <name>e1000e</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>223</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x10d3'>82574L Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='10'>
      <address domain='0x0000' bus='0xdc' slot='0x02' function='0x0'/>
      <address domain='0x0000' bus='0xdf' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='2'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='1'/>
      <link validity='sta' speed='2.5' width='1'/>
    </pci-express>
  </capability>
</device>

Finally, libvirt should also be reporting the NUMA topology

# virsh capabilities
...snip...
<topology>
  <cells num='3'>
    <cell id='0'>
      <memory unit='KiB'>4014464</memory>
      <pages unit='KiB' size='4'>1003616</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='10'/>
        <sibling id='1' value='20'/>
        <sibling id='2' value='20'/>
      </distances>
      <cpus num='4'>
        <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
        <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
        <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
        <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
      </cpus>
    </cell>
    <cell id='1'>
      <memory unit='KiB'>2016808</memory>
      <pages unit='KiB' size='4'>504202</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='20'/>
        <sibling id='1' value='10'/>
        <sibling id='2' value='20'/>
      </distances>
      <cpus num='2'>
        <cpu id='4' socket_id='4' core_id='0' siblings='4'/>
        <cpu id='5' socket_id='5' core_id='0' siblings='5'/>
      </cpus>
    </cell>
    <cell id='2'>
      <memory unit='KiB'>2014644</memory>
      <pages unit='KiB' size='4'>503661</pages>
      <pages unit='KiB' size='2048'>0</pages>
      <pages unit='KiB' size='1048576'>0</pages>
      <distances>
        <sibling id='0' value='20'/>
        <sibling id='1' value='20'/>
        <sibling id='2' value='10'/>
      </distances>
      <cpus num='2'>
        <cpu id='6' socket_id='6' core_id='0' siblings='6'/>
        <cpu id='7' socket_id='7' core_id='0' siblings='7'/>
      </cpus>
    </cell>
  </cells>
</topology>
...snip...

Everything should be ready and working at this point, so lets try and install a nested guest, and assign it one of the e1000e PCI devices. For simplicity we’ll just do the exact same install for the nested guest, as we used for the top level guest we’re currently running in. The only difference is that we’ll assign it a PCI device

# cd /var/lib/libvirt/images
# wget -O f25x86_64-boot.iso https://download.fedoraproject.org/pub/fedora/linux/releases/25/Server/x86_64/os/images/boot.iso
# virt-install --name f25x86_64 --ram 2000 --vcpus 8 \
    --file /var/lib/libvirt/images/f25x86_64.img --file-size 10 \
    --cdrom f25x86_64-boot.iso --os-type fedora23 \
    --hostdev pci_0000_df_00_0 --network none

If everything went well, you should now have a nested guest with an assigned PCI device attached to it.

This turned out to be a rather long blog posting, but this is not surprising as we’re experimenting with some cutting edge KVM features trying to emulate quite a complicated hardware setup, that deviates from normal KVM guest setup quite a way. Perhaps in the future virt-install will be able to simplify some of this, but at least for the short-medium term there’ll be a fair bit of work required. The positive thing though is that this has clearly demonstrated that KVM is now advanced enough that you can now reasonably expect to do development and testing of features like NUMA and PCI device assignment inside nested guests.

The next step is to convince someone to add QEMU emulation of an Intel SRIOV network device….volunteers please :-)

by Daniel Berrange at February 16, 2017 12:44 PM

ANNOUNCE: libosinfo 1.0.0 release

NB, this blog post was intended to be published back in November last year, but got forgotten in draft stage. Publishing now in case anyone missed the release…

I am happy to announce a new release of libosinfo, version 1.0.0 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

  • Update loader to follow new layout for external database
  • Move all database files into separate osinfo-db package
  • Move osinfo-db-validate into osinfo-db-tools package

As promised, this release of libosinfo has completed the separation of the library code from the database files. There are now three independently released artefacts:

  • libosinfo – provides the libosinfo shared library and most associated command line tools
  • osinfo-db – contains only the database XML files and RNG schema, no code at all.
  • osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

Before installing the 1.0.0 release of libosinfo it is necessary to install osinfo-db-tools, followed by osinfo-db. The download page has instructions for how to deploy the three components. In particular note that ‘osinfo-db’ does NOT contain any traditional build system, as the only files it contains are XML database files. So instead of unpacking the osinfo-db archive, use the osinfo-db-import tool to deploy it.

by Daniel Berrange at February 16, 2017 11:19 AM

February 15, 2017

QEMU project

Presentations from FOSDEM 2017

Over the last weekend, on February 4th and 5th, the FOSDEM 2017 conference took place in Brussels.

Some people from the QEMU community attended the Virtualisation and IaaS track, and the videos of their presentations are now available online, too:

by Thomas Huth at February 15, 2017 02:49 PM

February 14, 2017

Stefan Hajnoczi

Slides posted for "Using NVDIMM under KVM" talk

I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.

Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.

This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).

Update: Video is available here.

by stefanha (noreply@blogger.com) at February 14, 2017 02:54 PM

February 10, 2017

Daniel Berrange

ANNOUNCE: gtk-vnc 0.7.0 release including 2 security fixes

I’m pleased to announce a new release of GTK-VNC, vesion 0.7.0. The release focus is on bug fixing and includes fixes for two publically reported security bugs which allow a malicious server to exploit the client. Similar bugs were recently reported & fixed in other common VNC clients too.

  • CVE-2017-5884 – fix bounds checking for RRE, hextile and copyrect encodings
  • CVE-2017-5885 – fix color map index bounds checking
  • Add API to allow smooth scaling to be disabled
  • Workaround to help SPICE servers quickly drop VNC clients which mistakenly connect, by sending “RFB ” signature bytes early
  • Don’t accept color map entries for true-color pixel formats
  • Add missing vala .deps files for gvnc & gvncpulse
  • Avoid crash if host/port is NULL
  • Add precondition checks to some public APIs
  • Fix link to home page in README file
  • Fix misc memory leaks
  • Clamp cursor hot-pixel to within cursor region

Thanks to all those who reported bugs and provides patches that went into this new release.

by Daniel Berrange at February 10, 2017 04:45 PM

The surprisingly complicated world of disk image sizes

When managing virtual machines one of the key tasks is to understand the utilization of resources being consumed, whether RAM, CPU, network or storage. This post will examine different aspects of managing storage when using file based disk images, as opposed to block storage. When provisioning a virtual machine the tenant user will have an idea of the amount of storage they wish the guest operating system to see for their virtual disks. This is the easy part. It is simply a matter of telling ‘qemu-img’ (or a similar tool) ’40GB’ and it will create a virtual disk image that is visible to the guest OS as a 40GB volume. The virtualization host administrator, however, doesn’t particularly care about what size the guest OS sees. They are instead interested in how much space is (or will be) consumed in the host filesystem storing the image. With this in mind, there are four key figures to consider when managing storage:

  • Capacity – the size that is visible to the guest OS
  • Length – the current highest byte offset in the file.
  • Allocation – the amount of storage that is currently consumed.
  • Commitment – the amount of storage that could be consumed in the future.

The relationship between these figures will vary according to the format of the disk image file being used. For the sake of illustration, raw and qcow2 files will be compared since they provide an examples of the simplest file format and the most complicated file format used for virtual machines.

Raw files

In a raw file, the sectors visible to the guest are mapped 1-2-1 onto sectors in the host file. Thus the capacity and length values will always be identical for raw files – the length dictates the capacity and vica-verca. The allocation value is slightly more complicated. Most filesystems do lazy allocation on blocks, so even if a file is 10 GB in length it is entirely possible for it to consume 0 bytes of physical storage, if nothing has been written to the file yet. Such a file is known as “sparse” or is said to have “holes” in its allocation. To maximize guest performance, it is common to tell the operating system to fully allocate a file at time of creation, either by writing zeros to every block (very slow) or via a special system call to instruct it to immediately allocate all blocks (very fast). So immediately after creating a new raw file, the allocation would typically either match the length, or be zero. In the latter case, as the guest writes to various disk sectors, the allocation of the raw file will grow. The commitment value refers the upper bound for the allocation value, and for raw files, this will match the length of the file.

While raw files look reasonably straightforward, some filesystems can create surprises. XFS has a concept of “speculative preallocation” where it may allocate more blocks than are actually needed to satisfy the current I/O operation. This is useful for files which are progressively growing, since it is faster to allocate 10 blocks all at once, than to allocate 10 blocks individually. So while a raw file’s allocation will usually never exceed the length, if XFS has speculatively preallocated extra blocks, it is possible for the allocation to exceed the length. The excess is usually pretty small though – bytes or KBs, not MBs. Btrfs meanwhile has a concept of “copy on write” whereby multiple files can initially share allocated blocks and when one file is written, it will take a private copy of the blocks written. IOW, to determine the usage of a set of files it is not sufficient sum the allocation for each file as that would over-count the true allocation due to block sharing.

QCow2 files

In a qcow2 file, the sectors visible to the guest are indirectly mapped to sectors in the host file via a number of lookup tables. A sector at offset 4096 in the guest, may be stored at offset 65536 in the host. In order to perform this mapping, there are various auxiliary data structures stored in the qcow2 file. Describing all of these structures is beyond the scope of this, read the specification instead. The key point is that, unlike raw files, the length of the file in the host has no relation to the capacity seen in the guest. The capacity is determined by a value stored in the file header metadata. By default, the qcow2 file will grow on demand, so the length of the file will gradually grow as more data is stored. It is possible to request preallocation, either just of file metadata, or of the full file payload too. Since the file grows on demand as data is written, traditionally it would never have any holes in it, so the allocation would always match the length (the previous caveat wrt to XFS speculative preallocation still applies though). Since the introduction of SSDs, however, the notion of explicitly cutting holes in files has become commonplace. When this is plumbed through from the guest, a guest initiated TRIM request, will in turn create a hole in the qcow2 file, which will also issue a TRIM to the underlying host storage. Thus even though qcow2 files are grow on demand, they may also become sparse over time, thus allocation may be less than the length. The maximum commitment for a qcow2 file is surprisingly hard to get an accurate answer to. To calculate it requires intimate knowledge of the qcow2 file format and even the type of data stored in it. There is allocation overhead from the data structures used to map guest sectors to host file offsets, which is directly proportional to the capacity and the qcow2 cluster size (a cluster is the qcow2 equivalent “sector” concept, except much bigger – 65536 bytes by default). Over time qcow2 has grown other data structures though, such as various bitmap tables tracking cluster allocation and recent writes. With the addition of LUKS support, there will be key data tables. Most significantly though is that qcow2 can internally store entire VM snapshots containing the virtual device state, guest RAM and copy-on-write disk sectors. If snapshots are ignored, it is possible to calculate a value for the commitment, and it will be proportional to the capacity. If snapshots are used, however, all bets are off – the amount of storage that can be consumed is unbounded, so there is no commitment value that can be accurately calculated.

Summary

Considering the above information, for a newly created file the four size values would look like

Format Capacity Length Allocation Commitment
raw (sparse) 40GB 40GB 0 40GB [1]
raw (prealloc) 40GB 40GB 40GB [1] 40GB [1]
qcow2 (grow on demand) 40GB 193KB 196KB 41GB [2]
qcow2 (prealloc metadata) 40GB 41GB 6.5MB 41GB [2]
qcow2 (prealloc all) 40GB 41GB 41GB 41GB [2]
[1] XFS speculative preallocation may cause allocation/commitment to be very slightly higher than 40GB
[2] use of internal snapshots may massively increase allocation/commitment

For an application attempting to manage filesystem storage to ensure any future guest OS write will always succeed without triggering ENOSPC (out of space) in the host, the commitment value is critical to understand. If the length/allocation values are initially less than the commitment, they will grow towards it as the guest writes data. For raw files it is easy to determine commitment (XFS preallocation aside), but for qcow2 files it is unreasonably hard. Even ignoring internal snapshots, there is no API provided by libvirt that reports this value, nor is it exposed by QEMU or its tools. Determining the commitment for a qcow2 file requires the application to not only understand the qcow2 file format, but also directly query the header metadata to read internal parameters such as “cluster size” to be able to then calculate the required value. Without this, the best an application can do is to guess – e.g. add 2% to the capacity of the qcow2 file to determine likely commitment. Snapshots may life even harder, but to be fair, qcow2 internal snapshots are best avoided regardless in favour of external snapshots. The lack of information around file commitment is a clear gap that needs addressing in both libvirt and QEMU.

That all said, ensuring the sum of commitment values across disk images is within the filesystem free space is only one approach to managing storage. These days QEMU has the ability to live migrate virtual machines even when their disks are on host-local storage – it simply copies across the disk image contents too. So a valid approach is to mostly ignore future commitment implied by disk images, and instead just focus on the near term usage. For example, regularly monitor filesystem usage and if free space drops below some threshold, then migrate one or more VMs (and their disk images) off to another host to free up space for remaining VMs.

by Daniel Berrange at February 10, 2017 03:58 PM

February 08, 2017

Daniel Berrange

Commenting out XML snippets in libvirt guest config by stashing it as metadata

Libvirt uses XML as the format for configuring objects it manages, including virtual machines. Sometimes when debugging / developing it is desirable to comment out sections of the virtual machine configuration to test some idea. For example, one might want to temporarily remove a secondary disk. It is not always desirable to just delete the configuration entirely, as it may need to be re-added immediately after. XML has support for comments <!-- .... some text --> which one might try to use to achieve this. Using comments in XML fed into libvirt, however, will result in an unwelcome suprise – the commented out text is thrown into /dev/null by libvirt.

This is an unfortunate consequence of the way libvirt handles XML documents. It does not consider the XML document to be the master representation of an object’s configuration – a series of C structs are the actual internal representation. XML is simply a data interchange format for serializing structs into a text format that can be interchanged with the management application, or persisted on disk. So when receiving an XML document libvirt will parse it, extracting the pieces of information it cares about which are they stored in memory in some structs, while the XML document is discarded (along with the comments it contained). Given this way of working, to preserve comments would require libvirt to add 100’s of extra fields to its internal structs and extract comments from every part of the XML document that might conceivably contain them. This is totally impractical to do in realityg. The alternative would be to consider the parsed XML DOM as the canonical internal representation of the config. This is what the libvirt-gconfig library in fact does, but it means you can no longer just do simple field accesses to access information – getter/setter methods would have to be used, which quickly becomes tedious in C. It would also involve re-refactoring almost the entire libvirt codebase so such a change in approach would realistically never be done.

Given that it is not possible to use XML comments in libvirt, what other options might be available ?

Many years ago libvirt added the ability to store arbitrary user defined metadata in domain XML documents. The caveat is that they have to be located in a specific place in the XML document as a child of the <metadata> tag, in a private XML namespace. This metadata facility to be used as a hack to temporarily stash some XML out of the way. Consider a guest which contains a disk to be “commented out”:

<domain type="kvm">
  ...
  <devices>
    ...
    <disk type='file' device='disk'>
    <driver name='qemu' type='raw'/>
    <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>
    ...
  </devices>
</domain>

To stash the disk config as a piece of metadata requires changing the XML to

<domain type="kvm">
  ...
  <metadata>
    <s:disk xmlns:s="http://stashed.org/1" type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </s:disk>
  </metadata>
  ...
  <devices>
    ...
  </devices>
</domain>

What we have done here is

– Added a <metadata> element at the top level
– Moved the <disk> element to be a child of <metadata> instead of a child of <devices>
– Added an XML namespace to <disk> by giving it an ‘s:’ prefix and associating a URI with this prefix

Libvirt only allows a single top level metadata element per namespace, so if there are multiple tihngs to be stashed, just give them each a custom namespace, or introduce an arbitrary wrapper. Aside from mandating the use of a unique namespace, libvirt treats the metadata as entirely opaque and will not try to intepret or parse it in any way. Any valid XML construct can be stashed in the metadata, even invalid XML constructs, provided they are hidden inside a CDATA block. For example, if you’re using virsh edit to make some changes interactively and want to get out before finishing them, just stash the changed in a CDATA section, avoiding the need to worry about correctly closing the elements.

<domain type="kvm">
  ...
  <metadata>
    <s:stash xmlns:s="http://stashed.org/1">
    <![CDATA[
      <disk type='file' device='disk'>
        <driver name='qemu' type='raw'/>
        <source file='/home/berrange/VirtualMachines/demo.qcow2'/>
        <target dev='vda' bus='virtio'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      </disk>
      <disk>
        <driver name='qemu' type='raw'/>
        ...i'll finish writing this later...
    ]]>
    </s:stash>
  </metadata>
  ...
  <devices>
    ...
  </devices>
</domain>

Admittedly this is a somewhat cumbersome solution. In most cases it is probably simpler to just save the snippet of XML in a plain text file outside libvirt. This metadata trick, however, might just come in handy some times.

As an aside the real, intended, usage of the <metdata> facility is to allow applications which interact with libvirt to store custom data they may wish to associated with the guest. As an example, the recently announced libvirt websockets console proxy uses it to record which consoles are to be exported. I know of few other real world applications using this metadata feature, however, so it is worth remembering it exists :-) System administrators are free to use it for local book keeping purposes too.

by Daniel Berrange at February 08, 2017 07:14 PM

Thomas Huth

Testing the edk2 SMM driver stack with QEMU, KVM & libvirt

Laszlo wrote an article over at the edk2 wiki about testing SMM with OVMF, in QEMU/KVM virtual machines managed by libvirt:

https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt

The primary goal of the article is to help rapid development and testing of SMM-related firmware code (or any edk2 code in general).

February 08, 2017 12:10 PM

Alberto Garcia

QEMU and the qcow2 metadata checks

When choosing a disk image format for your virtual machine one of the factors to take into considerations is its I/O performance. In this post I’ll talk a bit about the internals of qcow2 and about one of the aspects that can affect its performance under QEMU: its consistency checks.

As you probably know, qcow2 is QEMU’s native file format. The first thing that I’d like to highlight is that this format is perfectly fine in most cases and its I/O performance is comparable to that of a raw file. When it isn’t, chances are that this is due to an insufficiently large L2 cache. In one of my previous blog posts I wrote about the qcow2 L2 cache and how to tune it, so if your virtual disk is too slow, you should go there first.

I also recommend Max Reitz and Kevin Wolf’s qcow2: why (not)? talk from KVM Forum 2015, where they talk about a lot of internal details and show some performance tests.

qcow2 clusters: data and metadata

A qcow2 file is organized into units of constant size called clusters. The cluster size defaults to 64KB, but a different value can be set when creating a new image:

qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G

Clusters can contain either data or metadata. A qcow2 file grows dynamically and only allocates space when it is actually needed, so apart from the header there’s no fixed location for any of the data and metadata clusters: they can appear mixed anywhere in the file.

Here’s an example of what it looks like internally:

In this example we can see the most important types of clusters that a qcow2 file can have:

  • Header: this one contains basic information such as the virtual size of the image, the version number, and pointers to where the rest of the metadata is located, among other things.
  • Data clusters: the data that the virtual machine sees.
  • L1 and L2 tables: a two-level structure that maps the virtual disk that the guest can see to the actual location of the data clusters in the qcow2 file.
  • Refcount table and blocks: a two-level structure with a reference count for each data cluster. Internal snapshots use this: a cluster with a reference count >= 2 means that it’s used by other snapshots, and therefore any modifications require a copy-on-write operation.

Metadata overlap checks

In order to detect corruption when writing to qcow2 images QEMU (since v1.7) performs several sanity checks. They verify that QEMU does not try to overwrite sections of the file that are already being used for metadata. If this happens, the image is marked as corrupted and further access is prevented.

Although in most cases these checks are innocuous, under certain scenarios they can have a negative impact on disk write performance. This depends a lot on the case, and I want to insist that in most scenarios it doesn’t have any effect. When it does, the general rule is that you’ll have more chances of noticing it if the storage backend is very fast or if the qcow2 image is very large.

In these cases, and if I/O performance is critical for you, you might want to consider tweaking the images a bit or disabling some of these checks, so let’s take a look at them. There are currently eight different checks. They’re named after the metadata sections that they check, and can be divided into the following categories:

  1. Checks that run in constant time. These are equally fast for all kinds of images and I don’t think they’re worth disabling.
    • main-header
    • active-l1
    • refcount-table
    • snapshot-table
  2. Checks that run in variable time but don’t need to read anything from disk.
    • refcount-block
    • active-l2
    • inactive-l1
  3. Checks that need to read data from disk. There is just one check here and it’s only needed if there are internal snapshots.
    • inactive-l2

By default all tests are enabled except for the last one (inactive-l2), because it needs to read data from disk.

Disabling the overlap checks

Tests can be disabled or enabled from the command line using the following syntax:

-drive file=hd.qcow2,overlap-check.inactive-l2=on
-drive file=hd.qcow2,overlap-check.snapshot-table=off

It’s also possible to select the group of checks that you want to enable using the following syntax:

-drive file=hd.qcow2,overlap-check.template=none
-drive file=hd.qcow2,overlap-check.template=constant
-drive file=hd.qcow2,overlap-check.template=cached
-drive file=hd.qcow2,overlap-check.template=all

Here, none means that no tests are enabled, constant enables all tests from group 1, cached enables all tests from groups 1 and 2, and all enables all of them.

As I explained in the previous section, if you’re worried about I/O performance then the checks that are probably worth evaluating are refcount-block, active-l2 and inactive-l1. I’m not counting inactive-l2 because it’s off by default. Let’s look at the other three:

  • inactive-l1: This is a variable length check because it depends on the number of internal snapshots in the qcow2 image. However its performance impact is likely to be negligible in all cases so I don’t think it’s worth bothering with.
  • active-l2: This check depends on the virtual size of the image, and on the percentage that has already been allocated. This check might have some impact if the image is very large (several hundred GBs or more). In that case one way to deal with it is to create an image with a larger cluster size. This also has the nice side effect of reducing the amount of memory needed for the L2 cache.
  • refcount-block: This check depends on the actual size of the qcow2 file and it’s independent from its virtual size. This check is relatively expensive even for small images, so if you notice performance problems chances are that they are due to this one. The good news is that we have been working on optimizing it, so if it’s slowing down your VMs the problem might go away completely in QEMU 2.9.

Conclusion

The qcow2 consistency checks are useful to detect data corruption, but they can affect write performance.

If you’re unsure and you want to check it quickly, open an image with overlap-check.template=none and see for yourself, but remember again that this will only affect write operations. To obtain more reliable results you should also open the image with cache=none in order to perform direct I/O and bypass the page cache. I’ve seen performance increases of 50% and more, but whether you’ll see them depends a lot on your setup. In many cases you won’t notice any difference.

I hope this post was useful to learn a bit more about the qcow2 format. There are other things that can help QEMU perform better, and I’ll probably come back to them in future posts, so stay tuned!

Acknowledgments

My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the rest of the QEMU development team.

by berto at February 08, 2017 08:52 AM

February 04, 2017

QEMU project

A new website for QEMU

At last, QEMU’s new website is up!

The new site aims to be simple and provides the basic information needed to download and start contributing to QEMU. It complements the wiki, which remains the central point for developers to share information quickly with the rest of the community.

We tried to test the website on most browsers and to make it lightweight and responsive. It is built using Jekyll and the source code for the website can be cloned from the qemu-web.git repository. Just like for any other project hosted by QEMU, the best way to propose or contribute a new change is by sending a patch through the qemu-devel@nongnu.org mailing list.

For example, if you would like to add a new screenshot to the homepage, you can clone the qemu-web.git repository, add a PNG file to the screenshots/ directory, and edit the _data/screenshots.yml file to include the new screenshot.

Blog posts about QEMU are also welcome; they are simple HTML or Markdown files and are stored in the _posts/ directory of the repository.

by Paolo Bonzini at February 04, 2017 07:40 AM

January 24, 2017

Gerd Hoffmann

local display for intel vgpu starts working

Intel vgpu guest display shows up in the qemu gtk window. This is a linux kernel booted to the dracut emergency shell prompt, with the framebuffer console running @ inteldrmfb. Booting a full linux guest with X11 and/or wayland not tested yet. There are rendering glitches too, running “dmesg” looks like this:

So, a bunch of issues to solve before this is ready for users, but it’s a nice start.

For the brave:

host kernel: https://www.kraxel.org/cgit/linux/log/?h=intel-vgpu
qemu: https://www.kraxel.org/cgit/qemu/log/?h=work/intel-vgpu

Take care, both branches are moving targets (aka: rebasing at times).

by Gerd Hoffmann at January 24, 2017 08:38 PM

January 18, 2017

Gerd Hoffmann

tweak arm images with libguestfs-tools

So, when using the official fedora arm images on your raspberry pi (or any other arm board) board you might have faced the problem that it is not easy to use them for a headless (i.e. no keyboard and display connected) machine. There is no default password, fedora asks you to set one on the first boot instead. Which is from a security point of view surely better than shipping with a fixed password. But for headless machines it is quite inconvenient …

Luckily there is an easy way out. You can use libguestfs-tools. The tools have been created to configure virtual machine images (this is where the name comes from). But the tools work fine with sdcards too.

I’m using a usb sdcard reader which shows up as /dev/sdc on my system. I can just pass /dev/sdc as image to the tools (take care, the device is probably something else for you). For example, to set a root password:

virt-customize -a /dev/sdc --root-password "password:<your-password-here>"

The initial setup on the first boot is a systemd service, and it can be turned off by simply removing the symlinks which enable the service:

virt-customize -a /dev/sdc \
  --delete /etc/systemd/system/multi-user.target.wants/initial-setup.service \
  --delete /etc/systemd/system/graphical.target.wants/initial-setup.service

You can use virt-copy-in (or virt-tar-in) to copy config files to the disk image. Small (or empty) configuration files can also be created with the write command:

virt-customize -a /dev/sdc --write "/.autorelabel:"

Adding the .autorelabel file will force selinux relabeling on the first boot (takes a while). It is a good idea to do that in case you copy files to the sdcard, to make sure the new files are labeled correctly. Especially in case you copy security sensitive things like ssh keys or ssh config files. Without relabeling selinux will not allow sshd access those files, which in turn can break remote logins.

There is alot more the virt-* tools can do for you. Check out the manual pages for more info. And you can easily script things, virt-customize has a --commands-from-file switch which accepts a file with a list of commands.

by Gerd Hoffmann at January 18, 2017 10:56 AM

January 17, 2017

Gerd Hoffmann

virtual gpu support landing upstream

The upstreaming process of virtual gpu support (vgpu) made a big step forward with the 4.10 merge window. Two important pieces have been merged:

First, the mediated device framework (mdev). Basically this allows kernel drivers to present virtual pci devices, using the vfio framework and interfaces. Both nvidia and intel will use mdev to partition the physical gpu of the host into multiple virtual devices which then can be assigned to virtual machines.

Second, intel landed initial mdev support for the i915 driver too. There is quite some work left to do in future kernel releases though. Accessing to the guest display is not supported yet, so you must run x11vnc or simliar tools in the guest to see the screen. Also there are some stability issues to find and fix.

If you want play with this nevertheless, here is how to do it. But be prepared for crashes and better don’t try this on a production machine.

On the host: create virtual devices

On the host machine you obviously need a 4.10 kernel. Also the intel graphics device (igd) must be broadwell or newer. In the kernel configuration enable vfio and mdev (all CONFIG_VFIO_* options). Enable CONFIG_DRM_I915_GVT and CONFIG_DRM_I915_GVT_KVMGT for intel vgpu support. Building the mtty sample driver (CONFIG_SAMPLE_VFIO_MDEV_MTTY, a virtual serial port) can be useful too, for testing.

Boot the new kernel. Load all modules: vfio-pci, vfio-mdev, optionally mtty. Also i915 and kvmgt of course, but that probably happened during boot already.

Go to the /sys/class/mdev_bus directory. This should look like this:

kraxel@broadwell ~# cd /sys/class/mdev_bus
kraxel@broadwell .../class/mdev_bus# ls -l
total 0
lrwxrwxrwx. 1 root root 0 17. Jan 10:51 0000:00:02.0 -> ../../devices/pci0000:00/0000:00:02.0
lrwxrwxrwx. 1 root root 0 17. Jan 11:57 mtty -> ../../devices/virtual/mtty/mtty

Each driver with mdev support has a directory there. Go to $device/mdev_supported_types to check what kind of virtual devices you can create.

kraxel@broadwell .../class/mdev_bus# cd 0000:00:02.0/mdev_supported_types
kraxel@broadwell .../0000:00:02.0/mdev_supported_types# ls -l
total 0
drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_1
drwxr-xr-x. 3 root root 0 17. Jan 11:57 i915-GVTg_V4_2
drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_4

As you can see intel supports three different configurations on my machine. The configuration (basically the amount of video memory) differs, and the number of instances you can create. Check the description and available_instance files in the directories:

kraxel@broadwell .../0000:00:02.0/mdev_supported_types# cd i915-GVTg_V4_2
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat description 
low_gm_size: 64MB
high_gm_size: 192MB
fence: 4
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat available_instance 
2

Now it is possible to create virtual devices by writing a UUID into the create file:

kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# uuid=$(uuidgen)
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# echo $uuid
f321853c-c584-4a6b-b99a-3eee22a3919c
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# sudo sh -c "echo $uuid > create"

The new vgpu device will show up as subdirectory of the host gpu:

kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cd ../../$uuid
kraxel@broadwell .../0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c# ls -l
total 0
lrwxrwxrwx. 1 root root    0 17. Jan 12:31 driver -> ../../../../bus/mdev/drivers/vfio_mdev
lrwxrwxrwx. 1 root root    0 17. Jan 12:35 iommu_group -> ../../../../kernel/iommu_groups/10
lrwxrwxrwx. 1 root root    0 17. Jan 12:35 mdev_type -> ../mdev_supported_types/i915-GVTg_V4_2
drwxr-xr-x. 2 root root    0 17. Jan 12:35 power
--w-------. 1 root root 4096 17. Jan 12:35 remove
lrwxrwxrwx. 1 root root    0 17. Jan 12:31 subsystem -> ../../../../bus/mdev
-rw-r--r--. 1 root root 4096 17. Jan 12:35 uevent

You can see the device landed in iommu group 10. We’ll need that in a moment.

On the host: configure guests

Ideally this would be as simple as adding <hostdev> to your guests libvirt xml config. The mdev devices don’t have a pci address on the host though, and because of that they must be passed to qemu using the sysfs device path instead of the pci address. libvirt doesn’t (yet) support sysfs paths though, so it is a bit more complicated for now. Alot of the setup libvirt does for hostdevs automatically must be done manually instead.

First, we must allow qemu access /dev. By default libvirt uses control groups to restrict access. That must be turned off. Edit /etc/libvirt/qemu.conf. Uncomment the cgroup_controllers line. Remove "devices" from the list. Restart libvirtd.

Second, we must allow qemu access the iommu group (10 in my case). A simple chmod will do:

kraxel@broadwell ~# chmod 666 /dev/vfio/10

Third, we must update the guest configuration:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  [ ... ]
  <currentMemory unit='KiB'>1048576</currentMemory>
  <memoryBacking>
    <locked/>
  </memoryBacking>
  [ ... ]
  <qemu:commandline>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,addr=05.0,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c'/>
  </qemu:commandline>
</domain>

There is special qemu namespace which can be used to pass extra command line arguments to qemu. We do this here to use a qemu feature not yet supported by libvirt (use sysfs paths for vfio-pci). Also we must explicitly allow to lock down guest memory.

Now we are ready to go:

kraxel@broadwell ~# virsh start --console $guest

In the guest

It is a good idea to prepare the guest a bit before adding the vgpu to the guest configuration. Setup a serial console, so you can talk to it even in case graphics are broken. Blacklist the i915 module and load it manually, at least until you have a known-working configuration. Also booting to runlevel 3 (aka multi-user.target) instead of 5 (aka graphical.target) and starting the xorg server manually is better for now.

For the guest machine intel recommends the 4.8 kernel. In theory newer kernels should work too, in practice they didn’t last time I tested (4.10-rc2). Also make sure the xorg server uses the modesetting driver, the intel driver didn’t work in my testing. This config file will do:

root@guest ~# cat /etc/X11/xorg.conf.d/intel.conf 
Section "Device"
        Identifier  "Card0"
#       Driver      "intel"
        Driver      "modesetting"
        BusID       "PCI:0:5:0"
EndSection

I’m starting the xorg server with x11vnc, xterm and mwm (motif window manager) using this little script:

#!/bin/sh

# debug
echo "# $0: DISPLAY=$DISPLAY"

# start server
if test "$DISPLAY" != ":4"; then
        echo "# $0: starting Xorg server"
        exec startx $0 -- /usr/bin/Xorg :4
        exit 1
fi
echo "# $0: starting session"

# configure session
xrdb $HOME/.Xdefaults

# start clients
x11vnc -rfbport 5904 &
xterm &
exec mwm

The session runs on display 4, so you should be able to connect from the host this way:

kraxel@broadwell ~# vncviewer $guest_ip:4

Have fun!

by Gerd Hoffmann at January 17, 2017 03:25 PM

January 16, 2017

Thomas Huth

Controlling Open Firmware with -prom-env

Open Firmware has a concept of environment configuration variables that are used to control the boot flow and other behavior of the firmware. On ppc64 systems, these variables are normally stored in the NVRAM of the machine (as defined in the LoPAPR specification, chapter 8.4.1.1, “System NVRAM Partition”), so that they are persistent between reboots - at least on real systems. With QEMU, the emulated NVRAM has got to be backed with a real file on the host, of course, to make the contents persistent. This can be done with the parameter -drive if=pflash,file=filename,format=raw of QEMU, for example.

Now, if the backing file via pflash is not used, the contents of the NVRAM can be created by QEMU dynamically. QEMU features a dedicated parameter called -prom-env for setting the configuration variables in NVRAM. This works with all the OpenBIOS based machines in QEMU (like the mac99 or g3beige machine) since a long time already, but since version 2.8 has been released in last December, QEMU now also supports this option for the pseries machine, i.e. it can now also be used to control the boot behavior of the SLOF firmware of an sPAPR guest. So this is a good point in time to have a closer look at this parameter to explain how it can be used with the pseries machine…

The supported environment variables and their values can be listed by typing printenv at the SLOF firmware prompt:

0 > printenv  
---environment variable--------current value-------------default value------
   load-base                   4000                      4000 
   real-mode?                  true                      true
   direct-serial?              false                     false
   use-nvramrc?                false                     false
   selftest-#megs              0                         0 
   security-password                                     
   security-mode               0                         0 
   security-#badlogins         0                         0 
   screen-#rows                200                       200 
   screen-#columns             200                       200 
   output-device                                         
   oem-logo?                   false                     false
   oem-logo                    1e762bb0 0                1e762110 0 
   oem-banner?                 false                     false
   oem-banner                                            
   nvramrc                                               
   input-device                                          
   fcode-debug?                true                      true
   diag-switch?                false                     false
   diag-file                                             
   diag-device                                           
   boot-command                boot                      boot
   boot-file                                             
   boot-device                                           
   auto-boot?                  false                     true

Many of the configuration variables are either only useful for debugging (like “fcode-debug?”), or even only there without real functionality since the IEEE 1275 standard mandates that they’ve got to be available (like “oem-logo”), but the implementation of the intended functionality did not make much sense in SLOF.

Some other configuration variables are really useful, though. For example, if you want to avoid that SLOF boots automatically (so you can do some things at the firmware prompt before running the OS), you can start QEMU with -prom-env 'auto-boot?=false' to disable the auto-boot feature. Of course you could also hit the “s” key during boot to drop to the firmware prompt, but this can be quite annoying if you’re doing multiple things in parallel and thus you easily miss the very right point in time. Using the configuration variable is much more convenient.

Another very useful trick is that you can execute arbitrary Forth code during the boot process with the -prom-env parameter! The likely obvious way is to use the “nvramrc” variable. For example, if you start QEMU with the parameters -prom-env 'use-nvramrc?=true' -prom-env 'nvramrc=." Hello World!" cr', SLOF will print “Hello World!” during the boot process, followed by a carriage return. But there is another way for executing Forth code, which I personally prefer if I do not have to boot an OS afterwards (since you do not have to set two variables in this case): You can override the boot-command variable, which also contains Forth code. For example using the parameters -prom-env 'boot-command=." Hello World!" cr' will print “Hello World” during the boot process, too, just at a little bit later point in time. This can also be useful to run firmware reboot tests: When you run QEMU with -prom-env 'boot-command=reset-all', the firmware will reboot automatically each time instead of booting an operating system. Or use -prom-env 'boot-command=power-off' to shut down the VM automatically at the end of the firmware boot process.

Something else that bugged me for a long time was the behavior of the input selection in SLOF. When you boot your pseries guest with a VGA graphics card, SLOF always automatically uses the emulated USB keyboard as input device. But if you want to debug the VGA or USB code in the firmware for example, it is much more convenient if you can interact with the firwmare via the serial console (aka. hvterm). So now that QEMU supports the -prom-env parameter for the pseries machine, too, I’ve recently added some code to SLOF that forces the firmware to stay with the serial input instead of switching to the emulated USB keyboard. You can control the behavior now with the “direct-serial?” configuration variable. If you start QEMU with the parameters -nographic -vga std -prom-env 'direct-serial?=true' for example, you can still interact with the firmware in the terminal window even though the firmware detected a graphics card and USB keyboard. (Note: This new feature is currently only available in the development version of SLOF, but it will be part of the SLOF release that will be shipped with QEMU version 2.9)

January 16, 2017 10:45 AM

January 06, 2017

Ladi Prosek

Extracting Windows VM crash dumps

This post aims to provides an overview of the many options for extracting Windows crash dump files and their equivalents out of Windows virtual machines. The assumption is that a Windows instance has just crashed with a Blue Screen Of Death (BSOD) and a dump is deemed helpful to diagnose the issue. Note that most of the tips below apply to physical machines as well but the examples here focus on a QEMU/KVM virtual machine.

1. Copy MEMORY.DMP after reboot

This is straightforward as long as MEMORY.DMP has actually been generated and there is a simple way to copy it out of the VM. Note that when Windows is showing the BSOD screen and “dumping memory to disk”, it is not creating a crash dump file yet. To increase the chances that the memory dump file will be persisted, Windows writes it in the page file first. Then, on first reboot, it tries to extract it into a separate file, usually %SystemRoot%\MEMORY.DMP.

A few things can go wrong here. First, if the system got super messed up by whatever caused the crash, the VM may not be able to boot again. No boot means no MEMORY.DMP and we’re out of luck. There may also not be enough free disk space to create the crash dump. This would, again, mean no MEMORY.DMP for us. Note that freeing disk space after the fact won’t help as the dump in the page file is already partially overwritten by the time the user gets to do anything about the disk space situation. Unless of course the disk can be accessed offline from the host or another VM. If you’re willing to do this, it is not necessary to reboot though (see below). Here’s an older blog post with a detailed description of the Windows behavior and useful tips.

2. Extract the page file

Little known is the fact that the page file itself can be fed into windbg and used in lieu of MEMORY.DMP (thanks to Vadim Rozenfeld for enlightening me!). The page file will be larger than MEMORY.DMP but it doesn’t require the reboot mentioned above. A Windows VM crashed and doesn’t boot back up? No problem, just extract the page file:

$ guestmount -a Windows10.qcow2 -i --ro /mnt/windows
$ cp /mnt/windows/pagefile.sys ~

And then on a Windows machine:

C:\>windbg -z path\to\pagefile.sys

3. Process QEMU guest memory dump with Volatility

This is not guaranteed to succeed but is worth the try as a last ditch effort or if the VM is in production and a “crash” dump is required without actually crashing it.

Side note: The easiest way to trigger a Windows VM crash is raising a non-maskable interrupt. This can be done by issuing the inject-nmi QMP command or its equivalent:

Welcome to the QMP low-level shell!
Connected to QEMU 2.8.50

(QEMU) inject-nmi

Back to the non-crashing scenario. First, get the guest memory dump:

Welcome to the QMP low-level shell!
Connected to QEMU 2.8.50

(QEMU) dump-guest-memory paging=false protocol=file:/tmp/vmdump

The file is a complete image of the guest physical memory in ELF format. It contains most of the data found in MEMORY.DMP but the structure is different. Use Volatility to convert it:

python vol.py -f /tmp/vmdump --profile=Win10x64 raw2dmp -O /tmp/MEMORY.DMP
Volatility Foundation Volatility Framework 2.5
Writing data (5.00 MB chunks): 
|..........................................................................
...........................................................................
...........................................................................
...........................................................................
...........................................................................
..............................|

Note that you have to know or guess (or let Volatility guess) the version of Windows running in the VM, sometimes down to the Service Pack / build level. If all works well, the resulting MEMORY.DMP can be opened in windbg as usual. You just won’t see any bug check code and BSOD parameters (duh!) and probably no context either.

4. Push dump data out of the VM programmatically

This is advanced and more of a note to self, as I hope to turn it into a tool at some point. Windows offers two kernel APIs of interest.

KeRegisterBugCheckReasonCallback with KbCallbackDumpIo
In theory, this should allow a kernel driver to be invoked by Windows as the BSOD is happening and memory is being dumped to the page file. The driver is given a pointer to the memory blocks being dumped and can push this data out of the VM using, for instance, a virtual serial port. Quite understandably, the callback code running on BSOD is very limited in what it can do. Memory allocations are obviously out of the question as is everything else that might end up allocating. What’s worse is that it’s not clear if the addresses passed to the callback are virtual or physical. The first part of the dump (header) is indicated with virtual addresses. The second part (body) starts with a few virtual pages and is then followed by physical pages indicated with physical addresses. Secondary data is again virtual. I might be missing something but apparently I’m not the only one observing this. As it stands now, this API is not usable without ugly and fragile heuristics.

KeInitializeCrashDumpHeader
This call produces the crash dump header, a 4096 byte data structure at offset 0 of MEMORY.DMP. And it works, as long as the physical memory layout that Windows works with is identical to that of whoever is producing the actual physical memory data to be appended to the header. Which sadly doesn’t seem to be the case with QEMU’s dump-guest-memory. So on one hand, having a valid and correct header saves us from the guesswork that Volatility has to do (no need to mess with Volatility “profiles”). But it’s still necessary to at least understand, or better patch the header to adapt it to the guest memory dump. Here’s a partial description of the dump header layout: computer.forensikblog.de/en/2006/03/dmp-file-structure.html.


by ladipro at January 06, 2017 02:21 PM

January 05, 2017

Daniel Berrange

ANNOUNCE: New libvirt project Go XML parser model

Shortly before christmas, I announced the availability of new Go bindings for the libvirt API. This post announces a companion package for dealing with XML parsing/formatting in Go. The master repository is available on the libvirt GIT server, but it is expected that Go projects will consume it via an import of the github mirror, since the Go ecosystem is heavilty github focused (e.g. godoc.org can’t produce docs for stuff hosted on libvirt.org git)

import (
  libvirtxml "github.com/libvirt/libvirt-go-xml"
  "encoding/xml"
)

domcfg := &libvirtxml.Domain{Type: "kvm", Name: "demo",
                             UUID: "8f99e332-06c4-463a-9099-330fb244e1b3",
                             ....}
xmldoc, err := xml.Marshal(domcfg)

API documentation is available on the godoc website.

When dealing with the libvirt API, most applications will find themselves needing to either parse or format XML documents describing configuration of various libvirt objects. Traditionally this task has been left upto the application to deal with and as a result most applications end up creating some kind of structure / object model to represent the XML document in a more easily accessible manner. To try to reduce this duplicate effort, libvirt has already created the libvirt-glib package, which contains a libvirt-gconfig library mapping libvirt XML documents into the GObject world. This library is accessible to many programming languages via the magic of GObject Introspection, and while there is some work to support this in Go, it is not particularly mature at this time.

In the Go world, there is a package “encoding/xml” which is able to transform between XML documents and Go structs, given suitable annotations on the struct fields. It is very easy to deal with, simply requiring someone to define a bit set of structs with annotated fields to map to the XML document. There’s no real “code” to write as it is really a data definition task.  Looking at applications using libvirt in Go, we see quite a few have already go down this route for dealing with libvirt XML. It should be readily apparent that every application using libvirt in Go is essentially going to end up writing an identical set of structs to deal with the XML handling. This duplication of effort makes no sense at all, and as such, we have started this new libvirt-go-xml package to provide a standard set of Go structs to deal with libvirt XML. The current level of schema support is pretty minimal supporting the capabilities XML, secrets XML and a small amount of the domain XML, so we’d encourage anyone interested in this to contribute patches to expand the XML schema coverage.

The following illustrates a further example of its usage in combination with the libvirt-go library (with error checking omitted for brevity):

import (
  libvirt "github.com/libvirt/libvirt-go"
  libvirtxml "github.com/libvirt/libvirt-go-xml"
  "encoding/xml"
  "fmt"
)

conn, err := libvirt.NewConnect("qemu:///system")
dom := conn.LookupDomainByName("demo")
xmldoc, err := dom.GetXMLDesc(0)

domcfg := &libvirtxml.Domain{}
err := xml.Unmarshal([]byte(xmldocC), domcfg)

fmt.Printf("Virt type %s", domcfg.Type)

 

by Daniel Berrange at January 05, 2017 12:15 PM

December 21, 2016

Eduardo Habkost

QEMU APIs: introduction to QemuOpts

This post is a short introduction to the QemuOpts API inside QEMU. This is part of a series, see the introduction for other pointers and additional information.

QemuOpts was introduced in 2009. It is a simple abstraction that handles two tasks:

  1. Parsing of config files and command-line options
  2. Storage of configuration options

Data structures

The QemuOpts data model is pretty simple:

  • QemuOptsList carries the list of all options belonging to a given config group. Each entity is represented by a QemuOpts struct.

  • QemuOpts represents a set of key-value pairs. (Some of the code refers to that as a config group, but to avoid confusion with QemuOptsList, I will call them config sections).

  • QemuOpt is a single key=value pair.

Some config groups have multiple QemuOpts structs (e.g. “drive”, “object”, “device”, that represent multiple drives, multiple objects, and multiple devices, respectively), while others always have only one QemuOpts struct (e.g. the “machine” config group).

For example, the following command-line options:

-drive id=disk1,file=disk.raw,format=raw,if=ide \
-drive id=disk2,file=disk.qcow2,format=qcow2,if=virtio \
-machine usb=on -machine accel=kvm

are represented internally as:

Diagram showing two QemuOptsList objects: qemu_drive_opts and qemu_machine_opts. qemu_drive_opts has two QemuOpts entries: disk1 and disk2. disk2 has three QemuOpt entries: file=disk.raw, format=raw, if=ide. disk2 has three QemuOpt entries: file=disk.qcow2, format=qcow2, if=virtio. qemu_machine_opts has one QemuOpts entry. The QemuOpts entry for qemu_machine_opts has two QemuOpt entries: usb=on, accel=kvm

Data Types

QemuOpts supports a limited number of data types for option values:

  • Strings
  • Boolean options
  • Numbers (integers)
  • Sizes

Strings

Strings are just used as-is, after the command-line or config file is parsed.

Note: On the command-line, options are separated by commas, but commas inside option values can be escaped as ,,.

Boolean options

The QemuOpt parser accepts only “on” and “off” as values for this option.

Warning: note that this behavior is different from the QOM property parser. I plan to explore this in future posts.

Numbers (integers)

Numbers are supposed to be unsigned 64-bit integers. However, the code relies on the behavior of strtoull() and does not reject negative numbers. That means the parsed uint64_t value might be converted to a signed integer later. For example, the following command-line is not rejected by QEMU:

$ qemu-system-x86_64 -smp cpus=-18446744073709551615,cores=1,threads=1

I don’t know if there is existing code that requires negative numbers to be accepted by the QemuOpts parser. I assume it exists, so we couldn’t easily change the existing parsing rules without breaking existing code.

Sizes

Sizes are represented internally as integers, but the parser accept suffixes like K, M, G, T.

qemu-system-x86_64 -m size=2G

is equivalent to:

qemu-system-x86_64 -m size=2048M

Note: there are two different size-suffix parsers inside QEMU: one at util/cutils.c and another at util/qemu-option.c. Figuring out which one is going to be used is left as an exercise to the reader.

Working around the QemuOpts parsers

QEMU code sometimes uses tricks to avoid or work around the QemuOpts option value parsers:

Example 1: using the raw option value

It is possible to get the original raw option value as a string using qemu_opt_get(), even after it was already parsed. For example, the code that handles memory options in QEMU does that, to ensure a suffix-less number is interpreted as Mebibytes, not bytes:

    mem_str = qemu_opt_get(opts, "size");
    if (mem_str) {
        /* [...] */
        sz = qemu_opt_get_size(opts, "size", ram_size);
        /* Fix up legacy suffix-less format */
        if (g_ascii_isdigit(mem_str[strlen(mem_str) - 1])) {
            sz <<= 20;
            /* [...] */
        }
    }

Example 2: empty option name list

Some options do not use the QemuOpts value parsers at all, by not defining any option names in the QemuOptsList struct. In those cases, the option values are parsed and validated using different methods. Some examples:

static QemuOptsList qemu_machine_opts = {
    .name = "machine",
    .implied_opt_name = "type",
    .merge_lists = true,
    .head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
    .desc = {
        /*
         * no elements => accept any
         * sanity checking will happen later
         * when setting machine properties
         */
        { }
    },
};
static QemuOptsList qemu_acpi_opts = {
    .name = "acpi",
    .implied_opt_name = "data",
    .head = QTAILQ_HEAD_INITIALIZER(qemu_acpi_opts.head),
    .desc = { { 0 } } /* validated with OptsVisitor */
};

This is a common pattern when options are translated to other data representations: mostly QOM properties or QAPI structs. I plan to explore this in a future blog post.

The following config groups use this method and do their own parsing/validation of config options: acpi, device, drive, machine, net, netdev, numa, object, smbios, tpmdev

-writeconfig

The QemuOpts code is responsible for two tasks:

  1. Parsing command-line options and config files
  2. Storage of configuration options

This means sometimes config options are parsed by custom code and converted to QemuOpts data structures. Storage of config options inside QemuOpts allow the existing QEMU configuration to be written to a file using the -writeconfig command-line option.

The original commit introducing -writeconfig describes it this way:

In theory you should be able to do:

qemu < machine config cmd line switches here > -writeconfig vm.cfg
qemu -readconfig vm.cfg

In practice it will not work. Not all command line switches are converted to QemuOpts, so you’ll have to keep the not-yet converted ones on the second line. Also there might be bugs lurking which prevent even the converted ones from working correctly.

This has improved over the years, but the comment still applies today: most command-line options are converted to QemuOpts options, but not all of them.

Further reading

by Eduardo Habkost at December 21, 2016 11:00 PM

December 17, 2016

Stefan Hajnoczi

13 years of using Linux

I've been using Linux on both work and personal machines for 13 years. Over time I've tried various distributions, changed the nature of my work, and revisited other operating systems to arrive back to the same conclusion every time: Linux works best for me.

The reason I started using Linux remains the reason why it's my operating system of choice today:

It's free and easy to install most software under an open source license that allows both commercial and non-commercial use.

That means software to do common tasks is available for free without limitations. The cost of entry for exploring and learning new things is zero.

The amount of packaged software available in major Linux distributions is incredible. Niche open source operating systems don't have this wide selection of high-quality software. Proprietary operating systems have high-quality software but there is constant irritation in dealing with the artificial limitations of closed source software. The strength of Linux is this sweet spot between high-quality mainstream software and the advantages of open source software.

The pain points of Linux have changed over the years. In the beginning hardware support was limited. This has largely been solved for laptops, desktop, and server hardware as vendors began to contribute drivers and publish hardware datasheets free of NDAs. Class-compliant USB devices also cut down on the number of vendor-specific drivers. Nowadays the reputation for limited hardware support is largely unjustified.

Another issue that has subsided is the Windows-only software that kept many people tied to that platform. Two trends killed Windows-only software: the move to the web and the rise of the Mac. A lot of applications migrated to pure web applications without the need for ActiveX or Java applets with platform-specific code - and Adobe Flash is close to its end too. Ever since Macs rose to popularity again it was no longer acceptable to ship Windows-only software. As a result so many things are now on the web or cross-platform applications with Linux support.

Migrating to Linux is still a big change just like switching from Windows to Mac is a big change. It will always be hard to overcome this, even with virtualization, because the virtual machine experience isn't seamless. Ultimately users need to pick native applications and import their existing data. And it's worth it because you get access to applications that can do almost everything without the hassles of proprietary platforms. That's the lasting advantage that Linux has over the competition.

by stefanha (noreply@blogger.com) at December 17, 2016 06:00 PM

December 15, 2016

Daniel Berrange

ANNOUNCE: New libvirt project Go language bindings

I’m happy to announce that the libvirt project is now supporting Go language bindings as a primary deliverable, joining Python and Perl, as language bindings with 100% API coverage of libvirt C library. The master repository is available on the libvirt GIT server, but it is expected that Go projects will consume it via an import of the github mirror, since the Go ecosystem is heavilty github focused (e.g. godoc.org can’t produce docs for stuff hosted on libvirt.org git)

import (
    libvirt "github.com/libvirt/libvirt-go"
)

conn, err := libvirt.NewConnect("qemu:///system")
...

API documentation is available on the godoc website.

For a while now libvirt has relied on 3rd parties to provide Go language bindings. The one most people use was first created by Alex Zorin and then taken over by Kyle Kelly. There’s been a lot of excellent work put into these bindings, however, the API coverage largely stops at what was available in libvirt 1.2.2, with the exception of a few APIs from libvirt 1.2.14 which have to be enabled via Go build tags. Libvirt is now working on version 3.0.0 and there have been many APIs added in that time, not to mention enums and other constants. Comparing the current libvirt-go API coverage against what the libvirt C library exposes reveals 163 missing functions (out of 476 total), 367 missing enum constants (out of 847 total) and 165 missing macro constants (out of 200 total). IOW while there is alot already implemented, there was still a very long way to go.

Initially I intended to contribute patches to address the missing API coverage to the existing libvirt-go bindings. In looking at the code though I had some concerns about the way some of the APIs had been exposed to Go. In the libvirt C library there are a set of APIs which accept or return a “virTypedParameterPtr” array, for cases where we need APIs to be easily extensible to handle additions of an arbitrary extra data fields in the future. The way these APIs work is one of the most ugly and unpleasant parts of the C API and thus in language bindings we never expose the virTypedParameter concept directly, but instead map it into a more suitable language specific data structure. In Perl and Python this meant mapping them to hash tables, which gives application developers a language friendly way to interact with the APIs. Unfortunately the current Go API bindings have exposed the virTypedParameter concept directly to Go and since Go does not support unions, the result is arguably even more unpleasant in Go than it already is in C. The second concern is with the way events are exposed to Go – in the C layer we have different callbacks that are needed for each event type, but have one method for registering callbacks, requiring an ugly type cast. This was again exposed directly in Go, meaning that the Go compiler can’t do strong type checking on the callback registration, instead only doing a runtime check at time of event dispatch. There were some other minor concerns about the Go API mapping, such as fact that it needlessly exposed the “vir” prefix on all methods & constants despite already being in a “libvirt” package namespace, returning of a struct instead of pointer to a struct for objects. Understandably the current maintainer had a desire to keep API compatibility going forward, so the decision was made to fork the existing libvirt-go codebase. This allowed us to take advantage of all the work put in so far, while fixing the design problems, and also extending them to have 100% API coverage. The idea is that applications can then decide to opt-in to the new Go binding at a point in time where they’re ready to adapt their code to the API changes.

For users of the existing libvirt Go binding, converting to the new official libvirt Go binding requires a little bit of work, but nothing too serious and will simplify the code if using any of the typed parameter methods. The changes are roughly as follows:

  • The “VIR_” prefix is dropped from all constants. eg libvirt.VIR_DOMAIN_METADATA_DESCRIPTION because libvirt.DOMAIN_METADATA_DESCRIPTION
  • The “vir” prefix is dropped from all types. eg libvirt.virDomain becomes libvirt.Domain
  • Methods returning objects now return a pointer eg “* Domain” instead of “Domain”, allowing us to return the usual “nil” constant on error, instead of a struct with no underlying libvirt connection
  • The domain events DomainEventRegister method has been replaced by a separate method for each event type. eg DomainEventLifecycleRegister, DomainEventRebootRegister, etc giving compile time type checking of callbacks
  • The domain events API now accepts a single callback, instead of taking a pair of callbacks – the caller can create an anonymous function to invoke multiple things if required.
  • Methods accepting or returning typed parameters now have a formal struct defined to expose all the parameters in a manner that allows direct access without type casts and enables normal Go compile time type checking. eg the Domain.GetBlockIOTune method returns a DomainBlockIoTuneParameters struct
  • It is no longer necessary to use Go compiler build tags to access functionality in different libvirt versions. Through the magic of conditional compilation, the binding will transparently build against every libvirt version from 1.2.0 through 3.0.0
  • The binding can find libvirt via pkg-config making it easy to compile against a libvirt installed to a non-standard location by simply setting “PKG_CONFIG_PATH”
  • There is 100% coverage of all APIs [1], constants and macros, verified by the libvirt CI system, so that it always keeps up with GIT master of the Libvirt C library.
  • The error callback concept is removed from the binding as this is deprecated by libvirt due to not being thread safe. It was also redundant since every method already directly returns an error object.
  • There are now explicit types defined for all enums and methods which take flags or enums now use these types instead of “uint32”, again allowing stronger compiler type checking

With the exception of the typed parameter changes adapting existing apps should be a largely boring mechanical conversion to just adapt to the renames.

Again, without the effort put in by Alex Zorin and Kyle Kelly & other community contributors, creation of these new libvirt-go bindings would have taken at least 4-5 weeks instead of the 2 weeks effort put into this. So there’s a huge debt owed to all the people who previously contributed to libvirt Go bindings. I hope that having these new bindings with guaranteed 100% API coverage will be of benefit to the Go community going forward.

[1] At time of writing this is a slight lie, as i’ve not quite finished the virStream and virEvent callback method bindings, but this will be done shortly.

by Daniel Berrange at December 15, 2016 11:56 AM

December 06, 2016

Kashyap Chamarthy

QEMU Advent Calendar 2016

The QEMU Advent Calendar website 2016 features a QEMU disk image each day from 01-DEC to 24 DEC. Each day a new package becomes available for download (of format tar.xz) which contains a file describing the image (readme.txt or similar), and a little run shell script that starts QEMU with the recommended command-line parameters for the disk image.

The disk images contain interesting operating systems and software that run under the QEMU emulator. Some of them are well-known or not-so-well-known operating systems, old and new, others are custom demos and neat algorithms.” [From the About section.]

This is brought to you by Thomas Huth (his initial announcement here) and yours truly.


Explore the last five days of images from the 2016 edition here! [Extract the download with, e.g. for Day 05: tar -xf.day05.tar.xz]

PS: We still have a few open slots, so please don’t hesitate to contact if you have any fun disk image(s) to contribute.


by kashyapc at December 06, 2016 03:10 PM

December 01, 2016

Thomas Huth

QEMU Advent Calendar 2016 starts

The QEMU Advent Calendar 2016 website reveals a new disk image for download each day in December until 2016-12-24, to create a fun experience for the QEMU community, and to celebrate the 10th anniversary of KVM. Starting today, on December 1st, the first door of the advent calendar can now be opened – Get ready for some 16-bit retro feeling!

December 01, 2016 08:40 AM

November 29, 2016

Eduardo Habkost

An incomplete list of QEMU APIs

Having seen many people (including myself) feeling confused about the purpose of some QEMU’s internal APIs when reviewing and contributing code to QEMU, I am trying to document things I learned about them.

I want to make more detailed blog posts about some of them, stating their goals (as I perceive them), where they are used, and what we can expect to see happening to them in the future. When I do that, I will update this post to include pointers to the more detailed content.

QemuOpts

Introduced in 2009. Compared to the newer abstractions below, it is quite simple. As described in the original commit: it “stores device parameters in a better way than unparsed strings”. It is still used by configuration and command-line parsing code.

Making QemuOpts work with the more modern abstractions (esp. QOM and QAPI) may be painful. Sometimes you can pretend it is not there, but you can’t run away from it if you are dealing with QEMU configuration or command-line parameters.

See also: the Introduction to QemuOpts blog post.

qdev

qdev was added to QEMU in 2009. qdev manages the QEMU device tree, based on a hierarchy of buses and devices. You can see the device tree managed by qdev using the info qtree monitor command in QEMU.

qdev allows device code to register implementations of device types. Machine code, on the other hand, would instantiate those devices and configure them by setting properties, and not accessing internal device data structures directly. Some devices can be plugged from the QEMU monitor or command-line, and their properties can be configured as arguments to the -device option or device_add command.

From the original code:

The theory here is that it should be possible to create a machine without knowledge of specific devices. Historically board init routines have passed a bunch of arguments to each device, requiring the board know exactly which device it is dealing with. This file provides an abstract API for device configuration and initialization. Devices will generally inherit from a particular bus (e.g. PCI or I2C) rather than this API directly.

Some may argue that qdev doesn’t exist anymore, and was replaced by QOM. Others (including myself) describe it as being built on top of QOM. Either way you describe it, the same features provided by the original qdev code are provided by the QOM-based code living in hw/core.

See also:

  • KVM Forum 2010 talk by Markus Armbruster: “QEMU’s new device model qdev” (slides)
  • KVM Forum 2011 talk by Markus Armbruster: “QEMU’s device model qdev: Where do we go from here?” (slides. video)
  • KVM Forum 2013 talk by Andreas Färber: “Modern QEMU Devices” (slides, video)

QOM

QOM is short for QEMU Object Model and was introduced in 2011. It is heavily documented on its header file. It started as a generalization of qdev. Today the device tree and backend objects are managed through the QOM object tree.

From its documentation:

The QEMU Object Model provides a framework for registering user creatable types and instantiating objects from those types. QOM provides the following features:

  • System for dynamically registering types
  • Support for single-inheritance of types
  • Multiple inheritance of stateless interfaces

QOM also has a property system for introspection and object/device configuration. qdev’s property system is built on top of QOM’s property system.

Some QOM types and their properties are meant to be used internally only (e.g. some devices that are not pluggable and only created by machine code; accelerator objects). Some types can be instantiated and configured directly from the QEMU monitor or command-line (using, e.g., -device, device_add, -object, object-add).

See also:

  • KVM Forum 2014 talk by Paolo Bonzini: “QOM exegesis and apocalypse” (slides, video).

VMState

VMState was introduced in 2009. It was added to change the device state saving/loading (for savevm and migration) from error-prone ad-hoc coding to a table-based approach.

From the original commit:

This patch introduces VMState infrastructure, to convert the save/load functions of devices to a table approach. This new approach has the following advantages:

  • it is type-safe
  • you can’t have load/save functions out of sync
  • will allows us to have new interesting commands, like dump , that shows all its internal state.
  • Just now, the only added type is arrays, but we can add structures.
  • Uses old load_state() function for loading old state.

See also:

  • KVM Forum 2010 talk by Juan Quintela: “Migration: How to hop from machine to machine without losing state” (slides)
  • KVM Forum 2011 talk by Juan Quintela: “Migration: one year later” (slides, video)
  • KVM Forum 2012 talk by Michael Roth: “QIDL: An Embedded Language to Serialize Guest Data Structures for Live Migration” (slides)

QMP

QMP is the QEMU Machine Protocol. Introduced in 2009. From its documentation:

The QEMU Machine Protocol (QMP) allows applications to operate a QEMU instance.

QMP is JSON based and features the following:

  • Lightweight, text-based, easy to parse data format
  • Asynchronous messages support (i.e. events)
  • Capabilities Negotiation

For detailed information on QMP’s usage, please, refer to the following files:

  • qmp-spec.txt QEMU Machine Protocol current specification
  • qmp-commands.txt QMP supported commands (auto-generated at build-time)
  • qmp-events.txt List of available asynchronous events

See also: KVM Forum 2010 talk by Luiz Capitulino, A Quick Tour of the QEMU Monitor Protocol.

QObject

QObject was introduced in 2009. It was added during the work to add QMP. It provides a generic QObject data type, and available subtypes include integers, strings, lists, and dictionaries. It includes reference counting. It was also called QEMU Object Model when the code was introduced, but do not confuse it with QOM.

It started a as simple implementation, but was expanded later to support all the data types defined in the QAPI schema (see below).

QAPI

QAPI was introduced in 2011. The original documentation (which can be outdated) can be seen at http://wiki.qemu.org/Features/QAPI.

From the original patch series:

Goals of QAPI

1) Make all interfaces consumable in C such that we can use the interfaces in QEMU

2) Make all interfaces exposed through a library using code generation from static introspection

3) Make all interfaces well specified in a formal schema

From the documentation:

QAPI is a native C API within QEMU which provides management-level functionality to internal and external users. For external users/processes, this interface is made available by a JSON-based wire format for the QEMU Monitor Protocol (QMP) for controlling qemu, as well as the QEMU Guest Agent (QGA) for communicating with the guest. The remainder of this document uses “Client JSON Protocol” when referring to the wire contents of a QMP or QGA connection.

To map Client JSON Protocol interfaces to the native C QAPI implementations, a JSON-based schema is used to define types and function signatures, and a set of scripts is used to generate types, signatures, and marshaling/dispatch code. This document will describe how the schemas, scripts, and resulting code are used.

See also:

  • KVM Forum 2011 talk by Anthony Liguori: “Code Generation for Fun and Profit” (slides, video)

Visitor API

QAPI includes an API to define and use visitors for the QAPI-defined data types. Visitors are the mechanism used to serialize QAPI data to/from the external world (e.g. through QMP, the command-line, or config files).

From its documentation:

The QAPI schema defines both a set of C data types, and a QMP wire format. QAPI objects can contain references to other QAPI objects, resulting in a directed acyclic graph. QAPI also generates visitor functions to walk these graphs. This file represents the interface for doing work at each node of a QAPI graph; it can also be used for a virtual walk, where there is no actual QAPI C struct.

There are four kinds of visitor classes: input visitors (QObject, string, and QemuOpts) parse an external representation and build the corresponding QAPI graph, output visitors (QObject and string) take a completed QAPI graph and generate an external representation, the dealloc visitor can take a QAPI graph (possibly partially constructed) and recursively free its resources, and the clone visitor performs a deep clone of one QAPI object to another. While the dealloc and QObject input/output visitors are general, the string, QemuOpts, and clone visitors have some implementation limitations; see the documentation for each visitor for more details on what it supports. Also, see visitor-impl.h for the callback contracts implemented by each visitor, and docs/qapi-code-gen.txt for more about the QAPI code generator.

The End

Although large, this list is incomplete. In the near future, I plan to write about QAPI, QOM, and QemuOpts, and how they work (and sometimes don’t work) together.

Most of the abstractions above are about data modeling, in one way or another. That’s not a coincidence: one of the things I want to write about are how some times those data abstractions have conflicting world views, and the issues resulting from that.

Feedback wanted: if you have any correction or suggestion to this list, please send your comments. You can use the GitHub page for the post to send comments or suggest changes, or just e-mail me.

by Eduardo Habkost at November 29, 2016 02:51 AM

November 15, 2016

Thomas Huth

QEMU Advent Calendar 2016 website now online

This year, we are celebrating the 10th anniversary of KVM (see Amit Sha’s article on LWN.net for more information), and the 25th anniversary of Linux, so to contribute to this celebration, Kashyap Chamarthy and I are preparing another edition of the QEMU Advent Kalendar this year.

The QEMU Advent Calendar 2016 will be a website that reveals a new disk image for download each day in December until 2016-12-24. To motivate people to contribute some disk images to the calendar (we still need some to be able to provide one each day), the new website for the advent calendar is now already online at www.qemu-advent-calendar.org – but of course the doors with the disk images will not open before December 1st.

November 15, 2016 01:50 PM

Daniel Berrange

New TLS algorithm priority config for libvirt with gnutls on Fedora >= 25

Libvirt has long supported use of TLS for its remote API service, using the gnutls library as its backend. When negotiating a TLS session, there are a huge number of possible algorithms that could be used and the client & server need to decide on the best one, where “best” is commonly some notion of “most secure”. The preference for negotiation is expressed by simply having an list of possible algorithms, sorted best to worst, and the client & server choose the first matching entry in their respective lists. Historically libvirt has not expressed any interest in the handshake priority configuration, simply delegating the decision to the gnutls library on that basis that its developers knew better than libvirt developers which are best. In gnutls terminology, this means that libvirt has historically used the “DEFAULT” priority string.

The past year or two has seen a seemingly never ending stream of CVEs related to TLS, some of them particular to specific algorithms. The only way some of these flaws can be addressed is by discontinuing use of the affected algorithm. The TLS library implementations have to be fairly conservative in dropping algorithms, because this has an effect on consumers of the library in question. There is also potentially a significant delay between a decision to discontinue support for an algorithm, and updated libraries being deployed to hosts. To address this Fedora 21 introduced the ability to define the algorithm priority strings in host configuration files, outside of the library code. This system administrators can edit a file /etc/crypto-policies/config to change the algorithm priority for all apps using TLS on the host. After editting this file, the update-crypto-policies command is run to generate the library specific configuration files. For example, it populates /etc/crypto-policies/back-ends/gnutls.config In gnutls use of this file is enabled by specifying that an application wants to use the “@SYSTEM” priority string.

This is a good step forward, as it takes the configuration out of source code and into data files, but it has limited flexibility because it applies to all apps on the host. There can be two apps on a host which have mutually incompatible views about what the best algorithm priority is. For example, a web browser will want to be fairly conservative in dropping algorithms to avoid breaking access to countless websites. An application like libvirtd though, where there is a well known set of servers and clients to connect in any site, can be fairly aggressive in only supporting the very best algorithms. What is desired is a way to override the algorithm priorty per application. Now of course this can easily be done via the application’s own configuration file, and so libvirt has added a new parameter “tls_priority” to /etc/libvirt/libvirtd.conf

The downside of using the application’s own configuration, is that the system administrator has to go hunting through many different files to update each application. It is much nicer to have a central location where the TLS priority settings for all applications can be controlled. What is desired is a way for libvirt to be built such that it can tell gnutls to first look for a libvirt specific priority string, and then fallback to the global priority string. To address this patches were written for GNUTLS to extend its priority string syntax. It is now possible to for libvirt to pass “@LIBVIRT,SYSTEM” to gnutls as the priority. It will thus read /etc/crypto-policies/back-ends/gnutls.config first looking for an entry matching “LIBVIRT” and then looking for an entry matching “SYSTEM“. To go along with the gnutls change, there is also an enhancement to the update-crypto-policies tool to allow application specific entries to be included when generating the /etc/crypto-policies/back-ends/gnutls.config file. It is thus possible to configure the libvirt priority string by simply creating a file /etc/crypto-policies/local.d/gnutls-libvirt.config containing the desired string and re-running update-crypto-policies.

In summary, the libvirt default priority settings are now:

  • RHEL-6/7 – NORMAL – a string hard coded in gnutls at build time
  • Fedora 25 - @SYSTEM – a priority level defined by sysadmin based on /etc/crypto-policies/config
  • Fedora >= 25 – @LIBVIRT,SYSTEM – a raw priority string defined in /etc/crypto-policies/local.d/gnutls-libvirt.config, falling back to /etc/crypto-policies/config if not present.

In all cases it is still possible to customize in /etc/libvirt/libvirtd.conf via the tls_priority setting, but it is is recommended to use the global system /etc/crypto-policies facility where possible.

by Daniel Berrange at November 15, 2016 12:08 PM

November 11, 2016

Daniel Berrange

New libvirt website design

The current previous libvirt website design dated from circa 2008 just a few years after the libvirt project started. We have grown alot of content since that time, but the overall styling and layout of the libvirt website has not substantially changed. Compared to websites for more recently launched projects, libvirt was starting to look rather outdated. So I spent a little time to come up with a new design for the libvirt website to bring it into the modern era. There were two core aspects to the new design, simplify the layout and navigation, and implement new branding.

From the layout / navigation POV we have killed the massive expanding menu that was on the left hand side of every page. It was not serving its purpose very effectively since it was requiring far too many clicks & page loads to access some of the most frequently needed content. The front page now has direct links to key pieces of content (as identified from our web access stats), while the rest of the pages are directly visible in a flat menu on the “docs” page. The download page has been overhauled to recognise the fact that libvirt is shipping more than just the core C library – we have language bindings, object model mappings, docs and test suites. Also new is a page directly targeting new contributors with information about how to get involved in the project and the kind of help we’re looking for. The final notable change is used of some jquery magic to pull in a feed of blog posts to the site front page.

From the branding POV, we have taken the opportunity to re-create the project logo. We sadly lost the original master vector artwork used to produce the libvirt logo eons ago, so only had a png file of it in certain resolutions. When creating docbook content, we did have a new SVG created that was intended to mirror the original logo, but it was quite crudely drawn. None the less it was a useful basis to start from to create new master logo graphics. As a result we now have an attractively rendered new logo for the project, available in two variants – a standard square(-ish) format

Libvirt logo

and in a horizontal banner format

Libvirt logo banner

With the new logo prepared, we took the colour palette and font used in the graphic and applied both to the main website content, bringing together a consistent style.

Libvirt website v1 (2006-2008)

libvirt-website-v1-download
libvirt-website-v1-index

Libvirt website v2 (2008-2016)

libvirt-website-v2-index
libvirt-website-v2-download

Libvirt website v3 (2016-)

libvirt-website-v3-index
libvirt-website-v3-download

by Daniel Berrange at November 11, 2016 04:16 PM

November 08, 2016

Gerd Hoffmann

raspberry pi status update

You might have noticed meanwhile that Fedora 25 ships with raspberry pi support and might have wondered what this means for my packages and images.

The fedora images use a different partition layout than my images. Specifically the fedora images have a separate vfat partition for the firmware and uboot and the /boot partition with the linux kernels lives on ext2. My images have a vfat /boot partition with everything (firmware, uboot, kernels), and the rpms in my repo will only work properly on such a sdcard. You can’t mix & match stuff and there is no easy way to switch from my sdcard layout to the fedora one.

Current plan forward:

I will continue to build rpms for armv7 (32bit) for a while for existing installs. There will be no new fedora 25 images though. For new devices or reinstalls I recommend to use the official fedora images instead.

Fedora 25 has no aarch64 (64bit) support, although it is expected to land in one of the next releases. Most likely I’ll create new Fedora 25 images for aarch64 (after final release), and of course I’ll continue to build kernel updates too.

Finally some words on the upstream kernel status:

The 4.8 dwc2 usb host adapter driver has some serious problems on the raspberry pi. 4.7 works ok, and so do the 4.9-rc kernels. But 4.7 doesn’t get stable updates any more, so I jumped straight to the 4.9-rc kernels for mainline. You might have noticed already if you updated your rpi recently. The raspberry pi foundation kernels don’t suffer from that issue as they use a different (not upstream) driver for the dwc.

by Gerd Hoffmann at November 08, 2016 01:54 PM

November 04, 2016

Daniel Berrange

ANNOUNCE: libvirt-glib release 1.0.0

I am pleased to announce that a new release of the libvirt-glib package, version 1.0.0, is now available from

https://libvirt.org/sources/glib/

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Switch to new release numbering scheme, major digit incremented each year, minor for each release, micro for stable branches (if any)
  • Fix Libs.private variable in pkg-config file
  • Fix git introspection warnings
  • Add ability to set SPICE gl property
  • Add support for virtio video model
  • Add support for 3d accel property
  • Add support for querying video model
  • Add support for host device config for PCI devs
  • Add docs for more APIs
  • Avoid unused variable warnings
  • Fix check for libvirt optional features to use pkg-config
  • Delete manually written python binding. All apps should use PyGObject with gobject introspection.
  • Allow schema to be NULL on config objects
  • Preserve unknown devices listed in XML
  • Add further test coverage

libvirt-glib comprises three distinct libraries:

  • libvirt-glib – Integrate with the GLib event loop and error handling
  • libvirt-gconfig – Representation of libvirt XML documents as GObjects
  • libvirt-gobject – Mapping of libvirt APIs into the GObject type system

NB: While libvirt aims to be API/ABI stable forever, with libvirt-glib we are not currently guaranteeing that libvirt-glib libraries are permanently API/ABI stable. That said we do not expect to break the API/ABI for the forseeable future and will always strive avoid it.

Follow up comments about libvirt-glib should be directed to the regular libvir-list@redhat.com development list.

Thanks to all the people involved in contributing to this release.

 

by Daniel Berrange at November 04, 2016 05:30 PM


Powered by Planet!
Last updated: February 28, 2017 12:01 PM