Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

November 25, 2021

KVM on Z

RHEL 8.5 AV Released

RHEL 8.5 Advanced Virtualization (AV) is out! See the official announcement and the release notes.

KVM is supported via Advanced Virtualization, and provides

  • QEMU v6.0, supporting virtio-fs on IBM Z
  • libvirt v7.6

Furthermore, RHEL 8.5 AV now supports the possibility to persist mediated devices.

For a detailed list of Linux on Z-specific changes, also see this blog entry at Red Hat.

IBM-specific documentation for Red Hat Enterprise Linux 8.5 is available at IBM Documentation here (in particular: Device Drivers, Features and Commands on Red Hat Enterprise Linux 8.5).
See here on how to enable AV in RHEL 8 installs.

by Stefan Raspl (noreply@blogger.com) at November 25, 2021 11:05 AM

November 23, 2021

KVM on Z

New Community: Compass L

Do you know Compass L yet...? This community offers a great opportunity to interact with other users, developers and architects of Linux and KVM on IBM Z!

For further information, see the flyer below, or head over right away and join the community here.


by Stefan Raspl (noreply@blogger.com) at November 23, 2021 11:46 AM

November 21, 2021

Gerd Hoffmann

processing patch mails with b4 and notmuch

This blog post describes my mail setup, with a focus on how I handle patch email. Lets start with a general mail overview. Not going too deep into the details here, the internet has plenty of documentation and configuration tutorials.

Outgoing mail

Most of my machines have a local postfix configured for outgoing mail. My workstation and my laptop forward all mail (over vpn) to the company internal email server. All I need for this to work is a relayhost line in /etc/postfix/main.cf:

relayhost = [smtp.corp.redhat.com]

Most unix utilities (including git send-email) try to send mails using /usr/sbin/sendmail by default. This tool will place the mail in the postfix queue for processing. The name of the binary is a convention dating back to the days where sendmail was the one and only unix mail processing daemon.

Incoming mail

All my mail is synced to local maildir storage. I'm using offlineimap for the job. Plenty of other tools exist, isync is another popular choice.

Local mail storage has the advantage that reading mail is faster, especially in case you have a slow internet link. Local mail storage also allows to easily index and search all your mail with notmuch.

Filtering mail

I'm using server side filtering. The major advantage is that I always have the same view on all my mail. I can use a mail client on my workstation, the web interface or a mobile phone. Doesn't matter, I always see the same folder structure.

Reading mail

All modern email clients should be able to use maildir folders. I'm using neomutt. I also have used thunderbird and evolution in the past. All working fine.

The reason I use neomutt is that it is simply faster than GUI-based mailers, which matters when you have to handle alot of email. It is also easy very to hook up scripts, which is very useful when it comes to patch processing.

Outgoing patches

I'm using git send-email for the simple cases and git-publish for the more complex ones. Where "simple" typically is single changes (not a patch series) where it is unlikely that I have to send another version addressing review comments.

git publish keeps track of the revisions you have sent by storing a git tag in your repo. It also stores the cover letter and the list of people Cc'ed on the patch, so sending out a new revision of a patch series is much easier than with plain git send-email.

git publish also features config profiles. This is helpful for larger projects where different subsystems use different mailing lists (and possibly different development branches too).

Incoming patches

So, here comes the more interesting part: Hooking scripts into neomutt for patch processing. Lets start with the config (~/.muttrc) snippet:

# patch processing
bind	index,pager	p	noop			# default: print
macro	index,pager	pa	"<pipe-entry>~/.mutt/bin/patch-apply.sh<enter>"
macro	index,pager	pl	"<pipe-entry>~/.mutt/bin/patch-lore.sh<enter>"

First I map the 'p' key to noop (instead of print which is the default configuration), which allows to use two-key combinations starting with 'p' for patch processing. Then 'pa' is configured to run my patch-apply.sh script, and 'pl' runs patch-lore.sh.

Lets have a look at the patch-apply.sh script which applies a single patch:

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-apply-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
        echo "ERROR: can't figure project"
        exit 1
fi

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try applying patch to $project, branch $branch"
echo "#"

if git am --message-id --3way --ignore-whitespace --whitespace=fix "$file"; then
        echo "#"
        echo "# OK"
        echo "#"
else
        echo "# FAILED, cleaning up"
        cp -v .git/rebase-apply/patch patch-apply-failed.diff
        cp -v "$file" patch-apply-failed.mail
        git am --abort
        git reset --hard
fi

The mail is passed to the script on stdin, so the first thing the script does is to store that mail in a temporary file. Next it goes try figure which project the patch is for. The logic for that is in a separate file so other scripts can share it, see below. Finally try to apply the patch using git am. In case of a failure store both decoded patch and complete email before cleaning up and exiting.

Now for patch-find-project.sh. This script snippet tries to figure the project by checking which mailing list the mail was sent to:

#!/bin/sh
if test "$PATCH_PROJECT" != ""; then
        project="$PATCH_PROJECT"
elif grep -q -e "devel@edk2.groups.io" "$file"; then
        project="edk2"
elif grep -q -e "qemu-devel@nongnu.org" "$file"; then
        project="qemu"
# [ ... more checks snipped ... ]
fi
if test "$project" = ""; then
        echo "Can't figure project automatically."
        echo "Use env var PATCH_PROJECT to specify."
fi

The PATCH_PROJECT environment variable can be used to override the autodetect logic if needed.

Last script is patch-lore.sh. That one tries to apply a complete patch series, with the help of the b4 tool. b4 makes patch series management an order of magnitude simpler. It will find the latest revision of a patch series, bring the patches into the correct order, pick up tags (Reviewed-by, Tested-by etc.) from replies, checks signatures and more.

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-queue-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
	echo "ERROR: can't figure project"
	exit 1
fi

# find msgid
msgid=$(grep -i -e "^message-id:" "$file" | head -n 1 \
	| sed -e 's/.*<//' -e 's/>.*//')

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try queuing patch (series) for $project, branch $branch"
echo "#"
echo "# msgid: $msgid"
echo "#"

# create work dir
WORK="${TMPDIR-/tmp}/${0##*/}-$$"
mkdir "$WORK" || exit 1
trap 'rm -rf $file "$WORK"' EXIT

echo "# fetching from lore ..."
echo "#"
b4 am	--outdir "$WORK" \
	--apply-cover-trailers \
	--sloppy-trailers \
	$msgid || exit 1

count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
if test "$count" = "0"; then
	echo "#"
	echo "# got nothing, trying notmuch instead ..."
	echo "#"
	echo "# update db ..."
	notmuch new
	echo "# find thread ..."
	notmuch show \
		--format=mbox \
		--entire-thread=true \
		id:$msgid > $WORK/notmuch.thread
	echo "# process mails ..."
	b4 am	--outdir "$WORK" \
		--apply-cover-trailers \
		--sloppy-trailers \
		--use-local-mbox $WORK/notmuch.thread \
		$msgid || exit 1
	count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
fi

echo "#"
echo "# got $count patches, trying to apply ..."
echo "#"
if git am -m -3 $WORK/*.mbx; then
	echo "#"
	echo "# OK"
	echo "#"
else
	echo "# FAILED, cleaning up"
	git am --abort
	git reset --hard
fi

First part (store mail, find project) of the script is the same as patch-apply.sh. Then the script goes get the message id of the mail passed in and feeds that into b4. b4 will go try to find the email thread on lore.kernel.org. In case this doesn't return results the script will go query notmuch for the email thread instead and feed that into b4 using the --use-local-mbox switch.

Finally it tries to apply the complete patch series prepared by b4 with git am.

So, with all that in place applying a patch series is just two key strokes in neomutt. Well, almost. I still need an terminal on the side which I use to make sure the correct branch is checked out, to run build tests etc.

by Gerd Hoffmann at November 21, 2021 11:00 PM

November 09, 2021

Stefan Hajnoczi

Peer-to-peer applications with Urbit

This article gives an overview of the architecture of Urbit applications. I spent a weekend trying out Urbit, reading documentation, and digging through the source code. I'm always on the lookout for the next milestone system that will change the internet and computing landscape. In particular, I find decentralized and peer-to-peer systems interesting because I have a sense that the internet is not quite right. It could be better if only someone could figure out how to do that and make it mainstream.

Urbit is an operating system and network designed to give users control by running applications on personal servers instead of on centralized servers operated by the application creators. This means data is stored on personal servers and is not immediately accessible to application creators. Both the Urbit operating system and network run on top of existing computing infrastructure. It's not a baremetal operating system (it runs under Linux, macOS, and Windows) or a new Layer 3 network protocol (it uses UDP). If you want to know more there is an overview of Urbit here.

The operating function

The Urbit kernel, Arvo, is a single-function operating system in the sense of purely functional programming. The operating system function takes the previous state and input events and produces the next state and output events. This means that the state of the system can be saved after each invocation. If there is a power failure or the system otherwise stops execution it's easy to resume it later from the last state.

Urbit has a virtual machine and runtime that supports this programming environment. The low-level language is called Nock and the higher-level language is called Hoon. I haven't investigated them in detail, but they appear to support deterministic purely functional programming with I/O and other side-effects kept outside via monadsand passing around inputs like the current time.

Applications

Applications, also called agents, follow the purely functional model where they produce the next state as their result. Agents expose their services in three ways:

  1. Peek, a read-only query that fetches data without changing state.
  2. Poke, a command and response similar to a Remote Procedure Call (RPC).
  3. Subscriptions, a stream of updates that may be delivered over time until the subscription is closed.

For example, an application that keeps a counter can define a poke interface for incrementing the counter and a peek interface for querying its value. A subscription can be used to receive an update whenever the counter changes.

Urbit supports peeks, pokes, and subscriptions over the network. This is how applications running on different personal servers can communicate. If we want to replicate a remote counter we can subscribe to it and then poke our local counter every time an update is received. This replication model leads to the store/hook/view architecture, a way of splitting applications into components that support local state, remote replication, and a user interface. In our counter example the store would be the counter, the hook would be the code that replicates remote counters, and the view would provide any logic needed for the user interface to control the counter.

Interacting with the outside world

User interfaces for applications are typically implemented in Landscape, a web-based user interface for interacting with Urbit from your browser. The user interface can be a modern JavaScript application that communicates with the agent running inside Urbit via the HTTP JSON API. This API supports peeks, pokes, and subscriptions. In other words, the application's backend is implemented as an Urbit agent while the frontend is a regular client-side web application.

Of course there are also APIs for data storage, networking, HTTP, etc. For example, the weather widget in Landscape fetches the weather from a web service using an HTTP request.

Urbit also supports peer discovery so you can resolve the funny IDs like ~bitbet-bolbel and establish connections to remote Urbit instances. The IDs are allocated hierarchically and ultimately registered on the Ethereum blockchain.

Criticisms

Keep in mind I only spent a weekend investigating Urbit so I don't understand the full system and could be wrong about what I have described. Also, I've spent a lot of time and am therefore invested in Linux and conventional programming environments. Feedback from the Urbit community is welcome, just send me an email or message me on IRC or Matrix.

The application and network model is intended for personal servers. I don't think people want personal servers. It's been tried before by Sandstorm, FreedomBox, and various projects without mainstream success. I think a more interesting model for the powerful devices we run today is one without any "server" at all. Instead of having an always-on server that is hosted somewhere, apps should be able to replicate and sync directly between a laptop and a phone. Having the option to run a personal server for always-on services like chat rooms or file hosting is nice, but many things don't need this. I wish Urbit was less focussed on personal servers and more on apps that replicate and sync directly between each other.

Urbit is obfuscated by the most extreme not invented here (NIH) syndrome I have ever seen. I tried to keep the terminology at a minimum in this article, so it might not be obvious unless you dive into the documentation or source code yourself. Not only is most of it a reinvention of existing stuff but it also uses new terminology for everything. It was disappointing to find that what first appeared like an alien system that might hold interesting discoveries was just a quirky reimplementation of existing concepts.

It will be difficult for Urbit to catch on as a platform since it has no common terminology with existing programming environments. If you want to write an app for Urbit using the Hoon programming language you'll have to wade through a lot of NIH at every level of the stack (programming language, operating system, APIs). There is an argument that reinventing everything allows the whole system to be small and self-contained, but in practice that's not true since Landscape apps are JavaScript web applications. They drag in the entire conventional computing environment that Urbit was supposed to replace. I wonder if the same kind of system can be built on top of a browser plus Deno with WebRTC for the server side, reusing existing technology that is actively developed by teams much larger than Urbit. It seems like a waste because Urbit doesn't really innovate in VMs, programming languages, etc yet the NIH stuff needs to be maintained.

Finally, for a system that is very much exposed to the network, I didn't see a strong discipline or much support for writing secure applications. The degree of network transparency that Urbit agents have also means that they present an attack surface. I would have expected the documentation and APIs/tooling to steer developers in a direction where it's hard to make mistakes. My impression is that a lot of the attack surface in agents is hand coded and security issues could become commonplace when Urbit gains more apps written by a wider developer community.

Despite this criticism I really enjoyed playing with Urbit. It is a cool rabbit hole to go down.

Conclusion

Urbit applications boil down to a relatively familiar interface similar to what can be done with gRPC: command/response, querying data, and subscriptions. The Urbit network allows applications to talk to each other directly in a peer-to-peer fashion. Users run apps on personal servers instead of centralized servers operated by the application creators (like Twitter, Facebook, etc). If Urbit can attract enough early adopters then it could become an interesting operating system and ecosystem that overcomes some of the issues of today's centralized internet. If you're wondering, I think it's worth spending a weekend exploring Urbit!

by Unknown (noreply@blogger.com) at November 09, 2021 01:54 PM

November 01, 2021

Cornelia Huck

Blog update

I have moved my blog to a new location and done some other changes at the same time.

  • This blog is now generated via Jekyll (a huge thank you to the authors here!) This makes posts easier to write for me (especially when formatting command output and similar), and gets rid of intrusive scripts as on the Blogger platform.
  • This blog’s title is now “KVM, QEMU, and more.” Observant readers may have noticed that I dropped the “Big Iron” part. I may still post s390x-specific content, but in general, I plan to write more about topics that are not architecture-specific.

And yes, I plan to actually post something new this year ;)

by Cornelia Huck at November 01, 2021 11:00 PM

October 18, 2021

KVM on Z

Ubuntu 21.10 released

Ubuntu Server 21.10 is out!
It ships

  • Linux kernel 5.13 (including, among others, features as described here and here)
  • QEMU v6.0
  • libvirt v7.6
See the release notes here, and the blog entry at canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at October 18, 2021 09:57 AM

October 15, 2021

Stefan Hajnoczi

A new approach to usermode networking with passt

There is a new project called passt that Stefano Brivio has been working on to implement usermode networking, the magic that forwards network packets between QEMU guests and the host network.

passt is designed as a replacement for QEMU's --netdev user (also known as slirp), a feature that is commonly used but not really considered production-ready. What passt improves on is security and performance, finally making usermode networking production-ready. That's all you need to know to try it out but I thought the internals of how passt works are interesting, so this article explains the approach.

Why usermode networking is necessary

Guests send and receive Ethernet frames through emulated network interface cards like virtio-net. Those packets need to be injected into the host network but popular operating systems don't provide an API for sending and receiving Ethernet frames because that poses a security risk (spoofing) or could simply interfere with other applications on the host.

Actually that's not quite true, operating systems do provide specialized APIs for injecting Ethernet frames but they come with limitations. For example, the Linux tun/tap driver requires additional network configuration steps as well as administrator privileges. Sometimes it's not possible to take advantage of tap due to these limitations and we really need a solution for unprivileged users. That's what usermode networking is about.

Transmuting Ethernet frames to Socket API calls

Since an unprivileged user cannot inject Ethernet frames into the host network, we have to make due with the POSIX Sockets API that is available to unprivileged users. Each Ethernet frame sent by the guest needs to be converted into equivalent Sockets API calls on the host so that the desired effect is achieved even though we weren't able to transmit the original Ethernet frame byte-for-byte. Incoming packets from the external network need to be received via the Sockets API and repackaged into Ethernet frames that the guest network interface card can receive.

In networking parlance this conversion between Ethernet frames and Sockets API calls is a Layer 2 (Data Link Layer)/Layer 4 (Transport Layer) conversion. The Ethernet frames have additional packet headers including the Ethernet header, IP header, and the TCP/UDP header that the Sockets API calls don't include. Careful use of the Sockets API makes it possible to synthesize Ethernet frames that are similar enough to the original ones that the guest can communicate successfully.

For the most part this conversion requires parsing and building, respectively, packet headers in a straightforward way. The TCP protocol makes things more interesting though because a TCP connection involves non-trivial state that is normally handled by the TCP/IP stack. For example, data sent over a TCP connection might arrive out of order or some chunks may have been dropped. There is also a state machine for the TCP connection lifecycle including its famous three-way handshake. This means TCP connections must be carefully tracked so that these low-level protocol features can be simulated correctly.

How passt works

Passt runs as an unprivileged host userspace process that is connected to QEMU through --netdev socket, a way to transfer Ethernet frames from QEMU's network interface card emulation to another process like passt. When passt reads an Ethernet frame like a UDP message from the guest it sends the data onwards through an equivalent AF_INET SOCK_DGRAM socket on the host. It also keeps the socket open so replies can be read on the host and then packaged into Ethernet frames that are written to the guest. The effect of this is that guest network communication appears like it's coming from the passt process on the host and integrates nicely into host networking.

How TCP works is a bit more interesting. Since TCP connections require acknowledgement messages for reliable delivery, passt uses the recvmmsg(2) MSG_PEEK flag to fetch data while keeping it queued in the host network stack's rcvbuf until the guest acknowledges it. This avoids extra buffer management code in passt and is part of its strategy of implementing only a subset of TCP. There is no need to duplicate the full TCP/IP stack since the host and guest already have them, but achieving this requires detailed knowledge of TCP so that passt can maintain just enough state.

Incoming connections are handled by port forwarding. This means passt can bind to port 2222 on the host and forward connections to port 22 inside the guest. This is very useful for usermode networking since the user may not have permission to bind to low-numbered ports on the host or there might already be host services listening on those ports. If you don't want to mess with port forwarding you can use passt's all mode, which simply listens on all non-ephemeral ports (basically a brute force approach).

A few basic network protocols are necessary for network communication: ARP, ICMP, DHCP, DNS, and IPv6 services. Passt offers these because the guest cannot talk to those services on the host network directly. They can be disabled when the guest has knowledge of the network configuration and doesn't need them.

Why passt is unique

Thanks to running in a separate process from QEMU and by taking a minimalist approach, passt is able to tighten security. Its seccomp filters are stronger than anything the larger QEMU process could do. The code is clean and specifically designed for security and simplicity. I think writing passt in C was a missed opportunity. Some users may rule it out entirely for this reason. Userspace parsing of untrusted network packets should be done in a memory-safe programming language nowadays. Nevertheless, it's a step up from slirp, which has a poor track record of security issues and runs as part of the QEMU process.

I'm excited to see how passt performs in comparison to slirp. passt uses techniques like recvmmsg(2)/sendmmsg(2) to batch message transfer and reads multiple Ethernet frames from the guest in a single syscall to amortize the cost of syscalls over multiple packets. There is no dynamic memory allocation and packet headers are pre-populated to minimize the number of CPU cycles spent in the data path. While this is promising, QEMU's --netdev socket isn't the fastest (packets first take a trip through QEMU's net subsystem queues), but still a good trade-off between performance and simplicity/security. Based on reading the code, I think passt will be faster than slirp but I haven't benchmarked it myself.

There is another mode that passt supports for containers instead of virtualization. Although it's not relevant to QEMU, this so-called pasta mode is a cool feature for container networking. In this mode pasta connects a network namespace with the outside world (init namespace) through a tap device. This might become passt's killer feature, because the same software can be used for both virtualization and containers, so why bother investing in two separate solutions?

Conclusion

Passt is a promising replacement for slirp (on Linux hosts at least). It looks like there will finally be a production-ready usermode networking feature for QEMU that is fast and secure. Passt's functionality is generic enough that other projects besides QEMU will be able to use it, which is great because this kind of networking code is non-trivial to develop. I look forward to passt becoming available for use with QEMU in Linux distributions soon!

by Unknown (noreply@blogger.com) at October 15, 2021 07:00 PM

October 13, 2021

KVM on Z

Documentation Update: KVM Virtual Server Management

Intended for KVM virtual server administrators, the "Linux on Z and LinuxONE - KVM Virtual Server Management" book illustrates how to set up, configure, and operate Linux on KVM instances and their virtual devices running on the KVM host and IBM Z hardware.

This major update includes libvirt commands and XML elements for managing the lifecycle of VFIO mediated devices, performance tuning tips, and a simplified method for configuring a virtual server for IBM Secure Execution for Linux.

by Stefan Raspl (noreply@blogger.com) at October 13, 2021 09:47 AM

September 16, 2021

Stefan Hajnoczi

KVM Forum 2021 Highlights

Here are highlights from this year's KVM Forum conference, the yearly event around QEMU, KVM, and related communities like VIRTIO and rust-vmm.

Video recordings will be posted soon. In the meantime, here are short summaries of what I learnt. You can find slides to many of these talks in the links below.

vDPA

vDPA is a Linux driver framework for developing hybrid hardware/software VIRTIO devices.

In Hyperscale vDPA, Jason Wang covered ways to create fine-grained virtual devices for virtual machines and containers using vDPA. This means offering a way to slice up a physical device into many virtual devices with some aspects of the virtual device handled in hardware and others in software. This is the direction that networking and accelerator devices are heading in and is actively being discussed in the VIRTIO community. Many different approaches are possible and Jason's talk enumerates some of them:

  • An interface for management commands (a virtio-pci capability, a management virtqueue)
  • DMA isolation (PCIe PASID, a platform-independent device MMU)
  • More than 2048 MSI-X interrupts (virtio-pci capability for VIRTIO-specific MSI-X tables)

Another new vDPA project was presented by Yongji Xie. VDUSE - vDPA Device in Userspace showed how vDPA devices can be implemented in userspace. Although this is roughly the same use case as vhost-user, it has the unique advantage of allowing containers and bare metal to attach devices. An untrusted userspace process emulates the device and the host kernel can either present a vhost device to virtual machines or attach to the userspace device. This is a neat way to develop software devices that can also benefit container workloads.

Stefano Garzarella covered the new unified virtio-blk storage stack in vdpa-blk: Unified Hardware and Software Offload for virtio-blk. The goal is to support hardware virtio-blk devices, an optimized host kernel software device, and still offer QEMU block layer features like qcow2 images. This allows the fast path to go directly to hardware or an optimized in-kernel device while software storage features can still be used when desired via a slow path.

vfio-user

VFIO User - Using VFIO as the IPC Protocol in Multi-process QEMU focussed on the new out-of-process device interface that John Johnson, Jagannathan Raman, and Elena Ufimtseva have been working on together with others. This new protocol allows PCI (and perhaps other busses in the future) devices to be implemented as separate processes. QEMU communicates with the device over a UNIX domain socket. The approach is similar to vhost-user except the protocol messages are based on the Linux VFIO ioctl interface instead of the vhost ioctls.

While vhost-user has been in use for a number of years for VIRTIO-based devices, vfio-user now makes it possible to implement non-VIRTIO devices as separate processes. There were several other talks about vfio-user at KVM Forum 2021 that you can also check out:

virtiofs

In Towards High-availability for Virtio-fs, Jiachen Zheng and Yongji Xie explained how they extended virtiofs to handle crash recovery and live updates. These features are challenging for any program with a lot of state because care must to taken to maintain a consistent snapshot to resume from in the case of a restart. They tackled this by storing Linux file handles and a journal in a shm file. This required some changes to QEMU's virtiofsd data structures that makes them suitable for storing in shm and a journal that makes it possible to provide idempotency for operations like mkdir that would otherwise fail if replayed.

Virtual IOMMUs

Suravee Suthikulpanit and Wei Huang gave a talk titled Analysis of AMD HW-assisted vIOMMU Implementation and Performance. AMD is working on a hardware implementation of a virtual IOMMU that allows guests to specify DMA permissions for guest memory. This functionality is important for VFIO device assignment within guests, for example. Although it can be done in software via emulation of real IOMMUs or the virtio-iommu device that was designed specifically for virtual machines, implementing the vIOMMU in real hardware has performance advantages. One interesting feature of the hardware-assisted vIOMMU is that it natively supports encrypted memory for AMD SEV-SNP guests, something that is slow and clumsy to do in software.

by Unknown (noreply@blogger.com) at September 16, 2021 04:17 PM

September 06, 2021

Gerd Hoffmann

Advanced network booting for virtual machines

Network booting is cool. Once you have setup everything you can stop juggling iso images in your virtual machine configs. Instead you just kick a network boot and pick whatever you want install from the boot menu delivered by the boot server.

This article is not about the basics of setting up a boot server. The internet has tons of tutorials on how to install a tftp server and how to boot your favorite OS from tftp. This article will focus on configuring network boot for libvirt-managed virtual machines.

Before we get started ...

The config file snippets are examples from my home network, home.kraxel.org is the local domain and 192.168.2.14 is the machine acting as boot server here. You have to replace those to match your setup of course. The same is true for the boot file names.

The default libvirt network uses 192.168.122.0/24. In case you use that unmodified these addresses will work fine for you and in fact they should already be in your libvirt network configuration. If you have changed the default libvirt network I expect you know what you have to do 😎.

Step one: very basic netboot setup

That is pretty simple. libvirt has support for that, so all you have to do is adding a bootp tag with the ip address of your tftp server and the boot file name to the network config.

<network>
  [ ... ]
  <ip address='192.168.122.1' netmask='255.255.255.0'>                                        
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>                                    
      <bootp file='pxelinux.0' server='192.168.2.14'/>
    </dhcp>
  </ip>
</network>

You can edit the network configuration using virsh net-edit name. The default libvirt network is simply named default. The network needs an restart to apply any changes (virsh net-destroy name; virsh net-start name).

That was easy, right? Well, maybe not. In case this is not working for you try running modprobe nf_nat_tftp. tftp uses udp, which means there are no connections at ip level, so the kernel has to look into the tftp packets to figure how to route them correctly for a masqueraded network. The nf_nat_tftp kernel module does exactly that.

Note: Recent libvirt versions seem to take care to load nf_nat_tftp if needed, so there is a chance this works out-of-the-box for you.

Neverthelless that leads straight to the question: do we actually need tftp?

Step two: replace tftp with http

As you might have guessed the answer is no.

The ipxe boot roms support booting from http, by simply specifying an URL instead of a filename as bootfile. This was never formally specified though, so unfortunaly you can't expect this to work with every boot rom. For qemu-powered virtual machines this isn't a problem at all because the qemu boot roms are built from ipxe. With physical machines you might have to hop though some extra loops to chainload ipxe (not covered here).

The easiest way to get this going is to install apache on your tftp boot server, then configure a virtual host with the tftproot as document root. You can do so by dropping a snippet like this into /etc/httpd/conf.d/:

<Directory "/var/lib/tftpboot">
        Options Indexes FollowSymLinks
        AllowOverride None
	Require all granted
</Directory>
<VirtualHost *:80>
        ServerName boot.home.kraxel.org
        DocumentRoot /var/lib/tftpboot
</VirtualHost>

Enabling Indexing is not needed for boot server functionality, but might be handy if you want access the boot server with your web browser for trouble-shooting.

Using the tftproot as document root has the advantage that the paths are identical for both tftp and http boot, so your pxelinux and grub configuration files should continue to work unmodified.

Now you can go edit your libvirt network config and replace the bootp configuration with this:

<bootp file='http://boot.home.kraxel.org/pxelinux.0'/>

Done. Don't forget to restart the network to apply the changes. Booting should be noticable faster now (especially when fetching larger initrds), and any NAT traversal problems should be gone too.

Extra tip for lazy people

When using http you can boot from pretty much any server on the internet, there is no need to setup your own. You can use for example the boot server provided by netboot.xyz with a large collection of operating systems available as live systems and for install. Here is the bootp snippet for this:

<bootp file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>

In most cases probably want have a local boot server for faster installs. But for a one time test install of a new distro this might be more handy than downloading the install iso.

Step three: what about UEFI?

For EFI guests the pxelinux.0 is pretty much useless indeed, so we must do something else for them. First question is how do we figure this is a EFI guest asking for a boot file? Lets have a look at the dhcp request, BIOS guest goes first. Captured using tcpdump -i virbr0 -v port bootps:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 0
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"

Now a request from a (x64) EFI guest:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            { ... ]
	    ARCH Option 93, length 2: 7
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003001"

See? The EFI guest uses arch 7 instead of 0, in both option 93 and option 60. So we will use that.

Unfortunaly libvirt has no direct support for that. But libvirt uses dnsmasq as dhcp (and dns) server for the virtual networks. dnsmasq has support for this, and starting with libvirt version 5.6.0 it is possible to specify any dnsmasq config option in your libvirt network configuration using the dnsmasq xml namespace.

dnsmasq uses the concept of tags to implement this. Requests can be tagged using matches, and configurartion directives can be applied to requests with certain tags. So, here is how it looks like, using the efi-x64-pxe tag for x64 efi guests and /arch-x86_64/grubx64.efi as bootfile.

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>
      <bootp file='http://boot.home.kraxel.org/pxelinux.0'/>
    </dhcp>
  </ip>
  <dnsmasq:options>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>
  </dnsmasq:options>
</network>

dnsmasq uses '#' for comments, and it is here only to visually separate entries a bit. It will also be in the dnsmasq config files created by libvirt (in /var/lib/libvirt/dnsmasq/).

Step four: Can UEFI guests use http too?

Sure. You might have already noticed that the UEFI boot manager has both UEFI PXEv4 and UEFI HTTPv4 entries. Here is what happens when you pick the latter:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 16
	    Vendor-Class Option 60, length 33: "HTTPClient:Arch:00016:UNDI:003001"

It's arch 16 now. Also option 60 starts with HTTPClient instead of PXEClient. So we can simply add another arch match to identify http clients.

Another detail we need to take care of is that the UEFI http boot client expect a reply with option 60 set to HTTPClient, otherwise it will be ignored. So we need to take care of that too, using dhcp-option-force. Here we go, using tag efi-x64-http for http clients:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <dnsmasq:options>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-http,http://boot.home.kraxel.org/arch-x86_64/grubx64.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

Extra tip for lazy people, now with UEFI

Complete example, defining a new libvirt network named netboot.xyz. You can store that in some file, then use virsh net-define file to create the network.

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  <name>netboot.xyz</name>
  <forward mode='nat'/>
  <bridge name='netboot0' stp='on' delay='0'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.123.10' end='192.168.123.99'/>
      <bootp file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>
    </dhcp>
  </ip>
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-x64-http,http://boot.netboot.xyz/ipxe/netboot.xyz.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

Then, in your guest domain configration, use <source network='netboot.xyz'/> to use the new network. With this both BIOS and UEFI guests can netboot from netboot.xyz. With UEFI you have to take care to pick the UEFI HTTPv4 entry from the firmware boot menu.

Step five: architecture experiments

There is a world beyond x86. The arch field does not only specify the system architecture (bios vs. uefi) or the boot protocol (pxe vs. http), but also the cpu architecture. Here are the ones relevant for qemu:

CodeArchitecture
0x00BIOS pxeboot (both i386 and x86_64)
0x06EFI pxeboot, IA32 (i386)
0x07EFI pxeboot, X64 (x86_64)
0x0aEFI pxeboot, ARM (v7)
0x0bEFI pxeboot, AA64 (v8 / aarch64)
0x12powerpc64
0x16EFI httpboot, X64
0x18EFI httpboot, ARM
0x19EFI httpboot, AA64
0x31s390x

So, if you want play with arm or powerpc without owning such a machine you can let qemu emulate it with tcg. If you want netboot it -- no problem, just add a few more lines to your network configuration. Here is an example for aarch64:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  <dnsmasq:options>
    [ ... ]
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-aa64-pxe,option:client-arch,b'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-aa64-pxe,/arch-aarch64/grubaa64.efi,,192.168.2.14'/>
    <dnsmasq:option value='#'/>
    <dnsmasq:option value='dhcp-match=set:efi-aa64-http,option:client-arch,19'/>
    <dnsmasq:option value='dhcp-boot=tag:efi-aa64-http,http://boot.home.kraxel.org/arch-aarch64/grubaa64.efi'/>
    <dnsmasq:option value='dhcp-option-force=tag:efi-aa64-http,60,HTTPClient'/>
  </dnsmasq:options>
</network>

In case you are wondering why I place the grub binaries in subdirectories: grub tries fetch the config file from the same directory, so that way I get per-arch config files and they are named /arch-aarch64/grub.cfg, /arch-x86_64/grub.cfg and so on. A nice side effect is that the toplevel directory is a bit less cluttered with files.

And beyond libvirt?

Well, the fundamental idea doesn't change. Look at arch option, then send different replies depending on what you find there. With other dhcp servers the syntax is different, but the pattern is the same. Here is a sample snippet for the isc dhcp server shipped with most linux distributions:

option arch code 93 = unsigned integer 16;

subnet 192.168.2.0 netmask 255.255.255.0 {
        [ ... ]

        if (option arch = 00:16) {
		option vendor-class-identifier "HTTPClient";
		filename "http://boot.home.kraxel.org/arch-x86_64/grubx64.efi";
	} else if (option arch = 00:07) {
		next-server 192.168.2.14;
		filename "/arch-x86_64/grubx64.efi";
	} else {
		next-server 192.168.2.14;
		filename "/pxelinux.0";
	}
}

by Gerd Hoffmann at September 06, 2021 10:00 PM

QEMU project

Presenting guest images as raw image files with FUSE

Sometimes, there is a VM disk image whose contents you want to manipulate without booting the VM. One way of doing this is to use libguestfs, which can boot a minimal Linux VM to provide the host with secure access to the disk’s contents. For example, guestmount allows you to mount a guest filesystem on the host, without requiring root rights.

However, maybe you cannot or do not want to use libguestfs, e.g. because you do not have KVM available in your environment, and so it becomes too slow; or because you do not want to go through a guest OS, but want to access the raw image data directly on the host, with minimal overhead.

Note: Guest images can generally be arbitrarily modified by VM guests. If you have an image to which an untrusted guest had write access at some point, you must treat any data and metadata on this image as potentially having been modified in a malicious manner. Parsing anything must be done carefully and with caution. Note that many existing tools are not careful in this regard, for example, filesystem drivers generally deliberately do not have protection against maliciously corrupted filesystems. This is why in contrast accessing an image through libguestfs is considered secure, because the actual access happens in a libvirt-managed VM guest.

From this point, we assume you are aware of the security caveats and still want to access and manipulate image data on the host.

Now, unless your image is already in raw format, you will be faced with the problem of getting it into raw format. The tools that you might want to use for image manipulation generally only work on raw images (because that is how block device files appear), like:

  • dd to just copy data to and from given offsets,
  • parted to manipulate the partition table,
  • kpartx to present all partitions as block devices,
  • mount to access filesystems’ contents.

So if you want to use such tools on image files e.g. in QEMU’s qcow2 format, you will need to translate them into raw images first, for example by:

  • Exporting the image file with qemu-nbd -c as an NBD block device file,
  • Converting between image formats using qemu-img convert,
  • Accessing the image from a guest, where it appears as a normal block device.

Unfortunately, none of these methods is perfect: qemu-nbd -c generally requires root rights; converting to a temporary raw copy requires additional disk space and the conversion process takes time; and accessing the image from a guest is basically what libguestfs does (i.e., if that is what you want, then you should probably use libguestfs).

As of QEMU 6.0, there is another method, namely FUSE block exports. Conceptually, these are rather similar to using qemu-nbd -c, but they do not require root rights.

Note: FUSE block exports are a feature that can be enabled or disabled during the build process with --enable-fuse or --disable-fuse, respectively; omitting either configure option will enable the feature if and only if libfuse3 is present. It is possible that the QEMU build you are using does not have FUSE block export support, because it was not compiled in.

FUSE (Filesystem in Userspace) is a technology to let userspace processes provide filesystem drivers. For example, sshfs is a program that allows mounting remote directories from a machine accessible via SSH.

QEMU can use FUSE to make a virtual block device appear as a normal file on the host, so that tools like kpartx can interact with it regardless of the image format, like in the following example:

$ qemu-img create -f raw foo.img 20G
Formatting 'foo.img', fmt=raw size=21474836480

$ parted -s foo.img \
    'mklabel msdos' \
    'mkpart primary ext4 2048s 100%'

$ qemu-img convert -p -f raw -O qcow2 foo.img foo.qcow2 && rm foo.img
    (100.00/100%)

$ file foo.qcow2
foo.qcow2: QEMU QCOW2 Image (v3), 21474836480 bytes

$ sudo kpartx -l foo.qcow2

$ qemu-storage-daemon \
    --blockdev node-name=prot-node,driver=file,filename=foo.qcow2 \
    --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    --export \
    type=fuse,id=exp0,node-name=fmt-node,mountpoint=foo.qcow2,writable=on \
    &
[1] 200495

$ file foo.qcow2
foo.qcow2: DOS/MBR boot sector; partition 1 : ID=0x83, start-CHS (0x10,0,1),
end-CHS (0x3ff,3,32), startsector 2048, 41940992 sectors

$ sudo kpartx -av foo.qcow2
add map loop0p1 (254:0): 0 41940992 linear 7:0 2048

In this example, we create a partition on a newly created raw image. We then convert this raw image to qcow2 and discard the original. Because a tool like kpartx cannot parse the qcow2 format, it reports no partitions to be present in foo.qcow2.

Using the QEMU storage daemon, we then create a FUSE export for the image that apparently turns it into a raw image, which makes the content and thus the partitions visible to file and kpartx. Now, we can use kpartx to access the partition in foo.qcow2 under /dev/mapper/loop0p1.

So how does this work? How can the QEMU storage daemon make a qcow2 image appear as a raw image?

File mounts

To transparently translate a file into a different format, like we did above, we make use of two little-known facts about filesystems and the VFS on Linux. The first one of these we can explain immediately, for the second one we will need some more information about how FUSE exports work, so that secret will be lifted later (down in the “Mounting an image on itself” section).

Here is the first secret: Filesystems do not need to have a root directory. They only need a root node. A regular file is a node, so a filesystem that only consists of a single regular file is perfectly valid.

Note that this is not about filesystems with just a single file in their root directory, but about filesystems that really do not have a root directory.

Conceptually, every filesystem is a tree, and mounting works by replacing one subtree of the global VFS tree by the mounted filesystem’s tree. Normally, a filesystem’s root node is a directory, like in the following example:

Regular filesystem: Root directory is mounted to a directory mount point
Fig. 1: Mounting a regular filesystem with a directory as its root node

Here, the directory /foo and its content (the files /foo/a and /foo/b) are shadowed by the new filesystem (showing /foo/x and /foo/y).

Note that a filesystem’s root node generally has no name. After mounting, the filesystem’s root directory’s name is determined by the original name of the mount point. (“/” is not a name. It specifically is a directory without a name.)

Because a tree does not need to have multiple nodes but may consist of just a single leaf, a filesystem with a file for its root node works just as well, though:

Mounting a file root node to a regular file mount point
Fig. 2: Mounting a filesystem with a regular (unnamed) file as its root node

Here, FS B only consists of a single node, a regular file with no name. (As above, a filesystem’s root node is generally unnamed.) Consequently, the mount point for it must also be a regular file (/foo/a in our example), and just like before, the content of /foo/a is shadowed, and when opening it, one will instead see the contents of FS B’s unnamed root node.

QEMU block exports

Before we can see what FUSE exports are and how they work, we should explore QEMU block exports in general.

QEMU allows exporting block nodes via various protocols (as of 6.0: NBD, vhost-user, FUSE). A block node is an element of QEMU’s block graph (see e.g. Managing the New Block Layer, a talk given at KVM Forum 2017), which can for example be attached to guest devices. Here is a very simple example:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> virtio-blk guest device-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> virtio-blk guest device" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> virtio-blk guest device" />
Fig. 3: A simple block graph for attaching a qcow2 image to a virtio-blk guest device

This is the simplest example for a block graph that connects a virtio-blk guest device to a qcow2 image file. The file block driver, instanced in the form of a block node named prot-node, accesses the actual file and provides the node above it access to the raw content. This node above, named fmt-node, is handled by the qcow2 block driver, which is capable of interpreting the qcow2 format. Parents of this node will therefore see the actual content of the virtual disk that is represented by the qcow2 image. There is only one parent here, which is the virtio-blk guest device, which will thus see the virtual disk.

The command line to achieve the above could look something like this:

$ qemu-system-x86_64 \
    -blockdev node-name=prot-node,driver=file,filename=$image_path \
    -blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    -device virtio-blk,drive=fmt-node,share-rw=on

Besides attaching guest devices to block nodes, you can also export them for users outside of qemu, for example via NBD. Say you have a QMP channel open for the QEMU instance above, then you could do this:

{
    "execute": "nbd-server-start",
    "arguments": {
        "addr": {
            "type": "inet",
            "data": {
                "host": "localhost",
                "port": "10809"
            }
        }
    }
}
{
    "execute": "block-export-add",
    "arguments": {
        "type": "nbd",
        "id": "exp0",
        "node-name": "fmt-node",
        "name": "guest-disk",
        "writable": true
    }
}

This opens an NBD server on localhost:10809, which exports fmt-node (under the NBD export name guest-disk). The block graph looks as follows:

Same block graph as fig. 3, but with an NBD server attached to fmt-node
Fig. 4: Block graph extended by an NBD server

NBD clients connecting to this server will see the raw disk as seen by the guest – we have exported the guest disk:

$ qemu-img info nbd://localhost/guest-disk
image: nbd://localhost:10809/guest-disk
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: unavailable

QEMU storage daemon

If you are not running a guest, and so do not need guest devices, but all you want is to use the QEMU block layer (for example to interpret the qcow2 format) and export nodes from the block graph, then you can use the more lightweight QEMU storage daemon instead of a full-blown QEMU process:

$ qemu-storage-daemon \
    --blockdev node-name=prot-node,driver=file,filename=$image_path \
    --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
    --nbd-server addr.type=inet,addr.host=localhost,addr.port=10809 \
    --export \
    type=nbd,id=exp0,node-name=fmt-node,name=guest-disk,writable=on

Which creates the following block graph:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> NBD server-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> NBD server" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> NBD server" />
Fig. 5: Exporting a qcow2 image over NBD

FUSE block exports

Besides NBD exports, QEMU also supports vhost-user and FUSE exports. FUSE block exports make QEMU become a FUSE driver that provides a filesystem that consists of only a single node, namely a regular file that has the raw contents of the exported block node. QEMU will automatically mount this filesystem on a given existing regular file (which acts as the mount point, as described in the “File mounts” section).

Thus, FUSE exports can be used like this:

$ touch mount-point

$ qemu-storage-daemon \
  --blockdev node-name=prot-node,driver=file,filename=$image_path \
  --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
  --export \
  type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on

The mount point now appears as the raw VM disk that is stored in the qcow2 image:

$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB

And mount tells us that this is indeed its own filesystem:

$ mount | grep mount-point
/dev/fuse on /tmp/mount-point type fuse (rw,nosuid,nodev,relatime,user_id=1000,
group_id=100,default_permissions,allow_other,max_read=67108864)

The block graph looks like this:

Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> FUSE server <-> exported file-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> FUSE server -> exported file" />-> file node (label: prot-node) -> qcow2 node (label: fmt-node) -> FUSE server -> exported file" />
Fig. 6: Exporting a qcow2 image over FUSE

Closing the storage daemon (e.g. with Ctrl-C) automatically unmounts the export, turning the mount point back into an empty normal file:

$ mount | grep -c mount-point
0

$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 0 B (0 bytes)
disk size: 0 B

Mounting an image on itself

So far, we have seen what FUSE exports are, how they work, and how they can be used. However, in the very first example in this blog post, we did not export the raw image on some empty regular file that just serves as a mount point – no, we turned the original qcow2 image itself into a raw image.

How does that work?

What happens to the old tree under a mount point?

Mounting a filesystem only shadows the mount point’s original content, it does not remove it. The original content can no longer be looked up via its (absolute) path, but it is still there, much like a file that has been unlinked but is still open in some process. Here is an example:

First, create some file in some directory, and have some process keep it open:

$ mkdir foo

$ echo 'Is anyone there?' > foo/bar

$ irb
irb(main):001:0> f = File.open('foo/bar', 'r+')
=> #<File:foo/bar>
irb(main):002:0> ^Z
[1]  + 35494 suspended  irb

Next, mount something on the directory:

$ sudo mount -t tmpfs tmpfs foo

The file cannot be found anymore (because foo’s content is shadowed by the mounted filesystem), but the process who kept it open can still read from it, and write to it:

$ ls foo

$ cat foo/bar
cat: foo/bar: No such file or directory

$ fg
f.read
irb(main):002:0> f.read
=> "Is anyone there?\n"
irb(main):003:0> f.puts('Hello from the shadows!')
=> nil
irb(main):004:0> exit

$ ls foo

$ cat foo/bar
cat: foo/bar: No such file or directory

Unmounting the filesystem lets us see our file again, with its updated content:

$ sudo umount foo

$ ls foo
bar

$ cat foo/bar
Is anyone there?
Hello from the shadows!

Letting a FUSE export shadow its image file

The same principle applies to file mounts: The original inode is shadowed (along with its content), but it is still there for any process that opened it before the mount occurred. Because QEMU (or the storage daemon) opens the image file before mounting the FUSE export, you can therefore specify an image’s path as the mount point for its corresponding export:

$ qemu-img create -f qcow2 foo.qcow2 20G
Formatting 'foo.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
 compression_type=zlib size=21474836480 lazy_refcounts=off refcount_bits=16

$ qemu-img info foo.qcow2
image: foo.qcow2
file format: qcow2
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

$ qemu-storage-daemon --blockdev \
   node-name=node0,driver=qcow2,file.driver=file,file.filename=foo.qcow2 \
   --export \
   type=fuse,id=node0-export,node-name=node0,mountpoint=foo.qcow2,writable=on &
[1] 40843

$ qemu-img info foo.qcow2
image: foo.qcow2
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB

$ kill %1
[1]  + 40843 done       qemu-storage-daemon --blockdev  --export

In graph form, that looks like this:

Two graphs: First, foo.qcow2 is opened by QEMU; second, a FUSE server exports the raw disk under foo.qcow2, thus shadowing the original foo.qcow2
Fig. 6: Exporting a qcow2 image via FUSE on its own path

QEMU (or the storage daemon in this case) keeps the original (qcow2) file open, and so it keeps access to it, even after the mount. However, any other process that opens the image by name (i.e. open("foo.qcow2")) will open the raw disk image exported by QEMU. Therefore, it looks like the qcow2 image is in raw format now.

qemu-fuse-disk-export.py

Because the QEMU storage daemon command line tends to become kind of long, I’ve written a script to facilitate the process: qemu-fuse-disk-export.py (direct download link). This script automatically detects the image format, and its --daemonize option allows safe use in scripts, where it is important that the process blocks until the export is fully set up.

Using qemu-fuse-disk-export.py, the above example looks like this:

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

$ qemu-fuse-disk-export.py foo.qcow2 &
[1] 13339
All exports set up, ^C to revert

$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw

$ kill -SIGINT %1
[1]  + 13339 done       qemu-fuse-disk-export.py foo.qcow2

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

Or, with --daemonize/-d:

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2

$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw

$ kill -SIGINT $(cat qfde.pid)

$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2

Bringing it all together

Now we know how to make disk images in any format understood by QEMU appear as raw images. We can thus run any application on them that works with such raw disk images:

$ qemu-fuse-disk-export.py \
    -dp qfde.pid \
    Arch-Linux-x86_64-basic-20210711.28787.qcow2

$ parted Arch-Linux-x86_64-basic-20210711.28787.qcow2 p
WARNING: You are not superuser.  Watch out for permissions.
Model:  (file)
Disk /tmp/Arch-Linux-x86_64-basic-20210711.28787.qcow2: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  2097kB  1049kB                     bios_grub
 2      2097kB  42.9GB  42.9GB  btrfs

$ sudo kpartx -av Arch-Linux-x86_64-basic-20210711.28787.qcow2
add map loop0p1 (254:0): 0 2048 linear 7:0 2048
add map loop0p2 (254:1): 0 83881951 linear 7:0 4096

$ sudo mount /dev/mapper/loop0p2 /mnt/tmp

$ ls /mnt/tmp
bin   boot  dev  etc  home  lib  lib64  mnt  opt  proc  root  run  sbin  srv
swap  sys   tmp  usr  var

$ echo 'Hello, qcow2 image!' > /mnt/tmp/home/arch/hello

$ sudo umount /mnt/tmp

$ sudo kpartx -d Arch-Linux-x86_64-basic-20210711.28787.qcow2
loop deleted : /dev/loop0

$ kill -SIGINT $(cat qfde.pid)

And launching the image, in the guest we see:

[arch@archlinux ~] cat hello
Hello, qcow2 image!

A note on allow_other

In the example presented in the above section, we access the exported image with a different user than the one who exported it (to be specific, we export it as a normal user, and then access it as root). This does not work prior to QEMU 6.1:

$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2

$ sudo stat foo.qcow2
stat: cannot statx 'foo.qcow2': Permission denied

QEMU 6.1 has introduced support for FUSE’s allow_other mount option. Without that option, only the user who exported the image has access to it. By default, if the system allows for non-root users to add allow_other to FUSE mount options, QEMU will add it, and otherwise omit it. It does so by simply attempting to mount the export with allow_other first, and if that fails, it will try again without. (You can also force the behavior with the allow_other=(on|off|auto) export parameter.)

Non-root users can pass allow_other if and only if /etc/fuse.conf contains the user_allow_other option.

Conclusion

As shown in this blog post, FUSE block exports are a relatively simple way to access images in any format understood by QEMU as if they were raw images. Any tool that can manipulate raw disk images can thus manipulate images in any format, simply by having the QEMU storage daemon provide a translation layer. By mounting the FUSE export on the original image path, this translation layer will effectively be invisible, and the original image will look like it is in raw format, so it can directly be accessed by those tools.

The current main disadvantage of FUSE exports is that they offer relatively bad performance. That should be fine as long as your use case is just light manipulation of some VM images, like manually modifying some files on them. However, we did not yet really try to optimize performance, so if more serious use cases appear that would require better performance, we can try.

by Hanna Reitz at September 06, 2021 06:30 PM

August 24, 2021

QEMU project

QEMU version 6.1.0 released

We’d like to announce the availability of the QEMU 6.1.0 release. This release contains 3000+ commits from 221 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • block: support for changing block node options after creation via ‘blockdev-reopen’ QMP command
  • Crypto: more performant backend recommendations and improved documentation
  • I2C: emulation support for I2C muxes (pca9546, pca9548) and PMBus
  • TCG Plugins: now enabled by default, with new execlog and cache modelling plugins.
  • ARM: new board support for Aspeed (rainier-bmc, quanta-q7l1), npcm7xx (quanta-gbs-bmc), and Cortex-M3 (stm32vldiscovery) based machines
  • ARM: Aspeed support of Hash and Crypto Engine
  • ARM: emulation support for SVE2 (including bfloat16), integer matrix multiply accumulate operations, TLB invalidate in Outer Shareable domain, TLB range invalidate, and more.
  • PowerPC: pseries: support for detecting hotplug failures in newer guests
  • PowerPC: pseries: increased maximum CPU count
  • PowerPC: pseries: emulation support for some POWER10 prefixed instructions
  • PowerPC: new board support for Genesi/bPlan Pegasos II (pegasos2)
  • RISC-V: updates to OpenTitan platform support, including OpenTitan timer
  • RISC-V: support for virtio-vga
  • RISC-V: documentation improvements and general code cleanups/fixes
  • s390: emulation support for the vector-enhancements facility
  • s390: support for gen16 CPU models
  • x86: new Intel CPU model versions with support for XSAVES instruction
  • x86: added ACPI based PCI hotplug support for Q35 machine (now the default)
  • x86: improvements to emulation of AMD virtualization extensions
  • and lots more…

Thank you to everyone involved!

August 24, 2021 08:22 PM

August 19, 2021

QEMU project

Cache Modelling TCG Plugin

Caches are a key way that enables modern CPUs to keep running at full speed by avoiding the need to fetch data and instructions from the comparatively slow system memory. As a result understanding cache behaviour is a key part of performance optimisation.

TCG plugins provide means to instrument generated code for both user-mode and full system emulation. This includes the ability to intercept every memory access and instruction execution. This post introduces a new TCG plugin that’s used to simulate configurable L1 separate instruction cache and data cache.

While different microarchitectures often have different approaches at the very low level, the core concepts of caching are universal. As QEMU is not a microarchitectural emulator we model an ideal caching system with a few simple parameters. By doing so, we can adequately simulate the behaviour of L1 private (per-core) caches.

Overview

The plugin simulates how L1 user-configured caches would behave when given a working set defined by a program in user-mode, or system-wide working set. Subsequently, it logs performance statistics along with the most N cache-thrashing instructions.

Configurability

The plugin is configurable in terms of:

  • icache size parameters: icachesize, iblksize, iassoc, All of which take a numeric value
  • dcache size parameters: dcachesize, dblksize, dassoc. All of which take a numeric value
  • Eviction policy: evict=lru|rand|fifo
  • How many top-most thrashing instructions to log: limit=TOP_N
  • How many core caches to keep track of: cores=N_CORES

Multicore caching

Multicore caching is achieved by having independent L1 caches for each available core.

In full-system emulation, the number of available vCPUs is known to the plugin at plugin installation time, so separate caches are maintained for those.

In user-space emulation, the index of the vCPU initiating memory access monotonically increases and is limited with however much the kernel allows creating. The approach used is that we allocate a static number of caches, and fit all memory accesses into those cores. This approximation is sufficiently similar to real systems since having more threads than cores will result in interleaving those threads between the available cores so they might thrash each other anyway.

Design and implementation

General structure

A generic cache data structure, Cache, is used to model either an icache or dcache. For each known core, the plugin maintains an icache and a dcache. On a memory access coming from a core, the corresponding cache is interrogated.

Each cache has a number of cache sets that are used to store the actual cached locations alongside metadata that backs eviction algorithms. The structure of a cache with n sets, and m blocks per sets is summarized in the following figure:

cache structure

Eviction algorithms

The plugin supports three eviction algorithms:

  • Random eviction
  • Least recently used (LRU)
  • FIFO eviction

Random eviction

On a cache miss that requires eviction, a randomly chosen block is evicted to make room for the newly-fetched block.

Using random eviction effectively requires no metadata for each set.

Least recently used (LRU)

For each set, a generation number is maintained that is incremented on each memory access and. The current generation number is assigned to the block currently being accessed. On a cache miss, the block with the least generation number is evicted.

FIFO eviction

A FIFO queue instance is maintained for each set. On a cache miss, the evicted block is the first-in block, and the newly-fetched block is enqueued as the last-in block.

Usage

Now a simple example usage of the plugin is demonstrated by running a program that does matrix multiplication, and how the plugin helps identify code that thrashes the cache.

A program, test_mm uses the following function to carry out matrix multiplication:

void mm(int n, int m1[n][n], int m2[n][n], int res[n][n])
{
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            int sum = 0;
            for (int k = 0; k < n; k++) {
                int op1 = m1[i][k];
                int op2 = m2[k][j];
                sum += op1 * op2;
            }
            res[i][j] = sum;
        }
    }
}

Running mm_test inside QEMU using the following command:

./qemu-x86_64 $(QEMU_ARGS) \
  -plugin ./contrib/plugins/libcache.so,dcachesize=8192,dassoc=4,dblksize=64,\
      icachesize=8192,iassoc=4,iblksize=64 \
  -d plugin \
  -D matmul.log \
  ./mm_test

The preceding command will run QEMU and attach the plugin with the following configuration:

  • dcache: cache size = 8KBs, associativity = 4, block size = 64B.
  • icache: cache size = 8KBs, associativity = 4, block size = 64B.
  • Default eviction policy is LRU (used for both caches).
  • Default number of cores is 1.

The following data is logged in matmul.log:

core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate
0       4908419        274545          5.5933%  8002457        1005            0.0126%

address, data misses, instruction
0x4000001244 (mm), 262138, movl (%rdi, %rsi, 4), %esi
0x400000121c (mm), 5258, movl (%rdi, %rsi, 4), %esi
0x4000001286 (mm), 4096, movl %edi, (%r8, %rsi, 4)
0x400000199c (main), 257, movl %edx, (%rax, %rcx, 4)

...

We can observe two things from the logs:

  • The most cache-thrashing instructions belong to a symbol called mm, which happens to be the matrix multiplication function.
  • Some array-indexing instructions are generating the greatest share of data misses.

test_mm does a bunch of other operations other than matrix multiplication. However, Using the plugin data, we can narrow our investigation space to mm, which happens to be generating about 98% of the overall number of misses.

Now we need to find out why is the instruction at address 0x4000001224 thrashing the cache. Looking at the disassembly of the program, using objdump -Sl test_mm:

/path/to/test_mm.c:11 (discriminator 3)
                int op2 = m2[k][j];  <- The line of code we're interested in
    1202:   8b 75 c0               mov    -0x40(%rbp),%esi
    1205:   48 63 fe               movslq %esi,%rdi
    1208:   48 63 f2               movslq %edx,%rsi
    120b:   48 0f af f7            imul   %rdi,%rsi
    120f:   48 8d 3c b5 00 00 00   lea    0x0(,%rsi,4),%rdi
    1216:   00
    1217:   48 8b 75 a8            mov    -0x58(%rbp),%rsi
    121b:   48 01 f7               add    %rsi,%rdi
    121e:   8b 75 c8               mov    -0x38(%rbp),%esi
    1221:   48 63 f6               movslq %esi,%rsi
    1224:   8b 34 b7               mov    (%rdi,%rsi,4),%esi
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^
    1227:   89 75 d4               mov    %esi,-0x2c(%rbp)

It can be seen that the most problematic instruction is associated with loading m2[k][j]. This happens because we’re traversing m2 in a column-wise order. So if the matrix m2 is larger than the data cache, we end up with fetching blocks that we only use one integer from and not use again before getting evicted.

A simple solution to this problem is to transpose the second matrix and access it in a row-wise order.

By editing the program to transpose m2 before calling mm and run it inside QEMU with the plugin attached and using the same configuration as previously, the following data is logged in matmul.log:

core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate
0       4998994        24235           0.4848%  8191937        1009            0.0123%

address, data misses, instruction
0x4000001244 (mm), 16447, movl (%rdi, %rsi, 4), %esi
0x4000001359 (tran), 3994, movl (%rcx, %rdx, 4), %ecx
0x4000001aa7 (main), 257, movl %edx, (%rax, %rcx, 4)
0x4000001a72 (main), 257, movl %ecx, (%rax, %rdx, 4)

...

It can be seen that a minor number of misses is generated at transposition time in tran. The rest of the matrix multiplication is carried out using the same procedure but to multiply m1[i][k] by m2[j][k]. So m2 is traversed row-wise and hence utilized cache space much more optimally.

Multi-core caching

The plugin accepts a cores=N_CORES argument that represents the number of cores that the plugin must keep track of. Memory accesses generated by excess threads will be served through the available core caches. The model is an approximation, as described, and is most-akin to idealized behaviour when the number of threads generated by the program is less than cores available, otherwise inter-thread thrashing will invariably occur.

An example usage of the plugin using the cores argument to use 4 per-core caches against a multithreaded program:

./qemu-x86_64 $(QEMU_ARGS) \
    -plugin ./contrib/plugins/libcache.so,cores=4 \
    -d plugin \
    -D logfile \
    ./threaded_prog

This reports out the following:

core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate
0       76739          4195          5.411666%  242616         1555            0.6409%
1       29029          932           3.211106%  70939          988             1.3927%
2       6218           285           4.511835%  15702          382             2.4328%
3       6608           297           4.411946%  16342          384             2.3498%
sum     118594         5709          4.811139%  345599         3309            0.9575%

...

Conclusion

By emulating simple configurations of icache and dcache we can gain insights into how a working set is utilizing cache memory. Simplicity is sought and L1 cache is emphasized since its under-utilization can be severe to the overall system performance.

This plugin is made as part of my GSoC participation for the year 2021 under the mentorship of Alex Bennée.

List of posted patches related to the plugin:

The first series, (plugins: New TCG plugin for cache modelling), along with the bug fixes patches are already merged to the QEMU main tree, the remaining patches are merged to the plugins/next tree, awaiting merging to the main tree, since we’re in a release cycle as of the time of posting.

by Mahmoud Mandour at August 19, 2021 08:00 AM

July 30, 2021

KVM on Z

qeth Devices: Promiscuous Modes, Live Guest Migration, and more

qeth devices, namely OSA-Express and HiperSockets, have a vast array of functionalities that is easy to get lost in. This entry illustrates some of the most commonly sought functionalities, while trying to avoid confusing the reader with too much background information.

TLDR:

  • IBM z14: For KVM, always use separate OSA devices on source and target for LGM; For OVS, use a primary bridgeport with OSA, and VNIC characteristics with HiperSockets.
  • IBM z15: For KVM with OVS, use VNIC characteristics for any qeth device; for KVM and MacVTap, use VNIC characteristics if you want to use the same device on source and target system in LGM scenarios.

 

Bridgeport Mode

Initially, the only way to enable promiscuous mode on OSA-Express adapters and HiperSockets was through the so-called bridgeport mode. The concept of the bridgeport mode distinguishes between between ports as follows:

  • Primary bridgeport: The primary bridgeport receives all traffic for all destination addresses unknown to the device. Or, in other words: If the device receives data for a destination unknown to it, instead of dropping it, it will be forwarded to the current primary bridgeport instead. Which further implies that as soon as an operating system registers a MAC address with the device, traffic destined for that MAC address becomes "invisible" to the bridgeport.
    Note: Only a single operating system can use the primary bridgeport on an adapter at any time.
  • Secondary bridgeport: Whenever the operating system that currently has the primary bridgeport gives up on it, one of the secondary bridgeports will become the new primary. An arbitrary number of operating systems can register as a secondary bridgeport.

Bridgeport mode is available in Layer 2 mode only. Furthermore, HiperSockets devices need to be defined as external-bridged in IOCDS.
Use attributes in the device's sysfs directory as follows:

  • bridge_role: Set the desired role (none, primary, or secondary), and query the current one.
  • bridge_state: Query the current state (active or inactive)

Bridgeport mode effectively provides a promiscuous mode. But note that in addition to enabling the primary bridgeport mode, the respective interface has to have the promiscuous mode set, still!
All in all, here is how usage of this feature typically looks like:

  $ echo primary >/sys/devices/qeth/0.0.bd00/bridge_role

  # verify that we got primary bridgeport, not secondary, and are active:

  $ cat /sys/devices/qeth/0.0.bd00/bridge_state 
  active
  $ cat /sys/devices/qeth/0.0.bd00/bridge_role

  primary

  # enable promiscuous mode on the interface
  $ ip link set <interface> promisc on

The downside of this approach is that only a single operating system per device can enable the primary bridgeport mode, which scales only that far. Therefore, something better, with more functionality was introduced to the platform.


VNIC Characteristics

Introduced with IBM z14/LinuxONE II for HiperSockets, and IBM z15/LinuxONE III for OSA, the VNIC characteristics feature provides promiscuous mode for multiple operating systems attached to the same device, and provides additional functionality which can be very handy especially with KVM.
The VNIC characteristics can be controlled through a number of attributes located in an extra subdirectory called vnicc in the device's sysfs directory.

Let us focus on two main functionalities.

Promiscuous Mode

Technically, VNIC does not provide a traditional promiscuous mode (just like bridgeport mode did not in the literal sense), but rather emulates a self-learning switch. However, for users looking for a promiscuous mode that is usable in conjunction with a Linux bridge or an Open vSwitch, the end-result is the same.

To activate, set the attributes as follows:

  echo 1>/sys/devices/qeth/0.0.bd00/vnicc/flooding
  echo 1>/sys/devices/qeth/0.0.bd00/vnicc/mcast_flooding
  echo 1>/sys/devices/qeth/0.0.bd00/vnicc/learning

Again, in addition to enabling the promiscuous mode on the device, the respective interface has to have the promiscuous mode set, still:

  ip link set <interface> promisc on


KVM Live Guest Migration

Providing connectivity to virtual servers running in KVM, administrators have two choices to provide connectivity:

  • Via Open vSwitch: Requires a promiscuous mode, see above. Virtual servers migrated between the two Open vSwitches will have uninterrupted connectivity thanks to the devices being configure in promiscuous mode, provided that the networking architecture is set up accordingly. The two Open vSwitches may or may not share the same networking device.
  • Via MAC Address Takeover: This is only required in case both, the source and the target KVM host share the same device and use MacVTap to connect to it. While the traffic will still run through the same device, some handshaking has to take place to make sure that the MAC address is configured correctly, and traffic forwared to the target KVM host once migration has completed. This has to be authorized - otherwise, an attacker could divert traffic elsewhere.

Luckily, VNIC characteristics offers functionality for MAC address takeover, too. To enable, set the VNIC characteristics as follows:

On the source KVM host:

  echo 1>/sys/devices/qeth/0.0.bd00/vnicc/takeover_learning

On the target KVM host:

  echo 1>/sys/devices/qeth/0.0.bd00/vnicc/takeover_setvmac


Final Words

Note that bridgeport mode and VNIC characteristics are mutually exclusive! Meaning as soon as e.g. a single VNIC characteristics-related attribute is activated, bridgeport-related functionality is not available anymore until that VNIC-characteristic is disabled again.

Furthermore, check your Linux distribution's tools on how to persist the changes outlined above. On many distros, chzdev (comes with the s390-tools package) does the job, but not (yet) on all.

This article only provides a brief overview. Both, promiscuous mode and the VNIC characteristics have a lot more to it than what was covered in this brief overview, which merely aims to provide just enough information to get readers started with the most common usecases. For a deeper understanding, check the respective sections in the Device Drivers, Features, and Commands book.

by Stefan Raspl (noreply@blogger.com) at July 30, 2021 04:07 PM

July 02, 2021

Stefan Hajnoczi

Slides available for "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization"

The PDF slides for my "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization" talk from the 16th Workshop on Virtualization in High-Performance Cloud Computing are now available.

This talk covers out-of-process device interfaces including vhost (kernel), vhost-user, Linux VFIO, mdev, vfio-user, vDPA, and VDUSE. It gives a brief overview of each interface, how it works, and how to develop your own devices.

The growing number of out-of-process device interfaces available in QEMU/KVM can make it hard to understand and compare them. Each of these interfaces is designed for different use cases. For example, whether you want to pass through hardware or implement the device in software, if you want to implement your device in the host kernel or in host userspace, etc. This talk will give you the necessary knowledge to compare these interfaces yourself so you can decide which one is most appropriate for your use case.

For more information about the design of out-of-process device interfaces, see also my previous blog post about requirements for out-of-process devices.

by Unknown (noreply@blogger.com) at July 02, 2021 03:42 PM

June 30, 2021

KVM on Z

Webinar: 2021 Linux on IBM Z and LinuxONE Technical Client Workshop

Join us for the 2021 Linux on IBM Z and LinuxONE Virtual Client Workshop!

Abstract

Get the latest news about the Linux exploitation and advantages of the IBM Z and LinuxONE platform in this technical workshop. Presented by our developers and solution architects, the training focuses on the latest news and technical information for Linux on IBM Z, LinuxONE, z/VM, and KVM, such as Red Hat OpenShift Container Platform, Red Hat OpenShift Container Storage, Security, Performance, Networking and Virtualization. You will have the chance to interact directly with IBM developers and solution experts during the event, especially in the interactive workgroup sessions, which will be held on the last day.

This workshop is free of charge.

Agenda Highlights
  • What's New on RHOCP on IBM Z & LinuxONE 
  • Hybrid Cloud and why RHOCP on IBM Z & LinuxONE can enable highest flexibility
  • Introduction of Red Hat OpenShift Container Storage
  • Red Hat OpenShift Container Platform on IBM Z & LinuxONE: CPU Consumption Demystified
  • Cloud Ready Development, can now profit from multi Architecture capabilities and several features in RHOCP on IBM Z
  • FUJITSU Enterprise Postgres: Finally! An OCP-certified Database for Linux on IBM Z and LinuxONE that exploits our hardware capabilities
  • Reduce your IT costs with IBM LinuxONEHow IBM Cloud Paks drive business value and lower IT costs
  • z/VM Platform Update
  • Linux and KVM on IBM Z and LinuxONE - What's New
  • kdump - Recommendations for Linux on IBM Z and LinuxONE
  • Elasticsearch on IBM Z - Performance Experiences, Hints and Tips
  • Crypto Update
  • Fully homomorphic encryption Introduction and Update
  • Putting SMC-Dv2 to work
  • Java on IBM Z - News, Updates, and other Pulp Fiction
  • Various workgroup sessions

Schedules & Registration

Americas, Europe, Middle East & Africa
July 12-16, every day 8:30 - 11:30 AM EST / 14:30 - 17:30 CET
Register here.

Asia Pacific
July 27-29, 2021, every day 8:30 - 11:30 AM CET / 2:30 - 5:30 PM Singapore time
Register here.

by Stefan Raspl (noreply@blogger.com) at June 30, 2021 10:08 PM

June 25, 2021

KVM on Z

SLES 15 SP3 Released

SUSE Linux Enterprise Server 15 SP3 is out! See the official announcement and the release notes. It provides

  • QEMU v5.2, supporting virtio-fs on IBM Z
  • libvirt v7.1
For a detailed list of IBM Z and LinuxONE-specific (non-KVM) features see here.

by Stefan Raspl (noreply@blogger.com) at June 25, 2021 03:06 PM

June 21, 2021

Gerd Hoffmann

My kubernetes test cluster, overview.

This is an article series about my kubernetes test cluster.

  1. Cluster node installation on fedora and basic cluster setup.
  2. Planned: Setup ingress and other useful cluster services.

by Gerd Hoffmann at June 21, 2021 10:00 PM

My kubernetes test cluster, part one — install.

I'm running a kubernetes test cluster in my home network. It is used to learn kubernetes and try out various things, for example kata containers and kubevirt. Not used much (yet?) for actual development.

After mentioning it here and there some people asked for details, so here we go. I'll go describe my setup, with some kubernetes and container basics sprinkled in.

This is part one of an article series and will cover cluster node installation and basic cluster setup.

The cluster nodes

Most cluster nodes are dual-core virtual machines. The control-plane node (formerly known as master node) has 8G of memory, most worker nodes have 4G of memory. It is a mix of x86_64 and aarch64 nodes. Kubernetes names these architectures amd64 and arm64, which is easily confused, so take care 😎.

The virtual nodes use bridged networking. So no separate network, they simply show up on my 192.168.2.0/24 home network like the physical machines connected. They get a static IP address assigned by the DHCP server, and I can easily ssh into each node.

All cluster nodes run Fedora 34, Server Edition.

Node configuration

I have a git repository with some config files, to simplify rebuilding a cluster node from scratch. The repository also has some shell scripts with the commands listed later in this blog post.

Lets go over the config files one by one.

$ cat /etc/sysctl.d/kubernetes.conf
kernel.printk=4
net.ipv4.ip_forward=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1

This is needed for kubernetes networking.

$ cat /etc/modules-load.d/kubernetes.conf
# networking
bridge
br_netfilter
# kata
vhost
vhost_net
vhost_vsock

Load some kernel modules needed at boot. Again for kubernetes networking. Also vhost support which is needed by kata containers.

$ cat /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-$basearch
enabled=0
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

The upstream kubernetes rpm repository. Note this is not enabled (enabled=0) because I don't want normal fedora system updates also update the kubernetes packages. For installing/updating kubernetes packages I can enable the repo using dnf --enablerepo=kubernetes ....

Package installation

Given I want play with different container runtimes I've decided to use cri-o, which allows to do just that. Fedora has packages. They are in a module though, so that must be enabled first.

$ sudo dnf module list cri-o
$ sudo dnf module enable cri-o:${version}

The cri-o version should match the kubernetes version you want run. That is not the case in my cluster right now because I've learned that after setting up the cluster, so obviously sky isn't falling in case they don't match. The next time I update the cluster I'll bring them into sync.

Now we can go install the packages from the fedora repos. cri-o, runc (default container runtime), and a handful of useful utilities.

$ sudo dnf install podman skopeo buildah runc cri-o cri-tools \
    containernetworking-plugins bridge-utils telnet jq

Next in line are the kubernetes packages from the google repo. The repo has all versions, not only the most recent, so you can ask for the version you want and you'll get it. As mentioned above the repo must be enabled on the command line.

$ sudo dnf install --enablerepo=kubernetes \
    {kubectl,kubeadm,kubelet}-${version}

Configure and start services

kubelet needs some configuration, my git repo with the config files has this:

$ cat /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS=--cgroup-driver=systemd --fail-swap-on=false

Asking kubelet to delegate all cgroups work to systemd is needed to make kubelet work with cgroups v2. With that in place we can reload the configuration and start the services:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now crio
$ sudo systemctl enable --now kubelet

Kubernetes cluster nodes need a few firewall entries so the nodes can speak to each other. I was to lazy to setup all that and just turned off the firewall. The cluster isn't reachable from the internet anyway, so 🤷.

$ sudo systemctl disable --now firewalld

Initialize the control plane node

All the preparing steps up to this point are the same for all cluster nodes. Now we go initialize the control plane node.

$ sudo kubeadm init \
	--pod-network-cidr=10.85.0.0/16 \
	--kubernetes-version=${version} \
	--ignore-preflight-errors=Swap

Picked the 10.85.0.0/16 network because that happens to be the default network used by cri-o, see /etc/cni/net.d/100-crio-bridge.conf.

This command will take a while. It will pull kubernetes container images from the internet, start them using the kubelet service, and finally initialize the cluster.

kubeadm will write the config file needed to access the cluster with kubectl to /etc/kubernetes/admin.conf. It'll make you cluster root. Kubernetes names this cluster-admin role in the rbac (role based access control) scheme.

For my devel cluster I simply use that file as-is instead of setting up some more advanced user authentication and access control. I place a copy of the file at $HOME/.kube/config (the default location used by kubectl). Copying the file to other machines works, so I can also run kubectl on my laptop or workstation instead of ssh'ing into the control plane node.

Time to run the first kubectl command to see whenever everything worked:

$ kubectl get nodes
NAME                        STATUS   ROLES                  AGE   VERSION
k8s-node1.home.kraxel.org   Ready    control-plane,master   5m    v1.21.1

Yay! First milestone.

Side note: single node cluster

By default kubeadm init adds a taint to the control plane node so kubernetes wouldn't schedule pods there:

$ kubectl describe node k8s-node1.home.kraxel.org | grep NoSchedule
Taints:             node-role.kubernetes.io/master:NoSchedule

If you want go for a single node cluster all you have to do is remove that taint so kubernetes will schedule and run your pods directly on your new and shiny control plane node. The magic words for that are:

$ kubectl taint nodes --all node-role.kubernetes.io/master-

Done. You can start playing with the cluster now.

If you want add one or more worker nodes to the cluster instead, then watch kubernetes distribute the load, read on ...

Initialize worker nodes

The worker nodes need a bootstrap token to authenticate when they want join the cluster. The kubeadm init command creates a token and will also print the kubeadm join command needed to join. If you don't have that any more, no problem, you can always get the token later using kubeadm token list. In case the token did expire (they are valid for a day or so) you can create a new one using kubeadm token create. Beside the token kubeadm also needs the hostname and port to be used to connect to the control plane node. Default port for the kubernetes API is 6443, so ...

$ sudo kubeadm join "k8s-node1.home.kraxel.org:6443" \
	--token "${token}" \
	--discovery-token-unsafe-skip-ca-verification \
	--ignore-preflight-errors=Swap

... and check results:

$ kubectl get nodes
NAME                        STATUS   ROLES                  AGE   VERSION
k8s-node1.home.kraxel.org   Ready    control-plane,master   22m   v1.21.1
k8s-node2.home.kraxel.org   Ready    <none>                 2m    v1.21.1

The node may show up in "NotReady" state for a while when it did register already but didn't complete initialization yet.

Now repeat that procedure on every node you want add to the cluster.

Side note: scripting kubernetes with json

Both kubeadm and kubectl can return the data you ask for in various formats. By default they print a nice, human-readable table to the terminal. But you can also ask for yaml, json and others using the -o or --output switch. Specifically json is very useful for scripting, you can pipe the output through the jq utility (you might have noticed this in the list of packages to install at the start of this blog post) to fish out the items you actually need.

For starters two simple examples. You can get the raw bootstrap token this way:

$ kubeadm token list -o json | jq -r .token

Or check out some node details:

$ kubectl get node k8s-node1.home.kraxel.org -o json | jq .status.nodeInfo
{
  "architecture": "amd64",
  "bootID": "a18dcad0-3427-4a12-a238-7b815fe45ea0",
  "containerRuntimeVersion": "cri-o://1.19.0-dev",
  "kernelVersion": "5.12.9-300.fc34.x86_64",
  "kubeProxyVersion": "v1.21.1",
  "kubeletVersion": "v1.21.1",
  "machineID": "a2b3a7ba9ec54b2d84b66d70156702d2",
  "operatingSystem": "linux",
  "osImage": "Fedora 34 (Thirty Four)",
  "systemUUID": "7f4854c4-2b92-4fea-9bb7-3d28537af675"
}

There are way more possible use cases. When reading config and patch files kubectl likewise accepts both yaml and json as input.

Pod networking with flannel

There is one more basic thing to setup: Install a network fabric to get the pod network going. This is needed to allow pods running on different cluster nodes to talk to each other. When running a single node cluster this can be skipped.

There are a bunch of different solutions out there, I've settled for flannel in "host-gw" mode. First download kube-flannel.yml from github. Then tweak the configuration: Make sure the network matches the pod network passed to kubeadm init, and change the backend. Here are the changes I've made:

--- kube-flannel.yml	2021-04-26 11:15:09.820696429 +0200
+++ kube-flannel-local.yml	2021-04-26 11:15:18.403551923 +0200
@@ -125,9 +125,9 @@
     }
   net-conf.json: |
     {
-      "Network": "10.244.0.0/16",
+      "Network": "10.85.0.0/16",
       "Backend": {
-        "Type": "vxlan"
+        "Type": "host-gw"
       }
     }
 ---

Now apply the yaml file to install flannel:

$ kubectl apply -f kube-flannel-local.yml

The flannel pods are created in the kube-system namespace, you can check the status this way:

$ kubectl get pods -n kube-system
NAME                            READY   STATUS    RESTARTS   AGE
[ ... ]
kube-flannel-ds-5l7x6           1/1     Running   0          3m
kube-flannel-ds-7xjtz           1/1     Running   0          3m
[ ... ]

Once all pods are up and running your pod network should be working. One nice thing with "host-gw" mode is that this uses standard network routing of the cluster nodes and you can inspect the state with standard linux tools:

$ ip route list | grep 10.85
10.85.0.0/24 dev cni0 proto kernel scope link src 10.85.0.1 
10.85.1.0/24 via 192.168.2.112 dev enp2s0
[ ... ]

Each cluster node gets a /24 subnet of the pod network assigned. The cni0 device is the subnet of the local node. The other subnets are routed to the other cluster nodes. Pretty straight forward.

Rounding up

So, that's it for part one. The internet has tons of kubernetes tutorials and examples which you can try on the cluster now. One good starting point is Kubernetes by example.

My plan for part two of this article series is installing and configuring some useful cluster services, with one of them being ingress which is needed to access your cluster services with a web browser.

by Gerd Hoffmann at June 21, 2021 10:00 PM

June 20, 2021

Stefan Hajnoczi

My performance benchmarking workflow (2021)

Benchmarking computer systems is time-consuming because setting up the necessary environment involves a lot of work. Over time I have built a workflow that mitigates the cost of setting up benchmarks and allows me to analyze performance more effectively. This blog post covers my workflow as of 2021.

Performance investigations often follow these steps:

  1. Set up hardware and software.
  2. Run initial benchmarks to verify that the bottleneck under investigation is being triggered.
  3. Collect a full set of benchmark results and monitoring data.
  4. Analyze results and form a hypothesis about the bottleneck.
  5. Implement a proof-of-concept optimization to test the hypothesis.
  6. Go to Step 3 until the desired benchmark results are reached, keeping those optimizations that helped.

This is a long process that is costly to pause/resume or replicate again in the future. Setting up hardware and software manually is both time-consuming and error-prone. Therefore we don't want to do it more than once. There is a risk that replicating the benchmark on another machine will fail to produce identical results due to differences in environments.

The consequence of high-overhead processes is that we minimize their use since we cannot afford to run through the process as often as we'd like. This means we cannot answer all the performance questions we'd like to and therefore our understanding is limited. We cannot discover all the truths that would enable us to make performance improvements.

A more lightweight process would encourage experimentation and lead to higher productivity.

An ideal workflow

In a low-overhead world I would like to do the following:

  1. Set up hardware and software once only and be able to return to that state again in the future at the press of a button.
  2. Capture the full benchmarking environment so the configuration can be inspected and modified easily.
  3. Store benchmark results so that each run is available for further analysis in the future.

The workflow is actually quite similar to developing code with git:

  1. Create a topic branch for this performance investigation.
  2. Add an environment definition to produce the desired hardware and software state.
  3. Run the benchmark and collect the results.
  4. Commit the environment and results.
  5. Go to Step 2 to modify the environment (e.g. apply proof-of-concept patches to software) and repeat.

Since the performance investigation is captured in a git branch it's easy to switch to another investigation without losing history.

This is actually what I do! Git provides the storage and time machine functionality for easily pausing/resuming or replicating performance investigations.

Ansible

Ansible provides the automation system necessary to put hardware and software into the desired state for benchmarking. Ansible's killer feature is the large ecosystem of modules for handling tasks like installing packages, configuring virtual machines and containers, etc. I find Ansible more productive than Python or shell scripting thanks to Ansible's modules collection.

I've begun collecting Ansible tasks for Linux KVM development in virt-tasks. If you're wondering what the configuration for running a benchmark looks like, here is an Ansible playbook that builds QEMU and a guest kernel, creates a Fedora 34 virtual machine, runs the fio disk I/O benchmark, and collects the results:

pre { white-space: pre-wrap; font-family: monospace; color: #ffffff; background-color: #000000; } .PreProc { color: #00ffff; } .Constant { color: #ff40ff; } .Special { color: #ffd7d7; } .Identifier { color: #00ffff; font-weight: bold; } .Statement { color: #ffff00; }

---
- hosts: hosts
tasks:
- include_tasks: tasks/build-qemu.yml
vars:
- repo: https://gitlab.com/qemu-project/qemu.git
- version: v6.0.0

- name: create disk image
include_tasks: tasks/virt-builder-create-image.yml
vars:
- os_version: fedora-34
- size: 32G
- output: /var/lib/libvirt/images/test.img
- format: raw

- name: build guest kernel
include_tasks: tasks/build-kernel.yml
vars:
- repo: https://gitlab.com/stefanha/linux.git
- version: cpuidle-haltpoll-virtqueue
- config_src_path: files/.config

- name: stop vm
virt:
name: test
state: shutdown
ignore_errors: yes

- name: start vm
include_tasks: tasks/start-vm.yml
vars:
- xml: "{{ lookup('file', 'files/test.xml') }}"
- host: 192.168.122.192

- hosts: vms
tasks:
- name: install fio and rsync
dnf:
state: present
name:
- fio
- rsync

- name: run fio
script: files/fio.sh

- name: fetch fio output files
synchronize:
src: fio-output/
dest: notebook/fio-output/poll_source-off
mode: pull

- name: run fio
script: files/fio.sh --enable

- name: fetch fio output files
synchronize:
src: fio-output/
dest: notebook/fio-output/poll_source-on
mode: pull

- hosts: hosts
tasks:
- name: stop vm
virt:
name: test
state: shutdown

An important point is that the Ansible playbook sets up the full environment, runs all benchmarks, and collects the results. Unlike a playbook that runs a single benchmark this one runs the full suite so that the playbook captures the entire environment that produced the results. This distinction is important when tweaking the benchmark configuration or trying out proof-of-concept optimizations. Each git commit needs to encompass the full environment so that the performance investigation is reproducible and can be resumed in the future.

Jupyter

I recently started using JupyterLab notebooks for data analysis. It provides a convenient environment for graphing results and organizing them in documents.

Thanks to the official Jupyter container images you can get a full Python data analysis environment running with just one command:


$ podman run --userns=keep-id -e JUPYTER_ENABLE_LAB=yes -p 8888:8888 --rm -v "$PWD":/home/jovyan/work:z jupyter/scipy-notebook

So far I have only scratched the surface of JupyterLab. It works well for visualizing data although the way I currently use it is not much different from writing a Python matplotlib script and running it from the command-line. In time I'll get a better appreciation for its strengths and weaknesses.

Conclusion

A git-based workflow that automates benchmark setup and stores the results goes a long way towards mitigating the high overhead of performance investigations. If I'm interrupted or need to switch to a different machine it's easy to resume the investigation. Having the results and details of the environment stored together makes it possible to revisit benchmark runs in the future to reproduce or tweak them. The combination of git, Ansible, and Jupyter achieves this workflow quite well, but if you're familiar with other tools I'd love to hear!

by Unknown (noreply@blogger.com) at June 20, 2021 02:04 PM

May 31, 2021

KVM on Z

RHEL 8.4 Released

RHEL 8.4 is out! See the official announcement and the release notes.

KVM is supported via Advanced Virtualization, and provides

  • QEMU v5.2, supporting virtio-fs on IBM Z
  • libvirt v7.0

Furthermore, RHEL 8.4 now supports graphical installation for guest installs. Just add the highlighted arguments to your virt-install command line for an RHEL 8.4 install on an RHEL 8.4 KVM host:

    virt-install --input keyboard,bus=virtio --input mouse,bus=virtio \
    --graphics vnc --video virtio
--disk size=8 --memory 2048 --name rhel84 \
    --cdrom /var/lib/libvirt/images/RHEL-8.4.0-20210503.1-s390x-dvd1.iso

And the installation will enter the fancy graphical installer:

Make sure to have package virt-viewer installed on the host, and X forwarding enabled (option -X for ssh).

This new support also allows graphical installs started in cockpit:

by Stefan Raspl (noreply@blogger.com) at May 31, 2021 04:55 PM

May 30, 2021

Gerd Hoffmann

Adding cut+paste support to qemu

The spice project supports cut+paste for ages, now the rest of qemu is playing catch up.

Implementation options

So, what are the choices for implementing cut+paste support? Without guest cooperation the only possible way would be to send text as keystrokes to the guest. Which has a number of drawbacks:

  • It works for text only.
  • It is one-way (host to guest) only.
  • Has keyboard mapping problems even when limiting to us-ascii,
    sending unicode (ä ø Я € © 漢字 ❤ 😎) reliably is impossible.
  • Too slow for larger text blocks.

So, this is not something to consider seriously. Instead we need help from the guest, which is typically implemented with some agent process running inside the guest. The options are:

  1. Write a new cut+paste agent.
  2. Add cut+paste support to the qemu guest agent.
  3. Use the spice agent which already supports cut+paste.

Reusing the spice agent has some major advantages. For starters there is no need to write any new guest code for this. Less work for developers and maintainers. Also the agent is packaged since years for most distributions (typically the package is named spice-vdagent). So it is easily available, making things easier for users, and guest images with the agent installed work out-of-the-box.

Downside is that this is a bit confusing as you need the spice agent in the guest even when not using spice on the host. So I'm writing this article to address that ...

Some background on spice cut+paste

The spice guest agent is not a single process but two: One global daemon running as system service (spice-vdagentd) and one process (spice-vdagent) running in desktop session context.

The desktop process will handle everything which needs access to your display server. That includes cut+paste support. It will also talk to the system service. The system service in turn connects to the host using a virtio-serial port. It will relay data messages between desktop process and host and also process some of the requests (mouse messages for example) directly.

On the host side qemu simply forwards the agent data stream to the spice client and visa versa. So effectively the spice guest agent can communicate directly with the spice client. It's configured this way:

qemu-system-x86_64 [ ... ] \
  -chardev spicevmc,id=ch1,name=vdagent \
  -device virtio-serial-pci \
  -device virtserialport,chardev=ch1,id=ch1,name=com.redhat.spice.0
spicevmc
This is the data channel to the spice client.
virtio-serial
The virtio device which manages the ports.
virtserialport
The port for the guest/host connection. It'll show up as /dev/virtio-ports/com.redhat.spice.0 inside the guest.

The qemu clipboard implementation.

The central piece of code is the new qemu clipboard manager (ui/clipboard.c). Initially it supports only plain text. The interfaces are designed for multiple data types though, so adding support for more data types later on is possible.

There are three peers which can talk to the qemu clipboard manager:

vnc
The vnc server got clipboard support (ui/vnc-clipboard.c), so vnc clients with cut+paste support can exchange data with the qemu clipboard.
gtk
The gtk ui got clipboard support too (ui/gtk-clipboard.c) and connects the qemu clipboard manager with your desktop clipboard.
vdagent
Qemu got an implementation of the spice agent protocol (ui/vdagent.c), which connects the guest to the qemu clipboard.

This landed in the qemu upstream repo a few days ago and will be shipped with the qemu 6.1 release.

Configure the qemu vdagent

The qemu vdagent is implemented as chardev. It is a drop-in replacement for the spicevmc chardev, and instead of forwarding everything to the spice client it implements the spice agent protocol and parses the messages itself. So only the chardev configuration changes, the virtserialport stays as-is:

qemu-system-x86_64 [ ... ] \
  -chardev qemu-vdagent,id=ch1,name=vdagent,clipboard=on \
  -device virtio-serial-pci \
  -device virtserialport,chardev=ch1,id=ch1,name=com.redhat.spice.0

The vdagent has two options to enable/disable vdagent protocol features:

mouse={on,off}
enable/disable mouse messages. When enabled absolute mouse events can travel this way instead of using an usb or virtio tablet device for that. Default is on.
clipboard={on,off}
enable/disable clipboard support. Default is off (for security reasons).

Future work

No immediate plans right now, but I have some ideas what could be done:

Add more peers
Obvious candidates are the other UIs (SDL, cocoa). Possibly also more guest protocols, I think vmware supports cut+paste too (via vmport and agent).
Add more data types
With image support being a hot candidate. Chances are high that this involves more than just passing data. spice uses png as baseline image format, whereas vnc uses bmp. So qemu most likely has to do image format conversions.

Maybe I look into them when I find some time. No promise though. Patches are welcome.

by Gerd Hoffmann at May 30, 2021 10:00 PM

May 20, 2021

Gerd Hoffmann

virtio-gpu and qemu graphics in 2021

Time for an update, a few things did happen since the previous update in November 2019.

virtio-gpu features

Progress is rather slow in qemu due to shifted priorities. That doesn't mean virglrenderer development is completely stalled though. crosvm (aka Chrome OS Virtual Machine Monitor) has virtio-gpu support too and is pushing forward virglrenderer development these days. There is good progress in virglrenderer library (although I don't follow that closely any more these days), crosvm and linux kernel driver.

Lets go through the feature list from 2019 for a quick update first:

shared mappings
Not a separate feature, it's part of blob resources now.
blob resources
Specification is final. Linux kernel got support. crosvm too as far I know. qemu lags behind.
metadata query
Is implemented using execbuffer commands. Which is the communication path between guest driver and virglrenderer, so virtio-gpu doesn't need any changes for this.
host memory
Also part of blob resources.
vulkan support
Not fully sure where we stand here. Blob resources are designed with vulkan memory management needs in mind, so there shouldn't be any blockers left in the virtio-gpu guest/host interface. It should "only" be a matter of coding things up in guest driver and virglrenderer. And add blob resource support to qemu of course.

Another feature which was added to virtio-gpu which is not on the 2019 list is UUID support. This allows to attach a UUID to a virtio-gpu resource, which is specifically useful for dma-buf sharing between virtio drivers inside the guest. A guest driver importing a virtio-gpu resource can send the UUID to the host device for lookup, so the host devices can easily share the resource too.

virtio-gpu state in qemu

Not much progress here up to qemu 6.0. There are a few changes merged or in the pipeline for the next release (6.1) though.

First, the virtio-gpu device is splitted. It will loose the virgl=on|off property. There will be two devices instead: virtio-vga and virtio-vga-gl (same for the other device variants).

This will de-clutter the source code and it will also remove hard virglrenderer dependency from virtio-gpu. With modular qemu builds you'll now have two modules: One for the simple virtio-vga device, without any external dependencies, and one for the virtio-vga-gl device, which supports virgl and thus depends on the virglrenderer library.

Second, blob resource support for the simple virtio-vga device is in progress, and it will bring support shared resource mappings to qemu. This will accelerate the display path due to less or no copying of pixel data.

You may wonder why this is useful for the non-virgl device. The use case is 3D-rendering using a pci-assigned GPU (or vgpu). In that model the GPU handles only the rendering, virtio-gpu handles the display scanout, and framebuffers are shared between drivers using dma-bufs. If you worked with arm socs before this may sound familiar because they often handle rendering and scanout with separate hardware blocks too. So virt graphics will use the same approach, and userspace (xorg + wayland) luckily is already prepared for it.

modular graphics in qemu

When building qemu with configure --enable-modules you'll get a modular qemu build, which means some functionality is build as separate module (aka shared object) which is loaded on demand. This allows distributions to move some functionality to separate, optional sub-packages, which is especially useful for modules which depend on shared libraries. That way you can have a rather lightweight qemu install by simply not installing the sub-packages not needed.

Block backend drivers where modularized first, audio backend drivers next. The easy UI code was modularized early on too: curses, gtk and sdl.

Last year we took a big step forward in modularizing qemu graphics. Two features got added to qemu: First support for building devices as modules, and second support for modules depending on other modules. This allowed building pretty much all qemu graphics code with external shared library dependencies modular:

opengl, egl-headless
depend on mesa libraries and drivers
spice-core, spice-app
depend on libspice-server (plus more indirect deps)
depend on qemu opengl module
qxl device
depends on libspice-server too
depends on qemu spice-core module
virtio-vga-gl device
depends on virglrenderer

You can see the results on Fedora 34. Installing qemu-system-x86-core on a fresh system installs 29 packages, summing up to an 18M download and 74M installed size. Installing qemu-system-x86 (which pulls in all module sub-packages) on top adds 125 more packages with 45M download size and 145M installed size.

linux kernel drm driver updates

As already mentioned above the virtio drm driver got support for blob resources. Also a collection of bugfixes all over the place.

The ttm (drm memory manager) got a bunch of cleanups over the last few upstream kernel releases. A bunch of sanity checks have been added along the way, and they flagged a number of issues in the qxl drm driver (which uses ttm to manage video memory). That in turn caused a bunch of bugfixes and some other improvements in the qxl drm driver. Merged upstream for linux 5.13, most important fixes have been backported to 5.10+ stable branches.

by Gerd Hoffmann at May 20, 2021 10:00 PM

April 30, 2021

QEMU project

QEMU version 6.0.0 released

We’d like to announce the availability of the QEMU 6.0.0 release. This release contains 3300+ commits from 268 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Highlights include:

  • 68k: new ‘virt’ machine type based on virtio devices
  • ARM: support for ARMv8.1-M ‘Helium’ architecture and Cortex-M55 CPU
  • ARM: support for ARMv8.4 TTST, SEL2, and DIT extensions
  • ARM: ARMv8.5 MemTag extension now available for both system and usermode emulation
  • ARM: support for new mps3-an524, mps3-an547 board models
  • ARM: additional device emulation support for xlnx-zynqmp, xlnx-versal, sbsa-ref, npcm7xx, and sabrelite board models
  • Hexagon: new emulation support for Qualcomm hexagon DSP units
  • MIPS: new Loongson-3 ‘virt’ machine type
  • PowerPC: external BMC support for powernv machine type
  • PowerPC: pseries machines now report memory unplug failures to management tools, as well as retrying unsuccessful CPU unplug requests
  • RISC-V: Microchip PolarFire board now supports QSPI NOR flash
  • Tricore: support for new TriBoard board model emulating Infineon TC27x SoC
  • x86: AMD SEV-ES support for running guests with secured CPU register state
  • x86: TCG emulation support for protection keys (PKS)

  • ACPI: support for assigning NICs to known names in guest OS independently of PCI slot placement
  • NVMe: new emulation support for v1.4 spec with many new features, experimental support for Zoned Namespaces, multipath I/O, and End-to-End Data Protection.
  • virtiofs: performance improvements with new USE_KILLPRIV_V2 guest feature
  • VNC: virtio-vga support for scaling resolution based on client window size
  • QMP: backup jobs now support multiple asynchronous requests in parallel

  • and lots more…

Thank you to everyone involved!

April 30, 2021 12:39 AM

April 27, 2021

Thomas Huth

How to check your shell scripts for portability

This blog is mainly a reminder for myself for the various possibilities to check my shell scripts for portability, but maybe it’s helpful for some other people, too.

First, why bother? Well, while bash is the default /bin/sh shell on many rpm-based Linux distributions (so it’s also the default shell on the systems I’m developing with and thus referring to here), it’s often not the case on other Linux distributions like Debian or Alpine, and it’s certainly not the case on non-Linux systems like the various *BSD flavors or illumos based installations.

Test your scripts with other shells

The most obvious suggestion is, of course, to run your script with a different shell than bash to see whether it works as expected.

Using dash

The probably most important thing to check is whether your script works with dash. dash is the default /bin/sh shell on most Debian-based distributions, so if you want to make sure that your script also works on such systems, this is the bare minimum that you should check. The basic idea of dash is to run scripts as far as possible, without adding bloat to the shell. Therefore the shell is restricted to a minimum with regards to the syntax that it understands, and with regards to the user interface, e.g. the interactive shell prompt is way less comfortable compared with shells like bash.

Since dash is also available in Fedora and in RHEL via EPEL, its installation is as easy as typing something like:

 sudo dnf install dash

Thus checking your scripts with dash is almost no additional effort and thus a very good place to start.

Using posh

posh stands for “Policy-compliant Ordinary SHell” – it’s another shell that has been developed within the Debian project to check shell scripts for POSIX compliance. Unlike dash, the syntax that this shell understands is really restricted to the bare minimum set that the POSIX standard suggests for shells, so if your script works with posh, you can be pretty sure that it is portable to most POSIX-compliant shells indeed.

Unfortunately, I haven’t seen a pre-compiled binary of posh for Fedora or RHEL yet, and I haven’t spotted a dedicated website for this shell either, so the installation is a little bit more complicated compared to dash. The best thing you can do on a non-Debian based system is to download the tar.xz source package from https://packages.debian.org/sid/posh and compile it on your own:

 wget http://deb.debian.org/debian/pool/main/p/posh/posh_0.14.1.tar.xz
 tar -xaf posh_0.14.1.tar.xz
 cd posh-0.14.1/
 autoreconf -i
 ./configure
 make
 ./posh ~/script-to-test.sh

Using virtual machines

Of course you can also check your scripts on other systems using virtual machines, e.g. on guest installations with FreeBSD, NetBSD, OpenBSD or one of the illumos distributions. But since this is quite some additional effort (e.g. you have to boot a guest and make your script available to it), I normally skip this step – testing with dash and posh catches most of the issues already anyway.

Test your scripts with shell checkers

There are indeed programs that help you to check the syntax of your shell scripts. The two I’ve been using so far are checkbashism and ShellCheck:

Using checkbashism

checkbashism is a Perl script, maintained again by the Debian people to check for portability issues that occur when a shell script has been only written with bash in mind. It is part of the devscripts package in Debian. Fortunately, the script is also available in Fedora by installing the so-called devscripts-checkbashisms package (which can also be used on RHEL by way). checkbashism focuses on the syntax constructs that are typically only available in bash, so this is a good and easy way to check your scripts on distributions where /bin/sh is bash by default.

Using ShellCheck

ShellCheck is another static shell script analyzer tool which is available for most distributions or can ben installed via the instructions provided on its GitHub page. The nice thing about ShellCheck is that they even provide the possibility to check your script via upload on their www.shellcheck.net website – so for small, public scripts, you don’t have to install anything at all to try it out, just copy and paste your script into the text box on the website.

April 27, 2021 09:00 AM

April 22, 2021

Daniel Berrange

ANNOUNCE: virt-viewer release 10.0

I am happy to announce a new bugfix release of virt-viewer 10.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg).

Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

With this release the project replaced the autotools build system with Meson and Ninja and re-designed the user interface to eliminate the menu bar

All historical releases are available from:

http://virt-manager.org/download/

Changes in this release include:

  • Switch to use Meson for build system instead of autotools
  • Require libvirt >= 1.2.8
  • Redesign UI to use title bar widget instead of menu bar
  • Request use of dark theme by default, if available
  • Don’t filter out oVirt DATA storage domains for ISO image sharing
  • Add –keymap arg to allow keys to be remapped
  • Display error message if no extension is present for screenshot filename
  • Fix misc memory leaks
  • Use nicer error message if not ISOs are available
  • Use more explicit accelerator hint to distinguish left and right ctrl/alt keys
  • Report detailed file transfer errors
  • Use standard about diaglog
  • Refresh and improve translations
  • Install appstream data file in preferred location
  • Refresh appstream data file contents
  • Display VM title when listing VMs, if available
  • Display VM description as tooltop, if available
  • Sort VM names when listing
  • Enable ASLR and NX for Windows builds
  • Add –shared arg to request a shared session for VNC
  • Disable all accels when not grabbed in kiosk mode
  • Allow num keypad to be used for zoom changes
  • Disable grab sequence in kiosk mode to prevent escape
  • Allow zoom hotkeys to be set on the command line / vv file
  • Display error message if VNC connection fails
  • Fix warnings about atomics with new GLib
  • Remove use of deprecated GTK APIs
  • Document cursor ungrab sequence in man pages
  • Honour Ctrl-C when auth dialog is active
  • Minor UI tweaks to auth dialog
  • Support VM power control actions with VNC
  • Add –cursor arg to control whether a local pointer is rendered with VNC
  • Add –auto-resize arg and menu to control whether to resize the remote framebuffer to math local window size
  • Add support for remote framebuffer resize with VNC
  • Handle case sensitivity when parsing accelerator mappings

by Daniel Berrange at April 22, 2021 05:39 PM

April 15, 2021

Daniel Berrange

ANNOUNCE: gtk-vnc release 1.2.0 available

I’m pleased to announce a new release of GTK-VNC, version 1.2.0.

https://download.gnome.org/sources/gtk-vnc/1.2/gtk-vnc-1.2.0.tar.xz (213K)
sha256sum: 7aaf80040d47134a963742fb6c94e970fcb6bf52dc975d7ae542b2ef5f34b94a

Changes in this release include

  • Add API to request fixed zoom level
  • Add API to request fixed aspect ratio when scaling
  • Add APIs for client initiated desktop resize
  • Implement “Extended Desktop Resize” VNC extension
  • Implement “Desktop Rename” VNC extension
  • Implement “Last Rect” VNC extension
  • Implement “XVP” (power control) VNC extension
  • Implement VeNCrypt “plain” auth mode
  • Implement alpha cursor VNC extension
  • Use GTK preferred width/height helpers for resizing
  • Fix misc docs/introspection annotation bugs
  • Honour meson warninglevel setting for compiler flags
  • Fix JPEG decoding in low colour depth modes
  • Fix minor memory leaks
  • Add header file macros for checking API version
  • Change some meson options from “bool” to “feature”
  • Validate GLib/GTK min/max symbol versions at build time
  • Avoid recreating framebuffer if size/format is unchanged
  • Emit resize signal after WMVi update
  • Various fixes & enhancements to python demo program
  • Ensure Gir files build against local libs
  • Enable stack protector on more platforms
  • Don’t force disable introspection on windows
  • Relax min x11 deps for older platforms
  • Avoid mutex deadlock on FreeBSD in test suite
  • Stop using deprecated GLib thread APIs
  • Stop using deprecated GLib main loop APIs
  • Stop using deprecated GObject class private data APIs
  • Add fixes for building on macOS
  • Fix deps for building example program
  • Update translations

Thanks to all those who reported bugs and provided patches that went into this new release.

by Daniel Berrange at April 15, 2021 10:56 AM

April 06, 2021

KVM on Z

Webinar: Red Hat OpenShift for IBM Z and LinuxONE on RHEL 8.3 KVM

Join us for our webinar on Wednesday, April 21, 11:00 AM - 12:00 PM EST!

Abstract

Red Hat OpenShift is available on RHEL 8.3 KVM starting with Red Hat OpenShift version 4.7 on IBM Z and LinuxONE. We discuss the deployment of a Red Hat OpenShift Cluster on RHEL KVM from a high-level perspective, including supported configurations and requirements, especially the available network and storage options.
Furthermore, we explain the installation steps of Red Hat OpenShift 4.7 on RHEL KVM in detail, including best practices and a short excursion on cluster debugging.

Speakers

  • Dr. Wolfgang Voesch, Iteration Manager - OpenShift on IBM Z and LinuxONE
  • Holger Wolf, Product Owner - OpenShift on Linux on IBM Z and LinuxONE

Registration

Register here. You can check the system requirements here.
After registering, you will receive a confirmation email containing information about joining the webinar.

Replay & Archive

All sessions are recorded. For the archive as well as a replay and handout of this session and all previous webinars see here.

by Stefan Raspl (noreply@blogger.com) at April 06, 2021 01:22 PM

April 05, 2021

Stefan Hajnoczi

Learning programming languages

You may be productive and comfortable in one programming language but find the idea of learning a new programming language daunting. Or you may know and use multiple programming languages but haven't learnt a new one in a while. Or you might be a programming language geek who is just curious about how others dive into new programming languages and get productive quickly. No matter how easy or difficult it is for you to engage in new programming languages, this article explains how I like to learn new programming languages. Although people learn best in different ways, I hope you'll find my thought process interesting even if you decide to take a different approach.

Background

Language N+1

This article isn't aimed at learning to program. Learning your first programming language is much harder than learning an additional one. The reason is that many abstract concepts are involved in computer programming. When you first encounter programming, most languages require you to understand concepts like iteration, scopes, (im)mutability, arrays, modules, functions, and much more. The good news is that when you learn an additional language you'll already be familiar with common concepts and can therefore take a more streamlined approach in order to get up to speed quickly.

Courses, videos, exercises

There is a lot of educational material online that teaches various programming languages, but I don't find structured courses, videos, or exercises efficient. If you already know common programming concepts and have an idea of what you want to build in the new programming language, then it's more efficient to chart your own course. Materials that let you jump/skip around will let you focus on information that is novel and that you actually need. Working through a series of exercises that someone else designed may be time spent practicing the wrong things since usually you are the one with the best idea of what to practice. Courses, videos, and exercises tend to be an "on the rails" experience where you are exposed to information in a linear fashion whether it's useful at the moment or not.

1. Understanding the computational model

The first question about a new programming language is "what is its computational model?". Sadly, many language manuals and websites do not describe the computational model beyond what programming paradigms are supported (object-oriented, concatenative, functional, logic programming, etc). The actual computational model may only become fully apparent later. Or it might be expressed in too much detail in a language standards document to be of use early on. In any case, it's worthwhile reading the programming language's website for information on the computational model to grasp the big picture.

It's the computational model that you need to understand in order to write programs. Often we think about syntax and language features too much when learning a new language. The computational model informs us how to break down requirements into programs. We approach logic programming differently from object-oriented programming in how we organize data and code. The syntax and to an extent even the language features don't matter.

Understanding the computational model also helps you situate the new programming language relative to others, especially programming languages that you already know. It will give you an idea of how different programming will be and where you'll need to learn new concepts.

2. The language tutorial

After familiarizing yourself with the computational model of the programming language, the next step is to learn the basic syntax and concepts. Most modern programming languages have an official tutorial available online. The tutorial introduces the language elements, usually with short examples, and its table of contents gives an overview of what the language consists of. The tutorial can be completed in a few hours or days. Unlike full courses, official programming language tutorials often lend themselves to non-linear reading, which is helpful when certain aspects of the language are already familiar or will not be relevant to you.

I remember reading the Python tutorial in an afternoon years ago, but watch out: at this point you might be able to write valid syntax but you won't be writing idiomatic code yet. There's that saying "you can write FORTRAN in any language". In order to write programs that are expressed naturally and take advantage of the language effectively, more effort will be necessary.

3. Writing toy programs

After becoming aware of the language elements the next step is to explore how the language works. This can be done by writing small programs. Often these toy programs are familiar tasks you've already solved in other languages. If you want to write games, maybe it's Pong. If you write web applications, it could be a todo list. There are lots of different well-known programs to write.

During the course of writing toy programs you'll encounter syntax errors or issues with the program. Learning to interpret common error messages is important because they will come up in more complicated scenarios later where it can be harder to resolve them if you haven't seen them before.

You'll also hit common tasks for which you need to find solutions in the standard library or language reference manual. Whether it's parsing command-line options, regular expression matching, HTTP requests, or error handling, the language probably has a way of doing it. Toy programs present a simple environment in which to explore the basic facilities of a programming language.

4. Gaining a deeper appreciation for the language

Once you have written some toy programs you'll be able to start writing your own programs that solve new problems. At this stage you start being productive but there is still more to learn. In particular, the language's idioms and patterns must be studied in order to write natural code. Once I have experience with the basics of a language I like to read the source code to the standard library, popular libraries, and popular applications. In the beginning this is hard because they use unfamiliar language features or library dependencies, but after following up on unknown parts of one program, you'll find it becomes easier to read other programs because your knowledge of the language has expanded.

At this point it is also worth looking for style guides, manuals on language idioms, and documentation on common gotchas or anti-patterns. These will provide the information about thinking natively in the new programming language. This is what's needed to become fluent in the language and capable of reading and writing real programs confidently.

Although I have presented steps in a linear order, learning complex subjects is often an iterative process. Sometimes I find myself jumping back and forth between steps as my understanding evolves.

Conclusion

Learning a new programming language is time-consuming no matter how you do it. However, it doesn't all need to happen upfront and after a few days of reading the documentation and experimenting with toy programs, it's possible to peform basic tasks. Learning how to use a language effectively by studying popular programs and reading guides is the quickest way I've found to reaching fluency. Finally, it just takes practice!

by Unknown (noreply@blogger.com) at April 05, 2021 09:29 PM

March 26, 2021

KVM on Z

Installing Red Hat OpenShift on KVM on Z

While there is no documentation on how to install Red Hat OCP on Linux on Z with a static IP under KVM today, the instructions here will get you almost there. However, there are a few parts within section Creating Red Hat Enterprise Linux CoreOS (RHCOS) machines that require attention. Here is an updated version that will get you through:
 
4. You can use an empty QCOW2 image: Using the prepared one will also work, but it will be overwritten anyway.

5. Start the guest with the following modified command-line:
  $ virt-install --noautoconsole
     --boot kernel=/bootkvm/rhcos-4.7.0-s390x-live-kernel-s390x, \
       initrd=/bootkvm/rhcos-4.7.0-s390x-live-initramfs.s390x.img, \
           kernel_args='rd.neednet=1 dfltcc=off coreos.inst.install_dev=/dev/vda
       coreos.live.rootfs_url=https://mirror.openshift.com \
           /pub/openshift-v4/s390x/dependencies/rhcos/4.7/4.7.0 \
           /rhcos-4.7.0-s390x-live-rootfs.s390x.img
       coreos.inst.ignition_url=http://
192.168.5.106:8080/ignition \
       /bootstrap.ign ip=192.168.5.11::
192.168.5.1:24:bootstrap-0.pok-241-macvtap- \
           mars.com::none
       nameserver=9.1.1.1'
     --connect qemu:///system
     --name bootstrap-0
     --memory 16384
     --vcpus 8
     --disk /home/libvirt/images/bootstrap-0.qcow2
     --accelerate
     --import
     --network network=macvtap-mv1
     --qemu-commandline="-drive if=none,id=ignition,format=raw,file=/bootkvm \
           /bootstrap.ign,readonly=on -device virtio-blk, \
           serial=ignition,drive=ignition"

Note the following changes:

  • Use the live installer kernel, initrd (you can get them from the redhat mirror) and parmline (this you need to create yourself once for each guest) in the --boot parameter. This is basically like installing on z/VM, and will write the image to your QCOW2 image with the correct static IP configuration. Keep in mind that the ignition file needs to be provided by an http/s server for this method to work
  • dfltcc=off is required for IBM z15 and LinuxONE III

6. To restart the guest later on, you will need to change the guest definition to boot from the QCOW2 image.
When the kernel parms are passed into the installer, the domain xml will look like this once the guest is installed and running:
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel8.2.0'>hvm</type>
    <kernel>/bootkvm/rhcos-4.7.0-s390x-live-kernel-s390x</kernel>
    <initrd>/bootkvm/rhcos-4.7.0-s390x-live-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 dfltcc=off coreos.inst.install_dev=/dev/vda
        coreos.live.rootfs_url=https://mirror.openshift.com/pub/openshift-v4/ \
             s390x/dependencies/rhcos/4.7/4.7.0/rhcos-4.7.0-s390x-live- \
             rootfs.s390x.img
        coreos.inst.ignition_url=http://
192.168.5.106:8080/ignition/worker.ign
        ip=
192.168.5.49::192.168.5.1:24:worker-1.pok-241-macvtap- \
             mars.com::none nameserver=1.1.1.1</cmdline>
    <boot dev='hd'/>
  </os>

However, this domain XML still points at the installation media, hence a reboot will not work (it will merely restart the installation).
Remove the <kernel>, <initrd>, <cmdline> elements, so that all that is left is the following:
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel8.2.0'>hvm</type>
    <boot dev='hd'/>
  </os>

With this, the guest will start successfully.

 [Content contributed by Alexander Klein]

by Stefan Raspl (noreply@blogger.com) at March 26, 2021 08:59 PM

Powered by Planet!
Last updated: November 29, 2021 02:06 AMEdit this page