Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools


Planet Feeds

February 24, 2021

Stefan Hajnoczi

Milestone Systems: Software that changes how things are done

Every few years a project comes out with a new approach that becomes influential. Often it involves combining existing concepts in a novel way. People argue about whether the project is actually novel or whether it was just in the right place at the right time and popularized existing technology. Regardless, I find these projects fascinating and try to learn about them because they are milestones that future systems are based on.

Here is a short list of projects that I think fall into this category. I hope you enjoy them (if you haven't already explored them). Send me your picks!


Tor is an onion router. It enables (mostly) anonymous communication by tunneling encrypted connections. The client does not know the IP address of the server (when connecting to so-called hidden services), the server does not know the IP address of the client, and the intermediate hops only know about their immediate predecessor and successor.

The design of Tor is described in a paper.


BitTorrent is a decentralized peer-to-peer file sharing protocol that can be used to reduce load on file hosting servers and improve download times. It's commonly used to share copyrighted material, but is also used by Linux distributions to publish ISO images and by software update systems.

A central aspect to BitTorrent is that peers exchange pieces of the file amongst themselves thanks to a Merkle tree. Pieces received from untrusted peers are checked against the file's Markle tree to ensure that data has not been corrupted or manipulated.

A paper about the economics of BitTorrent described some of the ideas behind it. The actual protocol is described by the protocol specification.


Git is the most popular version control system as of 2020. It replaced the older CVS and Subversion systems that were widely used before it. Other systems like Mercurial, Darcs, Perforce, and BitKeeper had similar use cases and ideas.

Git is a content-addressable object store with a convention for representing trees of files as well as commits and tags. I wrote about how the object store is implemented here if you want to learn about pack files and deltas.


Bitcoin is a decentralized currency, also known as cryptocurrency. A network of mutually untrusted nodes maintains a ledger called the blockchain that records transactions. Bitcoin is famous for mining where nodes compete to solve a computationally-expensive problem in order to extend the ledger.

What is interesting about Bitcoin is that the blockchain prevents abuse as long as at least half of the nodes are not controlled or colluding. In other words, it is a decentralized consensus - although there can be short-lived splits where not all nodes agree on the current state.

The Bitcoin paper gives an overview of how the system works.


I hope this was a fun post that motivated you to look at a system you haven't studied yet or made you think about systems that you consider milestone systems. Please get in touch if you want to share yours!

by Unknown ( at February 24, 2021 08:29 PM

KVM on Z

QEMU v5.2 released [UPDATE Feb 24, 2021]

QEMU v5.2 is out. A highlight from a KVM on Z perspective:

  • PCI passthrough support now includes any PCI devices other than RoCE Express cards, e.g. including NVME devices. However, ISM devices as needed for SMC-D, require extra support an cannot be used at this point.
  • virtiofs support vi virtio-fs-ccw: Shared Filesystem allowing KVM guests to access host directories.
    Use cases:
    • Container image access in lightweight VMs (e.g. in Kata Containers)
    • CI/CD and development enablement
    • Filesystem as a service, to easily switch backends
    To use, define in the host as follows:
          <access mode='shared'/>
          <filesystem type='mount'
            <driver type='virtiofs'/>
            <source dir='/<hostpath>'/>
            <target dir='mount_tag'/>

    Then mount in guests as follows:
      # mount -t virtiofs mount_tag /mnt/<path>
    Requires Linux kernel 5.4 and libvirt v7.0.

For further details, see the Release Notes.

UPDATE: A previous version had falsely listed ISM devices as supported.

by Stefan Raspl ( at February 24, 2021 05:13 PM

Red Hat OpenShift Cotainer Platform 4.7 Released

Red Hat OCP 4.7 is out!

Among others, it adds support for KVM on Z as provided by RHEL 8.3 as the hypervisor for user-provisioned infrastructure.

See here for the full list of IBM Z-specific changes and improvements.

by Stefan Raspl ( at February 24, 2021 04:54 PM

February 17, 2021

QEMU project

QEMU is applying to Google Summer of Code and Outreachy 2021

QEMU is applying to Google Summer of Code 2021 and is participating in Outreachy May-August 2021. Both of these open source internship programs offer remote work opportunities for new developers wishing to get involved in our community.

Interns work with mentors who support them in their project. The code developed during the project is submitted via the same open source development process that all QEMU code follows. This gives interns experience with contributing to open source software.

QEMU’s mentors are experienced contributors who enjoy working with talented individuals who are getting started in open source. You can find a list of project ideas that mentors are proposing here.


Initial applications are open until February 22nd at 16:00 UTC. Outreachy’s goal is to increase diversity in open source and is open to anyone who faces under-representation, systemic bias, or discrimination in the technology industry of their country.

You can learn more about Outreachy May-August and how to apply at the Outreachy website.

Google Summer of Code

Google Summer of Code (GSOC) is a 10-week internship for students. Applications are open from March 29th to April 13th. You can find the details of how to apply at the Google Summer of Code website.

Google will announced accepted organizations on March 9th. QEMU is applying and we hope to mentors GSoC interns again this year!

Please review the eligibility criteria for GSoC before applying.

by Stefan Hajnoczi at February 17, 2021 07:00 AM

February 16, 2021

Stefan Hajnoczi

Video and slides available for "The Evolution of File Descriptor Monitoring in Linux"

My FOSDEM 2021 talk "The Evolution of File Descriptor Monitoring in Linux: From select(2) to io_uring" is now available:

The talk compares the file descriptor monitoring system calls available in Linux and discusses their design. Benchmark results show how well they scale when there are many file descriptors. I hope this is a useful overview to this important kernel feature that GUI applications, network services, and many other programs rely on.

If you are interested in API design and performance, this talk highlights how different approaches like stateless vs stateful APIs can affect performance and how to minimize the number of API calls through careful design.


by Unknown ( at February 16, 2021 09:24 AM

February 15, 2021

Daniel Berrange

ANNOUNCE: libvirt-glib release 4.0.0

I am pleased to announce that a new release of the libvirt-glib package, version 4.0.0, is now available from

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Replace autotools build system with meson
  • Mandate libvirt >= 1.2.8
  • Mandate libxml2 >= 2.9.1
  • Mandate glib >= 2.48.0
  • Mandate gobject-introspection >= 1.46.0
  • Fix docs incompatibility with gtk-doc >= 1.30
  • Updated translations
  • Misc API docs fixes
  • Add constants related to NVRAM during domain delete
  • Add domain config API for controller ports attribute
  • Fix compat with newer glib by avoid volatile for enum types

Thanks to everyone who contributed to this new release.

by Daniel Berrange at February 15, 2021 12:36 PM

February 03, 2021

Stefan Hajnoczi

Keeping a clean git commit history

Does the commit history of your source repository look like this:

f02af91822 docs: fix incorrect subheadings
dd7bee8b38 cli: add --import option
900ca2936a cli: extract move_topic() helper function

Or like this:

7011cc9868 lunch time
a07c82331d resolve code review comments
331d79a8ff more fixes


The first is a clean git commit history where each commit has a clear purpose and is a single logical change. The second is a messy commit history where the commits have no inherent structure:

  • "more fixes" does not describe clearly what is being fixed and the plural ("fixes") hints it may contain multiple logical changes instead of just one.
  • "resolve code review comments" contains changes requested by code reviewers in relation to another commit, it's not a self-contained logical change.
  • "lunch time" is an unfinished commit that was created because the programmer wanted to save their work.

Commit anti-patterns

The example above illustrates several anti-patterns:

Vague commit messages

If the commit message is vague and does not express a clear purpose, then it is hard to know what a commit does from the commit message. If git-log(1) doesn't provide useful information about commits then one has to resort to searching the code diffs. That is very tedious and sometimes it's almost impossible to come up with a good code search query while a clear commit message would have been easy to search. So clear commit messages are the first step towards clean commit history.

Doing too many things in one commit

Commits that make several logical code changes are hard to review and impede backporting fixes to stable branches. For example, a commit that fixes a bug as well as adding a new feature may need to be rewritten for a stable branch. If instead the code had been split into two commits, then the bug fix commit could have been backported easily. Therefore it is good practice to separate distinct bug fixes, features, and other logical code changes into separate commits.

Addressing code review comments

The code review and testing history is usually not useful information once a commit has been merged. For example, if there was a continuous integration (CI) test failure and a pull request needed to be changed, then the change should be made directly to the buggy commit so that the final commit passes the tests. No one needs to know about the code review or testing history once the code is merged and keeping these artifacts makes the commit history unwieldy by spreading a logical code change across multiple incomplete commits.

Saving work

There are valid reasons to temporarily save your work in a commit, but work-in-progress (WIP) commits should be cleaned up before merging them. For example, sometimes people make arbitrary commits to save work at the end of the day. That is fine in a local branch, but those temporary commits can be restructured into clean commits using git-rebase(1). No one else needs to know about temporary commits.

Broken commits

It can be easy to accidentally include a commit that does not build or fails tests if a later commit happens to resolve the issue. Since the later commit hides the issue it may not be apparent when testing the branch. When reordering commits the risk of introducing broken commits increases because those commits were originally written in a different order. I use the git-rebase(1) exec action to build and run tests after every commit to detect broken commits when doing extensive rebases.

Why clean commit history is important

Not all reasons for maintaining a clean commit history are obvious. Unfortunately all the above anti-patterns make commit history less useful so it's interesting to note that if you value any of the following reasons for keeping a clean commit history, then all anti-patterns need to be avoided.

Code review

Reviewers have an easier time reading clean commits than an unstructured series of commits. For example, if there is a broken commit because a function is used before it is defined in a later commit, then that affects code reviewers who read the commits linearly. They will be puzzled by the non-existent function and unable to decide whether it is being used correctly because it has not been defined yet. Although code reviewers could put in extra effort to reread the commits multiple times and try to remember the misordered changes, it's better to let code reviewers spend time on real issues rather than on untangling poorly structured commits.

Capturing the rationale for code changes

When each commit is a single logical change it becomes possible to write good commit descriptions that give the rationale for the code change. Explanations for why a code change is necessary, as well as links to issue trackers, email discussions, etc can be valuable when revisiting the commit history later. If commits contain multiple logical code changes or are incomplete then it is hard to include a good commit description, so the commit history is less useful when referring back to it later on.

Making cherry-picking easy

Many software projects maintain stable branches that still receive bug fixes for some time. This allows development to introduce new features and less mature code while users can run a mature stable release. However, maintaining stable branches can be time-consuming. Maintainers need to identify commits suitable for stable branches and cherry-pick or backport them. This requires clean commit history so that bug fixes can be applied in isolation without dragging in other code changes that do not fit the criteria for stable branches.

Enabling git-bisect(1)

When a bug is observed it may not be clear which commit introduced it. The git-bisect(1) command systematically searches the commit history and identifies the commit that caused the bug. However, git-bisect(1) only works with clean commit history. If there are broken commits then bisection becomes unreliable because some portions of commit history cannot be tested. Poorly structured commits, such as huge changes that do many different things, also make it difficult to identify which line caused the bug even when git-bisect(1) has determined which commit is to blame.

When clean commit history does not matter

The reason I have found that not everyone practices clean commit history is that they may not need any of this. Especially small projects developed by a single author may involve little code review, backporting changes to stable branches, or git-bisect(1). In that case the effort required to split code changes into clean commits and write good commit messages may seem unjustified. Of course this can change but once the commit history is messy there is not much to be done. So it's worth thinking carefully about whether to take shortcuts.

Another factor is poor tooling. Gerrit and GitHub's code review has historically made it hard to practice clean commit history. They were not designed for reviewing commit series and favored anti-patterns like squashing everything into a single commit or adding additional commits to address code review feedback. These are tool limitations and luckily GitHub code review has become better over the years. Tools that encourage you to review a commit series as a single diff are not conducive to clean commit history.

Finally, clean commit history requires proficiency with git-rebase(1) and that you are comfortable with the idea of rewriting your local branch to clean it up before publishing it. It takes a little practice to become competent at reordering, squashing, and splitting commits. The process can be a little scary, although git-reflog(1) makes it possible to undo even the most serious errors where commits were accidentally lost. On a related note, some people falsely believe that a pull or merge request branch should not be rebased. Although it is good practice to avoid rewriting history of branches that other people track, rewriting history and force-pushing a pull request is different. Most of the time no one else will maintain a local branch based on it and therefore force-pushing will not inconvenience anyone. Even if it is necessary to develop branches based on someone else's not-yet-merged branches, one needs to weigh the trade-offs of having to do more work in the short-term with the drawbacks of having a messy commit history forever.


I hope this is a useful summary of why each commit should have a clear purpose and embody a single logical change. For source repositories that are used by more than one person it is especially important to think about commit best practices. Clean commit history facilitates better code review, bug-finding, and maintaining stable branches. Beyond that it also provides a useful form of communication and sharing knowledge about the codebase that is missing when commit history is disregarded.

by Unknown ( at February 03, 2021 09:25 AM

February 01, 2021

KVM on Z

Webcast: IBM Secure Execution for Linux Introduction and Demo

IBM Secure Execution for Linux allows to build a Trusted Execution Environment for IBM Z and LinuxONE that helps protect data in use.
This webcast gives an overview of the value and the key concepts of the technology, followed by a hands-on demo, outlining the steps needed to secure Linux workloads.

    UPDATE: A recording of the event is now available here.

Audience: Clients, Business Partners, IT Architects, Systems Admins

Speaker: Viktor Mihajlovski, Linux on IBM Z Development, Product Owner for KVM on IBM Z

Date: November 18, 11:00 AM - 12:15 PM EST

Registration: Register here, and check system requirements here.

by Stefan Raspl ( at February 01, 2021 08:15 PM

January 30, 2021

Stefan Hajnoczi

Why learning the Vim text editor is worth it

Vim logo, GNU General Public License v2 or Later

Many tools come and go as our software and devices change - or we get bored and want to try something new and shiny. One of the few exceptions for me has been the Vim text editor, which I use for programming, emails, and writing every day. In this post I want to share why Vim is remarkable but more generally why learning a text editor is a great investment.

It may not be obvious why text editors are useful tools. Many programs, like web browsers, email clients, and integrated development environments (IDEs), have built-in text editing functionality. Why use a separate text editor? Text editors go deep. They are much more powerful than text boxes in browsers and email clients while being more general than IDEs. While IDEs often have excellent programming language-specific functionality, they are rarely used for other text editing tasks like writing emails or documents because they are specialized tools. Text editors strike a good balance of powerful editing and support for programming without being boxed into a narrow use-case.

Getting started with a text editor is easy. Vim implements the arcane vi editor user interface but has many videos, cheatsheets, and tutorials that make it fun to try. Really getting familiar with the features and customizing the editor to your own needs takes time though. This is true for any popular text editor because the number of settings, extensions, or plugins available can be huge. However, once you are familiar with a text editor you will have a powerful tool that can be used for most tasks involving writing or manipulating text. The time investment will pay off as you use the editor for todo lists, emails, documentation, programming, configuration files, and more.

This explains why I've found Vim a useful and enduring tool that I use daily. But what makes it a particularly strong text editor compared to the other options? Text editors go in and out of fashion all the time. I remember many that attracted attention for a time but then faded away. Vim has remained popular and I think there are a few reasons for that.

Powerful text editing plus IDE-like functionality. The vi user interface is actually a language of text editing operators. The keys you press aren't just keyboard shortcuts, they are like a bytecode (!) for a text manipulation CPU that is Turing-complete. Years ago I wrote vi macros that solve the stable marriage problem to demonstrate this. For many geeks this alone might be enough to convince you to learn Vim! But on top of this crazy text editing power Vim also has IDE-like functionality including syntax highlighting, completion, code search and navigation, compiler error navigation, and diffing. There is a large collection of plugins and scripts if you want to extend Vim's functionality even further.

Vim is ubiquitous. It runs on all major operating systems but furthermore it is found on devices from tiny Wi-Fi routers to the largest servers. It has a GUI but also a terminal interface if you are connecting to a remote machine over SSH. I always find it strange when I see people use their editor of choice on their laptop but then use another, less-familiar editor when connecting to remote machines. Learn Vim and you can use it everywhere!

Keyboard-friendly. Constantly moving my hand between the mouse and keyboard is tiring and distracting. Vim has excellent keyboard support and many things can be done without leaving the home row on the keyboard, including navigating, inserting, and deleting text. I find there is no need to use the arrow keys, mouse, or anything that is hard to reach.

If you are looking for a powerful text editor that you can use for many years then I recommend Vim. It's also worth looking at Emacs, which has a different angle but is also a good time investment. Looking back on 17 years of using Vim, I'm happy I stopped switching between language-specific IDEs and instead found a text editor capable of handling all tasks.

by Unknown ( at January 30, 2021 11:02 AM

January 22, 2021

Stefano Garzarella

SOCAT now supports AF_VSOCK

SOCAT is a CLI utility which enables the concatenation of two sockets together. It establishes two bidirectional byte streams and transfers data between them.

socat supports several address types (e.g. TCP, UDP, UNIX domain sockets, etc.) to construct the streams. The latest version 1.7.4, released earlier this year [2021-01-04], supports also AF_VSOCK addresses:

  • VSOCK-LISTEN:<port>

    • Listen on port and accepts a VSOCK connection.
  • VSOCK-CONNECT:<cid>:<port>

    • Establishes a VSOCK stream connection to the specified cid and port.


If you are interested on VSOCK, I’ll talk witn Andra Paraschiv (AWS) about it at FOSDEM 2021. The talk is titled Leveraging virtio-vsock in the cloud and containers and it’s scheduled for Saturday, February 6th 2021 at 11:30 AM (CET).

We will show cool VSOCK use cases and some demos about developing, debugging, and measuring the VSOCK performance, including socat demos.


socat could be very useful for concatenating and redirecting sockets. In this section we will see some examples.

All examples below refer to a guest with CID 42 that we created using virt-builder and virt-install .

VM setup

virt-builder is able to download the installer and create the disk image with Fedora 33 or other distros. It is also able to set the root password and inject the ssh public key, simplifying the creation of guest disk image:


host$ virt-builder --root-password=password:mypassword \
        --ssh-inject root:file:/home/user/.ssh/ \
        --output=${VM_IMAGE} \
        --format=qcow2 --size 10G --selinux-relabel \
        --update fedora-33

Once the disk image is ready, we create our VM with virt-install. We can specify the VM settings like the number of vCPUs, the amount of RAM, and the CID assigned to the VM [42]:

host$ virt-install --name vsockguest \
        --ram 2048 --vcpus 2 --os-variant fedora33 \
        --import --disk path=${VM_IMAGE},bus=virtio \
        --graphics none --vsock cid.address=42

After the creation of the VM, we will remain attached to the console and we can detach from it by pressing ctrl-].

We can reattach to the console in this way:

host$ virsh console vsockguest

If the VM is turned off, we can boot it and attach directly to the console in this way:

host$ virsh start --console vsockguest

ncat like

It’s possible to use socat like ncat, transferring stdin and stdout via VSOCK.

Guest listening

In this example we start socat in the guest listening on port 1234:

guest$ socat - VSOCK-LISTEN:1234

Then we connect from the host using the CID 42 assigned to the VM:

host$ socat - VSOCK-CONNECT:42:1234

At this point we can exchange characters between guest and host, since stdin and stdout are linked through the VSOCK socket.

Host listening

In this example we do the opposite, starting socat in the host listening on port 1234:

host$ socat - VSOCK-LISTEN:1234

Then, in the guest, we connect to the host using the well defined CID 2. It’s always used to reach the host:

guest$ socat - VSOCK-CONNECT:2:1234

ssh over VSOCK

The coolest feature of socat is to concatenate sockets of different address families, so in this example we redirect ssh traffic through VSOCK socket exposed by the VM.

This example could be useful if the VM doesn’t have any NIC attached and we want to provide some network connectivity, like the ssh access.

First of all, in the guest we start socat linking the VSOCK socket listening on port 22, to a TCP socket which will connect to the local TCP port 22 where the ssh server is listening:

guest$ socat VSOCK-LISTEN:22,reuseaddr,fork TCP:localhost:22

On the host we link a TCP socket listening on a port of our choice (e.g. 4321) to the guest port 22 just opened using VSOCK:

host$ socat TCP4-LISTEN:4321,reuseaddr,fork VSOCK-CONNECT:42:22

Finally from the host we can connect to the guest using ssh on the local port 4321, where socat is listening:

host$ ssh -p 4321 root@localhost

socat redirects all the traffic between the sockets and allow us to use ssh over VSOCK to reach the guest.

by (Stefano Garzarella) at January 22, 2021 02:16 PM

January 19, 2021

QEMU project

Configuring virtio-blk and virtio-scsi Devices

The previous article in this series introduced QEMU storage concepts. Now we move on to look at the two most popular emulated storage controllers for virtualization: virtio-blk and virtio-scsi.

This post provides recommendations for configuring virtio-blk and virtio-scsi and how to choose between the two devices. The recommendations provide good performance in a wide range of use cases and are suitable as default settings in tools that use QEMU.

Virtio storage devices

Key points

  • Prefer virtio storage devices over other emulated storage controllers.
  • Use the latest virtio drivers.

Virtio devices are recommended over other emulated storage controllers as they are generally the most performant and fully-featured storage controllers in QEMU.

Unlike emulations of hardware storage controllers, virtio-blk and virtio-scsi are specifically designed and optimized for virtualization. The details of how they work are published for driver and device implementors in the VIRTIO specification.

Virtio drivers are available for Linux, Windows, and other operating systems. Installing the latest version is recommended for the latest bug fixes and performance enhancements.

If virtio drivers are not available, the AHCI (SATA) device is widely supported by modern x86 operating systems and can be used as a fallback. On non-x86 guests the default storage controller can be used as a fallback.

Comparing virtio-blk and virtio-scsi

Key points

  • Prefer virtio-blk in performance-critical use cases.
  • Prefer virtio-scsi for attaching more than 28 disks or for full SCSI support.
  • With virtio-scsi, use scsi-block for SCSI passthrough and otherwise use scsi-hd.

Two virtio storage controllers are available: virtio-blk and virtio-scsi.


The virtio-blk device presents a block device to the virtual machine. Each virtio-blk device appears as a disk inside the guest. virtio-blk was available before virtio-scsi and is the most widely deployed virtio storage controller.

The virtio-blk device offers high performance thanks to a thin software stack and is therefore a good choice when performance is a priority. It does not support non-disk devices such as CD-ROM drives.

CD-ROMs and in general any application that sends SCSI commands are better served by the virtio-scsi device, which has full SCSI support. SCSI passthrough was removed from the Linux virtio-blk driver in v5.6 in favor of using virtio-scsi.

Virtual machines that require access to many disks can hit limits based on availability of PCI slots, which are under contention with other devices exposed to the guest, such as NICs. For example a typical i440fx machine type default configuration allows for about 28 disks. It is possible to use multi-function devices to pack multiple virtio-blk devices into a single PCI slot at the cost of losing hotplug support, or additional PCI busses can be defined. Generally though it is simpler to use a single virtio-scsi PCI adapter instead.


The virtio-scsi device presents a SCSI Host Bus Adapter to the virtual machine. SCSI offers a richer command set than virtio-blk and supports more use cases.

Each device supports up to 16,383 LUNs (disks) per target and up to 255 targets. This allows a single virtio-scsi device to handle all disks in a virtual machine, although defining more virtio-scsi devices makes it possible to tune for NUMA topology as we will see in a later blog post.

Emulated LUNs can be exposed as hard disk drives or CD-ROMs. Physical SCSI devices can be passed through into the virtual machine, including CD-ROM drives, tapes, and other devices besides hard disk drives.

Clustering software that uses SCSI Persistent Reservations is supported by virtio-scsi, but not by virtio-blk.

Performance of virtio-scsi may be lower than virtio-blk due to a thicker software stack, but in many use cases, this is not a significant factor. The following graph compares 4KB random read performance at various queue depths:

Comparing virtio-blk and virtio-scsi performance

virtio-scsi configuration

The following SCSI devices are available with virtio-scsi:

Device SCSI Passthrough Performance
scsi-hd No Highest
scsi-block Yes Lower
scsi-generic Yes Lowest

The scsi-hd device is suitable for disk image files and host block devices when SCSI passthrough is not required.

The scsi-block device offers SCSI passthrough and is preferred over scsi-generic due to higher performance.

The following graph compares the sequential I/O performance of these devices using virtio-scsi with an iothread:

Comparing scsi-hd, scsi-block, and scsi-generic performance


The virtio-blk and virtio-scsi offer a choice between a single block device and a full-fledged SCSI Host Bus Adapter. Virtualized guests typically use one or both of them depending on functional and performance requirements. This post compared the two and offered recommendations on how to choose between them.

The next post in this series will discuss the iothreads feature that both virtio-blk and virtio-scsi support for increased performance.

by Stefan Hajnoczi and Sergio Lopez at January 19, 2021 07:00 AM

December 11, 2020

KVM on Z

New Publications


The following videos and publications are now available on the IBM Knowledge Center:

For more updates, check this blog entry.

by Stefan Raspl ( at December 11, 2020 08:04 PM

Stefan Hajnoczi

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

create-branchPush a new branch that doesn't exist yet
create-tagPush a new tag that doesn't exist yet
fast-forwardPush a commit that is a descendent of the current commit
forcePush a commit or tag replacing the previous ref
deleteDelete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

create-directoryCreate a new directory
create-fileCreate a new file
create-symlinkCreate a symlink
modifyChange an existing file or symlink
delete-fileDelete a file

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
groups/ - group definitions (see below)
refs/ - access control policies for refs

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
# merge-reqs/<user-id>/<merge-req-num>/labels/
# needs-review -> /labels/needs-review
# ...
# labels/
# needs-review/
# <user-id>/
# <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
# ...
# ...
# ...
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.

by Unknown ( at December 11, 2020 02:17 PM

December 08, 2020

QEMU project

QEMU version 5.2.0 released

We’d like to announce the availability of the QEMU 5.2.0 release. This release contains 3200+ commits from 216 authors.

You can grab the tarball from our download page. The full list of changes are available in the Wiki.

Note that QEMU has switched build systems so you will need to install ninja to compile it. See the “Build Information” section of the Changelog for more information about this change.

Highlights include:

  • block: support for using qemu-storage-daemon as a vhost-user-blk device backend, and new ‘block-export-add’ QMP command which replaces now-deprecated ‘nbd-server-add’ with support for qemu-storage-daemon
  • block: qcow2 support for subcluster-based allocation (via extended_l2=on qemu-img option), improvements to NBD client network stalls, and qemu-nbd support for exposing multiple dirty bitmaps at once
  • migration: higher-bandwidth encrypted migration via TLS+multifd, new ‘block-bitmap-mapping’ option for finer-grained control over which bitmaps to migrate, and support for migration over a ‘vsock’ device (for nested environments and certain hardware classes)
  • qemu-ga: support for guest-get-devices, guest-get-disks, and guest-ssh-{get,add-remove}-authorized-keys commands.
  • virtiofs: virtiofsd support for new options to control how xattr names are seen by the guest, specify sandboxing alternative to pivot_root, and allowing different host mounts to be seen as separate submounts in the guest to avoid inode clashes
  • ARM: new board support for mp2-an386 (Cortex-M4 based), mp2-an500 (Cortex-M7 based), raspi3ap (Raspberry Pi 3 model A+), raspi0 (Raspberry Pi Zero), raspi1ap (Raspberry Pi A+), and npcm750-evb/quanta-gsj (Nuvoton iBMC)
  • ARM: ARMv8.2 FEAT_FP16 (half-precision flaoting point) support for AArch32 (already supported for AArch64)
  • ARM: virt: support for kvm-steal-time accounting
  • HPPA: support for booting NetBSD and older Linux distros like debian-0.5 and debian-0.6.1
  • PowerPC: pseries: improved support for user-specified NUMA distance topologies
  • RISC-V: live migration support
  • RISC-V: experimental hypervisor support updated to v0.6.1 and other improvements
  • RISC-V: support for NUMA sockets on virt/Spike machine types
  • s390: KVM support for diagnose 0x318 instruction, TCG support for additional z14 instructions
  • s390: vfio-pci devices now report real hardware features for functions instead of just emulated values
  • Xtensa: DFPU co-processor with single/double-precision FP opcodes is now supported
  • x86: improved support for asynchronous page faults via new kvm-async-pf-int -cpu option
  • and lots more…

Thank you to everyone involved!

December 08, 2020 09:00 PM

December 06, 2020

Stefan Hajnoczi

Git Forge Apps: Why git forges are serverless computing providers

In this post I want to explore an idea for a new type of application made possible by the power of git forges like GitLab and GitHub. I don't have a proof-of-concept and in fact we'll discuss hurdles that could make the idea infeasible. But it's an interesting way of thinking that is worth sharing as it offers a new way of looking at and using git forges.

Git forges offer git repository hosting at their core and then add on related features like pull requests, wikis, issue trackers, static website hosting, continuous integration, and more. They are now a long way away from the initial idea of a hosted git repository where you can publish code. They do so much more. They have a von Neumann architecture and are Turing complete. It's possible to do interesting things with that.

A git forge app (GFA) is an application that runs on a git forge. A hosted git repository is not just a place to publish and back up the source code. It's where the app runs. You can view the git forge as a Platform-as-a-Service (PaaS) or serverless computing provider. The app doesn't need to be deployed elsewhere, because the git forge itself is the execution environment.

A GFA must be able to:

  1. Process data.
  2. Store data.
  3. Interact with the outside world.

This is basically what computer applications do. Git forges have become powerful enough to do this.

Imagine the following: you fork a repository and this instantiates the GFA. The GFA is now running under your git forge account. The web interface is available at https://<user><app>/ and HTTP POST requests can also be used to interact with the application. The GFA could be a todo list, an RSS reader, a blog, etc. To the website visitor it appears like any other web application but everything happens within the git forge and no other hosting provider is necessary.

Let's look at how it's possible to use a git forge as the execution environment for an app.

Processing Data

The first order of business is executing code so that the application can process data. A configuration file is placed into a git repository to define runnable jobs. The job execution systems offered by git forges are GitLab CI and GitHub Actions, respectively. When the job is triggered it executes in a temporary environment, for example a Linux container image. It allows code hosted in the git repository to run on a server somewhere for a little while. There is a free allowance that grants a certain number of hours of unpaid execution time. This is how GFAs process data.

For example, say we are building an RSS reader. Our repo looks like this:

.gitlab-ci.yml - the job configuration file - download RSS feeds and check for new posts

The .gitlab-ci.yml file will contain a scheduled pipeline that runs every 15 minutes.

Storing Data

Applications need to store data persistently. The primary way of doing that in a git forge is by storing data in the git repository itself. Mutable data can be kept on a separate branch called data and can be force-pushed to avoid forever increasing the repository size when we don't need to store previous revisions of the data.

This works because each git branch has an independent commit history. The file namespace it stores is completely separate from other branches. This means it's possible to have the GFA's code on the main branch and to store data on the data branch without upsetting the commit history or files on the main branch.

Data storage is available to jobs because they are allowed to manipulate their own git repository thanks to an authentication token. On GitHub GITHUB_TOKEN allows jobs to push to their git repository. This gives jobs a way of storing data.

There are other forms of data storage besides the git repository. Artifactsare files produced by a job that can be downloaded via a URL. Artifacts are subject to expiry time and file size limitations. Caching is available to temporarily store files between job runs, although this data can be lost at any time.

At this point you may be thinking that this is nice but there is no way to store private data if the git repository is public. Git forges offer a secrets mechanism where variables can be stored privately and only made available during job execution. This can be used to stash an encryption key so that the data is stored in public but is encrypted. Anyone can download the encrypted data but they will not have the key needed to decrypt it. Some applications can also take advantage of client-side encryption and homomorphic encryption.

Interacting with the Outside World

Git forges offer a wealth of ways to interact with the outside world, called triggers. Triggers can run a job when an HTTP POST request is made. POST requests typically need a trigger token for authentication, but that token can be published if the triggered job is safe to be invoked by anyone. Both browser clients and Webhooks can invoke triggers through HTTP POST requests.

For example, imagine our RSS reader app needs an API to mark feeds as read. We define an HTTP POST trigger that runs a script that updates the feeds stored in the data branch. Since only the git repository owner should be able to invoke this trigger we do not publish the trigger token. Instead it is stored encrypted in the repository and the user provides a passphrase for decrypting this secret.

At this point it is also useful to offer a web interface. This is possible thanks to the static pages hosting feature called GitLab Pages and GitHub Pages. Pushing HTML, JavaScript, CSS, and images to a special branch publishes the static website at https://<user><app>/.

For the most part GFAs are asynchronous, they cannot handle HTTP requests synchronously like a web application that outputs an HTTP response. Incoming HTTP requests simply schedule GFA jobs that run sometime later. There are a few ways around this. The browser can poll for results using XMLHttpRequest or similar techniques. Alternatively, a GFA can fire up a job that runs for several minutes although I haven't investigated if there is a practical way to communicate with a web application running in a job (I guess incoming HTTPS is tricky to achieve).

Triggers offer a lot more fun than what I've already covered. They can respond to the creation and modification of wiki pages, issues, and pull request comments. This means it's possible to use those entities for interacting with the outside world. The GFA could act as a chat bot on its issue tracker, for example.


Git forges weren't designed for GFAs. But then they weren't initially designed to be Turing complete and offering ways to interact with the outside world either. That was added incrementally as demand for that functionality grew. Maybe git forges will evolve into full-blown serverless hosting competitors to today's cloud hosting providers.

GFAs are a hack that uses features like static pages, CI jobs, and webhooks in a creative way to run applications on a git forge. Building GFAs that are actually useful is likely to hit some challenges:

  • No synchronous requests - it is hard to implement search queries and other dynamic behavior in GFAs because they are primarily asynchronous (and slow!). This limitation matters for certain classes of applications. A lot can be done client-side instead to make up for this deficiency. But maybe someone can figure out how to do synchronous requests in GFAs.
  • Security - I have outlined how to make data publicly available and also how to encrypt it so only the git repository owner can view the plaintext. This is enough for personal web applications, but it's not sufficient for multi-user applications. Maybe git forges offer a form of authentication that works with GFAs so multiple users can store private data in a single GFA instance (the git repository owner could view all users' data but other users could not).
  • Free usage tiers - job execution time, storage capacity, and request throttling will limit how resource-intensive a GFA can be before it outgrows the free tier and eventually even the paid tier.

The first generation of GFAs could be written from scratch with each job carefully designed. Then a second generation of GFAs could be built on top of frameworks that abstract the tedious git forge integration with standard APIs and data models. For example, a Vue.js frontend could use a key/value store API and the whole thing can be hosted as a GFA.

Even if GFAs don't become a thing because git forges decide there is not enough demand to make them work really well, changing your perspective to think of git forges in this way is valuable. For example, I have a git repo for building a container image that I deploy on my server and that pushes output files to a web server host. All of that can be replaced with a git repository that runs a job and publishes to GitLab Pages. This simplifies things and frees me from maintaining infrastructure.


Git forges offer jobs for processing data, git repositories and artifacts for storing data, and triggers for interacting with the outside world. It is possible to build applications that exist solely as a git repository on a git forge. There is no longer a need to deploy code because the git forge itself is powerful enough to run non-trivial applications. I look forward to how this evolves and whether git forges eventually become full-blown cloud service providers.

by Unknown ( at December 06, 2020 10:02 AM

Understanding Peer-to-Peer Git Forges with Radicle

Git is a distributed version control system and does not require a central server. Although repositories are usually published at a well-known location for convenient cloning and fetching of the latest changes, this is actually not necessary. Each clone can have the full commit history and evolve independently. Furthermore, code changes can be exchanged via email or other means. Finally, even the clone itself does not need to be made from a well-known domain that hosts a git repository (see git-bundle(1)).

Given that git itself is already fully decentralized one would think there is no further work to do. I came across the Radicle project and its somewhat psychedelic website. Besides having a website with a wild color scheme, the project aims to offer a social coding experiment or git forge functionality using a peer-to-peer network architecture. According to the documentation the motivation seems to be that git's built-in functionality works but is not user-friendly enough to make it accessible. In particular, it lacks social coding features.

The goal is to add git forge features like project and developer discovery, issue trackers, wikis, etc. Additional, distinctly decentralized functionality, is also touched on involving Ethereum as a way to anchor project metadata, pay contributors, etc. Radicle is still in early development so these features are not yet implemented. Here is my take on the How it Works documentation, which is a little confusing due to its early stage and some incomplete sentences or typos. I don't know whether my understanding actually corresponds to the Radicle implementation that exists today or its eventual vision, because I haven't studied the code or tried running the software. However, the ideas that the documentation has brought up are interesting and fruitful in their own right, so I wanted to share them and explain them in my own words in case you also find them worth exploring.

The git data model

Let's quickly review the git data model because it is important for understanding peer-to-peer git forges. A git repository contains a refs/ subdirectory that provides a namespace for local branch heads (refs/heads/), local and remotely fetched tags (refs/tags/), and remotely fetched branches (refs/remotes/<remote>/). Actually this namespace layout is just a convention for everyday git usage and it's possible to use the refs/ namespace differently as we will see. The git client fetches refs from a remote according to a refspec rule that maps remote refs to local refs. This gives the client the power to fetch only certain refs from the server. The client can also put them in a different location in its local refs/ directory than the server. For details, see the git-fetch(1) man page.

Refs files contain the commit hash of an object stored in the repository's object database. An object can be a commit, tree (directory), tag, or a blob (file). Branch refs point to the latest commit object. A commit object refers to a tree object that may refer to further tree objects for sub-directories and finally the blob objects that make up the files being stored. Note that a git repository supports disjoint branches that share no history. Perhaps the most well-known example of disjoint branches are the GitHub Pages and GitLab Pages features where these git forges publish static websites from the HTML/CSS/JavaScript/image files on a specific branch in the repository. That branch shares no version history with other branches and the directories/files typically have no similarity to the repository's main branch.

Now we have covered enough git specifics to talk about peer-to-peer git forges. If you want to learn more about how git objects are actually stored, check out my article on the repository layout and pack files.

Identity and authority

Normally a git repository has one or more owners who are allowed to push refs. No one else has permission to modify the refs namespace. What if we tried to share a single refs namespace with the whole world and everyone could push? There would be chaos due to naming conflicts and malicious users would delete or change other users' refs. So it seems like an unworkable idea unless there is some way to enforce structure on the global refs namespace.

Peer-to-peer systems have solutions to these problems. First, a unique identity can be created by picking a random number with a sufficient number of bits so that the chance of collision is improbable. That unique identity can be used as a prefix in the global ref namespace to avoid accidental collisions. Second, there needs to be a way to prevent unauthorized users from modifying the part of the global namespace that is owned by other users.

Public-key cryptography provides the primitive for achieving both these things. A public key or its hash can serve as the unique identifier that provides identity and prevents accidental collisions. Ownership can be enforced by verifying that changes to the global namespace are signed with the private key corresponding to the unique identity.

For example, we fetch the following refs from a peer:


This is a simplified example based on the Radicle documentation. Here identity is the unique identity based on a public key. Remember no one else in the world has the same identity because the chance of generating the same public key is improbable. The heads/ refs are normal git refs to commit objects - these are published branches. The signed_refs ref points to an git object that contains a list of commit hashes and a signature generated using the public key. The signature can be verified using the public key.

Next we need to verify these changes to check that they were created with the private key that is only known to the identity's owner. First, we check the signature on the object pointed to by the signed_refs ref. If the signature is not valid we reject these changes and do not store them in our local repository. Next, we look up each ref in heads/ against the list in signed_refs. If a ref is missing from the list then we reject these refs and do not allow them into our local repository.

This scheme lends itself to peer-to-peer systems because the refs can be propagated (copied) between peers and verified at each step. The identity owner does not need to be present at each copy step since their cryptographic signature is all we need to be certain that they authorized these refs. So I can receive refs originally created by identity A from peer B and still be sure that peer B did not modify them since identity A's signature is intact.

Now we have a global refs namespace that is partitioned so that each identity is able to publish refs and peers can verify that these changes are authorized.


It may not be clear yet that it's not necessary to clone the entire global namespace. In fact, it's possible that no single peer will ever have a full copy of the entire global namespace! That's because this is a distributed system. Peers only fetch refs that they care about from their peers. Peers fetch from each other and this forms a network. The network does not need to be fully connected and it's possible to have multiple clusters of peers running without full global connectivity.

To bootstrap the global namespace there are seed repositories. Seeds are a common concept in peer-to-peer systems. They provide an entry point for new peers to learn about and start participating with other peers. In BitTorrent this is called a "tracker" rather than a "seed".

According to the Radicle documentation it is possible to directly fetch from peers. This probably means a git-daemon(1) or git-http-backend(1) needs to be accessible on the public internet. Many peers will not have sufficient network connectivity due to NAT limitations. I guess Radicle does not expect every user to participate as a repository.

Interestingly, there is a gossip system for propagating refs through the network. Let's revisit the refs for an identity in the global namespace:


We can publish identities that we track in remotes/. It's a recursive refs layout. This is how someone tracking our refs can find out about related identities and their refs.

Thanks to git's data model the commit, tree, and blob objects can be shared even though we duplicate refs published by another identity. Since git is a content-addressable object database the data is stored once even though multiple refs point to it.

Now we not only have a global namespace where anyone can publish git refs, but also ways to build a peer-to-peer network and propagate data throughout the network. It's important to note that data is only propagated if peers are interested in fetching it. Peers are not forced to store data that they are not interested in.

How data is stored locally

Let's bring the pieces together and show how the system stores data. The peer creates a local git repository called the monorepo for the purpose of storing portions of the global namespace. It fetches refs from seeds or direct peers to get started. Thanks to the remotes/ refs it also learns about other refs on the network that it did not request directly.

This git repository is just a data store, it is not usable for normal git workflows. The conventional git branch and git tag commands would not work well with the global namespace layout and verification requirements. Instead we can clone a local file:/// repository from the monorepo that fetches a subset of the refs into the conventional git refs layout. The files can be shared because git-clone(1) supports hard links to local repositories. Thanks to githooks(5) and/or extensible git-push(1) remote helper support it's possible to generate the necessary global namespace metadata (e.g. signatures) when we push from the local clone to the local monorepo. The monorepo can then publish the final refs to other peers.

Building a peer-to-peer git forge

There are neat ideas in Radicle and it remains to be seen how well it will grow to support git forge functionality. A number of challenges need to be addressed:

  • Usability - Radicle is a middle-ground between centralized git forges and email-based decentralized development. The goal is to be easy to use like git forges. Peer-to-peer systems often have challenges providing a human-friendly interface on top of public key identities (having usernames without centralized user accounts). Users will probably prefer to think in terms of repositories, merge requests, issues, and wikis instead of peers, gossip, identities, etc.
  • Security - The global namespace and peer-to-peer model is a target that malicious users will attack by trying to impersonate or steal identities, flood the system with garbage, game reputation systems with sockpuppets, etc.
  • Scalability - Peers only care about certain repositories and don't want to be slowed down by all the other refs in the global namespace. The recursive refs layout seems like it could cause performance problems and maybe users will configure clients to limit the depth to a low number like 3. At first glance Radicle should be able to scale well since peers only need to fetch refs they are interested in, but it's a novel way of using git refs, so we can expect scalability problems as the system grows.
  • Data model - How will this data model grow to handle wikis, issue trackers, etc? Issue tracker comments are an example of a data structure that requires conflict resolution in a distributed system. If two users post comments on an issue, how will this be resolved without a conflict? Luckily there is quite a lot of research on distributed data structures such as Conflict-free Replicated Data Types (CRDTs) and it may be possible to avoid most conflicts by eliminating concepts like linear comment numbering.
  • CI/CD - As mentioned in my blog post about why git forges are von Neumann machines, git forges are more than just data stores. They also have a computing model, initially used for Continuous Integration and Continuous Delivery, but really a general serverless computing platform. This is hard to do securely and without unwanted resource usage in a peer-to-peer system. Maybe Radicle will use Ethereum for compute credits?


Radicle is a cool idea and I look forward to seeing where it goes. It is still at an early stage but already shows interesting approaches with the global refs namespace and monorepo data store.

by Unknown ( at December 06, 2020 10:02 AM

December 03, 2020

Alberto Garcia

Subcluster allocation for qcow2 images

In previous blog posts I talked about QEMU’s qcow2 file format and how to make it faster. This post gives an overview of how the data is structured inside the image and how that affects performance, and this presentation at KVM Forum 2017 goes further into the topic.

This time I will talk about a new extension to the qcow2 format that seeks to improve its performance and reduce its memory requirements.

Let’s start by describing the problem.

Limitations of qcow2

One of the most important parameters when creating a new qcow2 image is the cluster size. Much like a filesystem’s block size, the qcow2 cluster size indicates the minimum unit of allocation. One difference however is that while filesystems tend to use small blocks (4 KB is a common size in ext4, ntfs or hfs+) the standard qcow2 cluster size is 64 KB. This adds some overhead because QEMU always needs to write complete clusters so it often ends up doing copy-on-write and writing to the qcow2 image more data than what the virtual machine requested. This gets worse if the image has a backing file because then QEMU needs to copy data from there, so a write request not only becomes larger but it also involves additional read requests from the backing file(s).

Because of that qcow2 images with larger cluster sizes tend to:

  • grow faster, wasting more disk space and duplicating data.
  • increase the amount of necessary I/O during cluster allocation,
    reducing the allocation performance.

Unfortunately, reducing the cluster size is in general not an option because it also has an impact on the amount of metadata used internally by qcow2 (reference counts, guest-to-host cluster mapping). Decreasing the cluster size increases the number of clusters and the amount of necessary metadata. This has direct negative impact on I/O performance, which can be mitigated by caching it in RAM, therefore increasing the memory requirements (the aforementioned post covers this in more detail).

Subcluster allocation

The problems described in the previous section are well-known consequences of the design of the qcow2 format and they have been discussed over the years.

I have been working on a way to improve the situation and the work is now finished and available in QEMU 5.2 as a new extension to the qcow2 format called extended L2 entries.

The so-called L2 tables are used to map guest addresses to data clusters. With extended L2 entries we can store more information about the status of each data cluster, and this allows us to have allocation at the subcluster level.

The basic idea is that data clusters are now divided into 32 subclusters of the same size, and each one of them can be allocated separately. This allows combining the benefits of larger cluster sizes (less metadata and RAM requirements) with the benefits of smaller units of allocation (less copy-on-write, smaller images). If the subcluster size matches the block size of the filesystem used inside the virtual machine then we can eliminate the need for copy-on-write entirely.

So with subcluster allocation we get:

  • Sixteen times less metadata per unit of allocation, greatly reducing the amount of necessary L2 cache.
  • Much faster I/O during allocating when the image has a backing file, up to 10-15 times more I/O operations per second for the same cluster size in my tests (see chart below).
  • Smaller images and less duplication of data.

This figure shows the average number of I/O operations per second that I get with 4KB random write requests to an empty 40GB image with a fully populated backing file.

I/O performance comparison between traditional and extended qcow2 images

Things to take into account:

  • The performance improvements described earlier happen during allocation. Writing to already allocated (sub)clusters won’t be any faster.
  • If the image does not have a backing file chances are that the allocation performance is equally fast, with or without extended L2 entries. This depends on the filesystem, so it should be tested before enabling this feature (but note that the other benefits mentioned above still apply).
  • Images with extended L2 entries are sparse, that is, they have holes and because of that their apparent size will be larger than the actual disk usage.
  • It is not recommended to enable this feature in compressed images, as compressed clusters cannot take advantage of any of the benefits.
  • Images with extended L2 entries cannot be read with older versions of QEMU.

How to use this?

Extended L2 entries are available starting from QEMU 5.2. Due to the nature of the changes it is unlikely that this feature will be backported to an earlier version of QEMU.

In order to test this you simply need to create an image with extended_l2=on, and you also probably want to use a larger cluster size (the default is 64 KB, remember that every cluster has 32 subclusters). Here is an example:

$ qemu-img create -f qcow2 -o extended_l2=on,cluster_size=128k img.qcow2 1T

And that’s all you need to do. Once the image is created all allocations will happen at the subcluster level.

More information

This work was presented at the 2020 edition of the KVM Forum. Here is the video recording of the presentation, where I cover all this in more detail:

You can also find the slides here.


This work has been possible thanks to Outscale, who have been sponsoring Igalia and my work in QEMU.

Igalia and Outscale

And thanks of course to the rest of the QEMU development team for their feedback and help with this!

by berto at December 03, 2020 06:15 PM

December 02, 2020

Stefan Hajnoczi

Software Freedom Conservancy announces donation matching

Software Freedom Conservancy, the non-profit charity home of QEMU, Git, Inkscape, and many other free and open source software projects is running its annual fundraiser. They have announced a generous donation matching pledge so donations made until January 15th 2021 will be doubled! Full details are here.

What makes Software Freedom Conservancy important, besides being a home for numerous high-profile free and open source software projects, is that it is backed by individuals and works for the public interest. It is not a trade association funded by companies to represent their interests. With the success of free and open source software it's important we don't lose these freedoms or use them just to benefit businesses. That's why I support Software Freedom Conservancy.

by Unknown ( at December 02, 2020 12:47 PM

December 01, 2020

Thomas Huth

QEMU Advent Calendar 2020 starts

Starting today, on December 1st, the first door of the QEMU Advent Calendar 2020 can now be opened! The advent calendar reveals a new disk image or something similar for download on each of the first 24 days in December 2020, to create a fun experience for the QEMU community, brightening your days in the winter season, and to provide some good images for testing various aspects of QEMU. There is also the possibility to contribute your own ideas for disk images - so if you’re interested in helping, please have a look at the announcement in the QEMU blog for details.

December 01, 2020 06:15 AM

November 27, 2020

Stefan Hajnoczi

Call for QEMU Advent Calendar 2020 disk images

QEMU Advent Calendar publishes a disk image surprise each day from December 1-24. It's a QEMU community tradition that is back again this year!

If you want to contribute disk images to this year's advent calendar (puzzles, games, niche operating systems, retro computing fun, etc), please check out the call for disk images for details.

by Unknown ( at November 27, 2020 07:19 AM

November 26, 2020

Thomas Huth

Secure Execution for zKVM Introduction and Demo

The recent generation of the mainframe (i.e. the z15) has the possibility to run protected KVM guests, so that the administrator of the host LPAR does not have the possibility anymore to read or alter the memory of a guest. This feature is called “Secure Execution” on the IBM Z.

IBM now published a nice introduction and demo of this new feature, which is in my opinion quite worthwhile to watch if you want to get a basic idea about running protected guests:

The video contains a short introduction into the concepts of the secure execution feature, followed by a demo that shows how to prepare a protected guest VM.

November 26, 2020 01:45 PM

QEMU project

QEMU Advent Calendar 2020 Announcement and Call for Images

QEMU Advent Calendar 2020 is around the corner and we are looking for volunteers to contribute disk images that showcase something cool, bring back retro computing memories, or simply entertain with a puzzle or game.

QEMU Advent Calendar publishes a QEMU disk image each day from December 1-24. Each image is a surprise designed to delight an audience consisting of the QEMU community and beyond. You can see previous years at

You can help us make this year’s calendar awesome by:

Here are the requirements for disk images:

  • Content must be freely redistributable (i.e. no proprietary license that prevents distribution). For GPL based software, you need to provide the source code, too.
  • Provide a name and a short description of the disk image (e.g. with hints on what to try)
  • Provide a ./run shell script that prints out the name and description/hints and launches QEMU
  • Provide a 320x240 screenshot/image/logo for the website
  • Size should be ideally under 100 MB per disk image (but if some few images are bigger, that should be OK, too)

Check out this disk image as an example of how to distribute an image. Links to files over 25MB are appreciated in lieu of email attachments.

PS: QEMU Advent Calendar is a secular calendar (not religious). The idea is to create a fun experience for the QEMU community which can be shared with everyone. You don’t need to celebrate Christmas or another religious festival to participate!

November 26, 2020 09:00 AM

November 24, 2020

Cornelia Huck

s390x changes in QEMU 5.2

 As, once again, a new QEMU release is around the corner, the time has come to list some s390x changes in there.

  • TCG has gained emulation support for some additional instructions that had been introduced with the z14. More enhancements needed to be able to run distributions built for the z14 will likely come in the future.
  • When running under KVM, QEMU now supports the diagnose 0x318 instruction. This can be used to set some diagnostic information (such as the operating system), which may be helpful when servicing the hardware. With this comes support for extended SCCBs; this is needed as the facility indication for diag318 encroaches into the control block used for reporting CPU information. A guest needs support for extended SCCBs to be able to see information for all CPUs if diag318 support is provided.
  • You can now use virtiofs on s390x, thanks to some endianness fixes, and a vhost-user-fs-ccw device has been added.
  • Up to now, both fully emulated PCI functions and PCI functions passed via vfio-pci reported the same values when the guest issued CLP instructions. However, the passed through functions may use different values for things such as the supported DMA range. If the host kernel supplies the respective capabilities for the vfio-pci device, QEMU can now provide the real values in the CLP queries.
  • zPCI is now also able to honour vfio DMA limits, if passed via the vfio-pci device, and can trigger the guest to flush its DMA mappings when needed.
  • The s390-ccw bios now tries harder to find a bootable device, if the first device is not suitable. This brings s390x booting a bit closer to what other architectures do.
  • And the usual fixes and cleanups.

by Cornelia Huck ( at November 24, 2020 04:34 PM

November 17, 2020

KVM on Z

RHEL 8.3 Released

RHEL 8.3 is out, see the official announcement.
The unquestionable highlight from a KVM on Z perspective is certainly the addition of the Secure Execution functionality. Also, ECKD DASD can be transparently passed through to KVM guests, allowing full exploitation of all DASD features, including raw track access and IPL

For further details on the changes, see the release notes, and the respective blog entry in the Linux on Z blog here.

by Stefan Raspl ( at November 17, 2020 09:31 AM

November 04, 2020

QEMU project

Using virtio-fs on a unikernel

This article provides an overview of virtio-fs, a novel way for sharing the host file system with guests and OSv, a specialized, lightweight operating system (unikernel) for the cloud, as well as how these two fit together.


Virtio-fs is a new host-guest shared filesystem, purpose-built for local file system semantics and performance. To that end, it takes full advantage of the host’s and the guest’s colocation on the same physical machine, unlike network-based efforts, like virtio-9p.

As the name suggests, virtio-fs builds on virtio for providing an efficient transport: it is included in the (currently draft, to become v1.2) virtio specification as a new device. The protocol used by the device is a slightly extended version of FUSE, providing a solid foundation for all file system operations native on Linux. Implementation-wise, on the QEMU side, it takes the approach of splitting between the guest interface (handled by QEMU) and the host file system interface (the device “backend”). The latter is handled by virtiofsd (“virtio-fs daemon”), running as a separate process, utilizing the vhost-user protocol to communicate with QEMU.

One prominent performance feature of virtio-fs is the DAX (Direct Access) window. It’s a shared memory window between the host and the guest, exposed as device memory (a PCI BAR) to the second. Upon request, the host (QEMU) maps file contents to the window for the guest to access directly. This bears performance gains due to taking VMEXITs out of the read/write data path and bypassing the guest page cache on Linux, while not counting against the VM’s memory (since it’s just device memory, managed on the host).

virtio-fs DAX architecture

Virtio-fs is under active development, with its community focussing on a pair of device implementation in QEMU and device driver in Linux. Both components are already available upstream in their initial iterations, while upstreaming continues further e.g. with DAX window support.


OSv is a unikernel (framework). The two defining characteristics of a unikernel are:

  • Application-specialized: a unikernel is an executable machine image, consisting of an application and supporting code (drivers, memory management, runtime etc.) linked together, running in a single address space (typically in guest “kernel mode”).
  • Library OS: each unikernel only contains the functionality mandated by its application in terms of non-application code, i.e. no unused drivers, or even whole subsystems (e.g. networking, if the application doesn’t use the network).

OSv in particular strives for binary compatibility with Linux, using a dynamic linker. This means that applications built for Linux should run as OSv unikernels without requiring modifications or even rebuilding, at least most of the time. Of course, not the whole Linux ABI is supported, with system calls like fork() and relatives missing by design in all unikernels, which lack the notion of a process. Despite this limitation, OSv is quite full featured, with full SMP support, virtual memory, a virtual file system (and many filesystem implementations, including ZFS) as well as a mature networking stack, based on the FreeBSD sources.

At this point, one is sure to wonder “Why bother with unikernels?”. The problem they were originally introduced to solve is the bloated software stack in modern cloud computing. Running general-purpose operating systems as guests, typically for a single application/service, on top of a hypervisor which already takes care of isolation and provides a standard device model means duplication, as well as loss of efficiency. This is were unikernels come in, trying to be just enough to support a single application and as light-weight as possible, based on the assumption that they are executing inside a VM. Below is an illustration of the comparison between general-purpose OS, unikernels and containers (as another approach to the same problem, for completeness).

Unikernels vs GPOS vs containers

OSv, meet virtio-fs

As is apparent e.g. from the container world, it is very common for applications running in isolated environments (such as containers, or unikernels even more so) to require host file system access. Whereas containers sharing the host kernel thus have an obvious, controlled path to the host file system, with unikernels this has been more complex: all solutions were somewhat heavyweight, requiring a network link or indirection through network protocols. Virtio-fs then provided a significantly more attractive route: straight-forward mapping of fs operations (via FUSE), reusing the existing virtio transport and decent performance without high memory overhead.

The OSv community quickly identified the opportunity and came up with a read-only implementation on its side, when executing under QEMU. This emphasized being lightweight complexity-wise, while catering to many of its applications’ requirements (they are stateless, think e.g. serverless). Notably, it includes support for the DAX window (even before that’s merged in upstream QEMU), providing excellent performance, directly rivalling that of its local (non-shared) counterparts such as ZFS and ROFS (an OSv-specific read-only file system).

One central point is OSv’s support for booting from virtio-fs: this enables deploying a modified version or a whole new application without rebuilding the image, just by adjusting its root file system contents on the host. Last, owing to the DAX window practically providing low-overhead access to the host’s page cache, scalability is also expected to excel, with it being a common concern due to the potentially high density of unikernels per host.

For example, to build the cli OSv image, bootable from virtio-fs, using the core OSv build system:

scripts/build fs=virtiofs export=all image=cli

This results in a minimal image (just the initramfs), while the root fs contents are placed in a directory on the host (build/export here, by default).

Running the above image is just a step away (may want to use the virtio-fs development version of QEMU, e.g. for DAX window support):

scripts/ --virtio-fs-tag=myfs --virtio-fs-dir=$(pwd)/build/export

This orchestrates running both virtiofsd and QEMU, using the contents of build/export as the root file system. Any changes to this directory, directly from the host will be visible in the guest without re-running the previous build step.


OSv has gained a prominent new feature, powered by virtio-fs and its QEMU implementation. This allows efficient, lightweight and performant access to the host’s file system, thanks to the native virtio transport, usage of the FUSE protocol and the DAX window architecture. In turn, it enables use cases like rapid unikernel reconfiguration.

by Fotis Xenakis at November 04, 2020 12:00 AM

October 30, 2020

KVM on Z

KVM on IBM Z at Virtual SHARE

Don't miss our session dedicated to Secure Execution at this year's virtual SHARE:
Protecting Data In-use With Secure Execution, presented by Reinhard Bündgen, 3:50 PM - 4:35 PM EST on Tuesday, August 4.

by Stefan Raspl ( at October 30, 2020 02:15 PM

October 29, 2020

KVM on Z

2020 Linux on IBM Z and LinuxONE Client Workshop, November 9-14

Get the latest news about the Linux exploitation and advantages of the IBM Z and LinuxONE platform in this technical 5 day workshop. Focusing also on new solutions and capabilities, such as Hybrid Cloud, Red Hat OpenShift Container Platform, Security, Performance, Networking and Virtualization presented by our development experts. We will start with introduction sessions on the first day, continue with 3 days of technical deep dive topics, and resume with the latest client experiences, and a panel discussion on the last day. You will also get a chance to interact directly with IBM developers and solution experts during the workshop.

Agenda highlights:

  • Introduction day
    • Linux on IBM Z and LinuxONE introduction and client stories
    • Virtualization options with Linux on IBM Z and LinuxONERed Hat OpenShift Container Platform in the Hybrid Cloud Strategy for IBM Z & LinuxONE
    • Cloud Paks for IBM Z overview
  • Deep dive sessions
    • What's new Linux on IBM Z and LinuxONE
    • z/VM Platform update
    • IBM Secure Execution for Linux - Introduction and Overview
    • Hyper Protect Virtual Server onPrem - Differences to Cloud
    • Red Hat OpenShift on IBM Z - Performance Experiences, Hints and Tips & Networking
    • Securing the Workloads on Red Hat OpenShift on IBM Z
    • IBM z15 Hardware Compression
    • Preparing for Multifactor Authentication for z/VM
    • Best Practices of Installing Red Hat OpenShift on IBM Z
    • Competitive application performance on Red Hat OpenShift
    • Cloud Paks for Data on IBM Z and LinuxONE
    • KVM Network Performance - Best Practices and Tuning Recommendations
  • Experiences and open discussion
    • Client experiences and lessons learned
    • Panel discussion for all

Date: November 9 - 13, 2020
Time schedule: 8:30 AM- 12:00 PM EDT / 2:30 - 6:00 PM CEST daily

Registration is open here till November 5.

by Stefan Raspl ( at October 29, 2020 09:41 AM

October 26, 2020

KVM on Z

Ubuntu 20.10 released

Ubuntu Server 20.10 is out!
It ships

See the release notes here, and the blog entry at canonical with Z-specific highlights here.

by Stefan Raspl ( at October 26, 2020 11:13 AM

October 18, 2020

Gerd Hoffmann

Improving microvm support in qemu and seabios.

In version 4.2 the microvm machine type was added to qemu. The initial commit describes it this way:

It's a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.

The initial code uses the minimal qboot firmware to initialize the guest, to load a linux kernel and boot it. For network/storage/etc devices virtio-mmio is used. The configuration is passed to the linux kernel on the command line so the guest is able to find the devices.

That works for direct kernel boot, using qemu -kernel vmlinuz, because qemu can easily patch the kernel command line then. But what if you want - for example - boot the Fedora Cloud image? Using the Fedora kernel stored within the image?

A better plan for device discovery

When not using direct kernel boot patching the kernel command line for device discovery isn't going to fly, so we need something else. There are two established standard ways to do that in modern systems. One is device trees. The other is ACPI. Both have support for virtio-mmio.

A device tree entry for virtio-mmio looks like this:

virtio_block@3000 {
	compatible = "virtio,mmio";
	reg = <0x3000 0x100>;
	interrupts = <41>;

And this is the ACPI DSDT version:

Device (VR06)
    Name (_HID, "LNRO0005")  // _HID: Hardware ID
    Name (_UID, 0x06)  // _UID: Unique ID
    Name (_CCA, One)  // _CCA: Cache Coherency Attribute
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
        Memory32Fixed (ReadWrite,
            0xFEB00C00,         // Address Base
            0x00000200,         // Address Length
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )

Both carry essentially the same information: What kind of device that is and which resources (registers & interrupt) it uses.

On the arm platform both are established, with device trees being common for small board computers like the raspberry pi and ACPI being used in the arm server space. On the x86 platform we don't have much of a choice though. There are some niche attempts to establish device trees, the google android goldfish platform for example. But for widespread support there is no way around using ACPI.

The nice thing about arm server using ACPI too is that this paved the way for us. The linux kernel supports both device tree and ACPI for the discovery of virtio-mmio devices:

root@fedora-bios ~/acpi# modinfo virtio-mmio | grep alias
alias:          of:N*T*Cvirtio,mmioC*
alias:          of:N*T*Cvirtio,mmio
alias:          acpi*:LNRO0005:*

So linux kernel support is a solved problem already. Yay!

virtio-mmio support for seabios

If we want load a kernel from the disk image the firmware must be able to find and read the disk. We already have virtio-pci support for blk and scsi in seabios. The differences between virtio-pci and virtio-mmio transports are not that big. Also some infrastructure for different transport modes was already there to deal with legacy vs. modern virtio-pci. So adding virtio-mmio support to the drivers wasn't much of a problem.

But of course seabios also has the problem that it must discover the devices before it can initialize the driver. Various approaches to find virtio-mmio devices using the available information sources where tried. All of them had the one or the other non-working corner case, except using ACPI. So seabios ended up getting a simple DSDT parser for device discovery.

While being at it some other small fixes where added to seabios too to make it work better with microvm. The hard dependency on the RTC CMOS has been removed for example, so latest seabios works fine with qemu -M microvm,rtc=off.

This ships with seabios version 1.14.

While speaking about seabios: When using a serial console I'd strongly recommend to run with qemu -M microvm,graphics=off. That will enable serial console support in seabios. This is one of the tweaks done by the qemu -nographic shortcut. The machine option works with the pc and q35 machine types too.

ACPI cleanups in qemu

Hooking up ACPI support for microvm on the qemu side turned out to be surprisingly difficuilt due to some historical baggage.

Years ago qemu used to have a static ACPI DSDT table. All ISA devices (serial & parallel ports, floppy, ...) are declared there, but they might not be actually present depending on qemu configuration. The LPC/ISA bride has some bits in pci config space saying whenever a device is actually present or not (qemu emulation follows physical hardware behavior here). So the devices have a _STA method looking up those bits and returning the device status. The guest had to run the method using AML interpreter to figure whenever the declared device is actually there.

// this is the qemu q35 ISA bridge at 00:1f.0
Device (ISA)
    Name (_ADR, 0x001F0000)  // _ADR: Address
    // ... snip ...
    OperationRegion (LPCE, PCI_Config, 0x82, 0x02)
    Field (LPCE, AnyAcc, NoLock, Preserve)
        CAEN,   1,  // serial port #1 enable bit
        CBEN,   1,  // serial port #2 enable bit
        LPEN,   1   // parallel port enable bit

// serial port #1
Device (COM1)
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Method (_STA, 0, NotSerialized)  // _STA: Status
        Local0 = CAEN /* \_SB_.PCI0.ISA_.CAEN */
        If ((Local0 == Zero))
            // serial port #1 is disabled
            Return (Zero)
            // serial port #1 is enabled
            Return (0x0F)
    // ... snip ...

The microvm machine type simply has no PCI support, so that approach isn't going to fly. Also these days all ACPI tables are dynamically generated anyway, so there is no reason to have the guests AML interpreter go dig into pci config space. Instead we can handle that in qemu when generating the DSDT table. Disabled devices are simply not listed. For enabled devices this is enough:

// serial port #1
Device (COM1)
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Name (_STA, 0x0F)  // _STA: Status
    // ... snip ...

So I've ended up reorganizing and simplifying the code which creates the DSDT entries for ISA devices. This landed in qemu version 5.1.

ACPI support for microvm

Now with the roadblocks out of the way it was finally possible to add acpi support to microvm. There is little reason to worry about backward compatibility to historic x86 platforms here, old guests wouldn't be able to handle virtio-mmio anyway. So this takes a rather modern approach and looks more like an arm virt machines than a x86 q35 machine. Like arm it uses the generic event device for power management support.

ACPI support for microvm is switchable, simliar to the other machine types, using the acpi=on|off machine option. The -no-acpi switch works too. By default ACPI support is enabled.

With ACPI enabled qemu uses virtio-mmio enabled seabios as firmware and doesn't bother patching the linux kernel command line for device discovery.

With ACPI disabled qemu continues to use qboot as firmware like older qemu versions do. Likewise it continues to add virtio-mmio devices to the linux kernel command line.

This will be available in qemu version 5.2. It is already merged in the master branch.

ACPI advantages

  • Number one is device discovery obviously, this is why we started all this in the first place. seabios and linux kernel find virtio-mmio devices automatically. You can boot Fedora cloud images in microvm without needing any tricks. Probably other distros too, even though I didn't try that. Compiling the linux kernel with CONFIG_VIRTIO_MMIO=y (or =m & adding the module to initramfs) is pretty much the only requirement for this to work.

  • Number two is device discovery too. ACPI will also tell the kernel which devices are not there. So with acpi=on the kernel simply skips the PS/2 probe in case the DSDT doesn't list a keyboard controller. With acpi=off the kernel assumes legacy hardware, goes into probe-harder mode and needs one second to figure that there really is no keyboard controller:

    [    0.414840] i8042: PNP: No PS/2 controller found.
    [    0.415287] i8042: Probing ports directly.
    [    1.454723] i8042: No controller found

    We have an simliar effect with the real time clock. With acpi=off the kernel goes register an IRQ line for the RTC even in case the device isn't there.

  • Number three is (basic) power management. ACPI provides a virtual power button, so the guest will honor shutdown requests sent that way. ACPI also provides S5 (aka poweroff) support, so qemu gets a notification from the guest when the shutdown is done and can exit.

  • Number four is better IRQ routing. The linux kernel finds the IO-APIC declared in the APIC table and uses it for IRQ routing. It is possible to use lines 16-23 for the virtio-mmio devices, avoiding IRQ sharing. Also we can refine the configuration using IRQ flags in the DSDT table.

    With acpi=off this does not work not reliable. I've seen the kernel ignore the IO-APIC in the past. Doesn't always happen though. Not clear which factors play a role here, I didn't investigate that in detail. Maybe newer kernel versions are a bit more clever here and find the IO-APIC even without ACPI.

Bottom line: ACPI helps moving the microvm machine type forward towards a world without legacy x86 platform devices.

But isn't ACPI bloated and slow?

Well, on my microvm test guest all ACPI tables combined are less than 1k in size:

root@fedora-bios /sys/firmware/acpi/tables# find -type f | xargs ls -l
-r--------. 1 root root  78 Oct  2 09:36 ./APIC
-r--------. 1 root root 482 Oct  2 09:36 ./DSDT
-r--------. 1 root root 268 Oct  2 09:36 ./FACP

I wouldn't call that bloated. This is a rather small virtual machine, with larger configurations (more CPUs, more devices) the tables will grow a bit of course.

When testing boot times I figured it is pretty hard to find any differences due to ACPI initialization. The noise (differences when doing 2-3 runs with identical configuration) is larger than the acpi=on/off difference. Seems to be at most a handful of milliseconds.

When trying that yourself take care to boot the kernel with 'quiet'. This is a good idea anyway if you want boot as fast as possible. The kernel prints more boot information with acpi=on, so slow console logging can skew your numbers if you let the kernel print out everything.

Runtime differences should be zero. There is only one AML method in the DSDT table. It toggles the power button when a notification comes in from the generic event device. It runs only on generic event device interrupts.

USB support for microvm

qemu just got a sysbus (non-pci) version of the xhci host adapter. It is needed for some arm boards. Nice thing is now that we have ACPI we can just wire that up in microvm too, add it in the DSDT table, then linux will find and use it:

Device (XHCI)
    Name (_HID, EisaId ("PNP0D10") /* XHCI USB Controller */)
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
        Memory32Fixed (ReadWrite,
            0xFE900000,         // Address Base
            0x00004000,         // Address Length
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )

USB support will be disabled by default, it can be enabled using the usual machine option: qemu -M microvm,usb=on.

Patches for qemu are in flight, should land in version 5.2
Patches for seabios are merged, will be available in version 1.15

PCIe support for microvm

There is one more arm platform thing we can reuse in microvm: The PCI Express host bridge. Again the same approach: Wire everything up, declare it in the ACPI DSDT. Linux kernel finds and uses it.

Not adding an asl snippet this time. The PCIe host bridge is a complex device so the description is a bit larger. It has GSI subdevices, IRQ routing information for each PCI slot, mmconfig configuration etc. Also shows in the DSDT size (even though that is still less than half the size q35 has):

root@fedora-bios /sys/firmware/acpi/tables# ll DSDT
-r--------. 1 root root 3130 Oct  2 10:20 DSDT

PCIe support will be disabled by default, it can be enabled using the new pcie machine option: qemu -M microvm,pcie=on.

This will be available in qemu version 5.2. It is already merged in the master branch.

Future work

My TODO list for qemu isn't very long:

  • Add second IO-APIC, allowing more IRQ lines for more virtio-mmio devices. Experimental patches exist.

  • IOMMU support, using virtio-iommu. Depends on ACPI spec update for virtio-iommu being finalized and support being merged into qemu and linux kernel. The actual wireup for microvm should be easy once all this is done.

Outside qemu there are a few more items:

  • Investigate microvm PCIe support in seabios. Experimental patches exist. I'm not sure yet whenever seabios should care though.

    The linux kernel can initialize the PCIe bus just fine on its own, whereas proper seabios support has its challenges. When running in real mode the mmconfig space can not be reached. Legacy pci config space access via port 0xcf8 is not available. Which breaks pcibios support. Probably fixable with a temporary switch to 32bit mode, at the cost of triggering a bunch of extra vmexits. Beside that seabios will spend more time at boot, initializing the pci devices.

    So, is it worth the effort? The benefit would be that seabios could support booting from pci devices on microvm then.

  • Maybe add microvm support to edk2/ovmf.

    Looks not that easy on a quick glance. ArmVirtPkg depends on device trees for virtio-mmio detection, so while we can re-use the virtio-mmio drivers we can not re-use device discovery code. Unless we maybe have qemu provide both ACPI tables and a device tree, even if ovmf happens to be the only device tree user.

    It also is not clear what other dragons (dependencies on classic x86 platform devices) are lurking in the ovmf codebase.

  • Support the new microvm features (possibly adding microvm support first) in other projects.

    Candidate number one is of course libvirt because it is the foundation for many other projects. Beside that microvm support is probably mostly useful for cloud/container-style workloads, i.e. kata and kubevirt.

by Gerd Hoffmann at October 18, 2020 10:00 PM

October 09, 2020

Stefan Hajnoczi

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.


Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=Ncommand-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.


The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.


Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.


Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.


This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.

by Unknown ( at October 09, 2020 05:03 PM

Powered by Planet!
Last updated: March 04, 2021 06:10 AMEdit this page