Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

March 12, 2024

Stefan Hajnoczi

How to access libvirt domains in KubeVirt

KubeVirt makes it possible to run virtual machines on Kubernetes alongside container workloads. Virtual machines are configured using VirtualMachineInstance YAML. But under the hood of KubeVirt lies the same libvirt tooling that is commonly used to run KVM virtual machines on Linux. Accessing libvirt can be convenient for development and troubleshooting.

Note that bypassing KubeVirt must be done carefully. Doing this in production may interfere with running VMs. If a feature is missing from KubeVirt, then please request it.

The following diagram shows how the user's VirtualMachineInstance is turned into a libvirt domain:

Accessing virsh

Libvirt's virsh command-line tool is available inside the virt-launcher Pod that runs a virtual machine. First determine vm1's virt-launcher Pod name by filtering on its label (thanks to Alice Frosi for this trick!):

$ kubectl get pod -l vm.kubevirt.io/name=vm1
NAME                      READY   STATUS    RESTARTS   AGE
virt-launcher-vm1-5gxvg   2/2     Running   0          8m13s

Find the name of the libvirt domain (this is guessable but it doesn't hurt to check):

$ kubectl exec virt-launcher-vm1-5gxvg -- virsh list
 Id   Name          State
-----------------------------
 1    default_vm1   running

Arbitrary virsh commands can be invoked. Here is an example of dumping the libvirt domain XML:

$ kubectl exec virt-launcher-vm1-5gxvg -- virsh dumpxml default_vm1
<domain type='kvm' id='1'>
  <name>default_vm1</name>
...

Viewing libvirt logs and full the QEMU command-line

The libvirt logs are captured by Kubernetes so you can view them with kubectl log <virt-launcher-pod-name>. If you don't know the virt-launcher pod name, check with kubectl get pod and look for your virtual machine's name.

The full QEMU command-line is part of the libvirt logs, but unescaping the JSON string is inconvenient. Here is another way to get the full QEMU command-line:

$ kubectl exec <virt-launcher-pod-name> -- ps aux | grep qemu

Customizing KubeVirt's libvirt domain XML

KubeVirt has a feature for customizing libvirt domain XML called hook sidecars. After the libvirt XML is generated, it is sent to a user-defined container that processes the XML and returns it back. The libvirt domain is defined using this processed XML. To learn more about how it works, check out the documentation.

Hook sidecars are available when the Sidecar feature gate is enabled in the kubevirt/kubevirt custom resource. Normally only the cluster administrator can modify the kubevirt CR, so be sure to check when trying this feature:

$ kubectl auth can-i update  kubevirt/kubevirt -n kubevirt
yes

Although you can provide a complete container image for the hook sidecar, there is a shortcut if you just want to run a script. A generic hook sidecar image is available that launches a script which can be provided as a ConfigMap. Here is example YAML including a ConfigMap that I've used to test the libvirt IOThread Virtqueue Mapping feature:

---
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration: 
      featureGates:
        - Sidecar
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: "fedora"
spec:
  storage:
    accessModes:
        - ReadWriteOnce
    resources:
      requests:
        storage: 5Gi
  source:
    http:
      url: "https://download.fedoraproject.org/pub/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.raw.xz"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: sidecar-script
data:
  my_script.sh: |
    #!/usr/bin/env python3
    import xml.etree.ElementTree as ET
    import os.path
    import sys
    
    NUM_IOTHREADS = 4
    VOLUME_NAME = 'data' # VirtualMachine volume name
    
    def main(xml):
        domain = ET.fromstring(xml)
    
        domain.find('iothreads').text = str(NUM_IOTHREADS)
    
        disk = domain.find(f"./devices/disk/alias[@name='ua-{VOLUME_NAME}']..")
        driver = disk.find('driver')
        del driver.attrib['iothread']
        iothreads = ET.SubElement(driver, 'iothreads')
        for i in range(NUM_IOTHREADS):
            iothread = ET.SubElement(iothreads, 'iothread')
            iothread.set('id', str(i + 1))
    
        ET.dump(domain)
    
    if __name__ == "__main__":
        # Workaround for https://github.com/kubevirt/kubevirt/issues/11276
        if os.path.exists('/tmp/ran-once'):
            main(sys.argv[4])
        else:
            open('/tmp/ran-once', 'wb')
            print(sys.argv[4])
---
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  creationTimestamp: 2018-07-04T15:03:08Z
  generation: 1
  labels:
    kubevirt.io/os: linux
  name: vm1
  annotations:
    hooks.kubevirt.io/hookSidecars: '[{"args": ["--version", "v1alpha3"],
      "image": "kubevirt/sidecar-shim:20240108_99b6c4bdb",
      "configMap": {"name": "sidecar-script",
                    "key": "my_script.sh",
                    "hookPath": "/usr/bin/onDefineDomain"}}]'
spec:
  domain:
    ioThreadsPolicy: auto
    cpu:
      cores: 8
    devices:
      blockMultiQueue: true
      disks:
      - disk:
          bus: virtio
        name: disk0
      - disk:
          bus: virtio
        name: data
    machine:
      type: q35
    resources:
      requests:
        memory: 1024M
  volumes:
  - name: disk0
    persistentVolumeClaim:
      claimName: fedora
  - name: data
    emptyDisk:
      capacity: 8Gi

If you need to go down one level further and customize the QEMU command-line, see my post on passing QEMU command-line options in libvirt domain XML.

More KubeVirt debugging tricks

The official KubeVirt documentation has a Virtualization Debugging section with more tricks for customizing libvirt logging, launching QEMU with strace or gdb, etc. Thanks to Alice Frosi for sharing the link!

Conclusion

It is possible to get libvirt access in KubeVirt for development and testing. This can make troubleshooting easier and it gives you the full range of libvirt domain XML if you want to experiment with features that are not yet exposed by KubeVirt.

by Unknown (noreply@blogger.com) at March 12, 2024 07:55 PM

February 12, 2024

Stefano Garzarella

vDPA: support for block devices in Linux and QEMU

A vDPA device is a type of device that follows the virtio specification for its datapath but has a vendor-specific control path.

vDPA devices can be both physically located on the hardware or emulated by software.

vDPA overview

A small vDPA parent driver in the host kernel is required only for the control path. The main advantage is the unified software stack for all vDPA devices:

  • vhost interface (vhost-vdpa) for userspace or guest virtio driver, like a VM running in QEMU
  • virtio interface (virtio-vdpa) for bare-metal or containerized applications running in the host
  • management interface (vdpa netlink) for instantiating devices and configuring virtio parameters

Useful Resources

Many blog posts and talks have been published in recent years that can help you better understand vDPA and the use cases. On vdpa-dev.gitlab.io we collected some of them; I suggest you at least explore the following:

Block devices

Most of the work in vDPA has been driven by network devices, but in recent years, we have also developed support for block devices.

The main use case is definitely leveraging the hardware to directly emulate the virtio-blk device and support different network backends such as Ceph RBD or iSCSI. This is the goal of some SmartNICs or DPUs, which are able to emulate virtio-net devices of course, but also virtio-blk for network storage.

The abstraction provided by vDPA also makes software accelerators possible, similar to existing vhost or vhost-user devices. We discussed about that at KVM Forum 2021.

We talked about the fast path and the slow path in that talk. When QEMU needs to handle requests, like supporting live migration or executing I/O throttling, it uses the slow path. During the slow path, the device exposed to the guest is emulated in QEMU. QEMU intercepts the requests and forwards them to the vDPA device by taking advantage of the driver implemented in libblkio. On the other hand, when QEMU doesn’t need to intervene, the fast path comes into play. In this case, the vDPA device can be directly exposed to the guest, bypassing QEMU’s emulation.

libblkio exposes common API for accessing block devices in userspace. It supports several drivers. We will focus more on virtio-blk-vhost-vdpa driver, which is used by virtio-blk-vhost-vdpa block device in QEMU. It only supports slow path for now, but in the future it should be able to switch to fast path automatically. Since QEMU 7.2, it supports libblkio drivers, so you can use the following options to attach a vDPA block device to a VM:

   -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \
   -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1 \

Anyway, to fully leverage the performance of a vDPA hardware device, we can always use the generic vhost-vdpa-device-pci device offered by QEMU that supports any vDPA device and exposes it directly to the guest. Of course, QEMU is not able to intercept requests in this scenario and therefore some features offered by its block layer (e.g. live migration, disk format, etc.) are not supported. Since QEMU 8.0, you can use the following options to attach a generic vDPA device to a VM:

    -device vhost-vdpa-device-pci,vhostdev=/dev/vhost-vdpa-0

At KVM Forum 2022, Alberto Faria and Stefan Hajnoczi introduced libblkio, while Kevin Wolf and I discussed its usage in the QEMU Storage Deamon (QSD).

Software devices

One of the significant benefits of vDPA is its strong abstraction, enabling the implementation of virtio devices in both hardware and software—whether in the kernel or user space. This unification under a single framework, where devices appear identical for QEMU facilitates the seamless integration of hardware and software components.

Kernel devices

Regarding in-kernel devices, starting from Linux v5.13, there exists a simple simulator designed for development and debugging purposes. It is available through the vdpa-sim-blk kernel module, which emulates a 128 MB ramdisk. As highlighted in the presentation at KVM Forum 2021, a future device in the kernel (similar to the repeatedly proposed but never merged vhost-blk) could potentially offer excellent performance. Such a device could be used as an alternative when hardware is unavailable, for instance, facilitating live migration in any system, regardless of whether the destination system features a SmartNIC/DPU or not.

User space devices

Instead, regarding user space, we can use VDUSE. QSD supports it and thus allows us to export any disk image supported by QEMU, such as a vDPA device in this way:

qemu-storage-daemon \
    --blockdev file,filename=/path/to/disk.qcow2,node-name=file \
    --blockdev qcow2,file=file,node-name=qcow2 \
    --export type=vduse-blk,id=vduse0,name=vduse0,node-name=qcow2,writable=on

Containers, VMs, or bare-metal

As mentioned in the introduction, vDPA supports different buses such as vhost-vdpa and virtio-vdpa. This flexibility enables the utilization of vDPA devices with virtual machines or user space drivers (e.g., libblkio) through the vhost-vdpa bus. Additionally, it allows interaction with applications running directly on the host or within containers via the virtio-vdpa bus.

The vdpa tool in iproute2 facilitates the management of vdpa devices through netlink, enabling the allocation and deallocation of these devices.

Starting with Linux 5.17, vDPA drivers support driver_ovveride. This enhancement allows dynamic reconfiguration during runtime, permitting the migration of a device from one bus to another in this way:

# load vdpa buses
$ modprobe -a virtio-vdpa vhost-vdpa
# load vdpa-blk in-kernel simulator
$ modprobe vdpa-sim-blk

# instantiate a new vdpasim_blk device called `vdpa0`
$ vdpa dev add mgmtdev vdpasim_blk name vdpa0

# `vdpa0` is attached to the first vDPA bus driver loaded
$ driverctl -b vdpa list-devices
vdpa0 virtio_vdpa

# change the `vdpa0` bus to `vhost-vdpa`
$ driverctl -b vdpa set-override vdpa0 vhost_vdpa

# `vdpa0` is now attached to the `vhost-vdpa` bus
$ driverctl -b vdpa list-devices
vdpa0 vhost_vdpa [*]

# Note: driverctl(8) integrates with udev so the binding is preserved.

Examples

Below are several examples on how to use VDUSE and the QEMU Storage Daemon with VMs (QEMU) or Containers (podman). These steps are easily adaptable to any hardware that supports virtio-blk devices via vDPA.

qcow2 image available for host applications and containers

# load vdpa buses
$ modprobe -a virtio-vdpa vhost-vdpa

# create an empty qcow2 image
$ qemu-img create -f qcow2 test.qcow2 10G

# load vduse kernel module
$ modprobe vduse

# launch QSD exposing the `test.qcow2` image as `vduse0` vDPA device
$ qemu-storage-daemon --blockdev file,filename=test.qcow2,node-name=file \
  --blockdev qcow2,file=file,node-name=qcow2 \
  --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on &

# instantiate the `vduse0` device (same name used in QSD)
$ vdpa dev add name vduse0 mgmtdev vduse

# be sure to attach it to the `virtio-vdpa` device to use with host applications
$ driverctl -b vdpa set-override vduse0 virtio_vdpa

# device exposed as a virtio device, but attached to the host kernel
$ lsblk -pv
NAME     TYPE TRAN   SIZE RQ-SIZE  MQ
/dev/vda disk virtio  10G     256   1

# start a container with `/dev/vda` attached
podman run -it --rm --device /dev/vda --group-add keep-groups fedora:39 bash

Launch a VM using a vDPA device

# download Fedora cloud image (or use any other bootable image you want)
$ wget https://download.fedoraproject.org/pub/fedora/linux/releases/39/Cloud/x86_64/images/Fedora-Cloud-Base-39-1.5.x86_64.qcow2

# launch QSD exposing the VM image as `vduse1` vDPA device
$ qemu-storage-daemon \
  --blockdev file,filename=Fedora-Cloud-Base-39-1.5.x86_64.qcow2,node-name=file \
  --blockdev qcow2,file=file,node-name=qcow2 \
  --export vduse-blk,id=vduse1,name=vduse1,num-queues=1,node-name=qcow2,writable=on &

# instantiate the `vduse1` device (same name used in QSD)
$ vdpa dev add name vduse1 mgmtdev vduse

# initially it's attached to the host (`/dev/vdb`), because `virtio-vdpa`
# is the first kernel module we loaded
$ lsblk -pv
NAME     TYPE TRAN   SIZE RQ-SIZE  MQ
/dev/vda disk virtio  10G     256   1
/dev/vdb disk virtio   5G     256   1
$ lsblk /dev/vdb
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
vdb    251:16   0    5G  0 disk 
├─vdb1 251:17   0    1M  0 part 
├─vdb2 251:18   0 1000M  0 part 
├─vdb3 251:19   0  100M  0 part 
├─vdb4 251:20   0    4M  0 part 
└─vdb5 251:21   0  3.9G  0 part 

# and it is identified as `virtio1` in the host
$ ls /sys/bus/vdpa/devices/vduse1/
driver  driver_override  power  subsystem  uevent  virtio1

# attach it to the `vhost-vdpa` device to use the device with VMs
$ driverctl -b vdpa set-override vduse1 vhost_vdpa

# `/dev/vdb` is not available anymore
$ lsblk -pv
NAME     TYPE TRAN   SIZE RQ-SIZE  MQ
/dev/vda disk virtio  10G     256   1

# the device is identified as `vhost-vdpa-1` in the host
$ ls /sys/bus/vdpa/devices/vduse1/
driver  driver_override  power  subsystem  uevent  vhost-vdpa-1
$ ls -l /dev/vhost-vdpa-1
crw-------. 1 root root 511, 0 Feb 12 17:58 /dev/vhost-vdpa-1

# launch QEMU using `/dev/vhost-vdpa-1` device with the
# `virtio-blk-vhost-vdpa` libblkio driver
$ qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
  -object memory-backend-memfd,share=on,id=mem,size="512M" \
  -blockdev node-name=drive0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-1,cache.direct=on \
  -device virtio-blk-pci,drive=drive0

# `virtio-blk-vhost-vdpa` blockdev can be used with any QEMU block layer
# features (e.g live migration, I/O throttling).
# In this example we are using I/O throttling:
$ qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
  -object memory-backend-memfd,share=on,id=mem,size="512M" \
  -blockdev node-name=drive0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-1,cache.direct=on \
  -blockdev node-name=throttle0,driver=throttle,file=drive0,throttle-group=limits0 \
  -object throttle-group,id=limits0,x-iops-total=2000 \
  -device virtio-blk-pci,drive=throttle0

# Alternatively, we can use the generic `vhost-vdpa-device-pci` to take
# advantage of all the performance, but without having any QEMU block layer
# features available
$ qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
  -object memory-backend-memfd,share=on,id=mem,size="512M" \
  -device vhost-vdpa-device-pci,vhostdev=/dev/vhost-vdpa-0

by sgarzare@redhat.com (Stefano Garzarella) at February 12, 2024 05:42 PM

January 26, 2024

Stefan Hajnoczi

Key-Value Stores: The Foundation of File Systems and Databases

File systems and relational databases are like cousins. They share more than is apparent at first glance.

It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.

The key-value store interface

Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.

Here is a key-value store with an integer key and a string value:

Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.

A key-value store provides the following interface for storing and retrieving values by a given key:

  • put(Key, Value) - an insert/update operation that stores a value for a given key
  • get(Key) -> Value - a lookup operation that retrieves the most recently stored value for a given key
  • first() -> Key, last() -> Key, next(Key) -> Key, prev(Key) -> Key - a cursor API that enumerates keys in sorted order

You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.

Why key-value stores are necessary

Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.

Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.

This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.

Building a file system on a key-value store

First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:

Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.

File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.

If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.

Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.

Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.

Building a database on a key-value store

The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:

This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?

To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.

SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.

Conclusion

If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.

by Unknown (noreply@blogger.com) at January 26, 2024 01:16 AM

January 02, 2024

Stefan Hajnoczi

QEMU AioContext removal and how it was done

This post is about the AioContext lock removal in QEMU 9.0 (planned for release in 2024), how we got here, and what it means for multi-threaded code in QEMU.

Early QEMU as a single-threaded program

Until 2009 QEMU was largely a single-threaded program. This had the benefit that the code didn't need to consider thread-safety and was thus simpler and less bug-prone. The main loop interleaved running the next piece of guest code and handling external events such as timers, disk I/O, and network I/O. This architecture had the downside that emulating multi-processor guests was bottlenecked by the single host CPU on which QEMU ran. There was no parallelism and this became problematic as multi-processor guests became popular.

Multi-threading with vCPU threads and the Big QEMU Lock

The architecture was modified to support running dedicated vCPU threads for KVM guests. This made parallelism possible for multi-processor guests but the feature was initially only available for KVM guests. The Multi-Threaded TCG (MTTCG) feature eventually allowed translated code to also take advantage of vCPU threads in 2016.

A straightforward approach to making all existing code thread-safe was taken: the Big QEMU Lock (BQL) was introduced to serialize access to QEMU's internal state. The BQL is a single global mutex that is used to protect the majority of QEMU's internal state. KVM vCPU threads do not need access to QEMU's internal state while executing guest code, so they don't hold the BQL most of the time. The main loop thread drops the BQL while blocking in ppoll(2) and this allows vCPU threads to acquire the lock when they come out of guest code.

Multi-threading with IOThreads and the AioContext lock

Although the vCPU bottleneck had been solved, device emulation still ran with the BQL held. This meant that only a single QEMU thread could process I/O requests at a time. For I/O bound workloads this was a bottleneck and especially disk I/O performance suffered due to this limitation. My first attempt at removing the bottleneck in 2012 amounted to writing a new "dataplane" code path outside the BQL, but it lacked the features that users needed like disk image file formats, I/O throttling, etc because it couldn't use the existing code that relied on the BQL. The long term solution would be introducing thread-safety to the existing code and that led to the creation of the AioContext lock.

The AioContext lock was like a mini-BQL but for an event loop (QEMU calls this an AioContext) instead of the entire program. Initially the event loop would acquire the lock while running event handlers, thereby ensuring mutual exclusion for all handlers associated with the event loop. Another thread could acquire the lock to stop the event loop from running and safely access variables. This was a crude approach though and propagated the BQL way of thinking further. QEMU began to suffer from deadlocks and race conditions now that multi-threading was possible. Although I wrote developer documentation about how the model worked, it became tricky to gain confidence in the safety of the code as the whole QEMU block layer needed to grapple with AioContext locking and did so incompletely and inconsistently.

The upshot of all of this was that disk I/O processing could run in a dedicated event loop thread (QEMU calls this an IOThread) while the QEMU monitor could acquire the AioContext lock for a brief moment to inspect the emulated disk for an "info block" monitor command, for example. Unlike the earlier "dataplane" approach, it was now possible for the QEMU block layer to run outside the BQL and instead rely on the AioContext lock.

Removing the AioContext lock

Paolo Bonzini had the idea to gradually eliminate the AioContext lock in favor of fine-grained locks because we kept hitting problems with the AioContext lock that I described above. His insight was to change the model so that handler functions would explicitly take their AioContext's lock instead acquiring the lock around the entire event loop iteration. The advantage to letting handlers take the lock was that they could also replace it with another mechanism. Eventually it would be possible to move away from the AioContext lock.

What came after was a multi-year journey that I credit to Paolo's vision. Emanuele Giuseppe Esposito worked with Paolo on putting fine-grained locking into practice and on sorting through the entire QEMU block layer to determine under which threads and locks variables were accessed. This was a massive effort and required a lot of persistence. Kevin Wolf figured out how to use clang's Thread Safety Analysis (TSA) to check some of the locking rules at compile time. Kevin also spent a lot of time protecting the block driver graph with a reader/writer lock so that in-flight I/O does not crash amidst modifications to the graph. Emanuele and Kevin gave a talk at KVM Forum 2023 about the larger QEMU multi-queue block layer effort and the slides are available here (PDF).

Once everything that previously relied on the AioContext lock had switched to another form of thread-safety, it was possible to remove the AioContext lock as nothing used it anymore. The BQL is still widely used and covers global state that is accessed from few threads. Code that can run in any IOThread now uses its own locks or other mechanisms. The complexity of the codebase is still roughly the same as with the AioContext lock, but now there are fine-grained locks, which are easier to understand and there are fewer undocumented locking assumptions that led to deadlocks and races in the past.

Conclusion

QEMU's AioContext lock enabled multi-threading but was also associated with deadlocks and race conditions due to its ambient nature. From QEMU 9.0 onwards, QEMU will switch to fine-grained locks that are more localized and make thread-safety more explicit. Changing locking in a large program is time-consuming and difficult. It took a multi-person multi-year effort to complete this work, but it forms the basis for further work including the QEMU multi-queue block layer effort that push multi-threading further in QEMU.

by Unknown (noreply@blogger.com) at January 02, 2024 08:08 PM

January 01, 2024

Stefan Hajnoczi

Upcoming talk: "Trust, confidentiality, and hardening: the virtio lessons" at LPC 2023

Update: The video is now available here and the slides are available here (PDF).

I will be at Linux Plumbers Conference 2023 to present "Trust, confidentiality, and hardening: the virtio lessons" at 2:30pm on Wednesday, November 15th. Michael Tsirkin and I prepared this talk about the evolution of the trust model of the Linux VIRTIO drivers. It explores how the drivers have been hardened in response to new use cases for VIRTIO, including Linux VDUSE, hardware VIRTIO devices, and Confidential Computing.

Details are available on the LPC schedule. Come watch to talk to find out how drivers work when you can't trust the hypervisor!

by Unknown (noreply@blogger.com) at January 01, 2024 10:52 AM

Storage literature notes on free space management and snapshots

I recently looked at papers about free space management and snapshots in storage systems like file systems, volume managers, and key-value stores. I'm publishing my notes in case you find them useful, but the real value might simply be the links to papers in this field. They might be a useful starting point for someone wishing to read into this field.

My aim was to get an overview of data structures and algorithms used in modern storage systems for tracking free space and snapshotting objects.

Literature

  • The Zettabyte File system (2003)
    • The Storage Pool Allocator (SPA) provides allocation and freeing of blocks across physical disks. It deals in disk virtual addresses (DVAs) so the caller is unaware of which disk storage is located. Blocks can be migrated between devices without changing their DVA because the SPA can just update translation metadata.
      • A slab allocator is used to satisfy contiguous block allocation requests of power-of-2 sizes (see details). Each device is divided into ~200 “metaslabs” (i.e. 0.5% of the device).
      • Allocations in a metaslab are written into a log called a space map and rewritten when the log becomes too long (see details). In memory, range trees are built from the on-disk log so that free space can be looked up by offset or length (see details).
    • All blocks are checksummed. Checksums are stored along with the block pointer, so the integrity of the entire tree is protected via the checksum. When data is mirrored across drives it is possible to fix checksum failures.
    • The Data Management Unit (DMU) provides an object storage interface for creating, accessing, and deleting objects on top of the SPA.
    • The ZFS POSIX Layer (ZPL) implements POSIX file system semantics using the DMU to create objects for directories, files, etc.
    • When there are too many data blocks to store the block pointers, ZFS uses indirect blocks (up to 6 levels). Indirect blocks are blocks containing block pointers.
  • B-trees, Shadowing, and Clones (2006)
    • Uses a copy-on-write B+-tree to implement an object storage device (OSD).
    • Requests are written to a log for recovery in between B+-tree checkpoints.
    • B+-tree pages are kept cached in memory until checkpoint write-out so that multiple updates to the same page are batched.
    • Hierarchical reference counts are used on tree nodes. This makes refcounts lazy and avoids having to increment/decrement refcounts on all blocks upfront.
  • FlexVol: Flexible, Efficient File Volume Virtualization in WAFL (2008)
    • Introduces logical volumes into WAFL so that multiple file systems can be managed on the same physical storage with separate snapshots, policies, etc.
    • Delayed Block Freeing: do not actually free blocks and instead defer until 2% of blocks are ready to be freed in the background.
    • Cloning Volumes from Snapshots works like backing file chains in qcow2 or VMDK. WAFL knows which Snapshots are referenced and won’t free their metadata and blocks because Clone Volumes may still be using them. Clone Volumes can be detached from their Snapshots by copying out the data blocks to new blocks.
  • Tracking Back References in a Write-Anywhere File System (2010)
    • Log-structured back references are write-optimized so that block allocation, snapshot creation, etc efficiently record users of physical blocks. This information is needed during defragmentation and other data reorganization operations.
    • Serves queries from physical block address to logical block (inode, offset).
    • Implemented using a log-structured merge tree (requires periodic compaction) and a Bloom filter.
  • MDB: A Memory-Mapped Database and Backend for OpenLDAP (2011)
    • LMDB is a read-optimized key-value store implemented as a copy-on-write B+-tree
    • Concurrency model: 1 writer and N readers at the same time
    • Entire database file is mmapped but writes and flushes use syscalls
    • Freelist B+-tree tracks free pages in database file
  • BTRFS: The Linux B-tree filesystem (2012)
    • Extent-based free space management
      • Extent allocation tree stores back references, allowing extents to be moved later
      • Relies on contiguous free space, so background defragmentation is necessary
    • Sub-volume tree nodes are reference counted
    • A 4KB write creates new inodes, file extents, checksums, and back references and corresponding b-tree spine nodes. When there are multiple modifications, spatial locality (sequential I/O or inode changes in a directory) helps batch these changes together resulting in fewer than N new nodes for N operations. Random I/O is less efficient.
  • GCTrees: Garbage Collecting Snapshots (2015)
    • Rodeh's hierarchical reference counting delays refcount updates by keep refcounts on tree nodes and updating only the node's refcount closest to the root. Further tree modifications might eventually make it necessary to update subsets of refcounts in tree leaves. This can be combined with a refcount log to reduce the random I/O involved in updating many scattered refcounts.
    • GCTrees node store an offset to the parent GCTree node and a borrowed bitmap tracking which blocks are shared with the parent.
      • When a GCTree is deleted:
        • Blocks are ignored when the borrowed bit is set
        • The borrowed bit is checked in immediate child GCTree nodes to determine if the remaining blocks are still in use:
          • If not in use, free the block
          • If in use, clear the borrowed bit in the child to transfer ownership of the block to the child (paper doesn't explain how this works when multiple immediate children borrow the same block because this research only considers read-only snapshots without writeable clone support)
        • The linked list (relationship between GCTree nodes) is updated
  • Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL (2017)
    • WAFL keeps free space metadata up-to-date instead of eventually consistent (relying on scanning metadata in the background to identify free space).
    • Free space is recorded in a bitmap called activemap. Blocks are allocated near each other (e.g. contiguous), if possible, to minimize updates to the activemap.
    • WAFL implements background and inline defragmentation to make contiguous free space available.
    • File deletion does not instantly clear bits in the activemap because doing so would be expensive on large files. Deleted files are incrementally freed across checkpoints.
    • The Batched Free Log (BFLog) holds deleted blocks and sorts them so they can be deleted incrementally.
  • How to Copy Files (2020)
    • Aims to create what they call "nimble clones" (fast creation, fast read/write I/O, and efficient space utilization)
    • Read performance with btrfs, ZFS, xfs degrades after successive rounds of clone + write. The intuition is that at some point it's better to copy the blocks to avoid fragmentation instead of sharing them.
      • They call this Copy-on-Abundant-Write (CAW)
    • Implemented in BetrFS, a file system based on a Bε-tree key-value store that uses path names as keys instead of inode numbers.
      • Uses hierarchical reference counts to track nodes
      • Free space is tracked in a bitmap in the node translation table, which is used for indirection to avoid rewriting nodes when physical block locations are updated
      • Didn't look in detail at the Bε-tree DAG technique introduced to implement efficient copies

Data structures

  • B+ trees: common in file systems and databases for ordered indexes
  • Bitmaps: widely used to track block allocation
  • Bloom filters: probabilistic data structure for set membership tests sacrificing accuracy (there can be false positives) for low space requirements
  • Skip lists: probabilistic O(log n) multi-level linked list data structure atop a sorted array but not as popular as B+ trees for on-disk structures

by Unknown (noreply@blogger.com) at January 01, 2024 10:50 AM

December 20, 2023

QEMU project

QEMU version 8.2.0 released

We’d like to announce the availability of the QEMU 8.2.0 release. This release contains 3200+ commits from 238 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

  • New virtio-sound device emulation
  • New virtio-gpu rutabaga device emulation used by Android emulator
  • New hv-balloon for dynamic memory protocol device for Hyper-V guests
  • New Universal Flash Storage device emulation
  • Network Block Device (NBD) 64-bit offsets for improved performance
  • dump-guest-memory now supports the standard kdump format
  • ARM: Xilinx Versal board now models the CFU/CFI, and the TRNG device
  • ARM: CPU emulation support for cortex-a710 and neoverse-n2
  • ARM: architectural feature support for PACQARMA3, EPAC, Pauth2, FPAC, FPACCOMBINE, TIDCP1, MOPS, HBC, and HPMN0
  • HPPA: CPU emulation support for 64-bit PA-RISC 2.0
  • HPPA: machine emulation support for C3700, including Astro memory controller and four Elroy PCI bridges
  • LoongArch: ISA support for LASX extension and PRELDX instruction
  • LoongArch: CPU emulation support for la132
  • RISC-V: ISA/extension support for AIA virtualization support via KVM, and vector cryptographic instructions
  • RISC-V: Numerous extension/instruction cleanups, fixes, and reworks
  • s390x: support for vfio-ap passthrough of crypto adapter for protected virtualization guests
  • Tricore: support for TC37x CPU which implements ISA v1.6.2
  • Tricore: support for CRCN, FTOU, FTOHP, and HPTOF instructions
  • x86: Zen support for PV console and network devices
  • and lots more…

Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!

December 20, 2023 04:33 PM

December 14, 2023

Gerd Hoffmann

W^X in UEFI firmware and the linux boot chain.

What is W^X?

If this sounds familiar to you, it probably is. It means that memory should be either writable ("W", typically data), or executeable ("X", typically code), but not both. Elsewhere in the software industry this is standard security practice since ages. Now it starts to take off for UEFI firmware too.

This is a deep dive into recent changes, in both code (firmware) and administration (secure boot signing), the consequences this has for the linux, and the current state of affairs.

Changes in the UEFI spec and edk2

All UEFI memory allocations carry a memory type (EFI_MEMORY_TYPE). UEFI tracks since day one whenever a memory allocation is meant for code or data, among a bunch of other properties such as boot service vs. runtime service memory.

For a long time it didn't matter much in practice. The concept of virtual memory does not exist for UEFI. IA32 builds even run with paging disabled (and this is unlikely to change until the architecture disappears into irrelevance). Other architectures use identity mappings.

While UEFI does not use address translation, nowdays it can use page tables to enforce memory attributes, including (but not limited to) write and execute permissions. When configured to do so it will set code pages to R-X and data pages to RW- instead of using RWX everywhere, so code using memory types incorrectly will trigger page faults.

New in the UEFI spec (added in version 2.10) is the EFI_MEMORY_ATTRIBUTE_PROTOCOL. Sometimes properties of memory regions need to change, and this protocol can be used to do so. One example is a self-uncompressing binary, where the memory region the binary gets unpacked to initially must be writable. Later (parts of) the memory region must be flipped from writable to executeable.

As of today (Dec 2023) edk2 has a EFI_MEMORY_ATTRIBUTE_PROTOCOL implementation for the ARM and AARCH64 architectures, so this is present in the ArmVirt firmware builds but not in the OVMF builds.

Changed secure boot signing requirements

In an effort to improve firmware security in general and especially for secure boot Microsoft changed the requirements for binaries they are willing to sign with their UEFI CA key.

One key requirement added is that the binary layout must allow to enforce memory attributes with page tables, i.e. PE binary sections must be aligned to page size (4k). Sections also can't be both writable and executable. And the application must be able to deal with data section being mapped as not executable (NX_COMPAT).

These requirements apply to the binary itself (i.e. shim.efi for linux systems) and everything loaded by the binary (i.e. grub.efi, fwupd.efi and the linux kernel).

Where does linux stand?

We had and party still have a bunch of problems in all components involved in the linux boot process, i.e. shim.efi, grub.efi and the efi stub of the linux kernel.

Some are old bugs such as memory types not being used correctly, which start to cause problems due to the firmware becoming more strict. Some are new problems due to Microsoft raising the bar for PE binaries, typically sections not being page-aligned. The latter are easily fixed in most cases, often it is just a matter of adding alignment to the right places in the linker scripts.

Lets have closer look at the components one by one:

shim.efi

shim added code to use the new EFI_MEMORY_ATTRIBUTE_PROTOCOL before it was actually implemented by any firmware. Then this was released completely untested. Did not work out very well, we got a nice time bomb, and edk2 implementing EFI_MEMORY_ATTRIBUTE_PROTOCOL for arm triggered it ...

Fixed in main branch, no release yet.

Getting new shim.efi binaries signed by Microsoft depends on the complete boot chain being compilant with the new requirements, which prevents shim bugfixes being shipped to users right now.

That should be solved soon though, see the kernel section below.

grub.efi

grub.efi used to use memory types incorrectly.

Fixed upstream years ago, case closed.

Well, in theory. Upstream grub development goes at glacial speeds, so all distros carry a big stack of downstream patches. Not surprisingly that leads to upstream fixes being absorbed slowly and also to bugs getting reintroduced.

So, in practice we still have buggy grub versions in the wild. It is getting better though.

The linux kernel

The linux kernel efi stub had it's fair share of bugs too. On non-x86 architectures (arm, riscv, ...) all issues have been fixed a few releases ago. They all share much of the efi stub code base and also use the same self-decompressing method (CONFIG_EFI_ZBOOT=y).

On x86 this all took a bit longer to sort out. For historical reasons x86 can't use the zboot approach used by the other architectures. At least as long as we need hybrid BIOS/UEFI kernels, which most likely will be a number of years still.

The final x86 patch series has been merged during the 6.7 merge window. So we should have a fixed stable kernel in early January 2024, and distros picking up the new kernel in the following weeks or months. Which in turn should finally unblock shim updates.

There should be enough time to get everything sorted for the spring distro releases (Fedora 40, Ubuntu 24.04).

edk2 config options

edk2 has a bunch of config options to fine tune the firmware behavior, both compile time and runtime. The relevant ones for the problems listed above are:

PcdDxeNxMemoryProtectionPolicy

Compile time option. Use the --pcd switch for the edk2 build script to set these. It's bitmask, with one bit for each memory type, specifying whenever the firmware shoud apply memory protections for that particular memory type, by setting the flags in the page tables accordingly.

Strict configuration is PcdDxeNxMemoryProtectionPolicy = 0xC000000000007FD5. This is also the default for ArmVirt builds.

Bug compatible configuration is PcdDxeNxMemoryProtectionPolicy = 0xC000000000007FD1. This excludes the EfiLoaderData memory type from memory protections, so using EfiLoaderData allocations for code will not trigger page faults. Which is an very common pattern seen in boot loader bugs.

PcdUninstallMemAttrProtocol

Compile time options, for ArmVirt only. Brand new, committed to the edk2 repo this week (Dec 12th 2023). When set to TRUE the EFI_MEMORY_ATTRIBUTE_PROTOCOL will be unistalled. Default is FALSE.

Setting this to TRUE will work around the shim bug.

opt/org.tianocore/UninstallMemAttrProtocol

Runtime option, for ArmVirt only. Also new. Can be set using -fw_cfg on the qemu command line: -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=y|n. This is a runtime override for PcdUninstallMemAttrProtocol. Works for both enabling and disabling the shim bug workaround.

In the future PcdDxeNxMemoryProtectionPolicy will probably disappear in favor of memory profiles, which will allow to configure the same settings (plus a few more) at runtime.

Hands on, part #1 — using fedora edk2 builds

The default builds in the edk2-ovmf and edk2-aarch64 packages are configured to be bug compatible, so VMs should boot fine even in case the guests are using a buggy boot chain.

While this is great for end users it doesn't help much for bootloader development and testing, so there are alternatives. The edk2-experimental package comes with a collection of builds better suited for that use case, configured with strict memory protections and (on aarch64) EFI_MEMORY_ATTRIBUTE_PROTOCOL enabled, so you can see buggy builds actually crash and burn. 🔥

AARCH64 architecture

For AARCH64 this is /usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw. The magic words for libvirt are:

<domain type='kvm'>
[ ... ]
  <os>
    <type arch='aarch64' machine='virt'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw</loader>
    <nvram template='/usr/share/edk2/aarch64/vars-template-pflash.raw'/>
  </os>
[ ... ]

If a page fault happens you will get this line ...

  Synchronous Exception at 0x00000001367E6578

... on the serial console, followed by a stack trace and register dump.

X64 architecture

For X64 this is /usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2. Needs edk2-20231122-12.fc39 or newer. The magic words for libvirt are:

<domain type='kvm'>
[ ... ]
  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <loader readonly='yes' secure='yes' type='pflash' format='qcow2'>/usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2</loader>
    <nvram template='/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2' format='qcow2'/>
  </os>
[ ... ]

It is also a good idea to add a debug console to capture the firmware log:

    <serial type='null'>
      <log file='/path/to/firmware.log' append='off'/>
      <target type='isa-debug' port='1'>
        <model name='isa-debugcon'/>
      </target>
      <address type='isa' iobase='0x402'/>
    </serial>

If you are lucky the page fault is logged there, also with an register dump. If you are not so lucky the VM will just reset and reboot.

Hands on, part #2 — using virt-firmware

The virt-firmware project is a collection of python modules and scripts for working with efi variables, efi varstores and also pe binaries. In case your distro hasn't packages you can install it using pip like most python packages.

virt-fw-vars

The virt-fw-vars utility can work with efi varstores. For example it is used to create the OVMF_VARS*secboot* files, enrolling the secure boot certificates into the efi security databases.

The simplest operation is to print the variable store:

virt-fw-vars --input /usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2 \
             --print --verbose | less

When updating edk2 varstores virt-fw-vars always needs both input and output files. If you want change an existing variable store both input and output can point to the same file. For example you can turn on shim logging for an existing libvirt guest this way:

virt-fw-vars --input /var/lib/libvirt/qemu/nvram/${guest}_VARS.qcow2 \
             --output /var/lib/libvirt/qemu/nvram/${guest}_VARS.qcow2 \
             --set-shim-verbose

The next virt-firmware version will get a new --inplace switch to avoid listing the file twice on the command line for this use case.

If you want start from scratch you can use an empty variable store from /usr/share/edk2 as input. For example when creating a new variable store template with the test CA certificate (shipped with pesign.rpm) enrolled additionally:

dnf install -y pesign
certutil -L -d /etc/pki/pesign-rh-test -n "Red Hat Test CA" -a \
             | openssl x509 -text > rh-test-ca.txt
virt-fw-vars --input /usr/share/edk2/ovmf/OVMF_VARS_4M.qcow2 \
             --output OVMF_VARS_4M.secboot.rhtest.qcow2 \
             --enroll-redhat --secure-boot \
             --add-db OvmfEnrollDefaultKeys rh-test-ca.txt

The test CA will be used by all Fedora, CentOS Stream and RHEL build infrastructure to sign unofficial builds, for example when doing scratch builds in koji or when building rpms locally on your developer workstation. If you want test such builds in a VM, with secure boot enabled, this is a convenient way to do it.

pe-inspect

Useful for having a look at EFI binaries is pe-inspect. If this isn't present try pe-listsigs. Initially the utility only listed the signatures, but was extended over time to show more information, so I added the pe-inspect alias later on.

Below is the output for an 6.6 x86 kernel, you can see it does not have the patches to page-align the sections:

# file: /boot/vmlinuz-6.6.4-200.fc39.x86_64
#    section: file 0x00000200 +0x00003dc0  virt 0x00000200 +0x00003dc0  r-x (.setup)
#    section: file 0x00003fc0 +0x00000020  virt 0x00003fc0 +0x00000020  r-- (.reloc)
#    section: file 0x00003fe0 +0x00000020  virt 0x00003fe0 +0x00000020  r-- (.compat)
#    section: file 0x00004000 +0x00df6cc0  virt 0x00004000 +0x05047000  r-x (.text)
#    sigdata: addr 0x00dfacc0 +0x00000d48
#       signature: len 0x5da, type 0x2
#          certificate
#             subject CN: Fedora Secure Boot Signer
#             issuer  CN: Fedora Secure Boot CA
#       signature: len 0x762, type 0x2
#          certificate
#             subject CN: kernel-signer
#             issuer  CN: fedoraca

pe-inspect also knows the names for a number of special sections and supports decoding and pretty-printing them, for example here:

# file: /usr/lib/systemd/boot/efi/systemd-bootx64.efi
#    section: file 0x00000400 +0x00011a00  virt 0x00001000 +0x0001191f  r-x (.text)
#    section: file 0x00011e00 +0x00003a00  virt 0x00013000 +0x00003906  r-- (.rodata)
#    section: file 0x00015800 +0x00000400  virt 0x00017000 +0x00000329  rw- (.data)
#    section: file 0x00015c00 +0x00000200  virt 0x00018000 +0x00000030  r-- (.sdmagic)
#       #### LoaderInfo: systemd-boot 254.7-1.fc39 ####
#    section: file 0x00015e00 +0x00000200  virt 0x00019000 +0x00000049  r-- (.osrel)
#    section: file 0x00016000 +0x00000200  virt 0x0001a000 +0x000000de  r-- (.sbat)
#       sbat,1,SBAT Version,sbat,1,https://github.com/rhboot/shim/blob/main/SBAT.md
#       systemd,1,The systemd Developers,systemd,254,https://systemd.io/
#       systemd.fedora,1,Fedora Linux,systemd,254.7-1.fc39,https://bugzilla.redhat.com/
#    section: file 0x00016200 +0x00000200  virt 0x0001b000 +0x00000084  r-- (.reloc)

virt-fw-sigdb

The last utility I want introduce is virt-fw-sigdb, which can create, parse and modify signature databases. The signature database format is used by the firmware to store certificates and hashes in EFI variables. But sometimes the format used for files too. virt-firmware has the functionality anyway, so I've added a small frontend utility to work with those files.

One file in signature database format is /etc/pki/ca-trust/extracted/edk2/cacerts.bin which contains the list of of trusted CAs in sigature database format. Can be used to pass the CA list to the VM firmware for TLS connections (https network boot).

Shim also uses that format when compiling multiple certificates into the built-in VENDOR_DB or VENDOR_DBX databases.

Final remarks

Thats it for today folks. Hope you find this useful.

by Gerd Hoffmann at December 14, 2023 11:00 PM

December 11, 2023

KVM on Z

Red Hat Ansible Automation Platform available on IBM Z and LinuxONE

While Linux on IBM Z and LinuxONE could be used as a target for Ansible scripts ever since, the backend had to be run on other architectures. But no longer so: Starting today, the entire Red Hat Ansible Automation Platform is becoming available on IBM Z and LinuxONE!

See here for more details, and here for the formal announcement from Red Hat.

by Stefan Raspl (noreply@blogger.com) at December 11, 2023 06:46 PM

December 10, 2023

Alex Bennée

A Systems Programmer's Perspectives on Generative AI

Like many people over the last few months I've been playing with a number of Large Language Models (LLMs). LLMs are perhaps best typified by the current media star ChatGPT. It is hard to avoid the current media buzz while every tech titan is developing their "AI" play and people are exposed to tools where the label of Artificial Intelligence is liberally applied. The ability of these models to spit out competent comprehensible text is seemingly a step change in ability compared to previous generations of tech.

I thought I would try and collect some of my thoughts and perspectives on this from the point of view of a systems programmer. For those not familiar with the term is refers to the low level development of providing platforms for the applications people actually use. In my case a lot of the work I do on QEMU which involves emulating the very lowest level instructions a computer can do: the simple arithmetic and comparison of numbers that all code is eventually expressed as.

Magic numbers and computing them

I claim no particular expertise on machine learning so expect this to be a very superficial explanation of whats going on.

In normal code the CPU tends to execute a lot of different instruction sequences as a program runs through solving the problem you have set it. The code that calculates where to draw your window will be different to the code checking the network for new data, or the logic that stores information safely on your file system. Each of those tasks is decomposed and abstracted into simpler and simpler steps until eventually it is simple arithmetic dictating what the processor should do do next. You occasionally see hot spots where a particular sequence of instructions are doing a lot of heavy lifting. There is a whole discipline devoted to managing computational complexity and ensuring algorithms are as efficient as possible.

However the various technologies that are currently wowing the world work very differently. They are models of various networks represented by a series of magic numbers or "weights" arranged in a hierarchical structure of interconnected matrices. While there is a lot of nuance to how problems are encoded and fed into these models fundamentally the core piece of computation is multiplying a bunch of numbers with another bunch of numbers feeding their results into the next layer of the network. At the end of the process the model spits out a prediction of the most likely next word is going to be. After selecting one the cycle repeats taking to account our expanded context to predict the most likely next word.

The "models" that drive these things are described mostly by the number of parameters they have. This encompasses the number of inputs and outputs they have and the number of numbers in between. For example common small open source models start at 3 billion parameters with 7, 13 and 34 billion also being popular sizes. Beyond that it starts getting hard to run models locally on all but the most tricked out desktop PCs. As a developer my desktop is pretty beefy (32 cores, 64Gb RAM) and can chew through computationally expensive builds pretty easily. However as I can't off-load processing onto my GPU a decent sized model will chug out a few words a second while maxing out my CPU. The ChatGPT v4 model is speculated to run about 1.7 trillion parameters which needs to be run on expensive cloud hardware - I certainly don't envy OpenAI their infrastructure bill.

Of course the computational power needed to run these models is a mere fraction of what it took to train them. In fact the bandwidth and processing requirements are so large it pays to develop custom silicon that is really good at multiplying large amounts of numbers and not much else. You can get a lot more bang for your buck compared to running those calculations on a general purpose CPU designed for tackling a wide range of computation problems.

The Value of Numbers

Because of the massive investment in synthesising these magic numbers they themselves become worth something. The "magic sauce" behind a model is more about how it was trained and what data was used to do it. We already know its possible to encode societies biases into models due to sloppy selection of the input data. One of the principle criticisms of proprietary generative models is how opaque the training methods are making it hard to judge their safety. The degree to which models may regurgitate data without any transformation is hard to quantify when you don't know what went into it.

As I'm fundamentally more interested in knowing how the technology I use works under the hood its fortunate there is a growing open source community working on building their own models. Credit should be given to Meta who made their language model LLaMA 2 freely available on fairly permissive terms. Since then there has been an explosion of open source projects that can run the models (e.g: llama.cpp, Ollama) and provide front-ends (e.g: Oobabooga's text generation UI, Ellama front-end for Emacs) for them.

Smaller Magic Numbers

The principle place where this work is going on is Hugging Face. Think of it as the GitHub of the machine learning community. It provides an environment for publishing and collaborating on data sets and models as well hosting and testing their effectiveness in various benchmarks. This make experimenting with models accessible to developers who aren't part of the well funded research divisions of the various tech titans. Datasets for example come with cards which describe the sources that went into these multi-terabyte files.

One example of a such is the RedPajama dataset. This is an open source initiative to recreate the LLaMA training data which combines data from the open web and well as numerous permissively licensed source such as Wikipedia, GitHub, StackExchange and ArXiv. This dataset has been used to train models like OpenLLaMA in an attempt to provide an unencumbered version of Meta's LLaMA 2. However training up these foundational models is an expensive and time consuming task, the real action is taking these models and then fine tuning them for particular tasks.

To fine tune a model you first take a general purpose model and further train it against data with a specific task in mind. The purpose of this is not only to make your new model better suited for a particular task but also to optimise the number of calculations that model has to do to achieve acceptable results. This is also where the style of prompting will be set as you feed the model examples of the sort of questions and answers you want it to give.

The are further stages that be applied including "alignment" where you ensure results are broadly in tune with the values of the organisation. This is the reason the various chatbots around won't readily cough up the recipe to build nukes or make it easier to explicitly break the law. This can be augmented with Reinforcement Learning through Human Feedback (RHLF) which is practically the purpose of every CAPTCHA you'll have filled in over the last 25 years online.

Finally the model can be quantised to make it more manageable. This takes advantage of the fact that a lot of the numbers will be have a negligible effect on the result for a wide range of inputs. In those cases there is no point storing them at full precision. As computation is a function of the number of bits of information being processed this also reduces the cost of computation. While phones and other devices are increasingly including dedicated hardware to process these models they are still constrained by physics - and the more you process the more heat you need to dissipate, the more battery you use and the more bandwidth you consume. Obviously the more aggressively you quantise the models the worse it will perform so there is an engineering trade off to make. Phones work best with multiple highly tuned models solving specific tasks as efficiently as possible. Fully flexible models giving a J.A.R.V.I.S like experience will probably always need to run in the cloud where thermal management is simply an exercise in plumbing.

Making magic numbers work for you

Before we discuss using models I want to discuss 3 more concepts: "prompts", "context" and "hallucinations".

The prompt is the closest thing there is to "programming" the model. The prompt can be purely explicit or include other inputs behind the scenes. For example the prompt can instruct the model to be friendly or terse, decorate code snippets with markdown, make changes as diffs or in full functions. Generally the more explicit your prompt is about what you want the better the result you get from the model. Prompt engineering has the potential to be one of those newly created job titles that will have to replace the jobs obsoleted by advancing AI. One of the ways to embed AI APIs into your app is to create a task specific prompt that will be put in front of user input that guides the results to what you want.

The "context" is the rest of the input into the model. That could be the current conversation in a chat or the current page of source code in a code editor. The larger the context the more reference the model has for its answer although that does come at the cost of even more composition as the context makes for more input parameters into the model.

In a strong candidate for 2023's word of the year "hallucination" describes the quirky and sometime unsettling behaviour of models outputting weird sometimes contradictory information. They will sincerely and confidently answer questions with blatant lies or start regurgitating training data when given certain prompts. It is a salient reminder that the statistical nature of these generative models will mean they occasionally spout complete rubbish. They are also very prone to following the lead of their users - the longer you chat with a model the more likely it is to end up agreeing with you.

So lets talk about what these models can and can't do. As a developer one of the areas I'm most interested in is their ability to write code. Systems code especially is an exercise in precisely instructing a computer what to do in explicit situations. I'd confidently predicted my job would be one of the last to succumb to the advance of AI as systems aren't something you can get "mostly" right. It was quite a shock when I first saw quite how sophisticated the generated code can be.

Code Review

One of the first things I asked ChatGPT to do was review a function I'd written. It manged to make 6 observations about the code, 3 of which where actual logic problems I'd missed and 3 where general points about variable naming and comments. The prompt is pretty important though. If not constrained to point out actual problems LLMs tend to have a tendency to spit out rather generic advice about writing clean well commented code.

They can be super useful when working with an unfamiliar language or framework. If you are having trouble getting something to work it might be faster to ask an LLM how to fix your function that spending time reading multiple StackOverflow answers to figure out what you've misunderstood. If compiler errors are confusing supplying the message alongside the code can often be helpful in understanding whats going on.

Writing Code

However rather than just suggesting changes one very tempting use case is writing code from scratch based on a description of what you want. Here the context is very important, the more detail you provide the better chance of generating something useful. My experience has been that the solutions are usually fairly rudimentary and can often benefit from a manual polishing step once you have something working.

For my QEMU KVM Forum 2023 Keynote I got ChatGPT to write the first draft of a number of my data processing scripts. However it missed obvious optimisations by repeatedly reading values inside inner loops that made the scripts slower than they needed to be.

If the task is a straight transformation they are very good. Ask an LLM to convert a function in one language into another and it will do a pretty good job - and probably with less mistakes than your first attempt. However there are limitations. For example I asked a model to convert some Aarch64 assembler into the equivalent 32 bit Arm assembler. It did a very good job of the mechanical part of that but missed the subtle differences in how to setup the MMU. This resulted in code which compiled but didn't work until debugged by a human who was paying close attention to the architecture documentation as they went.

One of the jobs LLM's are very well suited for is writing code that matches an existing template. For example if you are mechanically transforming a bunch of enums into a function to convert them to strings you need only do a few examples before there is enough context for the LLM to reliably figure out what you are doing. LLM's are a lot more powerful than a simple template expansion because you don't need to explicitly define a template first. The same is true of tasks like generating test fixtures for your code.

There is a potential trap however with using LLMs to write code. As there is no source code and the proprietary models are fairly cagey about exactly what data the models where trained on there are worries about them committing copyright infringement. There are active debates ongoing in the open source community (e.g. on qemu-devel) about the potential ramifications of a model regurgitating its training data. Without clarity on what license that data has there is a risk of contaminating projects with code of an unknown province. While I'm sure these issues will be resolved in time it's certainly a problem you need to be cognisant off.

Prose

Writing prose is a much more natural problem territory for LLM's and an area where low-effort text generation will be rapidly replaced by generative models like ChatGPT. "My" previous blog post was mostly written by a ChatGPT based on a simple brief and a few requests for rewrites in a chat session. While it made the process fairly quick the result comes across as a little bland and "off". I find there is a tendency for LLM's to fall back on fairly obvious generalisations and erase any unique authorial voice there may have been.

However if you give enough structure its very easy to get an LLM to expand on a bullet list into more flowery prose. They are more powerful when being fed a large piece of text and asked to summarise key information in a more accessible way.

They are certainly an easy way to give a first pass review of your writing although I try to re-phrase things myself rather than accept suggestions verbatim to keep my voice coming through the text.

Final Thoughts

The recent advances in LLM's and the public's exposure to popular tools like ChatGPT have certainly propelled the topic of AI in the zeitgeist. While we are almost certainly approaching the "Peak of Inflated Expectations" stage of the hype cycle they will undoubtedly be an important step on the road to the eventual goal of Artificial General Intelligence (AGI). We are still a long way from being able to ask computers to solve complex problems they way they can in for example in Star Trek. However in their current form they will certainly have a big impact on the way we work over the next decade or so.

It's important as a society we learn about how they are built, what their limitations are and understand the computational cost and resultant impact on the environment. It will be awhile before I'd want to trust a set of magic numbers over a carefully developed algorithm to actuate the control surfaces on a plane I'm flying on. However they are already well placed to help us learn new information through interactive questioning and summarising random information on the internet. We must learn to recognise when we've gone down hallucinatory rabbit hole and verify what we've learned with reference to trusted sources.

by alex at December 10, 2023 07:28 PM

December 08, 2023

KVM on Z

New Linux on IBM Z & LinuxONE Forum at Open Mainframe Project

The Open Mainframe Project has launched a new forum dedicated to Linux on Z. It can be found here, and is intended to complement existing facilities like the mailing lists hosted at Maris college. Any topic around Linux on Z, including virtualization as provided by z/VM and KVM, is fair game, and you may use it to ask quesitions, share useful hints and tips, or simply have a casual conversation about some aspect of the platform!

by Stefan Raspl (noreply@blogger.com) at December 08, 2023 11:29 AM

December 01, 2023

KVM on Z

New Releases: RHEL 8.9 and RHEL 9.3 on IBM Z & LinuxONE

Both, Red Hat Enterprise Linux 8.9 and 9.1 are out! See the press release here, and Red Hat's blog entry here!

Both releases ship

  • s390-tools v2.27 (renamed to s390utils)
  • smc-tools v1.8.2
  • openCryptoki v3.21

Further information can be found in the release notes for RHEL 8.9 and see RHEL 9.3.

by Stefan Raspl (noreply@blogger.com) at December 01, 2023 11:26 AM

November 30, 2023

Gerd Hoffmann

physical address space in qemu

The physical addess space is where all memory and most IO resources are located. PCI memory bars, PCI MMIO bars, platform devices like lapic, io-apic, hpet, tpm, ...

On your linux machine you can use lscpu to see the size of the physical address space:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
                         ^^^^^^^^^^^^^^^^
[ ... ]

In /proc/iomem you can see how the address space is used. Note that the actual addresses are only shown to root.

The physical address space problem on x86_64

The very first x86_64 processor (AMD Opteron) shipped with a physical address space of 40 bits (aka one TeraByte). So when qemu added support for the (back then) new architecture the qemu vcpu likewise got 40 bits of physical address space, probably assuming that this would be a safe baseline. It is still the default in qemu (version 8.1 as of today) for backward compatibility reasons.

Enter Intel. The first 64-bit processors shipped by Intel featured only 36 bits of physical address space. More recent Intel processors have 39, 42 or more physical address bits. Problem is this limit applies not only to the real physical addess space, but also to Extended Page Tables (EPT). Which means the physical address space of virtual machines is limited too.

So, the problem is the virtual machine firmware does not know how much physical address space it actually has. When checking CPUID it gets back 40 bits, but it could very well be it actually has only 36 bits.

Traditional firmware behavior

To address that problem the virtual machine firmware was very conservative with address space usage, to avoid crossing the unknown limit.

OVMF used to have a MMIO window with fixed size (32GB), which was based on the first multiple of 32GB after normal RAM. So a typical, smallish virtual machine had 0 -> 32GB for RAM and 32GB -> 64GB for IO, staying below the limit for 36 bits of physical address space (which equals 64GB).

VMs having more than 30GB of RAM will need address space above 32GB for RAM, which pushes the IO window above the 64GB limit. The assumtion that hosts which have enough physical memory to run such big virtual machines also have a physical address space larger than 64GB seems to have worked good enough.

Nevertheless the fixed 32G-sized IO window became increasingly problematic. Memory sizes are growing, not only for main memory, but also for device memory. GPUs have gigabytes of memory these days.

Config options in qemu

Qemu has tree -cpu options to control physical address space advertized to the guest, for quite a while already.

host-phys-bits={on,off}
When enabled qemu will use the hosts physical address bits for the guest, i.e. the guest can see the actual limit. I recommend enable this everywhere.
Upstream default: off (except for -cpu host where it is on).
Some downstream linux distro builds flip this to on by default.
host-phys-bits-limit=bits
Is used only with host-phys-bits=on. Can be used to reduce the number of physical address space bits communicated to the guest. Useful for live migration compatibility in case your machine cluster has machines with different physical address space sizes.
phys-bits=bits
Is used only with host-phys-bits=off. Can be used to set the number of physical address space bits to any value you want, including non-working values. Use only if you know what you are doing, it's easy to shot yourself into the foot with this one.

Changes in OVMF

Recent OVMF versions (edk2-stable202211 and newer) try to figure the size of the physical address space using a heuristic: In case the physical address space bits value received via CPUID is 40 or below it is checked against known-good values, which are 36 and 39 for Intel processors and 40 for AMD processors. If that check passes or the number of bits is 41 or higher OVMF assumes qemu is configured with host-phys-bits=on and the value can be trusted.

In case there is no trustworthy phys-bits value OVMF will continue with the traditional behavior described above.

In case OVMF trusts the phys-bits value it will apply some OVMF-specific limitations before actually using it:

  • The concept if virtual memory does not exist in UEFI, so the firmware will identity-map everything. Without 5-level paging (which is not yet supported in OVMF) at most 128TB (phys-bits=47) can be identity-mapped, so OVMF can not use more than that.
    The actual limit is phys-bits=46 (64TB) for now due to older linux kernels (4.15) having problems if OVMF uses phys-bits=47.
  • In case gigabyte pages are not available OVMF will not use more than phys-bits=40 (1TB). This avoids high memory usage and long boot times due to OVMF creating lots of page tables for the identity mapping.

The final phys-bits value will be used to calculate the size of the physical address space available. The 64-bit IO window will be placed as high as possibe, i.e. at the end of the physical address space. The size of the IO window and also the size of the PCI bridge windows (for prefetchable 64-bit bars) will be scaled up with the physical address space, i.e. on machines with a larger physical address space you will also get larger IO windows.

Changes in SeaBIOS

Starting with version 1.16.3 SeaBIOS uses a heuristic simliar to OVMF to figure whenever there is a trustworthy phys-bits value.

If that is the case SeaBIOS will enable the 64-bit IO window by default and place it at the end of the address space like OVMF does. SeaBIOS will also scale the size of the IO window with the size of the address space.

Although the overall behavior is simliar there are some noteworthy differences:

  • SeaBIOS will not enable the 64-bit IO window in case there is no RAM above 4G, for better compatibility with old -- possibly 32-bit -- guests.
  • SeaBIOS will not enable the 64-bit IO window in case the CPU has no support for long mode (i.e. it is a 32-bit processor), likewise for better compatibility with old guests.
  • SeaBIOS will limit phys-bits to 46, simliar to OVMF, likewise for better compatibility with old guests. SeaBIOS does not use paging though and does not care about support for gigabyte pages, it will never limit phys-bits to 40.
  • SeaBIOS has a list of devices which will never be placed in the 64-bit IO window. This list includes devices where SeaBIOS drivers must be able to access the PCI bars. SeaBIOS runs in 32-bit mode so these PCI bars must be mapped below 4GB.

Changes in qemu

Starting with release 8.2 the firmware images bundled with upstream qemu are new enough to include the OVMF and SeaBIOS changes described above.

Live migration and changes in libvirt

The new firmware behavior triggered a few bugs elsewhere ...

When doing live migration the vcpu configuration on source and target host must be identical. That includes the size of the physical address space.

libvirt can calculate the cpu baseline for a given cluster, i.e. create a vcpu configuration which is compatible with all cluster hosts. That calculation did not include the size of the physical address space though.

With the traditional, very conservative firmware behavior this bug did not cause problems in practice, but with OVMF starting to use the full physical address space live migrations in heterogeneous clusters started to fail because of that.

In libvirt 9.5.0 and newer this has been fixed.

Trouble shooting tips

In general, it is a good idea to set the qemu config option host-phys-bits=on.

In case guests can't deal with PCI bars being mapped at high addresses the host-phys-bits-limit=bits option can be used to limit the address space usage. I'd suggest to stick to values seen in actual processors, so 40 for AMD and 39 for Intel are good candidates.

In case you are running 32-bit guests with alot of memory (which btw isn't a good idea performance-wise) you might need turn off long mode support to force the PCI bars being mapped below 4G. This can be done by simply using qemu-system-i386 instead of qemu-system-x86_64, or by explicitly setting lm=off in the -cpu options.

by Gerd Hoffmann at November 30, 2023 11:00 PM

Daniel Berrange

ANNOUNCE: libvirt-glib release 5.0.0

I am pleased to announce that a new release of the libvirt-glib package, version 5.0.0, is now available from

https://libvirt.org/sources/glib/

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Fix compatiblity with libxml2 >= 2.12.0
  • Bump min libvirt version to 2.3.0
  • Bump min meson to 0.56.0
  • Require use of GCC >= 4.8 / CLang > 3.4 / XCode CLang > 5.1
  • Mark USB disks as removable by default
  • Add support for audio device backend config
  • Add support for DBus graphics backend config
  • Add support for controlling firmware feature flags
  • Improve compiler flag handling in meson
  • Extend library version script handling to FreeBSD
  • Fix pointer sign issue in capabilities config API
  • Fix compat with gnome.mkenums() in Meson 0.60.0
  • Avoid compiler warnings from gi-ir-scanner generated code by not setting glib version constraints
  • Be more robust about NULL GError parameters
  • Disable unimportant cast alignment compiler warnings
  • Use ‘pragma once’ in all header files
  • Updated translations

Thanks to everyone who contributed to this new release.

by Daniel Berrange at November 30, 2023 02:59 PM

October 20, 2023

KVM on Z

New Release: Ubuntu 23.10

Canonical released a new version of their Ubuntu server offering Ubuntu Server 23.10!

See the announcement on the mailing list here, and the blog entry at Canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at October 20, 2023 02:13 PM

New Release: Ubuntu 23.04

Canonical released a new version of their Ubuntu server offering Ubuntu Server 23.04!

See the announcement on the mailing list here, and the blog entry at Canonical with Z-specific highlights here.

by Stefan Raspl (noreply@blogger.com) at October 20, 2023 02:11 PM

October 06, 2023

Daniel Berrange

Bye Bye BIOS: a tool for when you need to warn users the VM image is EFI only

The x86 platform has been ever so slowly moving towards a world where EFI is used to boot everything, with legacy BIOS put out to pasture. Virtual machines in general have been somewhat behind the cutting edge in this respect though. This has mostly been due to the virtualization and cloud platforms being somewhat slow in enabling use of EFI at all, let alone making it the default. In a great many cases the platforms still default to using BIOS unless explicitly asked to use EFI. With this in mind most the mainstream distros tend to provide general purpose disk images built such that they can boot under either BIOS or EFI, thus adapting to whatever environment the user deploys them in.

In recent times there is greater interest in the use of TPM sealing and SecureBoot for protecting guest secrets (eg LUKS passphrases), the introduction of UKIs as the means to extend the SecureBoot signature to close initrd/cmdline hole, and the advent of confidential virtualization technology. These all combine to increase the liklihood that a virtual machine image will exclusively target EFI, fully discontinuing support for legacy BIOS.

This presents a bit of a usability trapdoor for people deploying images though, as it has been taken for granted that BIOS boot always works. If one takes an EFI only disk image and attempts to boot it via legacy BIOS, the user is likely to get an entirely blank graphical display and/or serial console, with no obvious hint that EFI is required. Even if the requirement for EFI is documented, it is inevitable that users will make mistakes.

Can we do better than this ? Of course we can.

Enter ‘Bye Bye BIOS‘  (https://gitlab.com/berrange/byebyebios)

This is a simple command line tool that, when pointed to a disk image, will inject a MBR sector that prints out a message to the user on the primary VGA display and serial port informing them that UEFI is required, then puts the CPUs in a ‘hlt‘ loop.

The usage is as follows, with a guest serial port connected to the local terminal:

$ byebyebios test.img
$ qemu-system-x86_64 \
    -blockdev driver=file,filename=test.img,node-name=img \
    -device virtio-blk,drive=img \
    -m 2000 -serial stdio

STOP: Machine was booted from BIOS or UEFI CSM
 _    _         _   _ ___________ _____   ___
| \  | |       | | | |  ___|  ___|_   _| |__ \
|  \ | | ___   | | | | |__ | |_    | |      ) |
| . `  |/ _ \  | | | |  __||  _|   | |     / /
| |\   | (_) | | |_| | |___| |    _| |_   |_|
\_| \_/ \___/   \___/\____/\_|    \___/   (_)

Installation requires UEFI firmware to boot

Meanwhile the graphical console shows the same:

QEMU showing

QEMU showing “No UEFI” message when booted from BIOS

The message shown here is a default, but it can be customized by pointing to an alternative message file

$ echo "Bye Bye BIOS" | figlet -f bubble | unix2dos > msg.txt
$ byebyebios --message msg.txt test.img
$ qemu-system-x86_64 \
    -blockdev driver=file,filename=test.img,node-name=img \
    -device virtio-blk,drive=img \
    -m 2000 -serial stdio

  _   _   _     _   _   _     _   _   _   _
 / \ / \ / \   / \ / \ / \   / \ / \ / \ / \
( B | y | e ) ( B | y | e ) ( B | I | O | S )
 \_/ \_/ \_/   \_/ \_/ \_/   \_/ \_/ \_/ \_/

The code behind this is simplicity itself, just a short piece of x86 asm

$ cat bootstub.S
# SPDX-License-Identifier: MIT-0

.code16
.global bye_bye_bios

bye_bye_bios:
  mov $something_important, %si
  mov $0xe, %ah
  mov $0x3f8,%dx

say_a_little_more:
  lodsb
  cmp $0, %al
  je this_is_the_end
  int $0x10
  outb %al,%dx
  jmp say_a_little_more

this_is_the_end:
  hlt
  jmp this_is_the_end

something_important:
# The string message will be appended here at time of install

This is compiled with the GNU assembler to create a i486 ELF object file

$ as -march i486 -mx86-used-note=no --32 -o bootstub.o bootstub.S

From this ELF object file we have to extract the raw machine code bytes

$ ld -m elf_i386 --oformat binary -e bye_bye_bios -Ttext 0x7c00 -o bootstub.bin bootstub.o

The byebyebios python tool takes this bootstub.bin, appends the text message and NUL terminator, padding to fill 446 bytes, then adds a dummy partition table and boot signature to fill the whole 512 sector.

With the boot stub binary at 21 bytes in size, this leaves 424 bytes available for the message to display to the user, which is ample for the purpose.

In conclusion, if you need to ship an EFI only virtual machine image, do your users a favour and use byebyebios to add a dummy MBR to tell them that the image is EFI only when they inevitably make a mistake and run it under legacy BIOS.

 

by Daniel Berrange at October 06, 2023 01:53 PM

October 03, 2023

Marcin Juszkiewicz

Testing *BSD on SBSA Reference Platform

SystemReady specification mentions that system to be certified needs to be able to boot several operating systems:

In addition, OS installation and boot logs are required:

  • Windows PE boot log, from a GPT partitioned disk, is required.
  • VMware ESXi-Arm installation and boot logs are recommended.
  • Installation and boot logs from two of the Linux distros or BSDs are required.

All logs must be submitted using the ES/SR template.

In choosing the Linux distros or BSDs, maximize the coverage by diversifying the heritage. For example, the following shows the grouping of the heritage:

  • RHEL/Fedora/CentOS/AlmaLinux/Rocky Linux/Oracle Linux/Anolis OS
  • SLES/openSUSE
  • Ubuntu/Debian
  • CBL-Mariner
  • NetBSD/OpenBSD/FreeBSD

So during last week I went through *BSD ones.

OpenBSD

Started with “download OpenBSD” page and found out that there is no installation ISO for aarch64 architecture. Not good.

So I fetched miniroot73.img disk image instead and went on with booting:

>> OpenBSD/arm64 BOOTAA64 1.16
boot>
cannot open sd0a:/etc/random.seed: No such file or directory
booting sd0a:/bsd: 2798224+1058776+12709688+630920 [229059+91+651336+254968]=0x1
3ce628
FACP DBG2 MCFG SPCR APIC SSDT PPTT GTDT BGRT
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2023 OpenBSD. All rights reserved.  https://www.OpenBSD.org

OpenBSD 7.3 (RAMDISK) #1941: Sat Mar 25 14:42:22 MDT 2023
    deraadt@arm64.openbsd.org:/usr/src/sys/arch/arm64/compile/RAMDISK
real mem  = 4287451136 (4088MB)
avail mem = 4073807872 (3885MB)
random: boothowto does not indicate good seed
mainbus0 at root: ACPI
psci0 at mainbus0: PSCI 1.1, SMCCC 1.2
cpu0 at mainbus0 mpidr 0: ARM Cortex-A57 r1p0
cpu0: 48KB 64b/line 3-way L1 PIPT I-cache, 32KB 64b/line 2-way L1 D-cache
cpu0: 2048KB 64b/line 16-way L2 cache
cpu0: CRC32,SHA2,SHA1,AES+PMULL,ASID16
efi0 at mainbus0: UEFI 2.7
efi0: EFI Development Kit II / SbsaQemu rev 0x10000
smbios0 at efi0: SMBIOS 3.4.0
smbios0: vendor EFI Development Kit II / SbsaQemu version "1.0" date 09/15/2023
smbios0: QEMU QEMU SBSA-REF Machine
agintc0 at mainbus0 shift 4:3 nirq 256 nredist 2: "interrupt-controller"
agtimer0 at mainbus0: 62500 kHz
acpi0 at mainbus0: ACPI 6.0
acpi0: tables DSDT FACP DBG2 MCFG SPCR APIC SSDT PPTT GTDT BGRT
acpimcfg0 at acpi0
acpimcfg0: addr 0xf0000000, bus 0-255
pluart0 at acpi0 COM0 addr 0x60000000/0x1000 irq 33
pluart0: console
ahci0 at acpi0 AHC0 addr 0x60100000/0x10000 irq 42: AHCI 1.0
ahci0: port 0: 1.5Gb/s
ahci0: port 1: 1.5Gb/s
scsibus0 at ahci0: 32 targets
sd0 at scsibus0 targ 0 lun 0: <ATA, QEMU HARDDISK, 2.5+> t10.ATA_QEMU_HARDDISK_QM00001_
sd0: 43MB, 512 bytes/sector, 88064 sectors, thin
sd1 at scsibus0 targ 1 lun 0: <ATA, QEMU HARDDISK, 2.5+> t10.ATA_QEMU_HARDDISK_QM00003_
sd1: 504MB, 512 bytes/sector, 1032192 sectors, thin
ehci0 at acpi0 USB0 addr 0x60110000/0x10000 irq 43panic: uvm_fault failed: ffffff800034c3e8 esr 96000050 far ffffff8066ef5048

The operating system has halted.
Please press any key to reboot.

As you can see it hang on an attempt to initialize USB controller. Which shows that our move from EHCI to XHCI was not properly tested ;(

The problem was that our virtual hardware (QEMU) had XHCI (USB 3) controller on non-discoverable platform bus. But firmware (EDK2) tells that it was EHCI (USB 2) one.

This got solved with Yuquan Wang’s patch moving EDK2 to initiate and describe XHCI usb controller (change is already merged upstream). After rebuilding EDK2 OpenBSD booted fine right to the installation prompt (skipped previous messages):

xhci0 at acpi0 USB0 addr 0x60110000/0x10000 irq 43, xHCI 0.0
usb0 at xhci0: USB revision 3.0
uhub0 at usb0 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 addr 1
acpipci0 at acpi0 PCI0
pci0 at acpipci0
0:1:0: rom address conflict 0xfffc0000/0x40000
0:2:0: rom address conflict 0xffff8000/0x8000
"Red Hat Host" rev 0x00 at pci0 dev 0 function 0 not configured
em0 at pci0 dev 1 function 0 "Intel 82574L" rev 0x00: msi, address 52:54:00:12:34:56
"Bochs VGA" rev 0x02 at pci0 dev 2 function 0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
simplefb0 at mainbus0: 1280x800, 32bpp
wsdisplay0 at simplefb0 mux 1
wsdisplay0: screen 0 added (std, vt100 emulation)
uhidev0 at uhub0 port 1 configuration 1 interface 0 "QEMU QEMU USB Keyboard" rev 2.00/0.00 addr 2
uhidev0: iclass 3/1
ukbd0 at uhidev0
wskbd0 at ukbd0 mux 1
wskbd0: connecting to wsdisplay0
uhidev1 at uhub0 port 2 configuration 1 interface 0 "QEMU QEMU USB Tablet" rev 2.00/0.00 addr 3
uhidev1: iclass 3/0
uhid at uhidev1 not configured
softraid0 at root
scsibus1 at softraid0: 256 targets
root on rd0a swap on rd0b dump on rd0b
WARNING: CHECK AND RESET THE DATE!
erase ^?, werase ^W, kill ^U, intr ^C, status ^T

Welcome to the OpenBSD/arm64 7.3 installation program.
(I)nstall, (U)pgrade, (A)utoinstall or (S)hell?

After this I added booting OpenBSD to QEMU tests for SBSA Reference Platform to make sure that we have something non-Linux based there.

FreeBSD

The next one was FreeBSD. And here situation started to be weird…

First I took 13.2 release. Used firmware with XHCI information and was greeted with:

Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC arm64
FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c)
VT(efifb): resolution 1280x800
module firmware already present!
real memory  = 4294967296 (4096 MB)
avail memory = 4160204800 (3967 MB)
Starting CPU 1 (1)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and the knob 'bypass_before_seeding' was enabled.
random: entropy device external interface
MAP 100fbdf0000 mode 2 pages 128
MAP 100fbe70000 mode 2 pages 160
MAP 100fbf10000 mode 2 pages 80
MAP 100fbfb0000 mode 2 pages 80
MAP 100ff500000 mode 2 pages 400
MAP 100ff690000 mode 2 pages 592
MAP 10000000 mode 0 pages 1728
MAP 60010000 mode 0 pages 1
kbd0 at kbdmux0
acpi0: <LINARO SBSAQEMU>
acpi0: Power Button (fixed)
acpi0: Sleep Button (fixed)
acpi0: Could not update all GPEs: AE_NOT_CONFIGURED
psci0: <ARM Power State Co-ordination Interface Driver> on acpi0
gic0: <ARM Generic Interrupt Controller v3.0> iomem 0x40060000-0x4007ffff,0x40080000-0x4407ffff on acpi0
its0: <ARM GIC Interrupt Translation Service> mem 0x44081000-0x440a0fff on gic0
generic_timer0: <ARM Generic Timer> irq 3,4,5 on acpi0
Timecounter "ARM MPCore Timecounter" frequency 62500000 Hz quality 1000
Event timer "ARM MPCore Eventtimer" frequency 62500000 Hz quality 1000
efirtc0: <EFI Realtime Clock>
efirtc0: registered as a time-of-day clock, resolution 1.000000s
uart0: <PrimeCell UART (PL011)> iomem 0x60000000-0x60000fff irq 0 on acpi0
uart0: console (115200,n,8,1)
ahci0: <AHCI SATA controller> iomem 0x60100000-0x6010ffff irq 1 on acpi0
ahci0: AHCI v1.00 with 6 1.5Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
xhci0: <Generic USB 3.0 controller> iomem 0x60110000-0x6011ffff irq 2 on acpi0
xhci0: 32 bytes context size, 32-bit DMA

And it hang there…

Let check newer FreeBSD

Contacted people on #freebsd IRC channel and Mina Galić (meena on IRC) asked me to boot FreeBSD 14 or 15 images. So I tried both:

xhci0: <Generic USB 3.0 controller> iomem 0x60110000-0x6011ffff irq 2 on acpi0
xhci0: 32 bytes context size, 64-bit DMA
usbus0 on xhci0

System booted further. Note “64-bit DMA” information instead of “32-bit DMA” from 13.2 release. Reported bug 274237 for it. On the same day required change was identified and marked for potential backport.

AHCI issue

But that was not the only problem. Turned out that none of AHCI devices were found… So no way to run an installer:

ahci0: <AHCI SATA controller> iomem 0x60100000-0x6010ffff irq 1 on acpi0
ahci0: AHCI v1.00 with 6 1.5Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
[..]
Release APs...done
Trying to mount root from cd9660:/dev/iso9660/13_2_RELEASE_AARCH64_BO [ro]...
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 170 serr 00000000 cmd 0000c017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted

I checked QEMU 7.2 (from Fedora package) and it booted fine. 8.0.5 failed, 8.0.0 booted. Hm… Started “git bisect” to find out which change broke it. After several rebuilds I found commit to blame:

commit 7bcd32128b227cee1fb39ff242d486ed9fff7648
Author: Niklas Cassel <niklas.cassel@wdc.com>
Date:   Fri Jun 9 16:08:40 2023 +0200

    hw/ide/ahci: simplify and document PxCI handling

    The AHCI spec states that:
    For NCQ, PxCI is cleared on command queued successfully.
Is it AArch64 only or not?

The next step: checking is it global problem or only aarch64 one.

I built x86-64 emulation component and checked Q35 machine (which also uses AHCI). And FreeBSD failed exactly same way. This made bug reporting a lot easier as there were several architectures and more users affected.

Mailed author and QEMU developers about it. Described the problem, gave exact command line arguments for QEMU etc. Niklas Cassel replied:

I will have a look at this.

So it will be done.

NetBSD

Here situation was a bit similar to FreeBSD one.

Fetched NetBSD 9.3 image and booted just to see it hang (removed printk.time from output):

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
    2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
    2018, 2019, 2020, 2021, 2022
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 9.3 (GENERIC64) #0: Thu Aug  4 15:30:37 UTC 2022
 mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/evbarm/compile/GENERIC64
total memory = 4075 MB
avail memory = 3929 MB
running cgd selftest aes-xts-256 aes-xts-512 done
armfdt0 (root)
simplebus0 at armfdt0: QEMU QEMU SBSA-REF Machine
simplebus1 at simplebus0
acpifdt0 at simplebus0
acpifdt0: using EFI runtime services for RTC
ACPI: RSDP 0x00000100FC020018 000024 (v02 LINARO)
ACPI: XSDT 0x00000100FC02FE98 00006C (v01 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: FACP 0x00000100FC02FB98 000114 (v06 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: DSDT 0x00000100FC02E998 000CD8 (v02 LINARO SBSAQEMU 20200810 INTL 20220331)
ACPI: DBG2 0x00000100FC02FA98 00005C (v00 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: MCFG 0x00000100FC02FE18 00003C (v01 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: SPCR 0x00000100FC02FF98 000050 (v02 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: IORT 0x00000100FC027518 0000DC (v00 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: APIC 0x00000100FC02E498 000108 (v04 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: SSDT 0x00000100FC02E898 000067 (v02 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: PPTT 0x00000100FC02FD18 0000B8 (v02 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: GTDT 0x00000100FC02E618 000084 (v03 LINARO SBSAQEMU 20200810 LNRO 00000001)
ACPI: 2 ACPI AML tables successfully acquired and loaded
acpi0 at acpifdt0: Intel ACPICA 20190405
cpu0 at acpi0: unknown CPU (ID = 0x411fd402)
cpu0: package 0, core 0, smt 0
cpu0: IC enabled, DC enabled, EL0/EL1 stack Alignment check enabled
cpu0: Cache Writeback Granule 16B, Exclusives Reservation Granule 16B
cpu0: Dcache line 64, Icache line 64
cpu0: L1 0KB/64B 4-way PIPT Instruction cache
cpu0: L1 0KB/64B 4-way PIPT Data cache
cpu0: L2 0KB/64B 8-way PIPT Unified cache
cpu0: revID=0x0, 4k table, 16k table, 64k table, 16bit ASID
cpu0: auxID=0x1011111110212120, GICv3, CRC32, SHA1, AES+PMULL, rounding, NaN propagation, denormals, 32x64bitRegs, Fused Multiply-Add
cpu1 at acpi0: unknown CPU (ID = 0x411fd402)
cpu1: package 0, core 1, smt 0
gicvthree0 at acpi0: GICv3
gicvthree0: ITS #0 at 0x44081000
gicvthree0: ITS [#0] Devices table @ 0x10009210000/0x80000, Cacheable WA WB, Inner shareable
gicvthree0: ITS [#1] Collections table @ 0x10009290000/0x10000, Cacheable WA WB, Inner shareable

As 9.3 release is quite old I tested NetBSD 10-Beta:

gicvthree0 at acpi0: GICv3
gicvthree0: ITS #0 at 0x44081000
gicvthree0: ITS [#0] Devices table @ 0x10008a60000/0x80000, Cacheable WA WB, Inner shareable
gicvthree0: ITS [#1] Collections table @ 0x10008ae0000/0x10000, Cacheable WA WB, Inner shareable
gtmr0 at acpi0: irq 27
armgtmr0 at gtmr0: Generic Timer (62500 kHz, virtual)
plcom0 at acpi0 (COM0, ARMH0011-0): mem 0x60000000-0x60000fff irq 33
plcom0: console
[..]
NetBSD-10.0_BETA Install System

Went to #netbsd channel on IRC and started disussion. Michael van Elst (mlelstv on irc) gave me a helping hand and debugged the problem. Looks like kernel went into infinite loop on parsing GTDT table from ACPI. Newer branches of NetBSD have additional check there.

Filled bug 57642 for it. And, like in FreeBSD case, it looks like some backport to stable branch is needed.

Summary

Testing platforms for SystemReady compliance needs to include *BSD systems. Linux and NetBSD were fine with our USB controller mess — gave “something is wrong” message and went on. FreeBSD and OpenBSD systems were complaining and stopped booting process.

We also need to do more testing before merging big changes in future. This USB controller mess could be avoided or done better.

by Marcin Juszkiewicz at October 03, 2023 06:46 PM

September 15, 2023

Marcin Juszkiewicz

SBSA Reference Platform update

There were several changes done since my previous post on the topic. So after some discussions I decided to write a post about it.

There are improvements, fixes and even issues with BSA specification.

Versioning related changes

SBSA Reference Platform (“sbsa-ref” in short) is now at version 0.3 one. Note that this is internal number. Machine name is still the same.

First bump was adding GIC data into (minimalistic) device-tree so firmware can configure it without using any magic numbers (as it was before).

Second update added GIC ITS (Interrupt Translation Services) support. Which means that we can have MSI-X interrupts and complex PCI Express setup.

Third time we said goodbye to USB 2.0 (EHCI) host controller. It never worked and only generated kernel warnings. XHCI (USB 3) controller is used instead now. EDK2 enablement is still work in progress.

Firmware improvements

Most of versioning updates involved firmware changes. Information about hardware details gets passed from virtual hardware level to operating system via standard defined ways:

  • Trusted Firmware (TF-A) gets minimalistic Device-Tree from QEMU
  • UEFI (EDK2) uses Secure Monitor Calls to get information from TF-A
  • operating system uses ACPI tables

This way we were able to get rid of part of “magic numbers” from firmware components.

CPU updates

We can use Neoverse V1 cpu core now. It uses Arm v8.4 architecture and brings SVE and a bunch of other interesting features. You may need to update Trusted Firmware to make use of it.

QEMU got Arm Cortex-A710 cpu core support. It is first Arm v9.0 core there. Due to 240 address space we cannot use it for sbsa-ref. But it prepares code for Neoverse N2/V2 cores.

PCI Express changes and disputes

SBSA Reference Platform passes most of BSA ACS tests from PCI Express module:

      *** Starting PCIe tests ***

Operating System View:
 801 : Check ECAM Presence                        : Result:  PASS
 802 : PE - ECAM Region accessibility check       : Result:  PASS
 803 : All EP/Sw under RP in same ECAM Region     : Result:  PASS
 804 : Check RootPort NP Memory Access            : Result:  PASS
 805 : Check RootPort P Memory Access             : Result:  PASS
 806 : Legacy int must be SPI & lvl-sensitive
       Checkpoint --  2                           : Result:  SKIPPED
 808 : Check all 1's for out of range             : Result:  PASS
 809 : Vendor specfic data are PCIe compliant     : Result:  PASS
 811 : Check RP Byte Enable Rules                 : Result:  PASS
 817 : Check Direct Transl P2P Support
       Checkpoint --  1                           : Result:  SKIPPED
 818 : Check RP Adv Error Report
       Checkpoint --  1                           : Result:  SKIPPED
 819 : RP must suprt ACS if P2P Txn are allow
       Checkpoint --  1                           : Result:  SKIPPED
 820 : Type 0/1 common config rule                : Result:  PASS
 821 : Type 0 config header rules                 : Result:  PASS
 822 : Check Type 1 config header rules
       BDF 0x400 : SLT attribute mismatch: 0xFF020100 instead of 0x20100
       BDF 0x500 : SLT attribute mismatch: 0xFF030300 instead of 0x30300
       BDF 0x600 : SLT attribute mismatch: 0xFF040400 instead of 0x40400
       BDF 0x700 : SLT attribute mismatch: 0xFF050500 instead of 0x50500
       BDF 0x800 : SLT attribute mismatch: 0xFF060600 instead of 0x60600
       BDF 0x900 : SLT attribute mismatch: 0xFF080700 instead of 0x80700
       BDF 0x10000 : SLT attribute mismatch: 0xFF020201 instead of 0x20201
       Failed on PE -    0
       Checkpoint --  7                           : Result:  FAIL
 824 : Device capabilities reg rule               : Result:  PASS
 825 : Device Control register rule               : Result:  PASS
 826 : Device cap 2 register rules                : Result:  PASS
 830 : Check Cmd Reg memory space enable
       BDF 400 MSE functionality failure
       Failed on PE -    0
       Checkpoint --  1                           : Result:  FAIL
 831 : Check Type0/1 BIST Register rule           : Result:  PASS
 832 : Check HDR CapPtr Register rule             : Result:  PASS
 833 : Check Max payload size supported           : Result:  PASS
 835 : Check Function level reset                 : Result:  PASS
 836 : Check ARI forwarding enable rule           : Result:  PASS
 837 : Check Config Txn for RP in HB              : Result:  PASS
 838 : Check all RP in HB is in same ECAM         : Result:  PASS
 839 : Check MSI support for PCIe dev             : Result:  PASS
 840 : PCIe RC,PE - Same Inr Shareable Domain     : Result:  PASS
 841 : NP type-1 PCIe supp 32-bit only
       NP type-1 pcie is not 32-bit mem type
       Failed on PE -    0
       Checkpoint --  1                           : Result:  FAIL
 842 : PASID support atleast 16 bits
       Checkpoint --  3                           : Result:  SKIPPED

      One or more PCIe tests failed or were skipped.

     -------------------------------------------------------
     Total Tests run  =   30  Tests Passed  =   22  Tests Failed =    3
     -------------------------------------------------------

As you see some of them require work.

Root ports SLT issue

I reported problem with test 822 to QEMU developers and turned out that it is a bug there. I got patch from Michael S. Tsirkin (one of QEMU PCI maintainers) and it made test pass. I hope it will be merged soon.

PCIe to PCI bridge issue

I wonder how many SBSA physical platforms will use one of those. Probably none, but my testing setup has one.

And it makes test 841 fail. This time problem requires more discussion because BSA specification writes (chapter E.2 PCI Express Memory Space):

When PCI Express memory space is mapped as normal memory, the system must support unaligned accesses to that region. PCI Type 1 headers, used in PCI-to-PCI bridges, and therefore in root ports and switches, have to be programmed with the address space resources claimed by the given bridge. For non-prefetchable (NP) memory, Type 1 headers only support 32-bit addresses. This implies that endpoints on the other end of a PCI-to-PCI bridge only support 32-bit NP BARs.

On the other side we have PCI Express Base Specification Revision 6.0 which, in chapter 7.5.1.2.1, says that BAR can be either 32 or 64-bit long:

Base Address registers that map into Memory Space can be 32 bits or 64 bits wide (to support mapping into a 64-bit address space) with bit 0 hardwired to 0b. For Memory Base Address registers, bits 2 and 1 have an encoded meaning as shown in Table 7-9. Bit 3 should be set to 1b if the data is prefetchable and set to 0b otherwise. A Function is permitted to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of the byte enables, and host bridges can merge processor writes into this range 150 without causing errors. Bits 3-0 are read-only.

Table 7-9 Memory Base Address Register Bits 2:1 Encoding

Bits 2:1(b) Meaning
00 Base register is 32 bits wide and can be mapped anywhere in the 32 address bit Memory Space.
01 Reserved
10 Base register is 64 bits wide and can be mapped anywhere in the 64 address bit Memory Space.
11 Reserved

And pcie-pci-bridge device in QEMU uses 64-bit BAR.

I opened support ticket for it at Arm. Will see how it ends.

Non-secure EL2 virtual timer

Arm v8.1 architecture brought Virtual Host Extension (VHE in short). And it added one more timer: non-secure EL2 virtual timer.

BSA ACS checks for it and we were failing:

 226 : Check NS EL2-Virt timer PPI Assignment         START

       NS EL2 Virtual timer interrupt 28 not received
       Failed on PE -    0
       B_PPI_02
       Checkpoint --  4                           : Result:  FAIL
         END

Turned out that everything to make it pass was already present in QEMU. Except code to enable it for our platform. Two lines of code were enough.

After I sent my small patch, Leif Lindholm extracted timer definitions to separate include file and cleaned code around it to make it easier to compare QEMU code with BSA specification.

Result? Test passes:

 226 : Check NS EL2-Virt timer PPI Assignment         START

       Received vir el2 interrupt
       B_PPI_02
                                       : Result:  PASS
         END

Summary

SBSA Reference Platform in QEMU gets better and better with time. We can emulate more complex systems, information about hardware details gets passed from virtual hardware level to operating system via standard defined ways.

Still have test failures but less than it was in past.

by Marcin Juszkiewicz at September 15, 2023 01:30 PM

September 09, 2023

Stefan Hajnoczi

How nostr could enable peer-to-peer apps

I hacked up a prototype multi-player game in just a static HTML/JS files. The game runs in players' browsers without the need for a centralized game server. This peer-to-peer model - getting rid of the server - is something I've been interested in for a long time. I finally discovered a way to make it work without hosting my own server or relying on a hosted service that requires API keys, accounts, or payments. That missing piece came in the form of nostr, a decentralized network protocol that I'll get into later.

Recently p2panda and Veilid were released. They are decentralized application frameworks. Neither has the exact properties I like, but that spurred me to work on a prototype game that shows the direction that I find promising for decentralized applications.

Distributed application models

Most distributed applications today are built on a centralized client-server model. Applications are not a single program, but two programs. A client application on the user's device communicates with a server application on the application owner's machines. The way it works is pretty simple: the server holds the data and the client sends commands to operate on the data.

The centralized client-server model is kind of a drag because you need to develop two separate programs and maintain a server so that the application remains online at all times. Non-technical users can't really host the application themselves. It costs money to run the server. If the application owner decides to pull the plug on the server then users cannot use the application anymore. Bad practices of locking in, hoarding, and selling user data as well as monitoring and manipulating user behavior are commonplace because the server controls access to user data.

Peer-to-peer applications solve many of these issues. The advantages are roughly:

  • Eliminating the cost, effort, and skill required to maintain servers.
  • Improving user privacy by not involving a server.
  • Operating without constant internet connectivity.
  • Enabling users to run the application even after the developer has stopped supporting it.
  • Reducing application complexity by replacing client/server with a single program.

How to make a peer-to-peer application

This needs to work for web, mobile, and desktop applications because people switch between these three environments all the time. It would be impractical if the solution does not support all environments. The web is the most restrictive environment, mostly for security reasons. Many technologies are not available on the web, including networking APIs that desktop peer-to-peer applications tend to rely on. But if a solution works on the web, then mobile and desktop applications are likely to be able to use the same technology and interoperate with web applications.

Luckily the web environment has one technology that can be used to build peer-to-peer applications: WebRTC. Implementations are available for mobile and destkop environments as well. WebRTC's DataChannels can be thought of as network connections that transfer messages between two devices. They are the primitive for communicating in a peer-to-peer application in place of HTTPS, TCP, or UDP connections that most existing application use today.

Unfortunately WebRTC is not fully peer-to-peer because it relies on a "signaling server" for connection establishment. The signaling server exchanges connectivity information so that a peer-to-peer connetion can be negotiated. This negotiation process does not always succeed, by the way, so in some cases it is not possible to create a peer-to-peer connection. I have no solution for that without hosting servers.

The crux of using WebRTC is that a signaling server is needed, but we don't want to host one for each application. Over the years I've investigated existing peer-to-peer networks like Tor and WebTorrent to see if they could act as the signaling server. I didn't find one that is usable from the web environment (it's too restricted) until now.

It turns out that nostr, originally designed for social network applications but now being used for a bunch of different applications, is web-friendly and could act as a WebRTC signaling server quite easily. In my prototype I abused the encrypted direct message (NIP-04) feature for WebRTC signaling. It works but has the downside that the nostr relay wastes storage because there is no need to preserve the messages. That can be fixed by assigning an "ephemeral kind" so the relay knows it can discard messages after delivery.

(Another option is to build a free public WebRTC signaling service. Its design would be remarkably close to the nostr protocol, so I decided not to reinvent the wheel. If anyone wants to create a public service, let me know and I can share ideas and research.)

Once connectivity has been established via WebRTC, it's up to the application to decide how to communicate. It could be a custom protocol like the JSON messages that my prototype uses, it could be the nostr protocol, it could be HTTP, or literally anything.

The user experience

Here is how my game prototype works:

  1. Player A opens the game web page (just static files hosted on GitLab Pages) and clicks "Host" game.
  2. Player A shares the game link with player B.
  3. Player B opens the game link and uses nostr to exchange WebRTC signaling messages encrypted with the other player's public key.
  4. A WebRTC DataChannel is negotiated and nostr is no longer used once the peer-to-peer connection is established.
  5. The game proceeds with player A and B exchanging game messages over the DataChannel.

In order to connect apps, a user must share a public key with the other user. The public key allows the other user to connect. In my prototype the player hosting the game gets a URL that can be shared with the other player. When the other player visits the URL they will join the game because the public key is embedded in the URL. The role of the public key is similar to the idea behind INET256's "stable addresses derived from public keys".

When devices go offline it is no longer possible to connect to them. This is not a problem for short-lived use cases like playing a game of chess or synchronizing the state of an RSS reader application between a phone and a laptop. For long-lived use cases like a discussion forum or a team chat there are two options: a fully peer-to-peer replicated and eventually consistent data model or a traditional centralized server hosted on a supernode. Both of these options are possible.

Try it out

You can try out my prototype in your web browser. It's a 2-player tic-tac-toe game: https://gitlab.com/stefanha/tic-tac-toe-p2p/. If the game does not start, try it again (sorry, I hacked it up in a weekend and it's not perfect).

If you want to discuss or share other peer-to-peer application approaches, see my contact details here.

by Unknown (noreply@blogger.com) at September 09, 2023 01:02 AM

August 22, 2023

QEMU project

QEMU version 8.1.0 released

We’d like to announce the availability of the QEMU 8.1.0 release. This release contains 2900+ commits from 250 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

  • VFIO: improved live migration support, no longer an experimental feature
  • GTK GUI now supports multi-touch events
  • ARM, PowerPC, and RISC-V can now use AES acceleration on host processor
  • PCIe: new QMP commands to inject CXL General Media events, DRAM events and Memory Module events
  • ARM: KVM VMs on a host which supports MTE (the Memory Tagging Extension) can now use MTE in the guest
  • ARM: emulation support for bpim2u (Banana Pi BPI-M2 Ultra) board and neoverse-v1 (Cortex Neoverse-V1) CPU
  • ARM: new architectural feature support for: FEAT_PAN3 (SCTLR_ELx.EPAN), FEAT_LSE2 (Large System Extensions v2), and experimental support for FEAT_RME (Realm Management Extensions)
  • Hexagon: new instruction support for v68/v73 scalar, and v68/v69 HVX
  • Hexagon: gdbstub support for HVX
  • MIPS: emulation support for Ingenic XBurstR1/XBurstR2 CPUs, and MXU instructions
  • PowerPC: TCG SMT support, allowing pseries and powernv to run with up to 8 threads per core
  • PowerPC: emulation support for Power9 DD2.2 CPU model, and perf sampling support for POWER CPUs
  • RISC-V: ISA extension support for BF16/Zfa, and disassembly support for Zcm/Zinx/XVentanaCondOps/Xthead
  • RISC-V: CPU emulation support for Veyron V1
  • RISC-V: numerous KVM/emulation fixes and enhancements
  • s390: instruction emulation fixes for LDER, LCBB, LOCFHR, MXDB, MXDBR, EPSW, MDEB, MDEBR, MVCRL, LRA, CKSM, CLM, ICM, MC, STIDP, EXECUTE, and CLGEBR(A)
  • SPARC: updated target/sparc to use tcg_gen_lookup_and_goto_ptr() for improved performance
  • Tricore: emulation support for TC37x CPU that supports ISA v1.6.2 instructions
  • Tricore: instruction emulation of POPCNT.W, LHA, CRC32L.W, CRC32.B, SHUFFLE, SYSCALL, and DISABLE
  • x86: CPU model support for GraniteRapids
  • and lots more…

Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!

August 22, 2023 04:42 PM

August 10, 2023

KVM on Z

2023 IBM TechXchange EMEA Client Workshop for Linux on IBM Z and LinuxONE

Interested in the latest news on Linux on IBM Z & LinuxONE? Come and meet us at the 2023 IBM TechXchange EMEA Client Workshop for Linux on IBM Z and LinuxONE on September 19-20 in Ehningen, Germany!

Register here.

by Stefan Raspl (noreply@blogger.com) at August 10, 2023 01:24 PM

July 10, 2023

KVM on Z

KVM in Linux Distributions in 2Q 2023

Second quarter of 2023 was quite productive in terms of new Linux distribution releases and KVM related features shipped in there. Here they are, in chronological order.

Ubuntu 23.04 

The most recent Ubuntu release contains the following new KVM-related functionality:
  • Interpretive vfio-pci support for ISM: Allows pass-through of ISM devices to KVM guests, enabling high-bandwith and low-latency network communications using SMC-D.
  • Encrypted dump for Secure Execution: Enhances problem determination capabilities while not compromising the security of secure KVM guests.
  • Bus id for subchannels: Allows you to identify passthrough CCW devices by their device bus id in the host without going through hoops.
  • Driverctl now lists persisted overrides: Makes it easier to identify and manage passthrough devices.

RHEL 8.8 and 9.2

While being distinct releases, both share the same set of new functionality, which is:
  • Interpretive vfio-pci support for ISM: Which allows pass-through of ISM devices to KVM guests, enabling high-bandwith and low-latency network communication using SMC-D.
  • Encrypted dump for Secure Execution: Enhances problem determination capabilities while not compromising the security of secure KVM guests.
  • Bus id for subchannels: Allows you to identify passthrough CCW devices by their device bus id in the host without going through hoops.
  • Dynamic configuration updates  for vfio-ap: Allows you to hot plug and unplug Crypto domains of a Crypto passthrough configuration for running KVM guests.

SLES 15 SP5

Being a large service pack, there are numerous new features delivered:
  • Interpretive vfio-pci support for ISM: Which allows pass-through of ISM devices to KVM guests, enabling high-bandwith and low-latency network communication using SMC-D.
  • Encrypted dump for Secure Execution: Enhances problem determination capabilities while not compromising the security of secure KVM guests.
  • Bus id for subchannels: Allows you to identify passthrough CCW devices by their device bus id in the host without going through hoops.
  • Driverctl now lists persisted overrides: Makes it easier to identify and manage passthrough devices.
  • Persistent configuration for vfio-ap: The s390-tools command zdev can now be used to persist Crypto passthrough configurations.
  • Dynamic configuration updates  for vfio-ap: Allows you to hot plug and unplug Crypto domains of a Crypto passthrough configuration for running KVM guests.
  • Remote attestation for Secure Execution: Provides cryptographic evidence of workload authenticity and integrity facilitates integration into common Confidential Computing frameworks.
  • Support of long kernel command lines of up to 64 KB length, for example, allows you to specify plenty of I/O devices.
...and other minor improvements

by Hendrik Brueckner (noreply@blogger.com) at July 10, 2023 02:49 PM

June 24, 2023

Thomas Huth

New KVM features in RHEL 9.2 and 8.8 on IBM Z

A couple of weeks ago, Red Hat Enterprise Linux 9.2 and Red Hat Enterprise Linux 8.8 have been release – time to look at the new features here with regards to KVM virtualization on IBM Z systems.

Rebased versions in RHEL 9.2

The KVM code in the 5.14-based kernel of RHEL 9.2 has been refreshed to the state of the upstream 6.0 kernel.

Additionally, many packages from the virtualization stack have been rebased in RHEL 9.2. The following versions are now available:

  • QEMU 7.2.0 (updated from 7.0.0 in RHEL 9.1)
  • libvirt 9.0.0 (updated from 8.5.0 in RHEL 9.1)
  • virt-install 4.1.0 (updated from 4.0.0 in RHEL 9.1)
  • libguestfs 1.48.4
  • libslirp 4.4.0

Speaking of libslirp, a new alternative to the “slirp” user mode networking called passt has been added in RHEL 9.2 for the first time and can be used by installing the “passt” package and adjusting the XML definition of your guest accordingly. “passt” should provide more performance than “slirp” and was designed with security in mind.

Beside the generic new features that are available thanks to the rebased packages in RHEL 9.2, there are also some cool new IBM Z-specific features which have been explicitly backported to the RHEL 9.2 and 8.8 code base:

Secure execution guest dump encryption with customer keys

When running secure guests it is of course normally not possible to dump the guest’s memory from the host (e.g. with virsh dump --memory-only) since the memory pages of the guest are not available to the host system anymore.

However, in some cases (e.g. when debugging a misbehaving or crashing kernel in the guest), the owner of the guest VM still might want to get a dump of the guest memory – just without providing it in clear text to the administrator of the host system. With RHEL 9.2 and 8.8, this is now possible on the new z16 mainframe. Please see the related documentation from IBM to learn how to set up such a dump.

Crypto passthrough hotplug

vfio-ap crypto adapters can now be hotplugged to guests during runtime, too, which brings you more flexibility, without the need to shutdown your guests to change their configurations.

Enhanced interpretation for PCI functions

The kernel code in RHEL 9.2 and 8.8 can now enable a new firmware/hardware feature of the recent IBM Z machines that can speed up the performance of passthrough PCI devices (more events can be handled within the guest, without intervention of the KVM hypervisor). Additionally, this now also allows to pass ISM PCI devices through to KVM guests (which was not possible before).

June 24, 2023 10:45 AM

May 23, 2023

Marcin Juszkiewicz

Versioning of sbsa-ref machine

QEMU has emulation of several machines. One of them is “sbsa-ref” which stands for SBSA Reference Platform. The Arm server in simpler words.

In past I worked on it when my help was needed. We have CI jobs which run some tests (SBSA ACS, BSA ACS) and do some checks to see how we are with SBSA compliance.

Versioning?

One day there was discussion that we need a way to recognize variants of “sbsa-ref” in some sane way. The idea was to get rid of most of hardcoded values and provide a way to have data going from QEMU up to firmware.

We started with adding “platform version major/minor” fields into DeviceTree. Starting with “0.0” as value. And for some time nothing changed here as some of people working on SBSA Reference Platform changed jobs and other worked on other parts of it.

Note that this is different than other QEMU targets. We do not go “sbsa-ref-8.0”, “sbsa-ref-8.1” way as this would add maintenance work without any gain for us.

During last Linaro Connect we had some discussion on how we want to proceed. And some after (as not everyone got there — UK visa issues).

The plan

The plan is simple:

  • QEMU adds data into DeviceTree
  • TF-A parses DeviceTree and extracts data from it
  • TF-A provides Secure Monitor Calls (SMC) with data from DT
  • EDK2 uses SMC to gather data from TF-A
  • EDK2 creates ACPI tables
  • OS uses ACPI to get hardware information

Implementation

After setting the plan I created a bunch of Jira tickets and started writing code. Some changes were new, some were adapted from our work-in-progress ones.

0.0: Platform version SMC

Trusted Firmware (TF-A) reads DeviceTree from QEMU and provides platform version (PV from now on) up to firmware via SMC. EDK2 reads it and does nothing (as expected).

0.1: GIC data SMC

Firmware knows which platform version we are on so it can do something about it. So we bump the value in QEMU and provide Arm GIC addresses via another SMCs.

TF-A uses those values instead of hardcoded ones to initialize GIC. Then EDK2 does the same.

If such firmware boots on older QEMU then hardcoded values are used and machine is operational still.

0.2: GIC ITS SMC

Here things start to be more interesting. We add Interrupt Translation Service support to GIC. Which means we have LPI, MSI(-X) etc. In other words: have normal, working PCI Express with root ports, proper interrupts etc.

From code side it is like previous step: QEMU adds address to DT, TF-A reads it and provides via SMC to EDK2.

If such firmware boots on older QEMU then ITS is not initialized as it was not present in PV 0.0 system.

0.x: PCIe SMC

Normal PCI Express is present. So let get rid of hardcoded values. Similar steps and behaviour like above.

0.y: go PCIe!

At this step we have normal, working PCI Express structure. So let get rid of some platform devices and replace them with expansion cards:

  • AHCI (sata storage)
  • EHCI (usb 2.0)

We can use “ich9-ahci” card instead of former and “qemu-xhci” for latter one.

This step is EDK2 only as we do not touch those parts in TF-A. No real code yet as it needs adding some conditions to existing ASL code so operating system will not get information in DSDT table.

Again: if booted on lower PV then hardcoded values are used.

Other changes

Recently some additional changes to “sbsa-ref” were merged.

We exchanged graphics card from basic VGA one on legacy PCI bus to Bochs one (which uses PCI Express). From firmware or OS view not much changed as both were supported already.

Other change was default processor. QEMU 8.0 brought emulation of Arm Neoverse-N1 cpu. It is enabled in TF-A for a while so we switched to use it by default (instead of ancient Cortex-A57). With move from arm v8.0 to v8.2 we got several cpu features and functionalities.

Future

The above steps are cleanup preparing “sbsa-ref” for future work. We want to be able to change hardware definition more. For example to select exact GIC model (like GIC-600) instead of generic “arm-gic-v3” one.

SBSA Reference Platform is system where most of expansion is expected to happen by adding PCI Express cards.

by Marcin Juszkiewicz at May 23, 2023 10:43 AM

KVM on Z

WARNING: Updating from RHEL8.6 to RHEL8.7/RHEL8.8 may break Network Access with RoCE Express Adapters

RoCE interfaces may lose their IP settings due to an unexpected change of the network interface name.

The RoCE Express adapters can lose their IP settings due to an unexpected change of the network interface name if both of the following conditions are met:

  • User upgrades from a RHEL 8.6 system or earlier.
  • The RoCE card is enumerated by UID.

To workaround this problem:

Create the file /etc/systemd/network/98-rhel87-s390x.link with the following content:

[Match]
Architecture=s390x
KernelCommandLine=!net.naming-scheme=rhel-8.7

[Link]
NamePolicy=kernel database slot path
AlternativeNamesPolicy=database slot path
MACAddressPolicy=persistent

After rebooting the system for the changes to take effect, you can safely upgrade to RHEL 8.7 or later.

Note:
RoCE interfaces that are enumerated by function ID (FID, indicated by featuring the prefix ens in their interface names) are not unique and not affected by this issue. Set the kernel parameter net.naming-scheme=rhel-8.7. to switch to predictable interface names with the eno prefix. See the Networking with RoCE Express book for further details.

by Stefan Raspl (noreply@blogger.com) at May 23, 2023 09:08 AM

April 28, 2023

KVM on Z

2023 Client Workshop for Linux on IBM Z and LinuxONE North America

Join us at the 2023 Client Workshop for Linux on IBM Z and LinuxONE for North America in-person at the IBM Innovation Studio in Poughkeepsie, NY!
 
Go here to register.

Dates and Times

  • Dates: Wednesday, May 24th and Thursday, May 25th
  • Times: 9:30 AM until 5:30 PM local time on both days

Address

  • IBM Poughkeepsie
  • 705 Development Court
  • Poughkeepsie, NY

General Information 

On May 24th and May 25th, you have the possibility to get the latest news and technical information for Hybrid Cloud with Linux on IBM zSystems, LinuxONE, KVM, and z/VM. The training will be delivered onsite at the IBM Innovation Studio Poughkeepsie. You have the chance to interact directly with our IBM experts and join different small work group sessions on these two days.  

Enrollment and Costs/Fees 

The class is free of charge. Travel & Living is on your own. 

Target Audience

Linux on IBM zSystems and LinuxONE Clients, Partners and Software Vendors, and IBM zSystems Technical Sales

Topic Highlights

  • Hybrid Cloud on IBM zSystems with Linux and z/OS
  • Getting the most out of the Latest Features in Linux on IBM Z and LinuxONE
  • The Gravity of Data - Co-locating z/OS and Linux
  • IBM LinuxONE Security and Compliance Center Overview & Demo
  • Common z/VM Hurdles and How to Overcome Them
  • Update on z/VM Express System Installation
  • z/VM Security Session
  • IBM Z next middleware HW overview
  • Quantum Deep Dive
  • Red Hat OpenShift, Storage and Solutioning Values
  • Turbonomic resource management with OpenShift on IBM zSystems
  • Running Databases on LinuxONE
  • Confidential computing - Confidential Containers
  • Cloud Native DevSecOps using OpenShift Pipelines on IBM LinuxONE
  • Various Workgroup Sessions

by Stefan Raspl (noreply@blogger.com) at April 28, 2023 09:25 AM

April 27, 2023

Stefan Hajnoczi

libblkio 1.3.0 is out


 

The 1.3.0 release of the libblkio high-performance block device I/O library is out. libblkio provides an API that abstracts various storage interfaces that are efficient but costly to integrate into applications including io_uring, NVMe uring_cmd, virtio-blk-pci, vdpa-blk, and more. Switching between them is very easy using libblkio and gives you flexibility to target both kernel and userspace storage interfaces in your application.

Linux packaging work has progressed over the past few months. Arch Linux, Fedora, and CentOS Stream now carry libblkio packages and more will come in the future. This makes it easier to use libblkio in your applications because you don't need to compile it from source yourself.

In this release the vdpa-blk storage interface support has been improved. vpda-blk is a virtio-blk-based storage interface designed for hardware implementation, typically on Data Processing Unit (DPU) PCIe adapters. Applications can use vdpa-blk to talk directly to the hardware from userspace. This approach can be used either as part of a hypervisor like QEMU or simply to accelerate I/O-bound userspace applications. QEMU uses libblkio to make vdpa-blk devices available to guests.

The downloads and release notes are available here.

by Unknown (noreply@blogger.com) at April 27, 2023 08:58 PM

April 20, 2023

QEMU project

QEMU version 8.0.0 released

We’d like to announce the availability of the QEMU 8.0.0 release. This release contains 2800+ commits from 238 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

  • ARM: emulation support for FEAT_EVT, FEAT_FGT, and AArch32 ARMv8-R
  • ARM: CPU emulation for Cortex-A55 and Cortex-R52, and new Olimex STM32 H405 machine type
  • ARM: gdbstub support for M-profile system registers
  • HPPA: fid (Floating-Point Identify) instruction support and 32-bit emulation improvements
  • RISC-V: additional ISA and Extension support for smstateen, native debug icount trigger, cache-related PMU events in virtual mode, Zawrs/Svadu/T-Head/Zicond extensions, and ACPI support
  • RISC-V: updated machine support for OpenTitan, PolarFire, and OpenSBI
  • RISC-V: wide ranges of fixes covering PMP propagation for TLB, mret exceptions, uncompressed instructions, and other emulation/virtualization improvements
  • s390x: improved zPCI passthrough device handling
  • s390x: support for asynchronous teardown of memory of secure KVM guests during reboot
  • x86: support for Xen guests under KVM with Linux v5.12+
  • x86: new SapphireRapids CPU model
  • x86: TCG support for FSRM, FZRM, FSRS, and FSRC CPUID flags
  • virtio-mem: support for using preallocation in conjunction with live migration
  • VFIO: experimental migration support updated to v2 VFIO migration protocol
  • qemu-nbd: improved efficient over TCP and when using TLS
  • and lots more…

Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!

April 20, 2023 06:53 PM

April 17, 2023

Marcin Juszkiewicz

“Ten” years at Linaro

Some time ago was a day when I reached “ten” years at Linaro. Why “”? Because it was 3 + 7 rather than 10 years straight. First three years as Canonical contractor now seven years at Red Hat employee assigned as Member Engineer.

My first three years at Linaro

NewCo or NewCore? Or Ubuntu on ARM?

In 2010 I signed contract with Canonical as “Foundation OS Engineer”. Once there I signed another paper which moved me to NewCo project (also called NewCore but NewCo name is on paper I signed).

On 30th April 2010 I got “Welcome to Linaro” e-mail.

Then UDS-M happened where we were hiding under “Ubuntu on Arm” name (despite the fact that Ubuntu had such team).

Ah, it is Linaro now :D

On 3rd June 2010 Linaro was officially announced. No more hiding, we went public with name.

My team

I became a part of Developer Platform team and worked mostly on toolchain packages for Ubuntu and later Debian. There were funny moments at sprints/summits/connects when I joked that I have two managers to listen to (one from Toolchain Working Group and one from Developer Platform).

There were Debian/Ubuntu developers there and people from other environments.

At some moment it was renamed to “Base and Baselines”. Or “Bed and Breakfast” as most of time “BB” was used instead of full name.

AArch64 bring up

In 2012 I dusted off my OpenEmbedded knowledge and started working on getting AArch64 architecture bring up. Lot of not-yet-public patches was in use. The fun of seeing “Hello world!” message in emulator printed by OS image I built from scratch was something I hope to never forget.

Each time I am choosing mug for my coffee I see Pac-Man one I bought during Linaro/Arm AArch64 sprint we had in October 2012.

My AArch64 mug My AArch64 mug

The End

There was lot of noise in 2011 about deal between Canonical and Linaro. Several engineers at Linaro were from Canonical and there was some messy situation related with money.

It ended with retiring of people every quarter. Some moved back to Canonical, some changed job and got hired directly by Linaro. There were also people who moved to Linaro member companies and stayed with their Linaro position. Some people left both companies and went to other jobs.

I was supposed to leave Linaro in 2012 but it was postponed by half a year. So I left after about 37 months.

Second round

Time passed, I was working at Red Hat on getting AArch64 first class citizen in RHEL and Fedora Linux distributions. And one day my manager asked do I want to work at Linaro again.

I did some research, discussed with friends at Linaro and on 8th April 2016 I was back.

My team II

This time I became part of LEG: Linaro Enterprise Group. Servers, data centres etc. There were several teams to choose and I ended in SDI (Software Defined Infrastructure) one. We were behind LDC (Linaro Developer Cloud) project.

At some moment LEG became LDCG (Linaro Datacenter and Cloud Group). Some years later LDCG lost “and Cloud” part as AWS and other cloud providers started offering AArch64 systems so we did not had to deal with it any more.

OpenStack all over

First version of LDC was Debian based with OpenStack ‘liberty’. Then were weird months when we had to reinvent deployment several times. It was mess most of time.

So we abandoned own solutions and went with OpenStack Kolla. I quickly became one of core developers there. LDC moved to be container based.

In 2022 we ended working on OpenStack. LDC is now used only for internal projects.

Building stuff

With my “give me software and I will build it” mantra I ended also as kind of CI jobs developer for LEG teams. Apache Bigtop, Apache Arrow, TensorFlow, EDK2 and several other projects. Some used containers, some were running shell scripts, some Ansible.

SBSA Reference Platform

I was involved in some work around getting QEMU to emulate SBSA Reference Platform (“sbsa-ref” machine). Created some CI jobs to run test suites, build firmware images etc. After running test suites I created a bunch of issues in Jira so we can track how things go.

During recent months I became more involved. I am testing patches, running them through both SBSA and BSA Arm Compliance Suites and reporting results.

I have own set of scripts to handle logs to make it easier to track how things are now.

Summary

At last Linaro Connect events there was always a moment when they announced people who worked at Linaro for 5 (or later) 10 years. I have to admit — I felt envy several times.

And when I was 5 years straight at Linaro we had COVID-19 pandemic so there was no event.

by Marcin Juszkiewicz at April 17, 2023 10:36 AM

Powered by Planet!
Last updated: March 19, 2024 08:05 AMEdit this page