Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

September 13, 2019

KVM on Z

IBM z15 announced

Today, IBM announced the new IBM Z models:
Furthermore, check the updated IBM Z tested platforms matrix here.
We will look at features in support of the new IBM Z model in a separate blog entry soon.

by Stefan Raspl (noreply@blogger.com) at September 13, 2019 09:42 AM

September 02, 2019

KVM on Z

virt-manager 2.2 released

virt-manager v2.2 was released a while ago. One feature we contributed is the ability to choose a temporary boot device. Here is a quick write-up on how to use that feature.

virt-xml is a simple command line tool for editing domain definitions. It can be used interactively or for batch processing. Starting with virt-manager v2.2, virt-xml allows to boot a guest with a temporarily changed domain definition. This allows us to specify a boot configuration other than the guest's current domain definition. This is especially useful as the IBM Z architecture allows for only a single boot device, and therefore the boot order settings do not work the way they do on other platforms: If the first boot device fails to boot, there is no attempt made to boot from the next boot device. In addition, the architecture/BIOS has no support for interactively changing the boot device during the boot/IPL process.
Therefore, two new command line options were introduced:
  • --no-define makes any changes to the domain definition transient (i.e. the guest's persistent domain XML will not be modified)
  • --start allows the user to start the domain after the changes to the domain XML were applied
Here is a simple example illustrating the usage:
  1. First, select the device which should be changed by a selector. In this example, the unique target name of the disk is used. See man virt-xml for a list of further possibilities.
  2. Temporarily modify the boot order, assign the first slot to device vdc, and start the guest right away:

      $ virt-xml --edit target='vdc' --disk='boot_order=1' --start \
                 --no-define sample_domain

    Note: If there was another device that already had boot_order=1, its boot order would be incremented.
As soon the guest is stopped, the changes will vanish.

    by Stefan Raspl (noreply@blogger.com) at September 02, 2019 03:22 PM

    August 29, 2019

    KVM on Z

    Webinar: How to Virtualize with KVM in Live Demo, August 28

    Abstract
    We will explain basic KVM concepts, including CPU and memory virtualization, storage, network management, as well as a brief overview about commonalities and differences with other virtualization environments. Furthermore, a live demo will demonstrate how to use the KVM management tools to create and install Linux guests, how to operate and monitor.

    Speaker
    Christian Bornträger, Chief Product Owner Linux and KVM on IBM Z.

    Registration
    Register here. You can check the system requirements here.
    After registering, you will receive a confirmation email containing information about joining the webinar.

    Replay & Archive
    All sessions are recorded. For the archive as well as a replay and handout of this session and all previous webcasts see here.

    by Stefan Raspl (noreply@blogger.com) at August 29, 2019 12:10 PM

    August 19, 2019

    KVM on Z

    QEMU v4.1 released

    QEMU v4.1 is out. For highlights from a KVM on Z perspective see the Release Notes.
    Note: The DASD IPL feature is still considered experimental.

    by Stefan Raspl (noreply@blogger.com) at August 19, 2019 01:14 PM

    August 16, 2019

    QEMU project

    QEMU version 4.1.0 released

    We would like to announce the availability of the QEMU 4.1.0 release. This release contains 2000+ commits from 176 authors.

    You can grab the tarball from our download page. The full list of changes are available in the Wiki.

    Highlights include:

    • ARM: FPU emulation support for Cortex-M CPUs, FPU fixes for Cortex-R5F
    • ARM: ARMv8.5-RNG extension support for CPU-generated random numbers
    • ARM: board build options now configurable via new Kconfig-based system
    • ARM: Exynos4210 SoC model now supports PL330 DMA controllers
    • MIPS: improved emulation performance of numerous MSA instructions, mostly integer and data permuting operations
    • MIPS: improved support for MSA ASE instructions on big-endian hosts, handling for ‘division by zero’ cases now matches reference hardware
    • PowerPC: pseries: support for NVIDIA V100 GPU/NVLink2 passthrough via VFIO
    • PowerPC: pseries: in-kernel acceleration for XIVE interrupt controller
    • PowerPC: pseries: supporting for hot-plugging PCI host bridges
    • PowerPC: emulation optimizations for vector (Altivec/VSX) instructions
    • RISC-V: support for new “spike” machine model
    • RISC-V: ISA 1.11.0 support for privileged architectures
    • RISC-V: improvements for 32-bit syscall ABI, illegal instruction handling, and built-in debugger
    • RISC-V: support for CPU topology in device trees
    • s390: bios support for booting from ECKD DASD assigned to guest via vfio-ccw
    • s390: emulation support for all “Vector Facility” instructions
    • s390: additional facilities and support for gen15 machines, including support for AP Queue Interruption Facility for using interrupts for vfio-ap devices
    • SPARC: sun4m: sun4u: fixes when running with -vga none (OpenBIOS)
    • x86: emulation support for new Hygon Dhyana and Intel SnowRidge CPU models
    • x86: emulation support for RDRAND extension
    • x86: md-clear/mds-no feature flags, for detection/mitigation of MDS vulnerabilities (CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091)
    • x86: CPU die topology now configurable using -smp …,dies=
    • Xtensa: support for memory protection unit (MPU) option
    • Xtensa: support for Exclusive Access option
    • GUI: virtio-gpu 2D/3D rendering may now be offloaded to an external vhost-user process, such as QEMU vhost-user-gpu
    • GUI: semihosting output can now be redirected to a chardev backend
    • qemu-img: added a –salvage option to qemu-img convert, which prevents the conversion process from aborting on I/O errors (can be used for example to salvage partially corrupted qcow2 files)
    • qemu-img: qemu-img rebase works now even when the input file doesn’t have a backing file yet
    • VMDK block driver now has read-only support for the seSparse subformat
    • GPIO: support for SiFive GPIO controller
    • and lots more…

    Thank you to everyone involved!

    August 16, 2019 05:50 AM

    August 09, 2019

    KVM on Z

    New Documentation: Configuring Crypto Express Adapters for KVM Guests


    See here for a new documentation release on how to configure Crypto Express adapters for KVM guests.

    by Stefan Raspl (noreply@blogger.com) at August 09, 2019 02:49 PM

    August 07, 2019

    Daniel Berrange

    ANNOUNCE: gtk-vnc 1.0.0 release

    I’m pleased to announce a new release of GTK-VNC, version 1.0.0.

    https://download.gnome.org/sources/gtk-vnc/1.0/gtk-vnc-1.0.0.tar.xz (211K)
    sha256sum: a81a1f1a79ad4618027628ffac27d3391524c063d9411c7a36a5ec3380e6c080
    

    Pay particular attention to the first two major changes in
    this release:

    • Autotools build system replaced with meson
    • Support for GTK-2 is dropped. GTK-3 is mandatory
    • Support for libview is dropped in example program
    • Improvements to example demos
    • Use MAP_ANON if MAP_ANONYMOUS doesn’t exist to help certain macOS versions
    • Fix crash when connection attempt fails early
    • Initialize gcrypt early in auth process
    • Emit vnc-auth-failure signal when SASL auth fals
    • Emit vnc-error signal when authentication fails
    • Fix double free when failing to read certificates
    • Run unit tests in RPM build
    • Modernize RPM spec
    • Fix race condition in unit tests
    • Fix install of missing header for cairo framebuffer
    • Fix typo in gir package name
    • Add missing VncPixelFormat file to gir data

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at August 07, 2019 03:06 PM

    August 05, 2019

    Stefan Hajnoczi

    Determining why a Linux syscall failed

    One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

    There are several ways to determine the reason for a system call failure (from easiest to most involved):

    1. Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
    2. Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
    3. Read the kernel source code to understand various error code paths and identify the most relevant one.
    4. Use the function graph tracer to see which code path was taken.
    5. Add printk() calls, recompile the kernel (module), and rerun to see the output.

    Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

    This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

    An example

    In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

    The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

    The function graph tracer

    The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

    Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

    # grep io_uring_setup /proc/kallsyms
    ...
    ffffffffbd357130 T __x64_sys_io_uring_setup

    Let's trace all __x64_sys_io_uring_setup calls:

    # cd /sys/kernel/debug/tracing
    # echo '__x64_sys_io_uring_setup' > set_graph_function
    # echo 'function_graph' >current_tracer
    # cat trace_pipe >/tmp/trace.log
    ...now run the application in another terminal...
    ^C
    The trace contains many successful io_uring_setup(2) calls that look like this:
     1)               |  __x64_sys_io_uring_setup() {
    1) | io_uring_setup() {
    1) | capable() {
    1) | ns_capable_common() {
    1) | security_capable() {
    1) 0.199 us | cap_capable();
    1) 7.095 us | }
    1) 7.594 us | }
    1) 8.059 us | }
    1) | kmem_cache_alloc_trace() {
    1) | _cond_resched() {
    1) 0.244 us | rcu_all_qs();
    1) 0.708 us | }
    1) 0.208 us | should_failslab();
    1) 0.220 us | memcg_kmem_put_cache();
    1) 2.201 us | }
    ...
    1) | fd_install() {
    1) 0.223 us | __fd_install();
    1) 0.643 us | }
    1) ! 190.396 us | }
    1) ! 216.236 us | }

    Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

     2)               |  __x64_sys_io_uring_setup() {
    2) | io_uring_setup() {
    2) | capable() {
    2) | ns_capable_common() {
    2) | security_capable() {
    2) 0.236 us | cap_capable();
    2) 0.872 us | }
    2) 1.419 us | }
    2) 1.951 us | }
    2) 0.419 us | free_uid();
    2) 3.389 us | }
    2) + 48.769 us | }

    The fs/io_uring.c code shows the likely error code paths:

            account_mem = !capable(CAP_IPC_LOCK);

    if (account_mem) {
    ret = io_account_mem(user,
    ring_pages(p->sq_entries, p->cq_entries));
    if (ret) {
    free_uid(user);
    return ret;
    }
    }

    ctx = io_ring_ctx_alloc(p);
    if (!ctx) {
    if (account_mem)
    io_unaccount_mem(user, ring_pages(p->sq_entries,
    p->cq_entries));
    free_uid(user);
    return -ENOMEM;
    }

    But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

    When trace output is inconclusive

    Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

    We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

    If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

    One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

    This is done as follows:

    1. Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
    2. Start perf record -a and let it run for 30 seconds.
    3. Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

    The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.

    Conclusion

    Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.

    by Unknown (noreply@blogger.com) at August 05, 2019 03:54 PM

    July 30, 2019

    Cole Robinson

    Blog moved to Pelican and GitHub Pages

    I've moved my blog from blogger.com to a static site generated with Pelican and hosted on GitHub Pages. This is a dump of some of the details.

    The content is hosted in three branches across two repos:

    The motivation for the split is that according to this pelican SEO article, master branches of GitHub repos are indexed by google, so if you store HTML content in a master branch your canonical blog might be battling your GitHub repo in the search results. And since you can only put content in the master branch of a $username.github.io repo, I added a separate blog.git repo. Maybe I could shove all the content into the blog/gh-pages branch I think dealing with multiple subdomains prevents it. I've already spent too much timing playing with all this stuff though so that's for another day to figure out. Of course, suggestions welcome, blog comments are enabled with Disqus.

    One issue I hit is that pushing updated content to blog/gh-pages doesn't consistently trigger a new GitHub Pages deployment. There's a bunch of hits about this around the web (this stackoverflow post in particular) but no authoritative explanation about what criteria GitHub Pages uses to determine whether to redeploy. The simplest 'fix' I found is to tweak the index.html content via the GitHub web UI and commit the change which seems to consistently trigger a refresh as reported by the repo's deployments page.

    You may notice the blog looks a lot like stock Jekyll with its minima theme. I didn't find any Pelican theme that I liked as much as minima, so I grabbed the CSS from a minima instance and started adapting the Pelican simple-bootstrap theme to use it. The end result is basically a simple reimplementation of minima for Pelican. I learned a lot in the process but it likely would have been much simpler if I just used Jekyll in the first place, but I'm in too deep to switch now!

    by Cole Robinson at July 30, 2019 07:30 PM

    July 12, 2019

    KVM on Z

    KVM at SHARE Pittsburgh 2019

    Yes, we will be at SHARE in Pittsburgh this August!
    See the following session in the Linux and VM/Virtualization track:

    • KVM on IBM Z News (Session #25978): Latest news on our development work with the open source community

    by Stefan Raspl (noreply@blogger.com) at July 12, 2019 04:42 PM

    July 10, 2019

    Cornelia Huck

    s390x changes in QEMU 4.1

    QEMU has just entered hard freeze for 4.1, so the time is here again to summarize the s390x changes for that release.

    TCG

    • All instructions that have been introduced with the "Vector Facility" in the z13 machines are now emulated by QEMU. In particular, this allows Linux distributions built for z13 or later to be run under TCG (vector instructions are generated when we compile for z13; other z13 facilities are optional.)

    CPU Models

    • As the needed prerequisites in TCG now have been implemented, the "qemu" cpu model now includes the "Vector Facility" and has been bumped to a stripped-down z13.
    • Models for the upcoming gen15 machines (the official name is not yet known) and some new facilities have been added.
    • If the host kernel supports it, we now indicate the AP Queue Interruption facility. This is used by vfio-ap and allows to provide interrupts for AP to the guest.

    I/O Devices

    • vfio-ccw has gained support for relaying HALT SUBCHANNEL and CLEAR SUBCHANNEL requests from the guest to the device, if the host kernel vfio-ccw driver supports it. Otherwise, these instructions continue to be emulated by QEMU, as before.
    • The bios now supports IPLing (booting) from DASD attached via vfio-ccw.

    Booting

    • The bios tolerates signatures written by zipl, if present; but it does not actually handle them. See the 'secure' option for zipl introduced in s390-tools 2.9.0.
    And the usual fixes and cleanups.

    by Cornelia Huck (noreply@blogger.com) at July 10, 2019 02:16 PM

    July 08, 2019

    KVM on Z

    SLES 15 SP1 released

    SLES 15 SP1 is out! See the announcement and their release notes with Z-specific changes.
    It ships the following code level updates:
    • QEMU v3.1 (GA: v2.11)
    • libvirt v5.1 (GA: v4.0)
    See previous blog entries on QEMU v2.12, v3.0 and v3.1 for details on new features that become available with the QEMU package update.
    Furthermore, SLES 15 SP1 introduces the kvm_stat tool, which can be used for guest event analysis.

    by Stefan Raspl (noreply@blogger.com) at July 08, 2019 10:14 PM

    June 16, 2019

    Gerd Hoffmann

    macos guest support for qemu

    display support

    After one of the minor updates (10.14.3 or 10.14.4, don't remember) my macOS Mojave guest started to switch video modes at boot. Also the "Display" panel in "System Preferences" started to offer three video modes to choose from. Unfortunaly FullHD (aka 1920x1080) is not on the list.

    Decided to look into this. Assuming that macOS learned to switch video modes using the EFI GOP interface I've tweaked the mode list in OVMF (QemuVideoDxe driver). No effect. Huh?

    Next I've looked at the list of drivers, using kextstat. Found a AppleVirtualGraphics.kext entry. Interesting. Checking out Info.plist (in /System/Library/Extensions/AppleVirtualGraphics.kext/Contents) I found this:

        [ ... ]
        <key>IOKitPersonalities</key>
        <dict>
            <key>AppleBochVGAFB</key>
            <dict>
                [ ... ]
                <key>IOPCIPrimaryMatch</key>
                <string>0x11111234&amp;0xFFFFFFFF</string>
                [ ... ]
            </dict>
            <key>AppleCirrusGD5446FB</key>
            <dict>
                [ ... ]
                <key>IOPCIPrimaryMatch</key>
                <string>0x00001013&amp;0x0000FFFF</string>
                [ ... ]
            </dict>
        </dict>
        [ ... ]

    So recent macOS Mojave ships with a driver for qemu stdvga and qemu cirrus vga. Nice. Unfortunaly the question how to switch the display into 1920x1080 mode (to match the hosts display resolution) isn't solved yet.

    virtio support

    While looking around I've noticed there is a AppleVirtIO.kext too, with this in Info.plist:

        [ ... ]
        <key>IOKitPersonalities</key>
        <dict>
            <key>AppleVirtIO9P</key>
            [ ... ]
            <key>AppleVirtIO9PVFS</key>
            [ ... ]
            <key>AppleVirtIOBlock</key>
            [ ... ]
            <key>AppleVirtIOConsole</key>
            [ ... ]
            <key>AppleVirtIOPCITransport</key>
            [ ... ]
        </dict>

    Apparently a virtio driver with support for virtio-console/serial, virtio-blk and virtio-9p.

    Tried to switch the system disk from sata to virtio-blk. Clover seems to be slightly confused. It stops showing the nice disk icons. But booting macOS works fine regardless. When using the transitional device, so the driver supports legacy mode only.

    virtio-9p for filesystem sharing looks pretty interesting too. So, lets try that (in libvirt xml):

      [ ... ]
      <devices>
        [ ... ]
        <filesystem type='mount' accessmode='mapped'>
          <source dir='/path/to/some/host/directory'/>
          <target dir='/test9p'/>
          <readonly/>
        </filesystem>
        [ ... ]
      </devices>
      [ ... ]

    macOS seems to not mount the filesystem automatically. But it is easy to do, using the terminal. You need to create the target directory first (sudo mkdir /test9p). Then run sudo mount_9p. Done.

    by Gerd Hoffmann at June 16, 2019 10:00 PM

    June 05, 2019

    Gerd Hoffmann

    recent qemu sound improvements

    The qemu sound system got a bunch of improvements in 2018 and 2019.

    New in qemu 3.0

    The hda emulation uses a high resolution timer now to better emulate the timing-sensitive dma transfer of sound samples. Credits for this implementation go to Martin Schrodt.

    Unfortunaly this is incompatible with older qemu versions, so it is only enabled for 3.0+ machine type versions. So upgrading qemu is not enough to get this, you also have to make sure you are using a new enough machine type (qemu -M command line switch).

    libvirt stores the machine type in the domain xml when the guest is created. It is never updated automatically. So have a look at your domain configuration (using virsh edit domain-name for example) and check the version is 3.0 or newer:

    [ ... ]
      <os>
        <type arch='x86_64' machine='pc-q35-3.0'>hvm</type>
                                            ^^^
    [ ... ]

    New in qemu 3.1

    The pulseaudio backend got fixes in 3.1, so if you are using pulse you should upgrade to at least qemu version 3.1.

    New in qemu upcoming 4.0

    Yet another pulseaudio bugfix.

    Initial support for the -audiodev command line switch was finally merged. So audio support is not the odd kid any more which is configured in a completely different way, using environment variables instead of command line switches. Credits for this go to Kővágó, Zoltán.

    In the pipeline

    There are more -audiodev improvements in the pipeline, they are expected to land upstream in the 4.1 or 4.2 devel cycle.

    Latency tuning

    While being at it one final note:

    Bugs in qemu sound device emulation and audio backends are not the only possible root cause for bad sound quality. Crackling sound -- typically caused by buffer underruns -- can also be caused by latency problems elsewhere in qemu.

    One known offender is disk I/O, specifically the linux aio support which isn't as async as it should be and blocks now and then. linux aio support is configured with io=native for block device backends.

    Better choice is io=threads. In libvirt xml:

    [ ... ]
      <devices>
        <disk type='...' device='disk'>
          <driver name='qemu' type='...' cache='none' io='threads'/>
                                                      ^^^^^^^^^^^^
    [ ... ]

    Another known issue is spice audio compression, so better turn that off when using spice:

    [ ... ]
        <graphics type='spice'>
          [ ... ]
          <playback compression='off'/>
        </graphics>
    [ ... ]

    by Gerd Hoffmann at June 05, 2019 10:00 PM

    May 22, 2019

    QEMU project

    QEMU 4.0 adds micro:bit emulation support

    micro:bit emulation support is available from QEMU 4.0 onwards and can be used for low-level software testing and development. Unlike existing micro:bit simulators, QEMU performs full-system emulation and actually runs the same ARM code as the real hardware. This blog post explains what full-system emulation means and why QEMU is now a useful tool for developing micro:bit software.

    The micro:bit is a tiny ARM board designed for teaching. It is increasingly being used around the world to expose children to computers, programming, and electronics in a low-cost way with an active online community that shares project ideas, lesson plans, and programming tips.

    micro:bit board

    Simulators and emulators

    Simulators are used for many tasks from mobile app development to performance analysis of computer hardware. It is possible to develop code using a simulator without having access to real hardware. Oftentimes using a simulator is more convenient than flashing and debugging programs on real hardware.

    Emulators allow programs written for one computer system to run on a different computer system. They use techniques like machine code interpreters and just-in-time compilers to execute guest programs that do not run natively on the host computer. Each CPU instruction must be correctly implemented by the emulator so it can run guest software.

    How existing micro:bit simulators work

    Simulators can be implemented at various layers in the software stack. The MakeCode editor for JavaScript development includes a micro:bit simulator:

    MakeCode editor

    This simulator does not execute any ARM code and is therefore not running the same CPU instructions as a real micro:bit. Instead it reuses the JavaScript engine already available in your web browser to execute micro:bit JavaScript programs. This is achieved by providing the micro:bit JavaScript APIs that micro:bit programs expect. The programs don’t need to know whether those APIs are implemented by the real micro:bit software stack or whether they are actually calling into the MakeCode simulator.

    In the screenshot above the micro:bit program calls showString("Hello world!") and this becomes a call into the MakeCode simulator code to render images of LEDs in the web browser. On real hardware the code path is different and eventually leads to an LED matrix driver that lights up the LEDs by driving output pins on the micro:bit board.

    Full-system emulation

    Unlike the MakeCode simulator, QEMU emulates the micro:bit CPU and boots from the same ARM code as the real micro:bit board. The simulation happens at the CPU instruction and hardware interface level instead of at the JavaScript API level. This is called full-system emulation because the entire guest software environment is present.

    What are the advantages of full-system emulation?

    • Programs written in any language can run (MicroPython, mbed C/C++, etc)
    • Boot, device driver, and language run-time code can be tested
    • Bugs in lower layers of the software stack can be reproduced
    • CPU architecture-specific bugs can be reproduced (stack and memory corruption bugs)
    • A debugger can be connected to inspect the entire software stack

    The main disadvantage of full-system emulation is that the performance overhead is higher since simulation happens at the CPU instruction level. Programs consist of many CPU instructions so the task of emulation is performance-sensitive. Luckily the micro:bit’s CPU is much less powerful than CPUs available in our laptops and desktops, so programs execute at a reasonable speed.

    Running micro:bit programs on QEMU

    QEMU emulates the core devices on the micro:bit, including the serial port (UART) and timers. This is enough for developing and testing low-level software but does not offer the LEDs, radio, and other devices that most micro:bit programs rely on. These devices might be emulated by QEMU in the future, but for now the main use of QEMU is for developing and testing low-level micro:bit code.

    To run test.hex:

    $ qemu-system-arm -M microbit -device loader,file=test.hex -serial stdio
    

    Any output written to the serial port is printed to the terminal by QEMU.

    Debugging micro:bit programs with QEMU and GDB

    QEMU has GDB guest debugging support. This means GDB can connect to QEMU in order to debug the guest software. This is similar to debugging a real system over JTAG, except no hardware is necessary!

    Connect with GDB to debug the guest:

    $ qemu-system-arm -M microbit -device loader,file=test.hex -s
    $ gdb
    (gdb) target remote tcp:127.0.0.1:1234
    (gdb) x/10i $pc
    => 0x161c4:	ldr	r3, [r4, #0]
       0x161c6:	cmp	r3, #0
       0x161c8:	beq.n	0x161d2
       0x161ca:	ldr	r3, [pc, #48]	; (0x161fc)
       0x161cc:	ldr	r3, [r3, #0]
       0x161ce:	cmp	r3, #0
       0x161d0:	bne.n	0x161d8
       0x161d2:	movs	r0, #6
       0x161d4:	bl	0x16160
       0x161d8:	ldr	r0, [r4, #0]
    

    Having a debugger is very powerful. QEMU can also load ELF files in addition to the popular .hex files used for micro:bit programs. ELF files can contain debugging information that enables source-level debugging so GDB can display function and variable names as well as listing the source code instead of showing assembly instructions.

    Conclusion

    QEMU now offers a platform for developing and testing micro:bit programs. It is open to future extension, hopefully to emulate more devices and offer a graphical user interface.

    micro:bit emulation was contributed by Julia Suvorova and Steffen Görtz as part of their Outreachy and Google Summer of Code internships with QEMU. Jim Mussared, Joel Stanley, and Stefan Hajnoczi acted as mentors and contributed patches as well.

    May 22, 2019 10:45 AM

    May 17, 2019

    KVM on Z

    QEMU v3.1 released

    QEMU v3.1 is out. Besides a number of small enhancements, some items that we would like to highlight from a KVM on Z perspective:
    • Huge Pages Support: KVM guests can now utilize 1MB pages. As this removes one layer of address translation for the guest backing, less page-faults need to be processed, and less translation lookaside buffer (TLB) entries are needed to hold translations. This, as well as the TLB improvements in z14, will improve KVM guest performance.
      To use:
      Create config file /etc/modprobe.d/kvmhpage.conf file with the following content to enable huge pages for KVM:

         options kvm hpage=1


      Furthermore, add the following line to /etc/sysctl.conf to reserve N huge pages:

         vm.nr_hugepages = N

      Alternatively, append the following statement to the kernel parameter line in case support is compiled into the kernel: kvm.hpage=1 hugepages=N.
      Note that means to add hugepages dynamically after boot exist, but with effects like memory fragmentation, it is preferable to define huge pages as early as possible.
      If successful, the file /proc/sys/vm/nr_hugepages should show N huge pages. See here for further documentation.
      Then, to enable huge pages for a guest, add the following element to the respective domain XML:

         <memoryBacking>
           <hugepages/>
         </memoryBacking>


      The use of huge pages in the host is orthogonal to the use of huge pages in the guest. Both will improve the performance independently by reducing the number of page faults and the number of page table walks after a TLB miss.
      The biggest performance improvement can be achieved by using huge pages in both, host and guest, e.g. with libhugetlbfs, as this will also make use of the larger 1M TLB entries in the hardware.
      Requires Linux kernel 4.19.
    • vfio-ap: The Adjunct Processor (AP) facility is an IBM Z cryptographic facility comprised of three AP instructions and up to 256 cryptographic adapter cards. Each adapter card is partitioned into up to 85 domains, each of which provides cryptographic services. An AP queue is the means by which AP messages are sent to and received from an AP adapter. Each AP queue is connected to a particular domain within a particular adapter. vfio-ap enables assignment of a subset of AP adapters and domains to one or more guests such that each guest has exclusive access to a discrete set of AP queues.
      Here is a small sample script illustrating host setup:

         # load vfio-ap device driver
         modprobe vfio-ap

         # reserve domain 7 for use by KVM guests
         echo -0x7 > /sys/bus/ap/aqmask
         # to reserve all domains of an adapter, use the following
         # line instead (by uncommenting it), and replace NN with the
         # adapter number:
         # echo -0xNN > /sys/bus/ap/apmask

         # create a mediated device (mdev) to provide userspace access
         # to a device in a secure manner
         UUID=e926839d-a0b4-4f9c-95d0-c9b34190c4ba
         echo $UUID > /sys/devices/vfio_ap/matrix/mdev_supported_types/ \
                      vfio_ap-passthrough/create

         # assign adapter, domain and control domain
         echo 0x3 > /sys/devices/vfio_ap/matrix/${UUID}/assign_adapter
         echo 0x7 > /sys/devices/vfio_ap/matrix/${UUID}/assign_domain
         echo 0x7 > /sys/devices/vfio_ap/matrix/${UUID}/ \
                    assign_control_domain


      To make use of the AP device in a KVM guest, add the following element to the respective domain XML:

         <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ap'>
           <source>
             <address uuid='e926839d-a0b4-4f9c-95d0-c9b34190c4ba'/>
           </source>
         </hostdev>


      Once complete, use the passthrough device in a KVM guest just like a regular crypto adapter.
      Requires Linux kernel 4.20 and libvirt 4.9, and is also available in RHEL 8, Ubuntu 18.04 and SLES 15 SP1.

    by Stefan Raspl (noreply@blogger.com) at May 17, 2019 11:33 PM

    May 07, 2019

    KVM on Z

    RHEL 8 released

    Red Hat Enterprise Linux 8 is out! See the announcement and their release notes with Z-specific changes.
    It ships the following code levels:

    by Stefan Raspl (noreply@blogger.com) at May 07, 2019 04:52 PM

    April 30, 2019

    KVM on Z

    QEMU v4.0 released

    QEMU v4.0 is out. Besides a number of small enhancements, some items that we would like to highlight from a KVM on Z perspective:
    • CPU models for z14 GA2 as follows:
         $ qemu-system-s390x -cpu help -enable-kvm | grep z14.2
         s390 z14.2-base      IBM z14 GA2           (static, migration-safe)
         s390 z14.2           IBM z14 GA2           (migration-safe)
    • vfio-ap now supports hotplugging of vfio-ap devices.

    by Stefan Raspl (noreply@blogger.com) at April 30, 2019 08:17 AM

    April 24, 2019

    QEMU project

    QEMU version 4.0.0 released

    We would like to announce the availability of the QEMU 4.0.0 release. This release contains 3100+ commits from 220 authors.

    You can grab the tarball from our download page. The full list of changes are available in the Wiki.

    Highlights include:

    • ARM: ARMv8+ extensions for SB, PredInv, HPD, LOR, FHM, AA32HPD, PAuth, JSConv, CondM, FRINT, and BTI
    • ARM: new emulation support for “Musca” and “MPS2” development boards
    • ARM: virt: support for >255GB of RAM and u-boot “noload” image types
    • ARM: improved emulation of ARM PMU
    • HPPA: support for TLB protection IDs and TLB trace events
    • MIPS: support for multi-threaded TCG emulation
    • MIPS: emulation support for I7200 I6500 CPUs, QMP-base querying of CPU types, and improved support for SAARI and SAAR configuration registers
    • MIPS: improvements to Interthread Communication Unit, Fulong 2E machine types, and end-user documentation.
    • PowerPC: pseries/powernv: support for POWER9 large decrementer
    • PowerPC: pseries: emulation support for XIVE interrupt controller
    • PowerPC: pseries: support for hotplugging PCI host bridges (PHBs)
    • PowerPC: pseries: Spectre/Meltdown mitigations enabled by default, additional support for count-cache-flush mitigation
    • RISC-V: virt: support for PCI and USB
    • RISC-V: support for TSR, TW, and TVM fields of mstatus, FS field now supports three stats (dirty, clean, and off)
    • RISC-V: built-in gdbserver supports register lists via XML files
    • s390: support for z14 GA 2 CPU model, Multiple-epoch and PTFF features now enabled in z14 CPU model by default
    • s390: vfio-ap: now supports hot plug/unplug, and no longer inhibits memory ballooning
    • s390: emulation support for floating-point extension facility and vector support instructions
    • x86: HAX accelerator now supported POSIX hosts other than Darwin, including Linux and NetBSD
    • x86: Q35: advertised PCIe root port speeds will now optimally default to maximum link speed (16GT/s) and width (x32) provided by PCIe 4.0 for QEMU 4.0+ machine types; older machine types will retain 2.5GT/x1 defaults for compatibility.
    • x86: Xen PVH images can now be booted with “-kernel” option
    • Xtensa: xtfpga: improved SMP support for linux (interrupt distributor, IPI, and runstall) and new SMP-capable test_mmuhifi_c3 core configuration
    • Xtensa: support for Flexible length instructions extension (FLIX)
    • GUI: new ‘-display spice-app’ to configure/launch a Spice client GUI with a similar UI to QEMU GTK. VNC server now supports access controls via tls-authz/sasl-authz options
    • QMP: support for “out-of-band” command execution, can be useful for postcopy migration recovery. Additional QMP commands for working with block devices and dirty bitmaps
    • VFIO: EDID interface for supported mdev (Intel vGPU for kernel 5.0+), allows resolution setting via xres/yres options.
    • Xen: new ‘xen-disk’ device which can create a Xen PV disk backend, and performance improvements for Xen PV disk backend.
    • Network Block Device: improved tracing and error diagnostics, improved client compatibility with buggy NBD server implementations, new –bitmap, –list, –tls-authz options for qemu-nbd
    • virtio-blk now supports DISCARD and WRITE_ZEROES
    • pvrdma device now supports RDMA Management Datagram services (MAD)
    • and lots more…

    Thank you to everyone involved!

    April 24, 2019 05:45 AM

    April 18, 2019

    Stefan Hajnoczi

    What's new in VIRTIO 1.1?

    The VIRTIO 1.1 specification has been published! This article covers the major new features in this specification.

    New Devices

    The following new devices are defined:

    • virtio-input is a Linux evdev input device (mouse, keyboard, joystick)
    • virtio-gpu is a 2D graphics device (with 3D support planned)
    • virtio-vsock is a host->guest socket communications device
    • virtio-crypto is a cryptographic accelerator device

    New Device Features

    virtio-net

    virtio-blk

    virtio-balloon

    New Core Features

    There is a new virtqueue memory layout called packed virtqueues. The old layout is called split virtqueues because the avail and used rings are separate from the descriptor table. The new packed virtqueue layout uses just a single descriptor table as the single ring. The layout is optimized for a friendlier CPU cache footprint and there are several features that devices can exploit for better peformance.

    The VIRTIO_F_NOTIFICATION_DATAfeature is an optimization mainly for hardware implementations of VIRTIO. The driver writes extra information as part of the Available Buffer Notification. Thanks to the information included in the notification, the device does not need to fetch this information from memory anymore. This is useful for PCI hardware implementations where minimizing DMA operations improves performance significantly.

    by Unknown (noreply@blogger.com) at April 18, 2019 12:56 PM

    April 09, 2019

    Cole Robinson

    Host 'Network Interfaces' panel removed from virt-manager

    I released virt-manager 2.0.0 in October 2018. Since the release contained the full port to python3, it seemed like a good opportunity to drop some baggage from the app.

    The biggest piece we removed was the UI for managing host network interfaces. This is the Connection Details->Network Interfaces panel, and the New Interface wizard for defining host network definitions for things like bridges, bonds, and vlan devices. The main screen of the old UI looked like this:

    virt-manager host interfaces panel

    Some history

    Behind the scenes, this UI was using libvirt's Interface APIs, which also power the virsh iface-* commands. These APIs are little more than a wrapper around the netcf library.

    netcf aimed to be a linux distro independent API for network device configuration. On Red Hat distros this meant turning the API's XML format into an /etc/sysconfig/network script. There were even pie-in-the-sky ideas about NetworkManager one day using netcf.

    In practice though the library never really took off. It was years before a debian backend showed up, contributed by a Red Hatter in the hope of increasing library uptake, though it didn't seem to help. netcf basically only existed to serve the libvirt Interface APIs, yet those APIs were never really used by any major libvirt consuming app, besides virt-manager. And in virt-manager's case it was largely just slapping some UI over the XML format and lifecycle operations.

    For virt-manager's usecases we hoped that netcf would make it trivial to bridge the host's network interface, which when used with VMs would give them first class IP addresses on the host network setup, not NAT like the 'default' virtual network. Unfortunately though the UI would create the ifcfg files well enough, behind the scenes nothing played well with NetworkManager for years and years. The standard suggestion for was to disable NetworkManager if you wanted to bridge your host NIC. Not very user friendly. Some people did manage to use the UI to that effect but it was never a trivial process.

    The state today

    Nowadays NetworkManager can handle bridging natively and is much more powerful than what virt-manager/libvirt/netcf provide. The virt-manager UI was more likely to shoot you in the foot than make things simple. And it had become increasingly clear that virt-manager was not the place to maintain host network config UI.

    So we made the decision to drop all this from virt-manager in 2.0.0. netcf and the libvirt interface APIs still exist. If you're interested in some more history on the interface API/netcf difficulties, check out Laine's email to virt-tools-list.

    by Cole Robinson at April 09, 2019 06:01 PM

    April 02, 2019

    Gerd Hoffmann

    drminfo 6 released

    drminfo is a small collection of tools for drm and fbdev devices. They print device information and can run some basic tests.

    New in version 6 is a number of avocado test cases for qemu display devices (stdvga, cirrus, qxl and virtio).

    drminfo has a homepage and a git repository.
    My copr repo has Fedora and EPEL rpm packages.

    by Gerd Hoffmann at April 02, 2019 10:00 PM

    March 12, 2019

    Cornelia Huck

    s390x changes in QEMU 4.0

    QEMU is now entering softfreeze for the 4.0 release (expected in April), so here is the usual summary of s390x changes in that release.

    CPU Models

    • A cpu model for the z14 GA 2 has been added. Currently, no new features have been added.
    • The cpu model for z14 now does, however, include the multiple epoch and PTFF enhancement features per default.
    • The 'qemu' cpu model now includes the zPCI feature per default. No more prerequisites are needed for pci support (see below).

    Devices


    • QEMU for s390x is now always built with pci support. If we want to provide backwards compatibility,  we cannot simply disable pci (we need the s390 pci host bus); it is easier to simply make pci mandatory. Note that disabling pci was never supported by the normal build system anyway.
    • zPCI devices have gained support for instruction counters (on a Linux guest, these are exposed through /sys/kernel/debug/pci/<function>/statistics).
    • zPCI devices always lacked support for migrating their s390-specific state (not implemented...); if you tried to migrate a guest with a virtio-pci device on s390x, odd things might happen. To avoid surprises, the 'zpci' devices are now explicitly marked as unmigratable. (Support for migration will likely be added in the future.)
    • Hot(un)plug of the vfio-ap matrix device is now supported.
    • Adding a vfio-ap matrix device no longer inhibits usage of a memory ballooner: Memory usage by vfio-ap does not clash with the concept of a memory balloon.

    TCG

    • Support for the floating-point extension facility has been added.
    • The first part of support for z13 vector instructions has been added (vector support instructions). Expect support for the remaining vector instructions in the next release; it should support enough of the instructions introduced with z13 to be able to run a distribution built for that cpu. 

    by Cornelia Huck (noreply@blogger.com) at March 12, 2019 06:20 PM

    March 11, 2019

    KVM on Z

    libvirt v4.10 released, providing PCI passthrough support

    libvirt v4.10, available for download at the libvirt project website, adds support for PCI passthrough devices on IBM Z (requires Linux kernel 4.14 and QEMU v2.11).
    To setup passthrough for a PCI device, follow these steps:
    1. Make sure the vfio-pci module is  available, e.g. using the modinfo command:
         $ modinfo vfio-pci
         filename:       /lib/modules/4.18.0/kernel/drivers/vfio/pci/vfio-pci.ko
         description:    VFIO PCI - User Level meta-driver
    2. Verify that the pciutils package, providing the lspci command et al, is available using your distro's package manager
    3. Determine the PCI device's address using the lspci command:
         $ lspci

         0002:06:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family

                      [ConnectX-3/ConnectX-3 Pro Virtual Function]
       
    4. Add the following element to the guest domain XML's devices section:
         <hostdev mode='subsystem' type='pci' managed='yes'>

           <source>

             <address domain='0x0002' bus='0x06' slot='0x00' function='0x0'/>

           </source>

         </hostdev>

      Note that if attribute managed is set to no (which is the default), it becomes the user's duty to unbind the PCI device from the respective device driver, and rebind to vfio-pci in the host prior to starting the guest.
    Once done and the guest is started, running the lspci command in the guest should show the PCI device, and one can proceed to configure it as needed.
    It is well worth checking out the expanded domain XML:
        <hostdev mode='subsystem' type='pci' managed='yes'>
          <source>
            <address domain='0x0002' bus='0x06' slot='0x00' function='0x0'/>
          </source>
          <address type='pci' domain='0x0002' bus='0x00' slot='0x01' function='0x0'>
            <zpci uid='0x0001' fid='0x00000000'/>
          </address>
        </hostdev>

    Theoretically, the PCI address in the guest can change between boots. However, the <zpci> element guarantees address persistence inside of the guest. The actual address of the passthrough device is based solely on the uid attribute: The uid becomes the PCI domain, all remaining values of the address (PCI bus, slot and function) are set to zero. Therefore, in this example, the PCI address in the guest would be 0001:00:00:0.
    Take note of the fid attribute, whose value is required to hotplug/hotunplug PCI devices within a guest.
    Furthermore note that the target PCI address is not visible anywhere (except within the QEMU process) at all. I.e. it is not related to the PCI address as observed within the KVM guest, and could be set to an arbitrary value. However, choosing the "wrong" values might have undesired subtle side effects with QEMU. Therefore, we strongly recommend not to specify a target address, and to rather rely on the auto-assignment. I.e. if the guest's PCI address has to be chosen, at a maximum restrict the target address element to uid (for PCI address definition) and fid (so that e.g. scripts in the guest for hotplugging PCI devices can rely on a specific value) as follows:
       <address type='pci'>
         <zpci uid='0x0001' fid='0x00000000'/>
       </address>


    For further (rather technical) details see here and here (git commit).

    by Stefan Raspl (noreply@blogger.com) at March 11, 2019 03:11 PM

    March 02, 2019

    Gerd Hoffmann

    EDID support for qemu

    Over the last months I've worked on adding EDID support to qemu. This allows to pass all kinds of information about the (virtual) display to the guest. Preferred video mode, display resolution, monitor name, monitor serial number and more. Current focus is getting the infrastructure in place. Once we have this we can build new features on top. HiDPI support comes to mind for example.

    New in qemu 3.1

    In qemu 3.1 the EDID generator code and support for the qemu stdvga was added. Right now EDID support is turned off by default, use edid=on to enable it. With EDID enabled you can also use the xres and yres properties to set the preferred video mode. Here is an example: qemu -device VGA,edid=on,xres=1280,yres=800

    The qemu-edid utility has been added too. Main purpose is to allow testing the generator code without having to boot a guest for that, so typically the qemu-edid output is piped into the edid-decode utility to verify the generator works correctly. If you need an EDID blob for other reasons you might find this useful.

    New in linux kernel 5.0

    Infrastructure work: Some interface updates have been merged:

    • EDID support has been added to the virtio-gpu protocol.
    • The vfio mdev interface for vgpus got EDID support too.

    We also got EDID support in drm drivers for both qemu stdvga (bochs-drm.ko) and virtio-gpu.

    If both guest (linux kernel) and host are (qemu) are new enough the drm drivers will use the informations from the edid blob to create the video mode list. It will also be available in sysfs, you can use edid-decode to get a human-readable version: edid-decode /sys/class/drm/card0-Virtual-1/edid

    Planned for qemu 4.0

    Support for EDID in virtio-gpu will be added, it is already merged in master branch. It is also turned off by default, use edid=on to enable it (simliar to stdvga).

    Support for EDID in vgpus (i.e. vfio-mdev) is planned too, patches are out for review right now. Note that the kernel's mdev driver must support that too.

    The qemu macos driver for the stdvga have been updated to support EDID.

    Future plans

    Intel is working on adding EDID support to gvt (the intel graphics mdev driver). Should land in the 5.1 or 5.2 linux kernel merge window.

    Once the EDID support got some real world testing it will be enabled by default for both stdvga and virtio-gpu. Unless something unexpected happens that will probably happen in qemu 4.1.

    As already mentioned above looking at HiDPI support (starting with the gtk UI probably) is something I plan to look at when I find some time.

    by Gerd Hoffmann at March 02, 2019 11:00 PM

    February 28, 2019

    Stefan Hajnoczi

    QEMU accepted into Google Summer of Code and Outreachy 2019

    QEMU is participating in the Google Summer of Code and Outreachy open source internship programs again this year. These 12-week, full-time, paid, remote work internships allow people interested in contributing to QEMU get started. Each intern works with one or more mentors who can answer questions and are experienced developers. This is a great way to try out working on open source if you are considering it as a career.

    For more information (including eligibility requirements), see our GSoC and our Outreachy pages.

    by Unknown (noreply@blogger.com) at February 28, 2019 05:17 PM

    February 27, 2019

    Gerd Hoffmann

    ramfb display in qemu

    ramfb is a very simple framebuffer display device. It is intended to be configured by the firmware and used as boot framebuffer, until the guest OS loads a real GPU driver.

    The framebuffer memory is allocated from guest RAM and initialized using the firmware config interface (fw_cfg). edk2 (uefi firmware) has ramfb support. There also is a vgabios, which emulates vga text mode and renders it to the framebuffer.

    Most interesting use case for this is boot display support for vgpus. vfio has a not hot-pluggable variant, which allows to enable ramfb support: qemu -device vfio-pci-nohotplug,ramfb=on,... Once the guest OS has initialized the vgpu qemu will show the vgpu display. Otherwise the ramfb framebuffer is used. The firmware messages, boot loader menu and efifb/vesafb output all will show via ramfb.

    There also is a standalone device, mostly intended for testing: qemu -vga none -device ramfb.

    Even though it is possible to use ramfb as primary display it isn't a good idea to actually do that as this isn't very efficient.

    by Gerd Hoffmann at February 27, 2019 11:00 PM

    February 26, 2019

    QEMU project

    Announcing GSoC and Outreachy 2019 internships

    QEMU is once again participating in Google Summer of Code and Outreachy this year! These open source internship programs offer full-time remote work opportunities for talented new developers wishing to get involved in our community.

    Each intern works with one or more mentors who support them in their project. Code is submitted according to QEMU’s normal development process, giving the intern experience in open source software development. Our projects range from device emulation to performance optimization to test infrastructure.

    If you are interested in contributing to QEMU through a paid 12-week internship from May to August 2019, take a look at our GSoC page and our Outreachy page for more information.

    Both GSoC and Outreachy have eligibility criteria, which you can review here (GSoC) and here (Outreachy) before applying.

    You can read about projects that were completed in 2018 here.

    These internships are generously funded by Google (GSoC) and Red Hat (Outreachy).

    February 26, 2019 07:00 AM

    February 18, 2019

    Daniel Berrange

    Easier QEMU live tracing using systemtap

    QEMU is able to leverage a number of live tracing systems, the choice configurable at build time between

    • log – printf formatted string for each event sent into QEMU’s logging system which writes to stderr
    • syslog – printf formatted string for each event sent via syslog
    • simple – binary data stream for each event written to a file or fifo pipe
    • ftrace – printf formatted string for each event sent to kernel ftrace facility
    • dtrace – user space probe markers dynamically enabled via dtrace or systemtap
    • ust – user space probe markers dynamically enabled via LTT-ng

    Upstream QEMU enables the “log” trace backend by default since it is cross-platform portable and very simple to use by adding “-d trace:PATTERN” on the QEMU command line. For example to enable logging of all trace events in the QEMU I/O subsystem (aka “qio“) we can

    $ qemu -d trace:qio* ...some args...
    23266@1547735759.137292:qio_channel_socket_new Socket new ioc=0x563a8a39d400
    23266@1547735759.137305:qio_task_new Task new task=0x563a891d0570 source=0x563a8a39d400 func=0x563a86f1e6c0 opaque=0x563a89078000
    23266@1547735759.137326:qio_task_thread_start Task thread start task=0x563a891d0570 worker=0x563a86f1ce50 opaque=0x563a891d9d90
    23273@1547735759.137491:qio_task_thread_run Task thread run task=0x563a891d0570
    23273@1547735759.137503:qio_channel_socket_connect_sync Socket connect sync ioc=0x563a8a39d400 addr=0x563a891d9d90
    23273@1547735759.138108:qio_channel_socket_connect_fail Socket connect fail ioc=0x563a8a39d400
    

    This is very simple and surprisingly effective much of the time, but it is not without its downsides

    • Inactive probes have non-negligible performance impact on hot codepaths
    • It is targetted to human consumption, so it is not easy to reliably process with machines
    • It requires adding arguments to QEMU’s command line so is not easy to enable in many cases
    • It is custom to QEMU so does not facilitate getting correlated traces across the whole system

    For these reasons, some downstreams chose not to use the default “log” backend. Both Fedora and RHEL have instead enabled the “dtrace” backend which can be leveraged via systemtap on Linux. This provides a very powerful tracing system, but the cost is that the previous simple task of printing a formatted string when a probe point fires has become MUCH more complicated. For example to get equivalent output to that seen with QEMU’s log backend would require

    # cat > trace.stp <<EOF
    probe qemu.system.x86_64.qio_task_new {
        printf("%d@%d qio_task_new Task new task=%p source=%p func=%p opaque=%p\n", 
               pid(), gettimeofday_ns(), task, source, func, opaque)
    }
    EOF
    # stap trace.stp
    22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400
    

    Repeat that code snippet for every qio* probe point you want to watch, figuring out the set of args it has available to print.This quickly becomes tedious for what should be a simple logging job, especially if you need to reference null terminated strings from userspace.

    After cursing this difficulty one time too many, it occurred to me that QEMU could easily do more to make life easier for systemtap users. The QEMU build system is already auto-generating all the trace backend specific code from a generic description of probes in the QEMU source tree. It has a format string which is used in the syslog, log and ftrace backends, but this is ignored for the dtrace backend. It did not take much to change the code generator so that it can use this format string to generate a convenient systemtap tapset representing the above manually written probe:

    probe qemu.system.x86_64.log.qio_task_new = qemu.system.x86_64.qio_task_new ?
    {
        printf("%d@%d qio_task_new Task new task=%p source=%p func=%p opaque=%p\n",
               pid(), gettimeofday_ns(), task, source, func, opaque)
    }
    

    This can be trivially executed with minimal knowledge of systemtap tapset language required

    # stap -e "qemu.system.x86_64.log.qio_task_new{}"
    22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400
    

    Even better, we have now gained the ability to use wildcards too

    # stap -e "qemu.system.x86_64.log.qio*{}"
    23266@1547735759.137292:qio_channel_socket_new Socket new ioc=0x563a8a39d400
    23266@1547735759.137305:qio_task_new Task new task=0x563a891d0570 source=0x563a8a39d400 func=0x563a86f1e6c0 opaque=0x563a89078000
    23266@1547735759.137326:qio_task_thread_start Task thread start task=0x563a891d0570 worker=0x563a86f1ce50 opaque=0x563a891d9d90
    23273@1547735759.137491:qio_task_thread_run Task thread run task=0x563a891d0570
    23273@1547735759.137503:qio_channel_socket_connect_sync Socket connect sync ioc=0x563a8a39d400 addr=0x563a891d9d90
    23273@1547735759.138108:qio_channel_socket_connect_fail Socket connect fail ioc=0x563a8a39d400
    

    Users still, however, need to be aware of the naming convention for QEMU’s systemtap tapsets and how it maps to the particular QEMU binary that is used, and don’t forget the trailing “{}”. Thus I decided to go one step further and ship a small helper tool to make it even easier to use

    $ qemu-trace-stap run qemu-system-x86_64 'qio*'
    22806@1547735341399856820 qio_channel_socket_new Socket new ioc=0x56135d1d7c00
    22806@1547735341399862570 qio_task_new Task new task=0x56135cd66eb0 source=0x56135d1d7c00 func=0x56135af746c0 opaque=0x56135bf06400
    22806@1547735341399865943 qio_task_thread_start Task thread start task=0x56135cd66eb0 worker=0x56135af72e50 opaque=0x56135c071d70
    22806@1547735341399976816 qio_task_thread_run Task thread run task=0x56135cd66eb0
    

    The second argument to this tool is the QEMU binary filename to be traced, which can be relative (to search $PATH) or absolute. What is clever is that it will set the SYSTEMTAP_TAPSET env variable to point to the right location to find the corresponding tapset definition. This is very useful when you have multiple copies of QEMU on the system and need to make sure systemtap traces the right one.

    The ‘qemu-trace-stap‘ script takes a verbose arg so you can understand what it is running behind the scenes:

    $ qemu-trace-stap run /home/berrange/usr/qemu-git/bin/qemu-system-x86_64 'qio*'
    Using tapset dir '/home/berrange/usr/qemu-git/share/systemtap/tapset' for binary '/home/berrange/usr/qemu-git/bin/qemu-system-x86_64'
    Compiling script 'probe qemu.system.x86_64.log.qio* {}'
    Running script, <Ctrl>-c to quit
    ...trace output...
    

    It can enable multiple probes at once

    $ qemu-trace-stap run qemu-system-x86_64 'qio*' 'qcrypto*' 'buffer*'
    

    By default it monitors all existing running processes and all future launched proceses. This can be restricted to a specific PID using the –pid arg

    $ qemu-trace-stap run --pid 2532 qemu-system-x86_64 'qio*'
    

    Finally if you can’t remember what probes are valid it can tell you

    $ qemu-trace-stap list qemu-system-x86_64
    ahci_check_irq
    ahci_cmd_done
    ahci_dma_prepare_buf
    ahci_dma_prepare_buf_fail
    ahci_dma_rw_buf
    ahci_irq_lower
    ...snip...
    

    This new functionality merged into QEMU upstream a short while ago and will be included in the QEMU 4.0 release coming at the end of April.

    by Daniel Berrange at February 18, 2019 03:00 PM

    January 28, 2019

    Thomas Huth

    How to create small VMs with buildroot

    A couple of times I already ran into the situation that I wanted to provide a small guest disk image to other people. For example, one time I wanted to provide a test application like LTP to colleagues via a server where I only had some limited disk quota available. Back then I was still able to resolve the problem by installing a stock Linux distribution together with the test software into a normal qcow2 image, and then to shrink the image with qemu-img convert and xz to approximately 500 MiB.

    But when I started to think about the QEMU advent calendar 2018, where I wanted to provide many small images for various different target architectures, it was clear to me that I needed a different approach. First the disk images needed to be much smaller due to network traffic constraints, and for many of “non-mainstream” target architectures (like MicroBlaze or Xtensa) you also can not easily get a standard Linux distribution that installs without problems on the machines that QEMU provides.

    Instead of using a pre-built Linux distribution, it would also be possible to cross-compile the kernel and user space programs and build a small disk image with that on your own. However, figuring out how to do that for multiple target architectures would have been very cumbersome and time consuming.

    So after doing some research, I finally discovered buildroot, which is an excellent framework for doing exactly what I wanted: It allows to create small disk images for non-x86 target CPUs, with all the magic about cross compiling and image creation wrapped into its internal scripts, and with a very flexible Kconfig-style configuration system on top.

    For those who are interested, here’s now a short description how to use buildroot for creating a small guest disk image:

    1. Download the version that you like to use from the buildroot download page and unpack it:
      $ wget https://buildroot.org/downloads/buildroot-2018.02.9.tar.bz2
      $ tar -xaf buildroot-2018.02.9.tar.bz2 
      $ cd buildroot-2018.02.9/
      
    2. Now you have to choose for which CPU and machine target you want to build. Have a look at the pre-defined config files and then select one. In the following example, I’m going to use the “pseries” POWER machine:
      $ cd configs/
      $ ls qemu*
      qemu_aarch64_virt_defconfig         qemu_nios2_10m50_defconfig
      qemu_arm_versatile_defconfig        qemu_or1k_defconfig
      qemu_arm_versatile_nommu_defconfig  qemu_ppc64le_pseries_defconfig
      qemu_arm_vexpress_defconfig         qemu_ppc64_pseries_defconfig
      qemu_m68k_mcf5208_defconfig         qemu_ppc_g3beige_defconfig
      qemu_m68k_q800_defconfig            qemu_ppc_mpc8544ds_defconfig
      qemu_microblazebe_mmu_defconfig     qemu_ppc_virtex_ml507_defconfig
      qemu_microblazeel_mmu_defconfig     qemu_sh4eb_r2d_defconfig
      qemu_mips32r2el_malta_defconfig     qemu_sh4_r2d_defconfig
      qemu_mips32r2_malta_defconfig       qemu_sparc64_sun4u_defconfig
      qemu_mips32r6el_malta_defconfig     qemu_sparc_ss10_defconfig
      qemu_mips32r6_malta_defconfig       qemu_x86_64_defconfig
      qemu_mips64el_malta_defconfig       qemu_x86_defconfig
      qemu_mips64_malta_defconfig         qemu_xtensa_lx60_defconfig
      qemu_mips64r6el_malta_defconfig     qemu_xtensa_lx60_nommu_defconfig
      qemu_mips64r6_malta_defconfig
      $ cd ..
      $ make qemu_ppc64_pseries_defconfig
      
    3. Now run make menuconfig to fine tune your build. I recommend to have a look at the following settings first:
      • In the Toolchain section, you might need to enable other languages like C++ in case it is required for the application that you want to ship in the image.
      • In the System Configuration section, change the System Banner to something that better suits your disk image.
      • Check the Kernel section to see whether the right kernel settings are used here. The defaults should be fine most of the time, but in case you want to use a newer kernel version for example, or a different kernel config file, you can adjust it here. Note that you also should adjust the kernel header version in the Toolchain section if you change the kernel version here.
      • Have a look at the Target packages section – maybe the application that you want to include is already available by the base buildroot system. In that case you can already enable it here.
      • Check the Filesystem images section and decide which kind of image you want to ship later. For example, for most of the QEMU advent calendar images, I used a simple initrd only, so I unchecked the ext2/3/4 root filesystem here and used initial RAM filesystem linked into linux kernel instead.
    4. Now save your configuration, exit the config menu, and type make for a first test to see whether it produces a usable image. Note: Don’t use the -j parameter of make here, buildroot will figure that out on its own instead.

    5. Once the build finished successfully, have a look at the output/images/ directory. You can start your guest with the results from there to give it a try. For example if you built with the ppc64 pseries configuration, with the initrd linked into the kernel:
      $ qemu-system-ppc64 -M pseries -m 1G -kernel output/images/vmlinux
      

      You should see the kernel booting up, and if you have a look at the serial console, there is also a getty running where you can log in as root and look around.

    6. To customize your build, you sooner or later want to add additional files to the image, for example some additional init scripts in the /etc/init.d/ folder. Or in the above case, it would be good to also have getty running on the graphical console. So to add custom files, the best way is to create an overlay folder which will be copied into the destination filesystem during the make process:
      $ mkdir overlay/etc/init.d
      $ cp my-startup-script.sh overlay/etc/init.d/S99myscript  # If you have one
      $ cp output/target/etc/inittab overlay/etc/inittab
      $ echo 'tty1::respawn:/sbin/getty -L tty1 0 linux' >> overlay/etc/inittab
      

      Then run make menuconfig and set the Root filesystem overlay directories option in the System Configuration section to the overlay folder that you have just created. Run make again and the next time you start your guest, you should see the new files in the image, e.g. also a getty running on the graphical console. Note: Do not try to add/change files directly in the output/target/ folder. That looks tempting first, but this is just a temporary folder used by the build system, which can be overwritten at any time and will be erased when you run make clean for example.

    7. If you need to tweak the kernel configuration, you can run make linux-menuconfig and do the appropriate changes there. For example, if you want to get keyboard input for the ppc64 pseries machine on the graphical console, you should enable the USB XHCI driver in the kernel, too. Once you are happy with the kernel configuration, save it, exit the menu and type make linux-rebuild && make. Note: To avoid that the kernel config gets reset after you run make clean at a later point in time, you should copy output/build/linux-*/.config to a safe location. Then run make menuconfig, change the Kernel -> Kernel configuration setting to Use a custom config file and set the Configuration file path to the copied file.

    8. If you want to add additional software to your image, you basically have to provide a Config.in file and a *.mk file. I recommend to have a look at the various packages in the package/ directory. Use one of the software from there with a similar build system as a template, and have a closer look at buildroot manual for details. Tweaking the build system of your software to properly cross-compile can be a little bit tricky some times, but most software that uses standard systems like autoconf should be fine.

    That’s it. You now should be able to package your software in really small VM images. Of course, there are still lots of other settings that you can tweak in the buildroot environment – if you need any of these just have a look at the good buildroot manual for more information.

    January 28, 2019 02:20 PM

    Powered by Planet!
    Last updated: September 18, 2019 09:08 AM
    Powered by OpenShift Online