Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

December 06, 2019

KVM on Z

Documentation: KVM Virtual Server Management Update

Intended for KVM virtual server administrators, this book illustrates how to set up, configure, and operate Linux on KVM instances and their virtual devices running on the KVM host and IBM Z hardware.

This major update includes information about VFIO pass-through devices, virtual server setup with the virt-install command, setting up virtual switches with VNIC characteristics, and other features that are available with the latest versions of QEMU and libvirt.

You can download the .pdf here.

by Stefan Raspl (noreply@blogger.com) at December 06, 2019 06:26 AM

December 01, 2019

Fabiano Fidêncio

Adopting GitLab workflow

In October 2018 there was a face-to-face short meeting with a big part of libosinfo maintainers, some contributors, and some users.

This short meeting took place during a lunch break in one of KVM Forum 2018 days and, among other things, we discussed whether we should allow, and / or prefer receiving patches through GitLab Merge Requests.

Here’s the announcement:

[Libosinfo] Merge Requests are enabled!

    From: Fabiano Fidêncio <fidencio redhat com>
    To: "libosinfo redhat com" <libosinfo redhat com>
    Subject: [Libosinfo] Merge Requests are enabled!
    Date: Fri, 21 Dec 2018 16:48:14 +0100

People,

Although the preferred way to contribute to libosinfo, osinfo-db and
osinfo-db-tools is still sending patches to this ML, we've decided to
also enable Merge Requests on our gitlab!

Best Regards,
--
Fabiano Fidêncio

Now, one year past that decision, let’s check what has been done, review some numbers, and discuss what’s my take, as one of the maintainers, of the decision we made.

2019, the experiment begins …

After the e-mail shown above was sent, I’ve kept using mailing list as the preferred way to submit and review patches, keeping an eye on GitLab Merge Requests, till August 2019 from when I did a full switch to using GitLab instead of mailing list.

… and what changed? …

Well, to be honest, not much. But in order to extend a little bit more, I have to describe a little bit my not optimal workflow.

Even before describing my workflow, let me just make clear that:

  • I don’t have any scripts that would fetch the patches from my e-mail and apply them automagically for me;

  • I never ever got used to text-based mail clients (I’m a former Evolution developer, I’m an Evolution user for several years);

Knowing those things, this is how my workflow looks like:

  • Development: I’ve been using GitLab for a few years as the main host of, my forks of. the projects I contribute to. When developing a new feature, I would:

    • Create a new branch;
    • Do the needed changes;
    • Push the new branch to the project on my GitLab account;
    • Submit the patches;
  • Review: It may sound weird, maybe it really is, but the way I do review patches is by:

    • Getting the patches submitted;
    • Applying atop of master;
    • Doing a git rebase -i so I can go through each one of the patchesR;
    • Then, for each one of the patches I would:
      • Add comments;
      • Do fix-up changes;
      • Squash my fixes atop of the original patch;
      • Move to the next patch;

And now, knowing my workflow, I can tell that pretty much nothing changed.

As part of the development workflow:

  • Submitting patches:

    • git publish -> click in the URL printed when a new branch is pushed to GitLab;
  • Reviewing patches:

    • Saving patch e-mails as mbox, applying them to my tree -> pull the MR

Everything else stays pretty much the same. I still do a git rebase -i and go through the patches, adding comments / fix-ups which, later on I’ll have to organise and paste somewhere (either replying to the e-mail or adding to GitLab’s web UI) and that’s the part which consumes the most of my time.

However, although the change was not big to me as a developer, some people had to adapt their workflow in order to start reviewing all the patches I’ve been submitting to GitLab. But let’s approach this later on … :-)

Anyways, it’s important to make it crystal clear that this is my personal experience and that I do understand that people who rely more heavily on text-based mail clients and / or with a bunch of scripts tailored for their development would have a different, way way different, experience.

… do we have more contributions since the switch? …

As by November 26th, I’ve checked the amount of submissions we had on both libosinfo mailing list and libosinfo GitLab page during the current year.

Mind that I’m not counting my own submissions and that I’m counting osinfo-db’s addition, which usually may consist in adding data & tests, as a single submission.

As for the mailing list, we’ve received 32 patches; as for the GitLab, we’ve received 34 patches.

Quite similar number of contributions, let’s dig a little bit more.

The 32 patches sent to our mailing list came from 8 different contributors, and all of them had at least one previous patch merged in one of the libosinfo projects.

The 34 patches sent to our GitLab came from 15 different contributors and, from those, only 6 of them had at least one previous patch merged in one of the libosinfo projects, whilst 9 of them were first time contributors (and I hope they’ll stay around, I sincerely do ;-)).

Maybe one thing to consider here is whether forking a project on GitLab is easier than subscribing to a new mailing list when submitting a patch. This is something people usually do once per project they contribute to, but subscribing to a mailing list may actually be a barrier.

Some people would argue, though, it’s a both ways barrier, mainly considering one may extensively contribute to projects using one or the other workflow. IMHO, it’s not exactly true. Subscribing to a mailing list, getting the patches correctly formatted feels more difficult than forking a repo and submitting a Merge Request.

In my personal case, I can tell the only projects I contribute to which still didn’t adopt GitLab / GitHub workflow are the libvirt ones, although it may change in the near future, as mentioned by Daniel P. Berrangé on his KVM Forum 2019 talk.

… what are the pros and cons? …

When talking about the “pros” and “cons” is really hard to get exactly which are the objective and subjective pros and cons.

  • pros

    • CI: The possibility to have a CI running for all libosinfo projects, running the tests we have on each MR, without any effort / knowledge of the contributor about this;

    • Tracking non-reviewed patches: Although this one may be related to each one’s workflow, it’s objective that figuring out which Merge Requests need review on GitLab is way easier for a new contributor than navigating through a mailing list;

    • Centralisation: This is one of the subjective ones, for sure. For libosinfo we have adopted GitLab as its issue tracker as well, which makes my life as maintainer quite easy as I have “Issues” and “Merge Requests” in a single place. It may not be true for different projects, though.

  • cons

    • Reviewing commit messages: It seems to be impossible to review commit messages, unless you make a comment about that. Making a comment, though, is not exactly practical as I cannot go specifically to the line I want to comment and make a suggestion.

    • Creating an account to yet another service: This is another one of the subjective ones. It bothers me a lot, having to create an account on a different service in order to contribute to a project. This is my case with GitLab, GNOME GitLab, and GitHub. However, is that different from subscribing to a few different mailing lists? :-)

Those are, for me, the most prominent “pros” and “cons”. There are a few other things that I’ve seen people complaining, being the most common one related to changing their workflow. And this is something worth its own section! :-)

… is there something out there to make my workflow change easier? …

Yes and no. That’s a horrible answer, ain’t it? :-)

Daniel P. Berrangé has created a project called Bichon, which is a tool providing a terminal based user interface for reviewing GitLab Merge Requests.

Cool, right? In general, yes. But you have to keep in mind that the project is still in its embryonic stage. When more mature, I’m pretty sure it’ll help people used to mailing lists workflow to easily adapt to GitLab workflow without leaving behind the facilities of doing everything via command-line.

I’ve been using the tool for simple things, I’ve been contributing to the tool with simple patches. It’s fair to say that it I do prefer adding a comment to Merge Requests, Approve, and Merge them using Bichon than via the web UI. Is the tool enough to suffice all the people’s needs? Of course not. Will it be? Hardly. But it may be enough to surpass the blockers of migrating away from mailing lists workflow.

… a few words from different contributors …

I’ve decided to ask Cole Robinson and Felipe Borges a word or two about this subject as they are contributors / reviewers of libosinfo projects.

It should go without saying that their opinions should not be taken as “this workflow is better than the other”. However, take their words as valid points from people who are heavily using one workflow or the other, as Cole Robinson comes from libvirt / virt-tools world, which rely heavily on mailing list, and Felipe Borges comes from GNOME world, which is a huge GitLab consumer.

“The change made things different for me, slightly worse but not in any disruptive way. The main place I feel the pain is putting review comments into a web UI rather than inline in email which is more natural for me. For a busier project than libosinfo I think the pain would ramp up, but it would also force me to adapt more. I’m still largely maintaining an email based review workflow and not living in GitLab / GitHub” - Cole Robinson

“The switch to Gitlab has significantly lowered the threshold for people getting started. The mailing list workflow has its advantages but it is an entry barrier for new contributors that don’t use native mail clients and that learned the Pull Request workflow promoted by GitLab/GitHub. New contributors now can easily browse the existing Issues and find something to work on, all in the same place. Reviewing contributions with inline discussions and being able to track the status of CI pipelines in the same interface is definitely a must. I’m sure Libosinfo foresees an increase in the number of contributors without losing existing ones, considering that another advantage of Gitlab is that it allows developers to interact with the service from email, similarly to the email-driven git workflow that we were using before.” - Felipe Borges

… is there any conclusion from the author’s side?

As the first thing, I’ve to emphasize two points:

  • Avoid keeping both workflows: Although we do that on libosinfo, it’s something I’d strongly discourage. It’s almost impossible to keep the information in sync in both places in a reasonable way.

  • Be aware of changes, be welcome to changes: As mentioned above, migrating from one workflow to another will be disruptive at some level. Is it actually a blocker? Although it was not for me, it may be for you. The thing to keep in mind here is to be aware of changes and welcome them knowing you won’t have a 1:1 replacement for your current workflow.

With that said, I’m mostly happy with the change made. The number of old time contributors has not decreased and, at the same time, the number of first time contributors has increased.

Another interesting fact is that the number of contributions using the mailing list has decreased as we only had 4 contributions through this mean since June 2019.

Well, that’s all I have to say about the topic. I sincerely hope a reading through this content somehow helps your project and the contributors of your project to have a better idea about the migration.

December 01, 2019 12:00 AM

November 29, 2019

Stefan Hajnoczi

Visiting the Centre for Computing History

I visited the Centre for Computing History today in Cambridge, UK. It's home to old machines from computer science history, 80s home computers, 90s games consoles, and much more. It was nice to see familiar machines that I used to play with back in the day. This post has pictures from the visit to the museum.

The journey starts with the Megaprocessor, a computer build from 15,000 transistors with countless LEDs that allow you to see what's happening inside the machine while a program runs.

The Megaprocessor has its own instruction set architecture (ISA) with 4 General Purpose Registers (GPRs). The contents of the GPRs are visible on seven-segment displays and LEDs.

The instruction decoder looks fun. Although I didn't look in detail, it seems to be an old-school decoder where each bit in an instruction is hardcoded to enable or disable certain hardware units. No microcoded instructions here!

Ada Lovelace is considered the first programmer thanks to her work on the Analytical Engine. On a Women in Computer Science note, I learnt that Margaret Hamilton coined the term "software engineering". Hedy Lamarr also has an interesting background: movie star and inventor. There are posters throughout the museum featuring profiles on women in computer science that are quite interesting.

The museum is very hands-on with machines available to use and other items like books open to visitors. If nostalgia strikes and you want to sit down and play a game or program in BASIC, or just explore an exotic old machine, you can just do it! That is quite rare for a museum since these are historic items that can be fragile or temperamental.

Moving along in chronological order, here is the PDP-11 minicomputer that UNIX ran on in 1970! I've seen them in museums before have yet to interact with one.

In the 1980s the MicroVAX ran VMS or ULTRIX. I've read about these machines but they were before my time! It's cool to see one.

This HP Graphics Terminal was amusing. I don't see anything graphical about ASCII art, but I think the machine was connected to a plotter/printer.

The museum has a lot of microcomputers from the 1980s including brands I've never heard of. There were also machines with laserdiscs or more obscure optical media, what eventually became the "multi-media" buzzword in the 90s when CD-ROMs became mainstream.

Speaking of optical media, here is an physical example of bitrot, the deterioration of data media or software!

Classic home computers: ZX Spectrum, Commodore 64, Atari ST Mega 2, and Acorn. The museum has machines that were popular in the UK, so the selection is a little different from what you find in the Computer History Museum in Mountain View, CA, USA.

There are games consoles from the 80s, 90s, and 2000s. The Gameboy holds a special place for me. Both the feel of the buttons and the look of the display still seems right in this age of high resolution color displays.

The museum has both the popular Nintendo, SEGA, and Sony consoles as well as rarer specimens that I've never seen in real life before. It was cool to see an Intellivision, Jaguar, etc.

Back to UNIX. This SGI Indy brought back memories. I remember getting a used one in order to play with the IRIX operating system. It was already an outdated machine at the time but the high resolution graphics and the camera were clearly ahead of its time.

Say hello to an old friend. I remember having exactly the same 56K modem! What a wonderful dial-up sound :).

And finally, the Palm pilot. Too bad that the company failed, they had neat hardware before smartphones came along. I remember programming and reverse engineering on the Palm.

Conclusion

If you visit Cambridge, UK be sure to check out the Centre for Computing History. It has an excellent collection of home computers and games consoles. I hope it will be expanded to cover the 2000s internet era too (old web browsers, big websites that no longer exist, early media streaming, etc).

by Unknown (noreply@blogger.com) at November 29, 2019 07:33 PM

November 27, 2019

Gerd Hoffmann

virtio gpu status and plans

Time for a status update for virtio-gpu development, so lets go ...

Before we begin: If you want follow development and discussions more closely head over to the virgl project at freedesktop gitlab. Git repos are there, most discussions are happening in gitlab issues.

What happened

Over the course of the last year (2019) a bunch of cleanups happened in the virtio-gpu linux kernel driver code base, to prepare for the planned changes and to reduce the code size by using common drm code. Memory management was switched over to use the shmem helpers instead of ttm. fbdev support (for the framebuffer console) was switched over to the generic drm fbdev emulation. fence support was improved. Various bugfixes.

Planned feature: shared mappings

Little change to reduce image data copying. Currently resources have a guest and a host buffer, and there are transfer commands to copy data between the two. Shared mappings allow the host to use the guest buffer directly.

On the guest side this is pretty simple, the guest only needs to inform the host that a shared mapping for the given resource -- so the host might see changes without explicit transfer commands -- is fine.

On the host side this is a bit more involved. Qemu will create a dma-buf for the resource, using the udmabuf driver. Which in turn allows qemu create a linear mapping of the resource, even if it is scattered in guest memory. That way the resource can be used directly (i.e. a pixman image created for it, ...)

Status: Almost ready to be submitted upstream.

Planned feature: blob resources

Currently virtio-gpu resources always have a format attached to it. So they are all-in-one objects, handling both memory management and image metadata. Combining both works ok for opengl, because we have a 1:1 relationship there. It will not work for vulkan though because the memory management works radically different there.

The drm subsystem has separate entities too: gem objects for memory management and framebuffer objects for the format metadata. That is difficuilt to model for virtio-gpu. virtio-gpu supports only a single format for dumb framebuffers because of that. dma-buf imports can't be supported too.

Blob resources will all fix that. A blob resource is just a bunch of bytes, i.e. it has only the memory management aspect.

Status: Proof-of-concept works. Various details are to be hashed out. Next in line after "shared mappings" feature.

Planned feature: metadata query

Allows guest userspace to query host renderer for capabilities and allocation requirements. Simliar to capsets but more flexible.

Status: Some test code exists. Not fully clear yet whenever that'll makes sense as feature on its own. Maybe it'll be folded into vulkan support.

Planned feature: host memory

For some features -- coherent memory for example, which is required by vulkan and at least one opengl extension -- the host gpu driver must allocate resources. So the model of using guest allocated memory for resources doesn't work, we have to map the host-allocated resources into the guest address space instead.

Virtio recently got support for (host-managed) shared memory, because virtiofs needs this for dax support. The plan is to use that for virtio-gpu too.

Status: Incomplete test code exists. Alot of the kernel driver cleanups where done to prepare the driver for this.

Planned feature: vulkan support

It's coming, but will surely take a while to actually arrive. As you have proably noticed while reading the article plans are already done with vulkan in mind, even if features are useful without vulkan, so we don't have to change things again when vulkan actually lands.

by Gerd Hoffmann at November 27, 2019 11:00 PM

Stefan Hajnoczi

Software Freedom Conservancy donation matching is back!

Software Freedom Conservancy is a non-profit that provides a home for Git, QEMU, Inkscape, and many other popular open source projects. Conservancy is also influential in promoting free software and open source licenses, including best practices for license compliance. They help administer the Outreachy open source internship program that encourages diversity in open source. They are a small organization with just 5 full-time employees taking on many tasks important in the open source community.

The yearly donation matching event has started again, so now is the best time to become a supporter by donating!

by Unknown (noreply@blogger.com) at November 27, 2019 09:31 AM

November 26, 2019

Daniel Berrange

ANNOUNCE: libvirt-glib release 3.0.0

I am pleased to announce that a new release of the libvirt-glib package, version 3.0.0, is now available from

https://libvirt.org/sources/glib/

The packages are GPG signed with

Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

Changes in this release:

  • Add support for bochs video device
  • Add API to query firmware config
  • Improve testing coverage
  • Validate min/max glib API versions in use
  • Remove deprecated G_PARAM_PRIVATE
  • Fix docs build linking problems
  • Convert python demos to be python 3 compatible & use modern best practice for pyobject introspection bindings
  • Add API to query domain capaibilities
  • Refresh translations
  • Simplify build process for handling translations
  • Fix some memory leaks
  • Add API for setting storage volume features

Thanks to everyone who contributed to this new release through testing, patches, bug reports, translations and more.

by Daniel Berrange at November 26, 2019 01:15 PM

November 21, 2019

QEMU project

Presentations from KVM Forum 2019

Last month, the KVM Forum 2019 took place in Lyon, France. The conference also featured quite a lot talks about QEMU, and now the videos of the presentation are available online. So for those who could not attend, here is the list of the QEMU-related presentations:

More interesting virtualization-related talks can be found in the KVM Forum Youtube Channel, and for LWN readers, there is “A recap of KVM Forum 2019” article, too.

by Thomas Huth at November 21, 2019 04:30 PM

November 19, 2019

Stefan Hajnoczi

Video and slides available for "virtio-fs: A Shared File System for Virtual Machines"

This year I presented virtio-fs at KVM Forum 2019 with David Gilbert and Miklos Szeredi. virtio-fs is a host<->guest file system that allows guests to access a shared directory on the host. We've been working on virtio-fs together with Vivek Goyal and community contributors since 2018 and are excited that it is now being merged upstream in Linux and QEMU.

virtio-fs gives guests file system access without the need for disk image files or copying files between the guest and host. You can even boot a guest from a directory on the host without a disk image file. Kata Containers 1.7 and later ship with virtio-fs support for running VM-isolated containers.

What is new and interesting about virtio-fs is that it takes advantage of the co-location of guests and the hypervisor to avoid file server communication and to provide local file system semantics. The guest can map the contents of files from the host page cache. This bypasses the guest page cache to reduce memory footprint and avoid copying data into guest RAM. Network file systems and earlier attempts at paravirtualized file systems, like virtio-9p, cannot do this since they are designed for message-passing communication only.

To learn more about virtio-fs, check out the video or slides (PDF) from the presentation.

by Unknown (noreply@blogger.com) at November 19, 2019 08:25 AM

November 15, 2019

KVM on Z

KVM on IBM z15 Features

To take advantage of the new features of z15, the latest addition to the IBM Z family as previously announced here, use any of the following CPU models in your guest's domain XML:
  • Pre-defined model for z15
      <cpu mode='custom'>
        <model>gen15a</model>
      </cpu>
  • Use z15 features in a migration-safe way (recommended). E.g. when running on z15 this will be a superset of the gen15a model, and feature existence will be verified on the target system prior to a live migration:
      <cpu mode='host-model'/>
  • Use z15 features in a non-migration-safe way. I.e. feature existence will not be verified on the target system prior to a live migration:
      <cpu mode='host-passthrough'/>
Here is a list of features of the new hardware generation as supported in Linux kernel 5.2 and QEMU 4.1, all activated by default in the CPU models listed above:
  • Miscellaneous Instructions
    Following the example of previous machines, new helper and general purpose instructions were
      minste3     Miscellaneous-Instruction-Extensions Facility 3 
  • SIMD Extensions
    Following up to the SIMD instructions as introduced with the previous z13 and z14 models, this feature again provides further vector instructions, which can again be used in KVM guests.
    These new vector instructions can be used to improve decimal calculations as well as for implementing high performance variants of certain cryptographic operations.
    In the z15 CPU models, the respective feature is:
      vxpdeh      Vector-Packed-Decimal-Enhancement Facility
      vxeh2       Vector enhancements facility 2
     
  • Deflate Conversion
    Provide acceleration for zlib compression and decompression
    In the z15 CPU model, the respective feature is:
      dflt        Deflate conversion facility
  • MSA Updates
    z15 introduces a new Message Security Assist MSA9, providing elliptic curve cryptography. It supports message authentication, the generation of elliptic curve keys, and scalar multiplication.
    This feature can be exploited in KVM guests' kernels and userspace applications independently (i.e. a KVM guest's userspace applications can take advantage of these features irrespective of the guest's kernel version).
    In the z15 CPU model, the respective feature is:
      msa9        Message-security-assist-extension 9 facility
      msa9_pckmo  Message-security-assist-extension 9 PCKMO
                  subfunctions for protected ECC keys
The z15 CPU model was backported into several Linux distributions. It is readily available in RHEL8.1, SLES 15 SP1 (via maintweb updates for kernel and qemu) and Ubuntu 18.04.

by Stefan Raspl (noreply@blogger.com) at November 15, 2019 09:53 AM

QEMU project

Micro-Optimizing KVM VM-Exits

Background on VM-Exits

KVM (Kernel-based Virtual Machine) is the Linux kernel module that allows a host to run virtualized guests (Linux, Windows, etc). The KVM “guest execution loop”, with QEMU (the open source emulator and virtualizer) as its user space, is roughly as follows: QEMU issues the ioctl(), KVM_RUN, to tell KVM to prepare to enter the CPU’s “Guest Mode” – a special processor mode which allows guest code to safely run directly on the physical CPU. The guest code, which is inside a “jail” and thus cannot interfere with the rest of the system, keeps running on the hardware until it encounters a request it cannot handle. Then the processor gives the control back (referred to as “VM-Exit”) either to kernel space, or to the user space to handle the request. Once the request is handled, native execution of guest code on the processor resumes again. And the loop goes on.

There are dozens of reasons for VM-Exits (Intel’s Software Developer Manual outlines 64 “Basic Exit Reasons”). For example, when a guest needs to emulate the CPUID instruction, it causes a “light-weight exit” to kernel space, because CPUID (among a few others) is emulated in the kernel itself, for performance reasons. But when the kernel cannot handle a request, e.g. to emulate certain hardware, it results in a “heavy-weight exit” to QEMU, to perform the emulation. These VM-Exits and subsequent re-entries (“VM-Enters”), even the light-weight ones, can be expensive. What can be done about it?

Guest workloads that are hard to virtualize

At the 2019 edition of the KVM Forum in Lyon, kernel developer Andrea Arcangeli addressed the kernel part of minimizing VM-Exits.

His talk touched on the cost of VM-Exits into the kernel, especially for guest workloads (e.g. enterprise databases) that are sensitive to their performance penalty. However, these workloads cannot avoid triggering VM-Exits with a high frequency. Andrea then outlined some of the optimizations he’s been working on to improve the VM-Exit performance in the KVM code path – especially in light of applying mitigations for speculative execution flaws (Spectre v2, MDS, L1TF).

Andrea gave a brief recap of the different kinds of speculative execution attacks (retpolines, IBPB, PTI, SSBD, etc). Followed by that he outlined the performance impact of Spectre-v2 mitigations in context of KVM.

The microbechmark: CPUID in a one million loop

Andrea constructed a synthetic microbenchmark program (without any GCC optimizations or caching) which runs the CPUID instructions one million times in a loop. This microbenchmark is meant to focus on measuring the performance of a specific area of the code – in this case, to test the latency of VM-Exits.

While stressing that the results of these microbenchmarks do not represent real-world workloads, he had two goals in mind with it: (a) explain how the software mitigation works; and (b) to justify to the broader community the value of the software optimizations he’s working on in KVM.

Andrea then reasoned through several interesting graphs that show how CPU computation time gets impacted when you disable or enable the various kernel-space mitigations for Spectre v2, L1TF, MDS, et al.

The proposal: “KVM Monolithic”

Based on his investigation, Andrea proposed a patch series, “KVM monolithc”, to get rid of the KVM common module, ‘kvm.ko’. Instead the KVM common code gets linked twice into each of the vendor-specific KVM modules, ‘kvm-intel.ko’ and ‘kvm-amd.ko’.

The reason for doing this is that the ‘kvm.ko’ module indirectly calls (via the “retpoline” technique) the vendor-specific KVM modules at every VM-Exit, several times. These indirect calls—via function pointers in the C source code—were not optimal before, but the “retpoline” mitigation (which isolates indirect branches, that allow a CPU to execute code from arbitrary locations, from speculative execution) for Spectre v2 compounds the problem, as it degrades performance.

This approach will result in a few MiB of increased disk space for ‘kvm-intel.ko’ and ‘kvm-amd.ko’, but the upside in saved indirect calls, and the elimination of “retpoline” overhead at run-time more than compensate for it.

With the “KVM Monolithic” patch series applied, Andrea’s microbenchmarks show a double-digit improvement in performance with default mitigations (for Spectre v2, et al) enabled on both Intel ‘VMX’ and AMD ‘SVM’. And with ‘spectre_v2=off’ or for CPUs with IBRS_ALL in ARCH_CAPABILITIES “KVM monolithic” still improve[s] performance, albiet it’s on the order of 1%.

Conclusion

Removal of the common KVM module has a non-negligible positive performance impact. And the “KVM Monolitic” patch series is still actively being reviewed, modulo some pending clean-ups. Based on the upstream review discussion, KVM Maintainer, Paolo Bonzini, and other reviewers seemed amenable to merge the series.

Although, we still have to deal with mitigations for ‘indirect branch prediction’ for a long time, reducing the VM-Exit latency is important in general; and more specifically, for guest workloads that happen to trigger frequent VM-Exits, without having to disable Spectre v2 mitigations on the host, as Andrea stated in the cover letter of his patch series.

by Kashyap Chamarthy at November 15, 2019 05:00 AM

November 09, 2019

Stefano Garzarella

KVM Forum 2019: virtio-vsock in QEMU, Firecracker and Linux

Slides and recording are available for the “virtio-vsock in QEMU, Firecracker and Linux: Status, Performance and Challenges“ talk that Andra Paraschiv and I presented at KVM Forum 2019. This was the 13th edition of the KVM Forum conference. It took place in Lyon, France in October 2019.

We talked about the current status and future works of VSOCK drivers in Linux and how Firecracker and QEMU provides the virtio-vsock device.

Summary

Initially, Andra gave an overview of VSOCK, she described the state of the art, and the key features:

  • it is very simple to configure, the host assigns an unique CID (Context-ID) to each guest, and no configuration is needed inside the guest;

  • it provides AF_VSOCK address family, allowing user space application in the host and guest to communicate using standard POSIX Socket API (e.g. bind, listen, accept, connect, send, recv, etc.)

Andra also described common use cases for VSOCK, such as guest agents (clipboard sharing, remote console, etc.), network applications using SOCK_STREAM, and services provided by the hypervisor to the guests.

Going into the implementation details, Andra explained how the device in the guest communicates with the vhost backend in the host, exchanging data and events (i.e. ioeventfd, irqfd).

Firecracker

Focusing on Firecracker, Andra gave a brief overview on this new VMM (Virtual Machine Monitor) written in Rust and she explained why, in the v0.18.0 release, they switched from the experimental vhost-vsock implementation to a vhost-less solution:

  • focus on security impact
  • less dependency on host kernel features

This change required a device emulation in Firecracker, that implements virtio-vsock device model over MMIO. The device is exposed in the host using UDS (Unix Domain Sockets).

Andra described how Firecracker maps the VSOCK ports on the uds_path specified in the VM configuration:

  • Host-Initiated Connections

    • Guest: create an AF_VSOCK socket and listen() on PORT
    • Host: connect() to AF_UNIX at uds_path
    • Host: send() “CONNECT PORT\n”
    • Guest: accept() the new connection
  • Guest-Initiated Connections

    • Host: create and listen() on an AF_UNIX socket at uds_path_PORT
    • Guest: create an AF_VSOCK socket and connect() to HOST_CID and PORT
    • Host: accept() the new connection

Finally, she showed the performance of this solution, running iperf-vsock benchmark, varying the size of the buffer used in Firecracker to transfer packets between the virtio-vsock device and the UNIX domain socket. The throughput on the guest to host path reaches 10 Gbps.

QEMU

In the second part of the talk, I described the QEMU implementation. QEMU provides the virtio-vsock device using the vhost-vsock kernel module.

The vsock device in QEMU handles only:

  • configuration
    • user or management tool can configure the guest CID
  • live-migration
    • connected SOCK_STREAM sockets become disconnected. Applications must handle a connection reset error and should reconnect.
    • guest CID can be not available in the new host because can be assigned to another VM. In this case the guest is notified about the CID change.

The vhost-vsock kernel module handles the communication with the guest, providing in-kernel virtio device emulation, to have very high performance and to interface directly to the host socket layer. In this way, also host application can directly use POSIX Socket API to communicate with the guest. So, guest and host applications can be switched between them, changing only the destination CID.

virtio-vsock Linux drivers

After that, I told the story of VSOCK in the Linux tree, started in 2013 when the first implementation was merged, and the changes in the last year.

These changes mainly regard fixes, but for the virtio/vhost transports we also improved the performance with two simple changes released with Linux v5.4:

  • reducing the number of credit update messages exchanged
  • increasing the size of packets queued in the virtio-vsock device from 4 KB up to 64 KB, the maximum packet size handled by virtio-vsock devices.

With these changes we are able to reach ~40 Gbps in the Guest -> Host path, because the guest can now send up to 64 KB packets directly to the host; for the Host -> Guest path, we reached ~25 Gbps, because the host is still using 4 KB buffer preallocated by the guest.

Tools and languages that support VSOCK

In the last few years, several applications, tools, and languages started to support VSOCK and I listed them to update the audience:

  • Tools:

    • wireshark >= 2.40 [2017-07-19]
    • iproute2 >= 4.15 [2018-01-28]
      • ss
    • tcpdump
      • merged in master [2019-04-16]
    • nmap >= 7.80 [2019-08-10]
      • ncat
      • nbd
    • nbdkit >= 1.15.5 [2019-10-19]
    • libnbd >= 1.1.6 [2019-10-19]
    • iperf-vsock
      • iperf3 fork
  • Languages:

    • C
      • glibc >= 2.18 [2013-08-10]
    • Python
      • python >= 3.7 alpha 1 [2017-09-19]
    • Golang
    • Rust
      • libc crate >= 0.2.59 [2019-07-08]
        • struct sockaddr_vm
        • VMADDR_* macros
      • nix crate >= 0.15.0 [2019-08-10]
        • VSOCK supported in the socket API (nix::sys::socket)

Next steps

Concluding, I went through the next challenges that we are going to face:

  • multi-transport to use VSOCK in a nested VM environment. because we are limited by the fact that the current implementation can handle only one transport loaded at run time, so, we can’t load virtio_transport and vhost_transport together in the L1 guest. I already sent some patches upstream [RFC, v1], but they are still in progress.

  • network namespace support to create independent addressing domains with VSOCK socket. This could be useful for partitioning VMs in different domains or, in a nested VM environment, to isolate host applications from guest applications bound to the same port.

  • virtio-net as a transport for the virtio-vsock to avoid to re-implement features already done in virtio-net, such as mergeable buffers, page allocation, small packet handling.

From the audience

Other points to be addressed came from the comments we received from the audience:

  • loopback device could be very useful for developers to test applications that use VSOCK socket. The current implementation support loopback only in the guest, but it would be better to support it also in the host, adding VMADDR_CID_LOCAL special address.

  • VM to VM communication was asked by several people. Introducing it in the VSOCK core could complicate the protocol, the addressing and could require some sort of firewall. For now we do not have in mind to do it, but I developed a simple user space application to solve this issue: vsock-bridge. In order to improve the performance of this solution, we will consider the possibility to add sendfile(2) or MSG_ZEROCOPY support to the AF_VSOCK core.

  • virtio-vsock windows drivers is not planned to be addressed, but contributions are welcome. Other virtio windows drivers are available in the vm-guest-drivers-windows repository.

Stay tuned!

by sgarzare@redhat.com (Stefano Garzarella) at November 09, 2019 05:45 PM

November 07, 2019

QEMU project

Fuzzing QEMU Device Emulation

QEMU (https://www.qemu.org/) emulates a large number of network cards, disk controllers, and other devices needed to simulate a virtual computer system, called the “guest”.

The guest is untrusted and QEMU may even be used to run malicious software, so it is important that bugs in emulated devices do not allow the guest to compromise QEMU and escape the confines of the guest. For this reason a Google Summer of Code project was undertaken to develop fuzz tests for emulated devices.

QEMU device emulation attack surface

Fuzzing is a testing technique that feeds random inputs to a program in order to trigger bugs. Random inputs can be generated quickly without relying on human guidance and this makes fuzzing an automated testing approach.

Device Fuzzing

Emulated devices are exposed to the guest through a set of registers and also through data structures located in guest RAM that are accessed by the device in a process known as Direct Memory Access (DMA). Fuzzing emulated devices involves mapping random inputs to the device registers and DMA memory structures in order to explore code paths in QEMU’s device emulation code.

Device fuzzing overview

Fuzz testing discovered an assertion failure in the virtio-net network card emulation code in QEMU that can be triggered by a guest. Fixing such bugs is usually easy once fuzz testing has generated a reproducer.

Modern fuzz testing intelligently selects random inputs such that new code paths are explored and previously-tested code paths are not tested repeatedly. This is called coverage-guided fuzzing and involves an instrumented program executable so the fuzzer can detect the code paths that are taken for a given input. This was surprisingly effective at automatically exploring the input space of emulated devices in QEMU without requiring the fuzz test author to provide detailed knowledge of device internals.

How Fuzzing was Integrated into QEMU

Device fuzzing in QEMU is driven by the open source libfuzzer library (https://llvm.org/docs/LibFuzzer.html). A special build of QEMU includes device emulation fuzz tests and launches without running a normal guest. Instead the fuzz test directly programs device registers and stores random data into DMA memory structures.

The next step for the QEMU project will be to integrate fuzzing into Google’s OSS-Fuzz (https://google.github.io/oss-fuzz/) continuous fuzzing service. This will ensure that fuzz tests are automatically run after new code is merged into QEMU and bugs are reported to the community.

Conclusion

Fuzzing emulated devices has already revealed bugs in QEMU that would have been time-consuming to find through manual testing approaches. So far only a limited number of devices have been fuzz-tested and we hope to increase this number now that the foundations have been laid. The goal is to integrate these fuzz tests into OSS-Fuzz so that fuzz testing happens continuously.

This project would not have been possible without Google’s generous funding of Google Summer of Code. Alexander Oleinik developed the fuzzing code and was mentored by Bandan Das, Paolo Bonzini, and Stefan Hajnoczi.

by Stefan Hajnoczi and Alexander Oleinik at November 07, 2019 05:50 AM

November 06, 2019

Cornelia Huck

s390x changes in QEMU 4.2

You know the drill: QEMU is entering freeze (this time for 4.2), and there's a post on the s390x changes for the upcoming release.

TCG

  • Emulation for  IEP (Instruction Execution Protection), a z14 feature, has been added.
  • A bunch of fixes in the vector instruction emulation and in the fault-handling code.

KVM

  • For quite some time now, the code has been implicitly relying on the presence of the 'flic' (floating interrupt controller) KVM device (which had been added in Linux 3.15). Nobody really complained, so we won't try to fix this up and instead make the dependency explicit.
  • The KVM memslot handling was reworked to be actually sane. Unfortunately, this breaks migration of huge KVM guests with more than 8TB of memory from older QEMUs. Migration of guests with less than 8TB continues to work, and there's no planned breakage of migration of >8TB guests starting with 4.2.

CPU models

  • We now know that the gen15a is called 'z15', so reflect this in the cpu model description.
  • The 'qemu' and the 'max' models gained some more features.
  • Under KVM, 'query-machines' will now return the correct default cpu model ('host-s390x-cpu').

Misc

  • The usual array of bugfixes, including in SCLP handling and in the s390-ccw bios.

by Cornelia Huck (noreply@blogger.com) at November 06, 2019 03:12 PM

October 20, 2019

KVM on Z

Ubuntu 19.10 released

Ubuntu Server 19.10 is out!
It ships For a detailed list of KVM on Z changes, see the release notes here.

by Stefan Raspl (noreply@blogger.com) at October 20, 2019 01:12 PM

October 16, 2019

Fabiano Fidêncio

Libosinfo (Part I)

This is the first blog post of a series which will cover Libosinfo, what it is, who uses it, how it is used, how to manage it, and, finally, how to contribute to it.

A quick overview

Libosinfo is the operating system information database. As a project, it consists of three different parts, with the goal to provide a single place containing all the required information about an operating system in order to provision and manage it in a virtualized environment.

The project allows management applications to:

  • Automatically identify for which operating system an ISO image or an installation tree is intended to;

  • Find the download location of installable ISOs and LiveCDs images;

  • Find the location of installation trees;

  • Query the minimum, recommended, and maximum CPU / memory / disk resources for an operating system;

  • Query the hardware supported by an operating system;

  • Generate scripts suitable for automating “Server” and “Workstation” installations;

The library (libosinfo)

The library API is written in C, taking advantage of GLib and GObject. Thanks to GObject Introspection, the API is automatically available in all dynamic programming languages with bindings for GObject (JavaScript, Perl, Python, and Ruby). Auto-generated bindings for Vala are also provided.

As part of libosinfo, three tools are provided:

  • osinfo-detect: Used to detect an Operating System from a given ISO or installation tree.

  • osinfo-install-script: Used to generate a “Server” or “Workstation” install-script to perform automated installation of an Operating System;

  • osinfo-query: Used to query information from the database;

The database (osinfo-db)

The database is written in XML and it can either be consumed via libosinfo APIs or directly via management applications’ own code.

It contains information about the operating systems, devices, installation scripts, platforms, and datamaps (keyboard and language mappings for Windows and Linux OSes).

The database tools (osinfo-db-tools)

These are tools that can be used to manage the database, which is distributed as a tarball archive.

  • osinfo-db-import: Used to import an osinfo database archive;

  • osinfo-db-export: Used to export an osinfo database archive;

  • osinfo-db-validate: Used to validate the XML files in one of the osinfo database locations for compliance with the RNG schema.

  • osinfo-db-path: Used to report the paths associated with the standard database locations;

The consumers …

Libosinfo and osinfo-db have management applications as their target audience. Currently the libosinfo project is consumed by big players in the virtual machine management environment such as OpenStack Nova, virt-manager, GNOME Boxes, Cockpit Machines, and KubeVirt.

… a little bit about them …

  • OpenStack Nova: An OpenStack project that provides a way to provision virtual machines, baremetal servers, and (limited supported for) system containers.

  • virt-manager: An application for managing virtual machines through libvirt.

  • GNOME Boxes: A simple application to view, access, and manage remote and virtual systems.

  • Cockpit Machines: A Cockpit extension to manage virtual machines running on the host.

  • KubeVirt: Virtual Machine Management on Kubernetes.

… and why they use it

  • Download ISOs: As libosinfo provides the ISO URLs, management applications can offer the user the option to download a specific operating system;

  • Automatically detect the ISO being used: As libosinfo can detect the operating system of an ISO, management applications can use this info to set reasonable default values for resources, to select the hardware supported, and to perform unattended installations.

  • Start tree installation: As libosinfo provides the tree installation URLs, management applications can use it to start a network-based installation without having to download the whole operating system ISO;

  • Set reasonable default values for RAM, CPU, and disk resources: As libosinfo knows the values that are recommended by the operating system’s vendors, management applications can rely on that when setting the default resources for an installation.

  • Automatically set the hardware supported: As libosinfo provides the list of hardware supported by an operating system, management applications can choose the best defaults based on this information, without taking the risk of ending up with a non-bootable guest.

  • Unattended install: as libosinfo provides unattended installations scripts for CentOS, Debian, Fedora, Fedora Silverblue, Microsoft Windows, OpenSUSE, Red Hat Enterprise Linux, and Ubuntu, management applications can perform unattended installations for both “Workstation” and “Server” profiles.

What’s next?

The next blog post will provide a “demo” of an unattended installation using both GNOME Boxes and virt-install and, based on that, explain how libosinfo is internally used by these projects.

By doing that, we’ll both cover how libosinfo can be used and also demonstrate how it can ease the usage of those management applications.

October 16, 2019 12:00 AM

September 26, 2019

Gerd Hoffmann

VGA and other display devices in qemu

There are alot of emulated display devices available in qemu. This blog post introduces them, explains the differences between them and the use cases they are good for.

The TL;DR version is in the recommendations section at the end of the article.

standard VGA

  • qemu: -vga std or -device VGA
  • libvirt: <model type='vga'/>
  • &check VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuVideoDxe)
  • &check linux driver (bochs-drm.ko)

This is the default display device (on x86). It provides full VGA compatibility and support for a simple linear framebuffer (using the bochs dispi interface). It is the best choice compatibility wise, pretty much any guest should be able to bring up a working display on this device. Performance or usability can be better with other devices, see discussion below.

The device has 16 MB of video memory by default. This can be changed using the vgamem_mb property, -device VGA,vgamem_mb=32 for example will double the amount of video memory. The size must be a power of two, the valid range is 1 MB to 256 MB.

The linux driver supports page-flipping, so having room for 3-4 framebuffers is a good idea. The driver can leave the framebuffers in vram then instead of swapping them in and out. FullHD (1920x1080) for example needs a bit more than 8 MB for a single framebuffer, so 32 or 64 MB would be a good choice for that.

The UEFI setup allows to choose the display resolution which OVMF will use to initialize the display at boot. Press ESC at the tianocore splash screen to enter setup, then go to "Device Manager" &rightarrow "OVMF Platform Configuration".

bochs display device

  • qemu: -device bochs-display
  • libvirt: <model type='bochs'/>
  • &cross not VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuVideoDxe)
  • &check linux driver (bochs-drm.ko)

This device supports a simple linear framebuffer. It also uses the bochs dispi interface for modesetting, therefore the linear framebuffer configuration is fully compatible to the standard VGA device.

The bochs display is not VGA compatible though. There is no support for text mode, planar video modes, memory windows at 0xa0000 and other legacy VGA features in the virtual hardware.

Main advantage over standard VGA is that this device is alot simpler. The code size and complexity needed to emulate this device is an order of magnitude smaller, resulting in a reduced attack surface. Another nice feature is that you can place this device in a PCI Express slot.

For UEFI guests it is safe to use the bochs display device instead of the standard VGA device. The firmware will setup a linear framebuffer as GOP anyway and never use any legacy VGA features.

For BIOS guests this device might be useable as well, depending on whenever they depend on direct VGA hardware access or not. There is a vgabios which supports text rendering on a linear framebuffer, so software which uses the vgabios services for text output will continue to work. Linux bootloaders typically fall into this category. The linux text mode console (vgacon) uses direct hardware access and does not work. The framebuffer console (fbcon running on vesafb or bochs-drm) works.

virtio vga

  • qemu: -vga virtio or -device virtio-vga
  • libvirt: <model type='virtio'/> (on x86).
  • &check VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuVideoDxe)
  • &check linux driver (virtio-gpu.ko)

This is a modern, virtio-based display device designed for virtual machines. It comes with VGA compatibility mode. You need a guest driver to make full use of this device. If your guest OS has no driver it should still show a working display thanks to the VGA compatibility mode, but the device will not provide any advantages over standard VGA then.

This device has (optional) hardware-assisted opengl acceleration support. This can be enabled using the virgl=on property, which in turn needs opengl support enabled (gl=on) in the qemu display.

This device has multihead support, can be enabled using the max_outputs=2 property.

This device has no dedicated video memory (except for VGA compatibility), gpu data will be stored in main memory instead. Therefore this device has no config options for video memory size.

This is the place where most development happens, support for new, cool features will most likely be added to this device.

virtio gpu

  • qemu: -device virtio-gpu-pci
  • libvirt: <model type='virtio'/> (on arm).
  • &cross not VGA compatible
  • &cross no vgabios support
  • &check UEFI support (VirtioGpuDxe)
  • &check linux driver (virtio-gpu.ko)

This device lacks VGA compatibility mode but is otherwise identical to the virtio vga device. UEFI firmware can handle this, and if your guests has drivers too you can use this instead of virtio-vga. This will reduce the attack surface (no complex VGA emulation support) and reduce the memory footprint by 8 MB (no pci memory bar for VGA compatibility). This device can be placed in a PCI Express slot.

vhost-user virtio gpu

There is a vhost-user variant for both virtio vga and virtio gpu. This allows to run the virtio-gpu emulation in a separate process. This is good from the security perspective, especially if you want use virgl 3D acceleration, and it also helps with opengl performance.

Run the gpu emulation process (see contrib/vhost-user-gpu/ in the qemu source tree):

./vhost-user-gpu --virgl -s vgpu.sock

Run qemu:

qemu \
  -chardev socket,id=vgpu,path=vgpu.sock \
  -device vhost-user-vga,chardev=vgpu \
  [ ... ]

libvirt support is in the works.

qxl vga

  • qemu: -vga qxl or -device qxl-vga.
  • libvirt: <model type='qxl' primary='yes'/>.
  • &check VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuVideoDxe)
  • &check linux driver (qxl.ko)
  • &check windows driver

This is a slightly dated display device designed for virtual machines. It comes with VGA compatibility mode. You need a guest driver to make full use of this device. If your guest OS has no driver it should still show a working display thanks to the VGA compatibility mode, but the device will not provide any advantages over standard VGA then.

This device has support for 2D acceleration. This becomes more and more useless though as modern display devices don't have dedicated 2D acceleration support any more and use the 3D engine for everything. The same happens on the software side, modern desktops are rendering with opengl or vulkan instead of using 2D acceleration.

Spice and qxl support offloading 2D acceleration to the spice client (typically virt-viewer these days). That is quite complex and with 2D acceleration being on the way out this becomes increasingly useless too. You might want pick some simpler device for security reasons.

This device has multihead support, can be enabled using the max_outputs=2 property. The linux driver will use this, the windows driver expects multiple devices instead (see below).

The amount of video memory for this device is configurable using the ram_size_mb and vram_size_mb properties for the two pci memory bars. The default is 64 MB for both, which should be plenty for typical use cases. When using 4K display resolution or multihead support you should assign more video memory though. When using small resolutions like 1024x768 you can assign less video memory to reduce the memory footprint.

qxl

  • qemu: -device qxl.
  • libvirt: <model type='qxl' primary='no'/>.

This device lacks VGA compatibility mode but is otherwise identical to the qxl vga device. Providing multihead support for windows guests is pretty much the only use case for this device. The windows guest driver expects one qxl device per secondary display (additionally to one qxl-vga device for the primary display).

cirrus vga

  • qemu: -vga cirrus or -device cirrus-vga.
  • libvirt: <model type='cirrus'/>.
  • &check VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuVideoDxe)
  • &check linux driver (cirrus.ko)

Emulates a Cirrus SVGA device which used to be modern in the 90ies of the last century, more than 20 years ago. For the most part my blog article from 2014 is still correct; the device is mostly useful for guests which are equally old and are shipped with a driver for cirrus vga devices.

Two things have changed meanwhile though: Since qemu version 2.2 cirrus is not the default vga device any more. Also the cirrus driver in the linux kernel has been completely rewritten. In kernel 5.2 & newer the cirrus driver uses a shadow framebuffer and converts formats on the fly to hide some of the cirrus oddities from userspace (Xorg/wayland), so things are working a bit better now. That doesn't cure everything though, especially the available display resolutions are still constrained by the small amount of video memory.

ati vga

  • qemu: -device ati-vga.
  • &check VGA compatible
  • &check vgabios support
  • &cross no UEFI support

Emulates two ATI SVGA devices, the model property can be used to pick the variant. model=rage128p selects the "Rage 128 Pro" and model=rv100 selects the "Radeon RV100".

The devices are newer (late 90ies / early 2000) and more modern than the cirrus VGA. Nevertheless the use case is very similar: For guests of a similar age which are shipping with drivers for those devices.

This device has been added recently to qemu, development is in progress still. The fundamentals are working (modesetting, hardware cursor). Most important 2D accel ops are implemented too. 3D acceleration is not implemented yet.

Linux has both drm and fbdev drivers for these devices. The drm drivers are not working due to emulation being incomplete still (which hopefully changes in the future). The fbdev drivers are working. Modern linux distros prefer the drm drivers though. So you probably have to build your own kernel if you want use this device.

ramfb

  • qemu: -device ramfb.
  • &cross not VGA compatible
  • &check vgabios support
  • &check UEFI support (QemuRamfbDxe)

Very simple display device. Uses a framebuffer stored in guest memory. The firmware initializes it and allows to use it as boot display (grub boot menu, efifb, ...) without needing complex legacy VGA emulation. Details can be found here.

no display device

  • qemu: -vga none -nographic.

You don't have to use a display device. If you don't need one you can run your guests with a serial console instead.

embedded devices

There are a bunch of other display devices. Those are typically SoC-specific and used by embedded board emulation. Just mentioning them here for completeness. You can't select the display device for embedded boards, the qemu emulation simply matches physical hardware here.

recommendations

For the desktop use case (assuming display performance matters and/or you need multihead support), in order of preference:

For the server use case (assuming the GUI is rarely used, or not at all), in order of preference:

On arm systems display devices with a pci memory bar do not work, which reduces the choices alot. We are left with:

by Gerd Hoffmann at September 26, 2019 10:00 PM

September 13, 2019

KVM on Z

IBM z15 announced

Today, IBM announced the new IBM Z models:
Furthermore, check the updated IBM Z tested platforms matrix here.
We will look at features in support of the new IBM Z model in a separate blog entry soon.

by Stefan Raspl (noreply@blogger.com) at September 13, 2019 09:42 AM

September 02, 2019

KVM on Z

virt-manager 2.2 released

virt-manager v2.2 was released a while ago. One feature we contributed is the ability to choose a temporary boot device. Here is a quick write-up on how to use that feature.

virt-xml is a simple command line tool for editing domain definitions. It can be used interactively or for batch processing. Starting with virt-manager v2.2, virt-xml allows to boot a guest with a temporarily changed domain definition. This allows us to specify a boot configuration other than the guest's current domain definition. This is especially useful as the IBM Z architecture allows for only a single boot device, and therefore the boot order settings do not work the way they do on other platforms: If the first boot device fails to boot, there is no attempt made to boot from the next boot device. In addition, the architecture/BIOS has no support for interactively changing the boot device during the boot/IPL process.
Therefore, two new command line options were introduced:
  • --no-define makes any changes to the domain definition transient (i.e. the guest's persistent domain XML will not be modified)
  • --start allows the user to start the domain after the changes to the domain XML were applied
Here is a simple example illustrating the usage:
  1. First, select the device which should be changed by a selector. In this example, the unique target name of the disk is used. See man virt-xml for a list of further possibilities.
  2. Temporarily modify the boot order, assign the first slot to device vdc, and start the guest right away:

      $ virt-xml --edit target='vdc' --disk='boot_order=1' --start \
                 --no-define sample_domain

    Note: If there was another device that already had boot_order=1, its boot order would be incremented.
As soon the guest is stopped, the changes will vanish.

    by Stefan Raspl (noreply@blogger.com) at September 02, 2019 03:22 PM

    August 29, 2019

    KVM on Z

    Webinar: How to Virtualize with KVM in Live Demo, August 28

    Abstract
    We will explain basic KVM concepts, including CPU and memory virtualization, storage, network management, as well as a brief overview about commonalities and differences with other virtualization environments. Furthermore, a live demo will demonstrate how to use the KVM management tools to create and install Linux guests, how to operate and monitor.

    Speaker
    Christian Bornträger, Chief Product Owner Linux and KVM on IBM Z.

    Registration
    Register here. You can check the system requirements here.
    After registering, you will receive a confirmation email containing information about joining the webinar.

    Replay & Archive
    All sessions are recorded. For the archive as well as a replay and handout of this session and all previous webcasts see here.

    by Stefan Raspl (noreply@blogger.com) at August 29, 2019 12:10 PM

    August 24, 2019

    Stefano Garzarella

    How to measure the boot time of a Linux VM with QEMU/KVM

    The stefano-garzarella/qemu-boot-time repository contains a Python perf-script and (Linux, QEMU, SeaBIOS) patches to measure the boot time of a Linux VM with QEMU/KVM.

    Using I/O writes, we can trace events to measure the time consumed during the boot phase by the different components:

    We extended the I/O port addresses and values defined in qboot/benchmark.h adding new trace points to trace the kernel boot time.

    In the repository you can find patches for Linux, QEMU, and SeaBIOS to add the I/O writes in the components involved during the boot, and a Python perf-script useful to process the data recorded through perf using perf-script’s built-in Python interpreter.

    Trace points

    The benchmark.h file contains the following trace points used in the patches:

    • QEMU
      • qemu_init_end: first kvm_entry (i.e. QEMU initialized has finished)
    • Firmware (SeaBIOS + optionrom or qboot)
      • fw_start: first entry of the firmware
      • fw_do_boot: after the firmware initialization (e.g. PCI setup, etc.)
      • linux_start_boot: before the jump to the Linux kernel
      • linux_start_pvhboot: before the jump to the Linux PVH kernel
    • Linux Kernel
      • linux_start_kernel: first entry of the Linux kernel
      • linux_start_user: before starting the init process

    Custom trace points

    If you want to add new trace points, you can simply add an I/O write to LINUX_EXIT_PORT or FW_EXIT_PORT I/O port with a value (> 7) that identifies the trace point:

        outb(10, LINUX_EXIT_PORT);

    The perf script output will contain Exit point 10 line that identifies your custom trace point:

     qemu_init_end: 143.770419
     fw_start: 143.964328 (+0.193909)
     fw_do_boot: 164.71107 (+20.746742)
     Exit point 10: 165.396804 (+0.685734)
     linux_start_kernel: 165.979486 (+0.582682)
     linux_start_user: 272.178335 (+106.198849)

    How to use

    Clone qemu-boot-time repository

    REPOS="~/repos"
    cd ${REPOS}
    git clone https://github.com/stefano-garzarella/qemu-boot-time.git

    Apply patches to Linux, QEMU and SeaBIOS

    Trace points are printed only if they are recorded, so you can enable few of them, patching only the components that you are interested in.

    Linux

    Apply the patches/linux.patch to your Linux kernel in order to trace kernel events

    cd ${REPOS}/linux
    git checkout -b benchmark
    git am ${REPOS}/qemu-boot-time/patches/linux.patch

    QEMU

    Apply the patches/qemu.patch to your QEMU in order to trace optionrom events

    cd ${REPOS}/qemu
    git checkout -b benchmark
    git am ${REPOS}/qemu-boot-time/patches/qemu.patch
    
    mkdir build-benchmark
    cd build-benchmark
    ../configure --target-list=x86_64-softmmu ...
    make

    You can use qemu-system-x86_64 -L ${REPOS}/qemu/build-benchmark/pc-bios/optionrom/ ... to use the optionrom patched.

    SeaBIOS

    Apply the patches/seabios.patch to your SeaBIOS in order to trace bios events

    cd ${REPOS}/seabios
    git checkout -b benchmark
    git am ${REPOS}/qemu-boot-time/patches/seabios.patch
    
    make clean distclean
    cp ${REPOS}/qemu/roms/config.seabios-256k .config
    make oldnoconfig

    You can use qemu-system-x86_64 -bios ${REPOS}/seabios/out/bios.bin ... to use the SeaBIOS image patched.

    qboot

    qboot already defines trace points, we just need to compile it defining BENCHMARK_HACK

    cd ${REPOS}/qboot
    make clean
    BIOS_CFLAGS="-DBENCHMARK_HACK=1" make

    You can use qemu-system-x86_64 -bios ${REPOS}/qboot/bios.bin ... to use the qboot image.

    Enable KVM events

    The following steps allow perf record to get the kvm trace events:

    echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
    echo -1 > /proc/sys/kernel/perf_event_paranoid
    mount -o remount,mode=755 /sys/kernel/debug
    mount -o remount,mode=755 /sys/kernel/debug/tracing

    Record the trace events

    Start perf record to get the trace events

    PERF_DATA="qemu_perf.data"
    perf record -a -e kvm:kvm_entry -e kvm:kvm_pio -e sched:sched_process_exec \
                -o $PERF_DATA &
    PERF_PID=$!

    You can run QEMU multiple times to get also some statistics (Avg/Min/Max)

    qemu-system-x86_64 -machine q35,accel=kvm \
                       -bios seabios/out/bios.bin \
                       -L qemu/build-benchmark/pc-bios/optionrom/ \
                       -kernel linux/bzImage ...
    qemu-system-x86_64 -machine q35,accel=kvm \
                       -bios seabios/out/bios.bin \
                       -L qemu/build-benchmark/pc-bios/optionrom/ \
                       -kernel linux/bzImage ...
    qemu-system-x86_64 -machine q35,accel=kvm \
                       -bios seabios/out/bios.bin \
                       -L qemu/build-benchmark/pc-bios/optionrom/ \
                       -kernel linux/bzImage ...

    Stop perf record

    kill $PERF_PID

    Process the trace recorded using the qemu-perf-script.py

    Note: times printed in milliseconds

    perf script -s ${REPOS}/qemu-boot-time/perf-script/qemu-perf-script.py -i $PERF_DATA
    
    in trace_begin
    sched__sched_process_exec     1 55061.435418353   289738 qemu-system-x86
    kvm__kvm_entry           1 55061.466887708   289741 qemu-system-x86
    kvm__kvm_pio             1 55061.467070650   289741 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=1
    
    kvm__kvm_pio             1 55061.475818073   289741 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=4
    
    kvm__kvm_pio             1 55061.477168037   289741 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=3
    
    kvm__kvm_pio             1 55061.558779540   289741 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=5
    
    kvm__kvm_pio             1 55061.686849663   289741 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=6
    
    sched__sched_process_exec     4 55067.461869075   289793 qemu-system-x86
    kvm__kvm_entry           4 55067.496402472   289796 qemu-system-x86
    kvm__kvm_pio             4 55067.496555385   289796 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=1
    
    kvm__kvm_pio             4 55067.505067184   289796 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=4
    
    kvm__kvm_pio             4 55067.506395502   289796 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=3
    
    kvm__kvm_pio             4 55067.584029910   289796 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=5
    
    kvm__kvm_pio             4 55067.704751791   289796 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=6
    
    sched__sched_process_exec     0 55070.073823767   289827 qemu-system-x86
    kvm__kvm_entry           0 55070.110507211   289830 qemu-system-x86
    kvm__kvm_pio             0 55070.110694645   289830 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=1
    
    kvm__kvm_pio             1 55070.120092692   289830 qemu-system-x86      rw=1, port=0xf5, size=1, count=1, val=4
    
    kvm__kvm_pio             1 55070.121437922   289830 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=3
    
    kvm__kvm_pio             1 55070.198628779   289830 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=5
    
    kvm__kvm_pio             1 55070.315734630   289830 qemu-system-x86      rw=1, port=0xf4, size=1, count=1, val=6
    
    in trace_end
    Trace qemu-system-x86
    1) pid 289738
     qemu_init_end: 31.469355
     fw_start: 31.652297 (+0.182942)
     fw_do_boot: 40.39972 (+8.747423)
     linux_start_boot: 41.749684 (+1.349964)
     linux_start_kernel: 123.361187 (+81.611503)
     linux_start_user: 251.43131 (+128.070123)
    2) pid 289793
     qemu_init_end: 34.533397
     fw_start: 34.68631 (+0.152913)
     fw_do_boot: 43.198109 (+8.511799)
     linux_start_boot: 44.526427 (+1.328318)
     linux_start_kernel: 122.160835 (+77.634408)
     linux_start_user: 242.882716 (+120.721881)
    3) pid 289827
     qemu_init_end: 36.683444
     fw_start: 36.870878 (+0.187434)
     fw_do_boot: 46.268925 (+9.398047)
     linux_start_boot: 47.614155 (+1.34523)
     linux_start_kernel: 124.805012 (+77.190857)
     linux_start_user: 241.910863 (+117.105851)
    
    Avg
     qemu_init_end: 34.228732
     fw_start: 34.403161 (+0.174429)
     fw_do_boot: 43.288918 (+8.885757)
     linux_start_boot: 44.630088 (+1.34117)
     linux_start_kernel: 123.442344 (+78.812256)
     linux_start_user: 245.408296 (+121.965952)
    
    Min
     qemu_init_end: 31.469355
     fw_start: 31.652297 (+0.182942)
     fw_do_boot: 40.39972 (+8.747423)
     linux_start_boot: 41.749684 (+1.349964)
     linux_start_kernel: 122.160835 (+80.411151)
     linux_start_user: 241.910863 (+119.750028)
    
    Max
     qemu_init_end: 36.683444
     fw_start: 36.870878 (+0.187434)
     fw_do_boot: 46.268925 (+9.398047)
     linux_start_boot: 47.614155 (+1.34523)
     linux_start_kernel: 124.805012 (+77.190857)
     linux_start_user: 242.882716 (+118.077704)

    by sgarzare@redhat.com (Stefano Garzarella) at August 24, 2019 01:03 PM

    August 23, 2019

    Stefano Garzarella

    QEMU 4.0 boots uncompressed Linux x86_64 kernel

    QEMU 4.0 is now able to boot directly into the uncompressed Linux x86_64 kernel binary with minimal firmware involvement using the PVH entry point defined in the x86/HVM direct boot ABI. (CONFIG_PVH=y must be enabled in the Linux config file).

    The x86/HVM direct boot ABI was initially developed for Xen guests, but with latest changes in both QEMU and Linux, QEMU is able to use that same entry point for booting KVM guests.

    Prerequisites

    • QEMU >= 4.0
    • Linux kernel >= 4.21
      • CONFIG_PVH=y enabled
      • vmlinux uncompressed image

    How to use

    To boot the PVH kernel image, you can use the -kernel parameter specifying the path to the vmlinux image.

    qemu-system-x86_64 -machine q35,accel=kvm \
        -kernel /path/to/vmlinux \
        -drive file=/path/to/rootfs.ext2,if=virtio,format=raw \
        -append 'root=/dev/vda console=ttyS0' -vga none -display none \
        -serial mon:stdio

    The -initrd and -append parameters are also supported as for compressed images.

    Details

    QEMU will automatically recognize if the vmlinux image has the PVH entry point and it will use SeaBIOS with the new pvh.bin optionrom to load the uncompressed image into the guest VM.

    As an alternative, qboot can be used to load the PVH image.

    Performance

    Perf script and patches used to measure boot time: https://github.com/stefano-garzarella/qemu-boot-time.

    The following values are expressed in milliseconds [ms]

    • QEMU (q35 machine) + SeaBIOS + bzImage

      • qemu_init_end: 36.072056
      • linux_start_kernel: 114.669522 (+78.597466)
      • linux_start_user: 191.748567 (+77.079045)
    • QEMU (q35 machine) + SeaBIOS + vmlinux(PVH)

      • qemu_init_end: 51.588200
      • linux_start_kernel: 62.124665 (+10.536465)
      • linux_start_user: 139.460582 (+77.335917)
    • QEMU (q35 machine) + qboot + bzImage

      • qemu_init_end: 36.443638
      • linux_start_kernel: 106.73115 (+70.287512)
      • linux_start_user: 184.575531 (+77.844381)
    • QEMU (q35 machine) + qboot + vmlinux(PVH)

      • qemu_init_end: 51.877656
      • linux_start_kernel: 56.710735 (+4.833079)
      • linux_start_user: 133.808972 (+77.098237)
    • Tracepoints:

      • qemu_init_end: first kvm_entry (i.e. QEMU initialization has finished)
      • linux_start_kernel: first entry of the Linux kernel (start_kernel())
      • linux_start_user: before starting the init process

    Patches

    Linux patches merged upstream in Linux 4.21:

    QEMU patches merged upstream in QEMU 4.0:

    qboot patches merged upstream:

    by sgarzare@redhat.com (Stefano Garzarella) at August 23, 2019 01:26 PM

    August 22, 2019

    Stefano Garzarella

    iperf3-vsock: how to measure VSOCK performance

    The iperf-vsock repository contains few patches to add the support of VSOCK address family to iperf3. In this way iperf3 can be used to measure the performance between guest and host using VSOCK sockets.

    The VSOCK address family facilitates communication between virtual machines and the host they are running on.

    To test VSOCK sockets (only Linux), you must use the new option --vsock on both server and client. Other iperf3 options (e.g. -t, -l, -P, -R, --bidir) are well supported by VSOCK tests.

    Prerequisites

    • Linux host kernel >= 4.8
    • Linux guest kernel >= 4.8
    • QEMU >= 2.8

    Build iperf3-vsock from source

    Clone repository

    git clone https://github.com/stefano-garzarella/iperf-vsock
    cd iperf-vsock

    Building

    mkdir build
    cd build
    ../configure
    make

    (Note: If configure fails, try running ./bootstrap.sh first)

    Example with Fedora 30 (host and guest):

    Host: start the VM

    GUEST_CID=3
    sudo modprobe vhost_vsock
    sudo qemu-system-x86_64 -m 1G -smp 2 -cpu host -M accel=kvm	\
         -drive if=virtio,file=/path/to/fedora.img,format=qcow2     \
         -device vhost-vsock-pci,guest-cid=${GUEST_CID}

    Guest: start iperf server

    # SELinux can block you, so you can write a policy or temporally disable it
    sudo setenforce 0
    iperf-vsock/build/src/iperf3 --vsock -s

    Host: start iperf client

    iperf-vsock/build/src/iperf3 --vsock -c ${GUEST_CID}

    Output

    Connecting to host 3, port 5201
    [  5] local 2 port 4008596529 connected to 3 port 5201
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  1.30 GBytes  11.2 Gbits/sec
    [  5]   1.00-2.00   sec  1.67 GBytes  14.3 Gbits/sec
    [  5]   2.00-3.00   sec  1.57 GBytes  13.5 Gbits/sec
    [  5]   3.00-4.00   sec  1.49 GBytes  12.8 Gbits/sec
    [  5]   4.00-5.00   sec   971 MBytes  8.15 Gbits/sec
    [  5]   5.00-6.00   sec  1.01 GBytes  8.71 Gbits/sec
    [  5]   6.00-7.00   sec  1.44 GBytes  12.3 Gbits/sec
    [  5]   7.00-8.00   sec  1.62 GBytes  13.9 Gbits/sec
    [  5]   8.00-9.00   sec  1.61 GBytes  13.8 Gbits/sec
    [  5]   9.00-10.00  sec  1.63 GBytes  14.0 Gbits/sec
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-10.00  sec  14.3 GBytes  12.3 Gbits/sec                  sender
    [  5]   0.00-10.00  sec  14.3 GBytes  12.3 Gbits/sec                  receiver
    
    iperf Done.

    by sgarzare@redhat.com (Stefano Garzarella) at August 22, 2019 03:52 PM

    August 19, 2019

    KVM on Z

    QEMU v4.1 released

    QEMU v4.1 is out. For highlights from a KVM on Z perspective see the Release Notes.
    Note: The DASD IPL feature is still considered experimental.

    by Stefan Raspl (noreply@blogger.com) at August 19, 2019 01:14 PM

    August 16, 2019

    QEMU project

    QEMU version 4.1.0 released

    We would like to announce the availability of the QEMU 4.1.0 release. This release contains 2000+ commits from 176 authors.

    You can grab the tarball from our download page. The full list of changes are available in the Wiki.

    Highlights include:

    • ARM: FPU emulation support for Cortex-M CPUs, FPU fixes for Cortex-R5F
    • ARM: ARMv8.5-RNG extension support for CPU-generated random numbers
    • ARM: board build options now configurable via new Kconfig-based system
    • ARM: Exynos4210 SoC model now supports PL330 DMA controllers
    • MIPS: improved emulation performance of numerous MSA instructions, mostly integer and data permuting operations
    • MIPS: improved support for MSA ASE instructions on big-endian hosts, handling for ‘division by zero’ cases now matches reference hardware
    • PowerPC: pseries: support for NVIDIA V100 GPU/NVLink2 passthrough via VFIO
    • PowerPC: pseries: in-kernel acceleration for XIVE interrupt controller
    • PowerPC: pseries: supporting for hot-plugging PCI host bridges
    • PowerPC: emulation optimizations for vector (Altivec/VSX) instructions
    • RISC-V: support for new “spike” machine model
    • RISC-V: ISA 1.11.0 support for privileged architectures
    • RISC-V: improvements for 32-bit syscall ABI, illegal instruction handling, and built-in debugger
    • RISC-V: support for CPU topology in device trees
    • s390: bios support for booting from ECKD DASD assigned to guest via vfio-ccw
    • s390: emulation support for all “Vector Facility” instructions
    • s390: additional facilities and support for gen15 machines, including support for AP Queue Interruption Facility for using interrupts for vfio-ap devices
    • SPARC: sun4m: sun4u: fixes when running with -vga none (OpenBIOS)
    • x86: emulation support for new Hygon Dhyana and Intel SnowRidge CPU models
    • x86: emulation support for RDRAND extension
    • x86: md-clear/mds-no feature flags, for detection/mitigation of MDS vulnerabilities (CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091)
    • x86: CPU die topology now configurable using -smp …,dies=
    • Xtensa: support for memory protection unit (MPU) option
    • Xtensa: support for Exclusive Access option
    • GUI: virtio-gpu 2D/3D rendering may now be offloaded to an external vhost-user process, such as QEMU vhost-user-gpu
    • GUI: semihosting output can now be redirected to a chardev backend
    • qemu-img: added a –salvage option to qemu-img convert, which prevents the conversion process from aborting on I/O errors (can be used for example to salvage partially corrupted qcow2 files)
    • qemu-img: qemu-img rebase works now even when the input file doesn’t have a backing file yet
    • VMDK block driver now has read-only support for the seSparse subformat
    • GPIO: support for SiFive GPIO controller
    • and lots more…

    Thank you to everyone involved!

    August 16, 2019 05:50 AM

    August 09, 2019

    KVM on Z

    New Documentation: Configuring Crypto Express Adapters for KVM Guests


    See here for a new documentation release on how to configure Crypto Express adapters for KVM guests.

    by Stefan Raspl (noreply@blogger.com) at August 09, 2019 02:49 PM

    August 07, 2019

    Daniel Berrange

    ANNOUNCE: gtk-vnc 1.0.0 release

    I’m pleased to announce a new release of GTK-VNC, version 1.0.0.

    https://download.gnome.org/sources/gtk-vnc/1.0/gtk-vnc-1.0.0.tar.xz (211K)
    sha256sum: a81a1f1a79ad4618027628ffac27d3391524c063d9411c7a36a5ec3380e6c080
    

    Pay particular attention to the first two major changes in
    this release:

    • Autotools build system replaced with meson
    • Support for GTK-2 is dropped. GTK-3 is mandatory
    • Support for libview is dropped in example program
    • Improvements to example demos
    • Use MAP_ANON if MAP_ANONYMOUS doesn’t exist to help certain macOS versions
    • Fix crash when connection attempt fails early
    • Initialize gcrypt early in auth process
    • Emit vnc-auth-failure signal when SASL auth fals
    • Emit vnc-error signal when authentication fails
    • Fix double free when failing to read certificates
    • Run unit tests in RPM build
    • Modernize RPM spec
    • Fix race condition in unit tests
    • Fix install of missing header for cairo framebuffer
    • Fix typo in gir package name
    • Add missing VncPixelFormat file to gir data

    Thanks to all those who reported bugs and provides patches that went into this new release.

    by Daniel Berrange at August 07, 2019 03:06 PM

    August 05, 2019

    Stefan Hajnoczi

    Determining why a Linux syscall failed

    One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

    There are several ways to determine the reason for a system call failure (from easiest to most involved):

    1. Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
    2. Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
    3. Read the kernel source code to understand various error code paths and identify the most relevant one.
    4. Use the function graph tracer to see which code path was taken.
    5. Add printk() calls, recompile the kernel (module), and rerun to see the output.

    Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

    This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

    An example

    In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

    The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

    The function graph tracer

    The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

    Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

    # grep io_uring_setup /proc/kallsyms
    ...
    ffffffffbd357130 T __x64_sys_io_uring_setup

    Let's trace all __x64_sys_io_uring_setup calls:

    # cd /sys/kernel/debug/tracing
    # echo '__x64_sys_io_uring_setup' > set_graph_function
    # echo 'function_graph' >current_tracer
    # cat trace_pipe >/tmp/trace.log
    ...now run the application in another terminal...
    ^C
    The trace contains many successful io_uring_setup(2) calls that look like this:
     1)               |  __x64_sys_io_uring_setup() {
    1) | io_uring_setup() {
    1) | capable() {
    1) | ns_capable_common() {
    1) | security_capable() {
    1) 0.199 us | cap_capable();
    1) 7.095 us | }
    1) 7.594 us | }
    1) 8.059 us | }
    1) | kmem_cache_alloc_trace() {
    1) | _cond_resched() {
    1) 0.244 us | rcu_all_qs();
    1) 0.708 us | }
    1) 0.208 us | should_failslab();
    1) 0.220 us | memcg_kmem_put_cache();
    1) 2.201 us | }
    ...
    1) | fd_install() {
    1) 0.223 us | __fd_install();
    1) 0.643 us | }
    1) ! 190.396 us | }
    1) ! 216.236 us | }

    Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

     2)               |  __x64_sys_io_uring_setup() {
    2) | io_uring_setup() {
    2) | capable() {
    2) | ns_capable_common() {
    2) | security_capable() {
    2) 0.236 us | cap_capable();
    2) 0.872 us | }
    2) 1.419 us | }
    2) 1.951 us | }
    2) 0.419 us | free_uid();
    2) 3.389 us | }
    2) + 48.769 us | }

    The fs/io_uring.c code shows the likely error code paths:

            account_mem = !capable(CAP_IPC_LOCK);

    if (account_mem) {
    ret = io_account_mem(user,
    ring_pages(p->sq_entries, p->cq_entries));
    if (ret) {
    free_uid(user);
    return ret;
    }
    }

    ctx = io_ring_ctx_alloc(p);
    if (!ctx) {
    if (account_mem)
    io_unaccount_mem(user, ring_pages(p->sq_entries,
    p->cq_entries));
    free_uid(user);
    return -ENOMEM;
    }

    But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

    When trace output is inconclusive

    Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

    We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

    If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

    One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

    This is done as follows:

    1. Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
    2. Start perf record -a and let it run for 30 seconds.
    3. Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

    The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.

    Conclusion

    Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.

    by Unknown (noreply@blogger.com) at August 05, 2019 03:54 PM

    July 30, 2019

    Cole Robinson

    Blog moved to Pelican and GitHub Pages

    I've moved my blog from blogger.com to a static site generated with Pelican and hosted on GitHub Pages. This is a dump of some of the details.

    The content is hosted in three branches across two repos:

    The motivation for the split is that according to this pelican SEO article, master branches of GitHub repos are indexed by google, so if you store HTML content in a master branch your canonical blog might be battling your GitHub repo in the search results. And since you can only put content in the master branch of a $username.github.io repo, I added a separate blog.git repo. Maybe I could shove all the content into the blog/gh-pages branch I think dealing with multiple subdomains prevents it. I've already spent too much timing playing with all this stuff though so that's for another day to figure out. Of course, suggestions welcome, blog comments are enabled with Disqus.

    One issue I hit is that pushing updated content to blog/gh-pages doesn't consistently trigger a new GitHub Pages deployment. There's a bunch of hits about this around the web (this stackoverflow post in particular) but no authoritative explanation about what criteria GitHub Pages uses to determine whether to redeploy. The simplest 'fix' I found is to tweak the index.html content via the GitHub web UI and commit the change which seems to consistently trigger a refresh as reported by the repo's deployments page.

    You may notice the blog looks a lot like stock Jekyll with its minima theme. I didn't find any Pelican theme that I liked as much as minima, so I grabbed the CSS from a minima instance and started adapting the Pelican simple-bootstrap theme to use it. The end result is basically a simple reimplementation of minima for Pelican. I learned a lot in the process but it likely would have been much simpler if I just used Jekyll in the first place, but I'm in too deep to switch now!

    by Cole Robinson at July 30, 2019 07:30 PM

    July 12, 2019

    KVM on Z

    KVM at SHARE Pittsburgh 2019

    Yes, we will be at SHARE in Pittsburgh this August!
    See the following session in the Linux and VM/Virtualization track:

    • KVM on IBM Z News (Session #25978): Latest news on our development work with the open source community

    by Stefan Raspl (noreply@blogger.com) at July 12, 2019 04:42 PM

    July 10, 2019

    Cornelia Huck

    s390x changes in QEMU 4.1

    QEMU has just entered hard freeze for 4.1, so the time is here again to summarize the s390x changes for that release.

    TCG

    • All instructions that have been introduced with the "Vector Facility" in the z13 machines are now emulated by QEMU. In particular, this allows Linux distributions built for z13 or later to be run under TCG (vector instructions are generated when we compile for z13; other z13 facilities are optional.)

    CPU Models

    • As the needed prerequisites in TCG now have been implemented, the "qemu" cpu model now includes the "Vector Facility" and has been bumped to a stripped-down z13.
    • Models for the upcoming gen15 machines (the official name is not yet known) and some new facilities have been added.
    • If the host kernel supports it, we now indicate the AP Queue Interruption facility. This is used by vfio-ap and allows to provide interrupts for AP to the guest.

    I/O Devices

    • vfio-ccw has gained support for relaying HALT SUBCHANNEL and CLEAR SUBCHANNEL requests from the guest to the device, if the host kernel vfio-ccw driver supports it. Otherwise, these instructions continue to be emulated by QEMU, as before.
    • The bios now supports IPLing (booting) from DASD attached via vfio-ccw.

    Booting

    • The bios tolerates signatures written by zipl, if present; but it does not actually handle them. See the 'secure' option for zipl introduced in s390-tools 2.9.0.
    And the usual fixes and cleanups.

    by Cornelia Huck (noreply@blogger.com) at July 10, 2019 02:16 PM

    Powered by Planet!
    Last updated: December 09, 2019 10:16 AM
    Powered by OpenShift Online