Benchmarking in user space: were there any IRQs?

An interesting question on StackOverflow asks if it is possible to determine whether some benchmarking sample taken in userspace had been poisoned by an IRQ.

I assume that you have shielded your benchmarking thread to the extent possible:

Furthermore, you should not jump into kernel space from within your benchmarked code path: upon return, your thread might get scheduled away for some time.

However, you simply can't get rid of all interrupts on a CPU core: on Linux, the local APIC timer interrupts, interprocessor interrupts (IPI) and others are used for internal purposes and you simply can't get rid of them! For example, timer interrupts are used to ensure, that threads get eventually scheduled. Likewise, IPIs are used to do trigger actions on other cores such as TLB shootdowns.

Now, thanks to the Linux tracing infrastructure, it is possible to tell from userspace whether a hardirq has happened during a certain period of time.

Tracing interrupts

One minor complication is that Linux treats two classes of interrupts differently with respect to tracing:

  1. First there are the "real" external hardware interrupts occupied by real devices like network adapters, sound cards and the such.
  2. There are the interrupts for internal use by Linux.

Both are hardirqs in the sense that the processor asynchronously transfers control to an interrupt service routine (ISR) as dictated by the interrupt descriptor table (IDT).

Usually, in Linux, the ISR is simply a trampoline stub written in assembly which transfers control to a high-level handler written in C.

For the external interrupts, a common stub is defined for each which transfer interrupts to a common higher level C routine (do_IRQ). Within the upper layers, a runtime decision is made whether to trace or not. For the internal interrupts on the other hand, there are two stubs for each interrupt: one for tracing on and one for tracing off each. The former calls an unconditionally non-tracing higher level C routine specific to the interrupt in question, while the latter calls a unconditonally tracing one.

For details, refer to arch/x86/entry/entry_64.S and arch/x86/kernel/irqinit.c in the Linux kernel sources.

This is why there is one tracing event for each internal interrupt:

# sudo perf list | grep irq_vectors:
  irq_vectors:call_function_entry                    [Tracepoint event]
  irq_vectors:call_function_exit                     [Tracepoint event]
  irq_vectors:call_function_single_entry             [Tracepoint event]
  irq_vectors:call_function_single_exit              [Tracepoint event]
  irq_vectors:deferred_error_apic_entry              [Tracepoint event]
  irq_vectors:deferred_error_apic_exit               [Tracepoint event]
  irq_vectors:error_apic_entry                       [Tracepoint event]
  irq_vectors:error_apic_exit                        [Tracepoint event]
  irq_vectors:irq_work_entry                         [Tracepoint event]
  irq_vectors:irq_work_exit                          [Tracepoint event]
  irq_vectors:local_timer_entry                      [Tracepoint event]
  irq_vectors:local_timer_exit                       [Tracepoint event]
  irq_vectors:reschedule_entry                       [Tracepoint event]
  irq_vectors:reschedule_exit                        [Tracepoint event]
  irq_vectors:spurious_apic_entry                    [Tracepoint event]
  irq_vectors:spurious_apic_exit                     [Tracepoint event]
  irq_vectors:thermal_apic_entry                     [Tracepoint event]
  irq_vectors:thermal_apic_exit                      [Tracepoint event]
  irq_vectors:threshold_apic_entry                   [Tracepoint event]
  irq_vectors:threshold_apic_exit                    [Tracepoint event]
  irq_vectors:x86_platform_ipi_entry                 [Tracepoint event]
  irq_vectors:x86_platform_ipi_exit                  [Tracepoint event]

while there is only a single general tracing event for the external interrupts:

# sudo perf list | grep irq:
  irq:irq_handler_entry                              [Tracepoint event]
  irq:irq_handler_exit                               [Tracepoint event]
  irq:softirq_entry                                  [Tracepoint event]
  irq:softirq_exit                                   [Tracepoint event]
  irq:softirq_raise                                  [Tracepoint event]

So, trace all these IRQ *_entry-events for the duration of your benchmarked code path and you know whether your benchmark sample has been poisoned by an IRQ or not.

Unfortunately, the list of *_entry-events does not include every internal IRQ theoretically possible, that is, there are some tracing version trampoline stubs that equal their non-tracing counterparts. These are:

irq_move_cleanup_interrupt
reboot_interrupt
uv_bau_message_intr1
kvm_posted_intr_ipi
kvm_posted_intr_wakeup_ipi
xen_hvm_callback_vector
hyperv_callback_vector
  1. Do not move interrupts while benchmarking: disable irqbalance.
  2. Do not reboot while benchmarking.
  3. uv_bau_message_intr1 can only happen if you have got CONFIG_X86_UV=yes, that is on SGI UV systems.
  4. Do not run VMs while benchmarking.

Tracing exceptions

In principle, an exception is simply an interrupt generated by the local processor core (or thread) itself in order to signal the OS that a certain event has occured. Examples are page faults, math exceptions, machine check exceptions and so on. Neglecting the fact that a little bit more information is passed to the ISR than for interrupts, control transfer for exceptions is handled in exactly the same way as described for interrupts above: an entry is looked up within the IDT and a jump is made to the trampoline stub specified therein.

In Linux, the trampoline stubs for exceptions are in a way similiar to the trampoline stubs for internally used interrupts: a non-tracing and a tracing version specific to the exception in question is setup for each. The big exception is that all but the page fault exception's tracing versions of these stubs equal their non-tracing counterpart and thus, don't do any tracing. Namely,

divide_error
overflow
bounds
invalid_op
device_not_available
double_fault
coprocessor_segment_overrun
invalid_TSS
segment_not_present
spurious_interrupt_bug
coprocessor_error
alignment_check
simd_coprocessor_error
debug
int3
stack_segment
general_protection
machine_check

all can't be traced. However, if your code is sufficiently well-behaving, i.e. not dividing by zero etc., these shouldn't happen to you at all. The tracing events for the page fault exception are called exceptions:page_fault_user and exceptions:page_fault_kernel.

Tracing NMIs

NMIs are treated a little bit specially and not caught by any of the events above. However, there's a tracepoint event for those as well: nmi:nmi_handler. According to arch/x86/kernel/nmi.c:nmi_handle(), these events get only triggered if NMI handlers have been registered. However, there's always at least one such guy: arch_trigger_all_cpu_backtrace_handler(). Thus, for every NMI, we should get at least one nmi:nmi_handler event.

Summary

Some interrupts and exceptions are untraceable but avoidable. Most notably, irqbalance should be disabled in order to avoid irq_move_cleanup_interrupt. If you stick to the rules, you can tell reliably whether your benchmarking had been interrupted by an external event.

For your convenience, put together a little piece of code for tracking IRQs during your benchmarked code path: http://nicst.de/git/?p=lxdetectirq. See the included example.c for usage. Note that access to /sys/kernel/debug is needed in order to determine tracepoint IDs.

Author: Nicolai Stange

Created: 2018-03-11 Sun 11:26

Emacs 25.3.1 (Org mode 8.2.10)

Validate