Benchmarking in user space: were there any IRQs?
An interesting question on StackOverflow asks if it is possible to determine whether some benchmarking sample taken in userspace had been poisoned by an IRQ.
I assume that you have shielded your benchmarking thread to the extent possible:
- It has exclusive access to its CPU core (not only HyperThread), see https://rt.wiki.kernel.org/index.php/Cpuset_Management_Utility/tutorial on how to manage this easily.
- Interrupt affinities have been moved away from that core, see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html
- If possible, run a
nohzkernel in order to minimize timer ticks.
Furthermore, you should not jump into kernel space from within your benchmarked code path: upon return, your thread might get scheduled away for some time.
However, you simply can't get rid of all interrupts on a CPU core: on Linux, the local APIC timer interrupts, interprocessor interrupts (IPI) and others are used for internal purposes and you simply can't get rid of them! For example, timer interrupts are used to ensure, that threads get eventually scheduled. Likewise, IPIs are used to do trigger actions on other cores such as TLB shootdowns.
Now, thanks to the Linux tracing infrastructure, it is possible to tell from userspace whether a hardirq has happened during a certain period of time.
One minor complication is that Linux treats two classes of interrupts differently with respect to tracing:
- First there are the "real" external hardware interrupts occupied by real devices like network adapters, sound cards and the such.
- There are the interrupts for internal use by Linux.
Both are hardirqs in the sense that the processor asynchronously transfers control to an interrupt service routine (ISR) as dictated by the interrupt descriptor table (IDT).
Usually, in Linux, the ISR is simply a trampoline stub written in assembly which transfers control to a high-level handler written in C.
For the external interrupts, a common stub is defined for each which
transfer interrupts to a common higher level C routine
do_IRQ). Within the upper layers, a runtime decision is made
whether to trace or not. For the internal interrupts on the other
hand, there are two stubs for each interrupt: one for tracing on
and one for tracing off each. The former calls an unconditionally
non-tracing higher level C routine specific to the interrupt in
question, while the latter calls a unconditonally tracing one.
For details, refer to
arch/x86/kernel/irqinit.c in the Linux kernel sources.
This is why there is one tracing event for each internal interrupt:
# sudo perf list | grep irq_vectors: irq_vectors:call_function_entry [Tracepoint event] irq_vectors:call_function_exit [Tracepoint event] irq_vectors:call_function_single_entry [Tracepoint event] irq_vectors:call_function_single_exit [Tracepoint event] irq_vectors:deferred_error_apic_entry [Tracepoint event] irq_vectors:deferred_error_apic_exit [Tracepoint event] irq_vectors:error_apic_entry [Tracepoint event] irq_vectors:error_apic_exit [Tracepoint event] irq_vectors:irq_work_entry [Tracepoint event] irq_vectors:irq_work_exit [Tracepoint event] irq_vectors:local_timer_entry [Tracepoint event] irq_vectors:local_timer_exit [Tracepoint event] irq_vectors:reschedule_entry [Tracepoint event] irq_vectors:reschedule_exit [Tracepoint event] irq_vectors:spurious_apic_entry [Tracepoint event] irq_vectors:spurious_apic_exit [Tracepoint event] irq_vectors:thermal_apic_entry [Tracepoint event] irq_vectors:thermal_apic_exit [Tracepoint event] irq_vectors:threshold_apic_entry [Tracepoint event] irq_vectors:threshold_apic_exit [Tracepoint event] irq_vectors:x86_platform_ipi_entry [Tracepoint event] irq_vectors:x86_platform_ipi_exit [Tracepoint event]
while there is only a single general tracing event for the external interrupts:
# sudo perf list | grep irq: irq:irq_handler_entry [Tracepoint event] irq:irq_handler_exit [Tracepoint event] irq:softirq_entry [Tracepoint event] irq:softirq_exit [Tracepoint event] irq:softirq_raise [Tracepoint event]
So, trace all these IRQ
*_entry-events for the duration of your benchmarked
code path and you know whether your benchmark sample has been poisoned
by an IRQ or not.
Unfortunately, the list of
*_entry-events does not include every
internal IRQ theoretically possible, that is, there are some tracing
version trampoline stubs that equal their non-tracing counterparts.
irq_move_cleanup_interrupt reboot_interrupt uv_bau_message_intr1 kvm_posted_intr_ipi kvm_posted_intr_wakeup_ipi xen_hvm_callback_vector hyperv_callback_vector
- Do not move interrupts while benchmarking: disable
- Do not reboot while benchmarking.
uv_bau_message_intr1can only happen if you have got
CONFIG_X86_UV=yes, that is on SGI UV systems.
- Do not run VMs while benchmarking.
In principle, an exception is simply an interrupt generated by the local processor core (or thread) itself in order to signal the OS that a certain event has occured. Examples are page faults, math exceptions, machine check exceptions and so on. Neglecting the fact that a little bit more information is passed to the ISR than for interrupts, control transfer for exceptions is handled in exactly the same way as described for interrupts above: an entry is looked up within the IDT and a jump is made to the trampoline stub specified therein.
In Linux, the trampoline stubs for exceptions are in a way similiar to the trampoline stubs for internally used interrupts: a non-tracing and a tracing version specific to the exception in question is setup for each. The big exception is that all but the page fault exception's tracing versions of these stubs equal their non-tracing counterpart and thus, don't do any tracing. Namely,
divide_error overflow bounds invalid_op device_not_available double_fault coprocessor_segment_overrun invalid_TSS segment_not_present spurious_interrupt_bug coprocessor_error alignment_check simd_coprocessor_error debug int3 stack_segment general_protection machine_check
all can't be traced. However, if your code is sufficiently
well-behaving, i.e. not dividing by zero etc., these shouldn't happen
to you at all. The tracing events for the page fault exception are
NMIs are treated a little bit specially and not caught by any of the
events above. However, there's a tracepoint event for those as well:
nmi:nmi_handler. According to
arch/x86/kernel/nmi.c:nmi_handle(), these events get only
triggered if NMI handlers have been registered. However, there's
always at least one such guy:
arch_trigger_all_cpu_backtrace_handler(). Thus, for every NMI, we
should get at least one
Some interrupts and exceptions are untraceable but avoidable. Most
irqbalance should be disabled in order to avoid
irq_move_cleanup_interrupt. If you stick to the rules, you can
tell reliably whether your benchmarking had been interrupted by an
For your convenience, put together a little piece of code for
tracking IRQs during your benchmarked code path:
http://nicst.de/git/?p=lxdetectirq. See the included
usage. Note that access to
/sys/kernel/debug is needed in order to
determine tracepoint IDs.