Linux on Power

阿新 • • 發佈：2019-01-16

Operating system tuning for latency

Disable unnecessary services

The set of running services tend to change over time and between distributions. Carefully analyze the list of services enabled on the system and permanently disable all which are not required.

To get the list of running services:

# systemctl list-units --type=service

To stop a service immediately:

# systemctl stop <service>

To permanently disable a service (upon next boot):

# systemctl disable <service>

Disable snooze

Availability: IBM PowerVM, SLES11SP2, RHEL6.5, RHEL7 (kernel 3.7), powerpc-utils 1.2.20

“snooze_delay” is the (millisecond) delay before an idle processor yields itself back to the hypervisor. A latency-sensitive virtual server may want to make this as high as possible, to avoid losing the processor any earlier than necessary (like the end of any timeslice for shared processors). Other non-latency-sensitive virtual servers sharing the processors may want to set this value low or to zero, to yield the processor back as soon as possible.

# ppc64_cpu --snooze_delay=<n>

However, the strongest recommendation is to use dedicated processors. This postpones giving control to the hypervisor.

Note: support for setting snooze_delay=-1 (infinite: do not snooze) went into kernel 3.7, and as of this writing (Jan 2013) is not in RHEL 6.3, whereas SLES11 SP2 does

appear to have support. powerpc-utils 1.2.20 or above is required.

Disable CPU frequency scaling

Availability: x86, PowerKVM, and non-virtualized

Set the maximum frequency available for scaling to the maximum possible frequency:

# max_freq=$(for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies); do echo $i; done | sort -nr | head -1)

# for cpu_max_freq in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do echo $max_freq > $cpu_max_freq; done

Set the frequency governor to performance: # for governor in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $governor; done

Disable WAKEUP_PREEMPT

Availability: All platforms

The Linux Completely Fair Scheduler (CFS) first appeared in the 2.6.23 release of the Linux kernel in October 2007. The algorithms used in the CFS provide efficient scheduling for a wide variety of systems and workloads. However, for some workloads there is one behavior of the CFS that can cost a few percent of CPU utilization.

In the CFS, a thread that submits I/O, blocks and then is notified of the I/O completion preempts the currently running thread and is run instead. This behavior is great for applications such as video streaming that need to have low latency for handling the I/O, but it can actually hurt performance in some cases. For example, when a thread submits I/O, such as sending a response out on the network, the I/O thread is in no hurry to handle the I/O completion. Upon I/O completion, the thread is simply finished with its work. Moreover, when an I/O completion thread preempts the current running thread, it prevents the current thread from making progress. And when it preempts the current thread it can ruin some of the cache warmth that the thread has created. Since there is no immediate need to handle the I/O completion, the current thread should be allowed to run. The I/O completion thread should be scheduled to run just like any other process.

The CFS has a list of scheduling features that can be enabled or disabled. The setting of these features is available through the debugfs file system. One of the features is WAKEUP_PREEMPT. It tells the scheduler that an I/O thread that was woken up should preempt the currently running thread, which is the default behavior as described above. To disable this feature, you set NO_WAKEUP_PREEMPT (not to be confused with NO_WAKEUP_PREEMPTION) in the scheduler’s features.

# mount -t debugfs debugfs /sys/kernel/debug
# echo NO_WAKEUP_PREEMPT > /sys/kernel/debug/sched_features
# umount /sys/kernel/debug

Disable scheduling preemption

Availability: All platforms

You can use the sched_min_granularity_ns parameter to disable preemption. sched_min_granularity_ns is the number of nanoseconds a process is guaranteed to run before it can be preempted. Setting the parameter to one half of the value of the sched_latency_ns parameter effectively disables preemption. sched_latency_ns is the period over which CFS tries to fairly schedule all the tasks on the runqueue. All of the tasks on the runqueue are guaranteed to be scheduled once within this period. So, the greatest amount of time a task can be given to run is inversely correlated with the number of tasks; fewer tasks means they each get to run longer. Since the smallest number of tasks needed for one to preempt another is two, setting sched_min_granularity_ns to half of sched_latency_ns means the second task will not be allowed to preempt the first task.

The scheduling parameters are located in the /proc/sys/kernel/ directory. Here is some sample bash code for disabling preemption.

# LATENCY=$(cat /proc/sys/kernel/sched_latency_ns)
# echo $((LATENCY/2)) > /proc/sys/kernel/sched_min_granularity_ns

The parameter sched_wakeup_granularity_ns is similar to the sched_min_granularity_ns parameter. The documentation is a little fuzzy on how this parameter actually works. It controls the ability of tasks being woken to preempt the current task. The smaller the value, the easier it is for the task to force the preemption. Setting sched_wakeup_granularity_ns to one half of sched_latency_ns can also help alleviate the scheduling preemption problem.

Disable multipath

Availability: All platforms

Multipath is a means of utilizing multiple paths to the same storage device as a single device with redundancy to facilitate higher bandwidth, higher availabilty or both. Two logical storage devices are bound together into a single virtual storage device.

multipathd is a service daemon that monitors the multipath devices for failure, and by default, it polls the devices every 5 seconds.

Ideally you should avoid this polling altogether to avoid the disruption. However, if multpathd must run, it is possible to increase the “polling_interval” in the /etc/multipath.conf file.

In some configurations, it is difficult or nearly impossible to install without multipath being enabled automatically by the installation process. However, it is not impossible to remove multipath from an installed system:

Remove the device from /etc/multipath.conf:
- Remove the device from the blacklist_exceptions section if already there, or
- Add the device to the blacklist section
Disable multipathd at reboot:
- On RHEL, # systemctl disable multipathd
Adjust /etc/fstab, if necessary:
- LVM will rescan at boot and needs no changes in /etc/fstab
- For physical partitions, entries in /etc/fstab should be changed from paths like /dev/mapper/mpathx to /dev/sdy
Disable multipath during boot:
1. On RHEL, dracut --verbose --force --omit multipath
2. Reboot

Mitigate pdflush

Availability: All platforms

pdflush wakes up periodically to flush dirty superblocks to storage. According to, R.I.P. pdflush, ext4 filesystems (among others) do not even take advantage of the mechanism that pdflush uses, so on systems using ext4 exclusively, pdflush will always wake up and find nothing to do.

It is impossible to turn off pdflush, as it is an integral kernel thread, but you can tell it to wake up a lot less often, as in this example, every 360000 centiseconds, or one hour:

# echo 360000 > /proc/sys/vm/dirty_writeback_centisecs

Disable RTAS event scanning

Availability: Power, non-virtualized only

The Power Architecture Platform Reference (PAPR), a standard to which IBM Power Systems conform, dictates that the platform’s Run-Time Abstraction Services (RTAS) must be periodically scanned for new event reports. In the Linux kernel, this is implemented as a daemon, of sorts (in reality, it’s a self-rescheduling workqueue item). Unfortunately, it is also defined in PAPR to do subsequent scans from different cores. As of this writing, the current implementation in the Linux kernel will schedule itself on the next online core, regardless of any other restrictions like cgroups or isolcpus settings, so all online cores will eventually be hit.

At present, there is no trivial method (other than changes to the Linux kernel source code) to disable this scan for PowerVM or PowerKVM guests.

You could disable it by recompiling the kernel and disabling the appropriate code in rtasd.c. At the time of this writing, it is not clear if there are negative side-effects to disabling the RTAS event scans, so this is not recommended.

Note that when running non-virtualized (the operating system is running directly on OPAL firmware), RTAS event scanning is not performed.

Mitigate decrementer overflow interrupt

Availability: Power, but still under development

The decrementer register on IBM Power Systems servers is a 32-bit quantity, and is decremented at the frequency of the time-base, which on IBM Power Systems servers is 512 MHz. The maximum value of the register is 2³¹-1, or 2147483647. At 512 MHz, this would decrement to zero in about 4.2 seconds. So, a completely idle processor will still see a decrementer interrupt every 4.2 seconds. There is currently no way to eliminate this interrupt.

The following patch will mitigate the interrupt by avoiding some optional housekeeping done when the interrupt occurs when the only action necessary is to reset it: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-October/121795.html

On a POWER8 system, the interrupt duration was reduced from about 30 microseconds to about 200 nanoseconds.

Remove processors from scheduling eligibility permanently

Availability: All platforms

Keep the Linux kernel scheduler from even considering a set of CPUs for tasks.

Use the kernel command line parameter isolcpus=<cpu list> to isolate a set of CPUs from consideration for scheduling tasks. For example, add isolcpus=1-63 to reserve all but CPU 0 (on a system with 64 CPUs) for specific application use. Use cgroups/cset, taskset, or numactl commands to force tasks explicitly on those CPUs.

Remove processors from scheduling eligibility dynamically

Availability: All platforms

Also known as CPU shielding. See this page for additional information: cpuset

# cset shield --cpu=4-63 --kthread=on

All existing tasks, including kernel threads, which can be migrated from the shielded CPUs (4-63 in the example above) will be migrated to the unshielded CPUs (everything else).

cset is a front-end to the cgroup infrastucture. It is possible to manipulate cgroups manually.

To set up a cgroup, “mycgroup”, with a subset of the available CPUs and all memory:

# mount -t tmpfs tmpfs /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpuset
# mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
# cd /sys/fs/cgroup/cpuset
# mkdir mycgroup
# cd mycgroup
# echo 8-95 > cpuset.cpus
# cat /sys/fs/cgroup/cpuset/cpuset.mems > cpuset.mems

Disable scheduler domain rebalancing

Availability: All platforms

Set SD_LOAD_BALANCE bit to zero for all critical cpu’s domains, for example:

for cpu in $(seq 1 63); do
    for domain in /proc/sys/kernel/sched_domain/cpu$cpu/domain*/flags; do
        flags=$(cat $domain)
        # ensure low order bit of flags (SD_LOAD_BALANCE) is zero
        echo "$flags - $((flags + 1)) / 2 + $flags / 2" | bc > $domain
    done
done

Disable watchdogs

Availability: All platforms

Watchdog kernel threads wake up regularly.To disable the watchdog kernel threads:

# sysctl kernel.nmi_watchdog=0

Kernel parameters:

nmi_watchdog=0 nowatchdog nosoftlockup

Use static network configuration

The DHCP client must periodically renew the lease for it’s IP address, at an interval specified by the DHCP server. It is preferable to avoid this by using static network configuration.

Disable inactive network interfaces

It was observed that physically disconnected interfaces which were nevertheless configured as “UP”, were generating system interrupts. Make sure that no unnecessary network interfaces are configured, even if physically disconnected.

Use static and/or one-shot IRQ balancing

To stop immediately (will restart on reboot):

# systemctl stop irqbalance

(or irqbalance or irqbalancer, or…)

To stop upon reboot (will not stop immediately):

# systemctl disable irqbalance

To balance IRQs once only and exit:

# IRQBALANCE_ONESHOT=1 IRQBALANCE_BANNED_CPUS=ffffffff,fffffff0 irqbalance

Fragment high-cost TCE-related hypervisor calls

Availability: PowerVM

Add kernel command line parameters:

bulk_remove=off
multitce=off

Go tickless (or not)

Eliminating scheduler clock interrupts would be ideal, but is currently difficult in practice. Work associated with the tick can “back up”, making the less-frequent interrupts take longer. If unable to eliminate ticks altogether, it may be better to keep them (nohz=off) and/or increase their frequency (recompile kernel with CONFIG_HZ_250 or 300 or 1000).

Availability: RHEL7

To eliminate most ticks, get a kernel with CONFIG_NO_HZ_FULL. This enables the kernel to significantly reduce timer interupts on CPUs where there is a single runnable task and thus no need for scheduling. CPUs thus enabled are called “adaptive ticks” CPUs. The capability is enabled in RHEL7 kernels. However, by default, no CPUs are defined as adaptive ticks CPUs. To enable a set of CPUs to be adaptive ticks CPUs, add nohz_full=<cpulist> to the kernel command line.

To elminate more ticks, recompile kernel with CONFIG_NO_HZ_FULL_ALL.

Reduce scheduling migrations

Availability: All platforms

Disable SD_WAKE_AFFINE, SD_WAKE_BALANCE or both.

vmstat offloading

Availability: Submitted for upstream

Utilization statistics, commonly displayed with the vmstat command, are gathered at the beginning and end of interrupt processing, lengthening the interrupt processing time. Some of this processing can be offloaded to a sacrificial thread, reducing interrupt latencies. This is currently being added to upstream kernels, and not in any enterprise distributions. (a.k.a. “Lameter patches”)

RCU offloading

Availability: Targeted upstream

RCU performs some housekeeping during interrupt processing. There are some recent patches being pushed into upstream kernels to move this processing to a sacrificial core, thus reducing interrupt latencies accordingly.

Kernel parameter: rcu_nocbs=<cpu list>

Or, kernel config CONFIG_RCU_NOCB_CPU_ALL=y

Linux on Power

Operating system tuning for latency

Disable unnecessary services

Disable snooze

Disable CPU frequency scaling

Disable WAKEUP_PREEMPT

Disable scheduling preemption

Disable multipath

Mitigate pdflush

Disable RTAS event scanning

Mitigate decrementer overflow interrupt

Remove processors from scheduling eligibility permanently

Remove processors from scheduling eligibility dynamically

Disable scheduler domain rebalancing

Disable watchdogs

Use static network configuration

Disable inactive network interfaces

Use static and/or one-shot IRQ balancing

Fragment high-cost TCE-related hypervisor calls

Go tickless (or not)

Reduce scheduling migrations

vmstat offloading

RCU offloading

相關推薦