1. 程式人生 > >Linux on Power

Linux on Power

Operating system tuning for latency

Disable unnecessary services

The set of running services tend to change over time and between distributions. Carefully analyze the list of services enabled on the system and permanently disable all which are not required.

To get the list of running services:

# systemctl list-units --type=service

To stop a service immediately:

# systemctl stop <service>

To permanently disable a service (upon next boot):

# systemctl disable <service>

Disable snooze

Availability: IBM PowerVM, SLES11SP2, RHEL6.5, RHEL7 (kernel 3.7), powerpc-utils 1.2.20

“snooze_delay” is the (millisecond) delay before an idle processor yields itself back to the hypervisor.  A latency-sensitive virtual server may want to make this as high as possible, to avoid losing the processor any earlier than necessary (like the end of any timeslice for shared processors).  Other non-latency-sensitive virtual servers sharing the processors may want to set this value low or to zero, to yield the processor back as soon as possible.

# ppc64_cpu --snooze_delay=<n>

However, the strongest recommendation is to use dedicated processors.  This postpones giving control to the hypervisor.

Note: support for setting snooze_delay=-1 (infinite: do not snooze) went into kernel 3.7, and as of this writing (Jan 2013) is not in RHEL 6.3, whereas SLES11 SP2 does

appear to have support. powerpc-utils 1.2.20 or above is required.

Disable CPU frequency scaling

Availability: x86, PowerKVM, and non-virtualized

Set the maximum frequency available for scaling to the maximum possible frequency:

# max_freq=$(for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies); do echo $i; done | sort -nr | head -1)
# for cpu_max_freq in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do echo $max_freq > $cpu_max_freq; done
Set the frequency governor to performance: # for governor in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $governor; done

Disable WAKEUP_PREEMPT

Availability: All platforms

The Linux Completely Fair Scheduler (CFS) first appeared in the 2.6.23 release of the Linux kernel in October 2007. The algorithms used in the CFS provide efficient scheduling for a wide variety of systems and workloads. However, for some workloads there is one behavior of the CFS that can cost a few percent of CPU utilization.

In the CFS, a thread that submits I/O, blocks and then is notified of the I/O completion preempts the currently running thread and is run instead. This behavior is great for applications such as video streaming that need to have low latency for handling the I/O, but it can actually hurt performance in some cases. For example, when a thread submits I/O, such as sending a response out on the network, the I/O thread is in no hurry to handle the I/O completion. Upon I/O completion, the thread is simply finished with its work. Moreover, when an I/O completion thread preempts the current running thread, it prevents the current thread from making progress. And when it preempts the current thread it can ruin some of the cache warmth that the thread has created. Since there is no immediate need to handle the I/O completion, the current thread should be allowed to run. The I/O completion thread should be scheduled to run just like any other process.

The CFS has a list of scheduling features that can be enabled or disabled. The setting of these features is available through the debugfs file system. One of the features is WAKEUP_PREEMPT. It tells the scheduler that an I/O thread that was woken up should preempt the currently running thread, which is the default behavior as described above. To disable this feature, you set NO_WAKEUP_PREEMPT (not to be confused with NO_WAKEUP_PREEMPTION) in the scheduler’s features.

# mount -t debugfs debugfs /sys/kernel/debug
# echo NO_WAKEUP_PREEMPT > /sys/kernel/debug/sched_features
# umount /sys/kernel/debug

Disable scheduling preemption

Availability: All platforms

You can use the sched_min_granularity_ns parameter to disable preemption. sched_min_granularity_ns is the number of nanoseconds a process is guaranteed to run before it can be preempted. Setting the parameter to one half of the value of the sched_latency_ns parameter effectively disables preemption. sched_latency_ns is the period over which CFS tries to fairly schedule all the tasks on the runqueue. All of the tasks on the runqueue are guaranteed to be scheduled once within this period. So, the greatest amount of time a task can be given to run is inversely correlated with the number of tasks; fewer tasks means they each get to run longer. Since the smallest number of tasks needed for one to preempt another is two, setting sched_min_granularity_ns to half of sched_latency_ns means the second task will not be allowed to preempt the first task.

The scheduling parameters are located in the /proc/sys/kernel/ directory. Here is some sample bash code for disabling preemption.

# LATENCY=$(cat /proc/sys/kernel/sched_latency_ns)
# echo $((LATENCY/2)) > /proc/sys/kernel/sched_min_granularity_ns

The parameter sched_wakeup_granularity_ns is similar to the sched_min_granularity_ns parameter. The documentation is a little fuzzy on how this parameter actually works. It controls the ability of tasks being woken to preempt the current task. The smaller the value, the easier it is for the task to force the preemption. Setting sched_wakeup_granularity_ns to one half of sched_latency_ns can also help alleviate the scheduling preemption problem.

Disable multipath

Availability: All platforms

Multipath is a means of utilizing multiple paths to the same storage device as a single device with redundancy to facilitate higher bandwidth, higher availabilty or both. Two logical storage devices are bound together into a single virtual storage device.

multipathd is a service daemon that monitors the multipath devices for failure, and by default, it polls the devices every 5 seconds.

Ideally you should avoid this polling altogether to avoid the disruption. However, if multpathd must run, it is possible to increase the “polling_interval” in the /etc/multipath.conf file.

In some configurations, it is difficult or nearly impossible to install without multipath being enabled automatically by the installation process.  However, it is not impossible to remove multipath from an installed system:

  1. Remove the device from /etc/multipath.conf:
    • Remove the device from the blacklist_exceptions section if already there, or
    • Add the device to the blacklist section
  2. Disable multipathd at reboot:
    • On RHEL, # systemctl disable multipathd
  3. Adjust /etc/fstab, if necessary:
    • LVM will rescan at boot and needs no changes in /etc/fstab
    • For physical partitions, entries in /etc/fstab should be changed from paths like /dev/mapper/mpathx to /dev/sdy
  4. Disable multipath during boot:
    1. On RHEL, dracut --verbose --force --omit multipath
    2. Reboot

Mitigate pdflush

Availability: All platforms

pdflush wakes up periodically to flush dirty superblocks to storage.  According to, R.I.P. pdflush, ext4 filesystems (among others) do not even take advantage of the mechanism that pdflush uses, so on systems using ext4 exclusively, pdflush will always wake up and find nothing to do.

It is impossible to turn off pdflush, as it is an integral kernel thread, but you can tell it to wake up a lot less often, as in this example, every 360000 centiseconds, or one hour:

# echo 360000 > /proc/sys/vm/dirty_writeback_centisecs

Disable RTAS event scanning

Availability: Power, non-virtualized only

The Power Architecture Platform Reference (PAPR), a standard to which IBM Power Systems conform, dictates that the platform’s Run-Time Abstraction Services (RTAS) must be periodically scanned for new event reports.  In the Linux kernel, this is implemented as a daemon, of sorts (in reality, it’s a self-rescheduling workqueue item).  Unfortunately, it is also defined in PAPR to do subsequent scans from different cores.  As of this writing, the current implementation in the Linux kernel will schedule itself on the next online core, regardless of any other restrictions like cgroups or isolcpus settings, so all online cores will eventually be hit.

At present, there is no trivial method (other than changes to the Linux kernel source code) to disable this scan for PowerVM or PowerKVM guests.

You could disable it by recompiling the kernel and disabling the appropriate code in rtasd.c.  At the time of this writing, it is not clear if there are negative side-effects to disabling the RTAS event scans, so this is not recommended.

Note that when running non-virtualized (the operating system is running directly on OPAL firmware), RTAS event scanning is not performed.

Mitigate decrementer overflow interrupt

Availability: Power, but still under development

The decrementer register on IBM Power Systems servers is a 32-bit quantity, and is decremented at the frequency of the time-base, which on IBM Power Systems servers is 512 MHz.  The maximum value of the register is 231-1, or 2147483647.  At 512 MHz, this would decrement to zero in about 4.2 seconds.  So, a completely idle processor will still see a decrementer interrupt every 4.2 seconds.  There is currently no way to eliminate this interrupt.

The following patch will mitigate the interrupt by avoiding some optional housekeeping done when the interrupt occurs when the only action necessary is to reset it: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-October/121795.html

On a POWER8 system, the interrupt duration was reduced from about 30 microseconds to about 200 nanoseconds.

Remove processors from scheduling eligibility permanently

Availability: All platforms

Keep the Linux kernel scheduler from even considering a set of CPUs for tasks.

Use the kernel command line parameter isolcpus=<cpu list> to isolate a set of CPUs from consideration for scheduling tasks. For example, add isolcpus=1-63 to reserve all but CPU 0 (on a system with 64 CPUs) for specific application use. Use cgroups/cset, taskset, or numactl commands to force tasks explicitly on those CPUs.

Remove processors from scheduling eligibility dynamically

Availability: All platforms

Also known as CPU shielding. See this page for additional information: cpuset

# cset shield --cpu=4-63 --kthread=on

All existing tasks, including kernel threads, which can be migrated from the shielded CPUs (4-63 in the example above) will be migrated to the unshielded CPUs (everything else).

cset is a front-end to the cgroup infrastucture.  It is possible to manipulate cgroups manually.

To set up a cgroup, “mycgroup”, with a subset of the available CPUs and all memory:

# mount -t tmpfs tmpfs /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpuset
# mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
# cd /sys/fs/cgroup/cpuset
# mkdir mycgroup
# cd mycgroup
# echo 8-95 > cpuset.cpus
# cat /sys/fs/cgroup/cpuset/cpuset.mems > cpuset.mems

Disable scheduler domain rebalancing

Availability: All platforms

Set SD_LOAD_BALANCE bit to zero for all critical cpu’s domains, for example:

for cpu in $(seq 1 63); do
    for domain in /proc/sys/kernel/sched_domain/cpu$cpu/domain*/flags; do
        flags=$(cat $domain)
        # ensure low order bit of flags (SD_LOAD_BALANCE) is zero
        echo "$flags - $((flags + 1)) / 2 + $flags / 2" | bc > $domain
    done
done

Disable watchdogs

Availability: All platforms

Watchdog kernel threads wake up regularly.To disable the watchdog kernel threads:

# sysctl kernel.nmi_watchdog=0

Kernel parameters:

nmi_watchdog=0 nowatchdog nosoftlockup

Use static network configuration

The DHCP client must periodically renew the lease for it’s IP address, at an interval specified by the DHCP server.  It is preferable to avoid this by using static network configuration.

Disable inactive network interfaces

It was observed that physically disconnected interfaces which were nevertheless configured as “UP”, were generating system interrupts. Make sure that no unnecessary network interfaces are configured, even if physically disconnected.

Use static and/or one-shot IRQ balancing

To stop immediately (will restart on reboot):

# systemctl stop irqbalance

(or irqbalance or irqbalancer, or…)

To stop upon reboot (will not stop immediately):

# systemctl disable irqbalance

To balance IRQs once only and exit:

# IRQBALANCE_ONESHOT=1 IRQBALANCE_BANNED_CPUS=ffffffff,fffffff0 irqbalance

Fragment high-cost TCE-related hypervisor calls

Availability: PowerVM

Add kernel command line parameters:

  • bulk_remove=off
  • multitce=off

Go tickless (or not)

Eliminating scheduler clock interrupts would be ideal, but is currently difficult in practice.  Work associated with the tick can “back up”, making the less-frequent interrupts take longer.  If unable to eliminate ticks altogether, it may be better to keep them (nohz=off) and/or increase their frequency (recompile kernel with CONFIG_HZ_250 or 300 or 1000).

Availability: RHEL7

To eliminate most ticks, get a kernel with CONFIG_NO_HZ_FULL.  This enables the kernel to significantly reduce timer interupts on CPUs where there is a single runnable task and thus no need for scheduling.  CPUs thus enabled are called “adaptive ticks” CPUs.  The capability is enabled in RHEL7 kernels.  However, by default, no CPUs are defined as adaptive ticks CPUs.  To enable a set of CPUs to be adaptive ticks CPUs, add nohz_full=<cpulist> to the kernel command line.

To elminate more ticks, recompile kernel with CONFIG_NO_HZ_FULL_ALL

Reduce scheduling migrations

Availability: All platforms

Disable SD_WAKE_AFFINE, SD_WAKE_BALANCE or both.

vmstat offloading

Availability: Submitted for upstream

Utilization statistics, commonly displayed with the vmstat command, are gathered at the beginning and end of interrupt processing, lengthening the interrupt processing time.  Some of this processing can be offloaded to a sacrificial thread, reducing interrupt latencies.  This is currently being added to upstream kernels, and not in any enterprise distributions.  (a.k.a. “Lameter patches”)

RCU offloading

Availability: Targeted upstream

RCU performs some housekeeping during interrupt processing.  There are some recent patches being pushed into upstream kernels to move this processing to a sacrificial core, thus reducing interrupt latencies accordingly.

Kernel parameter: rcu_nocbs=<cpu list>

Or, kernel config CONFIG_RCU_NOCB_CPU_ALL=y

相關推薦

Optimized libraries for Linux on Power

Introduction There are several techniques to squeeze maximum performance from your library. This article discusses various ways to im

Linux on Power Developer Portal

IBM PowerAI Vision Version 1.1.2 Released: 11/16/2018 About PowerAI Vision

Questions in Linux on Power space

Translate this page Translation is from English to selected language, using n.Fluent real-time machine translation service. No guarantees are

Advance Toolchain for Linux on Power

Why should I use Advance Toolchain? The Advance Toolchain provides toolchain functionality earlier and a group of optimized libraries. AT is h

Porting x86 application to Linux on Power porting steps

Did your application build without errors? After the program is successfully built, make sure you test it for runtime errors. Ru

Linux on Power

Operating system tuning for latency Disable unnecessary services The set of running services tend to change over time and between distributions. Carefull

LinuxPower Management開發總結

提升 padding pro 進程 框架 eight 概述 lock str 本文作為一個提綱挈領的介紹性文檔,後面會以此展開,逐漸豐富。 關於Linux省電: 保持CPU處於工作狀態時: 1. 設備使能RPM,不使用的設備動態關閉。 2. cpufreq動態調節CPU/G

手機變PC —— Linux on DeX讓開發者更自在

2017年底,三星 Samsung 宣佈了一個在手機上跑 Linux 的專案 —— Linux on DeX。 2018年11月,他們把這個專案升級了。 和2017版相比,2018版不再需要昂貴的專用底座,只需要一條視訊線。而在前些日子的 SDC 開發者大會上,三星宣佈 DeX 已完整支援 Linux

AWS Marketplace: SIOS Protection Suite for Linux on RHEL 7.4

SIOS Protection Suite is a premium availability solution for Linux application environments that provides a tightly integrated combination of high

Android折騰記——Linux on Android在Android手機上跑Linux教程

前言 手頭有淘汰的Android手機,想來將這個手機做伺服器使用,收集資料發現了一個Linux on Android的專案,可以直接將常見的Linux發行版安裝到手機上Ubuntu、Fedora 等等。查閱不少資料,都有很多缺漏,折騰了很久。現將安裝過程整理一下。 先

Newbie questions on Power and AIX

Thank you in advance for taking the time to review my post. I'm new to Power systems and AIX. Most of my infrastructure work is in the VMw

Ubuntu-linux-on-Dell-XPS-15-(9560)

DISCLAIMER: These are my personal notes for experiments with my own laptop. These instructions could cause data loss, damage to you

Linux on Android 簡單教程

介紹: Linux on Android,顧名思義,就是讓你能在Android上跑linux。。。。 步驟: 1.下載所需的檔案: 專案主頁為:tinyurl.com cn3lxgz 在這裡舉Ubuntu 13.10為例,下載tinyurl.com lp7fqw4 Cor

Have You Tried Delphi on Amazon Linux? (就是AWS用的Linux

enables custom customers servers nbsp ble exists compile targe The new Delphi Linux compiler enables customers to take new or existing Wi

Linux寫時拷貝技術(copy-on-write)

但是 現在 進程地址空間 優化 如何 進程創建 http exe fork COW技術初窺: 在Linux程序中,fork()會產生一個和父進程完全相同的子進程,但子進程在此後多會exec系統調用,出於效率考慮,linux中引入了“寫時復制“技術,也就是只有進程

SQL Server on Red Hat Enterprise Linux

rhel uri 部分 doc fonts tle web per 安裝過程 本文從零開始一步一步介紹如何在Red Hat Enterprise Linux上搭建SQL Server 2017,包括安裝系統、安裝SQL等相關步驟和方法(僅供測試學習之用,基礎篇)。 一.

SQL Server on Linux: How? Introduction: SQL Server Blog

nco adding existing initial planning for rec instance thum SQL Server Blog Official News from Microsoft’s Information Platform h

Enable a SQL Server Trace Flag Globally on Linux

mic perf one border pre 技術分享 directly ati res https://www.mssqltips.com/sql-server-tip-category/226/sql-server-on-linux// Microsoft has

play with snake on linux

dev 開始 學習 是我 改進 tails linux 庫函數 strong 在寫完超Low的windows上的貪吃蛇後 被人吐槽了幾個方面: 1.界面真的Low,開始,結束,遊戲中,都太簡陋了... 2.每次都清屏在輸出字符矩陣的解決方案...太晃眼了 3.一個BUG,為

Setting up a EDK II build environment on Windows and Linux:搭建Windows和Linux開發環境[2.2]

set clu cto 無法安裝 urn ems water 了解 源代碼管理 Setting up a EDK II build environment on Windows and Linux:搭建Windows和Linux開發環境[2.2] 2015-07 北