Netty高併發效能優化

阿新 • • 發佈：2018-11-09

最近在寫一個後臺中介軟體的原型，主要是做訊息的分發和透傳。因為要用Java實現，所以網路通訊框架的第一選擇當然就是Netty了，使用的是Netty 4版本。Netty果然效率很高，不用做太多努力就能達到一個比較高的tps。但使用過程中也碰到了一些問題，個人覺得都是比較經典而在網上又不太容易查詢到相關資料的問題，所以在此總結一下。

1.Context Switch過高

壓測時用nmon監控核心，發現Context Switch高達30w+。這明顯不正常，但JVM能有什麼導致Context Switch。參考之前整理過的恐龍書《Operating System Concept》的讀書筆記《程序排程》

和Wiki上的Context Switch介紹，程序/執行緒發生上下文切換的原因有：

I/O等待：在多工系統中，程序主動發起I/O請求，但I/O裝置還沒有準備好，所以會發生I/O阻塞，程序進入Wait狀態。
時間片耗盡：在多工分時系統中，核心分配給程序的時間片已經耗盡了，程序進入Ready狀態，等待核心重新分配時間片後的執行機會。
硬體中斷：在搶佔式的多工分時系統中，I/O裝置可以在任意時刻發生中斷，CPU會停下當前正在執行的程序去處理中斷，因此程序進入Ready狀態。

根據分析，重點就放在第一個和第二個因素上。

程序與執行緒的上下文切換

之前的讀書筆記裡總結的是程序的上下文切換原因，那執行緒的上下文切換又有什麼不同呢？在StackOverflow上果然找到了提問

thread context switch vs process context switch：

“The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch. Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost

of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor “remembers” in it’s cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor’s Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.”

通過排名第一的大牛的解答瞭解到，程序和執行緒的上下文切換都涉及進出系統核心和暫存器的儲存和還原，這是它們的最大開銷。但與程序的上下文切換相比，執行緒還是要輕量一些，最大的區別是執行緒上下文切換時虛擬記憶體地址保持不變，所以像TLB等CPU快取不會失效。但要注意的是另一份提問What is the overhead of a context-switch?的中提到了：Intel和AMD在2008年引入的技術可能會使TLB不失效。感興趣的話請自行研究吧。

1.1 非阻塞I/O

針對第一個因素I/O等待，最直接的解決辦法就是使用非阻塞I/O操作。在Netty中，就是服務端和客戶端都使用NIO。

這裡在說一下如何主動的向Netty的Channel寫入資料，因為網路上搜到的資料都是千篇一律：服務端就是接到請求後在Handler中寫入返回資料，而客戶端的例子竟然也都是在Handler裡Channel Active之後傳送資料。因為要做訊息透傳，而且是向下遊系統發訊息時是非同步非阻塞的，網上那種例子根本沒法用，所以在這裡說一下我的方法吧。

關於服務端，在接收到請求後，在channelRead0()中通過ctx.channel()得到Channel，然後就通過ThreadLocal變數或其他方法，只要能把這個Channel儲存住就行。當需要返回響應資料時就主動向持有的Channel寫資料。具體請參照後面第4節。

關於客戶端也是同理，在啟動客戶端之後要拿到Channel，當要主動傳送資料時就向Channel中寫入。

EventLoopGroup group = new NioEventLoopGroup();
        Bootstrap b = new Bootstrap();
        b.group(group)
            .channel(NioSocketChannel.class)
            .remoteAddress(host, port)
            .handler(new ChannelInitializer<SocketChannel>() {
                @Override
                protected void initChannel(SocketChannel ch) throws Exception {
                    ch.pipeline().addLast(...);
                }
            });

        try {
            ChannelFuture future = b.connect().sync();
            this.channel = future.channel();
        }
        catch (InterruptedException e) {
            throw new IllegalStateException("Error when start netty client: addr=[" + addr + "]", e);
        }

1.2 減少執行緒數

執行緒太多的話每個執行緒得到的時間片就少，CPU要讓各個執行緒都有機會執行就要切換，切換就要不斷儲存和還原執行緒的上下文現場。於是檢查Netty的I/O worker的EventLoopGroup。之前在《Netty 4原始碼解析：服務端啟動》中曾經分析過，EventLoopGroup預設的執行緒數是CPU核數的二倍。所以手動配置NioEventLoopGroup的執行緒數，減少一些I/O執行緒。

private void doStartNettyServer(int port) throws InterruptedException {
        EventLoopGroup bossGroup = new NioEventLoopGroup();
        EventLoopGroup workerGroup = new NioEventLoopGroup(4);
        try {
            ServerBootstrap b = new ServerBootstrap()
                    .group(bossGroup, workerGroup)
                    .channel(NioServerSocketChannel.class)
                    .localAddress(port)
                    .childHandler(new ChannelInitializer<SocketChannel>() {
                        @Override
                        public void initChannel(SocketChannel ch) throws Exception {
                            ch.pipeline().addLast(...);
                        }
                    });

            // Bind and start to accept incoming connections.
            ChannelFuture f = b.bind(port).sync();

            // Wait until the server socket is closed.
            f.channel().closeFuture().sync();
        } finally {
            bossGroup.shutdownGracefully();
            workerGroup.shutdownGracefully();
        }
    }

此外因為還用了Akka作為業務執行緒池，所以還看了下如何修改Akka的預設配置。方法是新建一個叫做application.conf的配置檔案，我們建立ActorSystem時會自動載入這個配置檔案，下面的配置檔案中定製了一個dispatcher：

my-dispatcher {
  # Dispatcher is the name of the event-based dispatcher
  type = Dispatcher
  mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
  # What kind of ExecutionService to use
  executor = "fork-join-executor"
  # Configuration for the fork join pool
  fork-join-executor {
    # Min number of threads to cap factor-based parallelism number to
    parallelism-min = 2
    # Parallelism (threads) ... ceil(available processors * factor)
    parallelism-factor = 1.0
    # Max number of threads to cap factor-based parallelism number to
    parallelism-max = 16
  }
  # Throughput defines the maximum number of messages to be
  # processed per actor before the thread jumps to the next actor.
  # Set to 1 for as fair as possible.
  throughput = 100
}

簡單來說，最關鍵的幾個配置項是：

parallelism-factor：決定執行緒池的大小（竟然不是parallelism-max）。
throughput：決定coroutine的切換頻率，1是最為頻繁也最為公平的設定。

因為本篇主要是介紹Netty的，所以具體含義就詳細介紹了，請參考官方文件中對Dispatcher和Mailbox的介紹。建立特定Dispatcher的Akka很簡單，以下是建立型別化Actor時指定Dispatcher的方法。

TypedActor.get(system).typedActorOf(
            new TypedProps<MyActorImpl>(
                    MyActor.class,
                    new Creator<MyActorImpl>() {
                        @Override
                        public MyActorImpl create() throws Exception {
                            return new MyActorImpl(XXX);
                        }
                    }
            ).withDispatcher("my-dispatcher")
    );

1.3 去業務執行緒池

儘管上面做了種種改進配置，用jstack檢視執行緒配置確實生效了，但Context Switch的狀況並沒有好轉。於是乾脆去掉Akka實現的業務執行緒池，徹底減少執行緒上下文的切換。發現CS從30w+一下子降到了16w！費了好大力氣在萬能的StackOverflow上查到了一篇文章，其中一句話點醒了我：

And if the recommendation is not to block in the event loop, then this can be done in an application thread. But that would imply an extra context switch. This extra context switch may not be acceptable to latency sensitive applaications.

有了線索就趕緊去查Netty原始碼，發現的確像呼叫channel.write()操作不是在當前執行緒上執行。Netty內部統一使用executor.inEventLoop()判斷當前執行緒是否是EventLoopGroup的執行緒，否則會包裝好Task交給內部執行緒池執行：

private void write(Object msg, boolean flush, ChannelPromise promise) {

        AbstractChannelHandlerContext next = findContextOutbound();
        EventExecutor executor = next.executor();
        if (executor.inEventLoop()) {
            next.invokeWrite(msg, promise);
            if (flush) {
                next.invokeFlush();
            }
        } else {
            int size = channel.estimatorHandle().size(msg);
            if (size > 0) {
                ChannelOutboundBuffer buffer = channel.unsafe().outboundBuffer();
                // Check for null as it may be set to null if the channel is closed already
                if (buffer != null) {
                    buffer.incrementPendingOutboundBytes(size);
                }
            }
            Runnable task;
            if (flush) {
                task = WriteAndFlushTask.newInstance(next, msg, size, promise);
            }  else {
                task = WriteTask.newInstance(next, msg, size, promise);
            }
            safeExecute(executor, task, promise, msg);
        }
    }

業務執行緒池原來是把雙刃劍。雖然將任務交給業務執行緒池非同步執行降低了Netty的I/O執行緒的佔用時間、減輕了壓力，但同時業務執行緒池增加了執行緒上下文切換的次數。通過上述這些優化手段，終於將壓測時的CS從每秒30w+降到了8w左右，效果還是挺明顯的！

2.系統呼叫開銷

系統呼叫一般會涉及到從User Space到Kernel Space的模態轉換(Mode Transition或Mode Switch)。這種轉換也是有一定開銷的。

Mode Switch vs. Context Switch

StackOverflow上果然什麼問題都有。前面介紹過了執行緒的上下文切換，那它與核心態和使用者態的切換是什麼關係？模態切換算是CS的一種嗎？Does there have to be a mode switch for something to qualify as a context switch?回答了這個問題：

“A mode switch happens inside one process. A context switch involves more than one process (or thread). Context switch happens only in kernel mode. If context switching happens between two user mode processes, first cpu has to change to kernel mode, perform context switch, return back to user mode and so on. So there has to be a mode switch associated with a context switch. But a context switch doesn’t imply a mode switch (could be done by the hardware alone). A mode switch does not require a context switch either.”

Context Switch必須在核心中完成，原理簡單說就是主動觸發一個軟中斷（類似被動被硬體觸發的硬中斷），所以一般Context Switch都會伴隨Mode Switch。然而有些硬體也可以直接完成（不是很懂了），有些CPU甚至沒有我們常說Ring 0 ~ 3的特權級概念。而Mode Switch則與Context Switch更是無關了，按照Wiki上的說法硬要扯上關係的話也只能說有的系統裡可能在Mode Switch中發生Context Switch。

Netty涉及的系統呼叫最多的就是網路通訊操作了，所以為了降低系統呼叫的頻度，最直接的方法就是緩衝輸出內容，達到一定的資料大小、寫入次數或時間間隔時才flush緩衝區。

對於緩衝區大小不足，寫入速度過快等問題，Netty提供了writeBufferLowWaterMark和writeBufferHighWaterMark選項，當緩衝區達到一定大小時則不能寫入，避免被撐爆。感覺跟Netty提供的Traffic Shaping流量整形功能有點像呢。具體還未深入研究，感興趣的同學可以自行學習一下。

3.Zero Copy實現

《Netty權威指南（第二版）》中專門有一節介紹Netty的Zero Copy，但針對的是Netty內部的零拷貝功能。我們這裡想談的是如何在應用程式碼中實現Zero Copy，最典型的應用場景就是訊息透傳。因為透傳不需要完整解析訊息，只需要知道訊息要轉發給下游哪個系統就足夠了。所以透傳時，我們可以只解析出部分訊息，訊息整體還原封不動地放在Direct Buffer裡，最後直接將它寫入到連線下游系統的Channel中。所以應用層的Zero Copy實現就分為兩部分：Direct Buffer配置和Buffer的零拷貝傳遞。

3.1 記憶體池

使用Netty帶來的又一個好處就是記憶體管理。只需一行簡單的配置，就能獲得到記憶體池帶來的好處。在底層，Netty實現了一個Java版的Jemalloc記憶體管理庫（還記得Redis自帶的那個嗎），為我們做完了所有“髒活累活”！

ServerBootstrap b = new ServerBootstrap()
            .group(bossGroup, workerGroup)
            .channel(NioServerSocketChannel.class)
            .localAddress(port)
            .childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT)
            .childHandler(new ChannelInitializer<SocketChannel>() {
                @Override
                public void initChannel(SocketChannel ch) throws Exception {
                    ch.pipeline().addLast(...);
                }
            });

3.2 應用層的Zero Copy

預設情況下，Netty會自動釋放ByteBuf。也就是說當我們覆寫的channelRead0()返回時，ByteBuf就結束了它的使命，被Netty自動釋放掉（如果是池化的就可會被放回到記憶體池中）。

public abstract class SimpleChannelInboundHandler<I> extends ChannelInboundHandlerAdapter {

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
        boolean release = true;
        try {
            if (acceptInboundMessage(msg)) {
                @SuppressWarnings("unchecked")
                I imsg = (I) msg;
                channelRead0(ctx, imsg);
            } else {
                release = false;
                ctx.fireChannelRead(msg);
            }
        } finally {
            if (autoRelease && release) {
                ReferenceCountUtil.release(msg);
            }
        }
    }
}

因為Netty是用引用計數的方式來判斷是否回收的，所以要想繼續使用ByteBuf而不讓Netty釋放的話，就要增加它的引用計數。只要我們在ChannelPipeline中的任意一個Handler中呼叫ByteBuf.retain()將引用計數加1，Netty就不會釋放掉它了。我們在連線下游的客戶端的Encoder中傳送訊息成功後再釋放掉，這樣就達到了零拷貝透傳的效果：

public class RespEncoder extends MessageToByteEncoder<Resp> {

    @Override
    protected void encode(ChannelHandlerContext ctx, Msg msg, ByteBuf out) throws Exception {
        // Raw in Msg is retained ByteBuf
        out.writeBytes(msg.getRaw(), 0, msg.getRaw().readerIndex());
        msg.getRaw().release();
    }

}

4.併發下的狀態處理

前面第1.1節介紹的非同步寫入持有的Channel和第2節介紹的根據一定規則flush緩衝區等等，都涉及到狀態的儲存。如果要併發訪問這些狀態的話，就要提防併發的race condition問題，避免更新衝突、丟失等等。

4.1 Channel儲存

在Netty服務端的Handler裡如何持有Channel呢？我是這樣做的，在channelActive()或第一次進入channelRead0()時建立一個Session物件持有Channel。因為之前在《Netty 4原始碼解析：請求處理》中曾經分析過Netty 4的執行緒模型：多個客戶端可能會對應一個EventLoop執行緒，但對於一個客戶端來說只能對應一個EventLoop執行緒。每個客戶端都對應自己的Handler例項，並且一直使用到連線斷開。

public class FrontendHandler extends SimpleChannelInboundHandler<Msg> {

    private Session session;

    @Override
    public void channelActive(ChannelHandlerContext ctx) throws Exception {
        session = factory.createSession(ctx.channel());
        super.channelActive(ctx);
    }

    @Override
    protected void channelRead0(final ChannelHandlerContext ctx, Msg msg) throws Exception {
        session.handleRequest(msg);
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        session = null;
        super.channelInactive(ctx);
    }

}

4.2 Decoder狀態

因為網路粘包拆包等因素，Decoder不可避免的要儲存一些解析過程的中間狀態。因為Netty對於每個客戶端的生命週期內會一直使用同一個Decoder例項，所以解析完成後一定要重置中間狀態，避免後續解析錯誤。

public class RespDecoder extends ReplayingDecoder {

    public MsgDecoder() {
        doCleanUp();
    }

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out)
            throws Exception {
        if (doParseMsg(in)) {
            doSendToHandler(out);
            doCleanUp();
        }
    }
}

5.總結

5.1 多變的Netty

總結之前先吐槽一下，令人又愛又恨的Netty更新速度。從Netty 3到Netty 4，API發生了一次“大地震”，好多網上的示例程式都是基於Netty 3，所以學習Netty 4時發現好多例子都跑不起來了。除了API，Netty內部的執行緒模型等等變化就更不用說了。本以為用上了Netty 4就可以安心了，結果Netty 5的執行緒模型又-變-了！看看官方文件裡的說法吧，升級的話又要注意了。

Even more flexible thread model

In Netty 4.x each EventLoop is tightly coupled with a fixed thread that executes all I/O events of its registered Channels and any tasks submitted to it. Starting with version 5.0 an EventLoop does no longer use threads directly but instead makes use of an Executor abstraction. That is, it takes an Executor object as a parameter in its constructor and instead of polling for I/O events in an endless loop each iteration is now a task that is submitted to this Executor. Netty 4.x would simply spawn its own threads and completely ignore the fact that it’s part of a larger system. Starting with Netty 5.0, developers can run Netty and the rest of the system in the same thread pool and potentially improve performance by applying better scheduling strategies and through less scheduling overhead (due to fewer threads). It shall be mentioned, that this change does not in any way affect the way ChannelHandlers are developed. From a developer’s point of view, the only thing that changes is that it’s no longer guaranteed that a ChannelHandler will always be executed by the same thread. It is, however, guaranteed that it will never be executed by two or more threads at the same time. Furthermore, Netty will also take care of any memory visibility issues that might occur. So there’s no need to worry about thread-safety and volatile variables within a ChannelHandler.

根據官方文件的說法，Netty不再保證特定的Handler例項在執行時一定對應一個執行緒，所以，在Handler中用ThreadLocal的話就是比較危險的寫法了！

5.2 高併發程式設計技巧

經過上面的種種琢磨和努力，tps終於從幾千達到了5w左右，學到了很多之前不懂的網路程式設計和效能優化的知識，還是很有成就感的！總結一下，高併發中介軟體的優化策略有：

執行緒數控制：高併發下如果執行緒較多時，Context Switch會非常明顯，超過CPU核心數的執行緒不會帶來任何好處。不是特別耗時的操作的話，業務執行緒池也是有害無益的。Netty 5為我們提供了指定底層執行緒池的機會，這樣能更好的控制整個中介軟體的執行緒數和排程策略。
非阻塞I/O操作：要想執行緒少還多做事，避免阻塞是一定要做的。
減少系統呼叫：雖然Mode Switch比Context Switch的開銷要小得多，但我們還是要儘量減少頻繁的syscall。
資料零拷貝：從核心空間的Direct Buffer拷貝到使用者空間，每次透傳都拷貝的話累積起來是個不小的開銷。
共享狀態保護：中介軟體內部的併發處理也是決定效能的關鍵。

Netty高併發效能優化

1.Context Switch過高

程序與執行緒的上下文切換

1.1 非阻塞I/O

1.2 減少執行緒數

1.3 去業務執行緒池

2.系統呼叫開銷

Mode Switch vs. Context Switch

3.Zero Copy實現

3.1 記憶體池

3.2 應用層的Zero Copy

4.併發下的狀態處理

4.1 Channel儲存

4.2 Decoder狀態

5.總結

5.1 多變的Netty

Even more flexible thread model

5.2 高併發程式設計技巧

Netty高併發效能優化

Java 架構師+高併發+效能優化+Spring boot大型分散式專案實戰

Laravel框架高併發效能優化

TOMCAT7併發效能優化總結

java 高併發如何優化

tomcat高併發下優化詳解及連線數和執行緒池

併發效能優化 – 降低鎖粒度

Nginx配置及linux系統記憶體高併發多方面優化

ngx_lua與go高併發效能對比

netty 高併發實戰

架構師必備技能之Netty 高併發 UTS 專案實戰

mysql大資料高併發處理(優化)

SpringBoot（20）之高併發介面優化-------秒殺介面地址隱藏 + 驗證碼驗證 +介面限流防刷

MySQL 高併發配置優化

如何使用jMeter對某個OData服務進行高併發效能測試

高併發-JVM優化-設定JVM引數

併發佇列ConcurrentLinkedQueue和阻塞佇列LinkedBlockingQueue在入隊操作高併發效能比較

SpringBoot（19）學習之使用RabbitMQ實現高併發介面優化

非同步程式設計CompletableFuture實現高併發系統優化之請求合併

【高併發】優化加鎖方式時竟然死鎖了！！

Netty高併發效能優化

1.Context Switch過高

程序與執行緒的上下文切換

1.1 非阻塞I/O

1.2 減少執行緒數

1.3 去業務執行緒池

2.系統呼叫開銷

Mode Switch vs. Context Switch

3.Zero Copy實現

3.1 記憶體池

3.2 應用層的Zero Copy

4.併發下的狀態處理

4.1 Channel儲存

4.2 Decoder狀態

5.總結

5.1 多變的Netty

Even more flexible thread model

5.2 高併發程式設計技巧

相關推薦