Linux Network Performance Optimization: Tips for Optimizing Linux Network Throughput and Latency

Linux Network Performance Optimization: Tips for Optimizing Linux Network Throughput and Latency
Photo by 2H Media / Unsplash

The network is a critical point of failure for almost all modern infrastructure. All of them are connected to the Internet, except perhaps military systems. This allows you to easily scale high load by adding more servers and work on tasks in parallel or distribute a portion of it to different machines.

OK, we have a high load - what do we do next? The easiest solution is to add more servers and InfiniBand or fiber networks, but it's also the most expensive. In this post we will examine another option: optimize network throughput and latency without hardware upgrades to save money and make current system architecture capable of transferring more data and do it faster: Web pages load fast - your customers are happy, if not - customers will start looking for an alternative service. Some research has measured the average time to load a web page - it's less than 30 seconds.

Need to say a few words about FOSS (Open Source) solutions - the open nature of Linux kernel development gives the ability to heavy tuning and rich system settings that are not available in proprietary systems such as Microsoft Windows.

To apply the following settings immediately, we can use the sysctl command: sysctl -w key=value. To make the changes permanent, add them to /etc/sysctl.conf in key = value format. Then reboot or immediately apply them with the sysctl -p command.

Be careful and test all commands before using them full time. If anything goes wrong, please leave a comment below this post.

Let's start with the next generation - XDP and eBPF

eBPF is a Linux kernel technology that enables the creation of custom, high-performance, and secure network filtering and processing rules and provides a flexible API for applications.

XDP (eXpress Data Path) is a high-performance networking technology that accelerates packet processing by handling incoming packets directly on the NIC within the kernel space, bypassing the traditional network stack. This approach enables fast and efficient processing, reducing latency and improving overall network performance.

XDP and eBPF work together to unlock network performance with:

  • Fast and efficient packet processing at the earliest stage of the network stack (XDP)
  • Custom, high-performance, and secure network filtering and processing rules (eBPF)

Together, they provide a powerful combination for improving network performance, reducing latency, and enhancing security.

MTU (Maximum Transmission Unit)

Increasing the MTU on Linux can reduce fragmentation and improve performance by making packets larger and transferring more data without fragmentation. Warning: Increasing it too high may cause risk of packet loss or corruption. Please use this option with care and find the best optimal size by doing real tests with your network. The default value is 1500.

Recommended option: sudo ip link set dev eth0 mtu 9000

Transmit (TX) Queue Length

The parameter that controls the maximum number of packets allowed in the transmit queue of a network interface device in Linux. The default value varies in many Linux distributions, but 1000 is probably the most popular.

Recommended option: sudo ip link set dev eth0 txqueuelen 1000

TCP Congestion Control

Congestion control is the TCP protocol's method of managing data flow over a network and preventing congestion - queuing delay, packet loss or blocking new connections.

TCP supports several congestion algorithms, to enable each of them you need to load the module. The default is the Reno algorithm, the Cubic and BBR are also available. The algorithms available in your kernel can be checked with the next command: sysctl net.ipv4.tcp_available_congestion_control.

Recommended option: sysctl -w net.ipv4.tcp_congestion_control=cubic

TCP_NODELAY

By default, TCP uses Nagle's algorithm to aggregate small packets into larger ones before sending them all at once. This way of managing network data will cause some latency issues. However, globally enabling or disabling TCP_NODELAY isn't recommended because it causes problems for many applications. The best way to change the option is to preload the compiled library with TCP_NODELAY enabled or disabled like this: LD_PRELOAD=./path/to/lib.so app.

net.core.rmem_max and net.core.wmem_max

Increasing the TCP read and write buffers results in an increased window size, which also increases the amount of data transferred before the TCP handshake begins. This will reduce network latency and improve throughput.
The default value is 262144, we recommend increasing it to the maximum supported by the kernel.
Recommended options:

  • sysctl -w net.core.rmem_max=1048576
  • sysctl -w net.core.wmem_max=1048576

TCP window size

TCP window size settings can be used to configure the amount of data a network device can receive before it must send an acknowledgement (ACK) back to the sender. The default value in most Linux distributions is 64 Kb.
Provides 3 options: minimum, default, and maximum TCP socket receive buffer size.
Default values are the next:

  • Minimum: 4096
  • Default: 87380
  • Maximum: 174760

Recommended options:

  • sysctl -w net.ipv4.tcp_rmem='16777216 16777216 16777216'

net.ipv4.tcp_wmem

Very similar to rmem above, also takes 3 parameters. Minimum - the smallest receive buffer size for a newly created socket, default - the initial size of a TCP sockets receive buffer, and maximum - the largest receive buffer size for auto-tuned TCP sockets send buffers. The default value is 4096 bytes, 16KB and 4MB.

Recommended options:

  • sysctl -w net.ipv4.tcp_wmem='16777216 16777216 16777216'

net.core.netdev_max_backlog

This parameter sets the maximum size of the network interface's receive queue. The queue contains received frames after they have been removed from the interface's ring buffer. It is useful to increase the queue size for fast network interfaces to avoid full queues and retransmission of the next packet. The default value is 1000.

Recommended option: sysctl -w net.core.netdev_max_backlog=25000

TCP low latency

The default settings are normally optimized for maximum throughput and latency is not a priority, but some applications such as trading and video chatting require more latency. The default value is 0.

Recommended option: sysctl -w net.ipv4.tcp_low_latency=1

TFO - TCP Fast Open

TCP Fast Open is a technique that allows the client to send data in the first packet of a TCP connection, eliminating the need for a full Round-Trip-Time (RTT) delay. This approach avoids the traditional three-way handshake required for repeated connections, allowing faster communication between the client and server. By sending the first packet immediately, TFO reduces the time required to establish a connection, allowing data transfer to begin sooner.

Check the default TFO state in your kernel: sysctl net.ipv4.tcp_fastopen.
Enable TFO on the client: sysctl -wnet.ipv4.tcp_fastopen = 1 and on server: sysctl -w net.ipv4.tcp_fastopen=2.

TCP fin timeout

This parameter is very important when working with unstable or high load network connections, because it allows to close unfinished connections with timeout. Each connection socket consumes on average 1.5 KB of memory, so it makes sense to clear them from time to time. The default timeout is 60 seconds.

Recommended option: sysctl -w net.ipv4.tcp_fin_timeout=10

TCP limit output bytes

The TCP limit output bytes setting is used to limit the amount of data that can be sent over a TCP connection. It is a mechanism to prevent a system from sending too much data over a network, which can cause congestion and slow down the network.

The default value is 262144.

Recommended option: sysctl -w net.ipv4.tcp_limit_output_bytes=131072

TCP max tw buckets

The kernel parameter controls the maximum number of TCP time-wait (tw) buckets allowed in the system. A tw bucket is used to store information about a closed TCP connection. If the number of tw buckets exceeds this limit, the system will start dropping new incoming connections. This parameter helps prevent a Denial of Service (DoS) attack by limiting the number of tw buckets. The default value is 262144.

Recommended option: sysctl -w net.ipv4.tcp_max_tw_buckets=450000

TCP Window Scaling

TCP window scaling is a mechanism that allows TCP to scale the window size to reduce large bandwidth delays on large networks. Using larger window sizes can improve network performance by reducing the number of acknowledgments required and transferring more data instead. The default value is 0, which disables this feature.

Recommended option: sysctl -w net.ipv4.tcp_window_scaling=1

Low latency network tuning

IRQ Processing

The irqbalance daemon is a system utility that optimizes CPU usage by balancing interrupt loads across multiple CPUs. It identifies the most active interrupt sources and assigns them to a single unique CPU, thereby spreading the workload and minimizing cache rate misses for interrupt handlers. This helps to balance CPU load and improve overall system performance.

To run irqbalance, use the irqbalance --oneshot command with the --debug3 flag if necessary.

ethtool

The ethtool utility is a command line tool used to retrieve and modify network interface settings, as well as gather statistics on network driver and hardware performance. It provides a way to monitor and configure network interface settings such as speed, duplex, link status, and more.

These ethtool options can be useful for low latency:

  • Combine ethtool -cC - improve NIC batching policy.
  • Ring buffers: ethtool -gG- reducing buffer sizes also improves latency.
  • Offload capabilities ethtool -kK - improve performance by offloading network processing from software to hardware.

Busy Polling

Busy polling is a method used by the Ethernet driver to directly access the hardware queues of a network device, bypassing scheduling, interrupts, and context switching.

Check if it is currently enabled: ethtool -k eth0 | grep busy-poll.

Recommended option: sysctl -w net.core.busy_poll=1.

This is the end for now

Thanks for reading, please share your personal network optimizations in the comments below.

Read more