Tuning FreeBSD to serve 100-200 thousands of connections

I’m back finally. There’s the translation of the Igor Sysoev’s report made on the RIT conference. Igor Sysoev is the creator of one of the most used lightweight http servers in Russia and the world – nginx.

I also use nginx as reverse-proxy and load balancer in my project.

mbuf clusters

FreeBSD stores the network data in the mbuf clusters 2Kb each, but only 1500B are used in each cluster (the size of the Ethernet packet)

mbufs

For each mbuf cluster there is “mbuf” structure needed, which have 256B in size and used to organize mbuf clusters in chains. There’s possibility to store some additional useful 100B data into the mbuf, but it is not always used.

If server have the RAM of 1Gb or more 25 thousands of mbuf clusters will be created by default but it is not enough in some cases.

When there’s no any free mbuf clusters available FreeBSD enters the zonelimit state and stops to answer to any network requests. You can see it as the `zoneli` state in the output of the `top` command.

To fix this problem the only solution is to log in through the local console and reboot the system. It is impossible to kill the process in `zoneli` state. This problem is also actual for Linux 2.6.x but even local console will not work in this state for Linux.

There is the patch that fixes the problem, it returns ENOBUFS error, which indicates entering the `zoneli` state and the program may close some connections when receives the error. Unfortunately this patch have not been merged into FreeBSD yet.

The state of used mbuf clusters can be checked by the following command:

> netstat -m
1/1421/1425 mbufs in use (current/cache/total)

0/614/614/25600 mbufs clusters in use (current/cache/total/max)

You can increase quantity of the mbufs clusters through the kern.ipc.nmbclusters parameter:

> sysctl kern.ipc.nmbclusters=65536

For earlier versions of FreeBSD mbuf clusters can be configured only in boot time:

/boot/loader.conf:

kern.ipc.mbclusters=65536

25000 mbuf clusters takes bout a 50Mb in the memory, 32000 – 74Mb, 65000 – 144 Mb (raises by the power of 2). 65000 is the boundary value and I can’t recommend to exceed it without increasing address space of the kernel first.

Increasing the amount of memory available for kernel

The default space for the kernel in memory is 1Gb for i386 architecture. To set it to 2Gb specify the following line in the kernel configuration file:

options KVA_PAGES=512

On the amd64 the the KVA is always 2Gb and there’s no possibility to increase it yet.

In addition to increasing the address space there’s the possibility to increase the limit of the physical memory available for kernel (320Mb by default). Let’s increase it to 1Gb:

/boot/loader.conf:

vm.kmem_size=1G

And reserve 275Mb for mbuf cluster from that space:

sysctl kern.ipc.nmbclusters=262144

Establishing the connection. syncache and syncookies

There’s approximately 100 bytes needed to serve one single connection.
Approximatelly 100 bytes space is used for single unfinished connection in syncache.
There’s possibility to store information about 15000 connections in memory. Approximately.

Snyncache parameters can bee seen by “sysctl net.inet.tcp.syncache” command (read-only).

Syncache parameters can be changed only during boot time:

/boot/loader.conf:
net.inet.tcp.syncache.hashsize=1024
net.inet.tcp.syncache.bucketlimit=100

when the new connection does not fit into overfull syncache FreeBSD enters the `syncookies` state (TCP SYN cookies). This possibility is enabled with:

sysctl net.inet.tcp.syncookies=1

The syncache population and the syncookies stats can be seen with `ntestat -s -p tcp` command.

When the connection is accepted it comes to the “listen socket queue”

Their’s stats can be seen with the `netstat -Lan` command.

Inreasing of the queue is possible with the `sysctl kern.ipc.somaxconn=4096` command

Whan the connection is accepted FreeBSD creates the sockets structures.

To increase the limit of the open sockets:

sysctl kern.ipc.maxsockets=204800

In earlier versions:

/boot/loader.conf:
kern.ipc.maxsockets=204800

The current state can be seen with the following command:

> vmstat -z

tcb hash

If the server processes several tens of thousands connections the tcb hash allows to detect the target connection for each incoming tcp packet quickly.

The tcb hash is 512 bytes by default.

The current size can be seen with:

sysctl net.inet.tcp.tcbhashsize

It is changeable in the boot time:

/boot/loader.conf:|
sysctl net.inet.tcp.tcbhashsize=4096

Files

Applicatios are working not with the sockets but with files. And there’s file descriptor needed for each socket because of that. To increase:

sysctl kern.maxfiles=204800
sysctl kern.maxfilesperproc=200000

These options can be changed on the live system but they will not affect already running processes. nginx have the ability to change the open files limit on the fly:

nginx.conf:
worker_limit_nofile 200000;
events {
worker_connections 200000;
}

receive buffers

Buffers for incoming data. 64Kb by default, if there’s no large uploads can be decreased to 8Kb (decreases the probability of overflow during a DDoS attack):

sysctl net.inet.tcp.recvspace=8192

For nginx:

nginx.conf:
listen 80 default rcvbuf=8k;

send buffers

Buffers for outgoing data. 32K by default. If data have a small size usually or there’s a lack of mbuf clusters it may be decreased:

sysctl net.inet.tcp.sendspace=16384

For nginx:

nginx.conf:
listen 80 default sendbuf=16k;

In the case when server has written some data to the socket but the client do not want to receive it the data will live in the kernel for several minutes even after the connection will be closed by timeout. Nginx have the option to erase all data after the timeout:

nginx.conf:
reset_timedout_connections on;

sendfile

Another way to save some mbuf clusters is the sendfile. It uses the kernel file buffers memory to send the data to the network interface without any intermediate buffers usage.

To enable in nginx:

nginx.conf:
sendfile on;

(you should explicitly switch it off if you’re sending files from the partition mounted via smbfs or cifs – ReRePi)

On the i386 platform with 1Gb and more memory 6656 sendfile buffers will be allocated which is usually enough. On the amd64 platform more optimal implementation is used and there’s no need in sendbufs at all.

On the sendbuf overflow the process stucks in the `sfbufa` state, but things turns ok after the buffer size is increased:

/boot/loader.conf:
kern.ipc.nsfbufs=10240

TIME_WAIT

After the connection was closed the socket enters the TIME_WAIT state. In this state it can live for 60 seconds by default. This time can be changed with sysctl (in milliseconds divided by 2. 2×30000 MSL = 60 seconds):

sysctl net.inet.tcp.msl=30000

TCP/IP ports

Outgoing connection are bind to the ports from the 49152 – 65535 range (16 thousands). It is better to be increased (1024 – 65535):

sysctl net.inet.ip.portrange.first=1024
sysctl net.inet.ip.portrange.last=65535

To use ports in natural order instead of random (to make the second connection for the same port impossible before TIME_WAIT):

sysctl net.inet.ip.portrange.randomized=0

In FreeBSD 6.2 the possibility to not create TIME_WAIT state for localhost connections was added:

sysctl net.inet.tcp.nolocaltimewait=1

Tags: , , ,

14 Responses to “Tuning FreeBSD to serve 100-200 thousands of connections”

  1. Kai Says:

    Hi,

    do you mean kern.ipc.nmbclusters or really kern.ipc.mbclusters?

    I found nothing about kern.ipc.mbclusters, can you help me?

    Thank you

    PS: very useful translation

  2. rerepi Says:

    I mean kern.ipc.nmbclusters everywhere. ipc.mbclusters was a typo and it is fixed now.

    Thank you.

  3. Brad Davis Says:

    In the TIME_WAIT section you have an example to set the sysctl:

    sysctl net.inet.tcp.mls=30000

    The mls should be msl, so:

    sysctl net.inet.tcp.msl=30000

  4. rerepi Says:

    thanks, will fix now.

  5. eclosion Says:

    worker_limit_nofile 200000

    should be worker_rlimit_nofile 200000;

  6. in2 Says:

    kernel virtual address space has been increased to 6GB after FreeBSD 7.2. so larger kern.ipc.nmbclusters should be fine.

  7. Jérôme Says:

    Some of this could be valuable with varnish too (RP/cache only).

    http://varnish-cache.org/wiki/Performance

    Thanks !

  8. anonymous Says:

    “25000 mbuf clusters takes bout a 50Mb in the memory, 32000 – 74Mb, 65000 – 144 Mb (raises by the power of 2). ”

    Here you mean doubled, not a power of 2. Nice article.

  9. rerepi Says:

    You’re right, it is doubled, not powered. Thanks.

  10. Moxo Says:

    hola, perdon por mi ingles
    segundo: cuanto seria el tamaño del kern.ipc.nmbclusters con 25000 conexiones y 64 kb de buffer tcp, a mi me dio 1600000 esta bien?

    gracias por todo.

  11. Moxo Says:

    Hi, sorry for my English
    second, as would be the size of kern.ipc.nmbclusters with 25,000 connections and tcp buffer 64 kb to 1.6 million gave me right?

    thanks for everything.
    jaja sorry

  12. rerepi Says:

    not sure I got you question

  13. Jason Wieland Says:

    listen 80 default sendbuf=16k;

    Should be
    listen 80 default sndbuf=16k;

  14. Sabbasth Says:

    Still very relevant for newbies, well explained.

    Thx !

    Some kern.ipc.mbclusters are still here instead of kern.ipc.nmbclusters but it’s fine.

Leave a reply to rerepi Cancel reply