Monday, March 30, 2015

Performance Tuning: java socket timeout

In my previous post I wrote about the socket connect timeout. In this one I am going to write about the other low-level socket timeout - the SO_TIMEOUT. Although most java programmers are probably never going to use the socket API directly, these timeouts are often configurable for higher-level protocols and must be well understood.

The socket timeout (SO_TIMEOUT) is the maximal amount of time that the application is willing to wait for a single read call on the socket to complete.

This timeout only limits the duration of a single invocation of the read method. Consider a higher-level protocol in which your application needs to read a whole line from the TCP connection. For this you could wrap the socket's input stream into a When you read a line from the buffered reader, it will in turn repeatedly call read on the socket and gather the received bytes until a new line has been detected and then return the whole line at once. For a line of 10 bytes (including the new-line) this could translate to up to 10 calls of the read method. Provided that you have configured the socket timeout, this would translate to an overall duration of 10 times the socket timeout - assuming that  each read call returns 1 byte just before the timeout has been reached. As you see, the socket timeout is not well-suited to set timeouts for higher-level protocols. However, without that timeout you will not be able to react to other timeouts.

Let us first see what happens under the hood when no timeout is set - based on the following simple java code:

1. Socket socket = new Socket();
2. socket.getInputStream().read();

This translates to the following syscalls on a linux machine:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000091>
2. recvfrom(6,  <unfinished ...>
3.  <... recvfrom resumed> "Hello back\n", 8192, 0, NULL, NULL) = 11 <4.823417>

On line 2 we see that the JVM invokes the recvfrom syscall on the socket. This is a blocking operation and will only return when data has been received or an error has been detected. In this case, we have received data after 4.823417 seconds. Unfortunately, this call could potentially block forever and in turn block the calling java thread until the JVM has been restarted (calling Thread.interrupt() from within the JVM will not help). The situation might sound like something very exotic but is in practice quite frequent when complex network infrastructure is involved. Imagine that your application has successfully established a connection and now wants to fetch data from the other side. If you forcefully restart a firewall on the way (not giving it a chance to send RST (reset) packets to all connected parties to allow them to gracefully close their connections) or physically terminate the connection and then restart the other party or do a similar thing that causes it to forget about the connection, you will end up in a state known as a half-open connection. A connection can remain in this state forever and so will your read method and in turn the thread that has called it. There will be no way to recover from this without a restart of the JVM.

In order to prevent such scenarios, it is always a good idea to set the socket timeout (e.g. 4444 milliseconds) like this:

1. Socket socket = new Socket();
2. socket.setSoTimeout(4444);
3. socket.getInputStream().read();

If you do that, the JVM will always first poll the socket for data with this timeout and only call the blocking recvfrom syscall if data is available (and the blocking recvfrom call will return very fast in this case). The trace for this code looks like this:

1. socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 6 <0.000078>
2. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
3. <... poll resumed> ) = 1 ([{fd=6, revents=POLLIN}]) <1.265994>
4. recvfrom(6, "Hi\n", 8192, 0, NULL, NULL) = 3 <0.000155>

We see that in this case the JVM first invokes the poll syscall with the SO_TIMEOUT (line 2) and calls the recvfrom syscall only after the poll returned that there is data to fetch (line 3) . In case of a half-open connection, the poll syscall would return a timeout after around 4444 ms and give the control back to the JVM. It in turn will raise a to inform your thread about the timeout:

1. poll([{fd=6, events=POLLIN|POLLERR}], 1, 4444 <unfinished ...>
2. <... poll resumed> ) = 0 (Timeout) <4.448941>

On line 2 we see that the poll syscall returns a timeout code after 4.448941 seconds. Your application can handle the timeout anyway it wants - either repeating the call (giving the other side some more time to respond) or closing the connection.

In this article we have seen that reading from a TCP socket without setting the socket timeout might lead to java threads being blocked forever. Although the SO_TIMEOUT cannot be used as a substitute for higher-level protocol timeouts (e.g. the whole HTTP response must be received within 4 seconds) it still allows the JVM to take back control from time to time and decide whether to repeat the read call or give up if a higher-level protocol timeout has been reached. If there is one thing that you should remember from this post, it is that you should always set the socket timeout.