Our journey to F5 LTM OneConnect Profile

Let me first set the context by briefing about our application and its production setup. It is an internal web application used by the support team across multiple countries. The application requires session persistence without which the current work of our users will get interrupted. While persistence can be a divisive topic, any discussion for it or against it is moot in this context since we do not have a choice. I personally prefer to design nonpersistent applications, but let us not discuss about it further. We have a pool of WebLogic servers load balanced behind F5 LTM virtual server. We use LTM’s Cookie Insert method for achieving session persistence. With Cookie Insert method, LTM will add a cookie (BIGipServer*OurPoolName*) for our domain with encoded (and encrypted if required) IP address and Port number as its value. Once this cookie is set, subsequent http requests from browser will be routed to the same application server thus ensuring session persistence. Load balancer cookie expiry is set to session and hence it will be cleared when browser is closed at the end of the shift.

During releases, we deploy our new code to a set of servers which are NOT serving traffic (backup servers). We will then add those servers to the current LTM pool and disable all other servers having old code. At this point of time, LTM server pool will have servers with old code (marked disabled) as well as servers with new code (marked active). If load balancer cookie exists in the http request, LTM will route that request to an appropriate application server, either to active or disabled server depending on the cookie value. If load balancer cookie is absent, LTM will load balance the request to an active server. This arrangement will ensure that existing users who are already working on our application will not get interrupted during releases. It will also ensure that new users will always be load balanced to an active server having new code. We ask our users to “logout”, “close browser” and “login back” once they are done with their current work. Closing the browser will clear the load balancer cookie since it is session based. So the users will end up connecting to an active server when they are logging back in. Once all our users are switched to active servers, we will remove disabled servers from the LTM pool. This cycle repeats for every release.

Everything was fine until we decided to let our users “logout” and “login back” during releases without closing the browser. Why is closing the browser a big deal? Users are not directly opening our application in a browser. They use an integrated desktop application which hosts 10s of different applications required for them. Launching the desktop application will take care of opening all applications required for them including ours. By the way, it has many more useful things which are not in the scope of this blog. It takes ~5 minutes to launch that Desktop Application since it has to prepare and launch all applications hosted within it. We wanted to eliminate this outage time and it is a big deal for us (~1000 users * 5 minutes).

First Problem – As long as the load balancer cookie exists in http request, LTM will route it to the same app server even if it is marked as disabled. So we have to clear load balancer cookie during logout. F5 provides a powerful tool to manipulate http payload in the form of iRule. It is basically a combination of TCL and F5’s own scripting. This problem is easily solvable by using an iRule.

F5 LTM iRule to clear load balancer cookie

If “Always Send Cookie” option is disabled in the Cookie profile, LTM will send Load Balancer cookie only if it is missing in the request. All subsequent http responses will not have this cookie in the response header. So you cannot directly make it expire using “HTTP::cookie expires” option since it is not always present in the response. You have to insert a cookie in response header with the same name and path as your load balancer cookie and then set the expiry. If “Always Send Cookie” option is enabled in Cookie persistence profile, LTM will send this cookie in every http response. So there is no need to add the cookie in iRule since it will already be present in the response. We just need to expire it. We can check if a cookie is present in the response or not and add it only if it is missing. Following iRule will cover both scenarios:

when HTTP_REQUEST 
{
	set logout 0
	
	#Set logout value to 1 in case the URI contains Logout
	if {[HTTP::uri] contains "LogOut"}
	{
		set logout 1
	}
}
when HTTP_RESPONSE
{
	if { $logout == 1 }
	{
		#This will add a cookie in the response header only if it is missing. Typically Load Balancer Cookie
		#will have the path value "/", but, you can double check this for your setup.
		
		if { not ( [HTTP::cookie exists "BIGipServerYourPoolName"] ) }
		{
			HTTP::cookie insert name "BIGipServerYourPoolName" value "DummyValue" path "/"
		}
		
		#Expiry of the Load Balancer Cookie is set to a past date (1 second since epoch) and this will
		#force browser to clear the cookie.
		
		HTTP::cookie expires "BIGipServerYourPoolName" 1 absolute
	}
}

This iRule will clear the load balancer cookie irrespective of whether its expiry is set to Session or to a specific time. Please note that you can clear any cookie in browser using this iRule (not just load balancer cookie). As I mentioned earlier, iRule is a very useful tool for playing with data.

Ok perfect. With this iRule in place, we decided to test our application. Our test setup had one disabled server and one active server behind a stage LTM. Test user was initially connected to disabled server. Logout and Login should switch him to active server. But it did not happen. Test User was still routed to disabled server when logging in. We monitored network traffic and analyzed the request/response headers. Logout request properly instructed browser to clear the load balancer cookie and subsequent login request did not have load balancer cookie in the request header. But still LTM was routing that request to same disabled server. WTF? (Why This Failed?).

It was totally strange and we suspected almost everything, from iRule to browser to developer tools to everything. We tried different iRule options and nothing worked. Occasionally, very few login requests were routed to active server but there was no clear pattern. But everything worked fine if we closed the browser and opened it again. We were testing continuously to find some pattern. While observing network traffic, we noticed that the request was sent to active server whenever browser establishes a new TCP connection during login. It was routed to disabled server whenever it reuses the existing TCP connection (keep-alive connection).

We used CurrPorts tool to confirm if TCP keep-alive connections are causing this inconsistent behavior. We can monitor and close any open TCP connection using “CurrPorts”. After logout, we were able to see one or multiple open TCP connections held by the browser. If we explicitly close all TCP connections using CurrPorts and login back, request was always routed to active server. But if we login without closing TCP keep-alive connections, request was routed to disabled server. This confirmed the root cause.

With some help from F5 dev central and F5 articles, we found the following statement in an F5 support article.

By default, the BIG-IP system performs load balancing for each TCP connection, rather than for each HTTP request. After the initial TCP connection is load balanced, all HTTP requests seen on the same connection are sent to the same pool member.

This completely explains the behavior we saw in our testing. Subsequent login request in the same TCP connection was routed to disabled server because the TCP connection was initially load balanced (during logout or for any action before logout) to the disabled server.

At first I was wondering whether this is a bug from F5 side, but it seems to be purposefully designed given the fact that they have explained the behavior rather than fixing it. But why would F5 design like this? They have not bothered to explain the reason behind this design in any article, but it should most probably for saving time to intercept traffic and take load balancing decision. It will be even more significant for https traffic.

F5 have proposed two solutions to fix this problem.

Solution 1

First solution is to use LB::Detach in iRule for every http request. This will detach the server-side connection for each request, forcing a new load balancing decision for subsequent requests on Keep-Alive connections. But overhead is the time required to establish a new TCP connection between LTM and our application server. Since they both reside in the same data center, TCP connection time will be typically at a lower 2 digit milliseconds, but still it is not a better solution compared to our current production setup. Imagine a page having 4 images, 3 js files and 2 css files, total delay for a single page load request will now pop up to 3 digit milliseconds and that is significant. We decided not to compromise loading time for a solution that is useful only during releases.

when HTTP_REQUEST {
	if { [HTTP::cookie names] contains "BIGipServer" } {

		catch {LB::detach}

	}
}

Hey F5, why don’t you provide an option like this? Instead of blindly detaching the server side connection, take the load balancing decision and then decide to detach only if load balancing algorithm comes up with a different server other than the one to which the current TCP connection exists. LTM can reuse existing TCP connection itself if it needs to connect to the same server. This solution is definitely better than LB::Detach for our problem.

Solution 2

To use ‘OneConnect’ profile. With this solution, LTM will maintain a pool of TCP connections with each application server and re-use those TCP connections for requests from any clients. It means few server side connections can efficiently handle requests from many users as long as users are all not making requests at the same time. If any connection in the pool is IDLE, LTM will use it instead of establishing a new connection. So spikes in client connections will not cause a spike at LTM side. Since connections in the TCP pool can be maintained in an IDLE state for a long time, TCP connection time will be cut down for most of the http requests. F5 does a better advertisement for OneConnect profile than me, so I will not talk about its benefit further. Here are some test results published by F5. This solution is theoretically better than our current setup and is definitely a lot better than using LB::Detach. So we were convinced to go with OneConnect profile itself. Performance evaluation will happen at later phases anyway.

Subsequently, we had to decide on OneConnect profile configuration values. The values are purely dependent on your application, environment, traffic type, and so on. There is no single perfect value. My advice would be to start with a configuration value based on your current knowledge of these factors. Monitor how OneConnect profile is performing in production with those configuration values and improve it if needed.

There are several articles available in F5 dev central to explain the significance of these values. So I will quickly go through our setup.

Source Mask – 0.0.0.0 is the most efficient option since any connection in the connection pool will be used for requests originating from any client. 255.255.255.255 is the safest value since the server side connection in connection pool will be used only for the client for which a connection was established. So we don’t have to worry about the impact of using same TCP connection for multiple clients. Many people will suggest using this value considering this safety factor. But it is not safe if requests are coming through a proxy server or if you have SNAT profile configured in LTM. We have SNAT Pool configured in our virtual server. If SNAT pool is configured, the client IP will be mapped to any IP in SNAT pool and the source mask is applied on that mapped IP. For example, if your SNAT pool has 5 IP Addresses, the source mask will always be applied on those five IP Addresses. So we already lost the safety factor associated with 255.255.255.255 and hence it is wise to go for efficient solution. We can perform specific tests to see if requests from multiple clients sent in the same TCP connection are handled properly by application server. To give an example, Person B should not be seeing profile information of Person A just because the request from Person B is sent in the same TCP connection established for Person A. We decided to use 0.0.0.0.

Maximum Size – This is the maximum number of TCP connections maintained by LTM in the connection pool. Please note that this value is for the whole server pool and not for a single server (thanks to clarifications from Vijay Emarose and F5 team here). We are planning to keep it as 12000 based on the number of TCP connections established during peak hours with our current setup.

Maximum Age – 8 hours. TCP connection will be recycled once it exceeds this limit.

Maximum Reuse – 10000. Keep in mind that a single page load may trigger multiple http requests (for images, js, css and so on). Keeping a low value will force the TCP connection to recycle often (well before 8 hours set in Maximum Age).

Idle Timeout Override – This is the time for which an idle TCP connection will be kept in the connection pool. Connection will be recycled if it is idle for more than this time. If your application has varying traffic (peak traffic during certain hours), this value will determine how quickly your connection pool will be drained after the peak traffic. Keep it in minutes. We set it to 15 minutes.

Limit Type – Idle

With these values, we started testing our application with OneConnect profile for the first time. We figured out a way to monitor OneConnect behavior (like number of open connections in the pool and so on) by using OneConnect statistics. We launched our application, saw the connection count in LTM and waited for connections to close after 15 minutes of idle time. Connection count in the connection pool was shown zero in a minute. One more problem to tackle.

Quick search in dev central showed an article which explained a bug in OneConnect statistics . To confirm if this is a real issue or just a bug in statistics, we did netstat on app server side and monitored TCP connections from LTM. The connections were dropping after a minute which confirmed that this is a problem to be fixed. A TCP connection can be closed by either party. We checked our application server first. tcp_keepalive_time in the linux machine was set to 2 hours. So we suspected some problem in the LTM side. We checked tcp profile, http profile and so on and everything seemed perfect.

But wait, we are running weblogic server on top of linux and TCP connections are established to port number in which the weblogic server is listening. So, is there a possibility that weblogic is controlling this behavior? Exploring protocols in the Weblogic Admin console showed a configuration called “HTTPS Duration” which determines the amount of time weblogic server waits before closing an inactive HTTPS connection. This was set to 65 seconds and it is inline with the behavior we observed in our testing. The maximum value we can set for “HTTPS Duration” is 360 seconds (6 minutes) even though we wanted to set 15 minutes. Anyway, we have no other choice and are happy with 6 minutes as well. With this change, OneConnect statistics and netstat confirmed that idle connection is closed only after 6 minutes. Finally some green signal in our way!!!

We are NOT Systems Administrator or network or infrastructure or operations engineers. We are application owners who love to learn new things and solve any problem that comes our way. This is really a great learning and now we are in a position to guide others for using OneConnect. If it improves our overall application performance, we are going to advocate this solution to other teams. Why not???

Still some way to go. We have to test this solution thoroughly with different use cases as I hinted above. Of course performance testing is very important. I will share if we learn anything interesting during those tests.

Please do share your opinion in Comments section below.

 

One Comment, RSS

  1. Sudhahar Sivakumaran January 15, 2017 @ 12:16 am

    Nice article.

Your email address will not be published. Required fields are marked *

*