Suppose you’re so lucky that your customer can utilize more than a datacenter and is asking you to design how to modify the existing WAS environment.
Now you have two datacenter, a network dispatcher, IP sprayers, WAS servers, a clustered application and a problem.
The customer is asking you an important thing: the availabilty of service over disasters and application changing.
How manage the WAS environment?
Let’go by steps: how many cells?
It’s better to use a cell in every datacenter. Higher is the isolation, higher is the protection in case of a unplanned outage. There are also many network issues and the JVM processes can be too much for a single CoreGroup. Tom Alcott described very well this point of view on developerWorks (“Can I run a WebSphere Application Server cell over multiple data centers?”).
Good. Suppose now we have two cells
As you see the requests come from the internet through the network switch, the IP sprayers redirects them to the IBM Http Server and the WAS Plugins queries the WAS servers.
On every cell we have two nodes and an horizontal application cluster where we installed an application.
Next question: how can work the cells? Together or not?
In details the question is: what’s the topology? Active/Active or Active/Passive?
Tom Alcott explains again the difference between these topologies (http://www.ibm.com/developerworks/websphere/techjournal/0707_col_alcott/0707_col_alcott.html).
The common part is the replication data. If you don’t replicate the application state and the application data, you can’t get failover at application level. This means we need a replicated database en two datacenter (not a joke to administer).
If you need to get the cell work together in an active/active configuration, you need to change the network switch with a network dispatcher (or level 7 switch) to add CBR (Content Based Routing) rules based on the cluster CloneID.
This way the rules can redirect the requests to the right cell and the cells works together to load balance the traffic: the server affinity ensures we don’t mix session applications over the cells.
Then we have two cells on two datacenters and differents levels of failover.
First level. Can fail one of the cluster member or an entire node: the plugin failover will solve the problem.
Second level. Can fail an IHS: the IP sprayer will solve.
Third level. Can fail an IP sprayer or the cell 1 data-center. The network switch will solve.
Good no? But what else?
What happens at the user session when we have a fail?
If we are at the WAS level and we setup the session persistence, nothing happens.
Simply the plug-in redirects the request, gets the session from the memory (using memory-to-memory replication) or the database (database persistence) and replies to the user.
If we are at data-center level (real disaster), we need to setup on both cluster the database persistence and share the session database through the data-centers. This way when a cell will fail, the other will recover the session state from the session database.
Obviously to achieve this goal the application needs to be exacty the same on both clusters.
Finally we found that to achieve a complete failover using two datacenter, we need to create two cells, using a network dispatcher if we need an active/active topology and specific configurations if we need to recover the session intracells.