WAS failover using differents datacenters

Suppose you’re so lucky that your customer can utilize more than a datacenter and is asking you to design how to modify the existing WAS environment.

Now you have two datacenter, a network dispatcher, IP sprayers, WAS servers, a clustered application and a problem.

The customer is asking you an important thing: the availabilty of service over disasters and application changing.

How manage the WAS environment?

Let’go by steps: how many cells?

It’s better to use a cell in every datacenter. Higher is the isolation, higher is the protection in case of a unplanned outage. There are also many network issues and the JVM processes can be too much for a single CoreGroup. Tom Alcott described very well this point of view on developerWorks (“Can I run a WebSphere Application Server cell over multiple data centers?”).

Good. Suppose now we have two cells

As you see the requests come from the internet through the network switch, the IP sprayers redirects them to the IBM Http Server and the WAS Plugins queries the WAS servers.

On every cell we have two nodes and an horizontal application cluster where we installed an application.

Next question: how can work the cells? Together or not?

In details the question is: what’s the topology? Active/Active or Active/Passive?

Tom Alcott explains again the difference between these topologies (http://www.ibm.com/developerworks/websphere/techjournal/0707_col_alcott/0707_col_alcott.html).

The common part is the replication data. If you don’t replicate the application state and the application data, you can’t get failover at application level. This means we need a replicated database en two datacenter (not a joke to administer).

If you need to get the cell work together in an active/active configuration, you need to change the network switch with a network dispatcher (or level 7 switch) to add CBR (Content Based Routing) rules based on the cluster CloneID.

This way the rules can redirect the requests to the right cell and the cells works together to load balance the traffic: the server affinity ensures we don’t mix session applications over the cells.

Then we have two cells on two datacenters and differents levels of failover.

First level. Can fail one of the cluster member or an entire node: the plugin failover will solve the problem.

Second level. Can fail an IHS: the IP sprayer will solve.

Third level. Can fail an IP sprayer or the cell 1 data-center. The network switch will solve.

Good no? But what else?

What happens at the user session when we have a fail?

If we are at the WAS level and we setup the session persistence, nothing happens.

Simply the plug-in redirects the request, gets the session from the memory (using memory-to-memory replication) or the database (database persistence) and replies to the user.

If we are at data-center level (real disaster), we need to setup on both cluster the database persistence and share the session database through the data-centers. This way when a cell will fail, the other will recover the session state from the session database.

Obviously to achieve this goal the application needs to be exacty the same on both clusters.

Finally we found that to achieve a complete failover using two datacenter, we need to create two cells, using a network dispatcher if we need an active/active topology and specific configurations if we need to recover the session intracells.


Websphere Commerce and Portal Architect ✔ Motivated IT professional with more than ten years of experience, combining Java and JEE developer skills with systems and IBM products installation knowledge. ✔ Strong experience and skills on planning, architecting and implementing complex commerce and portal solutions based on IBM middleware products. ✔ Reliable with a strong network attitude, experience on leading developer's teams and manage international relationships. Specialties ✔ Pleasant manner, reliable. ✔ Ability to consider issues from different point of views. ✔ End-oriented work capacity and problem-solving attitude. ✔ Ability to work with deadlines and under pressure. ✔ Ability to prioritise tasks and manage people. ✔ Ability to increase the whole team skills. ✔ Ability to generate commercial leads

Tagged with: , , ,
Posted in was
5 comments on “WAS failover using differents datacenters
  1. Hi Marco, this is an interesting configuration.
    Can you describe the replication model that you have builded for the JMS engines?
    Can you preserve the correct message queue order between differents sites?
    Thanks in advance for your reply.

  2. enzo says:

    Well done post Marco!
    Not yet on the product but I will get there soon! Take into account this new blog born at the end of 2008!

  3. Ciao Marco, do you evaluate the use of “eXtreme Scale” (aka XD datagrid) for session replication? I know it’s well suited for wan scenario.

  4. mfabbri says:

    Ciao Enzo!
    Thanks for the compliments: if you need ideas or material about the multi cell management, let’s email me and I will try to give you another point of view.

  5. mfabbri says:

    Ciao Diego!
    Thanks for the reply! Your comment is the same that made me Tom Alcott! We exchanged some emails and he suggested me to use eXtreme Scale to “truly” manage the session replication. I’m in contact with two XD gurus in India and I’ll let you know if there will be news in a project.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: