I mentioned in my post about layered architectures that I would discuss factors that can throttle throughput from the J2EE application server to the database (or persistence layer). To start of with, I will focus of the ethos of designing distributed object architectures, in the words of Martin Fowler:-
". . . Hence, we get to my First Law of Distributed Object Design: Don't distribute your objects!"
Distributing components between J2EE application servers in the J2EE world, involves the use of remote method invocation (RMI) calls to remote interfaces, assuming we are talking about synchronous communication here. Remote method calls to beans deployed to remote servers incurs the following performance overheads:-
- Copy by value
- Network latency
- Every incoming RMI call needs to be serviced by its own Java thread
- Servicing incoming connections from the TCP-IP stack causes context switches
The original text behind this design ethos appears in Martin's book "Patterns Of Enterprise Architecture", there is also a very similar article that Martin has published in Dr Dobbs journal which can be found here. 
If all the beans are deployed to the same application server, this will eliminate all the bottlenecks except for that of copy by value. However, as you will see, the overhead of pass by copy is still significant and will both increase heap utilisation and extend the part of major garbage collection that looks for dangling objects on the heap. On the last project I worked on at the time of writing this, we found that:-
- The 10g database time (not to be confused with wall clock time) was low.
 
 
- Average number of sessions from ADDM reports was low.
 
 
- The CPU on the WebSphere server was maxed out.
 
Profiling the code with a Java profiler revealed that 50% of the CPU time was being expended on a method with a name like com.ibm.CORBA.util.copyObject. This was because the business logic or domain beans were calling the database access beans via remote interfaces, which uses pass by value.
For the benefit of anyone reading this who is not that familiar with the J2EE world, I will elaborate on pass by reference, pass by value and remote and local homes. One of the original design goals being the J2EE standard was to present a framework in which distributed applications could easily be produced, this was based on CORBA and the synchronous communications method was RMI. Prior to the EJB 2.1 standard, it was assumed that when one enterprise bean called another bean, the 'other' bean was always remote, hence a remote interface and copy by value was always used. However, as time went by people realised that despite this, a lot of applications used beans that were deployed to the same J2EE app server and hence suffered this performance penalty for no added benefit. Thus, in the EJB 2.1 standard 'local' interfaces came about. Beans talking to each other using local interfaces use the much more efficient pass by reference method of passing object references rather than whole objects by value. 
Fortunately with WebSphere there is an option available on the Object Request Broker (ORB),    to turn pass by reference 'On', this forces all the beans deployed to the same EJB container using the same class loader to use pass by reference, irrespective of whether they have been coded to use remote interfaces. In the case of the application that I was looking at, this had the rather large impact of:-
- Reducing batch process run times by 50%.
- Reducing CPU consumption on the server hosting the WebSphere application server by 50%.
- Doubling the 10g db time that a batch process spent in the database.
- Increasing the average number of sessions active whilst the batch process was running, again this is an Oracle 10g time metric which is the database time divided by the actual elapsed "walk clock" time.
Setting the class loader policy on for the application server to single also helps. In tuning WebSphere, I also looked at the:-
- server JVM
- client JVM
- ORB with regard to the setting up a JNI reader thread pool and the fragment size
 
- JDBC pool, both the size of the pool, pool utilization and cached statement discard rate.
 
- EJB container, specifically the EJB cache size
 
- Log4j logging level
- number of batch process threads
 
 
- turning off the diagnostic trace
To the best of my knowledge, the only thing I didn't look at was turning the PMI off completely.  For both the client and server JVM, I found that using the JVM otpions as used by IBM in  a 
Solaris SPECjAppserver2004 submission yielded the best results. Refer to 
this link for a blog on which some of the IBMers who carry out this work write, in particular Andrew Spyker, WebSphere performance architect.
However, out of all the things I looked at, I got the biggest bang for my buck from turning pass by reference to 'On' by a long chalk. This comes with a word of caution though, if your software is designed such that beans that receive RMI calls modify objects and the calling bean is not expecting this, wild and erratic application can result. It is for this very reason, or so I believe , that pass by reference is now turned 'Off' by default on WebSphere 6.1.
This brings me on to J2EE clustering and shared nothing architectures. WebSphere network deployment, the clustered version of the WebSphere uses a shared nothing architecture, similar to that discussed in this article here, which again re-iterates the point about distributed architectures and performance:-
"The most scalable architectures are sharing-nothing clusters. In such clusters, each node has the same functionalities, and knows nothing about the existence of other nodes. The load balancer will decide how to dispatch the requests to the back end server instances. It is rare that a load balancer will become a bottleneck, because a load balancer only need to do some very simple jobs, such as distributing requests, health checking, and session sticking. By adding more nodes, the computing ability of the cluster is increasing almost lineally, if the back end database or other information systems are powerful enough." 
With WebSphere network deployment, the same ear file is deployed to each node in the cluster, or cell, to use IBM terminology, workload distribution is managed in a round robin fashion via the workload manager. WebSphere eXtreme scale adds some extra intelligence to this by distributing the work load based on the current available capacity of nodes in the cluster.
What exacerbated my original problem of the massive copy by object overhead was the batch processing design, which is fundamentally this. The problem being that because Oracle is called for every individual item in the workload, there is heavy traffic between the business logic layer and the persistence layer. Accessing a database item by item is really inefficient in terms of:-
1. Hard and soft parsing, in particular I found that setting the batch process thread count above 32 stressing the library cache pin mutex. Tanel Poder has written a good article on the library cache pin mutex and extended use of mutexes in 11g here.
2. Latching
3. Network round trips and network latency
4. Physical and logical reads
5. JDBC calls
Some people get very agitated about batch process designs that process the workload item by item and in principle this results in certain inefficiencies when accessing the database. However, the performance targets for the batch processes where the same as those for the system this new one was replacing, but with 200% of the workload and two years worth of "Database aging", i.e. the database had been artificially aged to give the appearance of the application having been in use for two years. In summary, all batch processes beat their  counter parts in the original application by at least a factor of 100%.
The morals of this story are:-
1. Distributed object architectures are bad for performance
2. Your application needs to be profiled in order to work out where the time is going, in the example I have highlighted, the database was never the bottleneck.
3. A holistic approach to looking at the architecture needs to be taken with regard to achieving good performance and scalability.
I will get around to discussing approaches on alleviating the stresses that item by item oriented batch processing imposes on the Oracle database in a future post.