A Different Frame of Mind: jax-ws

This post is primarily here because I am having trouble with a client-side JAX-WS implementation using Weblogic Server 10.3.3's JAX-WS implementation, and I can't seem to find any-one who is having a similar issue. I have talked with Oracle about this issue, and will update this post as information comes about.

I will freely admit that it is very possible this entire post and all of the issues I'm trying to solve, really only exists in my mind. I may be totally wrong here. I really hope I'm not, but then again......

Also, please forgive the lack of color/source/pretty references in this blog.. I am fairly new to blogging, and unfortunately do not have the time to go through and make things pretty. Heck, I barely have enough time to just get the content out here. maybe I'll come back and update this post with the "pretties" in the future.

I am writing an integration webservice which will consume various webservices from a couple different backend systems. I want to be able to parallelize non-dependent service calls and be able to cancel requests that take too long. The need to be able to cancel requests that exceed a specified SLA is paramount, since the requirement is to meet a very hard-sla with either a valid response or an appropriate error message if one or more of our backends are failing to respond or are taking too long. we should definitely NOT hang indefinitely waiting for backend calls.

to accomplish these requirements I am trying to use the Async Polling mode client apis that are defined in the JAX-WS specification (v2.x). I could use the Callback apis, however the orchestration of my services would be overly complex; besides the Polling mode API's seem a better fit for my requirements.

On the surface the client side API's and runtime appear to work as the specification outlines. Generation of client stubs based on a given WSDL which includes the async as well as the standard sync methods is straight forward. Implementing a sample client which utilizes these sync and async API's is also simple, and if you watch the execution of the test client, the output and behavior seems to work as expected: Sync calls work, async calls work, and in the case of async calls, the inherited functionality provided by the Future<> interface that the Async calls seems to behave exactly like my requirements need.

So, what is the problem I am facing?

The problem is related to the requirements:

1) That I be able to cancel an in progress async request
2) That I be able to cancel a request that has yet to actually execute

I do realize that there are semantic problems with canceling a request that has already been sent to the backend system. Typically you can't necessarily tell the back end system "Sorry, stop what you're doing, I'm tired of waiting now". All you can do is move on, and stop waiting. The backend system would either figure out that you've stopped waiting at some point, or it would remain blissfully ignorant and finish processing your request. This is pretty standard behavior, and is exactly what happens when you configure a response timeout on the underlying Socket.

In the case of an in-progress request, I would expect a cancelation to cause the thread that is stuck waiting for the backend to respond to stop waiting (via an interruption) and the client socket to be closed; as described above the backend system would necessarily not be notified that we've moved on.

Now, if for some reason the request hasn't actually made it to the point where it's been sent to the backend system and we attempt to cancel it, then I would expect that whatever else happens, that request would never be sent to the backend. The request would be treated as a no-op.

The problem I am facing is that none of these expectations turn out to be what happens in reality. in reality the behavior appears to be correct from the client API; but if you watch very carefully, in the background, you will notice that requests you think have been canceled are actually executed/sent to the backend system.

to demonstrate this, as I mentioned earlier, I have written a simple hello-world type JAX-WS based service and client.

You can find the entire project on github.com here:

git://github.com/helterscelter/jaxws_async_client_sample.git

or you can browse the source via your browser here:

http://github.com/helterscelter/jaxws_async_client_sample

All of the code in this example should be built using the included ANT build.xml. the only dependency an installation of Oracle Weblogic v10.3.3

to build the sample, edit the buid.properties file to update the wls.home variable with the location you have installed weblogic server under. Then run ant against the "all" target. this will compile all client side artifacts, and the sample service implementation to test the client against.

The sample Service implementation is a very basic service which takes an arbitrary string as input, and returns that string as part of the output. the service attempts to simulate in a very basic way, a service that may return very quickly or which may return after a relatively long time. (it does this by sleeping a random time between 0 and 10 seconds) other than this very simple logic the service is very much unchanged from the automatically generated service stub that the "wsdlc" ant task generates.

The sample client is also fairly simple.

The general flow is:

we initiate a sync call, and show the output,
initiate 10 async calls,
immediately cancel all 10 of the async calls
show that the Result object for each async request thinks that it has been canceled
then we attempt to get a result for each async call

The key here, is that in all cases the perceived output from the above is exactly what the spec says should happen.

see ./sample/output_v1.0.log in the sample project
http://github.com/helterscelter/jaxws_async_client_sample/blob/v1.4.0/sample/output_v1.0.log

I know, I know. So far, I've spent a whole lot of time explaining to you just exactly how the JAX-WS implementation provided by Weblogic does exactly what the spec says it should do.

So you may be asking, What is the problem? Why are you even bothering with this?

The problem starts showing it's self when you look at the error log output from the Weblogic instance that hosts the sample service. If you look there, you'll notice a good number of exceptions which look like this:

<[ServletContext@7313242[app:HelloWorld module:/helloWorld path:/helloWorld spec-version:2.5], request: weblogic.servlet.internal.ServletRequestImpl@ec72d0[

POST /helloWorld HTTP/1.1
Content-type: text/xml;charset="utf-8"
Soapaction: "http://www.example.org/HelloWorld/HelloWorld"
Accept: text/xml, multipart/related, text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
User-Agent: Oracle JAX-WS 2.1.5
Connection: keep-alive
Content-Length: 277

]] Root cause of ServletException.
javax.xml.ws.WebServiceException: com.ctc.wstx.exc.WstxIOException: Connection reset by peer: socket write error
   at com.sun.xml.ws.encoding.StreamSOAPCodec.encode(StreamSOAPCodec.java:118)
   at com.sun.xml.ws.encoding.SOAPBindingCodec.encode(SOAPBindingCodec.java:258)
   at com.sun.xml.ws.transport.http.HttpAdapter.encodePacket(HttpAdapter.java:368)
   at com.sun.xml.ws.transport.http.HttpAdapter.access$100(HttpAdapter.java:102)
   at com.sun.xml.ws.transport.http.HttpAdapter$HttpToolkit.handle(HttpAdapter.java:534)
   Truncated. see log file for complete stacktrace

If the client application is behaving the way we expect it to, then Why do we see these errors in the log? They had to be caused by a request being in progress and then something causing the socket to be closed before the request could be completed. but how can this be when our example client seemingly succeeded in canceling all of our requests, and exited without any unexpected results?

to help figure this out, we enable a java system property in our client application, which causes the JAX-WS client runtime to dump all of it's request and response operations from the wire to stdout:

-Dcom.sun.xml.ws.transport.http.client.HttpTransportPipe.dump="true"

see ./sample/output_v1.1.log
http://github.com/helterscelter/jaxws_async_client_sample/blob/v1.4.0/sample/output_v1.1.log

now when you run the client, you see that the JAX-WS runtime sends all 10 of our Async requests to the backend/sample service. however our client exits before any of these requests can be completed, hence the server reports a bunch of errors when trying to write the reply to the socket.

This still seems like it's expected behavior, doesn't it?

What happens, if instead of exiting immediately after we've validated that all our async requests are canceled, we simulate a longer running process? After all, when this goes into production the backend service calls we're making will be part of a process which is expected to run 24x7. So, lets introduce a 30 second delay at the end of the sample client, and see what happens.

see ./sample/output_v1.2.log
http://github.com/helterscelter/jaxws_async_client_sample/blob/v1.4.0/sample/output_v1.2.log

Interesting, isn't it? Every one of the Async client requests that we thought we canceled actually completes, and a valid response is returned to the client.

it was at this point that I started asking Oracle support questions. initially they came back with the explanation that the behavior I was seeing is caused essentially by a Race condition.

because the first thing my sample client does is to make a sync jax-ws call, the jax-ws runtime takes the opportunity to fully initialize it's infrastructure and open a pool of connections to the backend service. this is intended to make subsequent calls to the service faster because everything it needs to actually make the call would be setup already. Then when I attempt to make my async calls, the async requests are initiated immediately on the already existing connections. once the request actually goes out over the socket, they said that the request couldn't really be canceled.

If I wanted my sample client to behave the way I expect it to behave, I essentially needed to remove the sync call at the beginning of the client, so that when My async calls were made, the Socket connections wouldn't have been setup yet, and the client code would be able to cancel the async requests before the jax-ws runtime could initialize the connections and send the requests to the server.

This seemed like a reasonable thing to try, so I did.

please refer to ./samples/output_v1.3.log
http://github.com/helterscelter/jaxws_async_client_sample/blob/v1.4.0/sample/output_v1.3.log

as you can see, the behavior didn't change much. all of the requests are still sent, and all of the replies are still received.

The final thing I have tried is to attempt and see what is going on under the scenes with the async methods (without digging through tons of source)

to help in figuring out whats going on, I implemented a simple Executor service, which I initialize the HelloWorld_Service() service with. the custom executor basically logs the duration, start and stop time of each Runnable that is passed to it. The executor also utilizes a custom work queue that logs some information about each Runnable that is added or removed from the work queue.

see ./samples/output_v1.4.log
http://github.com/helterscelter/jaxws_async_client_sample/blob/v1.4.0/sample/output_v1.4.log

At this point I have to say something smells foul. I expected to see the com.sun.xml.ws.client.AsyncResponseImpl instances being pushed onto the Executor service's queue. what I did not expect to see was another Async task being pushed into the run queue as a result of the initial AsyncResponseImpl running:

Offer: engine-Oracle JAX-WS 2.1.5: Stub for http://www.example.org/helloWorldfiber--1

java.lang.Thread.getStackTrace(Unknown Source)
test.client.TimingThreadPool$LoggingArrayBlockingQueue.offer(TimingThreadPool.java:92)
java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
com.sun.xml.ws.api.pipe.Engine.addRunnable(Engine.java:76)
com.sun.xml.ws.api.pipe.Fiber.start(Fiber.java:262)
com.sun.xml.ws.client.Stub.processAsync(Stub.java:349)
com.sun.xml.ws.client.sei.SEIStub.doProcessAsync(SEIStub.java:156)
com.sun.xml.ws.client.sei.AsyncMethodHandler$SEIAsyncInvoker.do_run(AsyncMethodHandler.java:220)
com.sun.xml.ws.client.AsyncInvoker.run(AsyncInvoker.java:72)
com.sun.xml.ws.client.AsyncResponseImpl.run(AsyncResponseImpl.java:84)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)

apparently some part of the JAX-WS runtime actually submits a com.sun.xml.ws.api.pipe.Fiber to the run queue which is what actually does the SOAP request to the backend service!

When we cancel the Result<> object that is returned from the Async invocation, that cancel does not prevent these PIPEs from running on the queue and making the request.

Have I finally succeeded demonstrating the bug I've suspected all along? Does my argument make sense?

Even as I write this, I have to admit that I'm not 100% sure. I suppose my next step would be to dive into the JAX-WS code, and see if I can figure out what it's doing?

I'll keep you posted.

Update: 03/04/2011

I should have updated this entry a good while back. it turns out that I'm not completely crazy. after much conversation with Oracle they have issued patch RHEL against Weblogic 10.3.3 to address this issue..

A Different Frame of Mind

Syntax Hilighter

Tuesday, July 13, 2010

Oracle Weblogic 10.3.3 JAX-WS Client