Published: October 2, 2013
Quick note, I have no personal or professional connection to anyone at HHS, QSSI, or really anyone in government. I'm just someone curious about how large applications are put together.
It's a bit bizarre to watch national news play out first hand. Even more bizarre when the situation is one you deal with every day. Like apparently 4.7 million other people, I decided to try out the national health care exchange, and like many people I was greeted with a message saying the system was under heavy load and unavailable. Most notably the security questions rarely loaded properly. After checking back a few hours later and seeing the system was still unavailable, I started to do some digging.
I pulled up the web inspector in my browser and started to look for the different backend calls that were failing. I found the security questions call that kept failing, and lucky for me the server side debugging was (and still is) turned on. Here is part of an actual error:
java.net.SocketTimeoutException: Read timed out java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:129) java.io.BufferedInputStream.fill(BufferedInputStream.java:218) java.io.BufferedInputStream.read1(BufferedInputStream.java:258) java.io.BufferedInputStream.read(BufferedInputStream.java:317) sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:695) sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640) sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponseInternal(HTTPConduit.java:1542) org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1494) org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1402) org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:649) org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62) org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263) org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:533) org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:463) org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:366) org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:319) org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:88) org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:134) $Proxy3741.fetchSecurityQuestions(Unknown Source) gov.hhs.cms.eidm.ws.proxy.client.BaseEidmProxyServiceClient.fetchSecurityQuestions(BaseEidmProxyServiceClient.java:201) gov.hhs.cms.ffe.ee.rest.MyAccountEIDMUnsecuredIntegrationImpl.fetchAllSecurityQuestions(MyAccountEIDMUnsecuredIntegrationImpl.java:963)
This stack trace is pretty revealing and actually quite typical. The front end application servers are making a call to the EIDM (Enterprise Identity Management) Proxy to get security questions and the connection is timing out. This explains quite a bit of the issues plaguing the site. Essentially the user data store is overloaded.
Above is my guess at the architecture of the health care exchange based on the interaction I've had with it so far. That is to say, I've been able to sign up, but still not successfully sign in. There are likely many more sub-systems powering the health care exchanges, but to explain the current state of the site, only the ones displayed above are really necessary.
The user connects to the application server to sign up and the application server checks to see how many sessions it has available. If it does have session capacity available it sends the user to the sign up page, and if it doesn't, the application server serves that wonderful "at capacity" page and tells the browser to try again in 30 seconds. This process continues until the user is granted a session. Once the session is granted, the user begins the sign up process. In the background, the webpage makes a second call out to the application server to get the possible security questions.
Now, I want to stop here and say that the entire process up until this point seems to be working perfectly. In my personal opinion, this seems like a great way to handle large swings in traffic without showing the user a blank error page. After this point however, is where error messages, like the one above, start to show up. The calls from the application servers to the read proxy servers are overloading their capacity.
How are they trying to solve this? It appears that so far the site has tried to fix the problem by decreasing the number of sessions allowed on their application servers, but from my continuing experience on the site that doesn't appear to be helping. It really looks like there needs to be a fix made on the EIDM side to allow for greater capacity. This could mean more aggressive prefetching of data that doesn't change often from the databases to the proxy servers, sharding of basic user data to allow that data to be stored in a faster cache, or simply adding more read replicas of the database.
All of this however is speculation as a random end user of the site and designer of large web applications. My final words are to wish everyone working on this project at HHS and QSSI (Their HHS EIDM contract press release) good luck.