Dealing with Failure

When you rely on third-party systems for your site to work, prepare for the unexpected.

So, it's that time of the week that I was managing accounting stuff, and I noticed that my online accounting package had a notice about IRS integrations being down. That's not a big deal, when writing a web based system, you have to expect third-party integrations to not be available some of the time.

However, when you go into the tax payment section, it was a whole different story.

I was greeted with slow load times, and then a stack trace!

This violates two principals of software design

1) Never show customers stack traces

Not only was the stack trace shown, but they also provided me special debug variables for the session.

I'm guessing somebody didn't setup a flag for disabling these in production, but it is a java J2EE based platform (evidently), so I'm not sure who missed that.

What should you do when building an app? Be sure to test failures in your platform, and ensure that you aren't leaking any customer or internal server state

2) Prepare for the worst

The stack trace showed an Out Of Memeory error. This is most likely caused by the application taking longer to respond due to no timeouts on the IRS integrations.

There are dozens of articles on ways to reduce these, but here are a few to point you in the right direction.

  • Async queues for any third party integration
  • Separate pool of workers to free up web frontend
  • Polling on the website, rather than waiting for data on a long connection
  • Gracefully degrading the experience when data cannot be received quickly
  • Constant amount of work for success or failure states

What did it look like?

In case if you are curious, here's what the relevant part of the site displayed to me. I sanitized and removed the formatting and private details, so don't get too hack happy

Happy developing!

Posted by Marshall on 2016-01-08.

Tweets by @manschutz