
The Bat-Phone rings at the worst possible time. In the middle of dinner, deep into the weekend, or just as you’ve fallen asleep. And when it does, you know one thing is for certain – something is broken...
My most inopportune moment was 30 minutes before a show I was about to play at my favourite rock venue. I had a replacement ready (for the gig), but the show was with a member of the Lumineers, and I just HAD to play.
So… The call came in - a batch hadn’t run for a client. I scrambled to check the system and found that the instance responsible for running the batches was simply off. A quick restart, a SOAP call to trigger the batch service, and... Phew, we were back in business! Crisis averted – and just in time for me to grab my bass guitar and step onto the stage.
But that night, I couldn’t stop thinking about it. The fix had been simple, but the underlying issue nagged at me. What if they called 30 minutes later? What if I hadn't been able to restart the instance? I certainly didn’t want to drive to the office on a Sunday afternoon. That’s when I started digging deeper into batch processing and fault tolerance.
At the time, our batch processing relied on static configurations and database flags. If a node went down, the process simply wouldn’t run, requiring manual intervention. Worse, if a batch failed midway, there was no recovery. It would leave data in an inconsistent state, requiring someone to manually clean up and reprocess records. I knew there had to be a better way.
I introduced Hazelcast for distributed processing, leveraging leader election to ensure that batch execution wouldn’t be tied to a single instance. If one node went down, another could take over, eliminating single points of failure. I also implemented failover handling so that if a batch failed partway through, it could resume processing from the last successful step rather than restarting from scratch or corrupting data.
What started as a midafternoon firefight ended up reshaping how I thought about software. I wasn’t just thinking about getting tickets done and getting through my week of support – but I was thinking about resilience, automation and reliability. At the time I didn’t have a name for it, but it was the start of my journey into thinking about well architected frameworks.
Support has a way of doing that. It forces you to see software as more than just code - it’s a living system that real people rely on. When you’re on the receiving end of urgent incidents, you start to internalize what makes software fragile and, more importantly, how to build it to be stronger. You think more about the impact of failure, and more importantly how to avoid it in the future.

It’s easy to see support as a distraction, something to endure until you can get back to "real" development. But if you embrace it, support can shape you into a better engineer - one who doesn’t just write code, but who builds systems that last.
Yours faithfully, James Murray, Bassline Byte Sage
