New chat and our 72 hours in hell

Hello Fellow Strategists!

Last Wednesday, we released our brand new chat system. With its 1-on-1 and custom group chats abilities, it has the potential to significantly improve in-game communication. But it is new, with a new interface and new way of doing things, so it will take time getting used to. But once you experience the power of a custom battle group, or cross-clan admin chat, I hope you will love it!

The upgraded chat uses new technology, so it was being tested extensively for many months - on desktop since November via the Throne Room, then on mobile devices since mid-January with our mobile Throne Room release. It was not possible to release the new and old chat side by side, so we gathered as much data as we could, made many improvements and changes, all in order to ensure that the full release went smoothly. Alas, it was not to be so!

The day after the release, we've noticed a 100% or more increase in average run times across all our functions. Here is a screen shot of our health status tool:

That set of alarm bells and our 72 hour marathon started. The problems were intermittent - no issues for 55 minutes out of an hour, then 5 minutes of serious slow down (these are just for illustration, not actual numbers). Worse kind of a problem! You work really hard and fast for 5 minutes, then analyze your often incomplete data for next 55, hoping you will be ready to catch more data next slow down.

Over time, the slow downs occurred more frequently and lasted longer. The next day revealed even worse performance degradation:

By that point we were on track, though. We identified bugs in the chat login code. Small bugs, so insignificant they would not even be noticeable to you guys, but there was a tiny horde of them! We fixed them off by one by on Thursday and Friday. Performance improved and we were almost 100% back to normal by Friday afternoon, but we still did not know what was the real cause for the slow down.

That mystery was finally unveiled on Saturday morning as we investigated why some maintenance jobs were running for hours, rather than minutes. Oh, the horror that was revealed! In our over-zealous effort to ensure the chat code release was smooth, we had created a number of performance measurement hooks. But we made one little tiny mistake. Errors had a lot of info, and I mean a lot, 100MB worth of error info per error! That was the real cause of intermittent problems. Our error management and categorization system was not designed to deal with so many monsters and started to chug! Yup, we lost at tower defense...

You see, we were prepared for the worst! If an error occurred, we knew exactly when and how... and.... that killed us.... (Editor's Note: The lesson here is to simply live in ignorance, Greg.)

Ashamed as I write this, my only consolation is that I can report that we are back to normal now:

If anyone has issues with chat not loading, or anything else, please do message support and we will help you. The in-game contact form is ideal, but if you cannot get to it, email