Issue Summary:
EngagementHQ had a 75 minute downtime in Australia and New Zealand on Saturday, 24 November 2018. The issue started at around 12:41 AM AEST and the last reported error was at 01:56 AM AEST.
Service in our other regions - Canada, the UK and the US - were NOT impacted.
The downtime occurred because of errors during low level maintenance. As with all our work, we scheduled this to occur well outside of business hours in case we do encounter issues in the process.
Root Cause:
As EHQ grows and consumes more and more disk space on our servers, we are constantly adjusting our processes to cater for future growth.
In the process of moving some non-critical backend files on our servers and splitting them from one big into multiple small files, an error occurred that consumed all available temporary storage and prevented necessary processes from running. Rectifying this error took longer than expected which caused the lengthy downtime.
Corrective and Preventive measures:
Working with our server hosts, we were assured the maintenance would have no chance of causing any adverse effects on EHQ sites. Due to mistakes in the maintenance execution however, sites were impacted and brought down.
Multiple monitoring and alarm systems are in place which should have triggered earlier to alert the teams to stop work immediately. These alarms failed to trigger however, which caused the delay in rectification and ultimately the lengthy downtime. Investigation is still in progress as to why the monitoring failed.