On October 1, the Tokyo Stock Exchange (TSE) was halted for a complete day due to an issue IT professionals tend to overlook, a piece of hardware.
A crucial data storage and distribution device dubbed “Arrowhead” had malfunctioned, and the automatic backup failed to initiate. Arrowhead is the heartbeat of the TSE, giving and taking commands, routing data and, most importantly, monitoring trades. Without it, the exchange was forced to shut down for a full day, its longest sustained outage since 1999.
What happened?
The Arrowhead system is a hardware and software suite developed by Fujitsu that consists of two shared disk devices. On the day in question, the primary disk “Number 1 shared disk” encountered a memory error. When this occurred, the secondary device “Number 2 shared disk” should have automatically taken over in a failover procedure – essentially a handshake – to seamlessly keep processes functioning as normal. But this was not the case, since a forced manual failover needed to occur and that would have required a restart of the entire system, which was out of the question since orders, trades and data were already beginning to backlog. TSE officials made the decision to halt trading and resume operations the next day.
What could have prevented this?
- Testing recovery strategies – Testing recovery strategies and documenting results can often aid in the ability to recover in scenarios where time is of the essence. During the testing of recovery strategies, critical recovery gaps can be identified, addressed and resolved. Conducting testing scenarios better prepares an organization and can help greatly reduce the impact of downtime when dealing with a real-life disruption.
Additional Backup and Recovery Considerations
- Configuring backups – Configuring backup jobs are vital to all types of businesses. Backing up device configurations and data allows organizations to be resilient to disruptive events and occurrences. The ability to recovery quickly and completely is invaluable.
- Monitoring backups – While the configuration of backups is critical, what good is backup data if it fails to actually back up successfully? Monitoring backup jobs is equally as important, as it ensures data is in a healthy state and ready for recovery.
- Restoration testing of backups – A crucial part of successful data backups is testing the ability to restore that data. Restoration testing will help ensure data and systems are able to be recovered as intended. Businesses should conduct backup restoration testing frequently to confirm that they are prepared for an unscheduled outage or disaster-type event similar to what was faced by the TSE.
If you have questions about your backup and recovery strategies or your disaster recovery plan, please connect with us. We would welcome a discussion.