Case Study: Disaster Recovery
Disaster Recovery and contingency planning is a concern to all businesses no matter how large or small. Successful disaster recovery requires forward planning and configuration of systems to provide protection.
Testing and monitoring of the plan is vital for success.
About the client
The client in this case study is a major high street grocer. Cyberdan Ltd look after their Oracle systems that are used for warehouse operations to keep stores stocked. There are 3 main warehouses each running their own third party warehouse application. Each Live environment has a backup environment based in one of the other warehouses and utilises Oracle data guard for resilience. Operations are 24/7.
Initial Findings
During an initial check of the environment it was found that one of the failover sites was out of sync for several weeks. All three sites were using a manual style of data guard requiring logs to be transferred after archive and applied to the failover database. If the archival process was interrupted the logs had to be manually transferred and registered with the database before manually applying to bring back into sync.
The logs were kept for several days, but in this case it was not enough time to re-sync the Live and failover database. In the past the whole failover database required rebuilding to bring back in sync. This obviously took time and resources to achieve. It also exposed the client to a potential data loss should the archive logs not be transferred and didn’t cover the live logs. The live transactional logs can last an hour so the business was exposed to this amount of data loss.
Remedial Work
The out of sync database was put back in sync by taking an incremental backup of the live database from the point at which they went out of sync (SCN). This reduced rebuild time from days to 90 minutes and reduced the size of the backup to be transferred over the network thereby minimising impact on the other business systems.
The databases were moved from a manual style of data guard to using a “broker agent”. This manages the process of moving logs using Oracle Network services (TNS) as opposed to file system services. The Live sends to the failover site but in the event of a network outage, the failover can request the log from the Live system. Between these processes logs are consistent.
To alleviate the issue of only archived logs being transmitted to failover and the potential loss of data in the live transaction logs, Oracle standby logs were setup and utilised. This sends current changes in live logs immediately to the failover database and are applied in real time.
Finally monitoring was setup to check the status of the live logs and failover logs. Any discrepancies flag up an email alert for Cyberdan to investigate
Disaster! or Not!
Several months after the remedial work had completed, the client had a hardware part replaced on their Live server by a third party. Within 30 mins the server had gone down and smoke coming out of the chassis. The live server would not start.
Cyberdan switched their database operations to the failover database. The time from server going up in smoke to the failover database becoming Live was 40 minutes including diagnostics on the Live server to determine that a failover was required.
Since Cyberdan have monitoring data and regularly check status during health checks, Cyberdan had confidence in the system being up to date and there was no data loss.
Operations remained on the failover server for 6 weeks until the previous live server had been fixed and a suitable time slot became available.
The previous live server was synced to the failover and a switch was done out of hours back to the Live server.