Follow up after incident on August 2, 2018
What happened?
On August 2, 2018 the HZDR data center lost power supply and in the following all services needed to be brought up again. Also the GitLab@HZDR service was affected, since the VM was not reachable until August 3 at around 1 p.m.
After the VMWare virtualization platform was brought up the GitLab machine was pingable but could not be reached by any other protocol like HTTP or SSH. Immediately, after the VM could be restarted into a valid state, the GitLab service was up without any manual intervention.
How could the service be resistant against the cause of the problem?
Virtual machines in two different and independent virtualization platforms (The machines in the healthy environment could overtake the job)
Measures to further improve reliabilty of the RODARE service:
- setup GitLab in a multi-node HA environment. Already prepared on https://vlsgit1.fz-rossendorf.de. Relocation needs to be scheduled. Ideally, the multi-node environment would use VMs of two independent virtualization platforms. (#14 (closed))
- after the relocation: use the current machine as a test environment. Setup an automated job once a day, that restores the backup of the production environment. Like this, the integrity and the proper functioning of the backup can be ensured. (#21 (closed))