We've now seen this few times on dev env. the latest incident was on 21/03 when the whole dev env went down. It took more than 7 hours to recover.
We need to figure out why this happens and how to prevent it happening in prod.
We tested the following scenarios.
1 Master with read only disk, and doing deployments.
1 Master down and doing deployments
1 Master down, downscaling nodes and turning master back on
restarting alternative masters
So far we have not been able to repeat the failure scenario.
I create the disaster recovery document attached , if we could review as a team the options and procedure to make sure were on the same page that would be great.
Additionally i believe i have also a very good root cause for what causes the disaster scenario. and that is time syncronisation between the servers. I noticed this problem with a couple of the nodes in production the clocks were all out of sync according to mesos.
on a healthy node the time will look like this
digitransitprodadmin@dcos-agent-public-543859600000B:~$ timedatectl status
Local time: Tue 2017-04-18 10:02:36 UTC
Universal time: Tue 2017-04-18 10:02:36 UTC
RTC time: Fri 2017-06-16 12:53:30
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
the determining factor is NTP Syncronised = yes
I created a short account of the investigation for ease.