Mesos on Azure fails catastrophically

Expected Behavior

None

Actual Behavior

None

Steps to Reproduce

None

Environment

None

Description

We've now seen this few times on dev env. the latest incident was on 21/03 when the whole dev env went down. It took more than 7 hours to recover.

We need to figure out why this happens and how to prevent it happening in prod.

Activity

Show:
Mark Bryan
April 4, 2017, 7:46 AM

We tested the following scenarios.

  • 1 Master with read only disk, and doing deployments.

  • 1 Master down and doing deployments

  • 1 Master down, downscaling nodes and turning master back on

  • restarting alternative masters

So far we have not been able to repeat the failure scenario.

Mark Bryan
April 18, 2017, 12:00 PM

I create the disaster recovery document attached , if we could review as a team the options and procedure to make sure were on the same page that would be great.

Mark Bryan
April 18, 2017, 12:08 PM

Additionally i believe i have also a very good root cause for what causes the disaster scenario. and that is time syncronisation between the servers. I noticed this problem with a couple of the nodes in production the clocks were all out of sync according to mesos.

on a healthy node the time will look like this

digitransitprodadmin@dcos-agent-public-543859600000B:~$ timedatectl status
      Local time: Tue 2017-04-18 10:02:36 UTC
  Universal time: Tue 2017-04-18 10:02:36 UTC
        RTC time: Fri 2017-06-16 12:53:30
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

the determining factor is NTP Syncronised = yes

I created a short account of the investigation for ease.

Assignee

Antero Karppinen

Reporter

Sami Siren

More details from

None

Priority

High

Recurrence

None

User Agent

None

URL

None

Components

None

Story Points

None

Labels

None
Configure