Build Mesos Multi-node HA Cluster

For brevity’s sake detailed notes will not be included with each step. My previous Setup Mesos Multi-node Cluster on Ubuntu post can be used for reference as it contains an explanation for each action and configuration change.

To begin we will create 6 new Ubuntu 14.04.1 LTS servers that will used throughout this tutorial.

Node Name IP Address CPUs Memory
node-01 10.1.200.11 1 512MB
node-02 10.1.200.12 1 512MB
node-03 10.1.200.13 1 512MB
node-04 10.1.200.30 1 512MB
node-05 10.1.200.40 2 1024MB
node-06 10.1.200.41 2 1024MB

Standalone Cluster

Just as in the Setup Standalone Mesos on Ubuntu post the Mesos Master, Mesos Slave, Marathon, and ZooKeeper services are co-located on the same node.

Just as in the Setup Standalone Mesos on Ubuntu post the Mesos Master, Mesos Slave, Marathon, and ZooKeeper services are co-located on the same node.

node-01

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos marathon
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 1 | sudo tee /etc/mesos-master/quorum
$ echo 10.1.200.11 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181/mesos | sudo tee /etc/mesos/zk
$ echo 1 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Master/Zookeeper/Marathon and Slave Cluster

In this scenario the Mesos Master, ZooKeeper, and Marathon services are on one node and a separate Mesos Slave is built.

node-01

The only change is that the Mesos Slave service is disabled from starting automatically.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos marathon
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 1 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181/mesos | sudo tee /etc/mesos/zk
$ echo 1 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

node-05

The Mesos Master and ZooKeeper services are disabled from starting automatically.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-master.override
$ echo manual | sudo tee /etc/init/zookeeper.override
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181/mesos | sudo tee /etc/mesos/zk
$ sudo reboot

Redundant Master/Zookeeper/Marathon and Slave Cluster

In this scenario a second host running Mesos Master, ZooKeeper, and Marathon services is provisioned.

node-01

The ZooKeeper URL is expanded to include the addresses of both Node-01 and Node-02, the ZooKeeper configuration file both ZooKeeper servers specified, and Marathon is started in HA mode.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos marathon
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 1 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181/mesos | sudo tee /etc/mesos/zk
$ echo 1 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.11  

node-02

The ZooKeeper URL is expanded to include the addresses of both Node-01 and Node-02, the ZooKeeper configuration file both ZooKeeper servers specified, and Marathon is started in HA mode.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 1 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181/mesos | sudo tee /etc/mesos/zk
$ echo 2 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.12  

node-05

The ZooKeeper URL is expanded to include the addresses of both Node-01 and Node-02.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-master.override
$ echo manual | sudo tee /etc/init/zookeeper.override
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181,10.1.200.12:2181/mesos | sudo tee /etc/mesos/zk
$ sudo reboot

Three Master/Zookeeper/Marathon and Slave Cluster

The ZooKeeper URL is expanded to include the addresses of Node-01, Node-02, and Node-03. The third ZooKeeper server is added to the ZooKeeper configuration file.

For the first time we are increasing the quorum value in the Mesos Master configuration. Zookeeper defines a quorum as ceil(N/2) where N is the original number of Masters in the ensemble. A few examples to demonstrate how this would scale up:

  • with 1 Master: quorum=1
  • with 2 Masters: quorum=2
  • with 3 Masters: quorum=2
  • with 4 Masters: quorum=3
  • with 5 Masters: quorum=3

Technically an even number of peers is supported, but it is not typically used because an even sized cluster requires, proportionally, more nodes to form a quorum than one that is odd sized without an increase in fault tolerance. For example, a cluster with 4 Masters requires 3 to form a quorum but can only tolerate the failure or isolation of a single node. A cluster of 5 also requires 3 to forum a quorum but can tolerate the failure or isolation of 2 nodes. In this way it is more efficient and fault tolerate to have clusters of 3, 5, 7, etc. nodes.

In production a cluster of 5 nodes is recommended over 3 because it allows you to take 1 server out of service (e.g. planned maintenance) and still be able to maintain quorum if one of the remaining nodes unexpectedly fails. In a 3 node cluster if you take 1 out of service for maintenance and a remaining server fails Zookeeper cannot maintain quorum and stops responding to requests.

node-01

The ZooKeeper URL is expanded to include the addresses of Node-01, Node-02 and Node-03. The ZooKeeper configuration file has the three ZooKeeper servers specified.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 1 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.11  

node-02

The ZooKeeper URL is expanded to include the addresses of Node-01, Node-02 and Node-03. The ZooKeeper configuration file has the three ZooKeeper servers specified.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 2 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.12  

node-03

The ZooKeeper URL is expanded to include the addresses of Node-01, Node-02 and Node-03. The ZooKeeper configuration file has the three ZooKeeper servers specified.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.13 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.13 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 3 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.13  

node-05

The ZooKeeper URL is expanded to include the addresses of Node-01, Node-02, and Node-03.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-master.override
$ echo manual | sudo tee /etc/init/zookeeper.override
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ sudo reboot

Three Master/Zookeeper/Marathon and Multiple Slave Cluster

The only change from the previous environment is that a second Mesos Slave is provisioned.

node-01

No change.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.11 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 1 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.11  

node-02

No change.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.12 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 2 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.12  

node-03

No change.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-slave.override
$ echo 10.1.200.13 | sudo tee /etc/mesos-master/hostname
$ echo 10.1.200.13 | sudo tee /etc/mesos-master/ip
$ echo Cluster01 | sudo tee /etc/mesos-master/cluster
$ echo 2 | sudo tee /etc/mesos-master/quorum
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ echo 3 | sudo tee /etc/zookeeper/conf/myid
$ sudo reboot

Overwrite the /etc/zookeeper/conf/zoo.cfg to read:

tickTime=2000  
dataDir=/var/lib/zookeeper/  
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=10.1.200.11:2888:3888  
server.2=10.1.200.12:2888:3888  
server.3=10.1.200.13:2888:3888  

Overwrite the /etc/init/marathon.conf to read:

description "Marathon scheduler for Mesos"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn  
respawn limit 10 5

exec /usr/local/bin/marathon --ha --hostname 10.1.200.13  

node-05

No change.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-master.override
$ echo manual | sudo tee /etc/init/zookeeper.override
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.40 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ sudo reboot

node-06

A second Mesos Slave is provisioned.

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
$ DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
$ CODENAME=$(lsb_release -cs)
$ echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list  
$ sudo apt-get -y update
$ sudo apt-get -y install mesos
$ echo manual | sudo tee /etc/init/mesos-master.override
$ echo manual | sudo tee /etc/init/zookeeper.override
$ echo 10.1.200.41 | sudo tee /etc/mesos-slave/hostname
$ echo 10.1.200.41 | sudo tee /etc/mesos-slave/ip
$ echo "cgroups/cpu,cgroups/mem" | sudo tee /etc/mesos-slave/isolation
$ echo zk://10.1.200.11:2181,10.1.200.12:2181,10.1.200.13:2181/mesos | sudo tee /etc/mesos/zk
$ sudo reboot

Notes

Note 1

If you choose to install Marathon on its own node separate from Mesos you will need to install the Mesos packages and simple stop the mesos-master, mesos-slave, and zookeeper services from running. Per Connor Doyle at Mesosphere, “Right now there is a hard dependency on libmesos for the native half of the JNI Mesos bindings. We’ll work to lift this dependency in a future release as newer “pure” JVM bindings reach maturity.” This is as of 29 August 2014, so at some point this note won’t be valid.

Note 2

I ran into seemingly random issues with Marathon not being able to register with ZooKeeper or the Mesos Master. Besides not being able to start any tasks through Marathon the main symptom was that the Marathon framework always showed that it had re-registered “Just now” on the “Frameworks” section of the Mesos web console. If you encounter the same issue you’ll likely see entries in the mesos-master.INFO or syslog logs similar to the following:

I0829 16:05:01.096535  1118 master.cpp:1423] Received re-registration request from framework 20140829-001207-16842879-5050-1056-0000 at scheduler-1689eb90-5823-40da-a658-02c9204759a2@127.0.1.1:42226  
I0829 16:05:01.096768  1118 master.cpp:1474] Re-registering framework 20140829-001207-16842879-5050-1056-0000 at scheduler-1689eb90-5823-40da-a658-02c9204759a2@127.0.1.1:42226  
I0829 16:05:01.096971  1118 master.cpp:1501] Framework 20140829-001207-16842879-5050-1056-0000 failed over  
I0829 16:05:01.097206  1118 hierarchical_allocator_process.hpp:375] Activated framework 20140829-001207-16842879-5050-1056-0000  
I0829 16:05:01.097550  1120 master.cpp:725] Framework 20140829-001207-16842879-5050-1056-0000 disconnected  
I0829 16:05:01.097663  1120 master.cpp:1655] Deactivating framework 20140829-001207-16842879-5050-1056-0000  
I0829 16:05:01.097796  1120 master.cpp:747] Giving framework 20140829-001207-16842879-5050-1056-0000 1weeks to failover  
I0829 16:05:01.097892  1124 hierarchical_allocator_process.hpp:405] Deactivated framework 20140829-001207-16842879-5050-1056-0000  

After far too many hours troubleshooting and posting logs to the Marathon Framework Google Group a helpful tip led me to the culprit. On Debian systems during the build process a 127.0.1.1 entry is added to the /etc/hosts file with the hostname of the system and a canonical FQDN if a domain name is specified. For example, here is the default hosts file on node-01:

$ cat /etc/hosts
127.0.0.1    localhost  
127.0.1.1    node-01.homelab.local node-01

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes  
ff02::2 ip6-allrouters  

You might be tempted to comment out the 127.0.1.1 line. However, as detailed in this post it can lead to issues depending on what other services you are running. Per the Debian reference documentation:

Some software (e.g., GNOME) expects the system hostname to be resolvable to an IP address with a canonical fully qualified domain name. This is really improper because system hostnames and domain names are two very different things; but there you have it. In order to support that software, it is necessary to ensure that the system hostname can be resolved. Most often this is done by putting a line in /etc/hosts containing some IP address and the system hostname. If your system has a permanent IP address then use that; otherwise use the address 127.0.1.1.

Since I am using static addressing in this case I was able to comment out the 127.0.1.1 line and add an entry for the address of each host with the hostname and FQDN. Here is how node-01’s hosts file ended up:

$ cat /etc/hosts
127.0.0.1    localhost  
#127.0.1.1    node-01.homelab.local node-01

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes  
ff02::2 ip6-allrouters

10.1.200.11    node-01.homelab.local node-01  

Summary

In this post I started with a single node hosting Mesos Master, Mesos Slave, ZooKeeper, and Marathon services and incrementally added redundant Master/ZooKeeper/Marathon and Slave nodes to approach a highly available architecture that might be used in production. While I would never recommend running exactly this scenario in production hopefully it helps understand the process involved in adding different node types to expand a Mesos cluster.