Etcd v2.3.7 (#224)

* baseline * proxy health check with wait period, DR fix, upgrade to 2.3.7 * update readme * latest image, add notes for stale metadata race condition * bundle proxy, latest image fixes
2016-08-03 15:32:40 -07:00 · 2016-08-03 15:32:40 -07:00 · d36a0b65cb
commit d36a0b65cb
parent 4bed5baabf
4 changed files with 87 additions and 1 deletions
--- a/templates/etcd-ha/4/README.md
+++ b/templates/etcd-ha/4/README.md
@ -0,0 +1,39 @@
+# Etcd 
+
+A distributed key/value store that provides a reliable way to store data across a cluster of machines
+
+### General
+
+The template deploys an N-node cluster. Only 1 node is allowed per host. If less than N hosts are available, the largest cluster possible will be built with the resources available. Adding more hosts at a later date will result in the cluster scaling up to the maximum desired size.
+
+### Upgrades
+
+Starting with `2.3.6-rancher4`, upgrades are fully supported and require no user intervention beyond navigating the UI and selecting the desired version.
+
+### Resiliency
+
+Etcd can survive `floor(N/2)` recoverable or unrecoverable failures while maintaining 100% uptime. For recoverable host failures such as a power cycle, the service self-heals. For unrecoverable host failures, the service self-heals when sufficient resources are allocated to the environment. Therefore, it is not a bad idea to allocate an extra host beyond the specified cluster size. For instance, a 3-node deployment in a 4-host environment may survive 2 host failures, given enough time in-between to self-heal the cluster. The amount of time which must pass between host failures is indeterminate and depends on how much data must be replicated to the new node.
+
+Etcd can survive `N` recoverable failures, but system downtime will be experienced once a majority of nodes (quorum) is lost.
+
+Etcd can survive `N-1` unrecoverable failures, but will enter a disaster state where some user intervention is required to recover. Recovering from this situation is mostly-automated, but will not be fully automated. This is a very, very bad situation to be in and should never occur if appropriate steps are taken such as spreading hosts across availability zones.
+
+### Disaster Recovery (DR)
+
+If a majority of nodes are unrecoverably lost, you must re-build the cluster from one of the surviving nodes. The process involves selecting a survivor, transforming it into a standalone node, and adding new hosts so the service may scale back up to the desired size.
+
+In more detail, follow these steps:
+
+1. Determine if your lost hosts are truly unrecoverable. If on bare metal, this involves fixing or replacing hardware and rebooting. If in the cloud, check if the host was stopped and attempt to start it. If a host comes back online, wait at least 5 minutes to allow Network Agent to repair itself. You can figure out if the network is repaired by following [these steps](http://docs.rancher.com/rancher/latest/en/faqs/troubleshooting/#containers-on-hosts-unable-to-ping-each-other-how-to-check-that-the-hosts-can-ping-each-other) for the recovered host. If recovery fails, remove the dead hosts from the environment.
+2. Find 1 surviving container. Survivors will be in running state (green circle on the UI). These containers are DR candidates. From the dropdown menu, select `Execute Shell`. Type `disaster` and hit enter. The script will automatically restart the container in disaster recovery mode. Once the recovery process completes, etcd will begin servicing requests and downstream containers should return to a functional state.
+3. Add more hosts so your etcd service may scale up to the desired size. In the unlikely event you experienced a majority of hosts failing simultaneously and had a surplus of hosts, you will have unhealthy Etcd containers scheduled to other hosts. Wait for 5 minutes for these containers to become healthy. If after 5 minutes they are still initializing, restart them.
+
+### Limitations
+
+For 3+ node deployments in environments using etcd heavily, it is theoretically possible (but improbable) for etcd to temporarily lose quorum during upgrade. The next template will make use of new features available in etcd v3.0.0 that expose the raft index to clients, thereby making it possible to deduce if an upgraded node has caught up with the rest of the cluster. This will deprecate the use of non-deterministic waiting periods.
+
+### Changelog
+
+* Upgrade to etcd v2.3.7
+* Re-work DR script to automate restart of container and perform backup only after etcd termination
+* Proxy health check and add a waiting period before reporting healthy, preventing upgrades from losing quorum in most cases
--- a/templates/etcd-ha/4/docker-compose.yml
+++ b/templates/etcd-ha/4/docker-compose.yml
@ -0,0 +1,18 @@
+etcd:
+  image: rancher/etcd:v2.3.7-6
+  labels:
+    io.rancher.scheduler.affinity:container_label_ne: io.rancher.stack_service.name=$${stack_name}/$${service_name}
+    io.rancher.sidekicks: data
+  environment:
+    RANCHER_DEBUG: '${DEBUG}'
+  volumes_from:
+  - data
+
+data:
+  image: busybox
+  command: /bin/true
+  net: none
+  volumes:
+  - /data
+  labels:
+    io.rancher.container.start_once: 'true'
--- a/templates/etcd-ha/4/rancher-compose.yml
+++ b/templates/etcd-ha/4/rancher-compose.yml
@ -0,0 +1,29 @@
+.catalog:
+  version: 2.3.7
+  minimum_rancher_version: v1.1.1
+  questions:
+  - variable: SCALE
+    description: Desired cluster size. 3, 5, or 7 are good choices. You will need this many hosts to reach your desired scale.
+    label: Number of Nodes
+    required: true
+    default: 3
+    type: int
+  - variable: DEBUG
+    description: Enable or disable verbose logging
+    label: Debug
+    required: true
+    default: true
+    type: boolean
+etcd:
+  retain_ip: true
+  scale_policy:
+    min: 1
+    max: ${SCALE}
+    increment: 1
+  health_check:
+    port: 2378
+    request_line: /health
+    interval: 5000
+    response_timeout: 3000
+    unhealthy_threshold: 2
+    healthy_threshold: 2
--- a/templates/etcd-ha/config.yml
+++ b/templates/etcd-ha/config.yml
@ -1,5 +1,5 @@
 name: Etcd
 description: |
  A distributed key/value store that provides a reliable way to store data across a cluster of machines
-version: 2.3.6-rancher4
+version: 2.3.7
 category: Clustering