Merge pull request #183 from github/raft-sqlite

WIP: orchestrator/raft[/sqlite]
openark · Aug 3, 2017 · 3608ae2 · 3608ae2
2 parents 2cb8474 + 7307957
commit 3608ae2
Show file tree

Hide file tree

Showing 202 changed files with 240,542 additions and 691 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,5 @@ vagrant/db4-post-install.sh
 vagrant/vagrant-ssh-key
 vagrant/vagrant-ssh-key.pub
 Godeps/_workspace
+.gopath/
 main
diff --git a/docs/README.md b/docs/README.md
@@ -10,7 +10,7 @@
 - [Executing via command line](executing-via-command-line.md)
 - [Using the Web interface](using-the-web-interface.md)
 - [Using the web API](using-the-web-api.md)
-- [Using orchestrator-client](using-orchestrator-client.md): a no binary/config needed script that wraps API calls
+- [Using orchestrator-client](orchestrator-client.md): a no binary/config needed script that wraps API calls
 - [Security](security.md)
 - [SSL and TLS](ssl-and-tls.md)
 - [Status Checks](status-checks.md)

diff --git a/docs/configuration-backend.md b/docs/configuration-backend.md
@@ -6,7 +6,17 @@ Let orchestrator know where to find backend database. In this setup `orchestrato
 {
   "Debug": false,
   "ListenAddress": ":3000",
+}
+```
+
+You may choose either a `MySQL` backend or a `SQLite` backend. See [High availability](high-availability.md) page for scenarios, possibilities and reasons to using either.
+
+## MySQL backend
 
+You will need to set up schema & credentials:
+
+```json
+{
   "MySQLOrchestratorHost": "orchestrator.backend.master.com",
   "MySQLOrchestratorPort": 3306,
   "MySQLOrchestratorDatabase": "orchestrator",
@@ -33,11 +43,26 @@ Alternatively, you may choose to use plaintext credentials in the config file:
 }
 ```
 
-### Backend DB
+#### MySQL backend DB setup
 
 For a MySQL backend DB, you will need to grant the necessary privileges:
 
 ```
 CREATE USER 'orchestrator_srv'@'orc_host' IDENTIFIED BY 'orc_server_password';
 GRANT ALL ON orchestrator.* TO 'orchestrator_srv'@'orc_host';
 ```
+
+## SQLite backend
+
+Default backend is `MySQL`. To setup `SQLite`, use:
+
+```json
+{
+  "BackendDB": "sqlite",
+  "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.db",  
+}
+```
+
+`SQLite` is embedded within `orchestrator`.
+
+If the file indicated by `SQLite3DataFile` does not exist, `orchestrator` will create it. It will need write permissions on given path/file.
diff --git a/docs/configuration-raft.md b/docs/configuration-raft.md
@@ -0,0 +1,40 @@
+# Configuration: raft
+
+Set up a [orchestrator/raft](raft.md) cluster for high availability.
+
+Assuming you will run `orchestrator/raft` on a `3` node setup, you will configure this on each node:
+
+```json
+  "RaftEnabled": true,
+  "RaftBind": "<ip.of.this.orchestrator.node>",
+  "DefaultRaftPort": 10008,
+  "RaftNodes": [
+    "<ip.of.orchestrator.node1>",
+    "<ip.of.orchestrator.node2>",
+    "<ip.of.orchestrator.node3>"
+  ],
+```
+
+Some breakdown:
+
+- `RaftEnabled` must be set to `true`, otherwise `orchestrator` runs in shared-backend mode.
+- `RaftBind` must be set, use the IP address of local host. This IP will also be listed as one of the `RaftNodes` variable.
+- `DefaultRaftPort` can be set to any port, but must be consistent across all deployments.
+- `RaftNodes` should list all nodes of the raft cluster. This list will consist of IP addresses (not host names) and will include the value of this host itself as presented in `RaftBind`.
+
+As example, the following might be a working setup:
+
+```json
+  "RaftEnabled": true,
+  "RaftBind": "10.0.0.2",
+  "DefaultRaftPort": 10008,
+  "RaftNodes": [
+    "10.0.0.1",
+    "10.0.0.2",
+    "10.0.0.3"
+  ],
+```
+
+### Backend DB
+
+A `raft` setup supports either `MySQL` or `SQLite` backend DB. See [backend](configuration-backend.md) configuration for either. Read [high-availability](high-availability.md) page for scenarios, possibilities and reasons to using either.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -15,4 +15,5 @@ Use the following small steps to configure `orchestrator`:
 - [Discovery: Pseudo-GTID](configuration-discovery-pseudo-gtid.md)
 - [Failure detection](configuration-failure-detection.md)
 - [Recovery](configuration-recovery.md)
+- [Raft](configuration-raft.md): configure a [orchestrator/raft](raft.md) cluster for high availability
 - Security: See [security](#security) section.
diff --git a/docs/executing-via-command-line.md b/docs/executing-via-command-line.md
@@ -8,7 +8,7 @@ Also consult the [Orchestrator first steps](first-steps.md) page.
   - You will deploy `orchestrator` on ops/app boxes, but not run it as a service.
   - You will deploy the configuration file for the `orchestrator` binary to be able to
     connect to the backend DB.
-- Using the [orchestrator-client](using-orchestrator-client.md) script.
+- Using the [orchestrator-client](orchestrator-client.md) script.
   - You will only need the `orchestrator-client` script on your ops/app boxes.
   - You will not need any config file nor binaries.
   - You will need to specify the `ORCHESTRATOR_API` environment variable.

diff --git a/docs/high-availability.md b/docs/high-availability.md
@@ -1,57 +1,102 @@
 # Orchestrator High Availability
 
-`Orchestrator` makes your MySQL topologies available, but what makes `orchestrator` highly available?
+`orchestrator` runs as a highly available service. This document lists the various ways of achieving HA for `orchestrator`, as well as less/not highly available setups.
 
-Before drilling down into the details, we should first observe that orchestrator is a service that runs with a MySQL backend. Thus, we need to substantiate the HA of both these components, as well as the continued correctness in the failover process of either of the two or of both.
+### TL;DR ways to get HA
 
-### High Availability of the Orchestrator service
+HA is achieved by choosing either:
 
-`Orchestrator` runs as a service and its configuration needs to
-reference a MySQL backend. You can quite easily add more orchestrator
-applications probably running on different hosts to provide redundancy.
-These servers would have an identical configuration.  `Orchestrator`
-uses the database to record the different applications which are
-running and through it will allow an election process to choose one
-of the processes to be the active node.  If that process fails the
-remaining processes will notice and shortly afterwards choose a new
-active node. The active node is the one which periodically checks
-each of the MySQL servers being monitored to determine if they are
-healthy.  If it detects a failure it will recover the topology if
-so configured.
+- `orchestrator/raft` setup, where `orchestrator` nodes communicate by raft consensus. Each `orchestrator` node [has a private database backend](#ha-via-raft), either `MySQL` or `sqlite`. See also [orchestrator/raft documentation](raft.md)
+- [Shared backend](#ha-via-shared-backend) setup. Multiple `orchestrator` nodes all talk to the same backend, which may be a Galera/XtraDB Cluster/InnoDB Cluster/NDB Cluster. Synchronization is done at the database level.
 
-If you use the web interface to look at the topology information
-or to relocate replicas within a cluster and you have multiple
-orchestrator processes running then you need to use a load balancer
-to provide the redundant web service through a common URL.  Requests
-through the load balancer may not hit the active node but that is
-not an issue as any of the running processes can serve the web
-requests.
+See also: [orchestrator/raft vs. synchronous replication setup](raft-vs-sync-repl.md)
 
-### High Availability of the Orchestrator backend database
+### Availability types
 
-At this time `Orchestrator` relies on a MySQL backend. The state of the clusters is persisted to tables and is queried via SQL. It is worth considering the following:
+You may choose different availability types, based on your requirements.
 
-- The backend database is very small, and is linear with your number of servers. For most setups it's a matter of a few MB and well below 1GB, depending on history configuration, rate of polling etc. This easily allows for a fully in-memory database even on simplest machines.
+- No high availability: easiest, simplest setup, good for testing or dev setups. Can use `MySQL` or `sqlite`
+- Semi HA: backend is based on normal MySQL replication. `orchestrator` does not eat its own dog food and cannot failover its on backend.
+- HA: as depicted above; support for no single point of failure. Different solutions have different tradeoff in terms of resource utilization, supported software, type of client access.
 
-- Write rate is dependent on the frequency the MySQL hosts are polled and the number of servers involved.  For most orchestrator installations the write rate is low.
+Discussed below are all options.
 
-To that extent you may use one of the following solutions in order to make the backend database highly available:
+### No high availability
 
-- 2-node MySQL Cluster
-  This is a synchronous solution; anything you write on one node is guaranteed to exist on the second. Data is available and up to date even in the face of a death of one server.
-  Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
-  > NOTE: right now table creation explicitly creates tables using InnoDB engine; you may `ALTER TABLE ... ENGINE=NDB`
+![orchestrator no HA](images/orchestrator-ha--no-ha.png)
 
-- 3-node Galera/XtraDB cluster
-  This is a synchronous solution; anything you write on one node is guaranteed to exist on both other servers.
-  Galera is eventually consistent.
-  Data is available and up to date even in the face of a death of one server.
-  Suggestion: abstract via HAProxy with `first` load-balancing algorithm.
+This setup is good for CI testing, for local dev machines or otherwise experiments. It is a single-`orchestrator` node with a single DB backend.
 
-- MySQL Group Replication
-  This is similar to the MySQL Cluster (but uses the InnoDB engine) or the Galera/XtraDB cluster and is available in MySQL 5.7.17 (December 2016) and later.
-  Similar considerations apply as for the previous two options.
+The DB backend may be a `MySQL` server or it may be a `sqlite` DB, bundled with `orchestrator` (no dependencies, no additional software required)
 
-- 2-node active-passive master-master configuration
+### Semi HA
 
-> NOTE: there has been an initial discussion on supporting Consul/etcd as backend datastore; there is no pending work on that at this time.
+![orchestrator semi HA](images/orchestrator-ha--semi-ha.png)
+
+This setup provides semi HA for `orchestrator`. Two variations available:
+
+- Multiple `orchestrator` nodes talk to the same backend database. HA of the `orchestrator` services is achieved. However HA of the backend database is not achieved. Backend database may be a `master` with replicas, but `orchestrator` is unable to eat its own dog food and failover its very backend DB.
+
+  If the backend `master` dies, it takes someone or something else to failover the `orchestrator` service onto a promoted replica.
+
+- Multiple `orchestrator` services all talk to a proxy server, which load balances an active-active `MySQL` master-master setup with `STATEMENT` based replication.
+
+  - The proxy always directs to same server (e.g. `first` algorithm for `HAProxy`) unless that server is dead.
+  - Death of the active master causes `orchestrator` to talk to other master, which may be somewhat behind. `orchestrator` will typically self reapply the missing changes by nature of its continuous discovery.
+  - `orchestrator` queries guarantee `STATEMENT` based replication will not cause duplicate errors, and master-master setup will always achieve consistency.
+  - `orchestrator` will be able to recover the death of a backend master even if in the middle of runnign a recovery (recovery will re-initiate on alternate master)
+  - **Split brain is possible**. Depending on your setup, physical locations, type of proxy, there can be different `orchestrator` service nodes speaking to different backend `MySQL` servers. This scenario can lead two two `orchestrator` services which consider themselves as "active", both of which will run failovers independently, which would lead to topology corruption.
+
+To access your `orchestrator` service you may speak to any healthy node.
+
+Both these setups are well known to run in production for very large environments.
+
+### HA via shared backend
+
+![orchestrator HA via shared backend](images/orchestrator-ha--shared-backend.png)
+
+HA is achieved by highly available shared backend. Existing solutions are:
+
+- Galera
+- XtraDB Cluster
+- InnoDB Cluster
+- NDB Cluster
+
+In all of the above the MySQL nodes run synchronous replication (using the common terminology).
+
+Two variations exist:
+
+- Your Galera/XtraDB Cluster/InnoDB Cluster runs with a single-writer node. Multiple `orchestrator` nodes will speak to the single writer DB, probably via proxy. If the writer DB fails, the backend cluster promotes a different DB as writer; it is up to your proxy to identify that and direct `orchestrator`'s traffic to the promoted server.
+
+- Your Galera/XtraDB Cluster/InnoDB Cluster runs in multiple writers mode. A nice setup would couple each `orchestrator` node with a DB server (possibly on the very same box). Since replication is synchronous there is no split brain. Only one `orchestrator` node can ever be the leader, and that leader will only speak with a consensus of the DB nodes.
+
+In this setup there could be a substantial amount of traffic between the MySQL nodes. In cross-DC setups this may imply larger commit latencies (each commit may need to travel cross DC).
+
+To access your `orchestrator` service you may speak to any healthy node. It is advisable you speak only to the leader via proxy (use `/api/leader-check` as HTTP health check for your proxy).
+
+The latter setup is known to run in production at a very large environment on `3` or `5` nodes setup.
+
+### HA via raft
+
+![orchestrator HA via raft](images/orchestrator-ha--raft.png)
+
+`orchestrator` nodes will directly communicate via `raft` consensus algorithm. Each `orchestrator` node has its own private backend database. This can be `MySQL` or `sqlite`.
+
+Only one `orchestrator` node assumes leadership, and is always a part of a consensus. However all other nodes are independently active and are polling your topologies.
+
+In this setup there is:
+- No communication between the DB nodes.
+- Minimal communication between the `orchestrator`.
+- `*n` communication to `MySQL` topology nodes. A `3` node setup means each topology `MySQL` servr is probed by `3` different `orchestrator` nodes, independently.
+
+It is recommended to run a `3`-node or a `5`-node setup.
+
+`sqlite` is embedded within `orchestrator` and does not require an external dependency. `MySQL` outperforms `sqlite` on busy setups.
+
+To access your `orchestrator` service you may **only** speak to the leader node.
+- Use `/api/leader-check` as HTTP health check for your proxy.
+- Or use [orchestrator-client](orchestrator-client.md) with multiple `orchestrator` backends; `orchestrator-client` will figure out the identity of the leader and will send requests to the leader.
+
+![orchestrator HA via raft](images/orchestrator-ha--raft-proxy.png)
+
+`orchestrator/raft` is a newer development, and is being tested in production at this time. Please read the [orchestrator/raft documentation](raft.md) for all implications.
diff --git a/docs/images/orchestrator-ha--no-ha.png b/docs/images/orchestrator-ha--no-ha.png
diff --git a/docs/images/orchestrator-ha--raft-proxy.png b/docs/images/orchestrator-ha--raft-proxy.png
diff --git a/docs/images/orchestrator-ha--raft.png b/docs/images/orchestrator-ha--raft.png
diff --git a/docs/images/orchestrator-ha--semi-ha.png b/docs/images/orchestrator-ha--semi-ha.png
diff --git a/docs/images/orchestrator-ha--shared-backend.png b/docs/images/orchestrator-ha--shared-backend.png
diff --git a/docs/images/orchestrator-ha-raft-vs-sync-repl.png b/docs/images/orchestrator-ha-raft-vs-sync-repl.png
diff --git a/docs/using-orchestrator-client.md → docs/orchestrator-client.md b/docs/using-orchestrator-client.md → docs/orchestrator-client.md
@@ -1,8 +1,10 @@
-# Using orchestrator-client
+# orchestrator-client
 
 [orchestrator-client](https://github.com/github/orchestrator/blob/master/resources/bin/orchestrator-client) is a script that wraps API calls with convenient command line interface.
 
-In fact, it closely mimics the `orchestrator` command line interface.
+It can auto-determine the leader of an `orchestrator` setup and in such case forward all requests to the leader.
+
+It closely mimics the `orchestrator` command line interface.
 
 With `orchestrator-client`, you:
 
@@ -11,7 +13,14 @@ With `orchestrator-client`, you:
 - Do not need to make access to backend DB
 - Need to have access to the HTTP api
 - Need to set the `ORCHESTRATOR_API` environment variable.
-  e.g. `export ORCHESTRATOR_API=https://orchestrator.myservice.com:3000/api`
+  - Either provide a single endpoint for a proxy, e.g.
+  ```shell
+  export ORCHESTRATOR_API=https://orchestrator.myservice.com:3000/api
+  ```
+  - Or provide all `orchestrator` endpoints, and `orchestrator-client` will auto-pick the leader (no need for proxy), e.g.
+  ```shell
+  export ORCHESTRATOR_API="https://orchestrator.host1:3000/api https://orchestrator.host2:3000/api https://orchestrator.host3:3000/api"
+  ```
 
 ### Sample usage
 
@@ -54,8 +63,18 @@ The command line interface makes for a nice wrapper to API calls, whose output i
 
 As example, the command:
 
-    orchestrator-client -c discover -i 127.0.0.1:22987
+```shell
+orchestrator-client -c discover -i 127.0.0.1:22987
+```
 
 Translates to (simplified here for convenience):
 
-    curl "$orchestrator_api/discover/127.0.0.1/22987" | jq '.Details | .Key'
+```shell
+curl "$ORCHESTRATOR_API/discover/127.0.0.1/22987" | jq '.Details | .Key'
+```
+
+### Meta commands
+
+- `orchestrator-client -c help`: list all available commands
+- `orchestrator-client -c which-api`: output the API endpoint `orchestrator-client` would use to invoke a command. This is useful when multiple endpoints are provided via `$ORCHESTRATOR_API`.
+- `orchestrator-client -c api -path clusters`: invoke a generic HTTP API call (in this case `clusters`) and return the raw JSON response.