This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 937
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #183 from github/raft-sqlite
WIP: orchestrator/raft[/sqlite]
- Loading branch information
Showing
202 changed files
with
240,542 additions
and
691 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,4 +10,5 @@ vagrant/db4-post-install.sh | |
vagrant/vagrant-ssh-key | ||
vagrant/vagrant-ssh-key.pub | ||
Godeps/_workspace | ||
.gopath/ | ||
main |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Configuration: raft | ||
|
||
Set up a [orchestrator/raft](raft.md) cluster for high availability. | ||
|
||
Assuming you will run `orchestrator/raft` on a `3` node setup, you will configure this on each node: | ||
|
||
```json | ||
"RaftEnabled": true, | ||
"RaftBind": "<ip.of.this.orchestrator.node>", | ||
"DefaultRaftPort": 10008, | ||
"RaftNodes": [ | ||
"<ip.of.orchestrator.node1>", | ||
"<ip.of.orchestrator.node2>", | ||
"<ip.of.orchestrator.node3>" | ||
], | ||
``` | ||
|
||
Some breakdown: | ||
|
||
- `RaftEnabled` must be set to `true`, otherwise `orchestrator` runs in shared-backend mode. | ||
- `RaftBind` must be set, use the IP address of local host. This IP will also be listed as one of the `RaftNodes` variable. | ||
- `DefaultRaftPort` can be set to any port, but must be consistent across all deployments. | ||
- `RaftNodes` should list all nodes of the raft cluster. This list will consist of IP addresses (not host names) and will include the value of this host itself as presented in `RaftBind`. | ||
|
||
As example, the following might be a working setup: | ||
|
||
```json | ||
"RaftEnabled": true, | ||
"RaftBind": "10.0.0.2", | ||
"DefaultRaftPort": 10008, | ||
"RaftNodes": [ | ||
"10.0.0.1", | ||
"10.0.0.2", | ||
"10.0.0.3" | ||
], | ||
``` | ||
|
||
### Backend DB | ||
|
||
A `raft` setup supports either `MySQL` or `SQLite` backend DB. See [backend](configuration-backend.md) configuration for either. Read [high-availability](high-availability.md) page for scenarios, possibilities and reasons to using either. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,57 +1,102 @@ | ||
# Orchestrator High Availability | ||
|
||
`Orchestrator` makes your MySQL topologies available, but what makes `orchestrator` highly available? | ||
`orchestrator` runs as a highly available service. This document lists the various ways of achieving HA for `orchestrator`, as well as less/not highly available setups. | ||
|
||
Before drilling down into the details, we should first observe that orchestrator is a service that runs with a MySQL backend. Thus, we need to substantiate the HA of both these components, as well as the continued correctness in the failover process of either of the two or of both. | ||
### TL;DR ways to get HA | ||
|
||
### High Availability of the Orchestrator service | ||
HA is achieved by choosing either: | ||
|
||
`Orchestrator` runs as a service and its configuration needs to | ||
reference a MySQL backend. You can quite easily add more orchestrator | ||
applications probably running on different hosts to provide redundancy. | ||
These servers would have an identical configuration. `Orchestrator` | ||
uses the database to record the different applications which are | ||
running and through it will allow an election process to choose one | ||
of the processes to be the active node. If that process fails the | ||
remaining processes will notice and shortly afterwards choose a new | ||
active node. The active node is the one which periodically checks | ||
each of the MySQL servers being monitored to determine if they are | ||
healthy. If it detects a failure it will recover the topology if | ||
so configured. | ||
- `orchestrator/raft` setup, where `orchestrator` nodes communicate by raft consensus. Each `orchestrator` node [has a private database backend](#ha-via-raft), either `MySQL` or `sqlite`. See also [orchestrator/raft documentation](raft.md) | ||
- [Shared backend](#ha-via-shared-backend) setup. Multiple `orchestrator` nodes all talk to the same backend, which may be a Galera/XtraDB Cluster/InnoDB Cluster/NDB Cluster. Synchronization is done at the database level. | ||
|
||
If you use the web interface to look at the topology information | ||
or to relocate replicas within a cluster and you have multiple | ||
orchestrator processes running then you need to use a load balancer | ||
to provide the redundant web service through a common URL. Requests | ||
through the load balancer may not hit the active node but that is | ||
not an issue as any of the running processes can serve the web | ||
requests. | ||
See also: [orchestrator/raft vs. synchronous replication setup](raft-vs-sync-repl.md) | ||
|
||
### High Availability of the Orchestrator backend database | ||
### Availability types | ||
|
||
At this time `Orchestrator` relies on a MySQL backend. The state of the clusters is persisted to tables and is queried via SQL. It is worth considering the following: | ||
You may choose different availability types, based on your requirements. | ||
|
||
- The backend database is very small, and is linear with your number of servers. For most setups it's a matter of a few MB and well below 1GB, depending on history configuration, rate of polling etc. This easily allows for a fully in-memory database even on simplest machines. | ||
- No high availability: easiest, simplest setup, good for testing or dev setups. Can use `MySQL` or `sqlite` | ||
- Semi HA: backend is based on normal MySQL replication. `orchestrator` does not eat its own dog food and cannot failover its on backend. | ||
- HA: as depicted above; support for no single point of failure. Different solutions have different tradeoff in terms of resource utilization, supported software, type of client access. | ||
|
||
- Write rate is dependent on the frequency the MySQL hosts are polled and the number of servers involved. For most orchestrator installations the write rate is low. | ||
Discussed below are all options. | ||
|
||
To that extent you may use one of the following solutions in order to make the backend database highly available: | ||
### No high availability | ||
|
||
- 2-node MySQL Cluster | ||
This is a synchronous solution; anything you write on one node is guaranteed to exist on the second. Data is available and up to date even in the face of a death of one server. | ||
Suggestion: abstract via HAProxy with `first` load-balancing algorithm. | ||
> NOTE: right now table creation explicitly creates tables using InnoDB engine; you may `ALTER TABLE ... ENGINE=NDB` | ||
![orchestrator no HA](images/orchestrator-ha--no-ha.png) | ||
|
||
- 3-node Galera/XtraDB cluster | ||
This is a synchronous solution; anything you write on one node is guaranteed to exist on both other servers. | ||
Galera is eventually consistent. | ||
Data is available and up to date even in the face of a death of one server. | ||
Suggestion: abstract via HAProxy with `first` load-balancing algorithm. | ||
This setup is good for CI testing, for local dev machines or otherwise experiments. It is a single-`orchestrator` node with a single DB backend. | ||
|
||
- MySQL Group Replication | ||
This is similar to the MySQL Cluster (but uses the InnoDB engine) or the Galera/XtraDB cluster and is available in MySQL 5.7.17 (December 2016) and later. | ||
Similar considerations apply as for the previous two options. | ||
The DB backend may be a `MySQL` server or it may be a `sqlite` DB, bundled with `orchestrator` (no dependencies, no additional software required) | ||
|
||
- 2-node active-passive master-master configuration | ||
### Semi HA | ||
|
||
> NOTE: there has been an initial discussion on supporting Consul/etcd as backend datastore; there is no pending work on that at this time. | ||
![orchestrator semi HA](images/orchestrator-ha--semi-ha.png) | ||
|
||
This setup provides semi HA for `orchestrator`. Two variations available: | ||
|
||
- Multiple `orchestrator` nodes talk to the same backend database. HA of the `orchestrator` services is achieved. However HA of the backend database is not achieved. Backend database may be a `master` with replicas, but `orchestrator` is unable to eat its own dog food and failover its very backend DB. | ||
|
||
If the backend `master` dies, it takes someone or something else to failover the `orchestrator` service onto a promoted replica. | ||
|
||
- Multiple `orchestrator` services all talk to a proxy server, which load balances an active-active `MySQL` master-master setup with `STATEMENT` based replication. | ||
|
||
- The proxy always directs to same server (e.g. `first` algorithm for `HAProxy`) unless that server is dead. | ||
- Death of the active master causes `orchestrator` to talk to other master, which may be somewhat behind. `orchestrator` will typically self reapply the missing changes by nature of its continuous discovery. | ||
- `orchestrator` queries guarantee `STATEMENT` based replication will not cause duplicate errors, and master-master setup will always achieve consistency. | ||
- `orchestrator` will be able to recover the death of a backend master even if in the middle of runnign a recovery (recovery will re-initiate on alternate master) | ||
- **Split brain is possible**. Depending on your setup, physical locations, type of proxy, there can be different `orchestrator` service nodes speaking to different backend `MySQL` servers. This scenario can lead two two `orchestrator` services which consider themselves as "active", both of which will run failovers independently, which would lead to topology corruption. | ||
|
||
To access your `orchestrator` service you may speak to any healthy node. | ||
|
||
Both these setups are well known to run in production for very large environments. | ||
|
||
### HA via shared backend | ||
|
||
![orchestrator HA via shared backend](images/orchestrator-ha--shared-backend.png) | ||
|
||
HA is achieved by highly available shared backend. Existing solutions are: | ||
|
||
- Galera | ||
- XtraDB Cluster | ||
- InnoDB Cluster | ||
- NDB Cluster | ||
|
||
In all of the above the MySQL nodes run synchronous replication (using the common terminology). | ||
|
||
Two variations exist: | ||
|
||
- Your Galera/XtraDB Cluster/InnoDB Cluster runs with a single-writer node. Multiple `orchestrator` nodes will speak to the single writer DB, probably via proxy. If the writer DB fails, the backend cluster promotes a different DB as writer; it is up to your proxy to identify that and direct `orchestrator`'s traffic to the promoted server. | ||
|
||
- Your Galera/XtraDB Cluster/InnoDB Cluster runs in multiple writers mode. A nice setup would couple each `orchestrator` node with a DB server (possibly on the very same box). Since replication is synchronous there is no split brain. Only one `orchestrator` node can ever be the leader, and that leader will only speak with a consensus of the DB nodes. | ||
|
||
In this setup there could be a substantial amount of traffic between the MySQL nodes. In cross-DC setups this may imply larger commit latencies (each commit may need to travel cross DC). | ||
|
||
To access your `orchestrator` service you may speak to any healthy node. It is advisable you speak only to the leader via proxy (use `/api/leader-check` as HTTP health check for your proxy). | ||
|
||
The latter setup is known to run in production at a very large environment on `3` or `5` nodes setup. | ||
|
||
### HA via raft | ||
|
||
![orchestrator HA via raft](images/orchestrator-ha--raft.png) | ||
|
||
`orchestrator` nodes will directly communicate via `raft` consensus algorithm. Each `orchestrator` node has its own private backend database. This can be `MySQL` or `sqlite`. | ||
|
||
Only one `orchestrator` node assumes leadership, and is always a part of a consensus. However all other nodes are independently active and are polling your topologies. | ||
|
||
In this setup there is: | ||
- No communication between the DB nodes. | ||
- Minimal communication between the `orchestrator`. | ||
- `*n` communication to `MySQL` topology nodes. A `3` node setup means each topology `MySQL` servr is probed by `3` different `orchestrator` nodes, independently. | ||
|
||
It is recommended to run a `3`-node or a `5`-node setup. | ||
|
||
`sqlite` is embedded within `orchestrator` and does not require an external dependency. `MySQL` outperforms `sqlite` on busy setups. | ||
|
||
To access your `orchestrator` service you may **only** speak to the leader node. | ||
- Use `/api/leader-check` as HTTP health check for your proxy. | ||
- Or use [orchestrator-client](orchestrator-client.md) with multiple `orchestrator` backends; `orchestrator-client` will figure out the identity of the leader and will send requests to the leader. | ||
|
||
![orchestrator HA via raft](images/orchestrator-ha--raft-proxy.png) | ||
|
||
`orchestrator/raft` is a newer development, and is being tested in production at this time. Please read the [orchestrator/raft documentation](raft.md) for all implications. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.