diff options
Diffstat (limited to 'doc/administration/consul.md')
-rw-r--r-- | doc/administration/consul.md | 239 |
1 files changed, 239 insertions, 0 deletions
diff --git a/doc/administration/consul.md b/doc/administration/consul.md new file mode 100644 index 00000000000..ae22d8bd4d0 --- /dev/null +++ b/doc/administration/consul.md @@ -0,0 +1,239 @@ +--- +type: reference +--- + +# How to set up Consul **(PREMIUM ONLY)** + +A Consul cluster consists of both +[server and client agents](https://www.consul.io/docs/agent). +The servers run on their own nodes and the clients run on other nodes that in +turn communicate with the servers. + +GitLab Premium includes a bundled version of [Consul](https://www.consul.io/) +a service networking solution that you can manage by using `/etc/gitlab/gitlab.rb`. + +## Configure the Consul nodes + +NOTE: **Important:** +Before proceeding, refer to the +[available reference architectures](reference_architectures/index.md#available-reference-architectures) +to find out how many Consul server nodes you should have. + +On **each** Consul server node perform the following: + +1. Follow the instructions to [install](https://about.gitlab.com/install/) + GitLab by choosing your preferred platform, but do not supply the + `EXTERNAL_URL` value when asked. +1. Edit `/etc/gitlab/gitlab.rb`, and add the following by replacing the values + noted in the `retry_join` section. In the example below, there are three + nodes, two denoted with their IP, and one with its FQDN, you can use either + notation: + + ```ruby + # Disable all components except Consul + roles ['consul_role'] + + # Consul nodes: can be FQDN or IP, separated by a whitespace + consul['configuration'] = { + server: true, + retry_join: %w(10.10.10.1 consul1.gitlab.example.com 10.10.10.2) + } + + # Disable auto migrations + gitlab_rails['auto_migrate'] = false + ``` + +1. [Reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes + to take effect. +1. Run the following command to ensure Consul is both configured correctly and + to verify that all server nodes are communicating: + + ```shell + sudo /opt/gitlab/embedded/bin/consul members + ``` + + The output should be similar to: + + ```plaintext + Node Address Status Type Build Protocol DC + CONSUL_NODE_ONE XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul + CONSUL_NODE_TWO XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul + CONSUL_NODE_THREE XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul + ``` + + If the results display any nodes with a status that isn't `alive`, or if any + of the three nodes are missing, see the [Troubleshooting section](#troubleshooting-consul). + +## Upgrade the Consul nodes + +To upgrade your Consul nodes, upgrade the GitLab package. + +Nodes should be: + +- Members of a healthy cluster prior to upgrading the Omnibus GitLab package. +- Upgraded one node at a time. + +Identify any existing health issues in the cluster by running the following command +within each node. The command will return an empty array if the cluster is healthy: + +```shell +curl http://127.0.0.1:8500/v1/health/state/critical +``` + +Consul nodes communicate using the raft protocol. If the current leader goes +offline, there needs to be a leader election. A leader node must exist to facilitate +synchronization across the cluster. If too many nodes go offline at the same time, +the cluster will lose quorum and not elect a leader due to +[broken consensus](https://www.consul.io/docs/internals/consensus.html). + +Consult the [troubleshooting section](#troubleshooting-consul) if the cluster is not +able to recover after the upgrade. The [outage recovery](#outage-recovery) may +be of particular interest. + +NOTE: **Note:** +GitLab uses Consul to store only transient data that is easily regenerated. If +the bundled Consul was not used by any process other than GitLab itself, then +[rebuilding the cluster from scratch](#recreate-from-scratch) is fine. + +## Troubleshooting Consul + +Below are some useful operations should you need to debug any issues. +You can see any error logs by running: + +```shell +sudo gitlab-ctl tail consul +``` + +### Check the cluster membership + +To determine which nodes are part of the cluster, run the following on any member in the cluster: + +```shell +sudo /opt/gitlab/embedded/bin/consul members +``` + +The output should be similar to: + +```plaintext +Node Address Status Type Build Protocol DC +consul-b XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +db-a XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul +db-b XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul +``` + +Ideally all nodes will have a `Status` of `alive`. + +### Restart Consul + +If it is necessary to restart Consul, it is important to do this in +a controlled manner to maintain quorum. If quorum is lost, to recover the cluster, +you will need to follow the Consul [outage recovery](#outage-recovery) process. + +To be safe, it's recommended that you only restart Consul in one node at a time to +ensure the cluster remains intact. For larger clusters, it is possible to restart +multiple nodes at a time. See the +[Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) +for how many failures it can tolerate. This will be the number of simultaneous +restarts it can sustain. + +To restart Consul: + +```shell +sudo gitlab-ctl restart consul +``` + +### Consul nodes unable to communicate + +By default, Consul will attempt to +[bind](https://www.consul.io/docs/agent/options.html#_bind) to `0.0.0.0`, but +it will advertise the first private IP address on the node for other Consul nodes +to communicate with it. If the other nodes cannot communicate with a node on +this address, then the cluster will have a failed status. + +If you are running into this issue, you will see messages like the following in `gitlab-ctl tail consul` output: + +```plaintext +2017-09-25_19:53:39.90821 2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election +2017-09-25_19:53:41.74356 2017/09/25 19:53:41 [ERR] agent: failed to sync remote state: No cluster leader +``` + +To fix this: + +1. Pick an address on each node that all of the other nodes can reach this node through. +1. Update your `/etc/gitlab/gitlab.rb` + + ```ruby + consul['configuration'] = { + ... + bind_addr: 'IP ADDRESS' + } + ``` + +1. Reconfigure GitLab; + + ```shell + gitlab-ctl reconfigure + ``` + +If you still see the errors, you may have to +[erase the Consul database and reinitialize](#recreate-from-scratch) on the affected node. + +### Consul does not start - multiple private IPs + +In case that a node has multiple private IPs, Consul will be confused as to +which of the private addresses to advertise, and then immediately exit on start. + +You will see messages like the following in `gitlab-ctl tail consul` output: + +```plaintext +2017-11-09_17:41:45.52876 ==> Starting Consul agent... +2017-11-09_17:41:45.53057 ==> Error creating agent: Failed to get advertise address: Multiple private IPs found. Please configure one. +``` + +To fix this: + +1. Pick an address on the node that all of the other nodes can reach this node through. +1. Update your `/etc/gitlab/gitlab.rb` + + ```ruby + consul['configuration'] = { + ... + bind_addr: 'IP ADDRESS' + } + ``` + +1. Reconfigure GitLab; + + ```shell + gitlab-ctl reconfigure + ``` + +### Outage recovery + +If you lost enough Consul nodes in the cluster to break quorum, then the cluster +is considered failed, and it will not function without manual intervention. +In that case, you can either recreate the nodes from scratch or attempt a +recover. + +#### Recreate from scratch + +By default, GitLab does not store anything in the Consul node that cannot be +recreated. To erase the Consul database and reinitialize: + +```shell +sudo gitlab-ctl stop consul +sudo rm -rf /var/opt/gitlab/consul/data +sudo gitlab-ctl start consul +``` + +After this, the node should start back up, and the rest of the server agents rejoin. +Shortly after that, the client agents should rejoin as well. + +#### Recover a failed node + +If you have taken advantage of Consul to store other data and want to restore +the failed node, follow the +[Consul guide](https://learn.hashicorp.com/tutorials/consul/recovery-outage) +to recover a failed cluster. |