diff options
Diffstat (limited to 'doc/administration/database_load_balancing.md')
-rw-r--r-- | doc/administration/database_load_balancing.md | 277 |
1 files changed, 277 insertions, 0 deletions
diff --git a/doc/administration/database_load_balancing.md b/doc/administration/database_load_balancing.md new file mode 100644 index 00000000000..7f3be402b84 --- /dev/null +++ b/doc/administration/database_load_balancing.md @@ -0,0 +1,277 @@ +# Database Load Balancing **[PREMIUM ONLY]** + +> [Introduced][ee-1283] in [GitLab Premium][eep] 9.0. + +Distribute read-only queries among multiple database servers. + +## Overview + +Database load balancing improves the distribution of database workloads across +multiple computing resources. Load balancing aims to optimize resource use, +maximize throughput, minimize response time, and avoid overload of any single +resource. Using multiple components with load balancing instead of a single +component may increase reliability and availability through redundancy. +[_Wikipedia article_][wikipedia] + +When database load balancing is enabled in GitLab, the load is balanced using +a simple round-robin algorithm, without any external dependencies such as Redis. +Load balancing is not enabled for Sidekiq as this would lead to consistency +problems, and Sidekiq mostly performs writes anyway. + +In the following image, you can see the load is balanced rather evenly among +all the secondaries (`db4`, `db5`, `db6`). Because `SELECT` queries are not +sent to the primary (unless necessary), the primary (`db3`) hardly has any load. + + + +## Requirements + +For load balancing to work you will need at least PostgreSQL 9.2 or newer, +[**MySQL is not supported**][db-req]. You also need to make sure that you have +at least 1 secondary in [hot standby][hot-standby] mode. + +Load balancing also requires that the configured hosts **always** point to the +primary, even after a database failover. Furthermore, the additional hosts to +balance load among must **always** point to secondary databases. This means that +you should put a load balance in front of every database, and have GitLab connect +to those load balancers. + +For example, say you have a primary (`db1.gitlab.com`) and two secondaries, +`db2.gitlab.com` and `db3.gitlab.com`. For this setup you will need to have 3 +load balancers, one for every host. For example: + +* `primary.gitlab.com` forwards to `db1.gitlab.com` +* `secondary1.gitlab.com` forwards to `db2.gitlab.com` +* `secondary2.gitlab.com` forwards to `db3.gitlab.com` + +Now let's say that a failover happens and db2 becomes the new primary. This +means forwarding should now happen as follows: + +* `primary.gitlab.com` forwards to `db2.gitlab.com` +* `secondary1.gitlab.com` forwards to `db1.gitlab.com` +* `secondary2.gitlab.com` forwards to `db3.gitlab.com` + +GitLab does not take care of this for you, so you will need to do so yourself. + +Finally, load balancing requires that GitLab can connect to all hosts using the +same credentials and port as configured in the +[Enabling load balancing](#enabling-load-balancing) section. Using +different ports or credentials for different hosts is not supported. + +## Use cases + +- For GitLab instances with thousands of users and high traffic, you can use + database load balancing to reduce the load on the primary database and + increase responsiveness, thus resulting in faster page load inside GitLab. + +## Enabling load balancing + +For the environment in which you want to use load balancing, you'll need to add +the following. This will balance the load between `host1.example.com` and +`host2.example.com`. + +**In Omnibus installations:** + +1. Edit `/etc/gitlab/gitlab.rb` and add the following line: + + ```ruby + gitlab_rails['db_load_balancing'] = { 'hosts' => ['host1.example.com', 'host2.example.com'] } + ``` + +1. Save the file and [reconfigure GitLab][] for the changes to take effect. + +--- + +**In installations from source:** + +1. Edit `/home/git/gitlab/config/database.yml` and add or amend the following lines: + + ```yaml + production: + username: gitlab + database: gitlab + encoding: unicode + load_balancing: + hosts: + - host1.example.com + - host2.example.com + ``` + +1. Save the file and [restart GitLab][] for the changes to take effect. + +## Service Discovery + +> [Introduced][ee-5883] in [GitLab Premium][eep] 11.0. + +Service discovery allows GitLab to automatically retrieve a list of secondary +databases to use, instead of having to manually specify these in the +`database.yml` configuration file. Service discovery works by periodically +checking a DNS A record, using the IPs returned by this record as the addresses +for the secondaries. For service discovery to work, all you need is a DNS server +and an A record containing the IP addresses of your secondaries. + +To use service discovery you need to change your `database.yml` configuration +file so it looks like the following: + +```yaml +production: + username: gitlab + database: gitlab + encoding: unicode + load_balancing: + discover: + nameserver: localhost + record: secondary.postgresql.service.consul + port: 8600 + interval: 60 + disconnect_timeout: 120 +``` + +Here the `discover:` section specifies the configuration details to use for +service discovery. + +### Configuration + +The following options can be set: + +| Option | Description | Default | +|----------------------|---------------------------------------------------------------------------------------------------|-----------| +| `nameserver` | The nameserver to use for looking up the DNS record. | localhost | +| `record` | The A record to look up. This option is required for service discovery to work. | | +| `port` | The port of the nameserver. | 8600 | +| `interval` | The minimum time in seconds between checking the DNS record. | 60 | +| `disconnect_timeout` | The time in seconds after which an old connection is closed, after the list of hosts was updated. | 120 | +| `use_tcp` | Lookup DNS resources using TCP instead of UDP | false | + +The `interval` value specifies the _minimum_ time between checks. If the A +record has a TTL greater than this value, then service discovery will honor said +TTL. For example, if the TTL of the A record is 90 seconds, then service +discovery will wait at least 90 seconds before checking the A record again. + +When the list of hosts is updated, it might take a while for the old connections +to be terminated. The `disconnect_timeout` setting can be used to enforce an +upper limit on the time it will take to terminate all old database connections. + +Some nameservers (like [Consul][consul-udp]) can return a truncated list of hosts when +queried over UDP. To overcome this issue, you can use TCP for querying by setting +`use_tcp` to `true`. + +### Forking + +If you use an application server that forks, such as Unicorn, you _have to_ +update your Unicorn configuration to start service discovery _after_ a fork. +Failure to do so will lead to service discovery only running in the parent +process. If you are using Unicorn, then you can add the following to your +Unicorn configuration file: + +```ruby +after_fork do |server, worker| + defined?(Gitlab::Database::LoadBalancing) && + Gitlab::Database::LoadBalancing.start_service_discovery +end +``` + +This will ensure that service discovery is started in both the parent and all +child processes. + +## Balancing queries + +Read-only `SELECT` queries will be balanced among all the secondary hosts. +Everything else (including transactions) will be executed on the primary. +Queries such as `SELECT ... FOR UPDATE` are also executed on the primary. + +## Prepared statements + +Prepared statements don't work well with load balancing and are disabled +automatically when load balancing is enabled. This should have no impact on +response timings. + +## Primary sticking + +After a write has been performed, GitLab will stick to using the primary for a +certain period of time, scoped to the user that performed the write. GitLab will +revert back to using secondaries when they have either caught up, or after 30 +seconds. + +## Failover handling + +In the event of a failover or an unresponsive database, the load balancer will +try to use the next available host. If no secondaries are available the +operation is performed on the primary instead. + +In the event of a connection error being produced when writing data, the +operation will be retried up to 3 times using an exponential back-off. + +When using load balancing, you should be able to safely restart a database server +without it immediately leading to errors being presented to the users. + +## Logging + +The load balancer logs various messages, such as: + +* When a host is marked as offline +* When a host comes back online +* When all secondaries are offline + +Each log message contains the tag `[DB-LB]` to make searching/filtering of such +log entries easier. For example: + +``` +[DB-LB] Host 10.123.2.5 came back online +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Host 10.123.2.6 came back online +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Marking host 10.123.2.7 as offline +[DB-LB] Host 10.123.2.7 came back online +[DB-LB] Host 10.123.2.7 came back online +``` + +## Handling Stale Reads + +> [Introduced][ee-3526] in [GitLab Premium][eep] 10.3. + +To prevent reading from an outdated secondary the load balancer will check if it +is in sync with the primary. If the data is determined to be recent enough the +secondary can be used, otherwise it will be ignored. To reduce the overhead of +these checks we only perform these checks at certain intervals. + +There are three configuration options that influence this behaviour: + +| Option | Description | Default | +|------------------------------|----------------------------------------------------------------------------------------------------------------|------------| +| `max_replication_difference` | The amount of data (in bytes) a secondary is allowed to lag behind when it hasn't replicated data for a while. | 8 MB | +| `max_replication_lag_time` | The maximum number of seconds a secondary is allowed to lag behind before we stop using it. | 60 seconds | +| `replica_check_interval` | The minimum number of seconds we have to wait before checking the status of a secondary. | 60 seconds | + +The defaults should be sufficient for most users. Should you want to change them +you can specify them in `config/database.yml` like so: + +```yaml +production: + username: gitlab + database: gitlab + encoding: unicode + load_balancing: + hosts: + - host1.example.com + - host2.example.com + max_replication_difference: 16777216 # 16 MB + max_replication_lag_time: 30 + replica_check_interval: 30 +``` + +[hot-standby]: https://www.postgresql.org/docs/9.6/static/hot-standby.html +[ee-1283]: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1283 +[eep]: https://about.gitlab.com/pricing/ +[reconfigure gitlab]: restart_gitlab.md#omnibus-gitlab-reconfigure "How to reconfigure Omnibus GitLab" +[restart gitlab]: restart_gitlab.md#installations-from-source "How to restart GitLab" +[wikipedia]: https://en.wikipedia.org/wiki/Load_balancing_(computing) +[db-req]: ../install/requirements.md#database +[ee-3526]: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/3526 +[ee-5883]: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5883 +[consul-udp]: https://www.consul.io/docs/agent/dns.html#udp-based-dns-queries |