---
table_display_block: true
---

# GitLab Scalability

This document assumes working acknowledge of the [GitLab
architecture](architecture.md). Before we discuss the current limits of
GitLab scalability and discuss future direction, let's begin with a few
sample flows for some of the most frequent activities that occur today:

## Example 1: Git fetch over SSH

```mermaid
sequenceDiagram
  participant Client
  participant sshd
  participant gitlab_shell
  participant Rails
  participant Redis
  participant PostgreSQL
  participant Gitaly
  Note over Client,gitlab_shell: $ git pull
  Client->>gitlab_shell: ssh git@gitlab.com git-upload-pack group/project.git
  gitlab_shell->>Rails: HTTP POST /api/v4/internal/authorized_keys
  Rails->>PostgreSQL: Look up fingerprint
  PostgreSQL->>Rails: Found key
  Rails->>gitlab_shell: 200 OK
  gitlab_shell->>Rails: HTTP POST /api/v4/internal/allowed
  Rails->>Redis: Read cache data
  Redis->>Rails: Cache data
  Rails->>PostgreSQL: Look up user/authorized projects/keys/etc.
  PostgreSQL->>Rails: Database rows
  Rails->>Gitaly: RPCs for checking push rules (e.g. FindCommit)
  Gitaly->>Rails: Gitaly response data
  Rails->>gitlab_shell: 200 OK
  gitlab_shell->>Gitaly: gitaly-upload-pack
  Gitaly->>gitlab_shell: Git data
  gitlab_shell->>Client: Git data
```

TODO:

## Git fetch over HTTPS
## Git push over SSH
## Loading merge requests (/project/merge_requests/:iid)
## Runner CI jobs
## API: /api/v4/projects

### Microservice Review

Over the past year, we've seen a number of incidents arising from
degradation of one or more services:

#### sshd

sshd (under Ubuntu 16.04, not Ubuntu 14.04) has generally been rock
solid. However, it requires careful tuning to make it work reliably at
scale. For example, as discussed in
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7168:

1. HAproxy should be configured with `leastconn`
1. sshd `MaxStartups` needs to be tuned

#### Observability

sshd does not provide any way to monitor directly via Prometheus
metrics. There are verbosity levels that be turned up, but they are not
on by default. We may want to consider contributing better logging
and/or direct instrumentation.

#### gitlab-shell

[gitlab-shell](https://gitlab.com/gitlab-org/gitlab-shell) started out
as a pure Ruby project but has almost nearly been rewritten in Go for
performance. It used to handle both incoming Git SSH traffic and also
Git hooks (e.g. pre-receive, post-receive, etc.), but now all Git hooks
have been moved into Gitaly where they belong, alongside the Git
repositories.

Rewriting in Go is essential for scalability because each time
gitlab-shell runs, it needs to load its Ruby dependencies, parse its
YAML config file, and then do its work. This can take on the order of
200-300 milliseconds to complete, adding unnecessary latency.

#### Observability

gitlab-shell currently runs short-lived processes that can not be
monitored with Prometheus easily. gitlab-shell could benefit from
pushing metrics to some Prometheus endpoint.

#### Rails

As seen in the diagrams above, Rails handles internal API checks from
gitlab-shell and Workhorse. These requests are among the most
frequently-used API requests, so it is imperative that they be extremely
fast and reliable.

##### /api/v4/internal/authorized_keys

This endpoint has a simple job: validate that the SSH key presented by
the user exists in the database. As we have seen in
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7168, the
P99 duration of this endpoint is fast enough but there is significant
queueing delay that is concerning.

Given its simplicity and performance implications, we may want to
consider moving this check outside of Rails and inside a dedicated
service.

###### Observability

For all internal API routes, we currently have no idea how much time is
spent due to queuing here. We have an open issue to route this through
Workhorse: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4583.

##### /api/v4/internal/allowed

The `/internal/allowed` endpoint is used to check whether a certain user
or SSH key has access to upload or download repository data.

This endpoint has been a constant source of problems over the years, both
from a reliability and a performance standpoint. For example:

1. Deploy tokens not working
1. Push rules timing out
    1. Path locks: https://gitlab.com/gitlab-org/gitlab-ce/issues/55137
    1. LFS pointer checks fail: https://gitlab.com/gitlab-org/gitlab-ee/issues/10799
    1. Repository size limits: https://gitlab.com/gitlab-org/gitlab-ee/issues/11126

Because of push rules, this endpoint often needs to communicate with
Gitaly to scan commits on disk.

#### Redis
#### PgBouncer
#### PostgreSQL
#### Gitaly