1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
|
---
table_display_block: true
---
# GitLab Scalability
This document assumes working acknowledge of the [GitLab
architecture](architecture.md). Before we discuss the current limits of
GitLab scalability and discuss future direction, let's begin with a few
sample flows for some of the most frequent activities that occur today:
## Example 1: Git fetch over SSH
```mermaid
sequenceDiagram
participant Client
participant sshd
participant gitlab_shell
participant Rails
participant Redis
participant PostgreSQL
participant Gitaly
Note over Client,gitlab_shell: $ git pull
Client->>gitlab_shell: ssh git@gitlab.com git-upload-pack group/project.git
gitlab_shell->>Rails: HTTP POST /api/v4/internal/authorized_keys
Rails->>PostgreSQL: Look up fingerprint
PostgreSQL->>Rails: Found key
Rails->>gitlab_shell: 200 OK
gitlab_shell->>Rails: HTTP POST /api/v4/internal/allowed
Rails->>Redis: Read cache data
Redis->>Rails: Cache data
Rails->>PostgreSQL: Look up user/authorized projects/keys/etc.
PostgreSQL->>Rails: Database rows
Rails->>Gitaly: RPCs for checking push rules (e.g. FindCommit)
Gitaly->>Rails: Gitaly response data
Rails->>gitlab_shell: 200 OK
gitlab_shell->>Gitaly: gitaly-upload-pack
Gitaly->>gitlab_shell: Git data
gitlab_shell->>Client: Git data
```
TODO:
## Git fetch over HTTPS
## Git push over SSH
## Loading merge requests (/project/merge_requests/:iid)
## Runner CI jobs
## API: /api/v4/projects
### Microservice Review
Over the past year, we've seen a number of incidents arising from
degradation of one or more services:
#### sshd
sshd (under Ubuntu 16.04, not Ubuntu 14.04) has generally been rock
solid. However, it requires careful tuning to make it work reliably at
scale. For example, as discussed in
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7168:
1. HAproxy should be configured with `leastconn`
1. sshd `MaxStartups` needs to be tuned
#### Observability
sshd does not provide any way to monitor directly via Prometheus
metrics. There are verbosity levels that be turned up, but they are not
on by default. We may want to consider contributing better logging
and/or direct instrumentation.
#### gitlab-shell
[gitlab-shell](https://gitlab.com/gitlab-org/gitlab-shell) started out
as a pure Ruby project but has almost nearly been rewritten in Go for
performance. It used to handle both incoming Git SSH traffic and also
Git hooks (e.g. pre-receive, post-receive, etc.), but now all Git hooks
have been moved into Gitaly where they belong, alongside the Git
repositories.
Rewriting in Go is essential for scalability because each time
gitlab-shell runs, it needs to load its Ruby dependencies, parse its
YAML config file, and then do its work. This can take on the order of
200-300 milliseconds to complete, adding unnecessary latency.
#### Observability
gitlab-shell currently runs short-lived processes that can not be
monitored with Prometheus easily. gitlab-shell could benefit from
pushing metrics to some Prometheus endpoint.
#### Rails
As seen in the diagrams above, Rails handles internal API checks from
gitlab-shell and Workhorse. These requests are among the most
frequently-used API requests, so it is imperative that they be extremely
fast and reliable.
##### /api/v4/internal/authorized_keys
This endpoint has a simple job: validate that the SSH key presented by
the user exists in the database. As we have seen in
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7168, the
P99 duration of this endpoint is fast enough but there is significant
queueing delay that is concerning.
Given its simplicity and performance implications, we may want to
consider moving this check outside of Rails and inside a dedicated
service.
###### Observability
For all internal API routes, we currently have no idea how much time is
spent due to queuing here. We have an open issue to route this through
Workhorse: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4583.
##### /api/v4/internal/allowed
The `/internal/allowed` endpoint is used to check whether a certain user
or SSH key has access to upload or download repository data.
This endpoint has been a constant source of problems over the years, both
from a reliability and a performance standpoint. For example:
1. Deploy tokens not working
1. Push rules timing out
1. Path locks: https://gitlab.com/gitlab-org/gitlab-ce/issues/55137
1. LFS pointer checks fail: https://gitlab.com/gitlab-org/gitlab-ee/issues/10799
1. Repository size limits: https://gitlab.com/gitlab-org/gitlab-ee/issues/11126
Because of push rules, this endpoint often needs to communicate with
Gitaly to scan commits on disk.
#### Redis
#### PgBouncer
#### PostgreSQL
#### Gitaly
|