diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2023-03-02 21:12:31 +0000 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2023-03-02 21:12:31 +0000 |
commit | a88ffaad9f9f321a71657ac0aca4d947f5bc6a5b (patch) | |
tree | c4baeb79c24082ea806b8ce4f27fcc94903b742e /doc/development/database | |
parent | 3c4d101de003ea292be5ac5baf5d73c6c2747367 (diff) | |
download | gitlab-ce-a88ffaad9f9f321a71657ac0aca4d947f5bc6a5b.tar.gz |
Add latest changes from gitlab-org/gitlab@master
Diffstat (limited to 'doc/development/database')
-rw-r--r-- | doc/development/database/clickhouse/merge_request_analytics.md | 355 | ||||
-rw-r--r-- | doc/development/database/clickhouse/tiered_storage.md | 138 | ||||
-rw-r--r-- | doc/development/database/index.md | 2 |
3 files changed, 495 insertions, 0 deletions
diff --git a/doc/development/database/clickhouse/merge_request_analytics.md b/doc/development/database/clickhouse/merge_request_analytics.md new file mode 100644 index 00000000000..34da71d6c4c --- /dev/null +++ b/doc/development/database/clickhouse/merge_request_analytics.md @@ -0,0 +1,355 @@ +--- +stage: Data Stores +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments +--- + +# Merge request analytics with ClickHouse + +The [merge request analytics feature](../../../user/analytics/merge_request_analytics.md) +shows statistics about the merged merge requests in the project and also exposes record-level metadata. +Aggregations include: + +- **Average time to merge**: The duration between the creation time and the merge time. +- **Monthly aggregations**: A chart of 12 months of the merged merge requests. + +Under the chart, the user can see the paginated list of merge requests, 12 months per page. + +You can filter by: + +- Author +- Assignee +- Labels +- Milestone +- Source branch +- Target branch + +## Current performance problems + +- The aggregation queries require specialized indexes, which cost additional + disk space (index-only scans). +- Querying the whole 12 months is slow (statement timeout). Instead, the frontend + requests data per month (12 database queries). +- Even with specialized indexes, making the feature available on the group level + would not be feasible due to the large volume of merge requests. + +## Example queries + +Get the number of merge requests merged in a given month: + +```sql +SELECT COUNT(*) +FROM "merge_requests" +INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id" +WHERE (NOT EXISTS + (SELECT 1 + FROM "banned_users" + WHERE (merge_requests.author_id = banned_users.user_id))) + AND "merge_request_metrics"."target_project_id" = 278964 + AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00' + AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00' +``` + +The `merge_request_metrics` table was de-normalized (by adding `target_project_id`) +to improve the first-page load time. The query itself works well for smaller date ranges, +however, it can time out as the date range increases. + +After an extra filter is added, the query becomes more complex because it must also +filter the `merge_requests` table: + +```sql +SELECT COUNT(*) +FROM "merge_requests" +INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id" +WHERE (NOT EXISTS + (SELECT 1 + FROM "banned_users" + WHERE (merge_requests.author_id = banned_users.user_id))) + AND "merge_requests"."author_id" IN + (SELECT "users"."id" + FROM "users" + WHERE (LOWER("users"."username") IN (LOWER('ahegyi')))) + AND "merge_request_metrics"."target_project_id" = 278964 + AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00' + AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00' +``` + +To calculate mean time to merge, we also query the total time between the +merge request creation time and merge time. + +```sql +SELECT EXTRACT(epoch + FROM SUM(AGE(merge_request_metrics.merged_at, merge_request_metrics.created_at))) +FROM "merge_requests" +INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id" +WHERE (NOT EXISTS + (SELECT 1 + FROM "banned_users" + WHERE (merge_requests.author_id = banned_users.user_id))) + AND "merge_requests"."author_id" IN + (SELECT "users"."id" + FROM "users" + WHERE (LOWER("users"."username") IN (LOWER('ahegyi')))) + AND "merge_request_metrics"."target_project_id" = 278964 + AND "merge_request_metrics"."merged_at" >= '2022-08-01 00:00:00' + AND "merge_request_metrics"."merged_at" <= '2022-09-01 00:00:00' + AND "merge_request_metrics"."merged_at" > "merge_request_metrics"."created_at" +LIMIT 1 +``` + +## Store merge request data in ClickHouse + +Several other use cases exist for storing and querying merge request data in +ClickHouse. In this document, we focus on this particular feature. + +The core data exists in the `merge_request_metrics` and in the `merge_requests` +database tables. Some filters require extra tables to be joined: + +- `banned_users`: Filter out merge requests created by banned users. +- `labels`: A merge request can have one or more assigned labels. +- `assignees`: A merge request can have one or more assignees. +- `merged_at`: The `merged_at` column is located in the `merge_request_metrics` table. + +The `merge_requests` table contains data that can be filtered directly: + +- **Author**: via the `author_id` column. +- **Milestone**: via the `milestone_id` column. +- **Source branch**. +- **Target branch**. +- **Project**: via the `project_id` column. + +### Keep ClickHouse data up to date + +Replicating or syncing the `merge_requests` table is unfortunately not enough. +Separate queries to associated tables are required to insert one de-normalized +`merge_requests` row into the ClickHouse database. + +Change detection is non-trivial to implement. A few corners we could cut: + +- The feature is available for GitLab Premium and GitLab Ultimate customers. + We don't have to sync all the data, but instead sync only the `merge_requests` records + which are part of licensed groups. +- Data changes (often) happen via the `MergeRequest` services, where bumping the + `updated_at` timestamp column is mostly consistent. Some sort of incremental + synchronization process could be implemented. +- We only need to query the merged merge requests. After the merge, the record rarely changes. + +### Database table structure + +The database table structure uses de-normalization to make all required columns +available in one database table. This eliminates the need for `JOINs`. + +```sql +CREATE TABLE merge_requests +( + `id` UInt64, + `project_id` UInt64 DEFAULT 0 NOT NULL, + `author_id` UInt64 DEFAULT 0 NOT NULL, + `milestone_id` UInt64 DEFAULT 0 NOT NULL, + `label_ids` Array(UInt64) DEFAULT [] NOT NULL, + `assignee_ids` Array(UInt64) DEFAULT [] NOT NULL, + `source_branch` String DEFAULT '' NOT NULL, + `target_branch` String DEFAULT '' NOT NULL, + `merged_at` DateTime64(6, 'UTC') NOT NULL, + `created_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL, + `updated_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL +) +ENGINE = ReplacingMergeTree(updated_at) +ORDER BY (project_id, merged_at, id); +``` + +Similarly to the [activity data example](gitlab_activity_data.md), we use the +`ReplacingMergeTree` engine. Several columns of the merge request record may change, +so keeping the table up-to-date is important. + +The database table is ordered by the `project_id, merged_at, id` columns. This ordering +optimizes the table data for our use case: querying the `merged_at` column in a project. + +## Rewrite the count query + +First, let's generate some data for the table. + +```sql +INSERT INTO merge_requests (id, project_id, author_id, milestone_id, label_ids, merged_at, created_at) +SELECT id, project_id, author_id, milestone_id, label_ids, merged_at, created_at +FROM generateRandom('id UInt64, project_id UInt8, author_id UInt8, milestone_id UInt8, label_ids Array(UInt8), merged_at DateTime64(6, \'UTC\'), created_at DateTime64(6, \'UTC\')') +LIMIT 1000000; +``` + +NOTE: +Some integer data types were cast as `UInt8` so it is highly probable that they +have same values across different rows. + +The original count query only aggregated data for one month. With ClickHouse, we can +attempt aggregating the data for the whole year. + +PostgreSQL-based count query: + +```sql +SELECT COUNT(*) +FROM "merge_requests" +INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id" +WHERE (NOT EXISTS + (SELECT 1 + FROM "banned_users" + WHERE (merge_requests.author_id = banned_users.user_id))) + AND "merge_request_metrics"."target_project_id" = 278964 + AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00' + AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00' +``` + +ClickHouse query: + +```sql +SELECT + toYear(merged_at) AS year, + toMonth(merged_at) AS month, + COUNT(*) +FROM merge_requests +WHERE + project_id = 200 + AND merged_at BETWEEN '2022-01-01 00:00:00' + AND '2023-01-01 00:00:00' +GROUP BY year, month +``` + +The query processed a significantly lower number of rows compared to the generated data. +The `ORDER BY` clause (primary key) is helping the query execution: + +```plaintext +11 rows in set. Elapsed: 0.010 sec. +Processed 8.19 thousand rows, 131.07 KB (783.45 thousand rows/s., 12.54 MB/s.) +``` + +## Rewrite the Mean time to merge query + +The query calculates the mean time to merge as: +`duration(created_at, merged_at) / merge_request_count`. The calculation is done in +two separate steps: + +1. Request the monthly counts and the monthly duration values. +1. Sum the counts to get the yearly count. +1. Sum the durations to get the yearly duration. +1. Divide the durations by the count. + +In ClickHouse, we can calculate the mean time to merge with one query: + +```sql +SELECT + SUM( + dateDiff('second', merged_at, created_at) / 3600 / 24 + ) / COUNT(*) AS mean_time_to_merge -- mean_time_to_merge is in days +FROM merge_requests +WHERE + project_id = 200 + AND merged_at BETWEEN '2022-01-01 00:00:00' + AND '2023-01-01 00:00:00' +``` + +## Filtering + +The database queries above can be used as base queries. You can add more filters. +For example, filtering for a label and a milestone: + +```sql +SELECT + toYear(merged_at) AS year, + toMonth(merged_at) AS month, + COUNT(*) +FROM merge_requests +WHERE + project_id = 200 + AND milestone_id = 15 + AND has(label_ids, 118) + AND -- array includes 118 + merged_at BETWEEN '2022-01-01 00:00:00' + AND '2023-01-01 00:00:00' +GROUP BY year, month +``` + +Optimizing a particular filter is usually done with a database index. This particular +query reads 8000 rows: + +```plaintext +1 row in set. Elapsed: 0.016 sec. +Processed 8.19 thousand rows, 589.99 KB (505.38 thousand rows/s., 36.40 MB/s.) +``` + +Adding an index on `milestone_id`: + +```sql +ALTER TABLE merge_requests +ADD + INDEX milestone_id_index milestone_id TYPE minmax GRANULARITY 10; +ALTER TABLE + merge_requests MATERIALIZE INDEX milestone_id_index; +``` + +On the generated data, adding the index didn't improve the performance. + +### Banned users filter + +A recently added feature in GitLab filters out merge requests where the author is +banned by the admins. The banned users are tracked on the instance level in the +`banned_users` database table. + +#### Idea 1: Enumerate the banned user IDs + +This would require no structural changes to the ClickHouse database schema. +We could query the banned users in the project and filter the values out in query time. + +Get the banned users (in PostgreSQL): + +```sql +SELECT user_id FROM banned_users +``` + +In ClickHouse + +```sql +SELECT + toYear(merged_at) AS year, + toMonth(merged_at) AS month, + COUNT(*) +FROM merge_requests +WHERE + author_id NOT IN (1, 2, 3, 4) AND -- banned users + project_id = 200 + AND milestone_id = 15 + AND has(label_ids, 118) AND -- array includes 118 + merged_at BETWEEN '2022-01-01 00:00:00' + AND '2023-01-01 00:00:00' +GROUP BY year, month +``` + +The problem with this approach is that the number of banned users could increase significantly which would make the query bigger and slower. + +#### Idea 2: replicate the `banned_users` table + +Assuming that the `banned_users table` doesn't grow to millions of rows, we could +attempt to periodically sync the whole table to ClickHouse. With this approach, +a mostly consistent `banned_users` table could be used in the ClickHouse database query: + +```sql +SELECT + toYear(merged_at) AS year, + toMonth(merged_at) AS month, + COUNT(*) +FROM merge_requests +WHERE + author_id NOT IN (SELECT user_id FROM banned_users) AND + project_id = 200 AND + milestone_id = 15 AND + has(label_ids, 118) AND -- array includes 118 + merged_at BETWEEN '2022-01-01 00:00:00' AND '2023-01-01 00:00:00' +GROUP BY year, month +``` + +Alternatively, the `banned_users` table could be stored as a +[dictionary](https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts) +to further improve the query performance. + +#### Idea 3: Alter the feature + +For analytical calculations, it might be acceptable to drop this particular filter. +This approach assumes that including the merge requests of banned users doesn't skew the statistics significantly. diff --git a/doc/development/database/clickhouse/tiered_storage.md b/doc/development/database/clickhouse/tiered_storage.md new file mode 100644 index 00000000000..d9026f47e28 --- /dev/null +++ b/doc/development/database/clickhouse/tiered_storage.md @@ -0,0 +1,138 @@ +--- +stage: Data Stores +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments +--- + +# Tiered Storages in ClickHouse + +NOTE: +The MergeTree table engine in ClickHouse supports tiered storage. +See the documentation for [Using Multiple Block Devices for Data Storage](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes) +for details on setup and further explanation. + +Quoting from the [MergeTree documentation](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes): + +<!-- vale gitlab.Simplicity = NO --> + +> MergeTree family table engines can store data on multiple block devices. For example, +> it can be useful when the data of a certain table are implicitly split into "hot" and "cold". +> The most recent data is regularly requested but requires only a small amount of space. +> On the contrary, the fat-tailed historical data is requested rarely. + +<!-- vale gitlab.Simplicity = YES --> + +When used with remote storage backends such as +[Amazon S3](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-s3), +this makes a very efficient storage scheme. It allows for storage policies, which +allows data to be on local disks for a period of time and then move it to object storage. + +An [example configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure) can look like this: + +```xml +<storage_configuration> + <disks> + <fast_ssd> + <path>/mnt/fast_ssd/clickhouse/</path> + </fast_ssd> + <gcs> + <support_batch_delete>false</support_batch_delete> + <type>s3</type> + <endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint> + <access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id> + <secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key> + <metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path> + </gcs> + ... + </disks> + ... + <policies> + + <move_from_local_disks_to_gcs> <!-- policy name --> + <volumes> + <hot> <!-- volume name --> + <disk>fast_ssd</disk> <!-- disk name --> + </hot> + <cold> + <disk>gcs</disk> + </cold> + </volumes> + <move_factor>0.2</move_factor> + <!-- The move factor determines when to move data from hot volume to cold. + See ClickHouse docs for more details. --> + </moving_from_ssd_to_hdd> + .... +</storage_configuration> +``` + +In this storage policy, two volumes are defined `hot` and `cold`. After the `hot` volume is filled with occupancy of `disk_size * move_factor`, the data is being moved to Google Cloud Storage (GCS). + +If this storage policy is not the default, create tables by attaching the storage policies. For example: + +```sql +CREATE TABLE key_value_table ( + event_date Date, + key String, + value String, +) ENGINE = MergeTree +ORDER BY (key) +PARTITION BY toYYYYMM(event_date) +SETTINGS storage_policy = 'move_from_local_disks_to_gcs' +``` + +NOTE: +In this storage policy, the move happens implicitly. It is also possible to keep +_hot_ data on local disks for a fixed period of time and then move them as _cold_. + +This approach is possible with +[Table TTLs](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl), +which are also available with MergeTree table engine. + +The ClickHouse documentation shows this feature in detail, in the example of +[implementing a hot - warm - cold architecture](https://clickhouse.com/docs/en/guides/developer/ttl/#implementing-a-hotwarmcold-architecture). + +You can take a similar approach for the example shown above. First, adjust the storage policy: + +```xml +<storage_configuration> + ... + <policies> + <local_disk_and_gcs> <!-- policy name --> + <volumes> + <hot> <!-- volume name --> + <disk>fast_ssd</disk> <!-- disk name --> + </hot> + <cold> + <disk>gcs</disk> + </cold> + </volumes> + </local_disk_and_gcs> + .... +</storage_configuration> +``` + +Then create the table as: + +```sql +CREATE TABLE another_key_value_table ( + event_date Date, + key String, + value String, +) ENGINE = MergeTree +ORDER BY (key) +PARTITION BY toYYYYMM(event_date) +TTL + event_date TO VOLUME 'hot', + event_date + INTERVAL 1 YEAR TO VOLUME 'cold' +SETTINGS storage_policy = 'local_disk_and_gcs'; +``` + +This creates the table so that data older than 1 year (evaluated against the +`event_date` column) is moved to GCS. Such a storage policy can be helpful for append-only +tables (like audit events) where only the most recent data is accessed frequently. +You can drop the data altogether, which can be a regulatory requirement. + +We don't mention modifying TTLs in this guide, but that is possible as well. +See ClickHouse documentation for +[modifying TTL](https://clickhouse.com/docs/en/sql-reference/statements/alter/ttl/#modify-ttl) +for details. diff --git a/doc/development/database/index.md b/doc/development/database/index.md index 8f22eaac496..eaf37cd7699 100644 --- a/doc/development/database/index.md +++ b/doc/development/database/index.md @@ -109,6 +109,8 @@ including the major methods: - [Introduction](clickhouse/index.md) - [Optimizing query execution](clickhouse/optimization.md) - [Rebuild GitLab features using ClickHouse 1: Activity data](clickhouse/gitlab_activity_data.md) +- [Rebuild GitLab features using ClickHouse 2: Merge Request analytics](clickhouse/merge_request_analytics.md) +- [Tiered Storage in ClickHouse](clickhouse/tiered_storage.md) ## Miscellaneous |