From 9cbd5994c059769ecd5794574b5df84082f06600 Mon Sep 17 00:00:00 2001 From: James Ramsay Date: Mon, 5 Aug 2019 11:41:31 +0000 Subject: Add minimal partial clone docs Partial Clone and Sparse Checkout are the native Git approach to enormous repositories. We should document the state of these features and update as our support and Git's support for these features improves. --- doc/topics/git/partial_clone.md | 147 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 147 insertions(+) create mode 100644 doc/topics/git/partial_clone.md (limited to 'doc/topics/git/partial_clone.md') diff --git a/doc/topics/git/partial_clone.md b/doc/topics/git/partial_clone.md new file mode 100644 index 00000000000..9b8cf269684 --- /dev/null +++ b/doc/topics/git/partial_clone.md @@ -0,0 +1,147 @@ +# Partial Clone for Large Repositories + +CAUTION: **Alpha:** +Partial Clone is an experimental feature, and will significantly increase +Gitaly resource utilization when performing a partial clone, and decrease +performance of subsequent fetch operations. + +As Git repositories become very large, usability decreases as performance +decreases. One major challenge is cloning the repository, because Git will +download the entire repository including every commit and every version of +every object. This can be slow to transfer, and require large amounts of disk +space. + +Historically, performing a **shallow clone** +([`--depth`](https://www.git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt)) +has been the only way to reduce the amount of data transferred when cloning +a Git repository. This does not, however, allow filtering by sub-tree which is +important for monolithic repositories containing many projects, or by object +size preventing unnecessary large objects being downloaded. + +[Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt) +is a performance optimization that "allows Git to function without having a +complete copy of the repository. The goal of this work is to allow Git better +handle extremely large repositories." + +Specifically, using partial clone, it should be possible for Git to natively +support: + +- large objects, instead of using [Git LFS](https://git-lfs.github.com/) +- enormous repositories + +Briefly, partial clone works by: + +- excluding objects from being transferred when cloning or fetching a +repository using a new `--filter` flag +- downloading missing objects on demand + +Follow [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773) for roadmap and updates. + +## Enabling partial clone + +GitLab 12.1 uses Git 2.21.0 which has an arbitrary file access security +vulnerability when `uploadpack.allowFilter` is enabled, and should not be +enabled in production environments. + +A feature flag is planned to enable `uploadpack.allowFilter` and +`uploadpack.allowAnySHA1InWant` once the version of Git used by GitLab has been +updated to Git 2.22.0. + +Follow [this issue](https://gitlab.com/gitlab-org/gitaly/issues/1553) for +updated. + +## Excluding objects by size + +Partial Clone allows large objects to be stored directly in the Git repository, +and be excluded from clones as desired by the user. This eliminates the error +prone process of deciding which objects should be stored in LFS or not. Using +partial clone, all files – large or small – may be treated the same. + +With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options +enabled on the Git server: + +```bash +# clone the repo, excluding blobs larger than 1 megabyte +git clone --filter=blob:limit=1m + +# in the checkout step of the clone, and any subsequent operations +# any blobs that are needed will be downloaded on demand +git checkout feature-branch +``` + +## Excluding objects by path + +Partial Clone allows clones to be filtered by path using a format similar to a +`.gitignore` file stored inside the repository. + +With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options +enabled on the Git server: + +1. **Create a filter spec.** For example, consider a monolithic repository with +many applications, each in a different subdirectory in the root. Create a file +`shiny-app/.filterspec` using the GitLab web interface: + + ```.gitignore + # Only the paths listed in the file will be downloaded when performing a + # partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec` + + # Explicitly include filterspec needed to configure sparse checkout with + # git config --local core.sparsecheckout true + # git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout + shiny-app/.gitfilterspec + + # Shiny App + shiny-app/ + + # Dependencies + shimmery-app/ + shared-component-a/ + shared-component-b/ + ``` + +2. *Create a new Git repository and fetch.* Support for `--filter=sparse:oid` +using the clone command is incomplete, so we will emulate the clone command +by hand, using `git init` and `git fetch`. Follow +[gitaly#1769](https://gitlab.com/gitlab-org/gitaly/issues/1769) for updates. + + ```bash + # Create a new directory for the Git repository + mkdir jumbo-repo && cd jumbo-repo + + # Initialize a new Git repository + git init + + # Add the remote + git remote add origin git@gitlab.com/example/jumbo-repo + + # Enable partial clone support for the remote + git config --local extensions.partialClone origin + + # Fetch the filtered set of objects using the filterspec stored on the + # server. WARNING: this step is slow! + git fetch --filter=sparse:oid=master:shiny-app/.gitfilterspec origin + + # Optional: observe there are missing objects that we have not fetched + git rev-list --all --quiet --objects --missing=print | wc -l + ``` + + CAUTION: **IDE and Shell integrations:** + Git integrations with `bash`, `zsh`, etc and editors that automatically + show Git status information often run `git fetch` which will fetch the + entire repository. You many need to disable or reconfigure these + integrations. + +3. **Sparse checkout** must be enabled and configured to prevent objects from +other paths being downloaded automatically when checking out branches. Follow +[gitaly#1765](https://gitlab.com/gitlab-org/gitaly/issues/1765) for updates. + + ```bash + # Enable sparse checkout + git config --local core.sparsecheckout true + + # Configure sparse checkout + git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout + + # Checkout master + git checkout master + ``` -- cgit v1.2.1 From 828c52d092744e6381f7c1534d932ff4127e864e Mon Sep 17 00:00:00 2001 From: Achilleas Pipinellis Date: Mon, 5 Aug 2019 12:27:58 +0000 Subject: Fix some Markdown lint errors Introduced in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30295 --- doc/topics/git/partial_clone.md | 66 ++++++++++++++++++++--------------------- 1 file changed, 33 insertions(+), 33 deletions(-) (limited to 'doc/topics/git/partial_clone.md') diff --git a/doc/topics/git/partial_clone.md b/doc/topics/git/partial_clone.md index 9b8cf269684..f2951308ba1 100644 --- a/doc/topics/git/partial_clone.md +++ b/doc/topics/git/partial_clone.md @@ -32,7 +32,7 @@ support: Briefly, partial clone works by: - excluding objects from being transferred when cloning or fetching a -repository using a new `--filter` flag + repository using a new `--filter` flag - downloading missing objects on demand Follow [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773) for roadmap and updates. @@ -78,53 +78,53 @@ With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options enabled on the Git server: 1. **Create a filter spec.** For example, consider a monolithic repository with -many applications, each in a different subdirectory in the root. Create a file -`shiny-app/.filterspec` using the GitLab web interface: - - ```.gitignore - # Only the paths listed in the file will be downloaded when performing a - # partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec` - - # Explicitly include filterspec needed to configure sparse checkout with - # git config --local core.sparsecheckout true - # git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout - shiny-app/.gitfilterspec - - # Shiny App - shiny-app/ - - # Dependencies - shimmery-app/ - shared-component-a/ - shared-component-b/ - ``` + many applications, each in a different subdirectory in the root. Create a file + `shiny-app/.filterspec` using the GitLab web interface: + + ```.gitignore + # Only the paths listed in the file will be downloaded when performing a + # partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec` + + # Explicitly include filterspec needed to configure sparse checkout with + # git config --local core.sparsecheckout true + # git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout + shiny-app/.gitfilterspec + + # Shiny App + shiny-app/ + + # Dependencies + shimmery-app/ + shared-component-a/ + shared-component-b/ + ``` 2. *Create a new Git repository and fetch.* Support for `--filter=sparse:oid` -using the clone command is incomplete, so we will emulate the clone command -by hand, using `git init` and `git fetch`. Follow -[gitaly#1769](https://gitlab.com/gitlab-org/gitaly/issues/1769) for updates. + using the clone command is incomplete, so we will emulate the clone command + by hand, using `git init` and `git fetch`. Follow + [gitaly#1769](https://gitlab.com/gitlab-org/gitaly/issues/1769) for updates. ```bash # Create a new directory for the Git repository mkdir jumbo-repo && cd jumbo-repo - + # Initialize a new Git repository git init - + # Add the remote git remote add origin git@gitlab.com/example/jumbo-repo - + # Enable partial clone support for the remote git config --local extensions.partialClone origin - + # Fetch the filtered set of objects using the filterspec stored on the # server. WARNING: this step is slow! git fetch --filter=sparse:oid=master:shiny-app/.gitfilterspec origin - + # Optional: observe there are missing objects that we have not fetched git rev-list --all --quiet --objects --missing=print | wc -l ``` - + CAUTION: **IDE and Shell integrations:** Git integrations with `bash`, `zsh`, etc and editors that automatically show Git status information often run `git fetch` which will fetch the @@ -132,13 +132,13 @@ by hand, using `git init` and `git fetch`. Follow integrations. 3. **Sparse checkout** must be enabled and configured to prevent objects from -other paths being downloaded automatically when checking out branches. Follow -[gitaly#1765](https://gitlab.com/gitlab-org/gitaly/issues/1765) for updates. + other paths being downloaded automatically when checking out branches. Follow + [gitaly#1765](https://gitlab.com/gitlab-org/gitaly/issues/1765) for updates. ```bash # Enable sparse checkout git config --local core.sparsecheckout true - + # Configure sparse checkout git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout -- cgit v1.2.1