diff options
author | Douwe Maan <douwe@gitlab.com> | 2015-12-08 16:13:59 +0000 |
---|---|---|
committer | Douwe Maan <douwe@gitlab.com> | 2015-12-08 16:13:59 +0000 |
commit | 033947de90163aeadf0b1ae2c0f5be1b8529088b (patch) | |
tree | 82a8d489a6265e90a032d50e76f73dca9b3c7b61 | |
parent | a80f0f66e3c55e9793007bad6162eabba5c8582c (diff) | |
parent | 23f383ef69889c9829ad36afa53b5abfbf4b5511 (diff) | |
download | gitlab-ce-033947de90163aeadf0b1ae2c0f5be1b8529088b.tar.gz |
Merge branch 'sync-all-repos' into 'master'
Sync all repos
Scripts and documentation for moving repos, used on gitlab.com.
See merge request !1439
-rwxr-xr-x | bin/parallel-rsync-repos | 54 | ||||
-rw-r--r-- | doc/operations/moving_repositories.md | 180 | ||||
-rw-r--r-- | doc/raketasks/list_repos.md | 30 | ||||
-rw-r--r-- | lib/tasks/gitlab/list_repos.rake | 17 |
4 files changed, 281 insertions, 0 deletions
diff --git a/bin/parallel-rsync-repos b/bin/parallel-rsync-repos new file mode 100755 index 00000000000..21921148fa0 --- /dev/null +++ b/bin/parallel-rsync-repos @@ -0,0 +1,54 @@ +#!/usr/bin/env bash +# this script should run as the 'git' user, not root, because 'root' should not +# own intermediate directories created by rsync. +# +# Example invocation: +# find /var/opt/gitlab/git-data/repositories -maxdepth 2 | \ +# parallel-rsync-repos transfer-success.log /var/opt/gitlab/git-data/repositories /mnt/gitlab/repositories +# +# You can also rsync to a remote destination. +# +# parallel-rsync-repos transfer-success.log /var/opt/gitlab/git-data/repositories user@host:/mnt/gitlab/repositories +# +# If you need to pass extra options to rsync, set the RSYNC variable +# +# env RSYNC='rsync --rsh="foo bar"' parallel-rsync-repos transfer-success.log /src dest +# + +LOGFILE=$1 +SRC=$2 +DEST=$3 + +if [ -z "$LOGFILE" ] || [ -z "$SRC" ] || [ -z "$DEST" ] ; then + echo "Usage: $0 LOGFILE SRC DEST" + exit 1 +fi + +if [ -z "$JOBS" ] ; then + JOBS=10 +fi + +if [ -z "$RSYNC" ] ; then + RSYNC=rsync +fi + +if ! cd $SRC ; then + echo "cd $SRC failed" + exit 1 +fi + +rsyncjob() { + relative_dir="./${1#$SRC}" + + if ! $RSYNC --delete --relative -a "$relative_dir" "$DEST" ; then + echo "rsync $1 failed" + return 1 + fi + + echo "$1" >> $LOGFILE +} + +export LOGFILE SRC DEST RSYNC +export -f rsyncjob + +parallel -j$JOBS --progress rsyncjob diff --git a/doc/operations/moving_repositories.md b/doc/operations/moving_repositories.md new file mode 100644 index 00000000000..39086b7a251 --- /dev/null +++ b/doc/operations/moving_repositories.md @@ -0,0 +1,180 @@ +# Moving repositories managed by GitLab + +Sometimes you need to move all repositories managed by GitLab to +another filesystem or another server. In this document we will look +at some of the ways you can copy all your repositories from +`/var/opt/gitlab/git-data/repositories` to `/mnt/gitlab/repositories`. + +We will look at three scenarios: the target directory is empty, the +target directory contains an outdated copy of the repositories, and +how to deal with thousands of repositories. + +**Each of the approaches we list can/will overwrite data in the +target directory `/mnt/gitlab/repositories`. Do not mix up the +source and the target.** + +## Target directory is empty: use a tar pipe + +If the target directory `/mnt/gitlab/repositories` is empty the +simplest thing to do is to use a tar pipe. This method has low +overhead and tar is almost always already installed on your system. +However, it is not possible to resume an interrupted tar pipe: if +that happens then all data must be copied again. + +``` +# As the git user +tar -C /var/opt/gitlab/git-data/repositories -cf - -- . |\ + tar -C /mnt/gitlab/repositories -xf - +``` + +If you want to see progress, replace `-xf` with `-xvf`. + +### Tar pipe to another server + +You can also use a tar pipe to copy data to another server. If your +'git' user has SSH access to the newserver as 'git@newserver', you +can pipe the data through SSH. + +``` +# As the git user +tar -C /var/opt/gitlab/git-data/repositories -cf - -- . |\ + ssh git@newserver tar -C /mnt/gitlab/repositories -xf - +``` + +If you want to compress the data before it goes over the network +(which will cost you CPU cycles) you can replace `ssh` with `ssh -C`. + +## The target directory contains an outdated copy of the repositories: use rsync + +If the target directory already contains a partial / outdated copy +of the repositories it may be wasteful to copy all the data again +with tar. In this scenario it is better to use rsync. This utility +is either already installed on your system or easily installable +via apt, yum etc. + +``` +# As the 'git' user +rsync -a --delete /var/opt/gitlab/git-data/repositories/. \ + /mnt/gitlab/repositories +``` + +The `/.` in the command above is very important, without it you can +easily get the wrong directory structure in the target directory. +If you want to see progress, replace `-a` with `-av`. + +### Single rsync to another server + +If the 'git' user on your source system has SSH access to the target +server you can send the repositories over the network with rsync. + +``` +# As the 'git' user +rsync -a --delete /var/opt/gitlab/git-data/repositories/. \ + git@newserver:/mnt/gitlab/repositories +``` + +## Thousands of Git repositories: use one rsync per repository + +Every time you start an rsync job it has to inspect all files in +the source directory, all files in the target directory, and then +decide what files to copy or not. If the source or target directory +has many contents this startup phase of rsync can become a burden +for your GitLab server. In cases like this you can make rsync's +life easier by dividing its work in smaller pieces, and sync one +repository at a time. + +In addition to rsync we will use [GNU +Parallel](http://www.gnu.org/software/parallel/). This utility is +not included in GitLab so you need to install it yourself with apt +or yum. Also note that the GitLab scripts we used below were added +in GitLab 8.1. + +** This process does not clean up repositories at the target location that no +longer exist at the source. ** If you start using your GitLab instance with +`/mnt/gitlab/repositories`, you need to run `gitlab-rake gitlab:cleanup:repos` +after switching to the new repository storage directory. + +### Parallel rsync for all repositories known to GitLab + +This will sync repositories with 10 rsync processes at a time. We keep +track of progress so that the transfer can be restarted if necessary. + +First we create a new directory, owned by 'git', to hold transfer +logs. We assume the directory is empty before we start the transfer +procedure, and that we are the only ones writing files in it. + +``` +# Omnibus +sudo mkdir /var/opt/gitlab/transfer-logs +sudo chown git:git /var/opt/gitlab/transfer-logs + +# Source +sudo -u git -H mkdir /home/git/transfer-logs +``` + +We seed the process with a list of the directories we want to copy. + +``` +# Omnibus +sudo -u git sh -c 'gitlab-rake gitlab:list_repos > /var/opt/gitlab/transfer-logs/all-repos-$(date +%s).txt' + +# Source +cd /home/git/gitlab +sudo -u git -H sh -c 'bundle exec rake gitlab:list_repos > /home/git/transfer-logs/all-repos-$(date +%s).txt' +``` + +Now we can start the transfer. The command below is idempotent, and +the number of jobs done by GNU Parallel should converge to zero. If it +does not some repositories listed in all-repos-1234.txt may have been +deleted/renamed before they could be copied. + +``` +# Omnibus +sudo -u git sh -c ' +cat /var/opt/gitlab/transfer-logs/* | sort | uniq -u |\ + /usr/bin/env JOBS=10 \ + /opt/gitlab/embedded/service/gitlab-rails/bin/parallel-rsync-repos \ + /var/opt/gitlab/transfer-logs/succes-$(date +%s).log \ + /var/opt/gitlab/git-data/repositories \ + /mnt/gitlab/repositories +' + +# Source +cd /home/git/gitlab +sudo -u git -H sh -c ' +cat /home/git/transfer-logs/* | sort | uniq -u |\ + /usr/bin/env JOBS=10 \ + bin/parallel-rsync-repos \ + /home/git/transfer-logs/succes-$(date +%s).log \ + /home/git/repositories \ + /mnt/gitlab/repositories +` +``` + +### Parallel rsync only for repositories with recent activity + +Suppose you have already done one sync that started after 2015-10-1 12:00 UTC. +Then you might only want to sync repositories that were changed via GitLab +_after_ that time. You can use the 'SINCE' variable to tell 'rake +gitlab:list_repos' to only print repositories with recent activity. + +``` +# Omnibus +sudo gitlab-rake gitlab:list_repos SINCE='2015-10-1 12:00 UTC' |\ + sudo -u git \ + /usr/bin/env JOBS=10 \ + /opt/gitlab/embedded/service/gitlab-rails/bin/parallel-rsync-repos \ + succes-$(date +%s).log \ + /var/opt/gitlab/git-data/repositories \ + /mnt/gitlab/repositories + +# Source +cd /home/git/gitlab +sudo -u git -H bundle exec rake gitlab:list_repos SINCE='2015-10-1 12:00 UTC' |\ + sudo -u git -H \ + /usr/bin/env JOBS=10 \ + bin/parallel-rsync-repos \ + succes-$(date +%s).log \ + /home/git/repositories \ + /mnt/gitlab/repositories +``` diff --git a/doc/raketasks/list_repos.md b/doc/raketasks/list_repos.md new file mode 100644 index 00000000000..476428eb4f5 --- /dev/null +++ b/doc/raketasks/list_repos.md @@ -0,0 +1,30 @@ +# Listing repository directories + +You can print a list of all Git repositories on disk managed by +GitLab with the following command: + +``` +# Omnibus +sudo gitlab-rake gitlab:list_repos + +# Source +cd /home/git/gitlab +sudo -u git -H bundle exec rake gitlab:list_repos RAILS_ENV=production +``` + +If you only want to list projects with recent activity you can pass +a date with the 'SINCE' environment variable. The time you specify +is parsed by the Rails [TimeZone#parse +function](http://api.rubyonrails.org/classes/ActiveSupport/TimeZone.html#method-i-parse). + +``` +# Omnibus +sudo gitlab-rake gitlab:list_repos SINCE='Sep 1 2015' + +# Source +cd /home/git/gitlab +sudo -u git -H bundle exec rake gitlab:list_repos RAILS_ENV=production SINCE='Sep 1 2015' +``` + +Note that the projects listed are NOT sorted by activity; they use +the default ordering of the GitLab Rails application. diff --git a/lib/tasks/gitlab/list_repos.rake b/lib/tasks/gitlab/list_repos.rake new file mode 100644 index 00000000000..c7596e7abcb --- /dev/null +++ b/lib/tasks/gitlab/list_repos.rake @@ -0,0 +1,17 @@ +namespace :gitlab do + task list_repos: :environment do + scope = Project + if ENV['SINCE'] + date = Time.parse(ENV['SINCE']) + warn "Listing repositories with activity or changes since #{date}" + project_ids = Project.where('last_activity_at > ? OR updated_at > ?', date, date).pluck(:id).sort + namespace_ids = Namespace.where(['updated_at > ?', date]).pluck(:id).sort + scope = scope.where('id IN (?) OR namespace_id in (?)', project_ids, namespace_ids) + end + scope.find_each do |project| + base = File.join(Gitlab.config.gitlab_shell.repos_path, project.path_with_namespace) + puts base + '.git' + puts base + '.wiki.git' + end + end +end |