From 69d6d1a76de7c9f4c1274ada238fe5295fe7dc30 Mon Sep 17 00:00:00 2001 From: Sam Thursfield Date: Fri, 13 Oct 2017 13:10:15 +0100 Subject: Rename README so it gets displayed in GitLab --- README.md | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ README.mdwn | 584 ------------------------------------------------------------ 2 files changed, 584 insertions(+), 584 deletions(-) create mode 100644 README.md delete mode 100644 README.mdwn diff --git a/README.md b/README.md new file mode 100644 index 00000000..2f4c08d5 --- /dev/null +++ b/README.md @@ -0,0 +1,584 @@ +Baserock project public infrastructure +====================================== + +This repository contains the definitions for all of the Baserock Project's +infrastructure. This includes every service used by the project, except for +the mailing lists (hosted by [Pepperfish]) the wiki (hosted by [Branchable]) +and the GitLab CI runners (set up by Javier Jardón). + +Some of these systems are Baserock systems. This has proved an obstacle to +keeping them up to date with security updates, and we plan to switch everything +to run on mainstream distros in future. + +All files necessary for (re)deploying the systems should be contained in this +Git repository. Private tokens should be encrypted using +[ansible-vault](https://www.ansible.com/blog/2014/02/19/ansible-vault). + +[Pepperfish]: http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo +[Branchable]: http://www.branchable.com/ + + +General notes +------------- + +When instantiating a machine that will be public, remember to give shell +access everyone on the ops team. This can be done using a post-creation +customisation script that injects all of their SSH keys. The SSH public +keys of the Baserock Operations team are collected in +`baserock-ops-team.cloud-config.`. + +Ensure SSH password login is disabled in all systems you deploy! See: + for why. The Ansible playbook +`admin/sshd_config.yaml` can ensure that all systems have password login +disabled. + + +Administration +-------------- + +You can use [Ansible] to automate tasks on the baserock.org systems. + +To run a playbook: + + ansible-playbook -i hosts $PLAYBOOK.yaml + +To run an ad-hoc command (upgrading, for example): + + ansible -i hosts fedora -m command -a 'sudo dnf update -y' + ansible -i hosts ubuntu -m command -a 'sudo apt-get update -y' + +[Ansible]: http://www.ansible.com + + +Security updates +---------------- + +Fedora security updates can be watched here: +. Ubuntu issues +security advisories here: . +The Baserock reference systems doesn't have such a service. The [LWN +Alerts](https://lwn.net/Alerts/) service gives you info from all major Linux +distributions. + +If there is a vulnerability discovered in some software we use, we might need +to upgrade all of the systems that use that component at baserock.org. + +Bear in mind some systems are not accessible except via the frontend-haproxy +system. Those are usually less at risk than those that face the web directly. +Also bear in mind we use OpenStack security groups to block most ports. + +### Prepare the patch for Baserock systems + +First, you need to update the Baserock reference system definitions with a +fixed version of the component. Build that and test that it works. Submit +the patch to gerrit.baserock.org, get it reviewed, and merged. Then cherry +pick that patch into infrastructure.git. + +This a long-winded process. There are shortcuts you can take, although +someone still has to complete the process described above at some point. + +* You can modify the infrastructure.git definitions directly and start rebuilding + the infrastructure systems right away, to avoid waiting for the Baserock patch + review process. + +* You can add the new version of the component as a stratum that sits above + everything else in the build graph. For example, to do a 'hot-fix' for GLIBC, + add a 'glibc-hotfix' stratum containing the new version to all of the systems + you need to upgrade. Rebuilding them will be quick because you just need to + build GLIBC, and can reuse the cached artifacts for everything else. The new + GLIBC will overwrite the one that is lower down in the build graph in the + resulting filesystem. Of course, if the new version of the component is not + ABI compatible then this approach will break things. Be careful. + +### Check the inventory + +Make sure the Ansible inventory file is up to date, and that you have access to +all machines. Run this: + + ansible \* -i ./hosts -m ping + +You should see lots of this sort of output: + + mail | success >> { + "changed": false, + "ping": "pong" + } + + frontend-haproxy | success >> { + "changed": false, + "ping": "pong" + } + +You may find some host key errors like this: + + paste | FAILED => SSH Error: Host key verification failed. + It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue. + +If you have a host key problem, that could be because somebody redeployed +the system since the last time you connected to it with SSH, and did not +transfer the SSH host keys from the old system to the new system. Check with +other ops teams members about this. If you are sure the new host keys can +be trusted, you can remove the old ones with `ssh-keygen -R 192.168.x.y`, where 192.168.x.y is the internal IP address of the machine. You'll then be prompted to accept the new ones when you run Ansible again. + +Once all machines respond to the Ansible 'ping' module, double check that +every machine you can see in the OpenStack Horizon dashboard has a +corresponding entry in the 'hosts' file, to ensure the next steps operate +on all of the machines. + +### Check and upgrade Fedora systems + +> Bear in mind that only the latest 2 versions of Fedora receive security +updates. If any machines are not running the latest version of Fedora, +you should redeploy them with the latest version. See the instructions below +on how to (re)deploy each machine. You should deploy a new instance of a system +and test it *before* terminating the existing instance. Switching over should +be a matter of changing either its floating IP address or the IP address in +baserock_frontend/haproxy.conf. + +You can find out what version of Fedora is in use with this command: + + ansible fedora -i hosts -m setup -a 'filter=ansible_distribution_version' + +Check what version of a package is in use with this command (using GLIBC as an +example). You can compare this against Fedora package changelogs at +[Koji](https://koji.fedoraproject.org). + + ansible fedora -i hosts -m command -a 'rpm -q glibc --qf "%{VERSION}.%{RELEASE}\n"' + +You can see what updates are available using the `dnf updateinfo info' command. + + ansible -i hosts fedora -m command -a 'dnf updateinfo info glibc' + +You can then use `dnf upgrade -y` to install all available updates. Or give the +name of a package to update just that package. Be aware that DNF is quite slow, +and if you forget to pass `-y` then it will hang forever waiting for input. + +You will then need to restart services. The `dnf needs-restarting` command might be +useful, but rebooting the whole machine is probably easiest. + +### Check and upgrade Ubuntu systems + +> Bear in mind that only the latest and the latest LTS release of Ubuntu receive any +security updates. + +Find out what version of Ubuntu is in use with this command: + + ansible ubuntu -i hosts -m setup -a 'filter=ansible_distribution_version' + +Check what version of a given package is in use with this command (using GLIBC +as an example). + + ansible -i hosts ubuntu -m command -a 'dpkg-query --show libc6' + +Check for available updates, and what they contain: + + ansible -i hosts ubuntu -m command -a 'apt-cache policy libc6' + ansible -i hosts ubuntu -m command -a 'apt-get changelog libc6' | head -n 20 + +You can update all the packages with: + + ansible -i hosts ubuntu -m command -a 'apt-get upgrade -y' --sudo + +You will then need to restart services. Rebooting the machine is probably +easiest. + +### Check and upgrade Baserock systems + +Check what version of a given package is in use with this command (using GLIBC +as an example). Ideally Baserock reference systems would have a query tool for +this info, but for now we have to look at the JSON metadata file directly. + + ansible -i hosts baserock -m command \ + -a "grep '\"\(sha1\|repo\|original_ref\)\":' /baserock/glibc-bins.meta" + +The default Baserock machine layout uses Btrfs for the root filesystem. Filling +up a Btrfs disk results in unpredictable behaviour. Before deploying any system +upgrades, check that each machine has enough free disk space to hold an +upgrade. Allow for at least 4GB free space, to be safe. + + ansible -i hosts baserock -m command -a "df -h /" + +A good way to free up space is to remove old system-versions using the +`system-version-manager` tool. There may be other things that are +unnecessarily taking up space in the root file system, too. + +Ideally, at this point you've prepared a patch for definitions.git to fix +the security issue in the Baserock reference systems, and it has been merged. +In that case, pull from the reference systems into infrastructure.git, using +`git pull git://git.baserock.org/baserock/baserock/definitions master`. + +If the necessary patch isn't merged in definitions.git, it's still best to +merge 'master' from there into infrastructure.git, and then cherry-pick the +patch from Gerrit on top. + +You then need to build and upgrade the systems one by one. Do this from the +'devel-system' machine in the same OpenStack cloud that hosts the +infrastructure. Baserock upgrades currently involve transferring the whole +multi-gigabyte system image, so you *must* have a fast connection to the +target. + +Each Baserock system has its own deployment instructions. Each should have +a deployment .morph file that you can pass to `morph upgrade`. For example, +to deploy an upgrade git.baserock.org: + + morph upgrade --local-changes=ignore \ + baserock_trove/baserock_trove.morph gbo.VERSION_LABEL=2016-02-19 + +Once this completes successfully, rebooting the system should bring up the +new system. You may want to check that the new `/etc` is correct; you can +do this inside the machine by mounting `/dev/vda` and looking in `systems/$VERSION_LABEL/run/etc`. + +If you want to revert the upgrade, use `system-version-manager list` and +`system-version-manager set-default ` to set the previous +version as the default, then reboot. If the system doesn't boot at all, +reboot it while you have the graphical console open in Horizon, and you +should be able to press `ESC` fast enough to get the boot menu open. This +will allow booting into previous versions of the system. (You shouldn't +have any problems though since of course we test everything regularly). + +Beware of . + +For cache.baserock.org, you can reuse the deployment instructions for +git.baserock.org. Try: + + morph upgrade --local-changes=ignore \ + baserock_trove/baserock_trove.morph \ + gbo.update-location=root@cache.baserock.org + gbo.VERSION_LABEL=2016-02-19 + +Deployment to OpenStack +----------------------- + +The intention is that all of the systems defined here are deployed to an +OpenStack cloud. The instructions here harcode some details about the specific +tenancy at [DataCentred](http://www.datacentred.io) that the Baserock project +uses. It should be easy to adapt them for other OpenStack hosts, though. + +### Credentials + +The instructions below assume you have the following environment variables set +according to the OpenStack host you are deploying to: + + - `OS_AUTH_URL` + - `OS_TENANT_NAME` + - `OS_USERNAME` + - `OS_PASSWORD` + +When using `morph deploy` to deploy to OpenStack, you will need to set these +variables, because currently Morph does not honour the standard ones. See: +. + + - `OPENSTACK_USER=$OS_USERNAME` + - `OPENSTACK_PASSWORD=$OS_PASSWORD` + - `OPENSTACK_TENANT=$OS_TENANT_NAME` + +The `location` field in the deployment .morph file will also need to point to +the correct `$OS_AUTH_URL`. + +### Firewall / Security Groups + +The instructions assume the presence of a set of security groups. You can +create these by running the following Ansible playbook. + + ansible-playbook -i hosts firewall.yaml + +### Placeholders + +The commands below use a couple of placeholders like $network_id, you can set +them in your environment to allow you to copy and paste the commands below +as-is. + + - `export fedora_image_id=...` (find this with `glance image-list`) + - `export network_id=...` (find this with `neutron net-list`) + - `export keyname=...` (find this with `nova keypair-list`) + +The `$fedora_image_id` should reference a Fedora Cloud image. You can import +these from . At time of writing, these +instructions were tested with Fedora Cloud 23 for x86_64. + +Backups +------- + +Backups of git.baserock.org's data volume are run by and stored on on a +Codethink-managed machine named 'access'. They will need to migrate off this +system before long. The backups are taken without pausing services or +snapshotting the data, so they will not be 100% clean. The current +git.baserock.org data volume does not use LVM and cannot be easily snapshotted. + +Backups of 'gerrit' and 'database' are handled by the +'baserock_backup/backup.py' script. This currently runs on an instance in +Codethink's internal OpenStack cloud. + +Instances themselves are not backed up. In the event of a crisis we will +redeploy them from the infrastructure.git repository. There should be nothing +valuable stored outside of the data volumes that are backed up. + +To prepare the infrastructure to run the backup scripts you will need to run +the following playbooks: + + ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml + ansible-playbook -i hosts baserock_database/instance-backup-config.yml + ansible-playbook -i hosts baserock_gerrit/instance-backup-config.yml + +NOTE: to run these playbooks you need to have the public ssh key of the backups +instance in `keys/backup.key.pub`. + + +Systems +------- + +### Front-end + +The front-end provides a reverse proxy, to allow more flexible routing than +simply pointing each subdomain to a different instance using separate public +IPs. It also provides a starting point for future load-balancing and failover +configuration. + +To deploy this system: + + nova boot frontend-haproxy \ + --key-name=$keyname \ + --flavor=dc1.1x0 \ + --image=$fedora_image_id \ + --nic="net-id=$network_id" \ + --security-groups default,gerrit,shared-artifact-cache,web-server \ + --user-data ./baserock-ops-team.cloud-config + ansible-playbook -i hosts baserock_frontend/image-config.yml + ansible-playbook -i hosts baserock_frontend/instance-config.yml + ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml + + ansible -i hosts -m service -a 'name=haproxy enabled=true state=started' \ + --sudo frontend-haproxy + +The baserock_frontend system is stateless. + +Full HAProxy 1.5 documentation: . + +If you want to add a new service to the Baserock Project infrastructure via +the frontend, do the following: + +- request a subdomain that points at 185.43.218.170 (frontend) +- alter the haproxy.cfg file in the baserock_frontend/ directory in this repo + as necessary to proxy requests to the real instance +- run the baserock_frontend/instance-config.yml playbook +- run `ansible -i hosts -m service -a 'name=haproxy enabled=true + state=restarted' --sudo frontend-haproxy` + +OpenStack doesn't provide any kind of internal DNS service, so you must put the +fixed IP of each instance. + +The internal IP address of this machine is hardcoded in some places (beyond the +usual haproxy.cfg file), use 'git grep' to find all of them. You'll need to +update all the relevant config files. We really need some internal DNS system +to avoid this hassle. + +### Trove + +To deploy to production, run these commands in a Baserock 'devel' +or 'build' system. + + nova volume-create \ + --display-name git.baserock.org-home \ + --display-description '/home partition of git.baserock.org' \ + --volume-type Ceph \ + 300 + + git clone git://git.baserock.org/baserock/baserock/infrastructure.git + cd infrastructure + + morph build systems/trove-system-x86_64.morph + morph deploy baserock_trove/baserock_trove.morph + + nova boot git.baserock.org \ + --key-name $keyname \ + --flavor 'dc1.8x16' \ + --image baserock_trove \ + --nic "net-id=$network_id,v4-fixed-ip=192.168.222.58" \ + --security-groups default,git-server,web-server,shared-artifact-cache \ + --user-data baserock-ops-team.cloud-config + + nova volume-attach git.baserock.org /dev/vdb + + # Note, if this floating IP is not available, you will have to change + # the DNS in the DNS provider. + nova add-floating-ip git.baserock.org 185.43.218.183 + + ansible-playbook -i hosts baserock_trove/instance-config.yml + + # Before configuring the Trove you will need to create some ssh + # keys for it. You can also use existing keys. + + mkdir private + ssh-keygen -N '' -f private/lorry.key + ssh-keygen -N '' -f private/worker.key + ssh-keygen -N '' -f private/admin.key + + # Now you can finish the configuration of the Trove with: + + ansible-playbook -i hosts baserock_trove/configure-trove.yml + +### OSTree artifact cache + +To deploy this system to production: + + nova volume-create \ + --display-name ostree-volume \ + --display-description 'OSTree cache volume' \ + --volume-type Ceph \ + 300 + + nova boot ostree.baserock.org \ + --key-name $keyname \ + --flavor dc1.2x8.40 \ + --image $fedora_image_id \ + --nic "net-id=$network_id,v4-fixed-ip=192.168.222.153" \ + --security-groups default,web-server \ + --user-data ./baserock-ops-team.cloud-config + + nova volume-attach ostree.baserock.org /dev/vdb + + ansible-playbook -i hosts baserock_ostree/image-config.yml + ansible-playbook -i hosts baserock_ostree/instance-config.yml + ansible-playbook -i hosts baserock_ostree/ostree-access-config.yml + +SSL certificates +================ + +The certificates used for our infrastructure are provided for free +by Let's Encrypt. These certificates expire every 3 months. Here we +will explain how to renew the certificates, and how to deploy them. + +Generation of certificates +-------------------------- + +> Note: This should be automated in the next upgrade. The instructions +> sound like a lot of effort + +To generate the SSL certs, first you need to clone the following repositories: + + git clone https://github.com/lukas2511/letsencrypt.sh.git + git clone https://github.com/mythic-beasts/letsencrypt-mythic-dns01.git + +The version used the first time was `0.4.0` with sha `116386486b3749e4c5e1b4da35904f30f8b2749b`, +(just in case future releases break these instructions) + +Now inside of the repo, create a `domains.txt` file with the information +of the subdomains: + + cd letsencrypt.sh + cat >domains.txt <<'EOF' + baserock.org + docs.baserock.org download.baserock.org irclogs.baserock.org ostree.baserock.org paste.baserock.org spec.baserock.org + git.baserock.org + EOF + +And the `config` file needed: + + cat >config <<'EOF' + CONTACT_EMAIL="admin@baserock.org" + HOOK="../letsencrypt-mythic-dns01/letsencrypt-mythic-dns01.sh" + CHALLENGETYPE="dns-01" + EOF + +Create a `dnsapi.config.txt` with the contents of `private/dnsapi.config.txt` +decrypted. To show the contents of this file, run the following in a +`infrastructure.git` repo checkout. + + ansible-vault view private/dnsapi.config.txt + +Now, to generate the certs, run: + + ./dehydrated -c + +> If this is the first time, you will get asked to run +> `./dehydrated --register --accept-terms` + +In the `certs` folder you will have all the certificates generated. To construct the +certificates that are present in `certs` and `private` you will have to: + + cd certs + mkdir -p tmp/private tmp/certs + + # Create some full certs including key for some services that need it this way + cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem git.baserock.org/privkey.pem > tmp/private/git-with-key.pem + cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem docs.baserock.org/privkey.pem > tmp/private/frontend-with-key.pem + + # Copy key files + cp git.baserock.org/privkey.pem tmp/private/git.pem + cp docs.baserock.org/privkey.pem tmp/private/frontend.pem + + # Copy cert files + cp git.baserock.org/cert.csr tmp/certs/git.csr + cp git.baserock.org/cert.pem tmp/certs/git.pem + cp git.baserock.org/chain.pem tmp/certs/git-chain.pem + cp docs.baserock.org/cert.csr tmp/certs/frontend.csr + cp docs.baserock.org/cert.pem tmp/certs/frontend.pem + cp docs.baserock.org/chain.pem tmp/certs/frontend-chain.pem + + # Create full certs without keys + cat git.baserock.org/cert.csr git.baserock.org/cert.pem chain.pem > tmp/certs/git-full.pem + cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem chain.pem > tmp/certs/frontend-full.pem + +Before replacing the current ones, make sure you **encrypt** the ones that contain +keys (located in `private` folder): + + ansible-vault encrypt tmp/private/* + +And copy them to the repo: + + cp tmp/certs/* ../../certs/ + cp tmp/private/* ../../private/ + + +Deploy certificates +------------------- + +For `git.baserock.org` just run: + + ansible-playbook -i hosts baserock_trove/configure-trove.yml + +This script will copy the certificates to the Trove and run the scripts +that will configure them. + +For the frontend, run: + + ansible-playbook -i hosts baserock_frontend/instance-config.yml + ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy + +Which will install the certificates and then restart the services needed. + + +GitLab CI runners setup +======================= + +Baserock uses [GitLab CI] for build and test automation. For performance reasons +we provide our own runners and avoid using the free, shared runners provided by +GitLab. The runners are hosted at [DigitalOcean] and managed by the 'baserock' +team account there. + +There is a persistent 'manager' machine with a public IP of 138.68.143.2 that +runs GitLab Runner and [docker-machine]. This doesn't run any builds itself -- +we use the [autoscaling feature] of GitLab Runner to spawn new VMs for building +in. The configuration for this is in `/etc/gitlab-runner/config.toml`. + +Each build occurs in a Docker container on one of the transient VMs. As per +the [\[runners.docker\] section] of `config.toml`, each gets a newly created +volume mounted at `/cache`. The YBD and BuildStream cache directories get +located here because jobs were running out of disk space when using the default +configuration. + +There is a second persistent machine with a public IP of 46.101.48.48 that +hosts a Docker registry and a [Minio] cache. These services run as Docker +containers. The Docker registry exists to cache the Docker images we use which +improves the spin-up time of the transient builder VMs, as documented +[here](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-docker-registry-mirroring). +The Minio cache is used for the [distributed caching] feature of GitLab CI. + + +[GitLab CI]: https://about.gitlab.com/features/gitlab-ci-cd/ +[DigitalOcean]: https://cloud.digitalocean.com/ +[docker-machine]: https://docs.docker.com/machine/ +[autoscaling feature]: https://docs.gitlab.com/runner/configuration/autoscale.html +[Minio]: https://www.minio.io/ +[\[runners.docker\] section]: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section +[distributed caching]: https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching diff --git a/README.mdwn b/README.mdwn deleted file mode 100644 index 2f4c08d5..00000000 --- a/README.mdwn +++ /dev/null @@ -1,584 +0,0 @@ -Baserock project public infrastructure -====================================== - -This repository contains the definitions for all of the Baserock Project's -infrastructure. This includes every service used by the project, except for -the mailing lists (hosted by [Pepperfish]) the wiki (hosted by [Branchable]) -and the GitLab CI runners (set up by Javier Jardón). - -Some of these systems are Baserock systems. This has proved an obstacle to -keeping them up to date with security updates, and we plan to switch everything -to run on mainstream distros in future. - -All files necessary for (re)deploying the systems should be contained in this -Git repository. Private tokens should be encrypted using -[ansible-vault](https://www.ansible.com/blog/2014/02/19/ansible-vault). - -[Pepperfish]: http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo -[Branchable]: http://www.branchable.com/ - - -General notes -------------- - -When instantiating a machine that will be public, remember to give shell -access everyone on the ops team. This can be done using a post-creation -customisation script that injects all of their SSH keys. The SSH public -keys of the Baserock Operations team are collected in -`baserock-ops-team.cloud-config.`. - -Ensure SSH password login is disabled in all systems you deploy! See: - for why. The Ansible playbook -`admin/sshd_config.yaml` can ensure that all systems have password login -disabled. - - -Administration --------------- - -You can use [Ansible] to automate tasks on the baserock.org systems. - -To run a playbook: - - ansible-playbook -i hosts $PLAYBOOK.yaml - -To run an ad-hoc command (upgrading, for example): - - ansible -i hosts fedora -m command -a 'sudo dnf update -y' - ansible -i hosts ubuntu -m command -a 'sudo apt-get update -y' - -[Ansible]: http://www.ansible.com - - -Security updates ----------------- - -Fedora security updates can be watched here: -. Ubuntu issues -security advisories here: . -The Baserock reference systems doesn't have such a service. The [LWN -Alerts](https://lwn.net/Alerts/) service gives you info from all major Linux -distributions. - -If there is a vulnerability discovered in some software we use, we might need -to upgrade all of the systems that use that component at baserock.org. - -Bear in mind some systems are not accessible except via the frontend-haproxy -system. Those are usually less at risk than those that face the web directly. -Also bear in mind we use OpenStack security groups to block most ports. - -### Prepare the patch for Baserock systems - -First, you need to update the Baserock reference system definitions with a -fixed version of the component. Build that and test that it works. Submit -the patch to gerrit.baserock.org, get it reviewed, and merged. Then cherry -pick that patch into infrastructure.git. - -This a long-winded process. There are shortcuts you can take, although -someone still has to complete the process described above at some point. - -* You can modify the infrastructure.git definitions directly and start rebuilding - the infrastructure systems right away, to avoid waiting for the Baserock patch - review process. - -* You can add the new version of the component as a stratum that sits above - everything else in the build graph. For example, to do a 'hot-fix' for GLIBC, - add a 'glibc-hotfix' stratum containing the new version to all of the systems - you need to upgrade. Rebuilding them will be quick because you just need to - build GLIBC, and can reuse the cached artifacts for everything else. The new - GLIBC will overwrite the one that is lower down in the build graph in the - resulting filesystem. Of course, if the new version of the component is not - ABI compatible then this approach will break things. Be careful. - -### Check the inventory - -Make sure the Ansible inventory file is up to date, and that you have access to -all machines. Run this: - - ansible \* -i ./hosts -m ping - -You should see lots of this sort of output: - - mail | success >> { - "changed": false, - "ping": "pong" - } - - frontend-haproxy | success >> { - "changed": false, - "ping": "pong" - } - -You may find some host key errors like this: - - paste | FAILED => SSH Error: Host key verification failed. - It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue. - -If you have a host key problem, that could be because somebody redeployed -the system since the last time you connected to it with SSH, and did not -transfer the SSH host keys from the old system to the new system. Check with -other ops teams members about this. If you are sure the new host keys can -be trusted, you can remove the old ones with `ssh-keygen -R 192.168.x.y`, where 192.168.x.y is the internal IP address of the machine. You'll then be prompted to accept the new ones when you run Ansible again. - -Once all machines respond to the Ansible 'ping' module, double check that -every machine you can see in the OpenStack Horizon dashboard has a -corresponding entry in the 'hosts' file, to ensure the next steps operate -on all of the machines. - -### Check and upgrade Fedora systems - -> Bear in mind that only the latest 2 versions of Fedora receive security -updates. If any machines are not running the latest version of Fedora, -you should redeploy them with the latest version. See the instructions below -on how to (re)deploy each machine. You should deploy a new instance of a system -and test it *before* terminating the existing instance. Switching over should -be a matter of changing either its floating IP address or the IP address in -baserock_frontend/haproxy.conf. - -You can find out what version of Fedora is in use with this command: - - ansible fedora -i hosts -m setup -a 'filter=ansible_distribution_version' - -Check what version of a package is in use with this command (using GLIBC as an -example). You can compare this against Fedora package changelogs at -[Koji](https://koji.fedoraproject.org). - - ansible fedora -i hosts -m command -a 'rpm -q glibc --qf "%{VERSION}.%{RELEASE}\n"' - -You can see what updates are available using the `dnf updateinfo info' command. - - ansible -i hosts fedora -m command -a 'dnf updateinfo info glibc' - -You can then use `dnf upgrade -y` to install all available updates. Or give the -name of a package to update just that package. Be aware that DNF is quite slow, -and if you forget to pass `-y` then it will hang forever waiting for input. - -You will then need to restart services. The `dnf needs-restarting` command might be -useful, but rebooting the whole machine is probably easiest. - -### Check and upgrade Ubuntu systems - -> Bear in mind that only the latest and the latest LTS release of Ubuntu receive any -security updates. - -Find out what version of Ubuntu is in use with this command: - - ansible ubuntu -i hosts -m setup -a 'filter=ansible_distribution_version' - -Check what version of a given package is in use with this command (using GLIBC -as an example). - - ansible -i hosts ubuntu -m command -a 'dpkg-query --show libc6' - -Check for available updates, and what they contain: - - ansible -i hosts ubuntu -m command -a 'apt-cache policy libc6' - ansible -i hosts ubuntu -m command -a 'apt-get changelog libc6' | head -n 20 - -You can update all the packages with: - - ansible -i hosts ubuntu -m command -a 'apt-get upgrade -y' --sudo - -You will then need to restart services. Rebooting the machine is probably -easiest. - -### Check and upgrade Baserock systems - -Check what version of a given package is in use with this command (using GLIBC -as an example). Ideally Baserock reference systems would have a query tool for -this info, but for now we have to look at the JSON metadata file directly. - - ansible -i hosts baserock -m command \ - -a "grep '\"\(sha1\|repo\|original_ref\)\":' /baserock/glibc-bins.meta" - -The default Baserock machine layout uses Btrfs for the root filesystem. Filling -up a Btrfs disk results in unpredictable behaviour. Before deploying any system -upgrades, check that each machine has enough free disk space to hold an -upgrade. Allow for at least 4GB free space, to be safe. - - ansible -i hosts baserock -m command -a "df -h /" - -A good way to free up space is to remove old system-versions using the -`system-version-manager` tool. There may be other things that are -unnecessarily taking up space in the root file system, too. - -Ideally, at this point you've prepared a patch for definitions.git to fix -the security issue in the Baserock reference systems, and it has been merged. -In that case, pull from the reference systems into infrastructure.git, using -`git pull git://git.baserock.org/baserock/baserock/definitions master`. - -If the necessary patch isn't merged in definitions.git, it's still best to -merge 'master' from there into infrastructure.git, and then cherry-pick the -patch from Gerrit on top. - -You then need to build and upgrade the systems one by one. Do this from the -'devel-system' machine in the same OpenStack cloud that hosts the -infrastructure. Baserock upgrades currently involve transferring the whole -multi-gigabyte system image, so you *must* have a fast connection to the -target. - -Each Baserock system has its own deployment instructions. Each should have -a deployment .morph file that you can pass to `morph upgrade`. For example, -to deploy an upgrade git.baserock.org: - - morph upgrade --local-changes=ignore \ - baserock_trove/baserock_trove.morph gbo.VERSION_LABEL=2016-02-19 - -Once this completes successfully, rebooting the system should bring up the -new system. You may want to check that the new `/etc` is correct; you can -do this inside the machine by mounting `/dev/vda` and looking in `systems/$VERSION_LABEL/run/etc`. - -If you want to revert the upgrade, use `system-version-manager list` and -`system-version-manager set-default ` to set the previous -version as the default, then reboot. If the system doesn't boot at all, -reboot it while you have the graphical console open in Horizon, and you -should be able to press `ESC` fast enough to get the boot menu open. This -will allow booting into previous versions of the system. (You shouldn't -have any problems though since of course we test everything regularly). - -Beware of . - -For cache.baserock.org, you can reuse the deployment instructions for -git.baserock.org. Try: - - morph upgrade --local-changes=ignore \ - baserock_trove/baserock_trove.morph \ - gbo.update-location=root@cache.baserock.org - gbo.VERSION_LABEL=2016-02-19 - -Deployment to OpenStack ------------------------ - -The intention is that all of the systems defined here are deployed to an -OpenStack cloud. The instructions here harcode some details about the specific -tenancy at [DataCentred](http://www.datacentred.io) that the Baserock project -uses. It should be easy to adapt them for other OpenStack hosts, though. - -### Credentials - -The instructions below assume you have the following environment variables set -according to the OpenStack host you are deploying to: - - - `OS_AUTH_URL` - - `OS_TENANT_NAME` - - `OS_USERNAME` - - `OS_PASSWORD` - -When using `morph deploy` to deploy to OpenStack, you will need to set these -variables, because currently Morph does not honour the standard ones. See: -. - - - `OPENSTACK_USER=$OS_USERNAME` - - `OPENSTACK_PASSWORD=$OS_PASSWORD` - - `OPENSTACK_TENANT=$OS_TENANT_NAME` - -The `location` field in the deployment .morph file will also need to point to -the correct `$OS_AUTH_URL`. - -### Firewall / Security Groups - -The instructions assume the presence of a set of security groups. You can -create these by running the following Ansible playbook. - - ansible-playbook -i hosts firewall.yaml - -### Placeholders - -The commands below use a couple of placeholders like $network_id, you can set -them in your environment to allow you to copy and paste the commands below -as-is. - - - `export fedora_image_id=...` (find this with `glance image-list`) - - `export network_id=...` (find this with `neutron net-list`) - - `export keyname=...` (find this with `nova keypair-list`) - -The `$fedora_image_id` should reference a Fedora Cloud image. You can import -these from . At time of writing, these -instructions were tested with Fedora Cloud 23 for x86_64. - -Backups -------- - -Backups of git.baserock.org's data volume are run by and stored on on a -Codethink-managed machine named 'access'. They will need to migrate off this -system before long. The backups are taken without pausing services or -snapshotting the data, so they will not be 100% clean. The current -git.baserock.org data volume does not use LVM and cannot be easily snapshotted. - -Backups of 'gerrit' and 'database' are handled by the -'baserock_backup/backup.py' script. This currently runs on an instance in -Codethink's internal OpenStack cloud. - -Instances themselves are not backed up. In the event of a crisis we will -redeploy them from the infrastructure.git repository. There should be nothing -valuable stored outside of the data volumes that are backed up. - -To prepare the infrastructure to run the backup scripts you will need to run -the following playbooks: - - ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml - ansible-playbook -i hosts baserock_database/instance-backup-config.yml - ansible-playbook -i hosts baserock_gerrit/instance-backup-config.yml - -NOTE: to run these playbooks you need to have the public ssh key of the backups -instance in `keys/backup.key.pub`. - - -Systems -------- - -### Front-end - -The front-end provides a reverse proxy, to allow more flexible routing than -simply pointing each subdomain to a different instance using separate public -IPs. It also provides a starting point for future load-balancing and failover -configuration. - -To deploy this system: - - nova boot frontend-haproxy \ - --key-name=$keyname \ - --flavor=dc1.1x0 \ - --image=$fedora_image_id \ - --nic="net-id=$network_id" \ - --security-groups default,gerrit,shared-artifact-cache,web-server \ - --user-data ./baserock-ops-team.cloud-config - ansible-playbook -i hosts baserock_frontend/image-config.yml - ansible-playbook -i hosts baserock_frontend/instance-config.yml - ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml - - ansible -i hosts -m service -a 'name=haproxy enabled=true state=started' \ - --sudo frontend-haproxy - -The baserock_frontend system is stateless. - -Full HAProxy 1.5 documentation: . - -If you want to add a new service to the Baserock Project infrastructure via -the frontend, do the following: - -- request a subdomain that points at 185.43.218.170 (frontend) -- alter the haproxy.cfg file in the baserock_frontend/ directory in this repo - as necessary to proxy requests to the real instance -- run the baserock_frontend/instance-config.yml playbook -- run `ansible -i hosts -m service -a 'name=haproxy enabled=true - state=restarted' --sudo frontend-haproxy` - -OpenStack doesn't provide any kind of internal DNS service, so you must put the -fixed IP of each instance. - -The internal IP address of this machine is hardcoded in some places (beyond the -usual haproxy.cfg file), use 'git grep' to find all of them. You'll need to -update all the relevant config files. We really need some internal DNS system -to avoid this hassle. - -### Trove - -To deploy to production, run these commands in a Baserock 'devel' -or 'build' system. - - nova volume-create \ - --display-name git.baserock.org-home \ - --display-description '/home partition of git.baserock.org' \ - --volume-type Ceph \ - 300 - - git clone git://git.baserock.org/baserock/baserock/infrastructure.git - cd infrastructure - - morph build systems/trove-system-x86_64.morph - morph deploy baserock_trove/baserock_trove.morph - - nova boot git.baserock.org \ - --key-name $keyname \ - --flavor 'dc1.8x16' \ - --image baserock_trove \ - --nic "net-id=$network_id,v4-fixed-ip=192.168.222.58" \ - --security-groups default,git-server,web-server,shared-artifact-cache \ - --user-data baserock-ops-team.cloud-config - - nova volume-attach git.baserock.org /dev/vdb - - # Note, if this floating IP is not available, you will have to change - # the DNS in the DNS provider. - nova add-floating-ip git.baserock.org 185.43.218.183 - - ansible-playbook -i hosts baserock_trove/instance-config.yml - - # Before configuring the Trove you will need to create some ssh - # keys for it. You can also use existing keys. - - mkdir private - ssh-keygen -N '' -f private/lorry.key - ssh-keygen -N '' -f private/worker.key - ssh-keygen -N '' -f private/admin.key - - # Now you can finish the configuration of the Trove with: - - ansible-playbook -i hosts baserock_trove/configure-trove.yml - -### OSTree artifact cache - -To deploy this system to production: - - nova volume-create \ - --display-name ostree-volume \ - --display-description 'OSTree cache volume' \ - --volume-type Ceph \ - 300 - - nova boot ostree.baserock.org \ - --key-name $keyname \ - --flavor dc1.2x8.40 \ - --image $fedora_image_id \ - --nic "net-id=$network_id,v4-fixed-ip=192.168.222.153" \ - --security-groups default,web-server \ - --user-data ./baserock-ops-team.cloud-config - - nova volume-attach ostree.baserock.org /dev/vdb - - ansible-playbook -i hosts baserock_ostree/image-config.yml - ansible-playbook -i hosts baserock_ostree/instance-config.yml - ansible-playbook -i hosts baserock_ostree/ostree-access-config.yml - -SSL certificates -================ - -The certificates used for our infrastructure are provided for free -by Let's Encrypt. These certificates expire every 3 months. Here we -will explain how to renew the certificates, and how to deploy them. - -Generation of certificates --------------------------- - -> Note: This should be automated in the next upgrade. The instructions -> sound like a lot of effort - -To generate the SSL certs, first you need to clone the following repositories: - - git clone https://github.com/lukas2511/letsencrypt.sh.git - git clone https://github.com/mythic-beasts/letsencrypt-mythic-dns01.git - -The version used the first time was `0.4.0` with sha `116386486b3749e4c5e1b4da35904f30f8b2749b`, -(just in case future releases break these instructions) - -Now inside of the repo, create a `domains.txt` file with the information -of the subdomains: - - cd letsencrypt.sh - cat >domains.txt <<'EOF' - baserock.org - docs.baserock.org download.baserock.org irclogs.baserock.org ostree.baserock.org paste.baserock.org spec.baserock.org - git.baserock.org - EOF - -And the `config` file needed: - - cat >config <<'EOF' - CONTACT_EMAIL="admin@baserock.org" - HOOK="../letsencrypt-mythic-dns01/letsencrypt-mythic-dns01.sh" - CHALLENGETYPE="dns-01" - EOF - -Create a `dnsapi.config.txt` with the contents of `private/dnsapi.config.txt` -decrypted. To show the contents of this file, run the following in a -`infrastructure.git` repo checkout. - - ansible-vault view private/dnsapi.config.txt - -Now, to generate the certs, run: - - ./dehydrated -c - -> If this is the first time, you will get asked to run -> `./dehydrated --register --accept-terms` - -In the `certs` folder you will have all the certificates generated. To construct the -certificates that are present in `certs` and `private` you will have to: - - cd certs - mkdir -p tmp/private tmp/certs - - # Create some full certs including key for some services that need it this way - cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem git.baserock.org/privkey.pem > tmp/private/git-with-key.pem - cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem docs.baserock.org/privkey.pem > tmp/private/frontend-with-key.pem - - # Copy key files - cp git.baserock.org/privkey.pem tmp/private/git.pem - cp docs.baserock.org/privkey.pem tmp/private/frontend.pem - - # Copy cert files - cp git.baserock.org/cert.csr tmp/certs/git.csr - cp git.baserock.org/cert.pem tmp/certs/git.pem - cp git.baserock.org/chain.pem tmp/certs/git-chain.pem - cp docs.baserock.org/cert.csr tmp/certs/frontend.csr - cp docs.baserock.org/cert.pem tmp/certs/frontend.pem - cp docs.baserock.org/chain.pem tmp/certs/frontend-chain.pem - - # Create full certs without keys - cat git.baserock.org/cert.csr git.baserock.org/cert.pem chain.pem > tmp/certs/git-full.pem - cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem chain.pem > tmp/certs/frontend-full.pem - -Before replacing the current ones, make sure you **encrypt** the ones that contain -keys (located in `private` folder): - - ansible-vault encrypt tmp/private/* - -And copy them to the repo: - - cp tmp/certs/* ../../certs/ - cp tmp/private/* ../../private/ - - -Deploy certificates -------------------- - -For `git.baserock.org` just run: - - ansible-playbook -i hosts baserock_trove/configure-trove.yml - -This script will copy the certificates to the Trove and run the scripts -that will configure them. - -For the frontend, run: - - ansible-playbook -i hosts baserock_frontend/instance-config.yml - ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy - -Which will install the certificates and then restart the services needed. - - -GitLab CI runners setup -======================= - -Baserock uses [GitLab CI] for build and test automation. For performance reasons -we provide our own runners and avoid using the free, shared runners provided by -GitLab. The runners are hosted at [DigitalOcean] and managed by the 'baserock' -team account there. - -There is a persistent 'manager' machine with a public IP of 138.68.143.2 that -runs GitLab Runner and [docker-machine]. This doesn't run any builds itself -- -we use the [autoscaling feature] of GitLab Runner to spawn new VMs for building -in. The configuration for this is in `/etc/gitlab-runner/config.toml`. - -Each build occurs in a Docker container on one of the transient VMs. As per -the [\[runners.docker\] section] of `config.toml`, each gets a newly created -volume mounted at `/cache`. The YBD and BuildStream cache directories get -located here because jobs were running out of disk space when using the default -configuration. - -There is a second persistent machine with a public IP of 46.101.48.48 that -hosts a Docker registry and a [Minio] cache. These services run as Docker -containers. The Docker registry exists to cache the Docker images we use which -improves the spin-up time of the transient builder VMs, as documented -[here](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-docker-registry-mirroring). -The Minio cache is used for the [distributed caching] feature of GitLab CI. - - -[GitLab CI]: https://about.gitlab.com/features/gitlab-ci-cd/ -[DigitalOcean]: https://cloud.digitalocean.com/ -[docker-machine]: https://docs.docker.com/machine/ -[autoscaling feature]: https://docs.gitlab.com/runner/configuration/autoscale.html -[Minio]: https://www.minio.io/ -[\[runners.docker\] section]: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section -[distributed caching]: https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching -- cgit v1.2.1