summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md584
1 files changed, 584 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 00000000..2f4c08d5
--- /dev/null
+++ b/README.md
@@ -0,0 +1,584 @@
+Baserock project public infrastructure
+======================================
+
+This repository contains the definitions for all of the Baserock Project's
+infrastructure. This includes every service used by the project, except for
+the mailing lists (hosted by [Pepperfish]) the wiki (hosted by [Branchable])
+and the GitLab CI runners (set up by Javier Jardón).
+
+Some of these systems are Baserock systems. This has proved an obstacle to
+keeping them up to date with security updates, and we plan to switch everything
+to run on mainstream distros in future.
+
+All files necessary for (re)deploying the systems should be contained in this
+Git repository. Private tokens should be encrypted using
+[ansible-vault](https://www.ansible.com/blog/2014/02/19/ansible-vault).
+
+[Pepperfish]: http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo
+[Branchable]: http://www.branchable.com/
+
+
+General notes
+-------------
+
+When instantiating a machine that will be public, remember to give shell
+access everyone on the ops team. This can be done using a post-creation
+customisation script that injects all of their SSH keys. The SSH public
+keys of the Baserock Operations team are collected in
+`baserock-ops-team.cloud-config.`.
+
+Ensure SSH password login is disabled in all systems you deploy! See:
+<https://testbit.eu/is-ssh-insecure/> for why. The Ansible playbook
+`admin/sshd_config.yaml` can ensure that all systems have password login
+disabled.
+
+
+Administration
+--------------
+
+You can use [Ansible] to automate tasks on the baserock.org systems.
+
+To run a playbook:
+
+ ansible-playbook -i hosts $PLAYBOOK.yaml
+
+To run an ad-hoc command (upgrading, for example):
+
+ ansible -i hosts fedora -m command -a 'sudo dnf update -y'
+ ansible -i hosts ubuntu -m command -a 'sudo apt-get update -y'
+
+[Ansible]: http://www.ansible.com
+
+
+Security updates
+----------------
+
+Fedora security updates can be watched here:
+<https://bodhi.fedoraproject.org/updates/?type=security>. Ubuntu issues
+security advisories here: <http://www.ubuntu.com/usn/>.
+The Baserock reference systems doesn't have such a service. The [LWN
+Alerts](https://lwn.net/Alerts/) service gives you info from all major Linux
+distributions.
+
+If there is a vulnerability discovered in some software we use, we might need
+to upgrade all of the systems that use that component at baserock.org.
+
+Bear in mind some systems are not accessible except via the frontend-haproxy
+system. Those are usually less at risk than those that face the web directly.
+Also bear in mind we use OpenStack security groups to block most ports.
+
+### Prepare the patch for Baserock systems
+
+First, you need to update the Baserock reference system definitions with a
+fixed version of the component. Build that and test that it works. Submit
+the patch to gerrit.baserock.org, get it reviewed, and merged. Then cherry
+pick that patch into infrastructure.git.
+
+This a long-winded process. There are shortcuts you can take, although
+someone still has to complete the process described above at some point.
+
+* You can modify the infrastructure.git definitions directly and start rebuilding
+ the infrastructure systems right away, to avoid waiting for the Baserock patch
+ review process.
+
+* You can add the new version of the component as a stratum that sits above
+ everything else in the build graph. For example, to do a 'hot-fix' for GLIBC,
+ add a 'glibc-hotfix' stratum containing the new version to all of the systems
+ you need to upgrade. Rebuilding them will be quick because you just need to
+ build GLIBC, and can reuse the cached artifacts for everything else. The new
+ GLIBC will overwrite the one that is lower down in the build graph in the
+ resulting filesystem. Of course, if the new version of the component is not
+ ABI compatible then this approach will break things. Be careful.
+
+### Check the inventory
+
+Make sure the Ansible inventory file is up to date, and that you have access to
+all machines. Run this:
+
+ ansible \* -i ./hosts -m ping
+
+You should see lots of this sort of output:
+
+ mail | success >> {
+ "changed": false,
+ "ping": "pong"
+ }
+
+ frontend-haproxy | success >> {
+ "changed": false,
+ "ping": "pong"
+ }
+
+You may find some host key errors like this:
+
+ paste | FAILED => SSH Error: Host key verification failed.
+ It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
+
+If you have a host key problem, that could be because somebody redeployed
+the system since the last time you connected to it with SSH, and did not
+transfer the SSH host keys from the old system to the new system. Check with
+other ops teams members about this. If you are sure the new host keys can
+be trusted, you can remove the old ones with `ssh-keygen -R 192.168.x.y`, where 192.168.x.y is the internal IP address of the machine. You'll then be prompted to accept the new ones when you run Ansible again.
+
+Once all machines respond to the Ansible 'ping' module, double check that
+every machine you can see in the OpenStack Horizon dashboard has a
+corresponding entry in the 'hosts' file, to ensure the next steps operate
+on all of the machines.
+
+### Check and upgrade Fedora systems
+
+> Bear in mind that only the latest 2 versions of Fedora receive security
+updates. If any machines are not running the latest version of Fedora,
+you should redeploy them with the latest version. See the instructions below
+on how to (re)deploy each machine. You should deploy a new instance of a system
+and test it *before* terminating the existing instance. Switching over should
+be a matter of changing either its floating IP address or the IP address in
+baserock_frontend/haproxy.conf.
+
+You can find out what version of Fedora is in use with this command:
+
+ ansible fedora -i hosts -m setup -a 'filter=ansible_distribution_version'
+
+Check what version of a package is in use with this command (using GLIBC as an
+example). You can compare this against Fedora package changelogs at
+[Koji](https://koji.fedoraproject.org).
+
+ ansible fedora -i hosts -m command -a 'rpm -q glibc --qf "%{VERSION}.%{RELEASE}\n"'
+
+You can see what updates are available using the `dnf updateinfo info' command.
+
+ ansible -i hosts fedora -m command -a 'dnf updateinfo info glibc'
+
+You can then use `dnf upgrade -y` to install all available updates. Or give the
+name of a package to update just that package. Be aware that DNF is quite slow,
+and if you forget to pass `-y` then it will hang forever waiting for input.
+
+You will then need to restart services. The `dnf needs-restarting` command might be
+useful, but rebooting the whole machine is probably easiest.
+
+### Check and upgrade Ubuntu systems
+
+> Bear in mind that only the latest and the latest LTS release of Ubuntu receive any
+security updates.
+
+Find out what version of Ubuntu is in use with this command:
+
+ ansible ubuntu -i hosts -m setup -a 'filter=ansible_distribution_version'
+
+Check what version of a given package is in use with this command (using GLIBC
+as an example).
+
+ ansible -i hosts ubuntu -m command -a 'dpkg-query --show libc6'
+
+Check for available updates, and what they contain:
+
+ ansible -i hosts ubuntu -m command -a 'apt-cache policy libc6'
+ ansible -i hosts ubuntu -m command -a 'apt-get changelog libc6' | head -n 20
+
+You can update all the packages with:
+
+ ansible -i hosts ubuntu -m command -a 'apt-get upgrade -y' --sudo
+
+You will then need to restart services. Rebooting the machine is probably
+easiest.
+
+### Check and upgrade Baserock systems
+
+Check what version of a given package is in use with this command (using GLIBC
+as an example). Ideally Baserock reference systems would have a query tool for
+this info, but for now we have to look at the JSON metadata file directly.
+
+ ansible -i hosts baserock -m command \
+ -a "grep '\"\(sha1\|repo\|original_ref\)\":' /baserock/glibc-bins.meta"
+
+The default Baserock machine layout uses Btrfs for the root filesystem. Filling
+up a Btrfs disk results in unpredictable behaviour. Before deploying any system
+upgrades, check that each machine has enough free disk space to hold an
+upgrade. Allow for at least 4GB free space, to be safe.
+
+ ansible -i hosts baserock -m command -a "df -h /"
+
+A good way to free up space is to remove old system-versions using the
+`system-version-manager` tool. There may be other things that are
+unnecessarily taking up space in the root file system, too.
+
+Ideally, at this point you've prepared a patch for definitions.git to fix
+the security issue in the Baserock reference systems, and it has been merged.
+In that case, pull from the reference systems into infrastructure.git, using
+`git pull git://git.baserock.org/baserock/baserock/definitions master`.
+
+If the necessary patch isn't merged in definitions.git, it's still best to
+merge 'master' from there into infrastructure.git, and then cherry-pick the
+patch from Gerrit on top.
+
+You then need to build and upgrade the systems one by one. Do this from the
+'devel-system' machine in the same OpenStack cloud that hosts the
+infrastructure. Baserock upgrades currently involve transferring the whole
+multi-gigabyte system image, so you *must* have a fast connection to the
+target.
+
+Each Baserock system has its own deployment instructions. Each should have
+a deployment .morph file that you can pass to `morph upgrade`. For example,
+to deploy an upgrade git.baserock.org:
+
+ morph upgrade --local-changes=ignore \
+ baserock_trove/baserock_trove.morph gbo.VERSION_LABEL=2016-02-19
+
+Once this completes successfully, rebooting the system should bring up the
+new system. You may want to check that the new `/etc` is correct; you can
+do this inside the machine by mounting `/dev/vda` and looking in `systems/$VERSION_LABEL/run/etc`.
+
+If you want to revert the upgrade, use `system-version-manager list` and
+`system-version-manager set-default <old-version>` to set the previous
+version as the default, then reboot. If the system doesn't boot at all,
+reboot it while you have the graphical console open in Horizon, and you
+should be able to press `ESC` fast enough to get the boot menu open. This
+will allow booting into previous versions of the system. (You shouldn't
+have any problems though since of course we test everything regularly).
+
+Beware of <https://storyboard.baserock.org/#!/story/77>.
+
+For cache.baserock.org, you can reuse the deployment instructions for
+git.baserock.org. Try:
+
+ morph upgrade --local-changes=ignore \
+ baserock_trove/baserock_trove.morph \
+ gbo.update-location=root@cache.baserock.org
+ gbo.VERSION_LABEL=2016-02-19
+
+Deployment to OpenStack
+-----------------------
+
+The intention is that all of the systems defined here are deployed to an
+OpenStack cloud. The instructions here harcode some details about the specific
+tenancy at [DataCentred](http://www.datacentred.io) that the Baserock project
+uses. It should be easy to adapt them for other OpenStack hosts, though.
+
+### Credentials
+
+The instructions below assume you have the following environment variables set
+according to the OpenStack host you are deploying to:
+
+ - `OS_AUTH_URL`
+ - `OS_TENANT_NAME`
+ - `OS_USERNAME`
+ - `OS_PASSWORD`
+
+When using `morph deploy` to deploy to OpenStack, you will need to set these
+variables, because currently Morph does not honour the standard ones. See:
+<https://storyboard.baserock.org/#!/story/35>.
+
+ - `OPENSTACK_USER=$OS_USERNAME`
+ - `OPENSTACK_PASSWORD=$OS_PASSWORD`
+ - `OPENSTACK_TENANT=$OS_TENANT_NAME`
+
+The `location` field in the deployment .morph file will also need to point to
+the correct `$OS_AUTH_URL`.
+
+### Firewall / Security Groups
+
+The instructions assume the presence of a set of security groups. You can
+create these by running the following Ansible playbook.
+
+ ansible-playbook -i hosts firewall.yaml
+
+### Placeholders
+
+The commands below use a couple of placeholders like $network_id, you can set
+them in your environment to allow you to copy and paste the commands below
+as-is.
+
+ - `export fedora_image_id=...` (find this with `glance image-list`)
+ - `export network_id=...` (find this with `neutron net-list`)
+ - `export keyname=...` (find this with `nova keypair-list`)
+
+The `$fedora_image_id` should reference a Fedora Cloud image. You can import
+these from <http://www.fedoraproject.org/>. At time of writing, these
+instructions were tested with Fedora Cloud 23 for x86_64.
+
+Backups
+-------
+
+Backups of git.baserock.org's data volume are run by and stored on on a
+Codethink-managed machine named 'access'. They will need to migrate off this
+system before long. The backups are taken without pausing services or
+snapshotting the data, so they will not be 100% clean. The current
+git.baserock.org data volume does not use LVM and cannot be easily snapshotted.
+
+Backups of 'gerrit' and 'database' are handled by the
+'baserock_backup/backup.py' script. This currently runs on an instance in
+Codethink's internal OpenStack cloud.
+
+Instances themselves are not backed up. In the event of a crisis we will
+redeploy them from the infrastructure.git repository. There should be nothing
+valuable stored outside of the data volumes that are backed up.
+
+To prepare the infrastructure to run the backup scripts you will need to run
+the following playbooks:
+
+ ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml
+ ansible-playbook -i hosts baserock_database/instance-backup-config.yml
+ ansible-playbook -i hosts baserock_gerrit/instance-backup-config.yml
+
+NOTE: to run these playbooks you need to have the public ssh key of the backups
+instance in `keys/backup.key.pub`.
+
+
+Systems
+-------
+
+### Front-end
+
+The front-end provides a reverse proxy, to allow more flexible routing than
+simply pointing each subdomain to a different instance using separate public
+IPs. It also provides a starting point for future load-balancing and failover
+configuration.
+
+To deploy this system:
+
+ nova boot frontend-haproxy \
+ --key-name=$keyname \
+ --flavor=dc1.1x0 \
+ --image=$fedora_image_id \
+ --nic="net-id=$network_id" \
+ --security-groups default,gerrit,shared-artifact-cache,web-server \
+ --user-data ./baserock-ops-team.cloud-config
+ ansible-playbook -i hosts baserock_frontend/image-config.yml
+ ansible-playbook -i hosts baserock_frontend/instance-config.yml
+ ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml
+
+ ansible -i hosts -m service -a 'name=haproxy enabled=true state=started' \
+ --sudo frontend-haproxy
+
+The baserock_frontend system is stateless.
+
+Full HAProxy 1.5 documentation: <https://cbonte.github.io/haproxy-dconv/configuration-1.5.html>.
+
+If you want to add a new service to the Baserock Project infrastructure via
+the frontend, do the following:
+
+- request a subdomain that points at 185.43.218.170 (frontend)
+- alter the haproxy.cfg file in the baserock_frontend/ directory in this repo
+ as necessary to proxy requests to the real instance
+- run the baserock_frontend/instance-config.yml playbook
+- run `ansible -i hosts -m service -a 'name=haproxy enabled=true
+ state=restarted' --sudo frontend-haproxy`
+
+OpenStack doesn't provide any kind of internal DNS service, so you must put the
+fixed IP of each instance.
+
+The internal IP address of this machine is hardcoded in some places (beyond the
+usual haproxy.cfg file), use 'git grep' to find all of them. You'll need to
+update all the relevant config files. We really need some internal DNS system
+to avoid this hassle.
+
+### Trove
+
+To deploy to production, run these commands in a Baserock 'devel'
+or 'build' system.
+
+ nova volume-create \
+ --display-name git.baserock.org-home \
+ --display-description '/home partition of git.baserock.org' \
+ --volume-type Ceph \
+ 300
+
+ git clone git://git.baserock.org/baserock/baserock/infrastructure.git
+ cd infrastructure
+
+ morph build systems/trove-system-x86_64.morph
+ morph deploy baserock_trove/baserock_trove.morph
+
+ nova boot git.baserock.org \
+ --key-name $keyname \
+ --flavor 'dc1.8x16' \
+ --image baserock_trove \
+ --nic "net-id=$network_id,v4-fixed-ip=192.168.222.58" \
+ --security-groups default,git-server,web-server,shared-artifact-cache \
+ --user-data baserock-ops-team.cloud-config
+
+ nova volume-attach git.baserock.org <volume-id> /dev/vdb
+
+ # Note, if this floating IP is not available, you will have to change
+ # the DNS in the DNS provider.
+ nova add-floating-ip git.baserock.org 185.43.218.183
+
+ ansible-playbook -i hosts baserock_trove/instance-config.yml
+
+ # Before configuring the Trove you will need to create some ssh
+ # keys for it. You can also use existing keys.
+
+ mkdir private
+ ssh-keygen -N '' -f private/lorry.key
+ ssh-keygen -N '' -f private/worker.key
+ ssh-keygen -N '' -f private/admin.key
+
+ # Now you can finish the configuration of the Trove with:
+
+ ansible-playbook -i hosts baserock_trove/configure-trove.yml
+
+### OSTree artifact cache
+
+To deploy this system to production:
+
+ nova volume-create \
+ --display-name ostree-volume \
+ --display-description 'OSTree cache volume' \
+ --volume-type Ceph \
+ 300
+
+ nova boot ostree.baserock.org \
+ --key-name $keyname \
+ --flavor dc1.2x8.40 \
+ --image $fedora_image_id \
+ --nic "net-id=$network_id,v4-fixed-ip=192.168.222.153" \
+ --security-groups default,web-server \
+ --user-data ./baserock-ops-team.cloud-config
+
+ nova volume-attach ostree.baserock.org <volume-id> /dev/vdb
+
+ ansible-playbook -i hosts baserock_ostree/image-config.yml
+ ansible-playbook -i hosts baserock_ostree/instance-config.yml
+ ansible-playbook -i hosts baserock_ostree/ostree-access-config.yml
+
+SSL certificates
+================
+
+The certificates used for our infrastructure are provided for free
+by Let's Encrypt. These certificates expire every 3 months. Here we
+will explain how to renew the certificates, and how to deploy them.
+
+Generation of certificates
+--------------------------
+
+> Note: This should be automated in the next upgrade. The instructions
+> sound like a lot of effort
+
+To generate the SSL certs, first you need to clone the following repositories:
+
+ git clone https://github.com/lukas2511/letsencrypt.sh.git
+ git clone https://github.com/mythic-beasts/letsencrypt-mythic-dns01.git
+
+The version used the first time was `0.4.0` with sha `116386486b3749e4c5e1b4da35904f30f8b2749b`,
+(just in case future releases break these instructions)
+
+Now inside of the repo, create a `domains.txt` file with the information
+of the subdomains:
+
+ cd letsencrypt.sh
+ cat >domains.txt <<'EOF'
+ baserock.org
+ docs.baserock.org download.baserock.org irclogs.baserock.org ostree.baserock.org paste.baserock.org spec.baserock.org
+ git.baserock.org
+ EOF
+
+And the `config` file needed:
+
+ cat >config <<'EOF'
+ CONTACT_EMAIL="admin@baserock.org"
+ HOOK="../letsencrypt-mythic-dns01/letsencrypt-mythic-dns01.sh"
+ CHALLENGETYPE="dns-01"
+ EOF
+
+Create a `dnsapi.config.txt` with the contents of `private/dnsapi.config.txt`
+decrypted. To show the contents of this file, run the following in a
+`infrastructure.git` repo checkout.
+
+ ansible-vault view private/dnsapi.config.txt
+
+Now, to generate the certs, run:
+
+ ./dehydrated -c
+
+> If this is the first time, you will get asked to run
+> `./dehydrated --register --accept-terms`
+
+In the `certs` folder you will have all the certificates generated. To construct the
+certificates that are present in `certs` and `private` you will have to:
+
+ cd certs
+ mkdir -p tmp/private tmp/certs
+
+ # Create some full certs including key for some services that need it this way
+ cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem git.baserock.org/privkey.pem > tmp/private/git-with-key.pem
+ cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem docs.baserock.org/privkey.pem > tmp/private/frontend-with-key.pem
+
+ # Copy key files
+ cp git.baserock.org/privkey.pem tmp/private/git.pem
+ cp docs.baserock.org/privkey.pem tmp/private/frontend.pem
+
+ # Copy cert files
+ cp git.baserock.org/cert.csr tmp/certs/git.csr
+ cp git.baserock.org/cert.pem tmp/certs/git.pem
+ cp git.baserock.org/chain.pem tmp/certs/git-chain.pem
+ cp docs.baserock.org/cert.csr tmp/certs/frontend.csr
+ cp docs.baserock.org/cert.pem tmp/certs/frontend.pem
+ cp docs.baserock.org/chain.pem tmp/certs/frontend-chain.pem
+
+ # Create full certs without keys
+ cat git.baserock.org/cert.csr git.baserock.org/cert.pem chain.pem > tmp/certs/git-full.pem
+ cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem chain.pem > tmp/certs/frontend-full.pem
+
+Before replacing the current ones, make sure you **encrypt** the ones that contain
+keys (located in `private` folder):
+
+ ansible-vault encrypt tmp/private/*
+
+And copy them to the repo:
+
+ cp tmp/certs/* ../../certs/
+ cp tmp/private/* ../../private/
+
+
+Deploy certificates
+-------------------
+
+For `git.baserock.org` just run:
+
+ ansible-playbook -i hosts baserock_trove/configure-trove.yml
+
+This script will copy the certificates to the Trove and run the scripts
+that will configure them.
+
+For the frontend, run:
+
+ ansible-playbook -i hosts baserock_frontend/instance-config.yml
+ ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy
+
+Which will install the certificates and then restart the services needed.
+
+
+GitLab CI runners setup
+=======================
+
+Baserock uses [GitLab CI] for build and test automation. For performance reasons
+we provide our own runners and avoid using the free, shared runners provided by
+GitLab. The runners are hosted at [DigitalOcean] and managed by the 'baserock'
+team account there.
+
+There is a persistent 'manager' machine with a public IP of 138.68.143.2 that
+runs GitLab Runner and [docker-machine]. This doesn't run any builds itself --
+we use the [autoscaling feature] of GitLab Runner to spawn new VMs for building
+in. The configuration for this is in `/etc/gitlab-runner/config.toml`.
+
+Each build occurs in a Docker container on one of the transient VMs. As per
+the [\[runners.docker\] section] of `config.toml`, each gets a newly created
+volume mounted at `/cache`. The YBD and BuildStream cache directories get
+located here because jobs were running out of disk space when using the default
+configuration.
+
+There is a second persistent machine with a public IP of 46.101.48.48 that
+hosts a Docker registry and a [Minio] cache. These services run as Docker
+containers. The Docker registry exists to cache the Docker images we use which
+improves the spin-up time of the transient builder VMs, as documented
+[here](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-docker-registry-mirroring).
+The Minio cache is used for the [distributed caching] feature of GitLab CI.
+
+
+[GitLab CI]: https://about.gitlab.com/features/gitlab-ci-cd/
+[DigitalOcean]: https://cloud.digitalocean.com/
+[docker-machine]: https://docs.docker.com/machine/
+[autoscaling feature]: https://docs.gitlab.com/runner/configuration/autoscale.html
+[Minio]: https://www.minio.io/
+[\[runners.docker\] section]: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section
+[distributed caching]: https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching