Baserock project public infrastructure ====================================== This repository contains the definitions for all of the Baserock Project's infrastructure. This includes every service used by the project, except for the mailing lists (hosted by [Pepperfish]) the wiki (hosted by [Branchable]) and the GitLab CI runners (set up by Javier Jardón). Some of these systems are Baserock systems. This has proved an obstacle to keeping them up to date with security updates, and we plan to switch everything to run on mainstream distros in future. All files necessary for (re)deploying the systems should be contained in this Git repository. Private tokens should be encrypted using [ansible-vault](https://www.ansible.com/blog/2014/02/19/ansible-vault). [Pepperfish]: http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo [Branchable]: http://www.branchable.com/ General notes ------------- When instantiating a machine that will be public, remember to give shell access everyone on the ops team. This can be done using a post-creation customisation script that injects all of their SSH keys. The SSH public keys of the Baserock Operations team are collected in `baserock-ops-team.cloud-config.`. Ensure SSH password login is disabled in all systems you deploy! See: for why. The Ansible playbook `admin/sshd_config.yaml` can ensure that all systems have password login disabled. Administration -------------- You can use [Ansible] to automate tasks on the baserock.org systems. To run a playbook: ansible-playbook -i hosts $PLAYBOOK.yaml To run an ad-hoc command (upgrading, for example): ansible -i hosts fedora -m command -a 'sudo dnf update -y' [Ansible]: http://www.ansible.com Security updates ---------------- Fedora security updates can be watched here: . The Baserock reference systems doesn't have such a service. The [LWN Alerts](https://lwn.net/Alerts/) service gives you info from all major Linux distributions. If there is a vulnerability discovered in some software we use, we might need to upgrade all of the systems that use that component at baserock.org. Bear in mind some systems are not accessible except via the frontend-haproxy system. Those are usually less at risk than those that face the web directly. Also bear in mind we use OpenStack security groups to block most ports. ### Prepare the patch for Baserock systems First, you need to update the Baserock reference system definitions with a fixed version of the component. Build that and test that it works. Submit the patch to gerrit.baserock.org, get it reviewed, and merged. Then cherry pick that patch into infrastructure.git. This a long-winded process. There are shortcuts you can take, although someone still has to complete the process described above at some point. * You can modify the infrastructure.git definitions directly and start rebuilding the infrastructure systems right away, to avoid waiting for the Baserock patch review process. * You can add the new version of the component as a stratum that sits above everything else in the build graph. For example, to do a 'hot-fix' for GLIBC, add a 'glibc-hotfix' stratum containing the new version to all of the systems you need to upgrade. Rebuilding them will be quick because you just need to build GLIBC, and can reuse the cached artifacts for everything else. The new GLIBC will overwrite the one that is lower down in the build graph in the resulting filesystem. Of course, if the new version of the component is not ABI compatible then this approach will break things. Be careful. ### Check the inventory Make sure the Ansible inventory file is up to date, and that you have access to all machines. Run this: ansible \* -i ./hosts -m ping You should see lots of this sort of output: frontend-haproxy | success >> { "changed": false, "ping": "pong" } You may find some host key errors like this: paste | FAILED => SSH Error: Host key verification failed. It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue. If you have a host key problem, that could be because somebody redeployed the system since the last time you connected to it with SSH, and did not transfer the SSH host keys from the old system to the new system. Check with other ops teams members about this. If you are sure the new host keys can be trusted, you can remove the old ones with `ssh-keygen -R 10.3.x.y`, where 10.3.x.y is the internal IP address of the machine. You'll then be prompted to accept the new ones when you run Ansible again. Once all machines respond to the Ansible 'ping' module, double check that every machine you can see in the OpenStack Horizon dashboard has a corresponding entry in the 'hosts' file, to ensure the next steps operate on all of the machines. ### Check and upgrade Fedora systems > Bear in mind that only the latest 2 versions of Fedora receive security updates. If any machines are not running the latest version of Fedora, you should redeploy them with the latest version. See the instructions below on how to (re)deploy each machine. You should deploy a new instance of a system and test it *before* terminating the existing instance. Switching over should be a matter of changing either its floating IP address or the IP address in baserock_frontend/haproxy.conf. You can find out what version of Fedora is in use with this command: ansible fedora -i hosts -m setup -a 'filter=ansible_distribution_version' Check what version of a package is in use with this command (using GLIBC as an example). You can compare this against Fedora package changelogs at [Koji](https://koji.fedoraproject.org). ansible fedora -i hosts -m command -a 'rpm -q glibc --qf "%{VERSION}.%{RELEASE}\n"' You can see what updates are available using the `dnf updateinfo info' command. ansible -i hosts fedora -m command -a 'dnf updateinfo info glibc' You can then use `dnf upgrade -y` to install all available updates. Or give the name of a package to update just that package. Be aware that DNF is quite slow, and if you forget to pass `-y` then it will hang forever waiting for input. You will then need to restart services. The `dnf needs-restarting` command might be useful, but rebooting the whole machine is probably easiest. ### Check and upgrade Baserock systems Check what version of a given package is in use with this command (using GLIBC as an example). Ideally Baserock reference systems would have a query tool for this info, but for now we have to look at the JSON metadata file directly. ansible -i hosts baserock -m command \ -a "grep '\"\(sha1\|repo\|original_ref\)\":' /baserock/glibc-bins.meta" The default Baserock machine layout uses Btrfs for the root filesystem. Filling up a Btrfs disk results in unpredictable behaviour. Before deploying any system upgrades, check that each machine has enough free disk space to hold an upgrade. Allow for at least 4GB free space, to be safe. ansible -i hosts baserock -m command -a "df -h /" A good way to free up space is to remove old system-versions using the `system-version-manager` tool. There may be other things that are unnecessarily taking up space in the root file system, too. Ideally, at this point you've prepared a patch for definitions.git to fix the security issue in the Baserock reference systems, and it has been merged. In that case, pull from the reference systems into infrastructure.git, using `git pull git://git.baserock.org/baserock/baserock/definitions master`. If the necessary patch isn't merged in definitions.git, it's still best to merge 'master' from there into infrastructure.git, and then cherry-pick the patch from Gerrit on top. You then need to build and upgrade the systems one by one. Do this from the 'devel-system' machine in the same OpenStack cloud that hosts the infrastructure. Baserock upgrades currently involve transferring the whole multi-gigabyte system image, so you *must* have a fast connection to the target. Each Baserock system has its own deployment instructions. Each should have a deployment .morph file that you can pass to `morph upgrade`. For example, to deploy an upgrade git.baserock.org: morph upgrade --local-changes=ignore \ baserock_trove/baserock_trove.morph gbo.VERSION_LABEL=2016-02-19 Once this completes successfully, rebooting the system should bring up the new system. You may want to check that the new `/etc` is correct; you can do this inside the machine by mounting `/dev/vda` and looking in `systems/$VERSION_LABEL/run/etc`. If you want to revert the upgrade, use `system-version-manager list` and `system-version-manager set-default ` to set the previous version as the default, then reboot. If the system doesn't boot at all, reboot it while you have the graphical console open in Horizon, and you should be able to press `ESC` fast enough to get the boot menu open. This will allow booting into previous versions of the system. (You shouldn't have any problems though since of course we test everything regularly). Beware of . For cache.baserock.org, you can reuse the deployment instructions for git.baserock.org. Try: morph upgrade --local-changes=ignore \ baserock_trove/baserock_trove.morph \ gbo.update-location=root@cache.baserock.org gbo.VERSION_LABEL=2016-02-19 Deployment to OpenStack ----------------------- The intention is that all of the systems defined here are deployed to an OpenStack cloud. The instructions here harcode some details about the specific tenancy at [CityCloud](https://citycontrolpanel.com/) that the Baserock project uses. It should be easy to adapt them for other OpenStack hosts, though. ### Credentials The instructions below assume you have the following environment variables set according to the OpenStack host you are deploying to: - `OS_AUTH_URL` - `OS_TENANT_NAME` - `OS_USERNAME` - `OS_PASSWORD` For CityCloud you also need to ensure that `OS_REGION_NAME` is set to `Lon1` (for the London datacentre). When using `morph deploy` to deploy to OpenStack, you will need to set these variables, because currently Morph does not honour the standard ones. See: . - `OPENSTACK_USER=$OS_USERNAME` - `OPENSTACK_PASSWORD=$OS_PASSWORD` - `OPENSTACK_TENANT=$OS_TENANT_NAME` The `location` field in the deployment .morph file will also need to point to the correct `$OS_AUTH_URL`. ### Firewall / Security Groups The instructions assume the presence of a set of security groups. You can create these by running the following Ansible playbook. ansible-playbook -i hosts firewall.yaml ### Placeholders The commands below use a couple of placeholders like $network_id, you can set them in your environment to allow you to copy and paste the commands below as-is. - `export fedora_image_id=...` (find this with `glance image-list`) - `export network_id=...` (find this with `neutron net-list`) - `export keyname=...` (find this with `nova keypair-list`) The `$fedora_image_id` should reference a Fedora Cloud image. You can import these from . At time of writing, these instructions were tested with Fedora Cloud 26 for x86_64. Backups ------- Backups of git.baserock.org's data volume are run by and stored on on a Codethink-managed machine named 'access'. They will need to migrate off this system before long. The backups are taken without pausing services or snapshotting the data, so they will not be 100% clean. The current git.baserock.org data volume does not use LVM and cannot be easily snapshotted. Systems ------- ### Front-end The front-end provides a reverse proxy, to allow more flexible routing than simply pointing each subdomain to a different instance using separate public IPs. It also provides a starting point for future load-balancing and failover configuration. To deploy this system: nova boot frontend-haproxy \ --key-name=$keyname \ --flavor=1C-1GB \ --image=$fedora_image_id \ --nic="net-id=$network_id" \ --security-groups default,shared-artifact-cache,web-server \ --user-data ./baserock-ops-team.cloud-config ansible-playbook -i hosts baserock_frontend/image-config.yml ansible-playbook -i hosts baserock_frontend/instance-config.yml \ --vault-password-file=~/vault-infra-pass ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml ansible -i hosts -m service -a 'name=haproxy enabled=true state=started' \ --sudo frontend-haproxy The baserock_frontend system is stateless. Full HAProxy 1.5 documentation: . If you want to add a new service to the Baserock Project infrastructure via the frontend, do the following: - request a subdomain that points at 37.153.173.19 (frontend) - alter the haproxy.cfg file in the baserock_frontend/ directory in this repo as necessary to proxy requests to the real instance - run the baserock_frontend/instance-config.yml playbook - run `ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy` OpenStack doesn't provide any kind of internal DNS service, so you must put the fixed IP of each instance. The internal IP address of this machine is hardcoded in some places (beyond the usual haproxy.cfg file), use 'git grep' to find all of them. You'll need to update all the relevant config files. We really need some internal DNS system to avoid this hassle. ### General webserver The general-purpose webserver provides downloads, plus IRC logging and a pastebin service. To deploy to production: openstack volume create \ --description 'Webserver volume' \ --size 150 \ webserver-volume nova boot webserver \ --key-name $keyname \ --flavor 2C-8GB \ --image $fedora_image_id \ --nic "net-id=$network_id" \ --security-groups default,web-server,haste-server \ --user-data ./baserock-ops-team.cloud-config nova volume-attach webserver /dev/vdb ansible-playbook -i hosts baserock_webserver/image-config.yml ansible-playbook -i hosts baserock_webserver/instance-config.yml ansible-playbook -i hosts baserock_webserver/instance-gitlab-bot-config.yml \ ansible-playbook -i hosts baserock_webserver/instance-hastebin-config.yml \ --vault-password-file ~/vault-infra-pass ansible-playbook -i hosts baserock_webserver/instance-irclogs-config.yml The webserver machine runs [Cherokee](http://cherokee-project.com/). You can use the `cherokee-admin` configuration UI, by connecting to the webserver over SSH and including this in your SSH commandlines: `-L9090:localhost:9090`. When you run `sudo cherokee-admin` on the server, you'll be able to browse to it locally on your machine at `https://localhost:9090/`. You also have to modify the security groups temporarily to allow that port through. ### Trove To deploy to production, run these commands in a Baserock 'devel' or 'build' system. nova volume-create \ --display-name git.baserock.org-home \ --display-description '/home partition of git.baserock.org' \ --volume-type Ceph \ 300 git clone git://git.baserock.org/baserock/baserock/infrastructure.git cd infrastructure morph build systems/trove-system-x86_64.morph morph deploy baserock_trove/baserock_trove.morph nova boot git.baserock.org \ --key-name $keyname \ --flavor 'dc1.8x16' \ --image baserock_trove \ --nic "net-id=$network_id,v4-fixed-ip=192.168.222.58" \ --security-groups default,git-server,web-server,shared-artifact-cache \ --user-data baserock-ops-team.cloud-config nova volume-attach git.baserock.org /dev/vdb # Note, if this floating IP is not available, you will have to change # the DNS in the DNS provider. nova add-floating-ip git.baserock.org 37.153.173.36 ansible-playbook -i hosts baserock_trove/instance-config.yml # Before configuring the Trove you will need to create some ssh # keys for it. You can also use existing keys. mkdir private ssh-keygen -N '' -f private/lorry.key ssh-keygen -N '' -f private/worker.key ssh-keygen -N '' -f private/admin.key # Now you can finish the configuration of the Trove with: ansible-playbook -i hosts baserock_trove/configure-trove.yml ### OSTree artifact cache To deploy this system to production: openstack volume create \ --description 'OSTree cache volume' \ --size 300 \ ostree-volume nova boot ostree.baserock.org \ --key-name $keyname \ --flavor 2C-8GB \ --image $fedora_image_id \ --nic "net-id=$network_id" \ --security-groups default,web-server \ --user-data ./baserock-ops-team.cloud-config nova volume-attach ostree.baserock.org /dev/vdb ansible-playbook -i hosts baserock_ostree/image-config.yml ansible-playbook -i hosts baserock_ostree/instance-config.yml ansible-playbook -i hosts baserock_ostree/ostree-access-config.yml SSL certificates ================ The certificates used for our infrastructure are provided for free by Let's Encrypt. These certificates expire every 3 months. Here we will explain how to renew the certificates, and how to deploy them. Generation of certificates -------------------------- > Note: This should be automated in the next upgrade. The instructions > sound like a lot of effort To generate the SSL certs, first you need to clone the following repositories: git clone https://github.com/lukas2511/letsencrypt.sh.git git clone https://github.com/mythic-beasts/letsencrypt-mythic-dns01.git The version used the first time was `0.4.0` with sha `116386486b3749e4c5e1b4da35904f30f8b2749b`, (just in case future releases break these instructions) Now inside of the repo, create a `domains.txt` file with the information of the subdomains: cd letsencrypt.sh cat >domains.txt <<'EOF' baserock.org docs.baserock.org download.baserock.org irclogs.baserock.org ostree.baserock.org paste.baserock.org spec.baserock.org git.baserock.org EOF And the `config` file needed: cat >config <<'EOF' CONTACT_EMAIL="admin@baserock.org" HOOK="../letsencrypt-mythic-dns01/letsencrypt-mythic-dns01.sh" CHALLENGETYPE="dns-01" EOF Create a `dnsapi.config.txt` with the contents of `private/dnsapi.config.txt` decrypted. To show the contents of this file, run the following in a `infrastructure.git` repo checkout. ansible-vault view private/dnsapi.config.txt Now, to generate the certs, run: ./dehydrated -c > If this is the first time, you will get asked to run > `./dehydrated --register --accept-terms` In the `certs` folder you will have all the certificates generated. To construct the certificates that are present in `certs` and `private` you will have to: cd certs mkdir -p tmp/private tmp/certs # Create some full certs including key for some services that need it this way cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem git.baserock.org/privkey.pem > tmp/private/git-with-key.pem cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem docs.baserock.org/privkey.pem > tmp/private/frontend-with-key.pem # Copy key files cp git.baserock.org/privkey.pem tmp/private/git.pem cp docs.baserock.org/privkey.pem tmp/private/frontend.pem # Copy cert files cp git.baserock.org/cert.csr tmp/certs/git.csr cp git.baserock.org/cert.pem tmp/certs/git.pem cp git.baserock.org/chain.pem tmp/certs/git-chain.pem cp docs.baserock.org/cert.csr tmp/certs/frontend.csr cp docs.baserock.org/cert.pem tmp/certs/frontend.pem cp docs.baserock.org/chain.pem tmp/certs/frontend-chain.pem # Create full certs without keys cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem > tmp/certs/git-full.pem cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem > tmp/certs/frontend-full.pem Before replacing the current ones, make sure you **encrypt** the ones that contain keys (located in `private` folder): ansible-vault encrypt tmp/private/* And copy them to the repo: cp tmp/certs/* ../../certs/ cp tmp/private/* ../../private/ Deploy certificates ------------------- For `git.baserock.org` just run: ansible-playbook -i hosts baserock_trove/configure-trove.yml This script will copy the certificates to the Trove and run the scripts that will configure them. For the frontend, run: ansible-playbook -i hosts baserock_frontend/instance-config.yml ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy Which will install the certificates and then restart the services needed. GitLab CI runners setup ======================= Baserock uses [GitLab CI] for build and test automation. For performance reasons we provide our own runners and avoid using the free, shared runners provided by GitLab. The runners are hosted at [DigitalOcean] and managed by the 'baserock' team account there. There is a persistent 'manager' machine with a public IP of 138.68.143.2 that runs GitLab Runner and [docker-machine]. This doesn't run any builds itself -- we use the [autoscaling feature] of GitLab Runner to spawn new VMs for building in. The configuration for this is in `/etc/gitlab-runner/config.toml`. Each build occurs in a Docker container on one of the transient VMs. As per the [\[runners.docker\] section] of `config.toml`, each gets a newly created volume mounted at `/cache`. The YBD and BuildStream cache directories get located here because jobs were running out of disk space when using the default configuration. There is a second persistent machine with a public IP of 46.101.48.48 that hosts a Docker registry and a [Minio] cache. These services run as Docker containers. The Docker registry exists to cache the Docker images we use which improves the spin-up time of the transient builder VMs, as documented [here](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-docker-registry-mirroring). The Minio cache is used for the [distributed caching] feature of GitLab CI. [GitLab CI]: https://about.gitlab.com/features/gitlab-ci-cd/ [DigitalOcean]: https://cloud.digitalocean.com/ [docker-machine]: https://docs.docker.com/machine/ [autoscaling feature]: https://docs.gitlab.com/runner/configuration/autoscale.html [Minio]: https://www.minio.io/ [\[runners.docker\] section]: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section [distributed caching]: https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching