% Architecture of daemonised Lorry Controller % Codethink Ltd Introduction ============ This is an architecture document for Lorry Controller. It is aimed at those who develop the software, or develop against its HTTP API. See the file `README.md` for general information about Lorry Controller. Requirements ============ Some concepts/terminology: * CONFGIT is the git repository Lorry Controller uses for its configuration. * Lorry specification: the configuration to Lorry to mirror an upstream version control repository or tarball. Note that a `.lorry` file may contain several specifications. * Upstream Host: a git hosting server that Lorry Controller mirrors from. * Host specification: which Upstream Host to mirror. This gets broken into generated Lorry specifications, one per git repository on the other Host. There can be many Host specifications to mirror many Hosts. * Downstream Host: a git hosting server that Lorry Controller mirrors to. * run queue: all the Lorry specifications (from CONFGIT or generated from the Host specifications) a Lorry Controller knows about; this is the set of things that get scheduled. The queue has a linear order (first job in the queue is the next job to execute). * job: An instance of executing a Lorry specification. Each job has an identifier and associated data (such as the output provided by the running job, and whether it succeeded). * admin: a person who can control or reconfigure a Lorry Controller instance. All users of the HTTP API are admins, for example. For historical reasons, Hosts are also referred to as Troves in many places. Original set of requirements, which have been broken down and detailed up below: * Lorry Controller should be capable of being reconfigured at runtime to allow new tasks to be added and old tasks to be removed. (RC/ADD, RC/RM, RC/START) * Lorry Controller should not allow all tasks to become stuck if one task is taking a long time. (RR/MULTI) * Lorry Controller should not allow stuck tasks to remain stuck forever. (Configurable timeout? monitoring of disk usage or CPU to see if work is being done?) (RR/TIMEOUT) * Lorry Controller should be able to be controlled at runtime to allow: - Querying of the current task set (RQ/SPECS, RQ/SPEC) - Querying of currently running tasks (RQ/RUNNING) - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT) - Supporting of the health monitoring to allow appropriate alerts to be sent out (MON/STATIC, MON/DU) The detailed requirements (prefixed by a unique identfier, which is used elsewhere to refer to the exact requirement): * (FW) Lorry Controller can access Upstream Hosts from behind firewalls. * (FW/H) Lorry Controller can access the Upstream Host using HTTP or HTTPS only, without using ssh, in order to get a list of repositories to mirror. (Lorry itself also needs to be able to access the Upstream Host using HTTP or HTTPS only, bypassing ssh, but that's a Lorry problem and outside the scope of Lorry Controller, so it'll need to be dealt separately.) * (FW/C) Lorry Controller does not verify SSL/TLS certificates when accessing the Upstream Host. * (RC) Lorry Controller can be reconfigured at runtime. * (RC/ADD) A new Lorry specification can be added to CONFGIT, and a running Lorry Controller will add them to its run queue as soon as it is notified of the change. * (RC/RM) A Lorry specification can be removed from CONFGIT, and a running Lorry Controller will remove it from its run queue as soon as it is notified of the change. * (RC/START) A Lorry Controller reads CONFGIT when it starts, updating its run queue if anything has changed. * (RT) Lorry Controller can controlled at runtime. * (RT/KILL) An admin can get their Lorry Controller to stop a running job. * (RT/TOP) An admin can get their Lorry Controller to move a Lorry spec to the beginning of the run queue. * (RT/BOT) An admin can get their Lorry Controller to move a Lorry spec to the end of the run queue. * (RT/QSTOP) An admin can stop their Lorry Controller from scheduling any new jobs. * (RT/QSTART) An admin can get their Lorry Controller to start scheduling jobs again. * (RQ) Lorry Controller can be queried at runtime. * (RQ/RUNNING) An admin can list all currently running jobs. * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry Controller still remembers. * (RQ/SPECS) An admin can list all existing Lorry specifications in the run queue. * (RQ/SPEC) An admin can query existing Lorry specifications in the run queue for any information the Lorry Controller holds for them, such as the last time they successfully finished running. * (RR) Lorry Controller is reasonably robust. * (RR/CONF) Lorry Controller ignores any broken Lorry or Host specifications in CONFGIT, and runs without them. * (RR/TIMEOUT) Lorry Controller stops a job that runs for too long. * (RR/MULTI) Lorry Controller can run multiple jobs at the same time, and lets the maximal number of such jobs be configured by the admin. * (RR/DU) Lorry Controller (and the way it runs Lorry) is designed to be frugal about disk space usage. * (RR/CERT) Lorry Controller tells Lorry to not worry about unverifiable SSL/TLS certificates and to continue even if the certificate can't be verified or the verification fails. * (RS) Lorry Controller is reasonably scalable. * (RS/SPECS) Lorry Controller works for the number of Lorry specifications we have on git.baserock.org (a number that will increase, and is currently about 500). * (RS/GITS) Lorry Controller works for mirroring git.baserock.org (about 500 git repositories). * (RS/HW) Lorry Controller may assume that CPU, disk, and bandwidth are sufficient, if not to be needlessly wasted. * (MON) Lorry Controller can be monitored from the outside. * (MON/STATIC) Lorry Controller updates at least once a minute a static HTML file, which shows its current status with sufficient detail that an admin knows if things get stuck or break. * (MON/DU) Lorry Controller measures, at least, the disk usage of each job and Lorry specification. * (SEC) Lorry Controller is reasonably secure. * (SEC/API) Access to the Lorry Controller run-time query and controller interfaces is managed with iptables (for now). * (SEC/CONF) Access to CONFGIT is managed by the git server that hosts it. (Gitano on Trove.) Architecture design =================== Constraints ----------- Python is not good at multiple threads (partly due to the global interpreter lock), and mixing threads and executing subprocesses is quite tricky to get right in general. Thus, this design splits the software into a threaded web application (using the bottle.py framework) and one or more single-threaded worker processes to execute Lorry. Entities -------- * An admin is a human being or some software using the HTTP API to communicate with the Lorry Controller. * Lorry Controller runs Lorry appropriately, and consists of several components described below. * The Downstream Host is as defined in Requirements. * An Upstream Host is as defined in Requirements. There can be multiple Upstream Hosts. Components of Lorry Controller ------------------------------ * CONFGIT is a git repository for Lorry Controller configuration, which the Lorry Controller (see WEBAPP below) can access and pull from. Pushing is not required and should be prevented by Gitano. CONFGIT is hosted on the Downstream Host. * STATEDB is persistent storage for the Lorry Controller's state: what Lorry specs it knows about (provided by the admin, or generated from a Host spec by Lorry Controller itself), their ordering, jobs that have been run or are being run, information about the jobs, etc. The idea is that the Lorry Controller process can terminate (cleanly or by crashing), and be restarted, and continue approximately from where it was. Also, a persistent storage is useful if there are multiple processes involved due to how bottle.py and WSGI work. STATEDB is implemented using sqlite3. * WEBAPP is the controlling part of Lorry Controller, which maintains the run queue, and provides an HTTP API for monitoring and controlling Lorry Controller. WEBAPP is implemented as a bottle.py application. bottle.py runs the WEBAPP code in multiple threads to improve concurrency. * MINION runs jobs (external processes) on behalf of WEBAPP. It communicates with WEBAPP over HTTP, and requests a job to run, starts it, and while it waits, sends partial output to the WEBAPP every few seconds, and asks the WEBAPP whether the job should be aborted or not. MINION may eventually run on a different host than WEBAPP, for added scalability. Components external to Lorry Controller --------------------------------------- * A web server. This runs the Lorry Controller WEBAPP, using WSGI so that multiple instances (processes) can run at once, and thus serve many clients. * bottle.py is a Python microframework for web applications. It sits between the web server itself and the WEBAPP code. * systemd is the operating system component that starts services and processes. How the components work together -------------------------------- * Each WEBAPP instance is started by the web server, when a request comes in. The web server is started by a systemd unit. * Each MINION instance is started by a systemd unit. Each MINION handles one job at a time, and doesn't block other MINIONs from running other jobs. The admins decide how many MINIONs run at once, depending on hardware resources and other considerations. (RR/MULTI) * An admin communicates with the WEBAPP only, by making HTTP requests. Each request is either a query (GET) or a command (POST). Queries report state as stored in STATEDB. Commands cause the WEBAPP instance to do something and alter STATEDB accordingly. * When an admin makes changes to CONFGIT, and pushes them to the Downstream Host, the Host's git post-update hook makes an HTTP request to WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM) * Each MINION likewise communicates only with the WEBAPP using HTTP requests. MINION requests a job to run (which triggers WEBAPP's job scheduling), and then reports results to the WEBAPP (which causes WEBAPP to store them in STATEDB), which tells MINION whether to continue running the job or not (RT/KILL). There is no separate scheduling process: all scheduling happens when there is a MINION available. * At system start up, a systemd unit makes an HTTP request to WEBAPP to make it refresh STATEDB from CONFGIT. (RC/START) * A timer unit for systemd makes an HTTP request to get WEBAPP to refresh the static HTML status page. (MON/STATIC) In summary: systemd starts WEBAPP and MINIONs, and whenever a MINION can do work, it asks WEBAPP for something to do, and reports back results. Meanwhile, admin can query and control via HTTP requests to WEBAPP, and WEBAPP instances communicate via STATEDB. The WEBAPP ---------- The WEBAPP provides an HTTP API as described below. Run queue management: * `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to run. Any currently running jobs are not affected. (RT/QSTOP) * `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs again. (RT/QSTART) * `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of all Lorry specifications in the run queue, in the order they are in the run queue. (RQ/SPECS) * `POST /1.0/move-to-top` with `path=lorryspecid` as the body, where `lorryspecid` is the id (path) of a Lorry specification in the run queue, causes WEBAPP to move the specified spec to the head of the run queue, and store this in STATEDB. It doesn't affect currently running jobs. (RT/TOP) * `POST /1.0/move-to-bottom` with `path=lorryspecid` in the body is like `/move-to-top`, but moves the job to the end of the run queue. (RT/BOT) Running job management: * `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of ids of all currently running jobs. (RQ/RUNNING) * `GET /1.0/job/` causes WEBAPP to return a JSON map (dict) with all the information about the specified job. * `POST /1.0/stop-job` with `job_id=jobid` where `jobid` is an id of a running job, causes WEBAPP to record in STATEDB that the job is to be killed, and waits for it to be killed. (Killing to be done when MINION gets around to it.) This request returns as soon as the STATEDB change is done. * `GET /1.0/list-jobs` causes WEBAPP to return a JSON list of ids of all jobs, running or finished, that it knows about. (RQ/ALLJOBS) * `GET /1.0/list-jobs-html` is the same as `list-jobs`, but returns an HTML page instead. * `POST /1.0/remove-job` with `job_id=jobid` in the body, removes a stopped job from the state database. * `POST /1.0/remove-ghost-jobs` looks for any running jobs in STATEDB that haven't been updated (with `job-update`, see below) in a long time (see `--ghost-timeout`), and marks them as terminated. This is used to catch situations when a MINION fails to tell the WEBAPP that a job has terminated. Other status queries: * `GET /1.0/status` causes WEBAPP to return a JSON object that describes the state of Lorry Controller. This information is meant to be programmatically useable and may or may not be the same as in the HTML page. * `GET /1.0/status-html` causes WEBAPP to return an HTML page that describes the state of Lorry Controller. This also updates an on-disk copy of the HTML page, which the web server is configured to serve using a normal HTTP request. This is the primary interface for human admins to look at the state of Lorry Controller. (MON/STATIC) * `GET /1.0/lorry/` causes WEBAPP to return a JSON map (dict) with all the information about the specified Lorry specification. (RQ/SPEC) Requests for MINION: * `GET /1.0/give-me-job` is used by MINION to get a new job to run. WEBAPP will either return a JSON object describing the job to run, or return a status code indicating that there is nothing to do. WEBAPP will respond immediately, even if there is nothing for MINION to do, and MINION will then sleep for a while before it tries again. WEBAPP updates STATEDB to record that the job is allocated to a MINION. * `POST /1.0/job-update` is used by MINION to push updates about the job it is running to WEBAPP. The body sets fields `exit` (exit code of program, or `no` if not set), `stdout` (some output from the job's standard output) and `stderr` (ditto, but standard error output). There MUST be at least one `job-update` call, which indicates the job has terminated. WEBAPP responds with a status indicating whether the job should continue to run or be terminated (RR/TIMEOUT). WEBAPP records the job as terminated only after MINION tells it the job has been terminated. MINION makes the `job-update` request frequently, even if the job has produced no output, so that WEBAPP can update a timestamp in STATEDB to indicate the job is still alive. Other requests: * `POST /1.0/read-configuration` causes WEBAPP to update its copy of CONFGIT and update STATEDB based on the new configuration, if it has changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START) This is called by systemd units at system startup and periodically (perhaps once a minute) otherwise. It can also be triggered by an admin (there is a button on the `/1.0/status-html` web page). * `POST /1.0/ls-troves` causes WEBAPP to refresh its list of repositories in each Upstream Host, if the current list is too old (see the `ls-interval` setting for each Upstream Host in `lorry-controller.conf`). This gets called from a systemd timer unit at a suitable interval. * `POST /1.0/force-ls-troves` causes the repository refresh to happen for all Upstream Hosts, regardless of whether it is due or not. This can be called manually by an admin. The MINION ---------- * Do `GET /1.0/give-me-job` to WEBAPP. * If didn't get a job, sleep a while and try again. * If did get job, fork and exec that. * In a loop: wait for output, for a suitably short period of time, from job (or its termination), with `select` or similar mechanism, and send anything (if anything) you get to WEBAPP. If the WEBAPP told us to kill the job, kill it, then send an update to that effect to WEBAPP. * Go back to top to request new job. Old job removal --------------- To avoid the STATEDB filling up with logs of old jobs, a systemd timer unit will run occasionally to remove jobs so old, nobody cares about them anymore. To make it easier to experiment with the logic of choosing what to remove (age only? keep failed ones? something else?) the removal is kept outside the WEBAPP. STATEDB ------- The STATEDB has several tables. This section explains them. The `running_queue` table has a single column (`running`) and a single row, and is used to store a single boolean value that specifies whether WEBAPP is giving out jobs to run from the run-queue. This value is controlled by `/1.0/start-queue` and `/1.0/stop-queue` requests. The `lorries` table implements the run-queue: all the Lorry specs that WEBAPP knows about. It has the following columns: * `path` is the path of the git repository on the Downstream Host, i.e., the git repository to which Lorry will push. This is a unique identifier. It is used, for example, to determine if a Lorry spec is obsolete after a CONFGIT update. * `text` has the text of the Lorry spec. This may be read from a file or generated by Lorry Controller itself. This text will be given to Lorry when a job is run. * `generated` is set to 0 or 1, depending on if the Lorry came from an actual `.lorry` file or was generated by Lorry Controller. Code structure ============== The Lorry Controller code base is laid out as follows: * `lorry-controller-webapp` is the main program of WEBAPP. It sets up the bottle.py framework. All the implementations for the various HTTP requests are in classes in the `lorrycontroller` Python package, as subclasses of the `LorryControllerRoute` class. The main program uses introspection ("magic") to find the subclasses automatically and sets up the bottle.py routes correctly. This makes it possible to spread the code into simple classes; bottle's normal way (with the `@app.route` decorator) seemed to make that harder and require everything in the same class. * `lorrycontroller` is a Python package with: - The HTTP request handlers (`LorryControllerRoute` and its subclasses) - Management of STATEDB (`statedb` module) - Support for various Downstream and Upstream Host types (`hosts`, `gitano`, `gerrit`, `gitlab`, `local` modules) - Some helpful utilities (`proxy` module) * `lorry-controller-minion` is the entirety of the MINION, except that it uses the `lorrycontroller.setup_proxy` function. The MINION is kept very simple on purpose: all the interesting logic is in the WEBAPP instead. * `static` has static content to be served over HTTP. Primarily, the CSS file for the HTML interfaces. When LC is integrated within the Downstream Host, the web server gets configured to serve these files directly. The `static` directory will be accessible over plain HTTP on port 80, and on port 12765 via the WEBAPP, to allow HTML pages to refer to it via a simple path. * `templates` contains bottle.py HTML templates for various pages. * `etc` contains files to be installed in `/etc` when LC is installed on a Baserock system. Primarily this is the web server (lighttpd) configuration to invoke WEBAPP. * `units` contains various systemd units that start services and run time-based jobs. * `yarns.webapp` contains an integration test suite for WEBAPP. This is run by the `./check` script. The `./test-wait-for-port` script is used by the yarns. Example ------- As an example, to modify how the `/1.0/status-html` request works, you would look at its implementation in `lorrycontroller/status.py`, and perhaps also the HTML templates in `templates/*.tpl`. STATEDB ------- The persistent state of WEBAPP is stored in an Sqlite3 database. All access to STATEDB within WEBAPP is via the `lorrycontroller/statedb.py` code module. That means there are no SQL statements outside `statedb.py` at all, nor is it OK to add any. If the interface provided by the `StateDB` class isn't sufficient, then modify the class suitably, but do not add any new SQL outside it. All access from outside of WEBAPP happens via WEBAPP's HTTP API. Only the WEBAPP is allowed to touch STATEDB in any way. The bottle.py framework runs multiple threads of WEBAPP code. The threads communicate only via STATEDB. There is no shared state in memory. SQL's locking is used for mutual exclusion. The `StateDB` class acts as a context manager for Python's `with` statements to provide locking. To access STATEDB with locking, use code such as this: with self.open_statedb() as statedb: hosts = statedb.get_hosts() for host in hosts: statedb.remove_host(hosts) The code executed by the `with` statement is run under lock, and the lock gets released automatically even if there is an exception. (You could manage locks manually. It's a good way to build character and learn why using the context manager is really simple and leads to more correct code.)