diff options
author | Ben Hutchings <ben.hutchings@codethink.co.uk> | 2020-07-15 16:22:51 +0100 |
---|---|---|
committer | Ben Hutchings <ben.hutchings@codethink.co.uk> | 2020-07-15 16:29:20 +0100 |
commit | 74ffb6dbd4d21888d147bf912cc0de75e85d609e (patch) | |
tree | d2687e1c75073ceadd6409d8efd08123d73c4ffc /ARCH | |
parent | f4be06b2fe0655f64e7eb7d2b235f18d35c35696 (diff) | |
download | lorry-controller-74ffb6dbd4d21888d147bf912cc0de75e85d609e.tar.gz |
Add .md extension to Markdown documents
This will cause them to be rendered on GitLab and other git hosts'
web interfaces.
Diffstat (limited to 'ARCH')
-rw-r--r-- | ARCH | 505 |
1 files changed, 0 insertions, 505 deletions
@@ -1,505 +0,0 @@ -% Architecture of daemonised Lorry Controller -% Codethink Ltd - -Introduction -============ - -This is an architecture document for Lorry Controller. It is aimed at -those who develop the software, or develop against its HTTP API. See -the file `README` for general information about Lorry Controller. - - -Requirements -============ - -Some concepts/terminology: - -* CONFGIT is the git repository Lorry Controller uses for its - configuration. - -* Lorry specification: the configuration to Lorry to mirror an - upstream version control repository or tarball. Note that a `.lorry` - file may contain several specifications. - -* Upstream Host: a git hosting server that Lorry Controller mirrors - from. - -* Host specification: which Upstream Host to mirror. This gets - broken into generated Lorry specifications, one per git repository - on the other Host. There can be many Host specifications to - mirror many Hosts. - -* Downstream Host: a git hosting server that Lorry Controller mirrors - to. - -* run queue: all the Lorry specifications (from CONFGIT or generated - from the Host specifications) a Lorry Controller knows about; this - is the set of things that get scheduled. The queue has a linear - order (first job in the queue is the next job to execute). - -* job: An instance of executing a Lorry specification. Each job has an - identifier and associated data (such as the output provided by the - running job, and whether it succeeded). - -* admin: a person who can control or reconfigure a Lorry Controller - instance. All users of the HTTP API are admins, for example. - -For historical reasons, Hosts are also referred to as Troves in many -places. - -Original set of requirements, which have been broken down and detailed -up below: - -* Lorry Controller should be capable of being reconfigured at runtime - to allow new tasks to be added and old tasks to be removed. - (RC/ADD, RC/RM, RC/START) - -* Lorry Controller should not allow all tasks to become stuck if one - task is taking a long time. (RR/MULTI) - -* Lorry Controller should not allow stuck tasks to remain stuck - forever. (Configurable timeout? monitoring of disk usage or CPU to - see if work is being done?) (RR/TIMEOUT) - -* Lorry Controller should be able to be controlled at runtime to allow: - - Querying of the current task set (RQ/SPECS, RQ/SPEC) - - Querying of currently running tasks (RQ/RUNNING) - - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT) - - Supporting of the health monitoring to allow appropriate alerts - to be sent out (MON/STATIC, MON/DU) - -The detailed requirements (prefixed by a unique identfier, which is -used elsewhere to refer to the exact requirement): - -* (FW) Lorry Controller can access Upstream Hosts from behind firewalls. - * (FW/H) Lorry Controller can access the Upstream Host using HTTP or - HTTPS only, without using ssh, in order to get a list of - repositories to mirror. (Lorry itself also needs to be able to - access the Upstream Host using HTTP or HTTPS only, bypassing - ssh, but that's a Lorry problem and outside the scope of Lorry - Controller, so it'll need to be dealt separately.) - * (FW/C) Lorry Controller does not verify SSL/TLS certificates - when accessing the Upstream Host. -* (RC) Lorry Controller can be reconfigured at runtime. - * (RC/ADD) A new Lorry specification can be added to CONFGIT, and - a running Lorry Controller will add them to its run queue as - soon as it is notified of the change. - * (RC/RM) A Lorry specification can be removed from CONFGIT, and a - running Lorry Controller will remove it from its run queue as - soon as it is notified of the change. - * (RC/START) A Lorry Controller reads CONFGIT when it starts, - updating its run queue if anything has changed. -* (RT) Lorry Controller can controlled at runtime. - * (RT/KILL) An admin can get their Lorry Controller to stop a - running job. - * (RT/TOP) An admin can get their Lorry Controller to move a Lorry - spec to the beginning of the run queue. - * (RT/BOT) An admin can get their Lorry Controller to move a Lorry - spec to the end of the run queue. - * (RT/QSTOP) An admin can stop their Lorry Controller from - scheduling any new jobs. - * (RT/QSTART) An admin can get their Lorry Controller to start - scheduling jobs again. -* (RQ) Lorry Controller can be queried at runtime. - * (RQ/RUNNING) An admin can list all currently running jobs. - * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry - Controller still remembers. - * (RQ/SPECS) An admin can list all existing Lorry specifications - in the run queue. - * (RQ/SPEC) An admin can query existing Lorry specifications in - the run queue for any information the Lorry Controller holds for - them, such as the last time they successfully finished running. -* (RR) Lorry Controller is reasonably robust. - * (RR/CONF) Lorry Controller ignores any broken Lorry or Host - specifications in CONFGIT, and runs without them. - * (RR/TIMEOUT) Lorry Controller stops a job that runs for too - long. - * (RR/MULTI) Lorry Controller can run multiple jobs at the same - time, and lets the maximal number of such jobs be configured by - the admin. - * (RR/DU) Lorry Controller (and the way it runs Lorry) is - designed to be frugal about disk space usage. - * (RR/CERT) Lorry Controller tells Lorry to not worry about - unverifiable SSL/TLS certificates and to continue even if the - certificate can't be verified or the verification fails. -* (RS) Lorry Controller is reasonably scalable. - * (RS/SPECS) Lorry Controller works for the number of Lorry - specifications we have on git.baserock.org (a number that will - increase, and is currently about 500). - * (RS/GITS) Lorry Controller works for mirroring git.baserock.org - (about 500 git repositories). - * (RS/HW) Lorry Controller may assume that CPU, disk, and - bandwidth are sufficient, if not to be needlessly wasted. -* (MON) Lorry Controller can be monitored from the outside. - * (MON/STATIC) Lorry Controller updates at least once a minute a - static HTML file, which shows its current status with sufficient - detail that an admin knows if things get stuck or break. - * (MON/DU) Lorry Controller measures, at least, the disk usage of - each job and Lorry specification. -* (SEC) Lorry Controller is reasonably secure. - * (SEC/API) Access to the Lorry Controller run-time query and - controller interfaces is managed with iptables (for now). - * (SEC/CONF) Access to CONFGIT is managed by the git server that - hosts it. (Gitano on Trove.) - -Architecture design -=================== - -Constraints ------------ - -Python is not good at multiple threads (partly due to the global -interpreter lock), and mixing threads and executing subprocesses is -quite tricky to get right in general. Thus, this design splits the -software into a threaded web application (using the bottle.py -framework) and one or more single-threaded worker processes to execute -Lorry. - -Entities --------- - -* An admin is a human being or some software using the HTTP API to - communicate with the Lorry Controller. -* Lorry Controller runs Lorry appropriately, and consists of several - components described below. -* The Downstream Host is as defined in Requirements. -* An Upstream Host is as defined in Requirements. There can be - multiple Upstream Hosts. - -Components of Lorry Controller ------------------------------- - -* CONFGIT is a git repository for Lorry Controller configuration, - which the Lorry Controller (see WEBAPP below) can access and pull - from. Pushing is not required and should be prevented by Gitano. - CONFGIT is hosted on the Downstream Host. - -* STATEDB is persistent storage for the Lorry Controller's state: what - Lorry specs it knows about (provided by the admin, or generated from - a Host spec by Lorry Controller itself), their ordering, jobs that - have been run or are being run, information about the jobs, etc. The - idea is that the Lorry Controller process can terminate (cleanly or - by crashing), and be restarted, and continue approximately from - where it was. Also, a persistent storage is useful if there are - multiple processes involved due to how bottle.py and WSGI work. - STATEDB is implemented using sqlite3. - -* WEBAPP is the controlling part of Lorry Controller, which maintains - the run queue, and provides an HTTP API for monitoring and - controlling Lorry Controller. WEBAPP is implemented as a bottle.py - application. bottle.py runs the WEBAPP code in multiple threads to - improve concurrency. - -* MINION runs jobs (external processes) on behalf of WEBAPP. It - communicates with WEBAPP over HTTP, and requests a job to run, - starts it, and while it waits, sends partial output to the WEBAPP - every few seconds, and asks the WEBAPP whether the job should be - aborted or not. MINION may eventually run on a different host than - WEBAPP, for added scalability. - -Components external to Lorry Controller ---------------------------------------- - -* A web server. This runs the Lorry Controller WEBAPP, using WSGI so - that multiple instances (processes) can run at once, and thus serve - many clients. - -* bottle.py is a Python microframework for web applications. It sits - between the web server itself and the WEBAPP code. - -* systemd is the operating system component that starts services and - processes. - -How the components work together --------------------------------- - -* Each WEBAPP instance is started by the web server, when a request - comes in. The web server is started by a systemd unit. - -* Each MINION instance is started by a systemd unit. Each MINION - handles one job at a time, and doesn't block other MINIONs from - running other jobs. The admins decide how many MINIONs run at once, - depending on hardware resources and other considerations. (RR/MULTI) - -* An admin communicates with the WEBAPP only, by making HTTP requests. - Each request is either a query (GET) or a command (POST). Queries - report state as stored in STATEDB. Commands cause the WEBAPP - instance to do something and alter STATEDB accordingly. - -* When an admin makes changes to CONFGIT, and pushes them to the Downstream - Host, the Host's git post-update hook makes an HTTP request to - WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM) - -* Each MINION likewise communicates only with the WEBAPP using HTTP - requests. MINION requests a job to run (which triggers WEBAPP's job - scheduling), and then reports results to the WEBAPP (which causes - WEBAPP to store them in STATEDB), which tells MINION whether to - continue running the job or not (RT/KILL). There is no separate - scheduling process: all scheduling happens when there is a MINION - available. - -* At system start up, a systemd unit makes an HTTP request to WEBAPP - to make it refresh STATEDB from CONFGIT. (RC/START) - -* A timer unit for systemd makes an HTTP request to get WEBAPP to - refresh the static HTML status page. (MON/STATIC) - -In summary: systemd starts WEBAPP and MINIONs, and whenever a -MINION can do work, it asks WEBAPP for something to do, and reports -back results. Meanwhile, admin can query and control via HTTP requests -to WEBAPP, and WEBAPP instances communicate via STATEDB. - -The WEBAPP ----------- - -The WEBAPP provides an HTTP API as described below. - -Run queue management: - -* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to - run. Any currently running jobs are not affected. (RT/QSTOP) - -* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs - again. (RT/QSTART) - -* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of - all Lorry specifications in the run queue, in the order they are in - the run queue. (RQ/SPECS) - -* `POST /1.0/move-to-top` with `path=lorryspecid` as the body, where - `lorryspecid` is the id (path) of a Lorry specification in the run - queue, causes WEBAPP to move the specified spec to the head of the - run queue, and store this in STATEDB. It doesn't affect currently - running jobs. (RT/TOP) - -* `POST /1.0/move-to-bottom` with `path=lorryspecid` in the body is - like `/move-to-top`, but moves the job to the end of the run queue. - (RT/BOT) - -Running job management: - -* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of - ids of all currently running jobs. (RQ/RUNNING) - -* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict) - with all the information about the specified job. - -* `POST /1.0/stop-job` with `job_id=jobid` where `jobid` is an id of a - running job, causes WEBAPP to record in STATEDB that the job is to - be killed, and waits for it to be killed. (Killing to be done when - MINION gets around to it.) This request returns as soon as the - STATEDB change is done. - -* `GET /1.0/list-jobs` causes WEBAPP to return a JSON list of ids - of all jobs, running or finished, that it knows about. (RQ/ALLJOBS) - -* `GET /1.0/list-jobs-html` is the same as `list-jobs`, but returns an - HTML page instead. - -* `POST /1.0/remove-job` with `job_id=jobid` in the body, removes a - stopped job from the state database. - -* `POST /1.0/remove-ghost-jobs` looks for any running jobs in STATEDB - that haven't been updated (with `job-update`, see below) in a long - time (see `--ghost-timeout`), and marks them as terminated. This is - used to catch situations when a MINION fails to tell the WEBAPP that - a job has terminated. - -Other status queries: - -* `GET /1.0/status` causes WEBAPP to return a JSON object that - describes the state of Lorry Controller. This information is meant - to be programmatically useable and may or may not be the same as in - the HTML page. - -* `GET /1.0/status-html` causes WEBAPP to return an HTML page that - describes the state of Lorry Controller. This also updates an - on-disk copy of the HTML page, which the web server is configured to - serve using a normal HTTP request. This is the primary interface for - human admins to look at the state of Lorry Controller. (MON/STATIC) - -* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map - (dict) with all the information about the specified Lorry - specification. (RQ/SPEC) - - -Requests for MINION: - -* `GET /1.0/give-me-job` is used by MINION to get a new job to run. - WEBAPP will either return a JSON object describing the job to run, - or return a status code indicating that there is nothing to do. - WEBAPP will respond immediately, even if there is nothing for MINION - to do, and MINION will then sleep for a while before it tries again. - WEBAPP updates STATEDB to record that the job is allocated to a - MINION. - -* `POST /1.0/job-update` is used by MINION to push updates about the - job it is running to WEBAPP. The body sets fields `exit` (exit code - of program, or `no` if not set), `stdout` (some output from the - job's standard output) and `stderr` (ditto, but standard error - output). There MUST be at least one `job-update` call, which - indicates the job has terminated. WEBAPP responds with a status - indicating whether the job should continue to run or be terminated - (RR/TIMEOUT). WEBAPP records the job as terminated only after MINION - tells it the job has been terminated. MINION makes the `job-update` - request frequently, even if the job has produced no output, so that - WEBAPP can update a timestamp in STATEDB to indicate the job is - still alive. - -Other requests: - -* `POST /1.0/read-configuration` causes WEBAPP to update its copy of - CONFGIT and update STATEDB based on the new configuration, if it has - changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START) - - This is called by systemd units at system startup and periodically - (perhaps once a minute) otherwise. It can also be triggered by an - admin (there is a button on the `/1.0/status-html` web page). - -* `POST /1.0/ls-troves` causes WEBAPP to refresh its list of - repositories in each Upstream Host, if the current list is too old - (see the `ls-interval` setting for each Upstream Host in - `lorry-controller.conf`). This gets called from a systemd timer unit - at a suitable interval. - -* `POST /1.0/force-ls-troves` causes the repository refresh to happen - for all Upstream Hosts, regardless of whether it is due or not. This - can be called manually by an admin. - - -The MINION ----------- - -* Do `GET /1.0/give-me-job` to WEBAPP. -* If didn't get a job, sleep a while and try again. -* If did get job, fork and exec that. -* In a loop: wait for output, for a suitably short period of time, - from job (or its termination), with `select` or similar mechanism, - and send anything (if anything) you get to WEBAPP. If the WEBAPP - told us to kill the job, kill it, then send an update to that effect - to WEBAPP. -* Go back to top to request new job. - - -Old job removal ---------------- - -To avoid the STATEDB filling up with logs of old jobs, a systemd timer -unit will run occasionally to remove jobs so old, nobody cares about -them anymore. To make it easier to experiment with the logic of -choosing what to remove (age only? keep failed ones? something else?) -the removal is kept outside the WEBAPP. - - -STATEDB -------- - -The STATEDB has several tables. This section explains them. - -The `running_queue` table has a single column (`running`) and a single -row, and is used to store a single boolean value that specifies -whether WEBAPP is giving out jobs to run from the run-queue. This -value is controlled by `/1.0/start-queue` and `/1.0/stop-queue` -requests. - -The `lorries` table implements the run-queue: all the Lorry specs that -WEBAPP knows about. It has the following columns: - -* `path` is the path of the git repository on the Downstream Host, i.e., - the git repository to which Lorry will push. This is a unique - identifier. It is used, for example, to determine if a Lorry spec - is obsolete after a CONFGIT update. -* `text` has the text of the Lorry spec. This may be read from a file - or generated by Lorry Controller itself. This text will be given to - Lorry when a job is run. -* `generated` is set to 0 or 1, depending on if the Lorry came from an - actual `.lorry` file or was generated by Lorry Controller. - - -Code structure -============== - -The Lorry Controller code base is laid out as follows: - -* `lorry-controller-webapp` is the main program of WEBAPP. It sets up - the bottle.py framework. All the implementations for the various - HTTP requests are in classes in the `lorrycontroller` Python - package, as subclasses of the `LorryControllerRoute` class. The main - program uses introspection ("magic") to find the subclasses - automatically and sets up the bottle.py routes correctly. This makes - it possible to spread the code into simple classes; bottle's normal - way (with the `@app.route` decorator) seemed to make that harder and - require everything in the same class. - -* `lorrycontroller` is a Python package with: - - - The HTTP request handlers (`LorryControllerRoute` and its subclasses) - - Management of STATEDB (`statedb` module) - - Support for various Downstream and Upstream Host types - (`hosts`, `gitano`, `gerrit`, `gitlab`, `local` modules) - - Some helpful utilities (`proxy` module) - -* `lorry-controller-minion` is the entirety of the MINION, except that - it uses the `lorrycontroller.setup_proxy` function. - The MINION is kept very simple on purpose: all the interesting logic - is in the WEBAPP instead. - -* `static` has static content to be served over HTTP. Primarily, the - CSS file for the HTML interfaces. When LC is integrated within the - Downstream Host, the web server gets configured to serve these files directly. - The `static` directory will be accessible over plain HTTP on port - 80, and on port 12765 via the WEBAPP, to allow HTML pages to refer - to it via a simple path. - -* `templates` contains bottle.py HTML templates for various pages. - -* `etc` contains files to be installed in `/etc` when LC is installed - on a Baserock system. Primarily this is the web server (lighttpd) - configuration to invoke WEBAPP. - -* `units` contains various systemd units that start services and run - time-based jobs. - -* `yarns.webapp` contains an integration test suite for WEBAPP. - This is run by the `./check` script. The `./test-wait-for-port` - script is used by the yarns. - -Example -------- - -As an example, to modify how the `/1.0/status-html` request works, you -would look at its implementation in `lorrycontroller/status.py`, and -perhaps also the HTML templates in `templates/*.tpl`. - -STATEDB -------- - -The persistent state of WEBAPP is stored in an Sqlite3 database. All -access to STATEDB within WEBAPP is via the -`lorrycontroller/statedb.py` code module. That means there are no SQL -statements outside `statedb.py` at all, nor is it OK to add any. If -the interface provided by the `StateDB` class isn't sufficient, then -modify the class suitably, but do not add any new SQL outside it. - -All access from outside of WEBAPP happens via WEBAPP's HTTP API. -Only the WEBAPP is allowed to touch STATEDB in any way. - -The bottle.py framework runs multiple threads of WEBAPP code. The -threads communicate only via STATEDB. There is no shared state in -memory. SQL's locking is used for mutual exclusion. - -The `StateDB` class acts as a context manager for Python's `with` -statements to provide locking. To access STATEDB with locking, use -code such as this: - - with self.open_statedb() as statedb: - hosts = statedb.get_hosts() - for host in hosts: - statedb.remove_host(hosts) - -The code executed by the `with` statement is run under lock, and the -lock gets released automatically even if there is an exception. - -(You could manage locks manually. It's a good way to build character -and learn why using the context manager is really simple and leads to -more correct code.) |