diff options
Diffstat (limited to 'ARCH.md')
-rw-r--r-- | ARCH.md | 505 |
1 files changed, 505 insertions, 0 deletions
@@ -0,0 +1,505 @@ +% Architecture of daemonised Lorry Controller +% Codethink Ltd + +Introduction +============ + +This is an architecture document for Lorry Controller. It is aimed at +those who develop the software, or develop against its HTTP API. See +the file `README.md` for general information about Lorry Controller. + + +Requirements +============ + +Some concepts/terminology: + +* CONFGIT is the git repository Lorry Controller uses for its + configuration. + +* Lorry specification: the configuration to Lorry to mirror an + upstream version control repository or tarball. Note that a `.lorry` + file may contain several specifications. + +* Upstream Host: a git hosting server that Lorry Controller mirrors + from. + +* Host specification: which Upstream Host to mirror. This gets + broken into generated Lorry specifications, one per git repository + on the other Host. There can be many Host specifications to + mirror many Hosts. + +* Downstream Host: a git hosting server that Lorry Controller mirrors + to. + +* run queue: all the Lorry specifications (from CONFGIT or generated + from the Host specifications) a Lorry Controller knows about; this + is the set of things that get scheduled. The queue has a linear + order (first job in the queue is the next job to execute). + +* job: An instance of executing a Lorry specification. Each job has an + identifier and associated data (such as the output provided by the + running job, and whether it succeeded). + +* admin: a person who can control or reconfigure a Lorry Controller + instance. All users of the HTTP API are admins, for example. + +For historical reasons, Hosts are also referred to as Troves in many +places. + +Original set of requirements, which have been broken down and detailed +up below: + +* Lorry Controller should be capable of being reconfigured at runtime + to allow new tasks to be added and old tasks to be removed. + (RC/ADD, RC/RM, RC/START) + +* Lorry Controller should not allow all tasks to become stuck if one + task is taking a long time. (RR/MULTI) + +* Lorry Controller should not allow stuck tasks to remain stuck + forever. (Configurable timeout? monitoring of disk usage or CPU to + see if work is being done?) (RR/TIMEOUT) + +* Lorry Controller should be able to be controlled at runtime to allow: + - Querying of the current task set (RQ/SPECS, RQ/SPEC) + - Querying of currently running tasks (RQ/RUNNING) + - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT) + - Supporting of the health monitoring to allow appropriate alerts + to be sent out (MON/STATIC, MON/DU) + +The detailed requirements (prefixed by a unique identfier, which is +used elsewhere to refer to the exact requirement): + +* (FW) Lorry Controller can access Upstream Hosts from behind firewalls. + * (FW/H) Lorry Controller can access the Upstream Host using HTTP or + HTTPS only, without using ssh, in order to get a list of + repositories to mirror. (Lorry itself also needs to be able to + access the Upstream Host using HTTP or HTTPS only, bypassing + ssh, but that's a Lorry problem and outside the scope of Lorry + Controller, so it'll need to be dealt separately.) + * (FW/C) Lorry Controller does not verify SSL/TLS certificates + when accessing the Upstream Host. +* (RC) Lorry Controller can be reconfigured at runtime. + * (RC/ADD) A new Lorry specification can be added to CONFGIT, and + a running Lorry Controller will add them to its run queue as + soon as it is notified of the change. + * (RC/RM) A Lorry specification can be removed from CONFGIT, and a + running Lorry Controller will remove it from its run queue as + soon as it is notified of the change. + * (RC/START) A Lorry Controller reads CONFGIT when it starts, + updating its run queue if anything has changed. +* (RT) Lorry Controller can controlled at runtime. + * (RT/KILL) An admin can get their Lorry Controller to stop a + running job. + * (RT/TOP) An admin can get their Lorry Controller to move a Lorry + spec to the beginning of the run queue. + * (RT/BOT) An admin can get their Lorry Controller to move a Lorry + spec to the end of the run queue. + * (RT/QSTOP) An admin can stop their Lorry Controller from + scheduling any new jobs. + * (RT/QSTART) An admin can get their Lorry Controller to start + scheduling jobs again. +* (RQ) Lorry Controller can be queried at runtime. + * (RQ/RUNNING) An admin can list all currently running jobs. + * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry + Controller still remembers. + * (RQ/SPECS) An admin can list all existing Lorry specifications + in the run queue. + * (RQ/SPEC) An admin can query existing Lorry specifications in + the run queue for any information the Lorry Controller holds for + them, such as the last time they successfully finished running. +* (RR) Lorry Controller is reasonably robust. + * (RR/CONF) Lorry Controller ignores any broken Lorry or Host + specifications in CONFGIT, and runs without them. + * (RR/TIMEOUT) Lorry Controller stops a job that runs for too + long. + * (RR/MULTI) Lorry Controller can run multiple jobs at the same + time, and lets the maximal number of such jobs be configured by + the admin. + * (RR/DU) Lorry Controller (and the way it runs Lorry) is + designed to be frugal about disk space usage. + * (RR/CERT) Lorry Controller tells Lorry to not worry about + unverifiable SSL/TLS certificates and to continue even if the + certificate can't be verified or the verification fails. +* (RS) Lorry Controller is reasonably scalable. + * (RS/SPECS) Lorry Controller works for the number of Lorry + specifications we have on git.baserock.org (a number that will + increase, and is currently about 500). + * (RS/GITS) Lorry Controller works for mirroring git.baserock.org + (about 500 git repositories). + * (RS/HW) Lorry Controller may assume that CPU, disk, and + bandwidth are sufficient, if not to be needlessly wasted. +* (MON) Lorry Controller can be monitored from the outside. + * (MON/STATIC) Lorry Controller updates at least once a minute a + static HTML file, which shows its current status with sufficient + detail that an admin knows if things get stuck or break. + * (MON/DU) Lorry Controller measures, at least, the disk usage of + each job and Lorry specification. +* (SEC) Lorry Controller is reasonably secure. + * (SEC/API) Access to the Lorry Controller run-time query and + controller interfaces is managed with iptables (for now). + * (SEC/CONF) Access to CONFGIT is managed by the git server that + hosts it. (Gitano on Trove.) + +Architecture design +=================== + +Constraints +----------- + +Python is not good at multiple threads (partly due to the global +interpreter lock), and mixing threads and executing subprocesses is +quite tricky to get right in general. Thus, this design splits the +software into a threaded web application (using the bottle.py +framework) and one or more single-threaded worker processes to execute +Lorry. + +Entities +-------- + +* An admin is a human being or some software using the HTTP API to + communicate with the Lorry Controller. +* Lorry Controller runs Lorry appropriately, and consists of several + components described below. +* The Downstream Host is as defined in Requirements. +* An Upstream Host is as defined in Requirements. There can be + multiple Upstream Hosts. + +Components of Lorry Controller +------------------------------ + +* CONFGIT is a git repository for Lorry Controller configuration, + which the Lorry Controller (see WEBAPP below) can access and pull + from. Pushing is not required and should be prevented by Gitano. + CONFGIT is hosted on the Downstream Host. + +* STATEDB is persistent storage for the Lorry Controller's state: what + Lorry specs it knows about (provided by the admin, or generated from + a Host spec by Lorry Controller itself), their ordering, jobs that + have been run or are being run, information about the jobs, etc. The + idea is that the Lorry Controller process can terminate (cleanly or + by crashing), and be restarted, and continue approximately from + where it was. Also, a persistent storage is useful if there are + multiple processes involved due to how bottle.py and WSGI work. + STATEDB is implemented using sqlite3. + +* WEBAPP is the controlling part of Lorry Controller, which maintains + the run queue, and provides an HTTP API for monitoring and + controlling Lorry Controller. WEBAPP is implemented as a bottle.py + application. bottle.py runs the WEBAPP code in multiple threads to + improve concurrency. + +* MINION runs jobs (external processes) on behalf of WEBAPP. It + communicates with WEBAPP over HTTP, and requests a job to run, + starts it, and while it waits, sends partial output to the WEBAPP + every few seconds, and asks the WEBAPP whether the job should be + aborted or not. MINION may eventually run on a different host than + WEBAPP, for added scalability. + +Components external to Lorry Controller +--------------------------------------- + +* A web server. This runs the Lorry Controller WEBAPP, using WSGI so + that multiple instances (processes) can run at once, and thus serve + many clients. + +* bottle.py is a Python microframework for web applications. It sits + between the web server itself and the WEBAPP code. + +* systemd is the operating system component that starts services and + processes. + +How the components work together +-------------------------------- + +* Each WEBAPP instance is started by the web server, when a request + comes in. The web server is started by a systemd unit. + +* Each MINION instance is started by a systemd unit. Each MINION + handles one job at a time, and doesn't block other MINIONs from + running other jobs. The admins decide how many MINIONs run at once, + depending on hardware resources and other considerations. (RR/MULTI) + +* An admin communicates with the WEBAPP only, by making HTTP requests. + Each request is either a query (GET) or a command (POST). Queries + report state as stored in STATEDB. Commands cause the WEBAPP + instance to do something and alter STATEDB accordingly. + +* When an admin makes changes to CONFGIT, and pushes them to the Downstream + Host, the Host's git post-update hook makes an HTTP request to + WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM) + +* Each MINION likewise communicates only with the WEBAPP using HTTP + requests. MINION requests a job to run (which triggers WEBAPP's job + scheduling), and then reports results to the WEBAPP (which causes + WEBAPP to store them in STATEDB), which tells MINION whether to + continue running the job or not (RT/KILL). There is no separate + scheduling process: all scheduling happens when there is a MINION + available. + +* At system start up, a systemd unit makes an HTTP request to WEBAPP + to make it refresh STATEDB from CONFGIT. (RC/START) + +* A timer unit for systemd makes an HTTP request to get WEBAPP to + refresh the static HTML status page. (MON/STATIC) + +In summary: systemd starts WEBAPP and MINIONs, and whenever a +MINION can do work, it asks WEBAPP for something to do, and reports +back results. Meanwhile, admin can query and control via HTTP requests +to WEBAPP, and WEBAPP instances communicate via STATEDB. + +The WEBAPP +---------- + +The WEBAPP provides an HTTP API as described below. + +Run queue management: + +* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to + run. Any currently running jobs are not affected. (RT/QSTOP) + +* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs + again. (RT/QSTART) + +* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of + all Lorry specifications in the run queue, in the order they are in + the run queue. (RQ/SPECS) + +* `POST /1.0/move-to-top` with `path=lorryspecid` as the body, where + `lorryspecid` is the id (path) of a Lorry specification in the run + queue, causes WEBAPP to move the specified spec to the head of the + run queue, and store this in STATEDB. It doesn't affect currently + running jobs. (RT/TOP) + +* `POST /1.0/move-to-bottom` with `path=lorryspecid` in the body is + like `/move-to-top`, but moves the job to the end of the run queue. + (RT/BOT) + +Running job management: + +* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of + ids of all currently running jobs. (RQ/RUNNING) + +* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict) + with all the information about the specified job. + +* `POST /1.0/stop-job` with `job_id=jobid` where `jobid` is an id of a + running job, causes WEBAPP to record in STATEDB that the job is to + be killed, and waits for it to be killed. (Killing to be done when + MINION gets around to it.) This request returns as soon as the + STATEDB change is done. + +* `GET /1.0/list-jobs` causes WEBAPP to return a JSON list of ids + of all jobs, running or finished, that it knows about. (RQ/ALLJOBS) + +* `GET /1.0/list-jobs-html` is the same as `list-jobs`, but returns an + HTML page instead. + +* `POST /1.0/remove-job` with `job_id=jobid` in the body, removes a + stopped job from the state database. + +* `POST /1.0/remove-ghost-jobs` looks for any running jobs in STATEDB + that haven't been updated (with `job-update`, see below) in a long + time (see `--ghost-timeout`), and marks them as terminated. This is + used to catch situations when a MINION fails to tell the WEBAPP that + a job has terminated. + +Other status queries: + +* `GET /1.0/status` causes WEBAPP to return a JSON object that + describes the state of Lorry Controller. This information is meant + to be programmatically useable and may or may not be the same as in + the HTML page. + +* `GET /1.0/status-html` causes WEBAPP to return an HTML page that + describes the state of Lorry Controller. This also updates an + on-disk copy of the HTML page, which the web server is configured to + serve using a normal HTTP request. This is the primary interface for + human admins to look at the state of Lorry Controller. (MON/STATIC) + +* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map + (dict) with all the information about the specified Lorry + specification. (RQ/SPEC) + + +Requests for MINION: + +* `GET /1.0/give-me-job` is used by MINION to get a new job to run. + WEBAPP will either return a JSON object describing the job to run, + or return a status code indicating that there is nothing to do. + WEBAPP will respond immediately, even if there is nothing for MINION + to do, and MINION will then sleep for a while before it tries again. + WEBAPP updates STATEDB to record that the job is allocated to a + MINION. + +* `POST /1.0/job-update` is used by MINION to push updates about the + job it is running to WEBAPP. The body sets fields `exit` (exit code + of program, or `no` if not set), `stdout` (some output from the + job's standard output) and `stderr` (ditto, but standard error + output). There MUST be at least one `job-update` call, which + indicates the job has terminated. WEBAPP responds with a status + indicating whether the job should continue to run or be terminated + (RR/TIMEOUT). WEBAPP records the job as terminated only after MINION + tells it the job has been terminated. MINION makes the `job-update` + request frequently, even if the job has produced no output, so that + WEBAPP can update a timestamp in STATEDB to indicate the job is + still alive. + +Other requests: + +* `POST /1.0/read-configuration` causes WEBAPP to update its copy of + CONFGIT and update STATEDB based on the new configuration, if it has + changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START) + + This is called by systemd units at system startup and periodically + (perhaps once a minute) otherwise. It can also be triggered by an + admin (there is a button on the `/1.0/status-html` web page). + +* `POST /1.0/ls-troves` causes WEBAPP to refresh its list of + repositories in each Upstream Host, if the current list is too old + (see the `ls-interval` setting for each Upstream Host in + `lorry-controller.conf`). This gets called from a systemd timer unit + at a suitable interval. + +* `POST /1.0/force-ls-troves` causes the repository refresh to happen + for all Upstream Hosts, regardless of whether it is due or not. This + can be called manually by an admin. + + +The MINION +---------- + +* Do `GET /1.0/give-me-job` to WEBAPP. +* If didn't get a job, sleep a while and try again. +* If did get job, fork and exec that. +* In a loop: wait for output, for a suitably short period of time, + from job (or its termination), with `select` or similar mechanism, + and send anything (if anything) you get to WEBAPP. If the WEBAPP + told us to kill the job, kill it, then send an update to that effect + to WEBAPP. +* Go back to top to request new job. + + +Old job removal +--------------- + +To avoid the STATEDB filling up with logs of old jobs, a systemd timer +unit will run occasionally to remove jobs so old, nobody cares about +them anymore. To make it easier to experiment with the logic of +choosing what to remove (age only? keep failed ones? something else?) +the removal is kept outside the WEBAPP. + + +STATEDB +------- + +The STATEDB has several tables. This section explains them. + +The `running_queue` table has a single column (`running`) and a single +row, and is used to store a single boolean value that specifies +whether WEBAPP is giving out jobs to run from the run-queue. This +value is controlled by `/1.0/start-queue` and `/1.0/stop-queue` +requests. + +The `lorries` table implements the run-queue: all the Lorry specs that +WEBAPP knows about. It has the following columns: + +* `path` is the path of the git repository on the Downstream Host, i.e., + the git repository to which Lorry will push. This is a unique + identifier. It is used, for example, to determine if a Lorry spec + is obsolete after a CONFGIT update. +* `text` has the text of the Lorry spec. This may be read from a file + or generated by Lorry Controller itself. This text will be given to + Lorry when a job is run. +* `generated` is set to 0 or 1, depending on if the Lorry came from an + actual `.lorry` file or was generated by Lorry Controller. + + +Code structure +============== + +The Lorry Controller code base is laid out as follows: + +* `lorry-controller-webapp` is the main program of WEBAPP. It sets up + the bottle.py framework. All the implementations for the various + HTTP requests are in classes in the `lorrycontroller` Python + package, as subclasses of the `LorryControllerRoute` class. The main + program uses introspection ("magic") to find the subclasses + automatically and sets up the bottle.py routes correctly. This makes + it possible to spread the code into simple classes; bottle's normal + way (with the `@app.route` decorator) seemed to make that harder and + require everything in the same class. + +* `lorrycontroller` is a Python package with: + + - The HTTP request handlers (`LorryControllerRoute` and its subclasses) + - Management of STATEDB (`statedb` module) + - Support for various Downstream and Upstream Host types + (`hosts`, `gitano`, `gerrit`, `gitlab`, `local` modules) + - Some helpful utilities (`proxy` module) + +* `lorry-controller-minion` is the entirety of the MINION, except that + it uses the `lorrycontroller.setup_proxy` function. + The MINION is kept very simple on purpose: all the interesting logic + is in the WEBAPP instead. + +* `static` has static content to be served over HTTP. Primarily, the + CSS file for the HTML interfaces. When LC is integrated within the + Downstream Host, the web server gets configured to serve these files directly. + The `static` directory will be accessible over plain HTTP on port + 80, and on port 12765 via the WEBAPP, to allow HTML pages to refer + to it via a simple path. + +* `templates` contains bottle.py HTML templates for various pages. + +* `etc` contains files to be installed in `/etc` when LC is installed + on a Baserock system. Primarily this is the web server (lighttpd) + configuration to invoke WEBAPP. + +* `units` contains various systemd units that start services and run + time-based jobs. + +* `yarns.webapp` contains an integration test suite for WEBAPP. + This is run by the `./check` script. The `./test-wait-for-port` + script is used by the yarns. + +Example +------- + +As an example, to modify how the `/1.0/status-html` request works, you +would look at its implementation in `lorrycontroller/status.py`, and +perhaps also the HTML templates in `templates/*.tpl`. + +STATEDB +------- + +The persistent state of WEBAPP is stored in an Sqlite3 database. All +access to STATEDB within WEBAPP is via the +`lorrycontroller/statedb.py` code module. That means there are no SQL +statements outside `statedb.py` at all, nor is it OK to add any. If +the interface provided by the `StateDB` class isn't sufficient, then +modify the class suitably, but do not add any new SQL outside it. + +All access from outside of WEBAPP happens via WEBAPP's HTTP API. +Only the WEBAPP is allowed to touch STATEDB in any way. + +The bottle.py framework runs multiple threads of WEBAPP code. The +threads communicate only via STATEDB. There is no shared state in +memory. SQL's locking is used for mutual exclusion. + +The `StateDB` class acts as a context manager for Python's `with` +statements to provide locking. To access STATEDB with locking, use +code such as this: + + with self.open_statedb() as statedb: + hosts = statedb.get_hosts() + for host in hosts: + statedb.remove_host(hosts) + +The code executed by the `with` statement is run under lock, and the +lock gets released automatically even if there is an exception. + +(You could manage locks manually. It's a good way to build character +and learn why using the context manager is really simple and leads to +more correct code.) |