summaryrefslogtreecommitdiff
path: root/ARCH
diff options
context:
space:
mode:
authorBen Hutchings <ben.hutchings@codethink.co.uk>2020-07-15 16:22:51 +0100
committerBen Hutchings <ben.hutchings@codethink.co.uk>2020-07-15 16:29:20 +0100
commit74ffb6dbd4d21888d147bf912cc0de75e85d609e (patch)
treed2687e1c75073ceadd6409d8efd08123d73c4ffc /ARCH
parentf4be06b2fe0655f64e7eb7d2b235f18d35c35696 (diff)
downloadlorry-controller-74ffb6dbd4d21888d147bf912cc0de75e85d609e.tar.gz
Add .md extension to Markdown documents
This will cause them to be rendered on GitLab and other git hosts' web interfaces.
Diffstat (limited to 'ARCH')
-rw-r--r--ARCH505
1 files changed, 0 insertions, 505 deletions
diff --git a/ARCH b/ARCH
deleted file mode 100644
index 5f7ce7b..0000000
--- a/ARCH
+++ /dev/null
@@ -1,505 +0,0 @@
-% Architecture of daemonised Lorry Controller
-% Codethink Ltd
-
-Introduction
-============
-
-This is an architecture document for Lorry Controller. It is aimed at
-those who develop the software, or develop against its HTTP API. See
-the file `README` for general information about Lorry Controller.
-
-
-Requirements
-============
-
-Some concepts/terminology:
-
-* CONFGIT is the git repository Lorry Controller uses for its
- configuration.
-
-* Lorry specification: the configuration to Lorry to mirror an
- upstream version control repository or tarball. Note that a `.lorry`
- file may contain several specifications.
-
-* Upstream Host: a git hosting server that Lorry Controller mirrors
- from.
-
-* Host specification: which Upstream Host to mirror. This gets
- broken into generated Lorry specifications, one per git repository
- on the other Host. There can be many Host specifications to
- mirror many Hosts.
-
-* Downstream Host: a git hosting server that Lorry Controller mirrors
- to.
-
-* run queue: all the Lorry specifications (from CONFGIT or generated
- from the Host specifications) a Lorry Controller knows about; this
- is the set of things that get scheduled. The queue has a linear
- order (first job in the queue is the next job to execute).
-
-* job: An instance of executing a Lorry specification. Each job has an
- identifier and associated data (such as the output provided by the
- running job, and whether it succeeded).
-
-* admin: a person who can control or reconfigure a Lorry Controller
- instance. All users of the HTTP API are admins, for example.
-
-For historical reasons, Hosts are also referred to as Troves in many
-places.
-
-Original set of requirements, which have been broken down and detailed
-up below:
-
-* Lorry Controller should be capable of being reconfigured at runtime
- to allow new tasks to be added and old tasks to be removed.
- (RC/ADD, RC/RM, RC/START)
-
-* Lorry Controller should not allow all tasks to become stuck if one
- task is taking a long time. (RR/MULTI)
-
-* Lorry Controller should not allow stuck tasks to remain stuck
- forever. (Configurable timeout? monitoring of disk usage or CPU to
- see if work is being done?) (RR/TIMEOUT)
-
-* Lorry Controller should be able to be controlled at runtime to allow:
- - Querying of the current task set (RQ/SPECS, RQ/SPEC)
- - Querying of currently running tasks (RQ/RUNNING)
- - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT)
- - Supporting of the health monitoring to allow appropriate alerts
- to be sent out (MON/STATIC, MON/DU)
-
-The detailed requirements (prefixed by a unique identfier, which is
-used elsewhere to refer to the exact requirement):
-
-* (FW) Lorry Controller can access Upstream Hosts from behind firewalls.
- * (FW/H) Lorry Controller can access the Upstream Host using HTTP or
- HTTPS only, without using ssh, in order to get a list of
- repositories to mirror. (Lorry itself also needs to be able to
- access the Upstream Host using HTTP or HTTPS only, bypassing
- ssh, but that's a Lorry problem and outside the scope of Lorry
- Controller, so it'll need to be dealt separately.)
- * (FW/C) Lorry Controller does not verify SSL/TLS certificates
- when accessing the Upstream Host.
-* (RC) Lorry Controller can be reconfigured at runtime.
- * (RC/ADD) A new Lorry specification can be added to CONFGIT, and
- a running Lorry Controller will add them to its run queue as
- soon as it is notified of the change.
- * (RC/RM) A Lorry specification can be removed from CONFGIT, and a
- running Lorry Controller will remove it from its run queue as
- soon as it is notified of the change.
- * (RC/START) A Lorry Controller reads CONFGIT when it starts,
- updating its run queue if anything has changed.
-* (RT) Lorry Controller can controlled at runtime.
- * (RT/KILL) An admin can get their Lorry Controller to stop a
- running job.
- * (RT/TOP) An admin can get their Lorry Controller to move a Lorry
- spec to the beginning of the run queue.
- * (RT/BOT) An admin can get their Lorry Controller to move a Lorry
- spec to the end of the run queue.
- * (RT/QSTOP) An admin can stop their Lorry Controller from
- scheduling any new jobs.
- * (RT/QSTART) An admin can get their Lorry Controller to start
- scheduling jobs again.
-* (RQ) Lorry Controller can be queried at runtime.
- * (RQ/RUNNING) An admin can list all currently running jobs.
- * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry
- Controller still remembers.
- * (RQ/SPECS) An admin can list all existing Lorry specifications
- in the run queue.
- * (RQ/SPEC) An admin can query existing Lorry specifications in
- the run queue for any information the Lorry Controller holds for
- them, such as the last time they successfully finished running.
-* (RR) Lorry Controller is reasonably robust.
- * (RR/CONF) Lorry Controller ignores any broken Lorry or Host
- specifications in CONFGIT, and runs without them.
- * (RR/TIMEOUT) Lorry Controller stops a job that runs for too
- long.
- * (RR/MULTI) Lorry Controller can run multiple jobs at the same
- time, and lets the maximal number of such jobs be configured by
- the admin.
- * (RR/DU) Lorry Controller (and the way it runs Lorry) is
- designed to be frugal about disk space usage.
- * (RR/CERT) Lorry Controller tells Lorry to not worry about
- unverifiable SSL/TLS certificates and to continue even if the
- certificate can't be verified or the verification fails.
-* (RS) Lorry Controller is reasonably scalable.
- * (RS/SPECS) Lorry Controller works for the number of Lorry
- specifications we have on git.baserock.org (a number that will
- increase, and is currently about 500).
- * (RS/GITS) Lorry Controller works for mirroring git.baserock.org
- (about 500 git repositories).
- * (RS/HW) Lorry Controller may assume that CPU, disk, and
- bandwidth are sufficient, if not to be needlessly wasted.
-* (MON) Lorry Controller can be monitored from the outside.
- * (MON/STATIC) Lorry Controller updates at least once a minute a
- static HTML file, which shows its current status with sufficient
- detail that an admin knows if things get stuck or break.
- * (MON/DU) Lorry Controller measures, at least, the disk usage of
- each job and Lorry specification.
-* (SEC) Lorry Controller is reasonably secure.
- * (SEC/API) Access to the Lorry Controller run-time query and
- controller interfaces is managed with iptables (for now).
- * (SEC/CONF) Access to CONFGIT is managed by the git server that
- hosts it. (Gitano on Trove.)
-
-Architecture design
-===================
-
-Constraints
------------
-
-Python is not good at multiple threads (partly due to the global
-interpreter lock), and mixing threads and executing subprocesses is
-quite tricky to get right in general. Thus, this design splits the
-software into a threaded web application (using the bottle.py
-framework) and one or more single-threaded worker processes to execute
-Lorry.
-
-Entities
---------
-
-* An admin is a human being or some software using the HTTP API to
- communicate with the Lorry Controller.
-* Lorry Controller runs Lorry appropriately, and consists of several
- components described below.
-* The Downstream Host is as defined in Requirements.
-* An Upstream Host is as defined in Requirements. There can be
- multiple Upstream Hosts.
-
-Components of Lorry Controller
-------------------------------
-
-* CONFGIT is a git repository for Lorry Controller configuration,
- which the Lorry Controller (see WEBAPP below) can access and pull
- from. Pushing is not required and should be prevented by Gitano.
- CONFGIT is hosted on the Downstream Host.
-
-* STATEDB is persistent storage for the Lorry Controller's state: what
- Lorry specs it knows about (provided by the admin, or generated from
- a Host spec by Lorry Controller itself), their ordering, jobs that
- have been run or are being run, information about the jobs, etc. The
- idea is that the Lorry Controller process can terminate (cleanly or
- by crashing), and be restarted, and continue approximately from
- where it was. Also, a persistent storage is useful if there are
- multiple processes involved due to how bottle.py and WSGI work.
- STATEDB is implemented using sqlite3.
-
-* WEBAPP is the controlling part of Lorry Controller, which maintains
- the run queue, and provides an HTTP API for monitoring and
- controlling Lorry Controller. WEBAPP is implemented as a bottle.py
- application. bottle.py runs the WEBAPP code in multiple threads to
- improve concurrency.
-
-* MINION runs jobs (external processes) on behalf of WEBAPP. It
- communicates with WEBAPP over HTTP, and requests a job to run,
- starts it, and while it waits, sends partial output to the WEBAPP
- every few seconds, and asks the WEBAPP whether the job should be
- aborted or not. MINION may eventually run on a different host than
- WEBAPP, for added scalability.
-
-Components external to Lorry Controller
----------------------------------------
-
-* A web server. This runs the Lorry Controller WEBAPP, using WSGI so
- that multiple instances (processes) can run at once, and thus serve
- many clients.
-
-* bottle.py is a Python microframework for web applications. It sits
- between the web server itself and the WEBAPP code.
-
-* systemd is the operating system component that starts services and
- processes.
-
-How the components work together
---------------------------------
-
-* Each WEBAPP instance is started by the web server, when a request
- comes in. The web server is started by a systemd unit.
-
-* Each MINION instance is started by a systemd unit. Each MINION
- handles one job at a time, and doesn't block other MINIONs from
- running other jobs. The admins decide how many MINIONs run at once,
- depending on hardware resources and other considerations. (RR/MULTI)
-
-* An admin communicates with the WEBAPP only, by making HTTP requests.
- Each request is either a query (GET) or a command (POST). Queries
- report state as stored in STATEDB. Commands cause the WEBAPP
- instance to do something and alter STATEDB accordingly.
-
-* When an admin makes changes to CONFGIT, and pushes them to the Downstream
- Host, the Host's git post-update hook makes an HTTP request to
- WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM)
-
-* Each MINION likewise communicates only with the WEBAPP using HTTP
- requests. MINION requests a job to run (which triggers WEBAPP's job
- scheduling), and then reports results to the WEBAPP (which causes
- WEBAPP to store them in STATEDB), which tells MINION whether to
- continue running the job or not (RT/KILL). There is no separate
- scheduling process: all scheduling happens when there is a MINION
- available.
-
-* At system start up, a systemd unit makes an HTTP request to WEBAPP
- to make it refresh STATEDB from CONFGIT. (RC/START)
-
-* A timer unit for systemd makes an HTTP request to get WEBAPP to
- refresh the static HTML status page. (MON/STATIC)
-
-In summary: systemd starts WEBAPP and MINIONs, and whenever a
-MINION can do work, it asks WEBAPP for something to do, and reports
-back results. Meanwhile, admin can query and control via HTTP requests
-to WEBAPP, and WEBAPP instances communicate via STATEDB.
-
-The WEBAPP
-----------
-
-The WEBAPP provides an HTTP API as described below.
-
-Run queue management:
-
-* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to
- run. Any currently running jobs are not affected. (RT/QSTOP)
-
-* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs
- again. (RT/QSTART)
-
-* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of
- all Lorry specifications in the run queue, in the order they are in
- the run queue. (RQ/SPECS)
-
-* `POST /1.0/move-to-top` with `path=lorryspecid` as the body, where
- `lorryspecid` is the id (path) of a Lorry specification in the run
- queue, causes WEBAPP to move the specified spec to the head of the
- run queue, and store this in STATEDB. It doesn't affect currently
- running jobs. (RT/TOP)
-
-* `POST /1.0/move-to-bottom` with `path=lorryspecid` in the body is
- like `/move-to-top`, but moves the job to the end of the run queue.
- (RT/BOT)
-
-Running job management:
-
-* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of
- ids of all currently running jobs. (RQ/RUNNING)
-
-* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict)
- with all the information about the specified job.
-
-* `POST /1.0/stop-job` with `job_id=jobid` where `jobid` is an id of a
- running job, causes WEBAPP to record in STATEDB that the job is to
- be killed, and waits for it to be killed. (Killing to be done when
- MINION gets around to it.) This request returns as soon as the
- STATEDB change is done.
-
-* `GET /1.0/list-jobs` causes WEBAPP to return a JSON list of ids
- of all jobs, running or finished, that it knows about. (RQ/ALLJOBS)
-
-* `GET /1.0/list-jobs-html` is the same as `list-jobs`, but returns an
- HTML page instead.
-
-* `POST /1.0/remove-job` with `job_id=jobid` in the body, removes a
- stopped job from the state database.
-
-* `POST /1.0/remove-ghost-jobs` looks for any running jobs in STATEDB
- that haven't been updated (with `job-update`, see below) in a long
- time (see `--ghost-timeout`), and marks them as terminated. This is
- used to catch situations when a MINION fails to tell the WEBAPP that
- a job has terminated.
-
-Other status queries:
-
-* `GET /1.0/status` causes WEBAPP to return a JSON object that
- describes the state of Lorry Controller. This information is meant
- to be programmatically useable and may or may not be the same as in
- the HTML page.
-
-* `GET /1.0/status-html` causes WEBAPP to return an HTML page that
- describes the state of Lorry Controller. This also updates an
- on-disk copy of the HTML page, which the web server is configured to
- serve using a normal HTTP request. This is the primary interface for
- human admins to look at the state of Lorry Controller. (MON/STATIC)
-
-* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map
- (dict) with all the information about the specified Lorry
- specification. (RQ/SPEC)
-
-
-Requests for MINION:
-
-* `GET /1.0/give-me-job` is used by MINION to get a new job to run.
- WEBAPP will either return a JSON object describing the job to run,
- or return a status code indicating that there is nothing to do.
- WEBAPP will respond immediately, even if there is nothing for MINION
- to do, and MINION will then sleep for a while before it tries again.
- WEBAPP updates STATEDB to record that the job is allocated to a
- MINION.
-
-* `POST /1.0/job-update` is used by MINION to push updates about the
- job it is running to WEBAPP. The body sets fields `exit` (exit code
- of program, or `no` if not set), `stdout` (some output from the
- job's standard output) and `stderr` (ditto, but standard error
- output). There MUST be at least one `job-update` call, which
- indicates the job has terminated. WEBAPP responds with a status
- indicating whether the job should continue to run or be terminated
- (RR/TIMEOUT). WEBAPP records the job as terminated only after MINION
- tells it the job has been terminated. MINION makes the `job-update`
- request frequently, even if the job has produced no output, so that
- WEBAPP can update a timestamp in STATEDB to indicate the job is
- still alive.
-
-Other requests:
-
-* `POST /1.0/read-configuration` causes WEBAPP to update its copy of
- CONFGIT and update STATEDB based on the new configuration, if it has
- changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START)
-
- This is called by systemd units at system startup and periodically
- (perhaps once a minute) otherwise. It can also be triggered by an
- admin (there is a button on the `/1.0/status-html` web page).
-
-* `POST /1.0/ls-troves` causes WEBAPP to refresh its list of
- repositories in each Upstream Host, if the current list is too old
- (see the `ls-interval` setting for each Upstream Host in
- `lorry-controller.conf`). This gets called from a systemd timer unit
- at a suitable interval.
-
-* `POST /1.0/force-ls-troves` causes the repository refresh to happen
- for all Upstream Hosts, regardless of whether it is due or not. This
- can be called manually by an admin.
-
-
-The MINION
-----------
-
-* Do `GET /1.0/give-me-job` to WEBAPP.
-* If didn't get a job, sleep a while and try again.
-* If did get job, fork and exec that.
-* In a loop: wait for output, for a suitably short period of time,
- from job (or its termination), with `select` or similar mechanism,
- and send anything (if anything) you get to WEBAPP. If the WEBAPP
- told us to kill the job, kill it, then send an update to that effect
- to WEBAPP.
-* Go back to top to request new job.
-
-
-Old job removal
----------------
-
-To avoid the STATEDB filling up with logs of old jobs, a systemd timer
-unit will run occasionally to remove jobs so old, nobody cares about
-them anymore. To make it easier to experiment with the logic of
-choosing what to remove (age only? keep failed ones? something else?)
-the removal is kept outside the WEBAPP.
-
-
-STATEDB
--------
-
-The STATEDB has several tables. This section explains them.
-
-The `running_queue` table has a single column (`running`) and a single
-row, and is used to store a single boolean value that specifies
-whether WEBAPP is giving out jobs to run from the run-queue. This
-value is controlled by `/1.0/start-queue` and `/1.0/stop-queue`
-requests.
-
-The `lorries` table implements the run-queue: all the Lorry specs that
-WEBAPP knows about. It has the following columns:
-
-* `path` is the path of the git repository on the Downstream Host, i.e.,
- the git repository to which Lorry will push. This is a unique
- identifier. It is used, for example, to determine if a Lorry spec
- is obsolete after a CONFGIT update.
-* `text` has the text of the Lorry spec. This may be read from a file
- or generated by Lorry Controller itself. This text will be given to
- Lorry when a job is run.
-* `generated` is set to 0 or 1, depending on if the Lorry came from an
- actual `.lorry` file or was generated by Lorry Controller.
-
-
-Code structure
-==============
-
-The Lorry Controller code base is laid out as follows:
-
-* `lorry-controller-webapp` is the main program of WEBAPP. It sets up
- the bottle.py framework. All the implementations for the various
- HTTP requests are in classes in the `lorrycontroller` Python
- package, as subclasses of the `LorryControllerRoute` class. The main
- program uses introspection ("magic") to find the subclasses
- automatically and sets up the bottle.py routes correctly. This makes
- it possible to spread the code into simple classes; bottle's normal
- way (with the `@app.route` decorator) seemed to make that harder and
- require everything in the same class.
-
-* `lorrycontroller` is a Python package with:
-
- - The HTTP request handlers (`LorryControllerRoute` and its subclasses)
- - Management of STATEDB (`statedb` module)
- - Support for various Downstream and Upstream Host types
- (`hosts`, `gitano`, `gerrit`, `gitlab`, `local` modules)
- - Some helpful utilities (`proxy` module)
-
-* `lorry-controller-minion` is the entirety of the MINION, except that
- it uses the `lorrycontroller.setup_proxy` function.
- The MINION is kept very simple on purpose: all the interesting logic
- is in the WEBAPP instead.
-
-* `static` has static content to be served over HTTP. Primarily, the
- CSS file for the HTML interfaces. When LC is integrated within the
- Downstream Host, the web server gets configured to serve these files directly.
- The `static` directory will be accessible over plain HTTP on port
- 80, and on port 12765 via the WEBAPP, to allow HTML pages to refer
- to it via a simple path.
-
-* `templates` contains bottle.py HTML templates for various pages.
-
-* `etc` contains files to be installed in `/etc` when LC is installed
- on a Baserock system. Primarily this is the web server (lighttpd)
- configuration to invoke WEBAPP.
-
-* `units` contains various systemd units that start services and run
- time-based jobs.
-
-* `yarns.webapp` contains an integration test suite for WEBAPP.
- This is run by the `./check` script. The `./test-wait-for-port`
- script is used by the yarns.
-
-Example
--------
-
-As an example, to modify how the `/1.0/status-html` request works, you
-would look at its implementation in `lorrycontroller/status.py`, and
-perhaps also the HTML templates in `templates/*.tpl`.
-
-STATEDB
--------
-
-The persistent state of WEBAPP is stored in an Sqlite3 database. All
-access to STATEDB within WEBAPP is via the
-`lorrycontroller/statedb.py` code module. That means there are no SQL
-statements outside `statedb.py` at all, nor is it OK to add any. If
-the interface provided by the `StateDB` class isn't sufficient, then
-modify the class suitably, but do not add any new SQL outside it.
-
-All access from outside of WEBAPP happens via WEBAPP's HTTP API.
-Only the WEBAPP is allowed to touch STATEDB in any way.
-
-The bottle.py framework runs multiple threads of WEBAPP code. The
-threads communicate only via STATEDB. There is no shared state in
-memory. SQL's locking is used for mutual exclusion.
-
-The `StateDB` class acts as a context manager for Python's `with`
-statements to provide locking. To access STATEDB with locking, use
-code such as this:
-
- with self.open_statedb() as statedb:
- hosts = statedb.get_hosts()
- for host in hosts:
- statedb.remove_host(hosts)
-
-The code executed by the `with` statement is run under lock, and the
-lock gets released automatically even if there is an exception.
-
-(You could manage locks manually. It's a good way to build character
-and learn why using the context manager is really simple and leads to
-more correct code.)