summaryrefslogtreecommitdiff
path: root/ARCH.md
diff options
context:
space:
mode:
Diffstat (limited to 'ARCH.md')
-rw-r--r--ARCH.md505
1 files changed, 505 insertions, 0 deletions
diff --git a/ARCH.md b/ARCH.md
new file mode 100644
index 0000000..77218f1
--- /dev/null
+++ b/ARCH.md
@@ -0,0 +1,505 @@
+% Architecture of daemonised Lorry Controller
+% Codethink Ltd
+
+Introduction
+============
+
+This is an architecture document for Lorry Controller. It is aimed at
+those who develop the software, or develop against its HTTP API. See
+the file `README.md` for general information about Lorry Controller.
+
+
+Requirements
+============
+
+Some concepts/terminology:
+
+* CONFGIT is the git repository Lorry Controller uses for its
+ configuration.
+
+* Lorry specification: the configuration to Lorry to mirror an
+ upstream version control repository or tarball. Note that a `.lorry`
+ file may contain several specifications.
+
+* Upstream Host: a git hosting server that Lorry Controller mirrors
+ from.
+
+* Host specification: which Upstream Host to mirror. This gets
+ broken into generated Lorry specifications, one per git repository
+ on the other Host. There can be many Host specifications to
+ mirror many Hosts.
+
+* Downstream Host: a git hosting server that Lorry Controller mirrors
+ to.
+
+* run queue: all the Lorry specifications (from CONFGIT or generated
+ from the Host specifications) a Lorry Controller knows about; this
+ is the set of things that get scheduled. The queue has a linear
+ order (first job in the queue is the next job to execute).
+
+* job: An instance of executing a Lorry specification. Each job has an
+ identifier and associated data (such as the output provided by the
+ running job, and whether it succeeded).
+
+* admin: a person who can control or reconfigure a Lorry Controller
+ instance. All users of the HTTP API are admins, for example.
+
+For historical reasons, Hosts are also referred to as Troves in many
+places.
+
+Original set of requirements, which have been broken down and detailed
+up below:
+
+* Lorry Controller should be capable of being reconfigured at runtime
+ to allow new tasks to be added and old tasks to be removed.
+ (RC/ADD, RC/RM, RC/START)
+
+* Lorry Controller should not allow all tasks to become stuck if one
+ task is taking a long time. (RR/MULTI)
+
+* Lorry Controller should not allow stuck tasks to remain stuck
+ forever. (Configurable timeout? monitoring of disk usage or CPU to
+ see if work is being done?) (RR/TIMEOUT)
+
+* Lorry Controller should be able to be controlled at runtime to allow:
+ - Querying of the current task set (RQ/SPECS, RQ/SPEC)
+ - Querying of currently running tasks (RQ/RUNNING)
+ - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT)
+ - Supporting of the health monitoring to allow appropriate alerts
+ to be sent out (MON/STATIC, MON/DU)
+
+The detailed requirements (prefixed by a unique identfier, which is
+used elsewhere to refer to the exact requirement):
+
+* (FW) Lorry Controller can access Upstream Hosts from behind firewalls.
+ * (FW/H) Lorry Controller can access the Upstream Host using HTTP or
+ HTTPS only, without using ssh, in order to get a list of
+ repositories to mirror. (Lorry itself also needs to be able to
+ access the Upstream Host using HTTP or HTTPS only, bypassing
+ ssh, but that's a Lorry problem and outside the scope of Lorry
+ Controller, so it'll need to be dealt separately.)
+ * (FW/C) Lorry Controller does not verify SSL/TLS certificates
+ when accessing the Upstream Host.
+* (RC) Lorry Controller can be reconfigured at runtime.
+ * (RC/ADD) A new Lorry specification can be added to CONFGIT, and
+ a running Lorry Controller will add them to its run queue as
+ soon as it is notified of the change.
+ * (RC/RM) A Lorry specification can be removed from CONFGIT, and a
+ running Lorry Controller will remove it from its run queue as
+ soon as it is notified of the change.
+ * (RC/START) A Lorry Controller reads CONFGIT when it starts,
+ updating its run queue if anything has changed.
+* (RT) Lorry Controller can controlled at runtime.
+ * (RT/KILL) An admin can get their Lorry Controller to stop a
+ running job.
+ * (RT/TOP) An admin can get their Lorry Controller to move a Lorry
+ spec to the beginning of the run queue.
+ * (RT/BOT) An admin can get their Lorry Controller to move a Lorry
+ spec to the end of the run queue.
+ * (RT/QSTOP) An admin can stop their Lorry Controller from
+ scheduling any new jobs.
+ * (RT/QSTART) An admin can get their Lorry Controller to start
+ scheduling jobs again.
+* (RQ) Lorry Controller can be queried at runtime.
+ * (RQ/RUNNING) An admin can list all currently running jobs.
+ * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry
+ Controller still remembers.
+ * (RQ/SPECS) An admin can list all existing Lorry specifications
+ in the run queue.
+ * (RQ/SPEC) An admin can query existing Lorry specifications in
+ the run queue for any information the Lorry Controller holds for
+ them, such as the last time they successfully finished running.
+* (RR) Lorry Controller is reasonably robust.
+ * (RR/CONF) Lorry Controller ignores any broken Lorry or Host
+ specifications in CONFGIT, and runs without them.
+ * (RR/TIMEOUT) Lorry Controller stops a job that runs for too
+ long.
+ * (RR/MULTI) Lorry Controller can run multiple jobs at the same
+ time, and lets the maximal number of such jobs be configured by
+ the admin.
+ * (RR/DU) Lorry Controller (and the way it runs Lorry) is
+ designed to be frugal about disk space usage.
+ * (RR/CERT) Lorry Controller tells Lorry to not worry about
+ unverifiable SSL/TLS certificates and to continue even if the
+ certificate can't be verified or the verification fails.
+* (RS) Lorry Controller is reasonably scalable.
+ * (RS/SPECS) Lorry Controller works for the number of Lorry
+ specifications we have on git.baserock.org (a number that will
+ increase, and is currently about 500).
+ * (RS/GITS) Lorry Controller works for mirroring git.baserock.org
+ (about 500 git repositories).
+ * (RS/HW) Lorry Controller may assume that CPU, disk, and
+ bandwidth are sufficient, if not to be needlessly wasted.
+* (MON) Lorry Controller can be monitored from the outside.
+ * (MON/STATIC) Lorry Controller updates at least once a minute a
+ static HTML file, which shows its current status with sufficient
+ detail that an admin knows if things get stuck or break.
+ * (MON/DU) Lorry Controller measures, at least, the disk usage of
+ each job and Lorry specification.
+* (SEC) Lorry Controller is reasonably secure.
+ * (SEC/API) Access to the Lorry Controller run-time query and
+ controller interfaces is managed with iptables (for now).
+ * (SEC/CONF) Access to CONFGIT is managed by the git server that
+ hosts it. (Gitano on Trove.)
+
+Architecture design
+===================
+
+Constraints
+-----------
+
+Python is not good at multiple threads (partly due to the global
+interpreter lock), and mixing threads and executing subprocesses is
+quite tricky to get right in general. Thus, this design splits the
+software into a threaded web application (using the bottle.py
+framework) and one or more single-threaded worker processes to execute
+Lorry.
+
+Entities
+--------
+
+* An admin is a human being or some software using the HTTP API to
+ communicate with the Lorry Controller.
+* Lorry Controller runs Lorry appropriately, and consists of several
+ components described below.
+* The Downstream Host is as defined in Requirements.
+* An Upstream Host is as defined in Requirements. There can be
+ multiple Upstream Hosts.
+
+Components of Lorry Controller
+------------------------------
+
+* CONFGIT is a git repository for Lorry Controller configuration,
+ which the Lorry Controller (see WEBAPP below) can access and pull
+ from. Pushing is not required and should be prevented by Gitano.
+ CONFGIT is hosted on the Downstream Host.
+
+* STATEDB is persistent storage for the Lorry Controller's state: what
+ Lorry specs it knows about (provided by the admin, or generated from
+ a Host spec by Lorry Controller itself), their ordering, jobs that
+ have been run or are being run, information about the jobs, etc. The
+ idea is that the Lorry Controller process can terminate (cleanly or
+ by crashing), and be restarted, and continue approximately from
+ where it was. Also, a persistent storage is useful if there are
+ multiple processes involved due to how bottle.py and WSGI work.
+ STATEDB is implemented using sqlite3.
+
+* WEBAPP is the controlling part of Lorry Controller, which maintains
+ the run queue, and provides an HTTP API for monitoring and
+ controlling Lorry Controller. WEBAPP is implemented as a bottle.py
+ application. bottle.py runs the WEBAPP code in multiple threads to
+ improve concurrency.
+
+* MINION runs jobs (external processes) on behalf of WEBAPP. It
+ communicates with WEBAPP over HTTP, and requests a job to run,
+ starts it, and while it waits, sends partial output to the WEBAPP
+ every few seconds, and asks the WEBAPP whether the job should be
+ aborted or not. MINION may eventually run on a different host than
+ WEBAPP, for added scalability.
+
+Components external to Lorry Controller
+---------------------------------------
+
+* A web server. This runs the Lorry Controller WEBAPP, using WSGI so
+ that multiple instances (processes) can run at once, and thus serve
+ many clients.
+
+* bottle.py is a Python microframework for web applications. It sits
+ between the web server itself and the WEBAPP code.
+
+* systemd is the operating system component that starts services and
+ processes.
+
+How the components work together
+--------------------------------
+
+* Each WEBAPP instance is started by the web server, when a request
+ comes in. The web server is started by a systemd unit.
+
+* Each MINION instance is started by a systemd unit. Each MINION
+ handles one job at a time, and doesn't block other MINIONs from
+ running other jobs. The admins decide how many MINIONs run at once,
+ depending on hardware resources and other considerations. (RR/MULTI)
+
+* An admin communicates with the WEBAPP only, by making HTTP requests.
+ Each request is either a query (GET) or a command (POST). Queries
+ report state as stored in STATEDB. Commands cause the WEBAPP
+ instance to do something and alter STATEDB accordingly.
+
+* When an admin makes changes to CONFGIT, and pushes them to the Downstream
+ Host, the Host's git post-update hook makes an HTTP request to
+ WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM)
+
+* Each MINION likewise communicates only with the WEBAPP using HTTP
+ requests. MINION requests a job to run (which triggers WEBAPP's job
+ scheduling), and then reports results to the WEBAPP (which causes
+ WEBAPP to store them in STATEDB), which tells MINION whether to
+ continue running the job or not (RT/KILL). There is no separate
+ scheduling process: all scheduling happens when there is a MINION
+ available.
+
+* At system start up, a systemd unit makes an HTTP request to WEBAPP
+ to make it refresh STATEDB from CONFGIT. (RC/START)
+
+* A timer unit for systemd makes an HTTP request to get WEBAPP to
+ refresh the static HTML status page. (MON/STATIC)
+
+In summary: systemd starts WEBAPP and MINIONs, and whenever a
+MINION can do work, it asks WEBAPP for something to do, and reports
+back results. Meanwhile, admin can query and control via HTTP requests
+to WEBAPP, and WEBAPP instances communicate via STATEDB.
+
+The WEBAPP
+----------
+
+The WEBAPP provides an HTTP API as described below.
+
+Run queue management:
+
+* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to
+ run. Any currently running jobs are not affected. (RT/QSTOP)
+
+* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs
+ again. (RT/QSTART)
+
+* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of
+ all Lorry specifications in the run queue, in the order they are in
+ the run queue. (RQ/SPECS)
+
+* `POST /1.0/move-to-top` with `path=lorryspecid` as the body, where
+ `lorryspecid` is the id (path) of a Lorry specification in the run
+ queue, causes WEBAPP to move the specified spec to the head of the
+ run queue, and store this in STATEDB. It doesn't affect currently
+ running jobs. (RT/TOP)
+
+* `POST /1.0/move-to-bottom` with `path=lorryspecid` in the body is
+ like `/move-to-top`, but moves the job to the end of the run queue.
+ (RT/BOT)
+
+Running job management:
+
+* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of
+ ids of all currently running jobs. (RQ/RUNNING)
+
+* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict)
+ with all the information about the specified job.
+
+* `POST /1.0/stop-job` with `job_id=jobid` where `jobid` is an id of a
+ running job, causes WEBAPP to record in STATEDB that the job is to
+ be killed, and waits for it to be killed. (Killing to be done when
+ MINION gets around to it.) This request returns as soon as the
+ STATEDB change is done.
+
+* `GET /1.0/list-jobs` causes WEBAPP to return a JSON list of ids
+ of all jobs, running or finished, that it knows about. (RQ/ALLJOBS)
+
+* `GET /1.0/list-jobs-html` is the same as `list-jobs`, but returns an
+ HTML page instead.
+
+* `POST /1.0/remove-job` with `job_id=jobid` in the body, removes a
+ stopped job from the state database.
+
+* `POST /1.0/remove-ghost-jobs` looks for any running jobs in STATEDB
+ that haven't been updated (with `job-update`, see below) in a long
+ time (see `--ghost-timeout`), and marks them as terminated. This is
+ used to catch situations when a MINION fails to tell the WEBAPP that
+ a job has terminated.
+
+Other status queries:
+
+* `GET /1.0/status` causes WEBAPP to return a JSON object that
+ describes the state of Lorry Controller. This information is meant
+ to be programmatically useable and may or may not be the same as in
+ the HTML page.
+
+* `GET /1.0/status-html` causes WEBAPP to return an HTML page that
+ describes the state of Lorry Controller. This also updates an
+ on-disk copy of the HTML page, which the web server is configured to
+ serve using a normal HTTP request. This is the primary interface for
+ human admins to look at the state of Lorry Controller. (MON/STATIC)
+
+* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map
+ (dict) with all the information about the specified Lorry
+ specification. (RQ/SPEC)
+
+
+Requests for MINION:
+
+* `GET /1.0/give-me-job` is used by MINION to get a new job to run.
+ WEBAPP will either return a JSON object describing the job to run,
+ or return a status code indicating that there is nothing to do.
+ WEBAPP will respond immediately, even if there is nothing for MINION
+ to do, and MINION will then sleep for a while before it tries again.
+ WEBAPP updates STATEDB to record that the job is allocated to a
+ MINION.
+
+* `POST /1.0/job-update` is used by MINION to push updates about the
+ job it is running to WEBAPP. The body sets fields `exit` (exit code
+ of program, or `no` if not set), `stdout` (some output from the
+ job's standard output) and `stderr` (ditto, but standard error
+ output). There MUST be at least one `job-update` call, which
+ indicates the job has terminated. WEBAPP responds with a status
+ indicating whether the job should continue to run or be terminated
+ (RR/TIMEOUT). WEBAPP records the job as terminated only after MINION
+ tells it the job has been terminated. MINION makes the `job-update`
+ request frequently, even if the job has produced no output, so that
+ WEBAPP can update a timestamp in STATEDB to indicate the job is
+ still alive.
+
+Other requests:
+
+* `POST /1.0/read-configuration` causes WEBAPP to update its copy of
+ CONFGIT and update STATEDB based on the new configuration, if it has
+ changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START)
+
+ This is called by systemd units at system startup and periodically
+ (perhaps once a minute) otherwise. It can also be triggered by an
+ admin (there is a button on the `/1.0/status-html` web page).
+
+* `POST /1.0/ls-troves` causes WEBAPP to refresh its list of
+ repositories in each Upstream Host, if the current list is too old
+ (see the `ls-interval` setting for each Upstream Host in
+ `lorry-controller.conf`). This gets called from a systemd timer unit
+ at a suitable interval.
+
+* `POST /1.0/force-ls-troves` causes the repository refresh to happen
+ for all Upstream Hosts, regardless of whether it is due or not. This
+ can be called manually by an admin.
+
+
+The MINION
+----------
+
+* Do `GET /1.0/give-me-job` to WEBAPP.
+* If didn't get a job, sleep a while and try again.
+* If did get job, fork and exec that.
+* In a loop: wait for output, for a suitably short period of time,
+ from job (or its termination), with `select` or similar mechanism,
+ and send anything (if anything) you get to WEBAPP. If the WEBAPP
+ told us to kill the job, kill it, then send an update to that effect
+ to WEBAPP.
+* Go back to top to request new job.
+
+
+Old job removal
+---------------
+
+To avoid the STATEDB filling up with logs of old jobs, a systemd timer
+unit will run occasionally to remove jobs so old, nobody cares about
+them anymore. To make it easier to experiment with the logic of
+choosing what to remove (age only? keep failed ones? something else?)
+the removal is kept outside the WEBAPP.
+
+
+STATEDB
+-------
+
+The STATEDB has several tables. This section explains them.
+
+The `running_queue` table has a single column (`running`) and a single
+row, and is used to store a single boolean value that specifies
+whether WEBAPP is giving out jobs to run from the run-queue. This
+value is controlled by `/1.0/start-queue` and `/1.0/stop-queue`
+requests.
+
+The `lorries` table implements the run-queue: all the Lorry specs that
+WEBAPP knows about. It has the following columns:
+
+* `path` is the path of the git repository on the Downstream Host, i.e.,
+ the git repository to which Lorry will push. This is a unique
+ identifier. It is used, for example, to determine if a Lorry spec
+ is obsolete after a CONFGIT update.
+* `text` has the text of the Lorry spec. This may be read from a file
+ or generated by Lorry Controller itself. This text will be given to
+ Lorry when a job is run.
+* `generated` is set to 0 or 1, depending on if the Lorry came from an
+ actual `.lorry` file or was generated by Lorry Controller.
+
+
+Code structure
+==============
+
+The Lorry Controller code base is laid out as follows:
+
+* `lorry-controller-webapp` is the main program of WEBAPP. It sets up
+ the bottle.py framework. All the implementations for the various
+ HTTP requests are in classes in the `lorrycontroller` Python
+ package, as subclasses of the `LorryControllerRoute` class. The main
+ program uses introspection ("magic") to find the subclasses
+ automatically and sets up the bottle.py routes correctly. This makes
+ it possible to spread the code into simple classes; bottle's normal
+ way (with the `@app.route` decorator) seemed to make that harder and
+ require everything in the same class.
+
+* `lorrycontroller` is a Python package with:
+
+ - The HTTP request handlers (`LorryControllerRoute` and its subclasses)
+ - Management of STATEDB (`statedb` module)
+ - Support for various Downstream and Upstream Host types
+ (`hosts`, `gitano`, `gerrit`, `gitlab`, `local` modules)
+ - Some helpful utilities (`proxy` module)
+
+* `lorry-controller-minion` is the entirety of the MINION, except that
+ it uses the `lorrycontroller.setup_proxy` function.
+ The MINION is kept very simple on purpose: all the interesting logic
+ is in the WEBAPP instead.
+
+* `static` has static content to be served over HTTP. Primarily, the
+ CSS file for the HTML interfaces. When LC is integrated within the
+ Downstream Host, the web server gets configured to serve these files directly.
+ The `static` directory will be accessible over plain HTTP on port
+ 80, and on port 12765 via the WEBAPP, to allow HTML pages to refer
+ to it via a simple path.
+
+* `templates` contains bottle.py HTML templates for various pages.
+
+* `etc` contains files to be installed in `/etc` when LC is installed
+ on a Baserock system. Primarily this is the web server (lighttpd)
+ configuration to invoke WEBAPP.
+
+* `units` contains various systemd units that start services and run
+ time-based jobs.
+
+* `yarns.webapp` contains an integration test suite for WEBAPP.
+ This is run by the `./check` script. The `./test-wait-for-port`
+ script is used by the yarns.
+
+Example
+-------
+
+As an example, to modify how the `/1.0/status-html` request works, you
+would look at its implementation in `lorrycontroller/status.py`, and
+perhaps also the HTML templates in `templates/*.tpl`.
+
+STATEDB
+-------
+
+The persistent state of WEBAPP is stored in an Sqlite3 database. All
+access to STATEDB within WEBAPP is via the
+`lorrycontroller/statedb.py` code module. That means there are no SQL
+statements outside `statedb.py` at all, nor is it OK to add any. If
+the interface provided by the `StateDB` class isn't sufficient, then
+modify the class suitably, but do not add any new SQL outside it.
+
+All access from outside of WEBAPP happens via WEBAPP's HTTP API.
+Only the WEBAPP is allowed to touch STATEDB in any way.
+
+The bottle.py framework runs multiple threads of WEBAPP code. The
+threads communicate only via STATEDB. There is no shared state in
+memory. SQL's locking is used for mutual exclusion.
+
+The `StateDB` class acts as a context manager for Python's `with`
+statements to provide locking. To access STATEDB with locking, use
+code such as this:
+
+ with self.open_statedb() as statedb:
+ hosts = statedb.get_hosts()
+ for host in hosts:
+ statedb.remove_host(hosts)
+
+The code executed by the `with` statement is run under lock, and the
+lock gets released automatically even if there is an exception.
+
+(You could manage locks manually. It's a good way to build character
+and learn why using the context manager is really simple and leads to
+more correct code.)