summaryrefslogtreecommitdiff
path: root/ARCH
diff options
context:
space:
mode:
authorLars Wirzenius <lars.wirzenius@codethink.co.uk>2014-01-20 14:24:27 +0000
committerLars Wirzenius <lars.wirzenius@codethink.co.uk>2014-04-15 13:29:27 +0000
commit4fc162b07b2e9d8489e16ed647e5d96f5c66e10a (patch)
treeac2a2a5b86a5d789bd28b383851b28d7f293b928 /ARCH
parent716ad28c18ac00c52797dc42c843569b1834fb88 (diff)
downloadlorry-controller-4fc162b07b2e9d8489e16ed647e5d96f5c66e10a.tar.gz
Add new Lorry Controller
Diffstat (limited to 'ARCH')
-rw-r--r--ARCH413
1 files changed, 413 insertions, 0 deletions
diff --git a/ARCH b/ARCH
new file mode 100644
index 0000000..c1cb979
--- /dev/null
+++ b/ARCH
@@ -0,0 +1,413 @@
+% Architecture of daemonised Lorry Controller
+% Codethink Ltd
+
+Introduction
+============
+
+This is an architecture document for Lorry Controller. It is aimed at
+those who develop the software.
+
+Lorry is a tool in Baserock for mirroring code from whatever format
+upstream provides it into git repositories, converting them to git as
+needed. Lorry Controller is service, running on a Trove, which runs
+Lorry against all configured upstreams, including other Troves.
+
+Lorry Controller reads a configuration from a git repository. That
+configuration includes specifications of which upstreams to
+mirror/convert. This includes what upstream Troves to mirror. Lorry
+Controller instructs Lorry to push to a Trove's git repositories.
+
+Lorry specifications, and upstream Trove specifications, may include
+scheduling information, which the Lorry Controller uses to decide when
+to execute which specification.
+
+Requirements
+============
+
+Some concepts/terminology:
+
+* CONFGIT is the git repository the Lorry Controller instance uses for
+ its configuration.
+* Lorry specification: which upstream version control repository or
+ tarball to mirror.
+* Trove specification: which upstream Trove to mirror. This gets
+ broken into generated Lorry specifications, one per git repository
+ on the upstream Trove. There can be many Trove specifications to
+ mirror many Troves.
+* job: An instance of executing a Lorry specification. Each job has an
+ identifier and associated data (such as the output provided by the
+ running job, and whether it succeeded).
+* run queue: all the Lorry specifications (from CONFGIT or generated
+ from the Troe specifications) a Lorry Controller knows about; this
+ is the set of things that get scheduled. The queue has a linear
+ order (first job in the queue is the next job to execute).
+* admin: a person who can control or reconfigure a Lorry Controller
+ instance.
+
+Original set of requirement, which have been broken down and detailed
+up below:
+
+* Lorry Controller should be capable of being reconfigured at runtime
+ to allow new tasks to be added and old tasks to be removed.
+ (RC/ADD, RC/RM, RC/START)
+* Lorry Controller should not allow all tasks to become stuck if one
+ task is taking a long time. (RR/MULTI)
+* Lorry Controller should not allow stuck tasks to remain stuck
+ forever. (Configurable timeout? monitoring of disk usage or CPU to
+ see if work is being done?) (RR/TIMEOUT)
+* Lorry Controller should be able to be controlled at runtime to allow:
+ - Querying of the current task set (RQ/SPECS, RQ/SPEC)
+ - Querying of currently running tasks (RQ/RUNNING)
+ - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT)
+ - Supporting of the health monitoring to allow appropriate alerts
+ to be sent out (MON/STATIC, MON/DU)
+
+The detailed requirements (prefixed by a unique identfier, which is
+used elsewhere to refer to the exact requirement):
+
+* (FW) Lorry Controller can access upstream Troves from behind firewalls.
+ * (FW/H) Lorry Controller can access the upstream Trove using HTTP or
+ HTTPS only, without using ssh, in order to get a list of
+ repositories to mirror. (Lorry itself also needs to be able to
+ access the upstream Trove using HTTP or HTTPS only, bypassing
+ ssh, but that's a Lorry problem and outside the scope of Lorry
+ Controller, so it'll need to be dealt separately.)
+ * (FW/C) Lorry Controller does not verify SSL/TLS certificates
+ when accessing the upstream Trove.
+* (RC) Lorry Controller can be reconfigured at runtime.
+ * (RC/ADD) A new Lorry specification can be added to CONFGIT, and
+ a running Lorry Controller will add them to its run queue as
+ soon as it is notified of the change.
+ * (RC/RM) A Lorry specification can be removed from CONFGIT, and a
+ running Lorry Controller will remove it from its run queue as
+ soon as it is notified of the change.
+ * (RC/START) A Lorry Controller reads CONFGIT when it starts,
+ updating its run queue if anything has changed.
+* (RT) Lorry Controller can controlled at runtime.
+ * (RT/KILL) An admin can get their Lorry Controller to stop a running job.
+ * (RT/TOP) An admin can get their Lorry Controller to move a Lorry spec to
+ the beginning of the run queue.
+ * (RT/BOT) An admin can get their Lorry Controller to move a Lorry
+ spec to the end of the run queue.
+ * (RT/QSTOP) An admin can stop their Lorry Controller from scheduling any new
+ jobs.
+ * (RT/QSTART) An admin can get their Lorry Controller to start
+ scheduling jobs again.
+* (RQ) Lorry Controller can be queried at runtime.
+ * (RQ/RUNNING) An admin can list all currently running jobs.
+ * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry
+ Controller still remembers.
+ * (RQ/SPECS) An admin can list all existing Lorry specifications
+ in the run queue.
+ * (RQ/SPEC) An admin can query existing Lorry specifications in
+ the run queue for any information the Lorry Controller holds for
+ them, such as the last time they successfully finished running.
+* (RR) Lorry Controller is reasonably robust.
+ * (RR/CONF) Lorry Controller ignores any broken Lorry or Trove
+ specifications in CONFGIT, and runs without them.
+ * (RR/TIMEOUT) Lorry Controller stops a job that runs for too
+ long.
+ * (RR/MULTI) Lorry Controller can run multiple jobs at the same
+ time, and lets the maximal number of such jobs be configured by
+ the admin.
+ * (RR/DU) Lorry Controller (and the way it runs Lorry) is
+ designed to be frugal about disk space usage.
+ * (RR/CERT) Lorry Controller tells Lorry to not worry about
+ unverifiable SSL/TLS certificates and to continue even if the
+ certificate can't be verified or the verification fails.
+* (RS) Lorry Controller is reasonably scalable.
+ * (RS/SPECS) Lorry Controller works for the number of Lorry
+ specifications we have on git.baserock.org (a number that will
+ increase, and is currently about 500).
+ * (RS/GITS) Lorry Controller works for mirroring git.baserock.org
+ (about 500 git repositories).
+ * (RS/HW) Lorry Controller may assume that CPU, disk, and
+ bandwidth are sufficient, if not to be needlessly wasted.
+* (MON) Lorry Controller can be monitored from the outside.
+ * (MON/STATIC) Lorry Controller updates at least once a minute a
+ static HTML file, which shows its current status with sufficient
+ detail that an admin knows if things get stuck or break.
+ * (MON/DU) Lorry Controller measures, at least, the disk usage of
+ each job and Lorry specification.
+* (SEC) Lorry Controller is reasonably secure.
+ * (SEC/API) Access to the Lorry Controller run-time query and
+ controller interfaces is managed with iptables (for now).
+ * (SEC/CONF) Access to CONFGIT is managed by the git server that
+ hosts it. (Gitano on Trove.)
+
+Architecture design
+===================
+
+Constraints
+-----------
+
+Python is not good at multiple threads (partly due to the global
+interpreter lock), and mixing threads and executing subprocesses is
+quite tricky to get right in general. Thus, this design avoids using
+threads.
+
+Entities
+--------
+
+* An admin is a human being that communicates with the Lorry
+ Controller using an HTTP API. They might do it using a command line
+ client.
+* Lorry Controller runs Lorry appropriately, and consists of several
+ components described below.
+* The local Trove is where Lorry Controller tells its Lorry to push
+ the results.
+* Upstream Trove is a Trove that Lorry Controller mirrors to the local
+ Trove. There can be multiple upstream Troves.
+
+Components of Lorry Controller
+------------------------------
+
+* CONFGIT is a git repository for Lorry Controller configuration,
+ which the Lorry Controller can access and pull from. Pushing is not
+ required and should be prevented by Gitano. CONFGIT is hosted on the
+ local Trove.
+* STATEDB is persistent storage for the Lorry Controller's state: what
+ Lorry specs it knows about (provided by the admin, or generated from
+ a Trove spec by Lorry Controller itself), their ordering, jobs that
+ have been run or are being run, information about the jobs, etc.
+ The idea is that the Lorry Controller process can terminate (cleanly
+ or by crashing), and be restarted, and continue approximately where
+ it was. Also, a persistent storage is useful if there are multiple
+ processes involved due to how bottle.py and WSGI work. STATEDB is
+ implemented using sqlite3.
+* WEBAPP is the controlling part of Lorry Controller, which maintains
+ the run queue, and provides an HTTP API for monitoring and
+ controller Lorry Controller. WEBAPP is implemented as a bottle.py
+ application.
+* MINION runs jobs (external processes) on behalf of WEBAPP. It
+ communicates with WEBAPP over HTTP, and requests a job to run,
+ starts it, and while it waits, sends partial output to the WEBAPP,
+ and asks the WEBAPP whether the job should be aborted or not. MINION
+ may eventually run on a different host than WEBAPP, for added
+ scalability.
+
+Components external to Lorry Controller
+---------------------------------------
+
+* A web server. This runs the Lorry Controller WEBAPP, using WSGI so
+ that multiple instances (processes) can run at once, and thus serve
+ many clients.
+* bottle.py is a Python microframework for web applications. We
+ already have it in Baserock, where we use it for morph-cache-server,
+ and it seems to be acceptable.
+* systemd is the operating system component that starts services and
+ processes.
+
+How the components work together
+--------------------------------
+
+* Each WEBAPP instance is started by the web server, when a request
+ comes in. The web server is started by a systemd unit.
+* Each MINION instance is started by a systemd unit. Each MINION
+ handles one job at a time, and doesn't block other MINIONs from
+ running other jobs. The admins decide how many MINIONs run at once,
+ depending on hardware resources and other considerations. (RR/MULTI)
+* An admin communicates with the WEBAPP only, by making HTTP requests.
+ Each request is either a query (GET) or a command (POST). Queries
+ report state as stored in STATEDB. Commands cause the WEBAPP
+ instance to do something and alter STATEDB accordingly.
+* When an admin makes changes to CONFGIT, and pushes them to the local
+ Trove, the Trove's git post-update hook makes an HTTP request to
+ WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM)
+* Each MINION likewise communicates only with the WEBAPP using HTTP
+ requests. MINION requests a job to run (which triggers WEBAPP's job
+ scheduling), and then reports results to the WEBAPP (which causes
+ WEBAPP to store them in STATEDB), which tells MINION whether to
+ continue running the job or not (RT/KILL). There is no separate
+ scheduling process: all scheduling happens when there is a MINION
+ available.
+* At system start up, a systemd unit makes an HTTP request to WEBAPP
+ to make it refresh STATEDB from CONFGIT. (RC/START)
+* A timer unit for systemd makes an HTTP request to get WEBAPP to
+ refresh the static HTML status page. (MON/STATIC)
+
+In summary: systemd starts WEBAPP and MINIONs, and whenever a
+MINION can do work, it asks WEBAPP for something to do, and reports
+back results. Meanwhile, admin can query and control via HTTP requests
+to WEBAPP, and WEBAPP instances communicate via STATEDB.
+
+The WEBAPP
+----------
+
+The WEBAPP provides an HTTP API as described below.
+
+Requests for admins:
+
+* `GET /1.0/status` causes WEBAPP to return a JSON object that
+ describes the state of Lorry Controller. This information is meant
+ to be programmatically useable and may or may not be the same as in
+ the HTML page.
+* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to
+ run. Any currently running jobs are not affected. (RT/QSTOP)
+* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs
+ again. (RT/QSTART)
+
+* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of
+ all Lorry specifications in the run queue, in the order they are in
+ the run queue. (RQ/SPECS)
+* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map
+ (dict) with all the information about the specified Lorry
+ specification. (RQ/SPEC)
+* `POST /1.0/move-to-top/<lorryspecid>` where `lorryspecid` is the id
+ of a Lorry specification in the run queue, causes WEBAPP to move the
+ specified spec to the head of the run queue, and store this in
+ STATEDB. It doesn't affect currently running jobs. (RT/TOP)
+* `POST /1.0/move-to-bottom/<lorryspecid>` is like `/move-to-top`, but
+ moves the job to the end of the run queue. (RT/BOT)
+
+* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of
+ ids of all currently running jobs. (RQ/RUNNING)
+* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict)
+ with all the information about the specified job.
+* `POST /1.0/stop-job/<jobid>` where `jobid` is an id of a running job,
+ causes WEBAPP to record in STATEDB that the job is to be killed, and
+ waits for it to be killed. (Killing to be done when MINION gets
+ around to it.) This request returns as soon as the STATEDB change is
+ done.
+* `GET /1.0/list-all-jobs` causes WEBAPP to return a JSON list of ids
+ of all jobs, running or finished, that it knows about. (RQ/ALLJOBS)
+
+Requests for MINION:
+
+* `GET /1.0/give-me-job` is used by MINION to get a new job to run.
+ WEBAPP will either return a JSON object describing the job to run,
+ or return a status code indicating that there is nothing to do.
+ WEBAPP will respond immediately, even if there is nothing for MINION
+ to do, and MINION will then sleep for a while before it tries again.
+ WEBAPP updates STATEDB to record that the job is allocated to a
+ MINION.
+* `POST /1.0/job-update` is used by MINION to push updates about the
+ job it is running to WEBAPP. The body is a JSON object containing
+ additional information about the job, such as data from its
+ stdout/stderr, and current resource usage. There MUST be at least
+ one `job-update` call, which indicates the job has terminated.
+ WEBAPP responds with a status indicating whether the job should
+ continue to run or be terminated (RR/TIMEOUT). WEBAPP records the
+ job as terminated only after MINION tells it the job has been
+ terminated. MINION makes the `job-update` request frequently, even
+ if the job has produced no output, so that WEBAPP can update a
+ timestamp in STATEDB to indicate the job is still alive.
+
+Other requests:
+
+* `POST /1.0/read-configuration` causes WEBAPP to update its copy of
+ CONFGIT and update STATEDB based on the new configuration, if it has
+ changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START)
+* `GET /1.0/status-html` causes WEBAPP to return an HTML page that
+ describes the state of Lorry Controller. This also updates an
+ on-disk copy of the HTML page, which the web server is configured to
+ serve using a normal HTTP request. (MON/STATIC)
+
+The MINION
+----------
+
+* Do `GET /1.0/give-me-job` to WEBAPP.
+* If didn't get a job, sleep a while and try again.
+* If did get job, fork and exec that.
+* In a loop: wait for output, for a suitably short period of time,
+ from job (or its termination), with `select` or similar mechanism,
+ and send anything (if anything) you get to WEBAPP. If the WEBAPP
+ told us to kill the job, kill it, then send an update to that effect
+ to WEBAPP.
+* Go back to top to request new job.
+
+STATEDB
+-------
+
+The STATEDB has several tables. This section explains them.
+
+The `running_queue` table has a single column (`running`) and a single
+row, and is used to store a single boolean value that specifies
+whether WEBAPP is giving out jobs to run from the run-queue. This
+value is controlled by `/1.0/start-queue` and `/1.0/stop-queue`
+requests.
+
+The `lorries` table implements the run-queue: all the Lorry specs that
+WEBAPP knows about. It has the following columns:
+
+* `path` is the path of the git repository on the local Trove, i.e.,
+ the git repository to which Lorry will push. This is a unique
+ identifier. It is used, for example, to determine if a Lorry spec
+ is obsolete after a CONFGIT update.
+* `text` has the text of the Lorry spec. This may be read from a file
+ or generated by Lorry Controller itself. This text will be given to
+ Lorry when a job is run.
+* `generated` is set to 0 or 1, depending on if the Lorry came from an
+ actual `.lorry` file or was generated by Lorry Controller.
+
+Implementation plan
+===================
+
+The following are meant to be a good sequence of steps to implement
+the design as described above.
+
+* Make a skeleton Lorry Controller and yarn test suite for it (2d)
+
+ Write a simplistic, skeleton of a Lorry Controller WEBAPP and MINION,
+ and a few representative tests for them using yarn. The goal here is
+ not to have applications that do something real, or tests that test
+ something real, but to have a base upon which to start building, and
+ especially to make it easy to write tests (including new step
+ implementations) easily in the future.
+
+* Implement /1.0/status and /1.0/status-html in Lorry Controller
+ WEBAPP (1d)
+
+ This is the very basic, core of the status reporting. Every
+ subsequent change will include updating the status reporting as
+ necessary.
+
+* Implement /1.0/status/disk-free-bytes in Lorry Controller WEBAPP (1d)
+
+* Implement /1.0/stop-queue and /1.0/start-queue in Lorry Controller
+ WEBAPP (1d)
+
+ This should just affect the bit in STATEDB that decides whether we
+ are currently running jobs from the run queue or not. This
+ implementation step does not need to actually implement running
+ jobs.
+
+* Implement /1.0/read-configuration and /1.0/list-queue in Lorry
+ Controller WEBAPP (3d) (S10450)
+
+ This requires implementing parsing of the configuration files in
+ CONFGIT, generation of Lorry specs from Trove specs,
+ adding/removing/updating specs in the run queue according to
+ changes. list-queue needs to be implemented so that the results of
+ read-configuration can be verified.
+
+* Implement running jobs in Lorry Controller WEBAPP (1d) (S10451)
+
+ Requests /1.0/give-me-job, /1.0/job-update,
+ /1.0/list-running-jobs, /1.0/stop-job/. These do not actually run
+ anything, of course, since that is a job for MINION, but they
+ change the state of the job in STATEDB, and that's what needs to
+ be implemented and tested.
+
+* Implement MINION in Lorry Controller (1d) (S10452)
+
+* Implement /1.0/move-to-top/ and /1.0/move-to-bottom/ in Lorry
+ Controller WEBAPP (1d) (S10453)
+
+* Implement /1.0/list-all-jobs, /1.0/job/ in Lorry Controller
+ WEBAPP (1d) (S10454)
+
+* Implement /1.0/lorry/ in Lorry Controller WEBAPP (1d) (S10455)
+
+* Add new Lorry Controller to Trove (2d) (S10456)
+
+ Replace old Lorry Controller with new one, and add any systemd
+ units needed to make it functional. Create at least a very basic
+ sanity check, using yarn, to verify that a deployed, running
+ system has a working Lorry Controller.
+
+* Review Lorry Controller situation and decide on further work
+
+ No implementation plan survives contact with reality, and thus
+ things will need to be reviewed at the end, in case something has
+ been forgotten or requirements have changed.