Running jobs ============ This chapter contains tests that verify that WEBAPP schedules jobs, accepts job output, and lets the admin kill running jobs. Run a job successfully ---------------------- To start with, with an empty run-queue, nothing should be scheduled. SCENARIO run a job GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP We stop the queue first. WHEN admin makes request POST /1.0/stop-queue Then make sure we don't get a job when we request one. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to null WHEN admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [] Add a Lorry spec to the run-queue, and check that it looks OK. GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} WHEN admin makes request POST /1.0/read-configuration AND admin makes request GET /1.0/lorry/upstream/foo THEN response has jobs set to [] Request a job. We still shouldn't get a job, since the queue isn't set to run yet. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to null Enable the queue, and off we go. WHEN admin makes request POST /1.0/start-queue AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 AND response has path set to "upstream/foo" WHEN admin makes request GET /1.0/lorry/upstream/foo THEN response has running_job set to 1 AND response has jobs set to [1] WHEN admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [1] Requesting another job should now again return null. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to null Inform WEBAPP the job is finished. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0 THEN response has kill set to false WHEN admin makes request GET /1.0/lorry/upstream/foo THEN response has running_job set to null AND response has jobs set to [1] AND response has failed_jobs set to [] WHEN admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [] Cleanup. FINALLY WEBAPP terminates Run a job that fails -------------------- Lorry Controller needs to be able to deal with jobs that fail. It also needs to be able to list them correctly to the user. SCENARIO run a job that fails GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} WHEN admin makes request POST /1.0/read-configuration AND admin makes request POST /1.0/start-queue Initially, the lorry spec should have no jobs or failed jobs listed. WHEN admin makes request GET /1.0/lorry/upstream/foo THEN response has jobs set to [] AND response has failed_jobs set to [] MINION requests a job. WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 AND response has path set to "upstream/foo" Now, when MINION updates WEBAPP about the job, indicating that it has failed, and admin will then see that the lorry spec lists the job in failed jobs. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1 AND admin makes request GET /1.0/lorry/upstream/foo THEN response has jobs set to [1] AND response has failed_jobs set to [1] Cleanup. FINALLY WEBAPP terminates Limit number of jobs running at the same time --------------------------------------------- WEBAPP can be told to limit the number of jobs running at the same time. Set things up. Note that we have two local Lorry files, so that we could, in principle, run two jobs at the same time. SCENARIO limit concurrent jobs GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} AND Lorry file CONFGIT/bar.lorry with {"bar":{"type":"git","url":"git://bar"}} AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP WHEN admin makes request POST /1.0/read-configuration Check the current set of the `max_jobs` setting. WHEN admin makes request GET /1.0/get-max-jobs THEN response has max_jobs set to null Set the limit to 1. WHEN admin makes request POST /1.0/set-max-jobs with max_jobs=1 THEN response has max_jobs set to 1 WHEN admin makes request GET /1.0/get-max-jobs THEN response has max_jobs set to 1 Get a job. This should succeed. WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=1 THEN response has job_id set to 1 Get a second job. This should not succeed. WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=2 THEN response has job_id set to null Finish the first job. Then get a new job. This should succeed. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0 AND MINION makes request POST /1.0/give-me-job with host=testhost&pid=2 THEN response has job_id set to 2 Stop job in the middle ---------------------- We need to be able to stop jobs while they're running as well. We start by setting up everything so that a job is running, the same way we did for the successful job scenario. SCENARIO stop a job while it's running GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} WHEN admin makes request POST /1.0/read-configuration AND admin makes request POST /1.0/start-queue AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 AND response has path set to "upstream/foo" Admin will now ask WEBAPP to kill the job. This changes sets a field in the STATEDB only. WHEN admin makes request POST /1.0/stop-job with job_id=1 THEN response has kill set to true Now, when MINION updates the job, WEBAPP will tell it to kill it. MINION will do so, and then update the job again. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=no THEN response has kill set to true WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1 Admin will now see that the job has, indeed, been killed. WHEN admin makes request GET /1.0/lorry/upstream/foo THEN response has running_job set to null WHEN admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [] Check that job can be run successfully again. In 2014, we found a bug where a lorry that was ever set to be killed, would never again successfully run. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 2 AND response has path set to "upstream/foo" WHEN MINION makes request POST /1.0/job-update with job_id=2&exit=no THEN response has kill set to false Cleanup. FINALLY WEBAPP terminates Stop a job that runs too long ----------------------------- Sometimes a job gets "stuck" and should be killed. The `lorry-controller.conf` has an optional `lorry-timeout` field for this, to set the timeout, and WEBAPP will tell MINION to kill a job when it has been running too long. Some setup. Set the `lorry-timeout` to a know value. It doesn't matter what it is since we'll be telling WEBAPP to fake its sense of time, so that the test suite is not timing sensitive. We wouldn't want to have the test suite fail when running on slow devices. SCENARIO stop stuck job GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND lorry-controller.conf in CONFGIT has lorry-timeout set to 1 for everything AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP WHEN admin makes request POST /1.0/read-configuration Pretend it is the start of time. WHEN admin makes request POST /1.0/pretend-time with now=0 AND admin makes request GET /1.0/status THEN response has timestamp set to "1970-01-01 00:00:00 UTC" Start the job. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 Check that the job info contains a start time. WHEN admin makes request GET /1.0/job/1 THEN response has job_started set Pretend it is now much later, or at least later than the timeout specified. WHEN admin makes request POST /1.0/pretend-time with now=2 Pretend to be a MINION that reports an update on the job. WEBAPP should now be telling us to kill the job. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=no THEN response has kill set to true Kill the job, as requested. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1 Verify we can run the job successfully after it has been killed once by timeout. In 2014 we had a bug where this would not happen, because a lorry that had ever been killed would never run successfully again. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 2 WHEN MINION makes request POST /1.0/job-update with job_id=2&exit=no THEN response has kill set to false Cleanup. FINALLY WEBAPP terminates Forget jobs whose MINION is gone -------------------------------- A job's status is updated when a MINION uses the `/1.0/job-update` call, and when the MINION uses that to report that the job has finished, the STATEDB is updated accordingly. However, sometimes the MINION never tells WEBAPP that the job if finished. This can happen for a variety of reasons, such as (not limited to these): * MINION crashes. * WEBAPP is unavailable. * The host reboots, killing MINION and WEBAPP both. If this happens, STATEDB still marks the job as running, and WEBAPP won't start a new job for that lorry specification. To deal with these, we need to have a way to clean up "ghost jobs" like these. We do this with the `/1.0/cleanup-ghost-jobs` API call, which marks all jobs finished that haven't had a `job-update` called on them for a long time. SCENARIO forget jobs without MINION updates in a long time Set up a WEBAPP that uses a CONFGIT with a Lorry file, so we can start a job. GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP Pretend it is a known time (specifically, the beginning of the epoch). This is needed so we can trigger the ghost job timeout later. WHEN admin makes request POST /1.0/pretend-time with now=0 Tell WEBAPP to read the configuration. WHEN admin makes request POST /1.0/read-configuration Start a new job. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 Verify that the job is in the list of running jobs. WHEN admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [1] Remove any ghosts. There aren't any yet, so nothing should be removed. WHEN admin makes request POST /1.0/remove-ghost-jobs AND admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [1] Now, pretend a long time has passed, and clean up the ghost job. The default value for the ghost timeout is reasonably short (less than a day), so we pretend it is about 10 days later (one million seconds). WHEN admin makes request POST /1.0/pretend-time with now=1000000 AND admin makes request POST /1.0/remove-ghost-jobs AND admin makes request GET /1.0/list-running-jobs THEN response has running_jobs set to [] Further, if we request for a new job now, we'll get one for the same lorry specification. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 2 AND response has path set to "upstream/foo" Finally, clean up. FINALLY WEBAPP terminates Remove a terminated job ----------------------- WEBAPP doesn't remove jobs automatically, it needs to be told to remove jobs. SCENARIO remove job Setup. GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} WHEN admin makes request POST /1.0/read-configuration Start job 1. WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 Try to remove job 1 while it is running. This should fail. WHEN admin makes request POST /1.0/remove-job with job_id=1 THEN response has reason set to "still running" Finish the job. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0 WHEN admin makes request GET /1.0/list-jobs THEN response has job_ids set to [1] Remove it. WHEN admin makes request POST /1.0/remove-job with job_id=1 AND admin makes request GET /1.0/list-jobs THEN response has job_ids set to [] Cleanup. FINALLY WEBAPP terminates Remove old terminated jobs with helper program -------------------------- There is a helper program to remove old jobs automatically. SCENARIO remove old terminated jobs Setup. GIVEN a new git repository in CONFGIT AND an empty lorry-controller.conf in CONFGIT AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream AND WEBAPP uses CONFGIT as its configuration directory AND a running WEBAPP GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} WHEN admin makes request POST /1.0/read-configuration Start job 1. We start it a known time of 100, so that we can control when jobs become old. WHEN admin makes request POST /1.0/pretend-time with now=100 AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123 THEN response has job_id set to 1 Remove old jobs while job 1 is running, still pretending time is 100 seconds since epoch. This should leave job 1 running. WHEN admin removes old jobs at 100 AND admin makes request GET /1.0/list-jobs THEN response has job_ids set to [1] Finish the job. WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0 WHEN admin makes request GET /1.0/list-jobs THEN response has job_ids set to [1] Remove old jobs, still at 100 seconds. Job 1 should still remain, as it just finished. WHEN admin removes old jobs at 100 AND admin makes request GET /1.0/list-jobs THEN response has job_ids set to [1] Let a long time pass, and remove old jobs again. Job 1 should now go away. WHEN admin removes old jobs at 100000000000 AND admin makes request GET /1.0/list-jobs THEN response has job_ids set to [] Cleanup. FINALLY WEBAPP terminates