summaryrefslogtreecommitdiff
path: root/distbuild/worker_build_scheduler.py
Commit message (Collapse)AuthorAgeFilesLines
* worker_build_scheduler: Consider active jobs when creating/cancellingbaserock/pedroalvarez/fix-distbuild-bugPedro Alvarez2016-07-241-5/+11
| | | | | | | | | | | | | In WorkerBuildQueuer._handle_request we were only considering running jobs when creating a new one to not create duplicates. This was making `morph distbuild` build some components more than one time. In this commit we also start considering active jobs when cancelling them. Change-Id: Ib0a7296d453ccd0b8c636c7506d9f1da82acc462
* distbuild: When a build finishes, say which worker it was built onSam Thursfield2015-10-071-3/+4
| | | | Change-Id: I493fced8cf2664283923f6f41097ca991d3fc3de
* distbuild: Fix crash when worker disconnectsSam Thursfield2015-06-241-1/+1
| | | | | | | Bad function prototype meant that the mechanism for handling workers disconnecting actually caused the controller to crash instead. Change-Id: I8ceb6ad027ba2481c0c4c335e1760692823c208b
* Disable WC exec-output messages in log by defaultRichard Ipsum2015-05-181-1/+3
| | | | Change-Id: I01a60d4ec187d5fab060f40947d97aa97013f7a7
* distbuild: Set job status to failed when sending exec-cancelAdam Coldrick2015-05-121-0/+8
| | | | | | | | | Currently jobs may continue running after exec-cancel is sent if exec-response takes a while to be sent back. This commit makes the job's state be set to 'failed' when exec-cancel is sent, so that the wait for exec-response doesn't matter. Change-Id: I858d9efcba38c81a912cf57aee2bdd8c02cb466b
* Revert "distbuild: Track worker jobs using artifact basename only"Adam Coldrick2015-05-121-29/+48
| | | | | | | | | | This reverts commit 75ef3e9585091b463b60d2981b3b7283a2ea8eab. It turns out that the JobQueue may need to handle more than one build of the same artifact at once, as one may be in the process of being cancelled when another build of the same artifact is requested. So they do need an ID separate from the artifact ID. Change-Id: Ifa0c06987795a4aebdadbd9927de27919377b0a2
* Clean up artifact serialisationAdam Coldrick2015-05-121-5/+3
| | | | | | | We no longer serialise whole artifacts, so it doesn't make sense for things to still refer to serialise-artifact and similar. Change-Id: Id4d563a07041bbce77f13ac71dc3f7de39df5e23
* distbuild: Builds currently break due to job being set twiceLauren Perry2015-05-111-1/+0
| | | | | | | Remove extra job set line as self._current_job no longer exists in worker_build_scheduler.py Change-Id: I8849742587f11f83ebba64f48eaf97fac83e6589
* distbuild: Allow WorkerConnection to track multiple in-flight jobsSam Thursfield2015-05-071-108/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although in theory a worker should only ever have one job at once, in practice this assumption doesn't hold, and can cause serious confusion. The worker (implemented in the JsonRouter class) will actually queue up exec-request messages and run the oldest one first. I saw a case where, due to a build not being correctly cancelled, the WorkerConnection.current_job attribute got out of sync with what the worker was actually building. This lead to an error when trying to fetch the built artifacts, as the controller tried to fetch artifacts for something that wasn't actually built yet, and everything got stuck. To prevent this from happening, we either need to remove the exec-request queue in the worker-daemon process, or make the WorkerConnection class cope with multiple jobs at once. The latter seems like the more robust approach, so I have done that. Another bug this fixes is the issue where, if the 'Computing build graph' (serialise-artifact) step of a build completes on the controller while one of its WorkerConnection objects is waiting for artifacts to be fetched by the shared cache from the worker, the build hangs. This would happen because the WorkerConnection assumed that any HelperResponse message it saw was the result of its request, so would send a _JobFinished before caching had actually finished if there was an unrelated HelperResponse received in the meantime. It now checks the request ID of the HelperResponse before calling the code that is now in the new _handle_helper_result_for_job() function. Change-Id: Ia961f333f9dae77405b58c82c99a56e4c43e1628
* distbuild: Track worker jobs using artifact basename onlySam Thursfield2015-05-071-34/+23
| | | | | | | Rather than generating IDs for each job, identify them by what artifact is going to be built. Artifact cache IDs need to be unique in any case. Change-Id: I37a0277931c45a8fb6e37ae7c2a6a942ae732fdd
* distbuild: Track state of a job in the Job classSam Thursfield2015-05-071-22/+31
| | | | | | | This is a bit more comprehensive than the previous approach of using public instance attributes, and I find it easier to reason about. Change-Id: I2942ecf53c95e29893dc0982d38aec689ebfa614
* distbuild: Make Jobs class into a more generic JobQueueSam Thursfield2015-05-071-11/+17
| | | | | | | The intention is to allow workers to use this class for job tracking, in addition to the controller. Change-Id: I355861086764476b383266bab7e850af5e05bc54
* distbuild: Fix NameError when worker disconnectsSam Thursfield2015-04-281-1/+1
| | | | Change-Id: Ifdaa92c209a4ca488c4447911bef9b1bf7d61438
* Make distbuild use an ArtifactReference not an Artifact internally when buildingAdam Coldrick2015-04-241-13/+16
| | | | | | | | We no longer serialise entire artifacts, so the output of deserialise_artifact is an ArtifactReference. This commit changes stuff in distbuild to know how to deal with that rather than an Artifact. Change-Id: I79b40d041700a85c25980e3bd70cd34dedd2a113
* distbuild: Remove unneeded debugging statementSam Thursfield2015-04-091-6/+0
| | | | | | | A JsonMachine object can be set to log all messages that it sends, we don't need to handle it in the WorkerConnection class as well. Change-Id: Idfdc06953363a016708b5dda50c978eb93b1113c
* distbuild: Make 'Current jobs' log message more usefulSam Thursfield2015-04-091-2/+11
| | | | | | | | | | | | | | | | | | | It's good to know which jobs are in progress and which are queued, when reading morph-controller.log. Old output: 2015-04-09 10:40:58 DEBUG Current jobs: ['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc', 'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc'] New output: 2015-04-09 10:40:58 DEBUG Current jobs: ['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc (given to worker1:3434)', 'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc (given to worker2:3434)'] Change-Id: Ie89e6723b0da5f930813591a3166301fd3966804
* Use the modern way of the GPL copyright header: URL instead real addressJavier Jardón2015-03-161-2/+1
| | | | Change-Id: I992dc0c1d40f563ade56a833162d409b02be90a0
* Merge branch 'sam/distbuild-build-logs'Sam Thursfield2015-03-111-54/+81
|\ | | | | | | | | Reviewed-By: Adam Coldrick <adam.coldrick@codethink.co.uk> Reviewed-By: Richard Maw <richard.maw@codethink.co.uk>
| * distbuild: Fix build logs being sent to the wrong log filesSam Thursfield2015-02-181-54/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a while we have seen an issue where output from build A would end up in the log file of some other random chunk. The problem turns out to be that the WorkerConnection class in the controller-daemon assumes cancellation is instantaneous. If a build was cancelled, the WorkerConnection would send a cancel message for the job it was running, and then start a new job. However, the worker-daemon process would have a backlog of exec-output messages and a delayed exec-response message from the old job. The controller would receive these and would assume that they were for the new job, without checking the job ID in the messages. Thus they would be sent to the wrong log file. To fix this, the WorkerConnection class now tracks jobs by job ID, and the code should be generally more robust when unexpected messages are received.
| * Update copyright yearsSam Thursfield2015-02-181-1/+1
| |
* | distbuild: Be more robust when a worker disconnectsSam Thursfield2015-02-031-8/+35
|/ | | | | | | | | The logic to handle a worker disconnecting was broken. The WorkerConnection object would remove itself from the main loop as soon as the worker disconnected. But it would not get removed from the list of available workers that the WorkerBuildQueue maintains. So the controller would continue sending messages to this dead connection, and the builds it sent would hang forever for a response.
* Fix issues with distbuild caused by moving to building per-sourceRichard Maw2014-10-081-14/+13
|
* Fix copyright years of distbuild code.Sam Thursfield2014-09-111-1/+1
|
* Note future improvement for fetching artifacts from remote cacheSam Thursfield2014-06-101-0/+3
|
* Move presence check for job into remove functionRichard Ipsum2014-06-041-6/+6
| | | | We always want to warn if we attempt to remove a job that's not present
* Add comment to explain the use of _JobFailed eventRichard Ipsum2014-06-041-0/+6
|
* Make Jobs finish when caching is completeRichard Ipsum2014-06-031-24/+24
| | | | | | | | | | If a new build request makes a request for an artifact that is currently being cached then the artifact will be needlessly rebuilt. To avoid this the new build request should wait for caching to finish. We rename _ExecStarted, _ExecEnded, _ExecFailed to _JobStarted, _JobFinished, _JobFailed and Job's is_building attribute is renamed to running.
* Make job fail if caching failsRichard Ipsum2014-06-031-0/+5
| | | | | This fixes the bug that causes the distbuild controller to crash when population of the artifact cache fails.
* Merge remote-tracking branch 'origin/sam/distbuild-logs-2'Sam Thursfield2014-05-141-0/+1
|\ | | | | | | | | Reviewed-By: Richard Ipsum <richard.ipsum@codethink.co.uk> Reviewed-By: Lars Wirzenius <lars.wirzenius@codethink.co.uk>
| * distbuild: Include .build-log when copying chunk artifacts to the TroveSam Thursfield2014-05-141-0/+1
| | | | | | | | | | Users need to be able to see logs of all builds, not just those that failed.
* | Make distbuild put worker logs onto stdoutRichard Ipsum2014-05-141-0/+1
|/
* Add _ExecFailed eventRichard Ipsum2014-05-061-4/+24
| | | | To cancel jobs cleanly we need to know when a job has failed.
* Use messages to update job stateRichard Ipsum2014-05-061-3/+24
|
* Add cancelling to WorkerBuildSchedulerRichard Ipsum2014-05-061-13/+93
|
* Remove unused import and methodRichard Ipsum2014-05-061-3/+0
| | | | add_initiator() isn't necessary given lists have a remove method.
* Remove route mapRichard Ipsum2014-04-241-1/+0
|
* WorkerConnection: _maybe_handle_helper_resultRichard Ipsum2014-04-231-9/+4
| | | | | | | Put our _exec_response_msg into WorkerBuildFinished event, it's essentially the same as _finished_msg, just a different name Get our artifact's cache key from the job
* WorkerConnection: _request_cachingRichard Ipsum2014-04-231-9/+6
| | | | Now we just get everything from the job object
* WorkerConection: _handle_exec_responseRichard Ipsum2014-04-231-9/+8
| | | | | | | | | | | The exec_response_msg also needs to be sent to a number of initiators, so we give it a list of ids not just one. The exec_response_msg will be sent to the controller once the artifacts have been cached successfully. There's no longer any need to use a route map to retrieve the id of the initiator, since this is stored with the job
* WorkerConnection: _handle_exec_outputRichard Ipsum2014-04-231-2/+2
| | | | | msg now contains a list of initiator ids rather than a single one, since BuiltOutput needs to be sent to a number of initiators
* WorkerBuildQueuer: Use job's artifact and idRichard Ipsum2014-04-231-14/+17
| | | | | | Each job is given a unique id, so we don't need to generate an id for each exec request this means we can remove use of route map since we can use the job's id for the exec request
* Remove cancelRichard Ipsum2014-04-231-6/+1
| | | | This method no longer works, we will replace it soon.
* Change event names backRichard Ipsum2014-04-231-3/+3
| | | | | The name change from BuildFailed -> JobFailed etc was unintentionally merged into master, undo this.
* WorkerConnection: misc attributesRichard Ipsum2014-04-231-0/+6
| | | | | | _job is the job this worker is carrying out _exec_response_msg will contain the response the worker sends back to us when it finishes the build.
* WorkerBuildQueuer: replace request queue with jobsRichard Ipsum2014-04-231-25/+67
|
* Add Jobs and Job classesRichard Ipsum2014-04-231-10/+49
|
* Make WorkerBuildCaching carry a list of idsRichard Ipsum2014-04-231-5/+2
| | | | We need to be able to send this message to a number of initiators
* Add new build messages to worker build schedulerRichard Ipsum2014-04-231-2/+12
|
* Merge branch 'baserock/richardipsum/distbuild_improve_annotation3'Richard Ipsum2014-04-151-1/+2
|\ | | | | | | | | | | | | | | | | | | Conflicts: distbuild/build_controller.py Reviewed by: Lars Wirzenius Daniel Silverstone Sam Thursfield
| * Set body and headers in messageRichard Ipsum2014-04-111-1/+2
| | | | | | | | body and headers must now be specified for http-request message.