Follow-up from "Add jobids"
The following discussion from !6 should be addressed:
-
@david.schaefer started a discussion: (+2 comments) Only partly related to this MR, but anyhow a point that seems important to me: We now have sort of a fire-and-forget kind of regime here (and of course also in case of the
SlurmScheduler
), a job is submitted and all other job communication is supposed to be done using the MQTT-broker. This works fine, in case the job is able to complete successfully, but has issues if the submitted job dies along the way, or even worse, never we starts in the first place. How do we handle this kind of situations in order to e.g. restart a job or at least log the incident? Any ideas @palmb, @martin.abbrent ?