Suspend and Resume Parallel Jobs in Grid Engine (2013-07-31)

Tightly integrated parallel jobs are under full control of Grid Engine (accounting/resource limitation/etc.). But what happens when a tightly integrated parallel job is suspended?

The default signal sent to jobs for suspension is SIGSTOP and for resuming SIGCONT. Because SIGSTOP can't be catched in order to react (like forwarding suspension to clients) Grid Engine can send a catchable notifcation signal a few seconds in advance (config value notify in the queue config) when the job was submitted with -notfiy (man page qsub). This is not always a good solution since it delays suspension. Another solution is to configure a different signal for suspension like SIGTSTP. The signal or even a script can be configured in the queue configuration (suspend_method). But who gets the signals? This depends on the host configuration. There is a execution daemon parameter called SUSPEND_PE_TASKS (see man sge_conf). If the parameter is set to false only the master task gets the signal, the slave tasks are expected to be suspended from the master task of the job itself. If set to true then all parallel tasks of the job are signalled in arbitrary order. The default of SUSPEND_PE_TASKS (when not configured) changed some years ago from false to true hence it is better to configure it in the host or global configuration (qconf -mconf global) when using suspension / resume with parallel jobs. A good candidate to move into job classes.