Univa Grid Engine supports two levels of checkpointing: the user  level
       and  a  operating  system provided transparent level. User level check-
       pointing refers to applications, which do their  own  checkpointing  by
       writing  restart  files  at  certain  times or algorithmic steps and by
       properly processing these restart files when restarted.

       Transparent checkpointing has to be provided by  the  operating  system
       and  is  usually  integrated in the operating system kernel. An example
       for a kernel integrated checkpointing facility is the Hibernator  pack-
       age from Softway for SGI IRIX platforms.

       Checkpointing  jobs need to be identified to the Univa Grid Engine sys-
       tem by using the -ckpt option of the qsub1() command. The  argument  to
       this  flag  refers  to  a  so  called  checkpointing environment, which
       defines the attributes of the checkpointing  method  to  be  used  (see
       checkpoint5()  for  details).   Checkpointing environments are setup by
       the qconf1() options -ackpt, -dckpt, -mckpt  and  -sckpt.  The  qsub1()
       option  -c  can  be used to overwrite the when attribute for the refer-
       enced checkpointing environment.

       If a queue is of the type CHECKPOINTING, jobs need to have  the  check-
       pointing attribute flagged (see the -ckpt option to qsub1()) to be per-
       mitted to run in such a queue. As opposed to the behavior  for  regular
       batch  jobs, checkpointing jobs are aborted under conditions, for which
       batch or interactive jobs are suspended or even stay unaffected.  These
       conditions are:

       o  Explicit  suspension  of the queue or job via qmod1() by the cluster
          administration or a queue owner if the  x  occasion  specifier  (see
          qsub1() -c and checkpoint5()) was assigned to the job.

       o  A  load  average value exceeding the suspend threshold as configured
          for the corresponding queues (see queue_conf5().)

       o  Shutdown of the Univa  Grid  Engine  execution  daemon  sge_execd8()
          being responsible for the checkpointing job.

       After  abortion, the jobs will migrate to other queues unless they were
       submitted to one specific queue  by  an  explicit  user  request.   The
       migration  of  jobs leads to a dynamic load balancing.  Note: The abor-
       tion of checkpointed jobs will free all resources (memory, swap  space)
       which  the  job occupies at that time. This is opposed to the situation
       for suspended regular jobs, which still cover swap space.

       When a job migrates to a queue on another machine at present  no  files
       are  transferred  automatically  to  that  machine. This means that all
       files which are used throughout the entire job including restart files,
       executables and scratch files must be visible or transferred explicitly
       may suffer long turnaround times.

       sge_intro1(,) qconf1(,) qmod1(,)  qsub1(,)  checkpoint5(,)  Univa  Grid
       Engine  Installation and Administration Guide, Univa Grid Engine User's

       See sge_intro1() for a full statement of rights and permissions.

UGE 8.0.0                $Date: 2009/06/16 13:58:24 $              SGE_CKPT(1)

Man(1) output converted with man2html