V:4.1.4/Karajan:Restart Log Library

From Java CoG Kit

Jump to: navigation, search

 

The Restart Log Library provides fault tolerance in a style similar to that of Condor DAGMan. The execution of certain operations is recorded into a log file on the disk. In case of a failure the execution can be resumed using the information saved in the log file. The operations that previously completed successfully will not be re-executed. The library defines two elements: restartLog and logged.

This mechanism offers no guarantees of semantic consistency after a restart if the control flow is influenced by factors that change between the original execution and the resumed execution. It is generally safe to use this mechanism with for and parallelFor if the iteration values are fixed. Additionally, return values of logged elements are not recorded. Consequently a resumed logged element will not return anything.

Usage example:

import("sys.k")
import("task.k")
import("rlog.k")

logged(
  transfer(srcfile="a.txt", desthost="host.example.org", provider="gt2")
)

parallel(
  logged(
    execute(executable="/bin/cat", arguments="a.txt", stdout="b.txt", 
      host="host.example.org", provider="gt2")
  )
  logged(
    execute(executable="/bin/cat", arguments="a.txt", stdout="c.txt", 
      host="host.example.org", provider="gt2")
  )
)

parallelFor(out, list("b.txt", "c.txt")
  logged(
    transfer(srcfile=out, srchost="host.example.org", provider="gt2")
  )
)

Resuming:

cog-workflow workflow.k -rlog:resume=workflow.0.rlog

rlog:restartLog


rlog:restartLog(*resume, *name, restartlog)

RestartLog performs the following functions:

  1. Opens a log file. The prefix of the log file name is taken from the *name argument or, if the *name argument is missing, from the file name of the current script being executed. The actual file name is obtained by appending a dot character ("."), a unique numeric identifier, and the ".rlog" extension. RestartLog will attempt to successively use increasing numeric identifiers, starting from 0 (zero). If a log file with that identifier already exists or if an exclusive lock on the file cannot be obtained, the next number is used. An exclusive lock is acquired on the log file such that other processes will not attempt to use the same file.
  1. Accepts arguments on the restartlog channel and writes them to the log file. After writing each value, the file buffers are immediately flushed to the disk using the FileDescriptor.sync() method.
  1. In the case of a restart it also parses a previous log and builds the necessary data in a way that the logged elements can use. If during a restart restartLog cannot acquire an exclusive lock on the log file, it will not attempt to use another log file, but fail instead.
  1. Upon successful completion, it closes and deletes the log file.

Restarts can be triggered in two ways:

  1. Using the *resume argument with the full file name of a log file
  1. By specifying the -rlog:resume=<logname> command line argument to the script (not to the interpreter).

rlog:logged


rlog:logged()

Logged elements execute their child elements and, upon termination, they return, on the restartlog channel an identifier that uniquely identifies a logged element within a given execution, and a thread id, which is used to differentiate between concurrent runs of the same element. The same (element id; thread id) pair can occur multiple times in a log file, but it only reflects successive executions of an element within the same thread.

If a restart is in effect, logged elements will analyze data parsed from the log, and if there exists an entry with the same (element id; thread id) pair which has a count larger than 0, the count will be decreased and logged will complete without executing its arguments/child elements. It will consequently not return any values whatsoever, not even values that were returned by its child elements in a previous run. It is therefore recommended that logged only wrap elements that do not return values.

Personal tools
Collaboration and Jobs