Java CoG Kit Long Running Jobs
From Java CoG Kit
Gregor von Laszewski and Kaizar Amin and ...
Corresponding Author: gregor@mcs.anl.gov
Contents |
Abstract
TBD. once other parts are finished
Before making the mods, please review http://www.cogkit.org/w/index.php/Talk:Java_CoG_Kit_Workflow_Examples#Proposed_Enhancement:_Graphs
I like to have an integrated solution.
The above was designed before I got kaizars proposal. the above was clearly based on the original abstractons, but if we can get a higher level api to cog-task and cog-set we may just use that to interface with karajan.
Introduction
This section explains on how to run really long duration jobs with the help of the Java CoG Kit. This feature is implemented in two special providers. They are termed
- gt2ft and
- gt4ft (NOT YET IMPLEMENTED)
During the period a long job is running, it is assumed the client will come online and go offline at any time. The user must check upon the status through a pulling command. The advantage of thie use case is that
- (a) a network connectivity does not need to be maintained throughout the period the job is running.
- (b) the JVM in which the CoG run can be resarted. Hence it is possible to shutdown your computer and continue working on it at a later time.
The disadvantage is that the jobs submitted are assumed to run for a longer period as the query for the state is more costly (WE NEED A PERFORMANCE EXPERIMENT THAT CONTRASTS THIS.)
State model
Detailed description
TBD
- Specified:
- Submitted:
- Running:
- Completed:
- Pending:
- Failed:
- Susended: not implemented
Functionality
The GT2FT has the following features:
- It Extends the GT2 provider. So all functionality of gt2 provider is also available in gt2ft provider
- If the submitted task has a status of unsubmitted... it will do a fresh submission
- If the status of the task is "submitted", "active", or "suspended" .. this implies that the task was previously submitted and this time we want to simply reconnect to the submitted task. Thus it does a BIND to the existing task. And continues with status notification as usual.
Managing a single task
Submiting a long running task
The enhnaced fault tolerant feature can be accessed easily from the cog-job-submit command found in the bin directory of the Java CoG Kit. We have also added a cog-get-status command (USED TO BE cog-checpoint-submit).
The cog-task -submit launcher supports an option "-c". When this option is specified... the submitted task is checkpointed allowing the user to reconnect to it after potential failures detected at a later time.
Assuming /home/user/runLong is a script that runs for a very long time (lets assume 2 month), we checkpoint it in checkpoint.xml
cog-task -submit -p gt2ft
-s hot.mcs.anl.gov
-e /home/user/runLong
-args "-i 20 -s 2"
-c checkpoint.xml
At time of submission, we will get an output such as
PUT OUTPUT WITHOUT DEBUGGING HERE. I ASSUME IT WILL BE
Job submitted to: hot.mcs.anl.gov Checkpointfile: checkpoint.xml Command: /home/user/runLong -i 20 -s 2 Environment Vars: none
Assuming we have debugging switched on we get the following output
DEBUG [org.globus.cog.abstraction.examples.execution.JobSubmission] - Task Identity: urn:cog-1116756045868 DEBUG [org.globus.cog.abstraction.impl.common.AbstractionFactory] - Instantiating org.globus.cog.abstraction.impl.execution.gt2.GlobusSecurityContextImpl
for provider gt2ft
DEBUG [org.globus.cog.abstraction.impl.common.AbstractionClassLoader] - Using system class loader for provider gt2ft DEBUG [org.globus.cog.abstraction.impl.common.AbstractionFactory] - Instantiating org.globus.cog.abstraction.impl.execution.gt2ft.TaskHandlerImpl for provider gt2ft DEBUG [org.globus.cog.abstraction.impl.execution.gt2ft.JobSubmissionTaskHandler] - RSL: &(executable=/home/user/runLong)(arguments=-i 20 -s 2) DEBUG [org.globus.cog.abstraction.examples.execution.JobSubmission] - Status changed to Submitted Task checkpointed to file: checkpoint.xml
THIS DEBUG OUTPUT DOES NOT SHOW THE RESOURCE ON WHICH WE SUBMITTED.
Checking the status of a long runnig job
Assume we lost connection, or we rebooted our machine, or simple some time has passed. Asume I like to check for the status of the job than I can do this with the following command:
./cog-task -status -c checkpoint.xml
This command will go out to the apropiate remote system and check for the status. It will print
PUT THE OUTPUT HERE
In case you use the debug output you see (REFORMAT BETTER)
DEBUG [org.globus.cog.abstraction.impl.common.AbstractionFactory] - Instantiating org.globus.cog.abstraction.impl.execution.gt2.GlobusSecurityContextImpl for provider gt2ft DEBUG [org.globus.cog.abstraction.impl.common.AbstractionClassLoader] - Using system class loader for provider gt2ft DEBUG [org.globus.cog.abstraction.impl.common.AbstractionFactory] - Instantiating org.globus.cog.abstraction.impl.execution.gt2ft.TaskHandlerImpl for provider gt2ft DEBUG [org.globus.cog.abstraction.impl.execution.gt2ft.JobSubmissionTaskHandler] - Task binding successful DEBUG [org.globus.cog.abstraction.impl.execution.gt2ft.JobSubmissionTaskHandler] - Task identity:urn:cog-1116756045868 DEBUG [org.globus.cog.abstraction.impl.execution.gt2ft.JobSubmissionTaskHandler] - Previous status = Submitted DEBUG [org.globus.cog.abstraction.examples.xml.XML2Task] - Status changed to Completed DEBUG [org.globus.cog.abstraction.examples.xml.XML2Task] - Output = null
Obtaining information about a Task
When doing
cog-task -info checkpoint.xml
the checkpointed task looks like this:
identity: 1116756507318 name: myTask type: Job Submission service.identity: 1116756507319 service.provider: gt2ft service.type: Job Submission service.Contact: hot.mcs.anl.gov specification.type: JobSpecification specification.executable: /home/amin/goWorld specification.arguments: -i 20 -s 2 specification.batchjob: false specification.redirected: false specification.localexec: false attribute.name.globusid: https://hot.mcs.anl.gov:50001/28882/1116756633/ status.state: Submitted status.time.submitted: 2005-05-22T05:08:32.496
When doing
cog-task -info checkpoint.xml -format xml
the checkpointed task looks like this:
<?xml version="1.0"?>
<task>
<identity>1116756507318</identity>
<name>myTask</name>
<type>Job Submission</type>
<serviceList>
<service>
<identity>1116756507319</identity>
<provider>gt2ft</provider>
<type>Job Submission</type>
<serviceContact>hot.mcs.anl.gov</serviceContact>
</service>
</serviceList>
<specification>
<JobSpecification>
<executable>/home/user/longRun</executable>
<arguments>-i 20 -s 2</arguments>
<batchJob>false</batchJob>
<redirected>false</redirected>
<localExecutable>false</localExecutable>
</JobSpecification>
</specification>
<attributeList>
<attribute name="globusid"
value="https://hot.mcs.anl.gov:50001/28882/1116756633/"/>
</attributeList>
<status>Submitted</status>
<submittedTime>2005-05-22T05:08:32.496</submittedTime>
</task>
Managing a set of tasks
As it is very likely that you may need to manage multiple long runing jobs at a time the Java CoG kit provides conveneinet ways to make this more simple for you. We allow checkpointing multiple jobs into a directory or into a database (NOTE THET THE DATBASE IS NOT YET IMPLEMENTED).
Specifying the checkpoint system
The system to conduct checkpointing (file, or database) is specified with the help of the command
cog-set -checkpoint -type directory -location <path to the directory>
for a directory based location in which subsequent checkpoint files are written, or
cog-set -checkpoint -type database
-location mysql://<path to the database>
-password password
In case the password is not specified a GUI will apear to ask you for it.
Labels
To make it simple for the user, we have augmented the job submission command with a label option to allow him to use user defined labels to refer to a specific job. Grid middleware assigns job with a unique ID which is in most cases not suitable to be remembered by the user. Hence the label feature allows the user to identify labels thet are more convenient for the use by the user. In case a label has already been predifined it must first be removed
cog-set -delete label
To list the information attached with a job we have defined the following options
cog-set -info
lists the jobs in the set in a convenient ASCII table. To cahnge the format to XNL you can use
cog-set -info -format xml
To just obtain the information for a single job, you can use the label option
cog-set -info -label myjob
returns the information with the job myjob if the job is not available an error is retured.
To list jobs that correspond to a particular state the state optin can be used. Hence the command
cog-set -info -state failed
will list all jobs that have failed.
Adding tasks
To add a task to the set you can use the command
cog-set -label label -add ....
Please nnote that adding a task does not submit the task to the backend system, but it just adds a placeholder to the set. for submission at a later time. THis feature is useful to generate a number of jobs before submitting them. The feature is only useful if a sophisticated scheduler is used in conjunction with the set. At this time we recommend to using cog-set -submit instead of the add function.
Submitting tasks
A task can be submitted with the command
cog-set -label label -submit ...
It will add the job to the set and submits the job to the Grid. After the submit option the usual job options are specified. Hoever if the job has already been added before, than just the label can be used to conduct the submission.
cog-set -label label -submit
Order
In future we will add the ability to define orders for sets submitte dto the grid. This can be achieved in one of two ways. First, through a file that contains the labels of the jobs to support parameter studies.
cog-set -order filename
Or through the definition of explicit dependencies between the jobs
cog-set -dependency label1 label2
The order feature is naturally most useful with the add feature as to generate a schedule of the tasks. However it is also possible to add additional tasks at runtime. In this scenario we asume a particular set of jobs is already running. We can now add additional jobs through the add and the submit commands during runtime. The responsibility of avoiding or preventing deadlocks is up to the user.
Displaying a set with dependencies
To display a set with dependencies we can use the info option with some special other options
cog-set -info -dependencies
lists the jobs but adds a column in which the labels of the parents are listed
cog-set -info -dependencies -resolved no
lists only the parents that have not yet resolved in order to run the job. Adding the label to any of the commands restricts the output to just the label.
cog-set -info -dependencies -label label
would retunr the information of the task with the label
To return a garphical representation of the job dependencies and its states one can use the command
cog-set -info -graph filename.png -format dot
or
cog-set -info -graph filename.png -format karajan
to return a png that uses internally either the dot engine (in case it is installed on your system) or the karajan engine with comes with the Java CoG Kit.
Manual Pages
the above canges the jobsubmit to
cog-task and cog-setup
we need to discuss if we keep cog-submit or do cog-task instead
Here we put in the refence manuals for all of the commands
cog-task
NAME
SYNOPSIS
DESCRIPTION
OPTIONS
SEE ALSO
BUGS
EXAMPLES
- pointers to CVSVIEW
cog-set
NAME
SYNOPSIS
DESCRIPTION
OPTIONS
SEE ALSO
BUGS
EXAMPLES
- pointers to CVSVIEW
cog-job-submit (depricated)
Depricated in favour of cog-task and cog-set
Integration into Karajan
This is in more detail descripbed in
http://www.cogkit.org/w/index.php/Talk:Java_CoG_Kit_Workflow_Examples#Proposed_Enhancement:_Graphs
Specific Implementation Issues
Globus Toolkit 2
see Globus Toolkit 3 clasical model.
Globus Toolkit 3
These features will not be supported in the OGSI based services. In GT3 we do recommend you use the GT2 classical services. Hence our provider is called gt2ft
A prototype is available but does not yet conform to the commands presented in this guide
Globus Toolkit 4
A system will be implemented.
